Pipeline process / data model
See the about section of the planning.data website to learn more about the website and programme objectives:
“Our platform collects planning and housing data from local planning authorities (LPAs) and transforms it into a consistent state, across England. Anyone can view, download and analyse the data we hold.”
We ask Local Planning Authorities (LPAs) to publish open data on their website in the form of an accessible URL, or API endpoint. These URLs are called endpoints.
The system that is used to take data from endpoints and process it into a consistent format is called the pipeline. The pipeline is able to collect data hosted in many different formats, identify common quality issues with data (and in some cases resolve them), and transform data into a consistent state to be presented on the website.
!! For more detail on how the pipeline works see the documentation here.
Data is organised into separate datasets, each of which may consist of data collected from just one or many endpoints. Datasets might be referred to as either compiled or national based on how data for them is provided. For example the article-4-direction-area dataset has many providers as we collect data from LPAs to add to this dataset, and is therefore a compiled dataset. The agricultural-land-classification dataset on the other hand has just one provider as it is a dataset with national coverage published by Natural England, and is therefore a national dataset.
Each dataset is organised into separate collections, which are groups of datasets collected together based on their similarity. For example, the conservation-area-collection
is the home for the conservation-area
and the conservation-area-document
dataset. There are a few key components to collections, which are outlined below using the conservation-area-collection as an example:
- The collection repo (note the “-collection” after the name): https://github.com/digital-land/conservation-area-collection/. This is the repo which is used to build the collection data, and is triggered each night by a github workflow.
- The collection and pipeline configuration files, which store configuration data which controls how data feeding into the collection is processed (see section below for more detail):
The data management team is responsible for adding data to the platform, and maintaining it once it’s there, see here for the list of team responsibilities in the Planning Data Service Handbook.
Resources
Once an endpoint is added to our data processing pipeline it will be checked each night for the latest data. When an endpoint is added for the first time we take a copy of the data; this unique copy is referred to as a resource. If the pipeline detects any changes in the data, no matter how small, we save a new version of the entire dataset, creating a new resource. Each separate resource gets given a unique reference which we can use to identify it.
Facts
The data from each resource is saved as a series of facts. If we imagine a resource as a table of data, then each combination of entry (row) and field (column) generates a separate fact: a record of the value for that entry and field. For example, if a table has a field called “reference”, and the value of that field for the first entry is “Ar4.28”, we record the name of the field and the value of it along with a unique reference for this fact. You can see how this appears in our system here.
So a table with 10 rows and 10 columns would generate 100 facts. And each time data changes on an endpoint, all of the facts for the new resource are recorded again, including any new facts. We can use these records to trace back through the history of data from an endpoint.
A fact has the following attributes:
fact
- UUID, primary key onfact
table in databaseentity
- optional, numeric ID,entity
to which fact appliesstart-date
- optional, date at which fact begins to apply (not date at which fact is created within data platform)end-date
- optional, date at which fact ceases to applyentry-date
- optional, date at which fact was first collected
Entities
An Entity is the basic unit of data within the platform. It can take on one of many types defined by digital-land/specification/typology.csv
. An entity has the following attributes:
entity
- incrementing numeric ID, manually assigned on ingest, different numeric ranges represent different datasets, primary key onentity
table in SQLite and Postgis databasesstart-date
- optional, date at which entity comes into existence (not date at which entity is created within data platform)end-date
- optional, date at which entity ceases to existsentry-date
- optional, date at which entity was first collecteddataset
- optional, name ofdataset
(which should correspond todataset
field indigital-land/specification/dataset.csv
) to which entity belongsgeojson
- optional, a JSON object conforming to RFC 7946 specification which specifies the geographical bounds of the entitytypology
- optional, the type of the entity which should correspond to thetypoology
field indigital-land/specification/typology.csv
json
- optional, a JSON object containing metadata relating to the entity
Facts that are collected from resources get assigned to entities based on a combination of the reference of the record in the resource, the organisation that provided the resource and the dataset it belongs to (needs more clarification, or a link out to more detail somewhere).
So as well as the default(?) attributes above, an entity in the article-4-direction-area dataset can also have attributes like permitted-development-rights
and notes
.