- Digital Land

Pipeline process / data model

See the about section of the planning.data website to learn more about the website and programme objectives:

“Our platform collects planning and housing data from local planning authorities (LPAs) and transforms it into a consistent state, across England. Anyone can view, download and analyse the data we hold.”

We ask Local Planning Authorities (LPAs) to publish open data on their website in the form of an accessible URL, or API endpoint. These URLs are called endpoints.

The system that is used to take data from endpoints and process it into a consistent format is called the pipeline. The pipeline is able to collect data hosted in many different formats, identify common quality issues with data (and in some cases resolve them), and transform data into a consistent state to be presented on the website.

!! For more detail on how the pipeline works see the documentation here.

Data is organised into separate datasets, each of which may consist of data collected from just one or many endpoints. Datasets might be referred to as either compiled or national based on how data for them is provided. For example the article-4-direction-area dataset has many providers as we collect data from LPAs to add to this dataset, and is therefore a compiled dataset. The agricultural-land-classification dataset on the other hand has just one provider as it is a dataset with national coverage published by Natural England, and is therefore a national dataset.

Each dataset is organised into separate collections, which are groups of datasets collected together based on their similarity. For example, the conservation-area-collection is the home for the conservation-area and the conservation-area-document dataset. There are a few key components to collections, which are outlined below using the conservation-area-collection as an example:

The collection repo (note the “-collection” after the name): https://github.com/digital-land/conservation-area-collection/. This is the repo which is used to build the collection data, and is triggered each night by a github workflow.
The collection and pipeline configuration files, which store configuration data which controls how data feeding into the collection is processed (see section below for more detail):
- https://github.com/digital-land/config/tree/main/collection/conservation-area
- https://github.com/digital-land/config/tree/main/pipeline/conservation-area

The data management team is responsible for adding data to the platform, and maintaining it once it’s there, see here for the list of team responsibilities in the Planning Data Service Handbook.

Resources

Once an endpoint is added to our data processing pipeline it will be checked each night for the latest data. When an endpoint is added for the first time we take a copy of the data; this unique copy is referred to as a resource. If the pipeline detects any changes in the data, no matter how small, we save a new version of the entire dataset, creating a new resource. Each separate resource gets given a unique reference which we can use to identify it.

Facts

The data from each resource is saved as a series of facts. If we imagine a resource as a table of data, then each combination of entry (row) and field (column) generates a separate fact: a record of the value for that entry and field. For example, if a table has a field called “reference”, and the value of that field for the first entry is “Ar4.28”, we record the name of the field and the value of it along with a unique reference for this fact. You can see how this appears in our system here.

So a table with 10 rows and 10 columns would generate 100 facts. And each time data changes on an endpoint, all of the facts for the new resource are recorded again, including any new facts. We can use these records to trace back through the history of data from an endpoint.

A fact has the following attributes:

fact - UUID, primary key on fact table in database
entity - optional, numeric ID, entity to which fact applies
start-date - optional, date at which fact begins to apply (not date at which fact is created within data platform)
end-date - optional, date at which fact ceases to apply
entry-date - optional, date at which fact was first collected

Entities

An Entity is the basic unit of data within the platform. It can take on one of many types defined by digital-land/specification/typology.csv. An entity has the following attributes:

entity - incrementing numeric ID, manually assigned on ingest, different numeric ranges represent different datasets, primary key on entity table in SQLite and Postgis databases
start-date - optional, date at which entity comes into existence (not date at which entity is created within data platform)
end-date - optional, date at which entity ceases to exists
entry-date - optional, date at which entity was first collected
dataset - optional, name of dataset (which should correspond to dataset field in digital-land/specification/dataset.csv) to which entity belongs
geojson - optional, a JSON object conforming to RFC 7946 specification which specifies the geographical bounds of the entity
typology - optional, the type of the entity which should correspond to the typoology field in digital-land/specification/typology.csv
json - optional, a JSON object containing metadata relating to the entity

Facts that are collected from resources get assigned to entities based on a combination of the reference of the record in the resource, the organisation that provided the resource and the dataset it belongs to (needs more clarification, or a link out to more detail somewhere).

So as well as the default(?) attributes above, an entity in the article-4-direction-area dataset can also have attributes like permitted-development-rights and notes.