Skip to main content

Purpose

This tutorial isn’t for a particular task, instead it aims to demonstrate the data that is produced as a result of adding a new endpoint to our configuration.

You can follow it before or after you’ve followed the adding data process yourself, but the idea is that by following the links here you’ll start to build up a more complete picture of how data is collected, processed and published on our platform, which will help understand the data adding process better.

If you’ve not added an endpoint yourself yet, have a look at the example Pull Request in the Configuration section below, which was made to add a new endpoint for Southwark.

Otherwise, if you have recently added an endpoint yourself you could follow the links in the Platform and Datasette sections and update the queries to check that one instead.

The data flow

Configuration

See the following PR: https://github.com/digital-land/config/pull/463

It’s adding a new tree-preservation-zone endpoint for Southwark.

Three files are updated:

  • collection/endpoint.csv
  • collection/source.csv
  • pipeline/lookup.csv

These are the key files which tell the pipeline where to collect data from and how to process it. Once these changes are made in the config repo, the overnight pipeline run will collect any new data from the endpoint.

Platform

You can see this endpoint is live on our different services:

Datasette

This is how we interact with the dataset files which are produced by the pipeline, as well as some other databases which summarise data that’s useful for reporting.

Performance tables

https://datasette.planning.data.gov.uk/performance

These tables summarise information about provisions, endpoints, resources, and issues to make it easier to see what’s going on at a high level.

Check for the Southwark endpoint that’s been added to see how it’s captured in these tables.

  • provision_summary table can be queried by organisation to see all of the data we’re expecting an organisation to provide, and its status.
  • reporting_historic_endpoints shows all of the Southwark TPZ endpoints that we’ve added, and the resources that have been collected from them. Take a look at the start and end dates for endpoints and resources to see the changes that have happened over time. You can query this table to find provisions based on the dataset and organisation.
  • Find the endpoint in the endpoint_dataset_issue_type_summary table. This shows any issues that have been logged for the endpoint and all of its resources.

tree-preservation-zone dataset

Every night the pipeline runs and uses the configuration data to collect and process all of the data for each dataset. It produces a .sqlite3 database for each separate dataset; these databases have a common architecture across all datasets and store all of the data which is then served on the platform.

This is the datasette page for the TPZ dataset: https://datasette.planning.data.gov.uk/tree-preservation-zone

Explore each of the tables to see how they relate to the configuration changes that were made above, and the data that the pipeline processed from Southwark’s new endpoint after the changes were made. We’ll use one of the resources that was collected from the Southwark TPZ endpoint as an example: ff3c0871f2b415dea86fffa4058633d914c01b0df048077ad48ad4dda6ca0db0.

  • dataset_resource shows some processing information about the resource.
  • column_field shows how the columns in the resource have been mapped to the fields of our data specification
  • issue shows any issues that have been logged while processing the resource
  • fact records all of the facts that have been linked to an entity
  • these facts can be traced back to particular resources using the fact_resource table, which records the hashes of all the facts that were recorded from the resource

For more detail on the relationship between resources, facts and entities, see this blog post