Monitoring Data Quality

This page explains the processes we follow to fix data quality issues that we actively monitor for.

Each of these sections covers a different issue, each of which are defined by our data quality requirements.

Unknown entities

To keep the datasets up-to-date on the platform, we need to check “unknown entity” issues every week and assign entities.
The unknown entities issue usually occurs when an LPA updates their data on the endpoint we are retrieving and adds new records. These records will have reference values we do not have on the platform, hence when the system realises the new data has been added and the references of those new data are not on the platform, it will trigger an unknown entity issue.

The datasets that require assigning entities are categorised into three main scopes:

ODP Datasets – These datasets are supported by ODP funding. Datasets categorised as ODP can be found here ODP Data

Mandated Datasets – These are datasets that LPAs are legally required to provide, this includes brownfield-land and developer-contributions datasets.

Single Source Datasets – This category includes data obtained from authoritative sources or seeded data received from the Data Design Team.

The recommended steps to resolve this are as follows:

Setup Config Repo
Clone the Config repository if it has not already been done, then create and activate a virtual environment.
Run the Script
The script can be run using the command python3 batch_assign_entities.py

Upon execution, the script will download the issue_summary.csv file to the root directory of the Config folder.

The downloaded issue_summary.csv includes a column called scope, this column indicates the scope for each dataset. This scope includes the categories specified above, such as ODP, Mandated and Single source.
Analyse Unknown Entity issues
Open the issue_summary.csv file and apply a filter to the “scope” column to display only entries related to ODP. Begin by analysing all unknown entities issues associated with the ODP scope.

If the count_issue for any dataset is unusually high, verify that the entities are valid and new. count_issue may also be high if the LPA has recently their references for existing entities. Keep a note of endpoints with an unusually high number of count_issue to review once the entities have been assigned.

The command will prompt the user to confirm. Type “yes” to assign Unknown entities for ODP.

The command will prompt the user to enter scope (odp/mandated/single-source). Type “odp” to assign entities.

It will download all the resources for unknown entities into a resources folder, assign entities, and then delete the downloaded resource files. The affected dataset’s lookup.csv should now have new rows with the assigned entities. The amount of entities that needed to be assigned should be the same amount that have been added in the lookup file.

The previous assignment process which allowed Unknown entities to be automatically assigned has now been updated and provides an interactive issue summary reporting facility which highlights issues and enables corrective measures to be actioned to enhance data integrity.

Review the entities assigned for the endpoint you’ve noted. The key thing to check here is whether the references are a continuation or follow a similar format to existing lookups for that provision.

Note: If the entities belong to the Conservation Area dataset, you should check for duplicates using endpoint checker, refer Step 3 in Validating an endpoint. Once the new entries for the lookup.csv have been generated, use the outputs from the Duplicates with different entity numbers section of the endpoint checker to replace the newly generated entity numbers for any duplicates, with the entity numbers of the existing entity that they match.
Assign entities for Mandated and single-source datasets
Repeat Step 3 for assign entities for Mandated and single-source datasets.

Enter the scope, either mandated or single-source based on requirement.
Review Changes
Once merged, use endpoint_dataset_issue_type_summary table and check if the previous unknown entity issues are resolved.

Make a note in the ticket if you are not able to assign entities for any LPA.

Success criteria:
Ideally, the number of unknown entity errors should be zero after completing the above steps.

Check deleted entities

To keep the datasets up-to-date on the platform, we need to check entities that have been deleted from the latest resource every week. This occurs when the LPAs have deleted the entities on their endpoint but not told us. Once we have confirmed which entities have been deleted, we contact the LPAs to make sure. Once we have received confirmation, we can retire the entities.

The recommended steps to resolve this are as follows:

Run the this report
For each dataset, compare the Latest resource entity count with the Platform entity count. Make note of which dataset has more counts for the platform compared to the latest resource.
List those that need to be retired here. You will want the collection, endpoint, and source.
The LPAs will need to be contacted. Once confirmed that these were deleted, follow the retire entities process
Note in the sheet if an entitiy could not be retired

Success criteria:
The count of entities on the platform and on the latest resource should be the same. Run the report to make sure that the counts are matching.

Retire broken, non-primary endpoints

Trigger

We define an endpoint as “broken” once it has been logged with a non-200 status for more than 30 consequtive days. We pro-actively end-date broken endpoints and their sources when they are not the primary endpoint for a provision, i.e. the endpoint is not the only or most recently added endpoint for a provision.

e.g.
Wiltshire council has two active brownfield-land endpoints, one added in 2025 and one added in 2024. This makes the 2025 endpoint the primary endpoint. The 2024 endpoint has had a 404 status for 62 days, so it should be given an end-date.

NOTE
This applies to all datasets, including ODP, statutory and single-source.

Task

Identify broken, non-primary endpoints through this datasette query.
Download the query results as a scv file called retire.csv in the root of your local config directory.
Run the following command:

digital-land retire-endpoints-and-sources retire.csv

Test

Once the changes have been merged into main, the endpoints and sources you retired should no longer appear in the datasette query results.

Identify new data sources for broken, primary endpoints

Trigger

We define an endpoint as “broken” once it has been logged with a non-200 status for more than 30 consequtive days. And we call an endpoint a “primary” endpoint when it is the most recently added, or the only endpoint for a provision.

e.g.

the archaeological-priority-area dataset has one endpoint; this is the primary endpoint

the local-authority-district dataset has 3 active endpoints; the most recently added one is the primary endpoint

For ODP datasets, endpoint errors are raised back to data providers through the Submit service so we don’t need to do anything.

However, when non-ODP datasets have broken, primary endpoints we should search for alternatives.

Task

Identify broken, primary endpoints through this datasette query.
For any broken endpoints, search the data provider’s website for any newly published endpoint URLs (the source URL for the broken endpoint should take you to the correct site).
If you find any newer endpoints, add an end-date for the broken endpoint and source and follow the adding data process to add the new one.

Test

Once the changes have been merged into main, the primary endpoint for the provision should no longer appear in the datasette query.

Identify new data sources for stale endpoints

Trigger

We define an endpoint as “stale” when it has not been updated with new data within the time period we expect.

e.g.
the source of the latest endpoint we have for flood-risk-zone data published by the Environment Agency states that the dataset is updated quarterly. If the start date of the latest resource is 01/01/2024 and today’s date is 30/06/24 there hasn’t been an update for 6 months so we would say this endpoint is stale.

NOTE
For our compiled datasets, local planning authorities are responsible for updating endpoints or publishing new ones for new datasets so we don’t monitor for staleness.

For our single source datasets (i.e. those with national coverage from a single data provider) we need to check whether we have added the most up to date data.

Task

Check for any stale endpoints by running the monitor frequency of datasets report. This will identify any endpoints which have not been updated within the expected time period.
For any identified datasets you should check to see whether the data provider has published more up to date data on a new endpoint. You can use the source of existing endpoints to find their website.
If you find a new endpoint you will need to add it. Check the new endpoint for existing provision scenario on the Adding data page to find the steps to follow in order to retire old endpoints, add the new one and assign any new entities if required.

Test

Once you’ve added the new endpoint and merged the changes, re-run the monitor frequency of datasets report; the dataset you’ve updated should no longer be in the list.

Out of range entities

Trigger

This is a configuration error where the entity numbers that have been used in a dataset are not within the range defined for that dataset. These issues will be raised in the issue report for ODP datasets or all datasets, where the issue_type = “entity number out of range”.

The entity range for datasets are defined in the specification repository, select a dataset to view its entity range, defined by the entity-minimum and entity-maximum fields.

Task

In order to fix, for each dataset with issues you should:

Delete the entries in lookup.csv which are using an incorrect entity number, go to Datasette and select the relevant dataset. Next, filter the issue table using the resource and issue_type fields present in the downloaded issue table, then use the value field to identify the incorrect entity number. Now, find the incorrect entity number in lookup.csv and remove the entire row.
Follow the assign entities process to assign new entity numbers and replace the deleted lookup entries.

Test

Once fixed, there should no longer be any issues raised in the issue report.

Invalid Organisations

One of our monitoring tasks is patching any invalid organisation issues that arise. This isually happens if the organisation value provided in the endpoint is wrong or missing e.g it could be a blank field or the wrong organisation name / identifier.

A list of invalid organisation issues can be optained by downloading the issue report for ODP datasets or all datasets and filtering for invalid organisations under issue-type.

To fix this, we can make use of the patch.csv file. More information on how this file works can be found in the pipeline/patch section in configure an endpoint.

For example, if we were given the wrong organisationURI in a brownfield-land dataset, we can patch it by targetting the endpoint, give the current uri in the pattern section, and the desired uri in the value section like so:

brownfield-land,,OrganisationURI,http://opendatacommunities.org/id/london-borough-council/hammersmith-and-,http://opendatacommunities.org/doc/london-borough-council/hammersmith-and-fulham,,,,,890c3ac73da82610fe1b7d444c8c89c92a7f368316e3c0f8d2e72f0c439b5245

To test it, follow the guidance in building a collection locally but keep the new patch entry and focus on the desired endpoint.

Organisational changes

When organisations are created or ended we need to:

Create new organisation entities for any newly created organisations.
Make entities for organisations which have been ended have an end-date.
If appropriate, make sure any entities that any ended organisations were responsible for are moved to the new responsible organisations.

Note, this guidance relates to local-authority organisation changes, which have the most impact on ODP datasets.

Trigger

We will know that organisations need to be ended when changes are announced by [governing body]. These may involve multiple existing councils being replaced by a single unitary authority, or other variations of changes.

We may be notified of changes like this by team members, in which case we should act immediately.

Otherwise, the way we should make sure we check for any possible changes in order to be alerted is by monitoring updates to organisation typology datasets. If changes to local authority organisations are expected at least once each year, when the time between the latest resource start date and the current day’s date is greater than a year we should check for any changes which need to be reflected in the dataset. This test is defined by data-quality need T-05.

Task

Use the dataset editor to add new records for new organisations. You may need to cross-reference some other datasets to add all the necessary details, for instance the ONS codes for the local authority and the local planning authority.
Use the dataset editor to add an end-date for any organisations which have been terminated. Use the end date from the official announcement.

The next step will vary depending on how the local authorities transition existing datasets to the new organisation.

If there is not a new endpoint from the new organisation

Sometimes it may take a long time for data to be transitioned to a newly created organisation. In which case the existing endpoints from the old organisation should be kept live and maintained until there is a new one for the new organisation, at which point the process below can be followed.

If there is a new endpoint from the new organisation

Follow the standard process for validating and adding a new endpoint. During the validation step use the duplicate check step to check for any duplicates with existing data. This should highlight any existing entities that the new records match to. For any matches, existing entity numbers can be used instead of the new ones generated by the add-data process. Any new entities from the new organisation which don’t match can be given new entity numbers.
Any existing entities from the old organisation which haven’t been matched to new records from the new organisation should be given an end-date [note: we don’t currently have a process for this].
The entity-organisation range should be assigned to the new organisation for any of the entity numbers which are now being used for the new organisation’s records.
Retire any endpoints for the old organisation’s provisions so they are no longer collected.

De-duplication of conservation-area data

The purpose of this process is to ensure that duplicate data is not stored unnecessarily for the conservation-area dataset generated by an organisation which may have also been provided by Historic England(HE).

The steps required for this process:-

Run the add-data tasks for conservation-area dataset (making a note of how many entities were added in the lookup file).
Raise the pull-request(PR) and ensure that it has been merged into the main branch so that the duplicate entities are picked up by the expectation report on the following day.
DO NOT inform the organisation at this stage.
On Power BI navigate to the “Digital Planning” workspace then to the “Planning Data Monitoring” report from where you select the “Duplicate Conservation Area” page.(Link_0)
Click on the reports TITLE in order for the options panel to appear to right hand side
Click on the three dots for the more options dropdown menu, from which you select “Export data” to download the output.
Open up the exported file to show the HE duplicate entites.
Filter on the message column for “complete_match” criteria
Filter on the entity_a_organisation.name column for the organisation Historic England and filter on the entity_b_organisation.name column for the organisation for which the data was added on the previous day (re:step 1)
Copy the entities in columns entity_a and entity_b
Prepare the data to be appended to the old-enity.csv located at Link_1 in following format
where entity_a=old-entity and entity_b=entity
e.g. 44012512,301,44013703,redirect Historic England duplicate to LPA entity,2025-08-28,
Also DO NOT forget to update the entity-organisation file located at Link_2
When this change is merged, check the PowerBI report to confirm the duplicate entities have been fixed.