Monitoring Data Quality
This page explains the processes we follow to fix data quality issues that we actively monitor for.
Each of these sections covers a different issue, each of which are defined by our data quality requirements.
Unknown entities
To keep the datasets up-to-date on the platform, we need to check “unknown entity” issues every week and assign entities.
The unknown entities issue usually occurs when an LPA updates their data on the endpoint we are retrieving and adds new records. These records will have reference values we do not have on the platform, hence when the system realises the new data has been added and the references of those new data are not on the platform, it will trigger an unknown entity issue.
The datasets that require assigning entities are categorised into three main scopes:
ODP Datasets – These datasets are supported by ODP funding. Datasets categorised as ODP can be found here ODP Data
Mandated Datasets – These are datasets that LPAs are legally required to provide, this includes brownfield-land and developer-contributions datasets.
Single Source Datasets – This category includes data obtained from authoritative sources or seeded data received from the Data Design Team.
The recommended steps to resolve this are as follows:
-
Setup Config Repo
Clone the Config repository if it has not already been done, then create and activate a virtual environment. -
Run the Script
The script can be run using the commandpython3 batch_assign_entities.py
Upon execution, the script will download the
issue_summary.csv
file to the root directory of the Config folder.The downloaded
issue_summary.csv
includes a column called scope, this column indicates the scope for each dataset. This scope includes the categories specified above, such as ODP, Mandated and Single source. -
Analyse Unknown Entity issues
Open theissue_summary.csv
file and apply a filter to the “scope” column to display only entries related to ODP. Begin by analysing all unknown entities issues associated with the ODP scope.If the
count_issue
for any dataset is unusually high, verify that the entities are valid and new.count_issue
may also be high if the LPA has recently their references for existing entities. Keep a note of endpoints with an unusually high number ofcount_issue
to review once the entities have been assigned.The command will prompt the user to confirm. Type “yes” to assign Unknown entities for ODP.
The command will prompt the user to enter scope (odp/mandated/single-source). Type “odp” to assign entities.
It will download all the resources for unknown entities into a resources folder, assign entities, and then delete the downloaded resource files. The affected dataset’s lookup.csv should now have new rows with the assigned entities. The amount of entities that needed to be assigned should be the same amount that have been added in the lookup file.
Unknown entities will be automatically assigned.
Review the entities assigned for the endpoint you’ve noted. The key thing to check here is whether the references are a continuation or follow a similar format to existing lookups for that provision.
Note: If the entities belong to the Conservation Area dataset, you should check for duplicates using endpoint checker, refer Step 3 in Validating an endpoint. Once the new entries for the lookup.csv have been generated, use the outputs from the
Duplicates with different entity numbers
section of the endpoint checker to replace the newly generated entity numbers for any duplicates, with the entity numbers of the existing entity that they match. -
Assign entities for Mandated and single-source datasets
Repeat Step 3 for assign entities for Mandated and single-source datasets.Enter the scope, either mandated or single-source based on requirement.
-
Merge Changes
Raise a PR and merge it after it’s reviewed.Create a new sheet in this google docs and rename the sheet to have the Sprint start date.
Paste the row of “unknown entity” issues you have resolved in the google sheet, This will help track the datasets you’ve updated and ensure they are noted for future review. -
Review Changes
Once merged, use endpoint_dataset_issue_type_summary table and check if the previous unknown entity issues are resolved.Note in the sheet if you are not able to assign entities for any LPA.
Success criteria:
Ideally, the number of unknown entity errors should be zero after completing the above steps.
Check deleted entities
To keep the datasets up-to-date on the platform, we need to check entities that have been deleted from the latest resource every week. This occurs when the LPAs have deleted the entities on their endpoint but not told us. Once we have confirmed which entities have been deleted, we contact the LPAs to make sure. Once we have received confirmation, we can retire the entities.
The recommended steps to resolve this are as follows:
- Run the this report
- For each dataset, compare the
Latest resource entity count
with thePlatform entity count
. Make note of which dataset has more counts for the platform compared to the latest resource. - List those that need to be retired here. You will want the collection, endpoint, and source.
- The LPAs will need to be contacted. Once confirmed that these were deleted, follow the retire entities process
- Note in the sheet if an entitiy could not be retired
Success criteria:
The count of entities on the platform and on the latest resource should be the same. Run the report to make sure that the counts are matching.
Retire broken, non-primary endpoints
Trigger
We define an endpoint as “broken” once it has been logged with a non-200 status for more than 30 consequtive days. We pro-actively end-date broken endpoints and their sources when they are not the primary endpoint for a provision, i.e. the endpoint is not the only or most recently added endpoint for a provision.
e.g.
Wiltshire council has two activebrownfield-land
endpoints, one added in 2025 and one added in 2024. This makes the 2025 endpoint the primary endpoint. The 2024 endpoint has had a 404 status for 62 days, so it should be given an end-date.
NOTE
This applies to all datasets, including ODP, statutory and single-source.
Task
- Identify broken, non-primary endpoints through this datasette query.
- Download the query results as a scv file called
retire.csv
in the root of your localconfig
directory. - Run the following command:
digital-land retire-endpoints-and-sources retire.csv
Test
Once the changes have been merged into main, the endpoints and sources you retired should no longer appear in the datasette query results.
Identify new data sources for broken, primary endpoints
Trigger
We define an endpoint as “broken” once it has been logged with a non-200 status for more than 30 consequtive days. And we call an endpoint a “primary” endpoint when it is the most recently added, or the only endpoint for a provision.
e.g.
- the
archaeological-priority-area
dataset has one endpoint; this is the primary endpoint- the
local-authority-district
dataset has 3 active endpoints; the most recently added one is the primary endpoint
For ODP datasets, endpoint errors are raised back to data providers through the Submit service so we don’t need to do anything.
However, when non-ODP datasets have broken, primary endpoints we should search for alternatives.
Task
- Identify broken, primary endpoints through this datasette query.
- For any broken endpoints, search the data provider’s website for any newly published endpoint URLs (the source URL for the broken endpoint should take you to the correct site).
- If you find any newer endpoints, add an end-date for the broken endpoint and source and follow the adding data process to add the new one.
Test
Once the changes have been merged into main, the primary endpoint for the provision should no longer appear in the datasette query.
Identify new data sources for stale endpoints
Trigger
We define an endpoint as “stale” when it has not been updated with new data within the time period we expect.
e.g.
the source of the latest endpoint we have forflood-risk-zone
data published by the Environment Agency states that the dataset is updated quarterly. If the start date of the latest resource is 01/01/2024 and today’s date is 30/06/24 there hasn’t been an update for 6 months so we would say this endpoint is stale.
NOTE
For our compiled datasets, local planning authorities are responsible for updating endpoints or publishing new ones for new datasets so we don’t monitor for staleness.For our single source datasets (i.e. those with national coverage from a single data provider) we need to check whether we have added the most up to date data.
Task
- Check for any stale endpoints by running the monitor frequency of datasets report. This will identify any endpoints which have not been updated within the expected time period.
- For any identified datasets you should check to see whether the data provider has published more up to date data on a new endpoint. You can use the source of existing endpoints to find their website.
- If you find a new endpoint you will need to add it. Check the new endpoint for existing provision scenario on the Adding data page to find the steps to follow in order to retire old endpoints, add the new one and assign any new entities if required.
Test
Once you’ve added the new endpoint and merged the changes, re-run the monitor frequency of datasets report; the dataset you’ve updated should no longer be in the list.
Out of range entities
Trigger
This is a configuration error where the entity numbers that have been used in a dataset are not within the range defined for that dataset. These issues will be raised in the issue report for ODP datasets or all datasets, where the issue_type
= “entity number out of range”.
Task
In order to fix, for each dataset with issues you should:
- Delete the entries in
lookup.csv
which are using an incorrect entity number. - Follow the assign entities process to assign new entity numbers and replace the deleted lookup entries.
Test
Once fixed, there should no longer be any issues raised in the issue report.
Invalid Organisations
One of our monitoring tasks is patching any invalid organisation
issues that arise. This isually happens if the organisation value provided in the endpoint is wrong or missing e.g it could be a blank field or the wrong organisation name / identifier.
A list of invalid organisation issues can be optained by downloading the issue report for ODP datasets or all datasets and filtering for invalid organisations
under issue-type
.
To fix this, we can make use of the patch.csv
file. More information on how this file works can be found in the pipeline/patch section in configure an endpoint.
For example, if we were given the wrong organisationURI
in a brownfield-land
dataset, we can patch it by targetting the endpoint, give the current uri in the pattern
section, and the desired uri in the value
section like so:
brownfield-land,,OrganisationURI,http://opendatacommunities.org/id/london-borough-council/hammersmith-and-,http://opendatacommunities.org/doc/london-borough-council/hammersmith-and-fulham,,,,,890c3ac73da82610fe1b7d444c8c89c92a7f368316e3c0f8d2e72f0c439b5245
To test it, follow the guidance in building a collection locally but keep the new patch entry and focus on the desired endpoint.
Organisational changes
When organisations are created or ended we need to:
- Create new organisation entities for any newly created organisations.
- Make entities for organisations which have been ended have an end-date.
- If appropriate, make sure any entities that any ended organisations were responsible for are moved to the new responsible organisations.
Note, this guidance relates to local-authority organisation changes, which have the most impact on ODP datasets.
Trigger
We will know that organisations need to be ended when changes are announced by [governing body]. These may involve multiple existing councils being replaced by a single unitary authority, or other variations of changes.
We may be notified of changes like this by team members, in which case we should act immediately.
Otherwise, the way we should make sure we check for any possible changes in order to be alerted is by monitoring updates to organisation
typology datasets. If changes to local authority organisations are expected at least once each year, when the time between the latest resource start date and the current day’s date is greater than a year we should check for any changes which need to be reflected in the dataset. This test is defined by data-quality need T-05
.
Task
-
Use the dataset editor to add new records for new organisations. You may need to cross-reference some other datasets to add all the necessary details, for instance the ONS codes for the local authority and the local planning authority.
-
Use the dataset editor to add an end-date for any organisations which have been terminated. Use the end date from the official announcement.
The next step will vary depending on how the local authorities transition existing datasets to the new organisation.
If there is not a new endpoint from the new organisation
Sometimes it may take a long time for data to be transitioned to a newly created organisation. In which case the existing endpoints from the old organisation should be kept live and maintained until there is a new one for the new organisation, at which point the process below can be followed.
If there is a new endpoint from the new organisation
-
Follow the standard process for validating and adding a new endpoint. During the validation step use the duplicate check step to check for any duplicates with existing data. This should highlight any existing entities that the new records match to. For any matches, existing entity numbers can be used instead of the new ones generated by the
add-data
process. Any new entities from the new organisation which don’t match can be given new entity numbers. -
Any existing entities from the old organisation which haven’t been matched to new records from the new organisation should be given an end-date [note: we don’t currently have a process for this].
-
The entity-organisation range should be assigned to the new organisation for any of the entity numbers which are now being used for the new organisation’s records.
-
Retire any endpoints for the old organisation’s provisions so they are no longer collected.