Monitoring Data Quality

This section covers situations where we are monitoring the quality of data and fix any issues that have arisen.

Unknown entities

To keep the datasets up-to-date on the platform, we need to check “unknown entity” issues every week and assign entities.
The unknown entities issue usually occurs when an LPA updates their data on the endpoint we are retrieving and adds new records. These records will have reference values we do not have on the platform, hence when the system realises the new data has been added and the references of those new data are not on the platform, it will trigger an unknown entity issue.

The recommended steps to resolve this are as follows:

Go to the report found here
Download the CSV file with ‘Download Current Table’ which will download a file called odp_issues
Analyse the “unknown entity” issues. Look for issue_type with unknown entity
If possible, for each of the issues you have identified, follow the steps in Assign entities. Keep track of the row of the issue.
Raise a PR and merge it
Once merged, run the issue report again and check if the previous unknown entity is resolved. Paste the row of “unknown entity” issues you have tracked in this google docs after the changes are merged on the platform.
Note in the sheet if you are not able to assign entities for any LPA
Insert a new workbook with the next sprint date

Success criteria:
Ideally, the number of unknown entity errors should be zero after completing the above steps.

Check deleted entities

To keep the datasets up-to-date on the platform, we need to check entities that have been deleted from the latest resource every week. This occurs when the LPAs have deleted the entities on their endpoint but not told us. Once we have confirmed which entities have been deleted, we contact the LPAs to make sure. Once we have received confirmation, we can retire the entities.

The recommended steps to resolve this are as follows:

Run the this report
For each dataset, compare the Latest resource entity count with the Platform entity count. Make note of which dataset has more counts for the platform compared to the latest resource.
List those that need to be retired here. You will want the collection, endpoint, and source.
The LPAs will need to be contacted. Once confirmed that these were deleted, follow the retire entities process
Note in the sheet if an entitiy could not be retired

Success criteria:
The count of entities on the platform and on the latest resource should be the same. Run the report to make sure that the counts are matching.

Retire erroring endpoints

One of our quality measures is to reduce the number of endpoints erroring on our platform so we no longer collecting data from those endpoints. We can retire endpoints in a batch by running a script.

The scope of this task can either be for non-ODP datasets or for ODP datasets, so only retire those that are within scope of this.

The recommended steps to resolve this are as follows:

Run this query which will return all endpoints which have been erroring for more than 90 days (e.g where n_days_since_last_200 > 90)
Select the data as CSV and copy the contents
In VSC, create a retire.csv file in the root and paste the content in there
Run the command:

digital-land retire-endpoints-and-sources retire.csv

Write down the endpoints you have retired (and note any you have been unable to retire) in the sheet

Success criteria:
No erroring endpoints listed in the query for the scope of the ticket.

Invalid Organisations

One of our monitoring tasks is patching any invalid organisation issues that arise. This isually happens if the organisation value provided in the endpoint is wrong or missing e.g it could be a blank field or the wrong organisation name / identifier.

A list of invalid organisation issues can be optained by downloading a csv file from either the issue summary table or the overview issue table and filtering for invalid organisations under issue-type.

To fix this, we can make use of the patch.csv file. More information on how this file works can be found in the pipeline/patch section in configure an endpoint.

For example, if we were given the wrong organisationURI in a brownfield-land dataset, we can patch it by targetting the endpoint, give the current uri in the pattern section, and the desired uri in the value section like so:

brownfield-land,,OrganisationURI,http://opendatacommunities.org/id/london-borough-council/hammersmith-and-,http://opendatacommunities.org/doc/london-borough-council/hammersmith-and-fulham,,,,,890c3ac73da82610fe1b7d444c8c89c92a7f368316e3c0f8d2e72f0c439b5245

To test it, follow the guidance in building a collection locally but keep the new patch entry and focus on the desired endpoint.