Monitoring Data Quality
This section covers situations where we are monitoring the quality of data and fix any issues that have arisen.
Unknown entities
To keep the datasets up-to-date on the platform, we need to check “unknown entity” issues every week and assign entities.
The unknown entities issue usually occurs when an LPA updates their data on the endpoint we are retrieving and adds new records. These records will have reference values we do not have on the platform, hence when the system realises the new data has been added and the references of those new data are not on the platform, it will trigger an unknown entity issue.
The recommended steps to resolve this are as follows:
- Go to the report found here
- Download the CSV file with ‘Download Current Table’ which will download a file called
odp_issues
- Analyse the “unknown entity” issues. Look for
issue_type
withunknown entity
- If possible, for each of the issues you have identified, follow the steps in Assign entities. Keep track of the row of the issue.
- Raise a PR and merge it
- Once merged, run the issue report again and check if the previous unknown entity is resolved. Paste the row of “unknown entity” issues you have tracked in this google docs after the changes are merged on the platform.
- Note in the sheet if you are not able to assign entities for any LPA
- Insert a new workbook with the next sprint date
Success criteria:
Ideally, the number of unknown entity errors should be zero after completing the above steps.
Check deleted entities
To keep the datasets up-to-date on the platform, we need to check entities that have been deleted from the latest resource every week. This occurs when the LPAs have deleted the entities on their endpoint but not told us. Once we have confirmed which entities have been deleted, we contact the LPAs to make sure. Once we have received confirmation, we can retire the entities.
The recommended steps to resolve this are as follows:
- Run the this report
- For each dataset, compare the
Latest resource entity count
with thePlatform entity count
. Make note of which dataset has more counts for the platform compared to the latest resource. - List those that need to be retired here. You will want the collection, endpoint, and source.
- The LPAs will need to be contacted. Once confirmed that these were deleted, follow the retire entities process
- Note in the sheet if an entitiy could not be retired
Success criteria:
The count of entities on the platform and on the latest resource should be the same. Run the report to make sure that the counts are matching.
Retire erroring endpoints
One of our quality measures is to reduce the number of endpoints erroring on our platform so we no longer collecting data from those endpoints. We can retire endpoints in a batch by running a script.
The scope of this task can either be for non-ODP datasets or for ODP datasets, so only retire those that are within scope of this.
The recommended steps to resolve this are as follows:
- Run this query which will return all endpoints which have been erroring for more than 90 days (e.g where
n_days_since_last_200 > 90
) - Select the data as CSV and copy the contents
- In VSC, create a
retire.csv
file in the root and paste the content in there - Run the command:
digital-land retire-endpoints-and-sources retire.csv
- Write down the endpoints you have retired (and note any you have been unable to retire) in the sheet
Success criteria:
No erroring endpoints listed in the query for the scope of the ticket.
Invalid Organisations
One of our monitoring tasks is patching any invalid organisation
issues that arise. This isually happens if the organisation value provided in the endpoint is wrong or missing e.g it could be a blank field or the wrong organisation name / identifier.
A list of invalid organisation issues can be optained by downloading a csv file from either the issue summary table or the overview issue table and filtering for invalid organisations
under issue-type
.
To fix this, we can make use of the patch.csv
file. More information on how this file works can be found in the pipeline/patch section in configure an endpoint.
For example, if we were given the wrong organisationURI
in a brownfield-land
dataset, we can patch it by targetting the endpoint, give the current uri in the pattern
section, and the desired uri in the value
section like so:
brownfield-land,,OrganisationURI,http://opendatacommunities.org/id/london-borough-council/hammersmith-and-,http://opendatacommunities.org/doc/london-borough-council/hammersmith-and-fulham,,,,,890c3ac73da82610fe1b7d444c8c89c92a7f368316e3c0f8d2e72f0c439b5245
To test it, follow the guidance in building a collection locally but keep the new patch entry and focus on the desired endpoint.