Configure an endpoint
collection/endpoint
Endpoints identify the URLs which are used to collect data from.
Important fields:
attribution- statement about the source of the datacollection- repeats the name of the thing we are collectingdocumentation-url- the web link to the providers documentation about the dataendpoint-url- the URL of the endpoint itselfendpoint- the endpoint hashentry-date- the date the endpoint record was addedend-date- the date from which an endpoint is no longer validplugin- any plugin that is required to successfully collect from an endpoint, e.g.wfs,arcgisstart-date- the date from which an endpoint is valid
https://github.com/digital-land/digital-land-python/tree/main/digital_land/plugins
collection/source
Sources identify where we get the data from, including providing a documentation-url to the publishers website which can provide additional information on data downloaded from the source. Sources are separate to endpoints but each endpoint should be associated with a source. -
Important fields:
documentation-url- the URL of the webpage that links to the endpointend-date- used to end a sourceendpoint- the hash for the endpoint associated with the sourcelicence- the licence type provided with the source, one of these typesorganisation- the organisation providing the sourcepipelines- the pipelines used to process the source, multiple should be separated by a semi-colonsource- the hash for the source
collection/old-resource
This table is used to identify resources which should no longer be processed. When a resource is added here all of the facts it generated will be deleted (though the resource itself will be retained).
Important fields:
old-resource- the hash of the resource being endedstatus- the status code for this entry, use410for ended (!! are there others which can be used?)notes- to record why this configuration change was made
pipeline/column
Used to map column headers in an endpoint or resource to specification field names. Unlike transform.csv which handles spec-level renames globally, column.csv is typically used to handle inconsistent or non-standard column naming in specific endpoints. Leaving resource and endpoint blank applies the mapping gloablly.
Example
Mapping a field namedUIDin an endpoint to ourreferencefield
Important fields:
column- the column header in the resource being mappedfield- the field name in our specification the column header should be mapped toendpoint- (optional) limit the mapping to a specific endpoint
pipeline/combine
Used to merge field values across multiple facts for the same entity. This runs later in the pipeline than other configuration (after entity resolution) so operates on facts rather than raw rows.
For geometry fields, values are merged into a single Multipolygon using a spatial union rather than string joining. For all other fields, unique values are deduplicated, sorted and joined using the specified separator.
Example
In theagricultural-land-classificationcollection thegeometryfield of the Natural England is grouped by the reference, resulting in individual polygons being grouped into a multipolygon or geometry collection.
Important fields:
field- the field that should be groupedseparator- a separator to use between the grouped field values, e.g. a hyphen or semicolon to separate strings
Uncertain about how different field types are combined, need more detail.
pipeline/concat
Used to combine values across multiple fields into a single one.
Example
Concatenate the values fromnameandarea_reffields to create a unique value forreferencewhen there is no unique reference provided in the resource
Important fields:
field- the field to use the concatenated value infields- a list of the fields in the resource (or our schema?) to concatenate, separated by a semicolonseparator- an optional separator to use between the concatenated valuesprepend- optional text to add before the concatenated valuesappend- optional text to add after the concatenated values
pipeline/convert
Not currently in active use. This file was intended to configure converstion behaviour for specific resources but the functionality is handled automatically by the pipeline. The only existing configuration is in brownfield-land and contains no active parameters.
pipeline/default
Used to populate an empty field by copying the value from another field in the same row. Only applies when the target field has no value - existing values are never overwritten. This is different to default-value.csv which sets a hardcoded value rather than copying from another field.
Example
Ifstart-dateshould default to the value ofactual-datewhen not provided, add a row mappingstart-date→actual-date.
pipeline/default-value
Used to set a hardcoded default value for a field when it is empty. Unlike default.csv which copies a value from another field, this sets a fixed literal value.
Example
Set the value offlood-risk-levelto2for all values from an endpoint in theflood-risk-zone, because the data is provided split into a different endpoint per flood risk level but each resource doesn’t record the level explicitly in a field.See: https://github.com/digital-land/config/blob/main/pipeline/flood-risk-zone/default-value.csv#L3
Important fields:
field- the field to use the default value invalue- the value to enter as default in the field
pipeline/entity-organisation
This configuration file is used to assign the organisation responsible for managing an entity or range of entities. For any entities within the dataset and entity range given, facts from the assigned organisation will be prioritised over facts from any other organisation. In practice this means when we have multiple sources of data for a single entity, the organisation can be kept as the authoritative organisation by setting the entity-organisation in this file.
Example
Entity44002714has multiple entries inlookup.csv: one for organisationlocal-authority:TOB(Torbay council) and one forgovernment-organisation:PB1164(Historic England). That means we have data from both of these sources for this entity.This entity number falls within the range of 44002711 - 44002734 which is assigned to
local-authority:TOBinentity-organisation.csv. That means that the facts on the entity page should all be from Torbay, even if there are more recent facts from Historic England. And importantly, theOrganisationon the entity page will remain as Torbay Council.
Important fields:
dataset- the dataset to target e.gconservation-area-documentorganisation- the organisation to apply to e.glocal-authority:BABentity-minimum- sets the starting point of that range (inclusive)entity-maximum- sets the ending point of that range (inclusive)
pipeline/expect
This file is used to define rules that will be used by the pipeline code to generate and test expectations. The fields define the operation that will be used and the parameters to pass it, as well as what the expected result is and some metadata for the expectation that will appear in the expectation table in datasette.
Important fields:
-
dataset- The dataset or list of datasets (separated by ‘;’) within the collection, which the expectation should be executed against. -
organisations- the organisation or list of organisations which the expectation should be executed for, e.g.local-authority:BAB(list separated by ‘;’).If given a list, the rule will create multiple expectations, each which will be executed against only entities within each organisation.
You can also use a dataset to specify a list of organisations, e.g.
local-authority, ornational-park-authority. Or leave the field blank if the expectation should be executed at a dataset level. -
operation- the expectation operation to be executed. Must be defined in digital-land/expectations/operation.py. -
parameters- a JSON string passing the operation parameters. Keys should be enclosed in double quotes, and values in double quotes and braces. This is to handle jinja formatting, which can use class attributes to parameterise some of the inputs, e.g."{""lpa"":""""}" -
name- the name for any expectations that will be created by the rule. It’s best to not use parameters in this field as a generic name for each expectation created by the same rule makes it easier to quickly compare results in the expectation table. -
description- a description for the expectation. This can accept jinja formatting to output parameters in the same way as the as theparametersfield, e.g.A test to check there are no listed-building-outline entities outside of the boundary for
pipeline/filter
Used to filter a resource so that only a subset of the records are processed, based on whether the values in one of the resources fields are in a user-defined list.
Example
To only add records from a resource where thetree-preservation-zone-typefield has a value of “Area” to thetree-preservation-zonedataset.See: https://github.com/digital-land/config/blob/main/pipeline/tree-preservation-order/filter.csv#L6
Important fields:
field- the field to search for the patternpattern- the pattern to search for in the field (can just be a string, does this accept regex like in patch?)
NOTE!
Filter config for a dataset will only work for fields that are in the dataset schema. So if you need to filter based on a column that’s in the source data and not in the schema, you will first need to map it to a schema column usingcolumn.csvconfig.
pipeline/lookup
Used to map the relationships between the reference that a data provider uses to describe a thing, to the entity number that we have assigned to that thing. It is important to appreciate that there can be a 1:1 or a many:1 relationship here because we may collect data from multiple providers who publish information about the same thing (e.g. both LPAs and Historic England publish conservation area data, so we may map a reference from each to the same entity).
These records are produced at the entity assignment phase when an endpoint is added for the first time. Verifying that the lookups have been produced as expected, and also managing any merging with existing entities is an important stage of the Adding an endpoint process.
Important fields:
prefix- the dataset this entity is a part ofresource- (not used?)organisation- the organisation who provides data about this entityreference- the identifier used by the provider to refer to this entityentity- the reference we have generated to refer to this entity
pipeline/old-entity
Used to redirect and remove entities.
To prevent an entity from appearing on the platform, add a record with status 410 and the entity number to the file and it will be removed.
To redirect an entity, use status 301 along with current entity number and the target entity number. (mainly used for merging geographically duplicated entities)
Important fields:
old-entity- current entity numberstatus- 301 (redirect) / 410 (remove)entity- target entity number
pipeline/patch
Used to replace values that match a particular string or string pattern with another value.
Example
Change any string in thelisted-building-gradefield of a resource in thelisted-buildingcollection that contains the numeric character “2” to be “II” instead, in order to match the specification.See: https://github.com/digital-land/config/blob/main/pipeline/listed-building/patch.csv#L10
Important fields:
endpoint- (optional if targetting specific endpoit) the endpoint hash to target a specific endpoint for a patchfield- the field to search for the patternpattern- the pattern to search for in the field (should this be a regex pattern?)value- the value to use as a replacement for incoming values that match the pattern
pipeline/skip
Sometimes, the raw data contains extraneous lines that can cause issues during processing. To address this, the skip.csv file is used to skip specific lines from the raw data.
Example
Consider this endpoint url, the file begins with some extra data that we need to skip.See: https://github.com/digital-land/config/blob/main/pipeline/developer-contributions/skip.csv#L7
Important fields:
pattern- the pattern to search for in the raw endpoint file
pipeline/transform
Used to rename fields to match the latest specification. Maps old field names to their current replacements, applied globally across all resources. Use this when a field has been renamed in the specification and you need existing data to continue flowing through correctly.
Example
The brownfield-land specification changed from using fields likeOrganisationURIandSiteNameAddresstoorganisationandsite-address. These changes were added to the relevanttransform.csvto accommodate this specification change.See: https://github.com/digital-land/config/blob/main/pipeline/brownfield-land/transform.csv
Important fields:
field- the old field name in the source datareplacement-field- the new field name in the current specification