Configure an endpoint
collection/endpoint
Endpoints identify the URLs which are used to collect data from.
Important fields:
attribution
- statement about the source of the datacollection
- repeats the name of the thing we are collectingdocumentation-url
- the web link to the providers documentation about the dataendpoint-url
- the URL of the endpoint itselfendpoint
- the endpoint hashentry-date
- the date the endpoint record was addedend-date
- the date from which an endpoint is no longer validplugin
- any plugin that is required to successfully collect from an endpoint, e.g.wfs
,arcgis
start-date
- the date from which an endpoint is valid
https://github.com/digital-land/digital-land-python/tree/main/digital_land/plugins
collection/source
Sources identify where we get the data from, including providing a documentation-url to the publishers website which can provide additional information on data downloaded from the source. Sources are separate to endpoints but each endpoint should be associated with a source. -
Important fields:
documentation-url
- the URL of the webpage that links to the endpointend-date
- used to end a sourceendpoint
- the hash for the endpoint associated with the sourcelicence
- the licence type provided with the source, one of these typesorganisation
- the organisation providing the sourcepipelines
- the pipelines used to process the source, multiple should be separated by a semi-colonsource
- the hash for the source
collection/old-resource
This table is used to identify resources which should no longer be processed. When a resource is added here all of the facts it generated will be deleted (though the resource itself will be retained).
Important fields:
old-resource
- the hash of the resource being endedstatus
- the status code for this entry, use410
for ended (!! are there others which can be used?)resource
- uncertain, can a resource be re-directed?notes
- to record why this configuration change was made
pipeline/column
This table is used to add extra mappings from the resource column headers to our specification field names, for example mapping a field named UID
in a resource to our reference
field.
Important fields:
column
- the column header in the resource being mappedfield
- the field name in our specification the column header should be mapped to
pipeline/combine
Used to combine values across multiple rows. The grouping is based on the reference field so this only works when there are multiple rows per reference (note this happens after concat so concat can be used to create a reference from multiple fields and control the grouping to some extent)
Example
In theagricultural-land-classification
collection thegeometry
field of the Natural England is grouped by the reference, resulting in individual polygons being grouped into a multipolygon or geometry collection.
Important fields:
field
- the field that should be groupedseparator
- a separator to use between the grouped field values, e.g. a hyphen or semicolon to separate strings
Uncertain about how different field types are combined, need more detail.
pipeline/concat
Used to combine values across multiple fields into a single one.
Example
Concatenate the values fromname
andarea_ref
fields to create a unique value forreference
when there is no unique reference provided in the resource
Important fields:
field
- the field to use the concatenated value infields
- a list of the fields in the resource (or our schema?) to concatenate, separated by a semicolonseparator
- an optional separator to use between the concatenated valuesprepend
- optional text to add before the concatenated valuesappend
- optional text to add after the concatenated values
pipeline/convert
Unsure!
pipeline/default-value
Used to set a default value for all values in a field
Example
Set the value offlood-risk-level
to 2 for all values from an endpoint in theflood-risk-zone
, because the data is provided split into a different endpoint per flood risk level but each resource doesn’t record the level explicitly in a field.See: https://github.com/digital-land/config/blob/main/pipeline/flood-risk-zone/default-value.csv#L3
Important fields:
field
- the field to use the default value invalue
- the value to enter as default in the field
pipeline/default
I think to set a default value using another field in the resource, but uncertain how this is different to column. Need more info.
pipeline/entity-organisation
Used to set the entity range for organisations within the conservation-area collection. This is done to ensure that entities within a range are linked to a certain organisation in the case that we have lookup entries for the same entity from different organisations. This helps prioritise data from the authoritative source.
Example
Forthe conservation-area
dataset, we have an entry forlocal-authority:BAB
with aentity-minimum
of44005968
andentity-maximum
for44005997
. This sets out that any entity within that range will be part of that organisation. More ranges for that organisation and dataset can be added in following rows e.g. setting the next range as44008683 -> 44008684
Important fields:
dataset
- the dataset to target e.gconservation-area-document
organisation
- the organisation to apply to e.glocal-authority:BAB
entity-minimum
- sets the starting point of that range (inclusive)entity-maximum
- sets the ending point of that range (inclusive)
pipeline/expect
This file is used to define rules that will be used by the pipeline code to generate and test expectations. The fields define the operation that will be used and the parameters to pass it, as well as what the expected result is and some metadata for the expectation that will appear in the expectation table in datasette.
Important fields:
-
dataset
- The dataset or list of datasets (separated by ‘;’) within the collection, which the expectation should be executed against. -
organisations
- the organisation or list of organisations which the expectation should be executed for, e.g.local-authority:BAB
(list separated by ‘;’).If given a list, the rule will create multiple expectations, each which will be executed against only entities within each organisation.
You can also use a dataset to specify a list of organisations, e.g.
local-authority
, ornational-park-authority
. Or leave the field blank if the expectation should be executed at a dataset level. -
operation
- the expectation operation to be executed. Must be defined in digital-land/expectations/operation.py. -
parameters
- a JSON string passing the operation parameters. Keys should be enclosed in double quotes, and values in double quotes and braces. This is to handle jinja formatting, which can use class attributes to parameterise some of the inputs, e.g."{""lpa"":""""}"
-
name
- the name for any expectations that will be created by the rule. It’s best to not use parameters in this field as a generic name for each expectation created by the same rule makes it easier to quickly compare results in the expectation table. -
description
- a description for the expectation. This can accept jinja formatting to output parameters in the same way as the as theparameters
field, e.g.A test to check there are no listed-building-outline entities outside of the boundary for
pipeline/filter
Used to filter a resource so that only a subset of the records are processed, based on whether the values in one of the resources fields are in a user-defined list.
Example
To only add records from a resource where thetree-preservation-zone-type
field has a value of “Area” to thetree-preservation-zone
dataset.See: https://github.com/digital-land/config/blob/main/pipeline/tree-preservation-order/filter.csv#L6
Important fields:
field
- the field to search for the patternpattern
- the pattern to search for in the field (can just be a string, does this accept regex like in patch?)
NOTE!
Filter config for a dataset will only work for fields that are in the dataset schema. So if you need to filter based on a column that’s in the source data and not in the schema, you will first need to map it to a schema column usingcolumn.csv
config.
pipeline/lookup
Used to map the relationships between the reference that a data provider uses to describe a thing, to the entity number that we have assigned to that thing. It is important to appreciate that there can be a 1:1 or a many:1 relationship here because we may collect data from multiple providers who publish information about the same thing (e.g. both LPAs and Historic England publish conservation area data, so we may map a reference from each to the same entity).
These records are produced at the entity assignment phase when an endpoint is added for the first time. Verifying that the lookups have been produced as expected, and also managing any merging with existing entities is an important stage of the Adding an endpoint process.
Important fields:
prefix
- the dataset this entity is a part ofresource
- (not used?)organisation
- the organisation who provides data about this entityreference
- the identifier used by the provider to refer to this entityentity
- the reference we have generated to refer to this entity
pipeline/old-entity
Used to redirect and remove entities.
To prevent an entity from appearing on the platform, add a record with status 410 and the entity number to the file and it will be removed.
To redirect an entity, use status 301 along with current entity number and the target entity number. (mainly used for merging geographically duplicated entities)
Important fields:
old-entity
- current entity numberstatus
- 301 (redirect) / 410 (remove)entity
- target entity number
pipeline/patch
Used to replace values that match a particular string or string pattern with another value.
Example
Change any string in thelisted-building-grade
field of a resource in thelisted-building
collection that contains the numeric character “2” to be “II” instead, in order to match the specification.See: https://github.com/digital-land/config/blob/main/pipeline/listed-building/patch.csv#L10
Important fields:
endpoint
- (optional if targetting specific endpoit) the endpoint hash to target a specific endpoint for a patchfield
- the field to search for the patternpattern
- the pattern to search for in the field (should this be a regex pattern?)value
- the value to use as a replacement for incoming values that match the pattern
pipeline/skip
Sometimes, the raw data contains extraneous lines that can cause issues during processing. To address this, the skip.csv
file is used to skip specific lines from the raw data.
Example
Consider this endpoint url, the file begins with some extra data that we need to skip.See: https://github.com/digital-land/config/blob/main/pipeline/developer-contributions/skip.csv#L7
Important fields:
pattern
- the pattern to search for in the raw endpoint file
pipeline/transform
Unsure!