Skip to main content

Configure an endpoint

collection/endpoint

Endpoints identify the URLs which are used to collect data from.

Important fields:

  • attribution - statement about the source of the data
  • collection - repeats the name of the thing we are collecting
  • documentation-url - the web link to the providers documentation about the data
  • endpoint-url - the URL of the endpoint itself
  • endpoint - the endpoint hash
  • entry-date - the date the endpoint record was added
  • end-date - the date from which an endpoint is no longer valid
  • plugin - any plugin that is required to successfully collect from an endpoint, e.g. wfs, arcgis
  • start-date - the date from which an endpoint is valid

https://github.com/digital-land/digital-land-python/tree/main/digital_land/plugins

collection/source

Sources identify where we get the data from, including providing a documentation-url to the publishers website which can provide additional information on data downloaded from the source. Sources are separate to endpoints but each endpoint should be associated with a source. -

Important fields:

  • documentation-url - the URL of the webpage that links to the endpoint
  • end-date - used to end a source
  • endpoint - the hash for the endpoint associated with the source
  • licence - the licence type provided with the source, one of these types
  • organisation - the organisation providing the source
  • pipelines - the pipelines used to process the source, multiple should be separated by a semi-colon
  • source - the hash for the source

collection/old-resource

This table is used to identify resources which should no longer be processed. When a resource is added here all of the facts it generated will be deleted (though the resource itself will be retained).

Important fields:

  • old-resource - the hash of the resource being ended
  • status - the status code for this entry, use 410 for ended (!! are there others which can be used?)
  • notes - to record why this configuration change was made

pipeline/column

Used to map column headers in an endpoint or resource to specification field names. Unlike transform.csv which handles spec-level renames globally, column.csv is typically used to handle inconsistent or non-standard column naming in specific endpoints. Leaving resource and endpoint blank applies the mapping gloablly.

Example
Mapping a field named UID in an endpoint to our reference field

Important fields:

  • column - the column header in the resource being mapped
  • field - the field name in our specification the column header should be mapped to
  • endpoint - (optional) limit the mapping to a specific endpoint

pipeline/combine

Used to merge field values across multiple facts for the same entity. This runs later in the pipeline than other configuration (after entity resolution) so operates on facts rather than raw rows.

For geometry fields, values are merged into a single Multipolygon using a spatial union rather than string joining. For all other fields, unique values are deduplicated, sorted and joined using the specified separator.

Example
In the agricultural-land-classification collection the geometry field of the Natural England is grouped by the reference, resulting in individual polygons being grouped into a multipolygon or geometry collection.

See: https://github.com/digital-land/config/blob/main/pipeline/agricultural-land-classification/combine.csv#L2

Important fields:

  • field - the field that should be grouped
  • separator - a separator to use between the grouped field values, e.g. a hyphen or semicolon to separate strings

Uncertain about how different field types are combined, need more detail.

pipeline/concat

Used to combine values across multiple fields into a single one.

Example
Concatenate the values from name and area_ref fields to create a unique value for reference when there is no unique reference provided in the resource

See: https://github.com/digital-land/config/blob/main/pipeline/agricultural-land-classification/concat.csv#L2

Important fields:

  • field - the field to use the concatenated value in
  • fields - a list of the fields in the resource (or our schema?) to concatenate, separated by a semicolon
  • separator - an optional separator to use between the concatenated values
  • prepend - optional text to add before the concatenated values
  • append - optional text to add after the concatenated values

pipeline/convert

Not currently in active use. This file was intended to configure converstion behaviour for specific resources but the functionality is handled automatically by the pipeline. The only existing configuration is in brownfield-land and contains no active parameters.

pipeline/default

Used to populate an empty field by copying the value from another field in the same row. Only applies when the target field has no value - existing values are never overwritten. This is different to default-value.csv which sets a hardcoded value rather than copying from another field.

Example
If start-date should default to the value of actual-date when not provided, add a row mapping start-dateactual-date.

See: https://github.com/digital-land/config/blob/78c2167948503f794b6023ae17796b5d086514de/pipeline/local-plan/default.csv#L5

pipeline/default-value

Used to set a hardcoded default value for a field when it is empty. Unlike default.csv which copies a value from another field, this sets a fixed literal value.

Example
Set the value of flood-risk-level to 2 for all values from an endpoint in the flood-risk-zone, because the data is provided split into a different endpoint per flood risk level but each resource doesn’t record the level explicitly in a field.

See: https://github.com/digital-land/config/blob/main/pipeline/flood-risk-zone/default-value.csv#L3

Important fields:

  • field - the field to use the default value in
  • value - the value to enter as default in the field

pipeline/entity-organisation

This configuration file is used to assign the organisation responsible for managing an entity or range of entities. For any entities within the dataset and entity range given, facts from the assigned organisation will be prioritised over facts from any other organisation. In practice this means when we have multiple sources of data for a single entity, the organisation can be kept as the authoritative organisation by setting the entity-organisation in this file.

Example
Entity 44002714 has multiple entries in lookup.csv: one for organisation local-authority:TOB (Torbay council) and one for government-organisation:PB1164 (Historic England). That means we have data from both of these sources for this entity.

This entity number falls within the range of 44002711 - 44002734 which is assigned to local-authority:TOB in entity-organisation.csv. That means that the facts on the entity page should all be from Torbay, even if there are more recent facts from Historic England. And importantly, the Organisation on the entity page will remain as Torbay Council.

Important fields:

  • dataset - the dataset to target e.g conservation-area-document
  • organisation - the organisation to apply to e.g local-authority:BAB
  • entity-minimum - sets the starting point of that range (inclusive)
  • entity-maximum - sets the ending point of that range (inclusive)

pipeline/expect

This file is used to define rules that will be used by the pipeline code to generate and test expectations. The fields define the operation that will be used and the parameters to pass it, as well as what the expected result is and some metadata for the expectation that will appear in the expectation table in datasette.

Important fields:

  • dataset - The dataset or list of datasets (separated by ‘;’) within the collection, which the expectation should be executed against.

  • organisations - the organisation or list of organisations which the expectation should be executed for, e.g. local-authority:BAB (list separated by ‘;’).

    If given a list, the rule will create multiple expectations, each which will be executed against only entities within each organisation.

    You can also use a dataset to specify a list of organisations, e.g. local-authority, or national-park-authority. Or leave the field blank if the expectation should be executed at a dataset level.

  • operation - the expectation operation to be executed. Must be defined in digital-land/expectations/operation.py.

  • parameters - a JSON string passing the operation parameters. Keys should be enclosed in double quotes, and values in double quotes and braces. This is to handle jinja formatting, which can use class attributes to parameterise some of the inputs, e.g.

    "{""lpa"":""""}"

  • name - the name for any expectations that will be created by the rule. It’s best to not use parameters in this field as a generic name for each expectation created by the same rule makes it easier to quickly compare results in the expectation table.

  • description - a description for the expectation. This can accept jinja formatting to output parameters in the same way as the as the parameters field, e.g.

    A test to check there are no listed-building-outline entities outside of the boundary for

pipeline/filter

Used to filter a resource so that only a subset of the records are processed, based on whether the values in one of the resources fields are in a user-defined list.

Example
To only add records from a resource where the tree-preservation-zone-type field has a value of “Area” to the tree-preservation-zone dataset.

See: https://github.com/digital-land/config/blob/main/pipeline/tree-preservation-order/filter.csv#L6

Important fields:

  • field - the field to search for the pattern
  • pattern - the pattern to search for in the field (can just be a string, does this accept regex like in patch?)

NOTE!
Filter config for a dataset will only work for fields that are in the dataset schema. So if you need to filter based on a column that’s in the source data and not in the schema, you will first need to map it to a schema column using column.csv config.

pipeline/lookup

Used to map the relationships between the reference that a data provider uses to describe a thing, to the entity number that we have assigned to that thing. It is important to appreciate that there can be a 1:1 or a many:1 relationship here because we may collect data from multiple providers who publish information about the same thing (e.g. both LPAs and Historic England publish conservation area data, so we may map a reference from each to the same entity).

These records are produced at the entity assignment phase when an endpoint is added for the first time. Verifying that the lookups have been produced as expected, and also managing any merging with existing entities is an important stage of the Adding an endpoint process.

Important fields:

  • prefix - the dataset this entity is a part of
  • resource - (not used?)
  • organisation - the organisation who provides data about this entity
  • reference - the identifier used by the provider to refer to this entity
  • entity - the reference we have generated to refer to this entity

pipeline/old-entity

Used to redirect and remove entities.

To prevent an entity from appearing on the platform, add a record with status 410 and the entity number to the file and it will be removed.

To redirect an entity, use status 301 along with current entity number and the target entity number. (mainly used for merging geographically duplicated entities)

Important fields:

  • old-entity - current entity number
  • status - 301 (redirect) / 410 (remove)
  • entity- target entity number

pipeline/patch

Used to replace values that match a particular string or string pattern with another value.

Example
Change any string in the listed-building-grade field of a resource in the listed-building collection that contains the numeric character “2” to be “II” instead, in order to match the specification.

See: https://github.com/digital-land/config/blob/main/pipeline/listed-building/patch.csv#L10

Important fields:

  • endpoint - (optional if targetting specific endpoit) the endpoint hash to target a specific endpoint for a patch
  • field - the field to search for the pattern
  • pattern - the pattern to search for in the field (should this be a regex pattern?)
  • value - the value to use as a replacement for incoming values that match the pattern

pipeline/skip

Sometimes, the raw data contains extraneous lines that can cause issues during processing. To address this, the skip.csv file is used to skip specific lines from the raw data.

Example
Consider this endpoint url, the file begins with some extra data that we need to skip.

See: https://github.com/digital-land/config/blob/main/pipeline/developer-contributions/skip.csv#L7

Important fields:

  • pattern - the pattern to search for in the raw endpoint file

pipeline/transform

Used to rename fields to match the latest specification. Maps old field names to their current replacements, applied globally across all resources. Use this when a field has been renamed in the specification and you need existing data to continue flowing through correctly.

Example
The brownfield-land specification changed from using fields like OrganisationURI and SiteNameAddress to organisation and site-address. These changes were added to the relevant transform.csv to accommodate this specification change.

See: https://github.com/digital-land/config/blob/main/pipeline/brownfield-land/transform.csv

Important fields:

  • field - the old field name in the source data
  • replacement-field - the new field name in the current specification