Add an endpoint

Prerequisites:

Clone the config repo by running git clone [gitURL] and update it with make init in your virtual environment
Validate the data. If you haven’t done this yet, follow the steps in ‘Validating an Endpoint’ before continuing.

NOTE!
The endpointchecker will pre-populate some of the commands mentioned in the steps below, check the end of the notebook underneath ‘_scripting’.

Create an import file
If you don’t already have an import.csv file in the root of the config file, simply create one with the command touch import.csv
Add configurations

NOTE!
Check the Endpoint-edge-cases section below for guidance on how to handle configuration for some non-standard scenarios, like a single endpoint being used for multiple provisions, or an endpoint for the tree dataset with polygon instead of point geometry.
1. Populate the import file
  
  The following columns need to be included in import.csv:
  - endpoint-url - the url that the collector needs to extract data from
  - documentation-url - a url on the provider’s website which contains information about the data
  - start-date - the date that the collector should start from (this can be in the past)
  - plugins - if a plugin is required to extract that data then it can be noted here otherwise leave blank
  - pipelines - the pipelines that need to be ran on resources collected from this endpoint. These are equivalent to the datasets and where more than one is necessary they should be separated by ;
  - organisation - the organisation which the endpoint belongs to. The name should be in this list
  - licence - the type of licence the data is published with. This can usually be found at the dataset’s documentation url.
  The endpoint checker should output text you can copy into import.csv with the required headers and values, or alternatively copy the headers below:
```
organisation,endpoint-url,documentation-url,start-date,pipelines,plugin,licence
```
  Using the same example from validating an endpoint, the import.csv should look like this:
```
organisation,documentation-url,endpoint-url,start-date,pipelines,plugin,licence
local-authority-eng:SAW,https://www.sandwell.gov.uk/downloads/download/868/article-4-directions-planning-data,https://www.sandwell.gov.uk/downloads/file/2894/article-4-direction-dataset,,article-4-direction,,ogl3
```
2. Make changes to pipeline configuration files
  
  Use the how to configure an endpoint guide to see how each of the configuration files works.
  
  The most common step here will be using column.csv to add in extra column mappings.
Run add_data OR add_endpoint_and_lookups script
1. (Preferred) Run add_data
  
  Run the following command inside the config repository within the virtual environment:
```
digital-land add-data [INPUT-CSV-PATH] [COLLECTION_NAME] -c collection/[COLLECTION_NAME] -p ./pipeline/[COLLECTION_NAME] 
```
  The command will fetch from the endpoint, process the resource, and assign entities if necessary, providing feedback and warnings along the way.
  An example command would be:
```
digital-land add-data ./import.csv brownfield-land -c collection/brownfield-land/ -p pipeline/brownfield-land/
```
2. (Legacy) Run add_endpoints_and_lookups_script
  
  Run the following command inside the config repository within the virtual environment:
```
digital-land add-endpoints-and-lookups [INPUT-CSV-PATH] [COLLECTION_NAME] -c ./collection/[COLLECTION_NAME] -p ./pipeline/[COLLECTION_NAME]
```
  The completed command will be given in the scripting section of the endpoint_checker.
  
  For example (the actual command will vary based on the dataset added, article-4-direction is used as an example):
```
digital-land add-endpoints-and-lookups ./import.csv article-4-direction -c./collection/article-4-direction -p ./pipeline/article-4-direction
```
  Improved Method:
  This method defines the required prerequisite parameters for us.
  
  Syntax
```
make add-data COLLECTION=[COLLECTION_NAME] INPUT_CSV=[INPUT_FILE]
```
  For example
```
make add-data COLLECTION=conservation-area INPUT_CSV=import.csv
```
(Optional) Update entity-organisation.csv

If the data that has been added is part of the conservation-area collection e.g conservation-area and conservation-area-document, the entity range must be added as a new row. This is done using the entities generated in lookup. Use the first and the last of the entity numbers of the newly generated lookups e.g if 44012346 is the first and 44012370 the last, use these as entity-minimum and entity-maximum.

For an explanation of how the file works, see entity-organisation.
Check results
After running the command, the endpoint.csv, lookup.csv, and source.csv should be modified.
- A new line should be added to endpoint.csv and source.csv.
- For each new lookup, a new line should be added to the lookup.csv.
  The console output will show a list of new lookups entries organised by organisation and resource-hash. Seeing this is a good indication that the command ran successfully.
  For example:
```
----------------------------------------------------------------------
>>> organisations:['local-authority-eng:SAW']
>>> resource:2b142efd3bcfe29660a3b912c4f742b9c7ff31c8ca0a02d93c9aa8b60e8e2469
----------------------------------------------------------------------
article-4-direction,,local-authority-eng:SAW,A4D1,6100323
article-4-direction,,local-authority-eng:SAW,A4D2,6100324
...
```
Test locally
Once the changes have been made and pushed, the next step is to test locally if the changes have worked. Follow the steps in building a collection locally
Push changes
Commit your changes to a new branch that is named after the organisation whose endpoints are being added (use the 3 letter code for succinct names, e.g. add-LBH-data).

Push the changes on your branch to remote and create a new PR. This should be reviewed and approved by a colleague in the Data Management team before being merged into main.

Once the chages are merged they will be picked up by the nightly Airflow jobs which will build an updated dataset.
Run action workflow (optional)
Optionally, if you don’t want to wait until the next day, you can manually execute the workflow that usually runs overnight yourself in order to be able to check if the data is actually on the platform. Simply follow the instructions in the guide for triggering a collection manually.

Endpoint edge-cases

Handling Combined Endpoints

Note that, when adding an endpoint that feeds into separate datasets or pipelines (such as an endpoint with data for tree-preservation-zone and tree), the pipeline field in the import.csv file should be formatted to contains both datasets as follows:

pipelines
tree-preservation-zone;tree

When handling this type of endpoint, two possible scenarios may arise.

The endpoint includes two datasets: one spatial and one non-spatial - It may be necessary to use separate columns as a reference field for each dataset. In such cases, add a column mapping for each dataset contained within the endpoint in the column.csv file.
The endpoint includes two datasets both being spatial - For scenarios where the endpoint includes a column that determines the dataset, use the filter.csv file. Follow the instructions in ‘TPZ and Tree data in the same endpoint’ for this scenario.

TPZ and Tree data in the same endoint

We might receive an endpoint that contains both Tree and TPZ data. When this happens we can usually use a filter.csv configuration to process a subset of the endpoint data for each dataset. Data supplied like this should have a tree-preservation-zone-type field for the TPZ data, which should contain one of area, woodland or group for TPZs and individual for trees.

NOTE!
filter.csv config for a dataset will only work with a field that is in the dataset schema, and the tree-preservation-zone-type is not in the tree schema. So if you need to filter tree data using this field, it will first need to be mapped to a field in the tree schema that can then be used by filter.csv. You can use the tree-preservation-order-tree field (which isn’t in the website guidance or tech spec, but is in the specification repo spec), like this example in column.csv.

For example:

column.csv config

dataset,endpoint,resource,column,field,start-date,end-date,entry-date
tree,d6abdbc3123bc4b60ee9d34ab1ec52dda34d67e6260802df6a944a5f7d09352b,,tree_preservation_zone_type,tree-preservation-order-tree,,,

filter.csv config

dataset,resource,field,pattern,entry-number,start-date,end-date,entry-date,endpoint
tree-preservation-zone,,tree-preservation-zone-type,(?!Individual),,,,,d6abdbc3123bc4b60ee9d34ab1ec52dda34d67e6260802df6a944a5f7d09352b
tree,,tree-preservation-order-tree,Individual,,,,,d6abdbc3123bc4b60ee9d34ab1ec52dda34d67e6260802df6a944a5f7d09352b

Tree data with polygon instead of point

By default, the tree dataset wkt field (which is the incoming geometry from the resource) is mapped to point, with by a global mapping in column.csv. When a provider gives us a polygon data instead of a point, we need to add a mapping in the column.csvfile for the specific endpoint or resource from wkt to geometry which will override the default mapping.

For example:

tree,422e2a9f2fb1d809d8849e05556aa7c232060673c1cc51d84bcf9bb586d5de52,,WKT,geometry,,,

As an example, this datasette query shows a resource where we were provided a polygon dataset for tree so we mapped wkt to geometry.
Whereas this one was a point format so we did not need to override the mapping. You’ll notice that the field related to the column wkt is point.