Skip to main content

Collect

the collect process is used in our data collection pipelines to download a resource from an endpoint. It produces a collection log per endpoint for that given date.

Data Collection Pipeline

The collect process takes a csv which should correspond to the Endpoint dataset specification and a single directory in which two outputs are reorded:

  • log - a log json file is created the file is named after the hash of the endpoint which was checked. it is added into a log directoy. This diectory is patitioned by entry-date.
  • resource - a file (could be any format) nammed afte the hash of the contents of a file. this data file will always be refeed to as this in our system

digital-land-python

the collect process uses a small amount of code from digital land python. The command found in the commands.py file. The command offers a functional way of collecting resources but the heavy lifting is done by the Collector class found in in collect.py.

Batch Implementation

This process is run as part of our data collection pipelines.

they are ran during a task in our collection dags in airflow. The file generating these dags can be found here.

The airflow triggers an ECS task in fargate and uses the collection-task repository

the script for all procesess ran in the ECS Task is run.sh

In that script the make collect target is used which leads
to the digital-land cli:

digital-land ${DIGITAL_LAND_OPTS} collect <ENDPOINT_CSV> --collection-dir <COLLECTION_DIR>)

The arguement options are passed in by setting environment variables. defaults are chosen via make if not set.

Once complete results are pushed to s3 and saved using another make target

dynamic implementation

ran as part of the following tasks:

  • check_url