Running A Data Collection Pipeline
TODO: Needs Completing
For data engineers and often others in our team this is a key process that generates the files that are later loaded into the platform.
Once you understand how to run it for a Collection then it can be applied to any to debug errors that may have happened overnight. It would be good to read the key concepts in the data operations manual for clarity on the terms that we use. this tutorial will describe the practical applications of these concepts.
Anatomy of collection config
First it’s good to understand the anatomy of collection configuration. The config repo is the home of configuration for all collections.
Note: These files may not exist as they are generated when initialising and running the collection.
Inputs
The inputs to
- collection/source.csv — the list of data sources by organisation, see specification/source
- collection/endpoint.csv — the list of endpoint URLs for the collection, see specification/endpoint
- collection/resource/ — collected resources
- collection/resource.csv — a list of collected resources, see specification/resource
Outputs
- collection/log/ — individual log JSON files, created by the collection process
- collection/log.csv — a collection log assembled from the individual log files, see specification/log
- collection/resource.csv — a list of collected resources, see specification/resource
- fixed/ — contains amended resources that previously could not be processed
- harmonised/ — The output of the
harmonise
stage of the pipeline /var/converted/
- contains CSV files (named by hash of resource) with outputs of intermediary steps to createtransformed/
file