Building a collection locally
Before we commit any changes to main, we need to be sure that the changes have the desired effect. To do this, we build a datasette database locally for any collection.
Testing can be done from the config
repo. Testing the entire collection can take a while so it’s quicker focussing on only the newly added endpoint data.
NOTE!
This assumes you have the collection setup. If not, simple run the commandsmake makerules COLLECTION=[collection_name]
andmake init COLLECTION=[collection_name]
-
In the dataset’s
collection
folder, edit all the csv files to only include the newest record, everything else should be removed. Soendpoints.csv
andsource.csv
should only have the headers and the newly added endpoints/sources.old-resource.csv
should be empty (unless the change was retiring a resource). Remove theresource.csv
andlog.csv
as well as the samely named folders (with everything in it) as these can sometimes cause errors. -
In the dataset’s
pipeline
folder, anything targetting another resource/endpoint should be removed. This includes the following files[column.csv, combine.csv, old-entity.csv, and patch.csv]
. This is done to ensure that no errors will appear during the collection and transformation of the resource as well as the creation of the sql database. -
Run the pipeline in the collection repo by running
make COLLECTION=[collection_name]
e.g.make COLLECTION=brownfield-land
. -
After the pipeline has finished running, use
make datasette COLLECTION=[collection_name]
to interrogate the local datasets; this will enable you to check that the data is on the local platform as expected. Check that the expected data is there and no issues that are unexpected show up.
NOTE!
If testing the entire collection is required, only removing theresource
andlog
files and folders is required. To speed up the download process, it’s best to run the tasks in parallel, as downloading each resource individually could take hours for some of the larger datasets. You can achieve this by using the -j option with the make command, for example:
make -j 16 COLLECTION=[collection_name]
The -j option specifies the number of jobs (commands) to run concurrently. The higher the number, the more tasks will run at the same time, which can significantly speed up the process. You can adjust this number depending on the number of CPU cores available on your machine. For example, if your machine has 8 cores, you could use-j 8
; for 16 cores, you might try-j 16
.
Test and rebuild quicker!
Sometimes it might be necessary to test multiple times to get the right configuration down. To make this easier and quicker,make clobber COLLECTION=[collection_name]
can be used. This keeps the downloaded resources but removes all the logs.