Skip to main content

What are expectations?

Expecation is the term used to describe an assertion made about our data for the purposes of data quality.

A single assertion might be something like:

conservation-area entities belonging to the London Borough of Lambeth should not have geometries which are beyond the Lambeth local-planning-authority boundary

Understanding whether or not this assertion is true helps us identify data quality issues (in this case whether there are entities with geometries beyond the expected boundary of the providing organisation).

Creating a new expectation

Expectations can be created using rules defined in the pipeline configuration file expect.csv (see the relevant section in the configure and endpoint guide for more detail).

The pipeline code uses these rules to create sets of expectations. And each of these expectations defines a single assertion which is made by executing an operation with a given set of parameters against a given set of data. Each assertion produces an entry in the expectation log which should show whether the assertion is True or False.

Operations are functions defined in digital-land/expectations/operation.py and designed to work on a particular format of data (e.g. sqlite or csv). They can accept any required arguments but must return three things:

  • result, a boolean value based on whether the expectation has been met or not
  • message, a string giving summarised context to the result
  • details, a dictionary object which can be used to pass back more detailed context for the result, such as a list of entities which did not meet the expectation.

Testing expectations

Once expectations are defined in expect.csv, all you need to do is build the collection locally to see the results.

Expectation logs are saved to the config directory .parquet files in the following location:

log/expectation/dataset=[DATASET-NAME]/[DATASET-NAME].parquet

One of the quickest ways to examine these is just install DuckDB and use the Command Line Interface to query the parquet files.

Once installed you can run duckdb to activate it. You can then type a SQL command followed by a semicolon, and execute with enter. DuckDB allows you to read all files that match a glob pattern, like:

SELECT *
FROM `log/expectation/dataset=conservation-area/*.parquet`
LIMIT 10;

NOTE
To exit the duckdb cli interface execute .quit

Alternatively, if you don’t want to build the collection locally commit your changes to the config repo, and then run the Airflow workflow for the affected dataset in the development environment and check the results in the development datasette instance.

Re-testing

If you want to make changes to the rules and test there are some other steps you can follow to save having to re-build the entire collection from scratch after each change.

Any rules in expect.csv are loaded into var/[COLLECTION-NAME]/cache/config.sqlite3, which means that in order for any changes to expect.csv to be reflected you need to re-run make init for the collection, which will rebuild the config sqlite db.

So if you want to make expectation rule changes and test them without having to download all the of the collection files again, one option is to:

  • make changes to expect.csv
  • re-run make init to rebuild config.sqlite3
  • run rm -rf dataset to remove dataset files
  • run make dataset to build the dataset