Data validation with airflow

Data validation with airflow can be a simple way to do some data quality checks without any overhead. Here I will briefly cover the topic of doing data checks or data quality tests when importing user-input data (like when integrating other data management systems such as CRMs or when taking some mapping tables as input).

Data validation methods

1. Data validation

Data validation refers to verifying that what you think about the data is true, such as that cost is a number and not a string, that a date is a valid date,  that the entries are unique, that values are not left unfilled.

It’s ideal that you do not only validate the data (that it fits expected conditions/formats/data types /primary keys etc) but also validate the referential integrity.

2. Referential validation

Referential validation is about validating that the references to or from the data you imported are valid.

If you import some entities (orders, products, users) from some data management system (CRM, CMS), make sure that those entities exist in your database. For example, make sure that the companies you import from salesforce or other crm also exist in your company dimension table.

3. Anomaly detection

This is about detecting potential data issues or events underlined in the data. It often takes the form of a monitoring system for data quality that alerts on unexpected changes.

Data validation with airflow

The anomaly detection is usually done on streaming data and does not make sense to run from Airflow, which is a workflow engine meant for running batch jobs. However, for data quality checks, airflow can work perfectly.
For your data validation, loading your data into a table with constraints (unique, not null etc) will produce a pretty meaningful message on fail. You can simply add the data producer to the on-fail emailing, and they will get the message on fail.

For referential validation, an error from the foreign keys would likely be insufficient for providing enough information to the data owner to fix the issue. Such, I prefer to use a query to select the offending records, and then send the output to the data producer. I wrap the sql in a python method that asserts output is empty, and create a task on airflow in the relevant DAG. If output exists, airflow will send the failure message to the data owner with the relevant offending keys.

How do you do your data checks? Bonus points if your solution makes use of existing components without overhead.