Data validation with airflow

Data validation with airflow can be a simple way to do some data quality checks without any overhead. Here I will briefly cover the topic of doing data checks or data quality tests when importing user-input data (like when integrating other data management systems such as CRMs or when taking some mapping tables as input).

Data validation methods

1. Data validation

Data validation refers to verifying that what you think about the data is true, such as that cost is a number and not a string, that a date is a valid date,  that the entries are unique, that values are not left unfilled.

It’s ideal that you do not only validate the data (that it fits expected conditions/formats/data types /primary keys etc) but also validate the referential integrity.

2. Referential validation

Referential validation is about validating that the references to or from the data you imported are valid.

If you import some entities (orders, products, users) from some data management system (CRM, CMS), make sure that those entities exist in your database. For example, make sure that the companies you import from salesforce or other crm also exist in your company dimension table.

3. Anomaly detection

This is about detecting potential data issues or events underlined in the data. It often takes the form of a monitoring system for data quality that alerts on unexpected changes.

Data validation with airflow

The anomaly detection is usually done on streaming data and does not make sense to run from Airflow, which is a workflow engine meant for running batch jobs. However, for data quality checks, airflow can work perfectly.
For your data validation, loading your data into a table with constraints (unique, not null etc) will produce a pretty meaningful message on fail. You can simply add the data producer to the on-fail emailing, and they will get the message on fail.

For referential validation, an error from the foreign keys would likely be insufficient for providing enough information to the data owner to fix the issue. Such, I prefer to use a query to select the offending records, and then send the output to the data producer. I wrap the sql in a python method that asserts output is empty, and create a task on airflow in the relevant DAG. If output exists, airflow will send the failure message to the data owner with the relevant offending keys.

How do you do your data checks? Bonus points if your solution makes use of existing components without overhead.

Easy ETL, or “friends don’t let friends write loading code”

Easy ETL is what every tool promises, but most do not deliver. Over the last years there has been a strong shift in BI towards the use of python. If previously any scripting language would do, nowadays it’s almost impossible to manage without python specifically.

Take the ever more common open source tech stack of
SQL for storage and analysis
Python as glue language
Airflow (python) as workflow engine
Airflow (python)+ custom utils (any scripting language) as framework.

You can reduce the complexity by deciding to stick with python as far as possible with regards to the utils. It makes a lot of sense as you can just import the utils directly in the job (DAG) definition file to call and parametrize your methods with ease.

Having worked on multiple BI data warehousing projects over the last years, the most common mistake I see made is underestimating the cost of SQL maintenance, and the most common difficulty is staffing with enough workers to maintain and extend the large codebases.

The cost of maintaining sql throughout its lifecycle is usually estimated at 3-4x the time it takes to create it in the first place. The truth is, for the most vital components of your transformation, the cost of maintenance is closer to 10x the cost of original development. You will touch this code again and again every time there is a source data change, addition etc. Add a new column? Easy! Modify your data loading scripts, your DM creation scripts, your reporting scripts to include it, your data dictionary, and the frontend. Change a column? not so easy anymore.

So we are already familiar with the hard way: Write code to do the transformations. When the schema changes, add or modify the code.

So what is the easy way? Write code to write code to do the transformations. Many of the transformation steps are repetitive, so if you were to standardize things, you could easily re-use the same few parametrized operations. It’s called don’t repeat yourself or ‘DRY’, and it’s a basic concept for producing decent, maintainable code. You know, what you SHOULD be doing instead of copy pasting all that sql. But while many of us know our code is ‘WET’, it is sometimes hard to abstract and standardize. So how do you do it?

Standardize.

When it comes to loading data into a table, incrementally or not, there are only a few ways to do this. Ideally, stick with the basics and re-use the code. If most of your operations are of one type, and a new one might run faster if done differently, consider that writing new loading logic and maintaining it 10 times over is likely more expensive than the drawbacks of your existing solution.

Abstract.

Split your jobs into subroutines. You will quickly identify that there are many common parts, that can be abstracted out and written as reusable methods in some shared utils. You can then construct the job from those blocks, having only the job-specific code to manage in any particular job.

Take the following examples:

A job that pulls some data from an api:
1. Download from api to csv
2. load csv in database
3. merge increment

A job that loads data from one rdbms to the dwh:
1. Download from db.
2. load csv in database
3. merge increment

A job that creates some reports:
1. Run some sql
2. Run some code
3. Send report

A job that monitors your data
1. Run some sql
2. Check if normal
3. Send report if abnormal

A job that delivers some segments into your CRM tool.
1. Run some sql
2. Write to file
3. Push to api.

Notice how many of the steps are closely shared with other jobs? what if we abstracted as much of the comon stuff as we could? We would only need to write the components that we do not already have.

A job that pulls some data from an api:
1. Download from api to csv

A job that loads data from one rdbms to the dwh:
1. Make a list of tables to copy

A job that creates some reports:
1. Write report SQL

A job that monitors your data
1. write some sql
2. Check if normal

A job that delivers some segments into your CRM tool.
1. Run some sql
2. Push to api.

Notice we now only have 7/15 parts of code to deal with. Reducing complexity in this manner not only reduces development time, but also makes it easier to maintain in the future. Want to stop loading to Postgres and load to Redshift instead? Just change the loading method once. Or, want to stop worrying about staging tables? You can just create them on the fly based on your target table and drop them when you are done.

Pareto principle, KISS, lean, or whatever you prefer.

You will do 80% of the work with 20% of the code. Do not try to solve every potential problem at once. Start with something that solves the majority of your worries.

You do not need to have the best code running from the start. Solve the problems as well as you can without adding complexity. While your shared methods might not always be the best suited for each case, they are better than 5x the code.

A few examples:

For loading data, if you are using postgres, mysql or other rdbms, the fastest way to load is through the copy command. Write a python method with file and table as parameter. Now, you can use this method as method(file, table, args = defaults) every time you need to load some data, without writing the copy statement every time. For engines like redshift, your method will additionally need to copy the file to s3 first and parametrize the copy with s3 credentials.

For incremental loading, write methods for your common operations. If you want to do an increment merge(redshift manual link), it is easy to automate as merge_increment(source, target, keys = [])

For indexing: all your join keys and columns with typical heavy usage have pretty standard naming. id, sk, date, from, to, ts, and so on. Why change indices every time you make a change when you can just use standardised naming to your advantage? Using system tables, you are able to identify the tablename, column and datatype for the index based on your naming convention, and index accordingly in an automated manner.

I hope this helps some of you in your data delivery journey. How do you keep your complexity low?

ETL vs ELT: It’s all about maintenance!

 

When designing a job, you will have to decide at some point between ETL vs ELT. This boils down to the question if you should perform the transformation to dimensional model in SQL. Or if you want to do it in a programming language.

Why, you ask?

What does ETL vs ELT mean for you practically?

Well, ETL means “Extract – Transform – Load”. So it is about extracting data from a source and then transforming it to a dimensional model. In the last step you load to the data warehouse for analysis.

ELT on the other hand stands for – you guessed it – “Extract – Load – Transform”. It is about extracting data and then loading it to a staging database. And from there transforming it to the dimensional model.

The difference is that the transformation step occurs before loading to a database (ETL) vs after (ELT). In theory, the data is loaded before transformation in order to assist with ease of access. The transformation takes place from the loaded staging database to the data warehouse dimensional model. In practice, the fundamental decision between ETL vs ELT is the one between using SQL or a programming language. Because once the data is in a staging database, the transformation will usually happen using complex SQL.

ETL vs ELT: my recommendation

I am going to assume you are using a tech stack of one or more programming languages, and SQL. In this case, what you need to decide is if you want to write the complex transformations in a programming language or if you want to use a query language. In my experience, it’s easier to maintain jobs where you do a little more programming and write less SQL.  This happens because SQL does not support testing natively, and you cannot refactor it, so you have to rebuild instead.

Write less SQL

Write less SQL

Let me start by saying I like SQL, and I believe it is fantastic in its ease of use. It’s meant for rearranging and aggregating data, making it easy to create reports from a dimensional model. I do however believe we should not misuse it, and simply write less SQL.

Now, there are two types of SQL code in BI: Simple SQL and Complex SQL.

Simple SQL

Simple SQL is the type of SQL that makes reports from a dimensional model – it seldom has subqueries, and all it does is some joining/grouping/aggregating, as all the numbers are available in fact tables in ready-to-use form. It’s easy enough that even some business users can easily understand it and possibly maintain it.

Simple SQL is fantastic because it allows easy operations on prepared data sets. It is easy to read and to maintain.

Complex SQL

Complex SQL is the SQL that makes fact tables, or the SQL of data products(data transformations for data applications). This SQL often has many subqueries, multiple steps of transformation, sometimes data cleaning and enriching.

Complex SQL is hard to maintain, and even harder to extend or modify. It often ends up worse and worse after each iteration. It’s next to impossible to refactor and you usually end up having to rewrite it after a while.

Complex SQL is cause for high rate of bugs, high maintenance and low bus factors. Since SQL is not easy to test (vs tests in a programming language), it is unlikely that it will ever have good test coverage, making it dangerous to work with. Subtle logic bugs will be well hidden, and due to the monolithic nature of the applications written in SQL it is very hard to maintain and debug.

I suggest to avoid Complex SQL entirely. The key concept for this argument is the ETL vs the ELT. Transform your data before it hits the database, and do it with easily testable and dry code.

Many decide to do the complex transformations in SQL mainly because it’s quick and easy. However, in the long run, this quick and easy SQL code costs you quite a bit of maintenance. And will get worse and worse with each iteration, as it is not possible to easily refactor it.

Instead of the query language, use a programing language with tests. This simply is the better approach if you desire iteration speed and accuracy.

Write less SQL, use a programming language.

That’s right, write less SQL. Use a programming language and dry your code. Only use SQL in easy or temporary transformations, and use something like Python when it comes to doing the transformations. There are limitations to this approach, but they are minor in comparison to the improvement it brings to the BI development process. Make a leap of faith and write less SQL!

The data flow breakdown

Data FlowFor easy reference, I’d like to define the typical data flow encountered in business intelligence by its steps: Data integration, dimensional modelling, and data consumption. Basically these are the typical flows that make first and third party data available for analysis. Most of all it’s this data flow that brings a ROI when investing into BI.

1. Data flow part one: Data Integration

Integration measures how many of the first and third party data sources have been unified in a single place. This single place is usually called an integration layer. It can be any easy-to-access data store. Typically it would take the form of files on a network location (FTP, S3, GCS) or a relational database.

Usually this part is pretty messy in that the external sources are diverse and different. Often, each source requires its own access method to extract relevant data. As a result you get a very messy environment. Ideally handle it by a widely used scripting language to ‘glue’ systems together. It is a good practice here to use some form of utils to dry your code.Using common utils for loading data sources will prove verz time/saving when loading data.

2. Data flow part two: Dimensional Modelling

Dimensional modelling as a step in the data flow is the transformation of integration data to a format suitable for analysis. Usually in the form of a snowflake schema. The goal of this is to arrange the dimensions and fact tables such that metrics are unified.  Thus they are the same across reports.

Some small reporting set-ups omit dimensional modelling, but it is useful in enabling self service reporting.

Ultimately the end goal of dimensional modelling is to provide fact and dimension tables containing correct data. Later on, you will use this data for reporting and analysis. Ideally locate it in a SQL database for easy access.

3. Data flow part three: Data consumption

Here, I want to avoid using the terms reporting or analysis. In my opinion reporting is just one of many channels that a data consumer or analyst uses to access data. Such data consumption can take the form of performing an analysis that requires ad hoc query writing and statistical analysis. Or say the analyst has to perform a  seasonal product analysis based on data from a product performance dashboard.

The end goal of data consumption is to boost business metrics through informed decision making. Consequently this is where investing in data actually pays off, so make sure the needs of the decision-makers are met.

Python and SQL open source business intelligence stack

I’d like to cover here some basics about the common business intelligence stack of Python and SQL. Let’s see why this stack is so popular, and why it’s not yet the one and only gold standard.

Python Logo & SQL Logo

Why Python for business intelligence?

Readability.

Most business intelligence professionals using Python come from all walks of life and do not have formal programming education. This can lead to incredibly random ways of coding – not for Python though, as Python requires indentation.
This doesn’t mean that someone who has no idea will produce good code or a good architecture. And regardless of how clean the code, people can do very obscure and roundabout complex things. What it does mean that it is very easy to read their code, which is the first step in understanding it.

Staffing.

Python is an easy to learn, widespread, and easy to use language. It’s much easier to find Python developers than developers with VBA knowledge. Additionally, it’s even more difficult to find developers with BI knowledge and knowledge in other specific languages.

Libraries.

For BI work, you need to extract/load data, transform data and analyze data. Python can do all – most advertising  data producers offer Python libraries to access their APIs. Python can access any database you can think of. Python natively has great features for data munging. Python can be used to perform large parallel computations or statistical analysis. Data science too.

Widespread usage/versatility.

This means that whatever problem you are encountering, someone else probably encountered it too. You will have a very easy time to find solutions or get help for anything Python. While other languages might be better at specific things, Python is a jack of all trades, master of some.

Why not Python?

Spoon-fed.

Well, many BI developers stop there because the language is so versatile. However, this means they will not have the opportunity to work with languages designed for large software projects. Also they will miss out on best practices and programming paradigms. Too easy sometimes means people stop learning.

Slow.

Python is relatively slower than most other languages. In the context of working with data, this only appears as an issue when doing complex calculations on very large data sets.

Why SQL for business intelligence?

I will talk here about PostgreSQL for small data specifically, since it is the best open source database for analysis. It has better functionality that paid solutions like Microsoft’s, and equal or better performance and functionality to Oracle’s product. For free, without licensing fees or vendor locks, it is built by developers, for developers, with robustness and ease of use in mind. You can even run Python on Postgres 🙂

For big data, the leading SQL solution seems to be the postgres-like product from Amazon, Redshift.

Ease of access:

For the sake of easy data access, SQL makes a lot of sense. There are plenty of tools that can run SQL against a database to generate dashboards or can connect it easily in excel. And it is easy to learn for analysts.

Ease of data manipulation:

Most of the data you will have will not be single numbers or random sentences, but structured in tables. This means that instead of conventional algebra, we use relational algebra, to perform operations between data sets instead of single values. This is where SQL shines, since it is designed for these kinds of operations specifically. Once you have the data in a database, you can leverage the power of this language to very easily perform these operations.

Staffing:

Every self-respecting BI professional known at least a little SQL. Most good BI engineers know a lot of SQL. Also, particular type of SQL is not very relevant, as the flavors are very similar.

Why not SQL?

Poor options for architecture.

SQL is a query language, not a programming language. It’s meant to be used to rearrange and compute data, and not to do complex operations. You end up producing a lot of code, often wet (opposite of DRY). The code ends up being monolithic, and it is not easy to see at a glance what a query does. Basically, you cannot write easily maintainable code.

Lowers the bar.

SQL is so easy to learn, that a lot of people end up being able to use it. However, this does not suffice for writing clean code, designing an ETL, a BI application, or a data warehouse. I’ve personally seen ridiculous projects that were a dependency hell of views on views on views . Sadly, knowing SQL is not enough to say about one’s ability as a professional to deliver a quality product, but the two often get confused in the world of BI.

I hope you enjoyed this article, and if you have any opinions on the topic do not hesitate to leave a comment.

Lifecycle stages of Business Intelligence solutions in startups

Man showing business graph on wood table

Business Intelligence in companies is a decision support system. Its implementation is a reflection of the questions that need to be answered in the company at any given time. Many companies start with just a few spreadsheets with data. They populate them with copy pasted numbers. This allows them to understand how their business is evolving and where the next opportunities lie. As time goes on, the companies usually go through several stages of Business Intelligence solutions.

1. The copy paste spreadsheets

At the earliest stage in the company’s life, there is little automation of data extraction and reporting. Usually, the founder already has some profit and loss spreadsheets. Then he will ask the employees to fill in data relevant to their departments. This is when the team defines the first metrics. Unfortunately oftentimes the calculations are not complete due to missing data.

2. The VBA reports

First, let me state that I consider VBA+excel reporting to be a disaster when it comes to data solutions. The reasons are simple: it is a programming paradigm that follows zero best practices. It’s programming, so it’s too hard for the average Joe to build on. It’s VBA, so it’s not made for data munging, versioning, collaborative working, easy file handling etc. It works for a one-off project, but you will end up with too much code that is too hard to maintain for too little benefit.

This usually appears as a quick-and-dirty way of automating some reporting coming from someone with financial background. This happens because finance folks like to have things neatly ordered in Excel, with custom functionality for particular reports. This works on a small scale, but it is not worth maintaining bad code with custom functionality and particular layouts for all the reports on everyone’s computers.

3. The Excel Hell

Natural evolution: Each department fills in some data sheets that feed into the managers’ reports. After a while, each department wants their own reports as well, so they start using the data sources from the other departments. Soon, they want new numbers and start changing the data sources to add more data. This inevitably breaks other dependent reports and creates pain for everyone.

Soon, the reports become data sources for new reports, to try to avoid redundant maintenance. This only makes things worse, as now there are dependencies in many directions between many things. One small change could trigger catastrophic failure on company level. Moreover all the knowledge to maintain this system is so fragmented that it is not possible to centralize.

4. The data bucket

After excel hell, some companies fall back to what they know: Data source files. To allow collaborative development, they put these files on a shared drive or cloud storage solution. From here, the excel-savvy can build their own reports.

It has the advantage of standardizing data sources. Moreover it allows anyone to build their own reports based on those sources, instead of people building on top of existing reports.

Besides scale, the first shortcoming is the manual effort involved. Additionally, the lack of maintainability (cannot at a glace understand the logic behind a report) causes redundant reports to be build, which just adds to the work.

5. The data river

The company looks online to how to do better BI and receives misinformation from advertorials, ads, agencies and parrots (people who just repeat what they read). They end up buying some dashboard solution that promises seamless data integration and real time stats. They get IT to write some SQL queries for the tool, and they connect to the other services (advertising, ticketing systems etc) with ease.

It now looks like all the data is available. In a tool that displays, but cannot transform data to a useful format. This is the birth of the need for a data warehouse. Complex logic (and there is always complex logic in data munging) will need to be done by coding.

Now we can see our data, but we cannot leverage it for the reporting we need.

6. The data store, or unstructured data warehouse

After having struggled with having the data in files, or by piping the data directly to reports via a tool, the need for a database that can store both raw data and processed data appears.

Of course it makes sense to combine the data in a SQL database to allow all the other tools to pull data from it too. Also most of the transformations will happen in SQL. First of all it is easy to learn and use. And secondly it is a  very powerful language for working with sets of data  – tables.

This collection usually consists of everything that can be imported and is usually not structured or grouped together by a particular logic. It’s simply a place to store the data to make it  more accessible for analysis and further transformations. Often, much of the data is coming from manually populated data sources.

Some argue they are too cool for SQL. But in the end SQL just works because of how limited and powerful the language is. It’s powerful because it performs all operations on a table or column/row level. This allows for example to perform a Cartesian join between 2 data sets with a single keyword. Think of all the loops you’d write (run) in a programming language!

SQL has limitations because it only allows a restricted set of operations on tables. This is actually a huge boon, as it allows both good developers and bad developers to speak the same language. This means the code is unlikely to become too complex to understand.

7. The data warehouse

When a BI person wants to churn out a variety of reports, the benefits of a dimensional model become apparent. Typically, one adds this as a layer on top of the raw data. This makes the mapping of business processes at each grain or step possible. And thereby also enables easy creation of reports.

8. The self service reporting

Once the dimensional model is available, the team can create data cubes. This way they can make the data available easily to the rest of the company. Every stakeholder can filter the data cubes or drill down/roll up. With this functionality, a business analyst with access to the data can build complex reports and dashboards in minutes. The tool containing the data would usually be Excel or Dashboard tools. An operational data consumer might notice an anomaly and be able to troubleshoot it by drilling down in the same window. This is the end goal for most companies.

9. The advanced self service reporting

Drilling down predefined paths is not enough for complex analyses. Also, for the sake of agility many data sources will not be fully integrated into the data warehouse. (Example: a test of an ad network.) This is where custom complex transformations come into place. These are often to generate a report of a new data source, or to analyze particular details such as user behavior.

10. The data products: Automated decision making instead of BI as decision support.

Often times, the decision based on data is logical and repeatable. Therefore the data team creates algorithms to automate that decision. Take for example the decision of which products you should recommend to each customer, or the adjustment of bids based on performance metrics.

The data team can create these data products with relative ease. Sometimes the team builds them on top of the dimensional model of the DWH. Often, due to performance and different SLA requirements, the products run independently of the data warehouse. They consume data from the source or are integrated in an application.

Hope you enjoyed and let me know in the comments what else you have encountered. And where do you find yourself on this scale?

 

 

 

Twitter ads API: gotchas when downloading report

I recently built an ETL to download ad reports from the Twitter ads API.

If you are about to do the same, take note of the following gotcha.

Include historic data in your download

First, you should download at least 2+ weeks of history with every new daily load. The reason behind this? Twitter ads data will be updated historically by Twitter as they work out the actual numbers (removing bots, spam etc). This is common behavior for many ad APIs.

Provide datetime

Second, because of their API design, you NEED to provide start datetime and end datetime for whole days. Notice it requires datetime, and not date, even though you clearly request daily aggregation! Additionally, you need to provide the midnight of your account’s timezone in UTC time, or the API call will return an error. (Don’t ask me why you handle this on your end when the timezone info is coming from the API…)

You can retrieve the timezone of each account from the endpoint '/2/accounts/your_account_id', the body['data']['timezone'] of the response. Then you calculate your start and end date for your report request, and adjust both of them to the right UTC offset based on the account timezone. You can achieve this easily in Python with pytz module. Please note that if your start-end interval contains a daylight savings time change, you cannot apply the same offset to both dates, as offset from UTC changes. The best approach is to calculate this offset based on the timezone for each of the start and end dates.

Keep watching for gotchas in the Twitter ads API

There are a few other gotchas/undocumented limitations in their API, but you will find it easier to address those correctly. (Example: only request stats for 10 tweet_ids at a time).

Yes, it stinks, but at least Twitter ads API is still better than Bing ads API.

Good luck!

Example MySQL to Gmail script

This is a usage example for the MySQL to HTML table combined with the BI mailer. We will make use of the two previous scripts to easily create a scheduled email sending from a MySQL database through a Gmail address with minimal effort.

The idea behind it was to be able to create a new email with only the base building blocks: the MySQL query for data source, subject/recipients as envelope/destination.

The advantage of this combination is the necessity of only 3(!) lines of code for a new email, allowing you to leverage email as an easy and flexible distribution channel for BI data.

In the example below we also add a greeting along with a relevant subject line, which raises our grand total to 4 easily readable lines of code to send an email from MySQL to Gmail. The connection string I used for the query is an open MySQL database with DNA info (this should go in your sql_to_html file).

The email contains a random sample of DNA shortened to 64 characters length. It takes a MySQL query and sends an HTML table via Gmail.

Result:
MySQL to Gmail result

While this example is Gmail specific, it can be easily changed to use any email service. Keep in mind that even when using Gmail you should be able to reach any other email, and if you choose to avoid Gmail altogether, you can just change the bimailer.py to use any other email server.

MySQL query to HTML table with Python

For easy email reporting, I like to send plain HTML tables. An easy implementation is to take the query output and HTML-ize it.

The script below does just that – it connects to MySQL, sends a query, and turns the result into an HTML table. This table can further be used to be sent with the bimailer, or in any other way.

The MySQL connector for Python can be found on the MySQL official website here.