Easy ETL, or “friends don’t let friends write loading code”

Easy ETL is what every tool promises, but most do not deliver. Over the last years there has been a strong shift in BI towards the use of python. If previously any scripting language would do, nowadays it’s almost impossible to manage without python specifically.

Take the ever more common open source tech stack of
SQL for storage and analysis
Python as glue language
Airflow (python) as workflow engine
Airflow (python)+ custom utils (any scripting language) as framework.

You can reduce the complexity by deciding to stick with python as far as possible with regards to the utils. It makes a lot of sense as you can just import the utils directly in the job (DAG) definition file to call and parametrize your methods with ease.

Having worked on multiple BI data warehousing projects over the last years, the most common mistake I see made is underestimating the cost of SQL maintenance, and the most common difficulty is staffing with enough workers to maintain and extend the large codebases.

The cost of maintaining sql throughout its lifecycle is usually estimated at 3-4x the time it takes to create it in the first place. The truth is, for the most vital components of your transformation, the cost of maintenance is closer to 10x the cost of original development. You will touch this code again and again every time there is a source data change, addition etc. Add a new column? Easy! Modify your data loading scripts, your DM creation scripts, your reporting scripts to include it, your data dictionary, and the frontend. Change a column? not so easy anymore.

So we are already familiar with the hard way: Write code to do the transformations. When the schema changes, add or modify the code.

So what is the easy way? Write code to write code to do the transformations. Many of the transformation steps are repetitive, so if you were to standardize things, you could easily re-use the same few parametrized operations. It’s called don’t repeat yourself or ‘DRY’, and it’s a basic concept for producing decent, maintainable code. You know, what you SHOULD be doing instead of copy pasting all that sql. But while many of us know our code is ‘WET’, it is sometimes hard to abstract and standardize. So how do you do it?

Standardize.

When it comes to loading data into a table, incrementally or not, there are only a few ways to do this. Ideally, stick with the basics and re-use the code. If most of your operations are of one type, and a new one might run faster if done differently, consider that writing new loading logic and maintaining it 10 times over is likely more expensive than the drawbacks of your existing solution.

Abstract.

Split your jobs into subroutines. You will quickly identify that there are many common parts, that can be abstracted out and written as reusable methods in some shared utils. You can then construct the job from those blocks, having only the job-specific code to manage in any particular job.

Take the following examples:

A job that pulls some data from an api:
1. Download from api to csv
2. load csv in database
3. merge increment

A job that loads data from one rdbms to the dwh:
1. Download from db.
2. load csv in database
3. merge increment

A job that creates some reports:
1. Run some sql
2. Run some code
3. Send report

A job that monitors your data
1. Run some sql
2. Check if normal
3. Send report if abnormal

A job that delivers some segments into your CRM tool.
1. Run some sql
2. Write to file
3. Push to api.

Notice how many of the steps are closely shared with other jobs? what if we abstracted as much of the comon stuff as we could? We would only need to write the components that we do not already have.

A job that pulls some data from an api:
1. Download from api to csv

A job that loads data from one rdbms to the dwh:
1. Make a list of tables to copy

A job that creates some reports:
1. Write report SQL

A job that monitors your data
1. write some sql
2. Check if normal

A job that delivers some segments into your CRM tool.
1. Run some sql
2. Push to api.

Notice we now only have 7/15 parts of code to deal with. Reducing complexity in this manner not only reduces development time, but also makes it easier to maintain in the future. Want to stop loading to Postgres and load to Redshift instead? Just change the loading method once. Or, want to stop worrying about staging tables? You can just create them on the fly based on your target table and drop them when you are done.

Pareto principle, KISS, lean, or whatever you prefer.

You will do 80% of the work with 20% of the code. Do not try to solve every potential problem at once. Start with something that solves the majority of your worries.

You do not need to have the best code running from the start. Solve the problems as well as you can without adding complexity. While your shared methods might not always be the best suited for each case, they are better than 5x the code.

A few examples:

For loading data, if you are using postgres, mysql or other rdbms, the fastest way to load is through the copy command. Write a python method with file and table as parameter. Now, you can use this method as method(file, table, args = defaults) every time you need to load some data, without writing the copy statement every time. For engines like redshift, your method will additionally need to copy the file to s3 first and parametrize the copy with s3 credentials.

For incremental loading, write methods for your common operations. If you want to do an increment merge(redshift manual link), it is easy to automate as merge_increment(source, target, keys = [])

For indexing: all your join keys and columns with typical heavy usage have pretty standard naming. id, sk, date, from, to, ts, and so on. Why change indices every time you make a change when you can just use standardised naming to your advantage? Using system tables, you are able to identify the tablename, column and datatype for the index based on your naming convention, and index accordingly in an automated manner.

I hope this helps some of you in your data delivery journey. How do you keep your complexity low?

Python and SQL open source business intelligence stack

I’d like to cover here some basics about the common business intelligence stack of Python and SQL. Let’s see why this stack is so popular, and why it’s not yet the one and only gold standard.

Python Logo & SQL Logo

Why Python for business intelligence?

Readability.

Most business intelligence professionals using Python come from all walks of life and do not have formal programming education. This can lead to incredibly random ways of coding – not for Python though, as Python requires indentation.
This doesn’t mean that someone who has no idea will produce good code or a good architecture. And regardless of how clean the code, people can do very obscure and roundabout complex things. What it does mean that it is very easy to read their code, which is the first step in understanding it.

Staffing.

Python is an easy to learn, widespread, and easy to use language. It’s much easier to find Python developers than developers with VBA knowledge. Additionally, it’s even more difficult to find developers with BI knowledge and knowledge in other specific languages.

Libraries.

For BI work, you need to extract/load data, transform data and analyze data. Python can do all – most advertising  data producers offer Python libraries to access their APIs. Python can access any database you can think of. Python natively has great features for data munging. Python can be used to perform large parallel computations or statistical analysis. Data science too.

Widespread usage/versatility.

This means that whatever problem you are encountering, someone else probably encountered it too. You will have a very easy time to find solutions or get help for anything Python. While other languages might be better at specific things, Python is a jack of all trades, master of some.

Why not Python?

Spoon-fed.

Well, many BI developers stop there because the language is so versatile. However, this means they will not have the opportunity to work with languages designed for large software projects. Also they will miss out on best practices and programming paradigms. Too easy sometimes means people stop learning.

Slow.

Python is relatively slower than most other languages. In the context of working with data, this only appears as an issue when doing complex calculations on very large data sets.

Why SQL for business intelligence?

I will talk here about PostgreSQL for small data specifically, since it is the best open source database for analysis. It has better functionality that paid solutions like Microsoft’s, and equal or better performance and functionality to Oracle’s product. For free, without licensing fees or vendor locks, it is built by developers, for developers, with robustness and ease of use in mind. You can even run Python on Postgres 🙂

For big data, the leading SQL solution seems to be the postgres-like product from Amazon, Redshift.

Ease of access:

For the sake of easy data access, SQL makes a lot of sense. There are plenty of tools that can run SQL against a database to generate dashboards or can connect it easily in excel. And it is easy to learn for analysts.

Ease of data manipulation:

Most of the data you will have will not be single numbers or random sentences, but structured in tables. This means that instead of conventional algebra, we use relational algebra, to perform operations between data sets instead of single values. This is where SQL shines, since it is designed for these kinds of operations specifically. Once you have the data in a database, you can leverage the power of this language to very easily perform these operations.

Staffing:

Every self-respecting BI professional known at least a little SQL. Most good BI engineers know a lot of SQL. Also, particular type of SQL is not very relevant, as the flavors are very similar.

Why not SQL?

Poor options for architecture.

SQL is a query language, not a programming language. It’s meant to be used to rearrange and compute data, and not to do complex operations. You end up producing a lot of code, often wet (opposite of DRY). The code ends up being monolithic, and it is not easy to see at a glance what a query does. Basically, you cannot write easily maintainable code.

Lowers the bar.

SQL is so easy to learn, that a lot of people end up being able to use it. However, this does not suffice for writing clean code, designing an ETL, a BI application, or a data warehouse. I’ve personally seen ridiculous projects that were a dependency hell of views on views on views . Sadly, knowing SQL is not enough to say about one’s ability as a professional to deliver a quality product, but the two often get confused in the world of BI.

I hope you enjoyed this article, and if you have any opinions on the topic do not hesitate to leave a comment.

Twitter ads API: gotchas when downloading report

I recently built an ETL to download ad reports from the Twitter ads API.

If you are about to do the same, take note of the following gotcha.

Include historic data in your download

First, you should download at least 2+ weeks of history with every new daily load. The reason behind this? Twitter ads data will be updated historically by Twitter as they work out the actual numbers (removing bots, spam etc). This is common behavior for many ad APIs.

Provide datetime

Second, because of their API design, you NEED to provide start datetime and end datetime for whole days. Notice it requires datetime, and not date, even though you clearly request daily aggregation! Additionally, you need to provide the midnight of your account’s timezone in UTC time, or the API call will return an error. (Don’t ask me why you handle this on your end when the timezone info is coming from the API…)

You can retrieve the timezone of each account from the endpoint '/2/accounts/your_account_id', the body['data']['timezone'] of the response. Then you calculate your start and end date for your report request, and adjust both of them to the right UTC offset based on the account timezone. You can achieve this easily in Python with pytz module. Please note that if your start-end interval contains a daylight savings time change, you cannot apply the same offset to both dates, as offset from UTC changes. The best approach is to calculate this offset based on the timezone for each of the start and end dates.

Keep watching for gotchas in the Twitter ads API

There are a few other gotchas/undocumented limitations in their API, but you will find it easier to address those correctly. (Example: only request stats for 10 tweet_ids at a time).

Yes, it stinks, but at least Twitter ads API is still better than Bing ads API.

Good luck!

Example MySQL to Gmail script

This is a usage example for the MySQL to HTML table combined with the BI mailer. We will make use of the two previous scripts to easily create a scheduled email sending from a MySQL database through a Gmail address with minimal effort.

The idea behind it was to be able to create a new email with only the base building blocks: the MySQL query for data source, subject/recipients as envelope/destination.

The advantage of this combination is the necessity of only 3(!) lines of code for a new email, allowing you to leverage email as an easy and flexible distribution channel for BI data.

In the example below we also add a greeting along with a relevant subject line, which raises our grand total to 4 easily readable lines of code to send an email from MySQL to Gmail. The connection string I used for the query is an open MySQL database with DNA info (this should go in your sql_to_html file).

The email contains a random sample of DNA shortened to 64 characters length. It takes a MySQL query and sends an HTML table via Gmail.

Result:
MySQL to Gmail result

While this example is Gmail specific, it can be easily changed to use any email service. Keep in mind that even when using Gmail you should be able to reach any other email, and if you choose to avoid Gmail altogether, you can just change the bimailer.py to use any other email server.

MySQL query to HTML table with Python

For easy email reporting, I like to send plain HTML tables. An easy implementation is to take the query output and HTML-ize it.

The script below does just that – it connects to MySQL, sends a query, and turns the result into an HTML table. This table can further be used to be sent with the bimailer, or in any other way.

The MySQL connector for Python can be found on the MySQL official website here.

Scheduled emails with Python for BI

As part of reporting in business intelligence, scheduled emails are a very useful tool. They allow you to send various daily summaries, or attach larger reports.

For these purposes I put together a Python BI-mailer class to allow for easy management of all things you might normally require to send, such as text, summary/KPI html tables, embedded images or charts, or attach larger reports.

I will put the code here, and in the next articles I will give examples of how to use this easy Python emailer to accomplish daily reporting chores.

With this script you are able to:
send an email
optionally with text or multiple tables in body
optionally with attachments
optionally use the attachments as embedded images.