ETL vs ELT: It’s all about maintenance!

 

When designing a job, you will have to decide at some point between ETL vs ELT. This boils down to the question if you should perform the transformation to dimensional model in SQL. Or if you want to do it in a programming language.

Why, you ask?

What does ETL vs ELT mean for you practically?

Well, ETL means “Extract – Transform – Load”. So it is about extracting data from a source and then transforming it to a dimensional model. In the last step you load to the data warehouse for analysis.

ELT on the other hand stands for – you guessed it – “Extract – Load – Transform”. It is about extracting data and then loading it to a staging database. And from there transforming it to the dimensional model.

The difference is that the transformation step occurs before loading to a database (ETL) vs after (ELT). In theory, the data is loaded before transformation in order to assist with ease of access. The transformation takes place from the loaded staging database to the data warehouse dimensional model. In practice, the fundamental decision between ETL vs ELT is the one between using SQL or a programming language. Because once the data is in a staging database, the transformation will usually happen using complex SQL.

ETL vs ELT: my recommendation

I am going to assume you are using a tech stack of one or more programming languages, and SQL. In this case, what you need to decide is if you want to write the complex transformations in a programming language or if you want to use a query language. In my experience, it’s easier to maintain jobs where you do a little more programming and write less SQL.  This happens because SQL does not support testing natively, and you cannot refactor it, so you have to rebuild instead.

Write less SQL

Write less SQL

Let me start by saying I like SQL, and I believe it is fantastic in its ease of use. It’s meant for rearranging and aggregating data, making it easy to create reports from a dimensional model. I do however believe we should not misuse it, and simply write less SQL.

Now, there are two types of SQL code in BI: Simple SQL and Complex SQL.

Simple SQL

Simple SQL is the type of SQL that makes reports from a dimensional model – it seldom has subqueries, and all it does is some joining/grouping/aggregating, as all the numbers are available in fact tables in ready-to-use form. It’s easy enough that even some business users can easily understand it and possibly maintain it.

Simple SQL is fantastic because it allows easy operations on prepared data sets. It is easy to read and to maintain.

Complex SQL

Complex SQL is the SQL that makes fact tables, or the SQL of data products(data transformations for data applications). This SQL often has many subqueries, multiple steps of transformation, sometimes data cleaning and enriching.

Complex SQL is hard to maintain, and even harder to extend or modify. It often ends up worse and worse after each iteration. It’s next to impossible to refactor and you usually end up having to rewrite it after a while.

Complex SQL is cause for high rate of bugs, high maintenance and low bus factors. Since SQL is not easy to test (vs tests in a programming language), it is unlikely that it will ever have good test coverage, making it dangerous to work with. Subtle logic bugs will be well hidden, and due to the monolithic nature of the applications written in SQL it is very hard to maintain and debug.

I suggest to avoid Complex SQL entirely. The key concept for this argument is the ETL vs the ELT. Transform your data before it hits the database, and do it with easily testable and dry code.

Many decide to do the complex transformations in SQL mainly because it’s quick and easy. However, in the long run, this quick and easy SQL code costs you quite a bit of maintenance. And will get worse and worse with each iteration, as it is not possible to easily refactor it.

Instead of the query language, use a programing language with tests. This simply is the better approach if you desire iteration speed and accuracy.

Write less SQL, use a programming language.

That’s right, write less SQL. Use a programming language and dry your code. Only use SQL in easy or temporary transformations, and use something like Python when it comes to doing the transformations. There are limitations to this approach, but they are minor in comparison to the improvement it brings to the BI development process. Make a leap of faith and write less SQL!

The data flow breakdown

Data FlowFor easy reference, I’d like to define the typical data flow encountered in business intelligence by its steps: Data integration, dimensional modelling, and data consumption. Basically these are the typical flows that make first and third party data available for analysis. Most of all it’s this data flow that brings a ROI when investing into BI.

1. Data flow part one: Data Integration

Integration measures how many of the first and third party data sources have been unified in a single place. This single place is usually called an integration layer. It can be any easy-to-access data store. Typically it would take the form of files on a network location (FTP, S3, GCS) or a relational database.

Usually this part is pretty messy in that the external sources are diverse and different. Often, each source requires its own access method to extract relevant data. As a result you get a very messy environment. Ideally handle it by a widely used scripting language to ‘glue’ systems together. It is a good practice here to use some form of utils to dry your code.Using common utils for loading data sources will prove verz time/saving when loading data.

2. Data flow part two: Dimensional Modelling

Dimensional modelling as a step in the data flow is the transformation of integration data to a format suitable for analysis. Usually in the form of a snowflake schema. The goal of this is to arrange the dimensions and fact tables such that metrics are unified.  Thus they are the same across reports.

Some small reporting set-ups omit dimensional modelling, but it is useful in enabling self service reporting.

Ultimately the end goal of dimensional modelling is to provide fact and dimension tables containing correct data. Later on, you will use this data for reporting and analysis. Ideally locate it in a SQL database for easy access.

3. Data flow part three: Data consumption

Here, I want to avoid using the terms reporting or analysis. In my opinion reporting is just one of many channels that a data consumer or analyst uses to access data. Such data consumption can take the form of performing an analysis that requires ad hoc query writing and statistical analysis. Or say the analyst has to perform a  seasonal product analysis based on data from a product performance dashboard.

The end goal of data consumption is to boost business metrics through informed decision making. Consequently this is where investing in data actually pays off, so make sure the needs of the decision-makers are met.