Write less SQL

Write less SQL

Let me start by saying I like SQL, and I believe it is fantastic in its ease of use. It’s meant for rearranging and aggregating data, making it easy to create reports from a dimensional model. I do however believe we should not misuse it, and simply write less SQL.

Now, there are two types of SQL code in BI: Simple SQL and Complex SQL.

Simple SQL

Simple SQL is the type of SQL that makes reports from a dimensional model – it seldom has subqueries, and all it does is some joining/grouping/aggregating, as all the numbers are available in fact tables in ready-to-use form. It’s easy enough that even some business users can easily understand it and possibly maintain it.

Simple SQL is fantastic because it allows easy operations on prepared data sets. It is easy to read and to maintain.

Complex SQL

Complex SQL is the SQL that makes fact tables, or the SQL of data products(data transformations for data applications). This SQL often has many subqueries, multiple steps of transformation, sometimes data cleaning and enriching.

Complex SQL is hard to maintain, and even harder to extend or modify. It often ends up worse and worse after each iteration. It’s next to impossible to refactor and you usually end up having to rewrite it after a while.

Complex SQL is cause for high rate of bugs, high maintenance and low bus factors. Since SQL is not easy to test (vs tests in a programming language), it is unlikely that it will ever have good test coverage, making it dangerous to work with. Subtle logic bugs will be well hidden, and due to the monolithic nature of the applications written in SQL it is very hard to maintain and debug.

I suggest to avoid Complex SQL entirely. The key concept for this argument is the ETL vs the ELT. Transform your data before it hits the database, and do it with easily testable and dry code.

Many decide to do the complex transformations in SQL mainly because it’s quick and easy. However, in the long run, this quick and easy SQL code costs you quite a bit of maintenance. And will get worse and worse with each iteration, as it is not possible to easily refactor it.

Instead of the query language, use a programing language with tests. This simply is the better approach if you desire iteration speed and accuracy.

Write less SQL, use a programming language.

That’s right, write less SQL. Use a programming language and dry your code. Only use SQL in easy or temporary transformations, and use something like Python when it comes to doing the transformations. There are limitations to this approach, but they are minor in comparison to the improvement it brings to the BI development process. Make a leap of faith and write less SQL!

Python and SQL open source business intelligence stack

I’d like to cover here some basics about the common business intelligence stack of Python and SQL. Let’s see why this stack is so popular, and why it’s not yet the one and only gold standard.

Python Logo & SQL Logo

Why Python for business intelligence?

Readability.

Most business intelligence professionals using Python come from all walks of life and do not have formal programming education. This can lead to incredibly random ways of coding – not for Python though, as Python requires indentation.
This doesn’t mean that someone who has no idea will produce good code or a good architecture. And regardless of how clean the code, people can do very obscure and roundabout complex things. What it does mean that it is very easy to read their code, which is the first step in understanding it.

Staffing.

Python is an easy to learn, widespread, and easy to use language. It’s much easier to find Python developers than developers with VBA knowledge. Additionally, it’s even more difficult to find developers with BI knowledge and knowledge in other specific languages.

Libraries.

For BI work, you need to extract/load data, transform data and analyze data. Python can do all – most advertising  data producers offer Python libraries to access their APIs. Python can access any database you can think of. Python natively has great features for data munging. Python can be used to perform large parallel computations or statistical analysis. Data science too.

Widespread usage/versatility.

This means that whatever problem you are encountering, someone else probably encountered it too. You will have a very easy time to find solutions or get help for anything Python. While other languages might be better at specific things, Python is a jack of all trades, master of some.

Why not Python?

Spoon-fed.

Well, many BI developers stop there because the language is so versatile. However, this means they will not have the opportunity to work with languages designed for large software projects. Also they will miss out on best practices and programming paradigms. Too easy sometimes means people stop learning.

Slow.

Python is relatively slower than most other languages. In the context of working with data, this only appears as an issue when doing complex calculations on very large data sets.

Why SQL for business intelligence?

I will talk here about PostgreSQL for small data specifically, since it is the best open source database for analysis. It has better functionality that paid solutions like Microsoft’s, and equal or better performance and functionality to Oracle’s product. For free, without licensing fees or vendor locks, it is built by developers, for developers, with robustness and ease of use in mind. You can even run Python on Postgres 🙂

For big data, the leading SQL solution seems to be the postgres-like product from Amazon, Redshift.

Ease of access:

For the sake of easy data access, SQL makes a lot of sense. There are plenty of tools that can run SQL against a database to generate dashboards or can connect it easily in excel. And it is easy to learn for analysts.

Ease of data manipulation:

Most of the data you will have will not be single numbers or random sentences, but structured in tables. This means that instead of conventional algebra, we use relational algebra, to perform operations between data sets instead of single values. This is where SQL shines, since it is designed for these kinds of operations specifically. Once you have the data in a database, you can leverage the power of this language to very easily perform these operations.

Staffing:

Every self-respecting BI professional known at least a little SQL. Most good BI engineers know a lot of SQL. Also, particular type of SQL is not very relevant, as the flavors are very similar.

Why not SQL?

Poor options for architecture.

SQL is a query language, not a programming language. It’s meant to be used to rearrange and compute data, and not to do complex operations. You end up producing a lot of code, often wet (opposite of DRY). The code ends up being monolithic, and it is not easy to see at a glance what a query does. Basically, you cannot write easily maintainable code.

Lowers the bar.

SQL is so easy to learn, that a lot of people end up being able to use it. However, this does not suffice for writing clean code, designing an ETL, a BI application, or a data warehouse. I’ve personally seen ridiculous projects that were a dependency hell of views on views on views . Sadly, knowing SQL is not enough to say about one’s ability as a professional to deliver a quality product, but the two often get confused in the world of BI.

I hope you enjoyed this article, and if you have any opinions on the topic do not hesitate to leave a comment.

Example MySQL to Gmail script

This is a usage example for the MySQL to HTML table combined with the BI mailer. We will make use of the two previous scripts to easily create a scheduled email sending from a MySQL database through a Gmail address with minimal effort.

The idea behind it was to be able to create a new email with only the base building blocks: the MySQL query for data source, subject/recipients as envelope/destination.

The advantage of this combination is the necessity of only 3(!) lines of code for a new email, allowing you to leverage email as an easy and flexible distribution channel for BI data.

In the example below we also add a greeting along with a relevant subject line, which raises our grand total to 4 easily readable lines of code to send an email from MySQL to Gmail. The connection string I used for the query is an open MySQL database with DNA info (this should go in your sql_to_html file).

The email contains a random sample of DNA shortened to 64 characters length. It takes a MySQL query and sends an HTML table via Gmail.

Result:
MySQL to Gmail result

While this example is Gmail specific, it can be easily changed to use any email service. Keep in mind that even when using Gmail you should be able to reach any other email, and if you choose to avoid Gmail altogether, you can just change the bimailer.py to use any other email server.

MySQL query to HTML table with Python

For easy email reporting, I like to send plain HTML tables. An easy implementation is to take the query output and HTML-ize it.

The script below does just that – it connects to MySQL, sends a query, and turns the result into an HTML table. This table can further be used to be sent with the bimailer, or in any other way.

The MySQL connector for Python can be found on the MySQL official website here.

MySQL Federated database creation

MySQL does not support entire federated databases, but it does support federated tables  – as long as the option is enabled in my.ini or my.cnf.

To create an entire federated DB, in a non manual, and easy to maintain way (refresh table definition if source table structure changes), we can use the following procedure.

First, the caveats:

1. Federated tables do not use indices in the usual way, so you may need to create a table locally and index before expecting any performance.

2. When using a client such as Heidi, that does show table status, the connection between your DB and the source DB of the fed table remains open. An issue can arise if the number of tables is significant, while max connections on source db is low. To get around this, periodically use flush tables on the destination DB to close the connections. There are alternatives to this, but this is the simplest workaround (may not be the best).

Personally, to get around this issue I scheduled a MySQL event that flushes every federated table (not FLUSH TABLES, but FLUSH TABLE TABLENAME;FLUSH TABLE TABLENAME2), to avoid blocking other queries. FLUSH TABLES flushes all tables, so it has to wait for other queries to finish, and any new queries have to wait for it to finish, causing no possibility of concurrent queries.

1. Get a nodata mysqldump in the same way as you get a normal dump, but with -d switch for no data

 

2. Convert the engine part of the tables into federated+connection (here done with Python)

 

3. Load up your new dump

 

You can run this periodically or on demand to refresh the table definitions if the source definitions change

MySQL Rank() function

UPDATE: On July 18 2018 the MySql server team announced the introduction of window functions. The article was written before this, offering an workaround to gain some similar functionality. I updated this article now with the latest and best practice way to do this correctly, but sill leave the old article for reference.

AS of Mysql 8, you can now use window functions such as  rownum() OVER(PARTITION BY country) . Interestingly, you can just reference your partition by an alias, and define the partition in a separate window clause section (like where clause), allowing reuse:

 

Happy window aggregating!

OLD ARTICLE FROM 2014 BELOW:

A quick and dirty custom MySQL Rank() function to emulate rank from PostgreSQL into mySQL. Same disclaimers as the custom rownum() function. In this case, I am using it to rank decimals, but feel free to change the input type.

 

to invoke, use rank(column to rank),such as below:

output:

MySQL rownum() row number function

UPDATE: On July 18 2018 the MySql server team announced the introduction of window functions. The article was written before this, offering an workaround to gain some similar functionality. I updated this article now with the latest and best practice way to do this correctly, but sill leave the old article for reference.

AS of Mysql 8, you can now use window functions such as  rownum() OVER(PARTITION BY country) . Interestingly, you can just reference your partition by an alias, and define the partition in a separate window clause section (like where clause), allowing reuse:

 

Happy window aggregating!

 

OLD ARTICLE FROM 2014 BELOW:

MySQL does not have a built in row number function, which makes it an ugly hack to get the row number –  an ugly hack that needs to be repeated over and over, every time you need it.

I have taken the liberty of writing a rownum() function, that is just as bad, but at least it will keep your code clean.

DISCLAIMER:

  1. does not work on older versions of mySQL
  2. It is based on using variables that only reset when time changes. If the function is invoked in 2 separate places in the same microsecond(1 millionth of a second), the output will not be as expected. The variable will not reset, the output will not start counting from 1.
  3. If you run the function more than once in the same query, it will increment every time it is run, so possibly multiple times per row.

For the purposes of BI, I find the associated risks, when used as intended, non existent.
For older versions of mySQL server, you can try finding another creative way to reset the counter for the row number.

 

Row number function

 

to invoke it, use rownum(), such as below

Good luck!

Normalizing codes, mySQL extracting alphanumeric functions

Wanted to share an sql function I end up using quite frequently when dealing with user input data – normalization to alphanumeric. It can also be modified into a function to extract numeric values. If you do modify it to that, make sure you keep returns varchar for the cases of extracting numeric values beginning with 0. You can also add other characters, such as the minus sign, if required.

Example use cases:

Matching producer codes:

Normalizing/matching  phone numbers:

 

function: