Let me start by saying I like SQL, and I believe it is fantastic in its ease of use. It’s meant for rearranging and aggregating data, making it easy to create reports from a dimensional model. I do however believe we should not misuse it, and simply write less SQL.
Now, there are two types of SQL code in BI: Simple SQL and Complex SQL.
Simple SQL
Simple SQL is the type of SQL that makes reports from a dimensional model – it seldom has subqueries, and all it does is some joining/grouping/aggregating, as all the numbers are available in fact tables in ready-to-use form. It’s easy enough that even some business users can easily understand it and possibly maintain it.
Simple SQL is fantastic because it allows easy operations on prepared data sets. It is easy to read and to maintain.
Complex SQL
Complex SQL is the SQL that makes fact tables, or the SQL of data products(data transformations for data applications). This SQL often has many subqueries, multiple steps of transformation, sometimes data cleaning and enriching.
Complex SQL is hard to maintain, and even harder to extend or modify. It often ends up worse and worse after each iteration. It’s next to impossible to refactor and you usually end up having to rewrite it after a while.
Complex SQL is cause for high rate of bugs, high maintenance and low bus factors. Since SQL is not easy to test (vs tests in a programming language), it is unlikely that it will ever have good test coverage, making it dangerous to work with. Subtle logic bugs will be well hidden, and due to the monolithic nature of the applications written in SQL it is very hard to maintain and debug.
I suggest to avoid Complex SQL entirely. The key concept for this argument is the ETL vs the ELT. Transform your data before it hits the database, and do it with easily testable and dry code.
Many decide to do the complex transformations in SQL mainly because it’s quick and easy. However, in the long run, this quick and easy SQL code costs you quite a bit of maintenance. And will get worse and worse with each iteration, as it is not possible to easily refactor it.
Instead of the query language, use a programing language with tests. This simply is the better approach if you desire iteration speed and accuracy.
Write less SQL, use a programming language.
That’s right, write less SQL. Use a programming language and dry your code. Only use SQL in easy or temporary transformations, and use something like Python when it comes to doing the transformations. There are limitations to this approach, but they are minor in comparison to the improvement it brings to the BI development process. Make a leap of faith and write less SQL!
I’d like to cover here some basics about the common business intelligence stack of Python and SQL. Let’s see why this stack is so popular, and why it’s not yet the one and only gold standard.
Why Python for business intelligence?
Readability.
Most business intelligence professionals using Python come from all walks of life and do not have formal programming education. This can lead to incredibly random ways of coding – not for Python though, as Python requires indentation.
This doesn’t mean that someone who has no idea will produce good code or a good architecture. And regardless of how clean the code, people can do very obscure and roundabout complex things. What it does mean that it is very easy to read their code, which is the first step in understanding it.
Staffing.
Python is an easy to learn, widespread, and easy to use language. It’s much easier to find Python developers than developers with VBA knowledge. Additionally, it’s even more difficult to find developers with BI knowledge and knowledge in other specific languages.
Libraries.
For BI work, you need to extract/load data, transform data and analyze data. Python can do all – most advertising data producers offer Python libraries to access their APIs. Python can access any database you can think of. Python natively has great features for data munging. Python can be used to perform large parallel computations or statistical analysis. Data science too.
Widespread usage/versatility.
This means that whatever problem you are encountering, someone else probably encountered it too. You will have a very easy time to find solutions or get help for anything Python. While other languages might be better at specific things, Python is a jack of all trades, master of some.
Why not Python?
Spoon-fed.
Well, many BI developers stop there because the language is so versatile. However, this means they will not have the opportunity to work with languages designed for large software projects. Also they will miss out on best practices and programming paradigms. Too easy sometimes means people stop learning.
Slow.
Python is relatively slower than most other languages. In the context of working with data, this only appears as an issue when doing complex calculations on very large data sets.
Why SQL for business intelligence?
I will talk here about PostgreSQL for small data specifically, since it is the best open source database for analysis. It has better functionality that paid solutions like Microsoft’s, and equal or better performance and functionality to Oracle’s product. For free, without licensing fees or vendor locks, it is built by developers, for developers, with robustness and ease of use in mind. You can even run Python on Postgres 🙂
For big data, the leading SQL solution seems to be the postgres-like product from Amazon, Redshift.
Ease of access:
For the sake of easy data access, SQL makes a lot of sense. There are plenty of tools that can run SQL against a database to generate dashboards or can connect it easily in excel. And it is easy to learn for analysts.
Ease of data manipulation:
Most of the data you will have will not be single numbers or random sentences, but structured in tables. This means that instead of conventional algebra, we use relational algebra, to perform operations between data sets instead of single values. This is where SQL shines, since it is designed for these kinds of operations specifically. Once you have the data in a database, you can leverage the power of this language to very easily perform these operations.
Staffing:
Every self-respecting BI professional known at least a little SQL. Most good BI engineers know a lot of SQL. Also, particular type of SQL is not very relevant, as the flavors are very similar.
Why not SQL?
Poor options for architecture.
SQL is a query language, not a programming language. It’s meant to be used to rearrange and compute data, and not to do complex operations. You end up producing a lot of code, often wet (opposite of DRY). The code ends up being monolithic, and it is not easy to see at a glance what a query does. Basically, you cannot write easily maintainable code.
Lowers the bar.
SQL is so easy to learn, that a lot of people end up being able to use it. However, this does not suffice for writing clean code, designing an ETL, a BI application, or a data warehouse. I’ve personally seen ridiculous projects that were a dependency hell of views on views on views . Sadly, knowing SQL is not enough to say about one’s ability as a professional to deliver a quality product, but the two often get confused in the world of BI.
I hope you enjoyed this article, and if you have any opinions on the topic do not hesitate to leave a comment.
AS of Mysql 8, you can now use window functions such as
rownum() OVER(PARTITION BY country) . Interestingly, you can just reference your partition by an alias, and define the partition in a separate window clause section (like where clause), allowing reuse:
1
2
3
4
5
6
mysql>SELECT
val,
ROW_NUMBER()OVERwAS'row_number',
rank()OVERwAS'rank'
FROM numbers
WINDOWwAS(ORDER BY val);
Happy window aggregating!
OLD ARTICLE FROM 2014 BELOW:
A quick and dirty custom MySQL Rank() function to emulate rank from PostgreSQL into mySQL. Same disclaimers as the custom rownum() function. In this case, I am using it to rank decimals, but feel free to change the input type.
AS of Mysql 8, you can now use window functions such as
rownum() OVER(PARTITION BY country) . Interestingly, you can just reference your partition by an alias, and define the partition in a separate window clause section (like where clause), allowing reuse:
1
2
3
4
5
6
mysql>SELECT
val,
ROW_NUMBER()OVERwAS'row_number',
rank()OVERwAS'rank'
FROM numbers
WINDOWwAS(ORDER BY val);
Happy window aggregating!
OLD ARTICLE FROM 2014 BELOW:
MySQL does not have a built in row number function, which makes it an ugly hack to get the row number – an ugly hack that needs to be repeated over and over, every time you need it.
I have taken the liberty of writing a rownum() function, that is just as bad, but at least it will keep your code clean.
DISCLAIMER:
does not work on older versions of mySQL
It is based on using variables that only reset when time changes. If the function is invoked in 2 separate places in the same microsecond(1 millionth of a second), the output will not be as expected. The variable will not reset, the output will not start counting from 1.
If you run the function more than once in the same query, it will increment every time it is run, so possibly multiple times per row.
For the purposes of BI, I find the associated risks, when used as intended, non existent.
For older versions of mySQL server, you can try finding another creative way to reset the counter for the row number.
Here is a quick overview on how to stream gzip with Ruby: For the purpose of an OS-agnostic script to gzip a SQL dump, I needed a Ruby script. Unfortunately, for large files Ruby would return an error if the file size is too large to fit in the RAM. The error looks as below:
1
"failed to allocate memory (<span style="color:#000000;">NoMemoryError</span>)"
So I decided to do a sort of stream, combining some basic examples online. As I was not able to find something ready made, I combined file streaming with gzipping.
The script below does a SQL dump with date, gzips it in chunks with Ruby, and deletes the non-zipped file. (It is written for a Windows server, so change the system commands if moving to Linux.)
Additionally I wanted to check performance, so I also added timing.
Stream gzip with Ruby
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
require'zlib'
time=Time.new
date=time.to_s[0,10]
database='mydatabasename'
source=database+'_'+date+'.sql'
dest=source+'.gz'
MEGABYTE=1024*1024# or change size of chunk if you find it to run faster in larger chunks.
system'mysqldump --port=3307 --host=127.0.0.1 -u root --password=mypassword '+database+'> '+source#do not use exec to start the mysql dump, or the ruby script will terminate.
Wanted to share an sql function I end up using quite frequently when dealing with user input data – normalization to alphanumeric. It can also be modified into a function to extract numeric values. If you do modify it to that, make sure you keep returns varchar for the cases of extracting numeric values beginning with 0. You can also add other characters, such as the minus sign, if required.
Calculating mode and modal share on postgresql has never been easier than just running these functions.
Some background:
I needed to calculate the aggregate mode and the frequency of that mode to be able to draw some conclusions about the stability of some particularly noisy data. Sample standard deviation can be very misleading for measuring data noise, as it’s a mix of amplitude of variation and frequency.
Imagine measuring the best temperature to drink wine based on what people would report.
Some would say for a particular wine the temperature would be 10 celsius, while many others will say 11-12 degrees. As we are talking about user input, someone may jokingly recommend drinking it at -20 Celsius or at 120 Celsius. The data cannot be trusted.
So let’s take the array recommended_temperature [8,10,10,12,11,10,10,12,10,11,-20,200,10,10]
Finding the right temperature to serve the wine at from this array can be tricky. Data is noisy and not to be trusted.
Total Numbers: 14
Mean (Average): 21.71429
Standard deviation:51.95624
Variance(Standard deviation):2699.45055
Do any of the above numbers even come close? No. You would not serve the wine at an average of 22 Celsius, and the standard deviation is high, but that doesn’t tell you if the values are spread, or if an outlier is messing everything up.
What about the mode? Mode in our case is 10 (the most common value)
Sounds reasonable, wine tastes nice at 10 degrees. But how sure are we that this is the right temperature?
We need more info, mainly modal share (number of values in the mode/all values)
For that, I put together in PostgreSQL a mode_count function, to allow you to find how many values you have in the mode, and possibly calculate a % of all values.
so now we have Mode:10;Modal_share 0.5 (7/14)
We can say with 50% certainty that the appropriate temperature to serve the wine at, based on user recommendations, is 10. Add in the high standard deviation, and you will easily detect the fact that there are strong outliers present in the data, giving you even more confidence towards your mode value.
While data will most often be used as a tool to assist in making decisions, it sometimes also serves as a motivator. Live dashboards can allow your client to react quickly to any changes in their KPIs, or can assist a sales team in seeing how close they are to their daily targets.
Stick them in fullscreen mode on a wall display or a TV, and they will be there to offer this transparency non stop.
There are quite a few pieces of software out there that allow you to create live, automatically updating, automatically refreshing dashboards, but most of them cost more than it’s worth, and some are quite glitchy. I will not go into these tools now, but will rather offer a very fast and easy alternative.
When looking for this, my first requirement was that the refresh is easy to accomplish and smooth. HTML offers that, so I wanted something that can be wrapped in HTML.
HTML also offers the advantage of being flexible in regards to where you put it: as standalone on a dedicated web page, or embedded in an admin frontend or dashboard panel.
Google has a nice free Javascript API for creating charts, compatible with data pulled from almost any source. You can find more details about the look and feel of it here.
My final result was this, for a daily chart updated every 3 minutes:
Google does not provide any connectors for MySql or other common Sql, and most often that’s where live data will be.
I tried looking around for some resources that will pull the sql data into the chart, but my search was not very fruitful. Google API wants the data in a very specific format, so I took the liberty of writing a little PHP to do the job.
If you are not interested in how this works, just scroll all the way down and copy paste the full code into Notepad or TextEdit and save it as PHP. If you want to edit the code, I use Notepad++ as it highlights the operators.
2. Second part: transforming the data in the right format. We do this with an echo.
PHP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<?php
// My sql table has the columns Date, Margin, Target, Last month Margin, so I want the lines to //contain summed up values for each day. Except target, which is the monthly target on each row, //to create a horizontal line
$sqlQuery="SELECT fulldate,
round((SELECT SUM(last_month_margin) FROM dummy_value_daily d2 WHERE d1.fulldate>=d2.fulldate),2) AS last_month_margin ,
round(Target,2) AS Target,
round((SELECT SUM(margin) FROM dummy_value_daily d2 WHERE d1.fulldate>=d2.fulldate),2) AS margin
3. Final part: creating the final PHP with the Google API.You can find all the documentation for Google Charts API here
Note the refresh is set to 180 seconds with meta http-equiv="refresh" content="180"The php is run, query pulls data, and only after that is successful, the chart refreshes, so the update will be very smooth to the eye.
PHP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
<!DOCTYPE html PUBLIC"-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
To run the PHP you need a dev server, a web server, or someone who can upload it somewhere for you. I will cover how to install a lightweight dev server in a separate post. I recommend EasyPHP dev server if this is all you will ever do with it, alternatively XAMPP is a better option allowing you to publish to web as well. Guide for installing and configuring easyphp here.
5. Result:
So I hope you enjoyed this guide, and that you will find this useful. One final note: Google API has all kinds of charts, and it’s easy to switch between types. You can change this chart to a column chart by changing just one word in the html. Just give it a shot 🙂
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.