MySQL Rank() function

UPDATE: On July 18 2018 the MySql server team announced the introduction of window functions. The article was written before this, offering an workaround to gain some similar functionality. I updated this article now with the latest and best practice way to do this correctly, but sill leave the old article for reference.

AS of Mysql 8, you can now use window functions such as  rownum() OVER(PARTITION BY country) . Interestingly, you can just reference your partition by an alias, and define the partition in a separate window clause section (like where clause), allowing reuse:

 

Happy window aggregating!

OLD ARTICLE FROM 2014 BELOW:

A quick and dirty custom MySQL Rank() function to emulate rank from PostgreSQL into mySQL. Same disclaimers as the custom rownum() function. In this case, I am using it to rank decimals, but feel free to change the input type.

 

to invoke, use rank(column to rank),such as below:

output:

MySQL rownum() row number function

UPDATE: On July 18 2018 the MySql server team announced the introduction of window functions. The article was written before this, offering an workaround to gain some similar functionality. I updated this article now with the latest and best practice way to do this correctly, but sill leave the old article for reference.

AS of Mysql 8, you can now use window functions such as  rownum() OVER(PARTITION BY country) . Interestingly, you can just reference your partition by an alias, and define the partition in a separate window clause section (like where clause), allowing reuse:

 

Happy window aggregating!

 

OLD ARTICLE FROM 2014 BELOW:

MySQL does not have a built in row number function, which makes it an ugly hack to get the row number –  an ugly hack that needs to be repeated over and over, every time you need it.

I have taken the liberty of writing a rownum() function, that is just as bad, but at least it will keep your code clean.

DISCLAIMER:

  1. does not work on older versions of mySQL
  2. It is based on using variables that only reset when time changes. If the function is invoked in 2 separate places in the same microsecond(1 millionth of a second), the output will not be as expected. The variable will not reset, the output will not start counting from 1.
  3. If you run the function more than once in the same query, it will increment every time it is run, so possibly multiple times per row.

For the purposes of BI, I find the associated risks, when used as intended, non existent.
For older versions of mySQL server, you can try finding another creative way to reset the counter for the row number.

 

Row number function

 

to invoke it, use rownum(), such as below

Good luck!

Normalizing codes, mySQL extracting alphanumeric functions

Wanted to share an sql function I end up using quite frequently when dealing with user input data – normalization to alphanumeric. It can also be modified into a function to extract numeric values. If you do modify it to that, make sure you keep returns varchar for the cases of extracting numeric values beginning with 0. You can also add other characters, such as the minus sign, if required.

Example use cases:

Matching producer codes:

Normalizing/matching  phone numbers:

 

function:

 

Cleaning noisy data: Postgresql Mode and Modal Share aggregate function

Calculating mode and modal share on postgresql has never been easier than just running these functions.

Some background:
I needed to calculate the aggregate mode and the frequency of that mode to be able to draw some conclusions about the stability of some particularly noisy data. Sample standard deviation can be very misleading for measuring data noise, as it’s a mix of amplitude of variation and frequency.

Imagine measuring the best temperature to drink wine based on what people would report.

Some would say for a particular wine the temperature would be 10 celsius, while many others will say 11-12 degrees.  As we are talking about user input, someone may jokingly recommend drinking it at -20 Celsius or at 120 Celsius. The data cannot be trusted.

So let’s take the array recommended_temperature [8,10,10,12,11,10,10,12,10,11,-20,200,10,10]

Finding the right temperature to serve the wine at from this array can be tricky. Data is noisy and not to be trusted.

Total Numbers: 14

Mean (Average): 21.71429

Standard deviation:51.95624

Variance(Standard deviation):2699.45055

Do any of the above numbers even come close? No. You would not serve the wine at an average of 22 Celsius, and the standard deviation is high, but that doesn’t tell you if the values are spread, or if an outlier is messing everything up.

What about the mode? Mode in our case is 10 (the most common value)
Sounds reasonable, wine tastes nice at 10 degrees. But how sure are we that this is the right temperature?
We need more info, mainly modal share (number of values in the mode/all values)

For that, I put together in PostgreSQL a mode_count function, to allow you to find how many values you have in the mode, and possibly calculate a % of all values.
so now we have Mode:10;Modal_share 0.5 (7/14)
We can say with 50% certainty that the appropriate temperature to serve the wine at, based on user recommendations, is 10. Add in the high standard deviation, and you will easily detect the fact that there are strong outliers present in the data, giving you even more confidence towards your mode value.

The function from postgres manual for mode: 

and the custom mode count function:

They are aggregate functions, so call them as:

Hope you enjoy!