Cleaning noisy data: Postgresql Mode and Modal Share aggregate function

Calculating mode and modal share on postgresql has never been easier than just running these functions.

Some background:
I needed to calculate the aggregate mode and the frequency of that mode to be able to draw some conclusions about the stability of some particularly noisy data. Sample standard deviation can be very misleading for measuring data noise, as it’s a mix of amplitude of variation and frequency.

Imagine measuring the best temperature to drink wine based on what people would report.

Some would say for a particular wine the temperature would be 10 celsius, while many others will say 11-12 degrees.  As we are talking about user input, someone may jokingly recommend drinking it at -20 Celsius or at 120 Celsius. The data cannot be trusted.

So let’s take the array recommended_temperature [8,10,10,12,11,10,10,12,10,11,-20,200,10,10]

Finding the right temperature to serve the wine at from this array can be tricky. Data is noisy and not to be trusted.

Total Numbers: 14

Mean (Average): 21.71429

Standard deviation:51.95624

Variance(Standard deviation):2699.45055

Do any of the above numbers even come close? No. You would not serve the wine at an average of 22 Celsius, and the standard deviation is high, but that doesn’t tell you if the values are spread, or if an outlier is messing everything up.

What about the mode? Mode in our case is 10 (the most common value)
Sounds reasonable, wine tastes nice at 10 degrees. But how sure are we that this is the right temperature?
We need more info, mainly modal share (number of values in the mode/all values)

For that, I put together in PostgreSQL a mode_count function, to allow you to find how many values you have in the mode, and possibly calculate a % of all values.
so now we have Mode:10;Modal_share 0.5 (7/14)
We can say with 50% certainty that the appropriate temperature to serve the wine at, based on user recommendations, is 10. Add in the high standard deviation, and you will easily detect the fact that there are strong outliers present in the data, giving you even more confidence towards your mode value.

The function from postgres manual for mode: 

and the custom mode count function:

They are aggregate functions, so call them as:

Hope you enjoy!

Author: Adrian

I'm a Business intelligence professional working in the tech scene in Berlin, Germany. I'm currently freelancing. If you'd like to check my services, visit adrian.brudaru.com