Calculating mode and modal share on postgresql has never been easier than just running these functions.
Some background:
I needed to calculate the aggregate mode and the frequency of that mode to be able to draw some conclusions about the stability of some particularly noisy data. Sample standard deviation can be very misleading for measuring data noise, as it’s a mix of amplitude of variation and frequency.
Imagine measuring the best temperature to drink wine based on what people would report.
Some would say for a particular wine the temperature would be 10 celsius, while many others will say 11-12 degrees. As we are talking about user input, someone may jokingly recommend drinking it at -20 Celsius or at 120 Celsius. The data cannot be trusted.
So let’s take the array recommended_temperature [8,10,10,12,11,10,10,12,10,11,-20,200,10,10]
Finding the right temperature to serve the wine at from this array can be tricky. Data is noisy and not to be trusted.
Total Numbers: 14
Mean (Average): 21.71429
Standard deviation:51.95624
Variance(Standard deviation):2699.45055
Do any of the above numbers even come close? No. You would not serve the wine at an average of 22 Celsius, and the standard deviation is high, but that doesn’t tell you if the values are spread, or if an outlier is messing everything up.
What about the mode? Mode in our case is 10 (the most common value)
Sounds reasonable, wine tastes nice at 10 degrees. But how sure are we that this is the right temperature?
We need more info, mainly modal share (number of values in the mode/all values)
For that, I put together in PostgreSQL a mode_count function, to allow you to find how many values you have in the mode, and possibly calculate a % of all values.
so now we have Mode:10;Modal_share 0.5 (7/14)
We can say with 50% certainty that the appropriate temperature to serve the wine at, based on user recommendations, is 10. Add in the high standard deviation, and you will easily detect the fact that there are strong outliers present in the data, giving you even more confidence towards your mode value.
The function from postgres manual for mode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
CREATE OR REPLACE FUNCTION _final_mode(anyarray) RETURNS anyelement AS $BODY$ SELECT a FROM unnest($1) a GROUP BY 1 ORDER BY COUNT(1) DESC, 1 LIMIT 1; $BODY$ LANGUAGE 'sql' IMMUTABLE; <em id="__mceDel" style="line-height: 1.714285714; font-size: 1rem;"><em id="__mceDel">DROP AGGREGATE IF EXISTS mode(anyelement);</em></em> CREATE AGGREGATE mode(anyelement) ( SFUNC=array_append, STYPE=anyarray, FINALFUNC=_final_mode, INITCOND='{}' ); |
and the custom mode count function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
CREATE OR REPLACE FUNCTION _final_mode_count(anyarray) RETURNS bigint AS $BODY$ SELECT COUNT(*) FROM unnest($1) a GROUP BY a ORDER BY COUNT(1) DESC, a LIMIT 1; $BODY$ LANGUAGE 'sql' IMMUTABLE; -- Tell Postgres how to use our aggregate DROP AGGREGATE IF EXISTS mode_count(anyelement); CREATE AGGREGATE mode_count(anyelement) ( SFUNC=array_append, --Function to call for each row. Just builds the array STYPE=anyarray, FINALFUNC=_final_mode_count, --Function to call after everything has been added to array INITCOND='{}' --Initialize an empty array when starting ); |
They are aggregate functions, so call them as:
1 2 3 4 5 |
SELECT mode(some_value) AS modal_value, mode_count(some_value) as count_at_mode, mode_count(some_value)/(count(some_value)::float as modal_share <span style="font-size: 0.857142857rem; line-height: 1.714285714;">FROM t;</span> |
Hope you enjoy!