Scheduled emails with Python for BI

As part of reporting in business intelligence, scheduled emails are a very useful tool. They allow you to send various daily summaries, or attach larger reports.

For these purposes I put together a Python BI-mailer class to allow for easy management of all things you might normally require to send, such as text, summary/KPI html tables, embedded images or charts, or attach larger reports.

I will put the code here, and in the next articles I will give examples of how to use this easy Python emailer to accomplish daily reporting chores.

With this script you are able to:
send an email
optionally with text or multiple tables in body
optionally with attachments
optionally use the attachments as embedded images.

 

MySQL Federated database creation

MySQL does not support entire federated databases, but it does support federated tables  – as long as the option is enabled in my.ini or my.cnf.

To create an entire federated DB, in a non manual, and easy to maintain way (refresh table definition if source table structure changes), we can use the following procedure.

First, the caveats:

1. Federated tables do not use indices in the usual way, so you may need to create a table locally and index before expecting any performance.

2. When using a client such as Heidi, that does show table status, the connection between your DB and the source DB of the fed table remains open. An issue can arise if the number of tables is significant, while max connections on source db is low. To get around this, periodically use flush tables on the destination DB to close the connections. There are alternatives to this, but this is the simplest workaround (may not be the best).

Personally, to get around this issue I scheduled a MySQL event that flushes every federated table (not FLUSH TABLES, but FLUSH TABLE TABLENAME;FLUSH TABLE TABLENAME2), to avoid blocking other queries. FLUSH TABLES flushes all tables, so it has to wait for other queries to finish, and any new queries have to wait for it to finish, causing no possibility of concurrent queries.

1. Get a nodata mysqldump in the same way as you get a normal dump, but with -d switch for no data

 

2. Convert the engine part of the tables into federated+connection (here done with Python)

 

3. Load up your new dump

 

You can run this periodically or on demand to refresh the table definitions if the source definitions change

MySQL Rank() function

UPDATE: On July 18 2018 the MySql server team announced the introduction of window functions. The article was written before this, offering an workaround to gain some similar functionality. I updated this article now with the latest and best practice way to do this correctly, but sill leave the old article for reference.

AS of Mysql 8, you can now use window functions such as  rownum() OVER(PARTITION BY country) . Interestingly, you can just reference your partition by an alias, and define the partition in a separate window clause section (like where clause), allowing reuse:

 

Happy window aggregating!

OLD ARTICLE FROM 2014 BELOW:

A quick and dirty custom MySQL Rank() function to emulate rank from PostgreSQL into mySQL. Same disclaimers as the custom rownum() function. In this case, I am using it to rank decimals, but feel free to change the input type.

 

to invoke, use rank(column to rank),such as below:

output:

MySQL rownum() row number function

UPDATE: On July 18 2018 the MySql server team announced the introduction of window functions. The article was written before this, offering an workaround to gain some similar functionality. I updated this article now with the latest and best practice way to do this correctly, but sill leave the old article for reference.

AS of Mysql 8, you can now use window functions such as  rownum() OVER(PARTITION BY country) . Interestingly, you can just reference your partition by an alias, and define the partition in a separate window clause section (like where clause), allowing reuse:

 

Happy window aggregating!

 

OLD ARTICLE FROM 2014 BELOW:

MySQL does not have a built in row number function, which makes it an ugly hack to get the row number –  an ugly hack that needs to be repeated over and over, every time you need it.

I have taken the liberty of writing a rownum() function, that is just as bad, but at least it will keep your code clean.

DISCLAIMER:

  1. does not work on older versions of mySQL
  2. It is based on using variables that only reset when time changes. If the function is invoked in 2 separate places in the same microsecond(1 millionth of a second), the output will not be as expected. The variable will not reset, the output will not start counting from 1.
  3. If you run the function more than once in the same query, it will increment every time it is run, so possibly multiple times per row.

For the purposes of BI, I find the associated risks, when used as intended, non existent.
For older versions of mySQL server, you can try finding another creative way to reset the counter for the row number.

 

Row number function

 

to invoke it, use rownum(), such as below

Good luck!

Get yesterday’s date in a windows batch file – correct version

Hi all,

If you use a windows server and get yesterday’s date for something like a download, you might probably use this popular script that can be found here and there. Unfortunately, the script is somewhat broken when dealing with getting the last day of the month. The logic is solid, but somewhere between the idea and implementation, something was lost.

see below the working, correct version:

 

 

Stream gzip with Ruby

Here is a quick overview on how to stream gzip with Ruby: For the purpose of an OS-agnostic script to gzip a SQL dump, I needed a Ruby script. Unfortunately, for large files Ruby would return an error if the file size is too large to fit in the RAM. The error looks as below:

So I decided to do a sort of stream, combining some basic examples online. As I was not able to find something ready made, I combined file streaming with gzipping.

The script below does a SQL dump with date, gzips it in chunks with Ruby, and deletes the non-zipped file. (It is written for a Windows server, so change the system commands if moving to Linux.)

Additionally I wanted to check performance, so I also added timing.

Stream gzip with Ruby

There you go, an easy way to stream gzip with Ruby. Enjoy!

Using Cost Revenue Ratio to measure Campaign Efficiency

The Cost Revenue Ratio (CRR), also known as efficiency ratio, is a way of comparing expenses to revenue. While the CRR is mainly used in banking businesses, it is also an interesting key performance indicator for performance marketers.

The calculation is very straightforward: Just divide the spendings of your campaign by revenue of that campaign. The result will be a percentage, telling you how much you have spent to generate one euro/dollar of revenue, meaning: the lower the percentage, the more efficient at creating revenue. If your result is larger than 100%, your costs are higher than the created revenue – optimization of the campaign is urgently needed!

CRR = Spend/Revenue
CRR formula

To use this metric for measuring your Google Adwords campaigns, make sure to assign a value to each conversion action. This will enable you to use the “Total conv. Value” column in your campaign overview, you will need it to calculate the CRR.

The result could look like in the following example:

extract campaign report

Here you see an extract from a campaign performance report with the Cost Revenue Ratio in the last column. Cost of the respective campaign gets divided by total conversion value. When not having any CRR benchmarks from your financial department, it might be a good idea to use the brand campaign as orientation. This campaign will most likely reach the best cost revenue ratio you can find in your account, so the rest of your campaigns should be somewhere between this and 100%, the closer to the brand campaign benchmark, the better.

Normalizing codes, mySQL extracting alphanumeric functions

Wanted to share an sql function I end up using quite frequently when dealing with user input data – normalization to alphanumeric. It can also be modified into a function to extract numeric values. If you do modify it to that, make sure you keep returns varchar for the cases of extracting numeric values beginning with 0. You can also add other characters, such as the minus sign, if required.

Example use cases:

Matching producer codes:

Normalizing/matching  phone numbers:

 

function:

 

KPI emails with google charts url parameters

I recently created a job that calculates some KPIs and writes a txt file that gets picked up as an email body for daily delivery. It was a few lines of text, with the date and kpi value, like the example below

2014-01-16  New Customers: 7    Sold_Items: 15
2014-01-15   New Customers: 6    Sold_Items: 12
2014-01-14   New Customers: 5    Sold_Items: 6
2014-01-13   New Customers: 4    Sold_Items: 8
2014-01-12   New Customers: 3     Sold_Items: 8
2014-01-11   New Customers: 2     Sold_Items: 4
2014-01-10   New Customers: 1     Sold_Items: 3

It was the easiest approach to just concatenate date, customers, items coming from my data source with some strings, but of course it was hard to read. There were several such blocks of text in it, so while the data was there, it was hard to see a trend.

the sql for it went something like:

SELECT concat_ws(' ', date, ' New Customers:', users, ' Sold_Items:', items)
from kpi_table
ORDER BY date DESC
LIMIT 7
INTO OUTFILE 'C:\\folder\\email_kpi.txt' FIELDS TERMINATED BY ';';

Easy, but by far not the best way to deliver an email report.

To properly display this data with the idea in mind that the recipient will have to read and understand it, it must be represented differently.

Best ways are a visual aid, along with a structured format to allow the user to work with the data, so I decided for a chart and a table. Luckily, google charts offers a simple way that allows you to put parameters in an url, which will then return the chart. So I had a go at it, and came up with below.
Note the url is not secure; on the other ahnd, this is not a problem, as you can only access the specific url with the data you put in it. In other words, if you don’t have the data in the first place, you can’t access it. It’s not dynamic, so there’s no possibility of future leaks.

 

Google Image chart for emailing, with parameters in url
Google Image chart for emailing, with parameters in url

As for the table, a simple html table will do – it allows you to copy paste straight into spreadsheets and keep the formatting, and the simpler the better. Table formatting may vary, it will normally look pretty plain, but some email clients support inline css, so if you have the time, feel free to experiment.

date New Customers Sold Items
2014-01-26 15 30
2014-01-25 20 40
2014-01-24 49 80
2014-01-23 45 73
2014-01-22 22 40
2014-01-21 21 50
2014-01-20 18 33

What about the code to create this every day?

I use plain mysql that generates the two lines as text

 

 

for the table:

for the chart:

For the google image chart it’s a bit more work to get the vertical limit and the axes to update according to your maximum values, but otherwise pretty straightforward. The easiest way to figure out what you need is in the google playground in the link below. Alternatively you can just change the parameters in the url of the google chart above

https://developers.google.com/chart/image/docs/chart_playground?

My parameters were as follows:

cht=lc
chs=650x300
chd=t:30,40,80,73,40,50,33
15,20,49,45,22,21,18
chds=0,100
chxt=x,y
chxs=0,022222,11,0,lt
1,011111,10,1,lt
chdl=Items_Sold
New_Customers
chxl=0:
2014-01-20
2014-01-21
2014-01-22
2014-01-23
2014-01-24
2014-01-25
2014-01-26
1:
0
20
40
60
80
100
chco=044444,099999
chg=16.67,20
chm=N,004040,0,-1,11
N,005050,1,-1,11
chtt=Daily_Sales

You end up with just one line for each, perfect for shooting into a html email. Easy to read, looks good.

Easy html email
Easy html email

and the final result:

google charts email with html table
google charts email with html table

Stay tuned for more!

 

Cleaning noisy data: Postgresql Mode and Modal Share aggregate function

Calculating mode and modal share on postgresql has never been easier than just running these functions.

Some background:
I needed to calculate the aggregate mode and the frequency of that mode to be able to draw some conclusions about the stability of some particularly noisy data. Sample standard deviation can be very misleading for measuring data noise, as it’s a mix of amplitude of variation and frequency.

Imagine measuring the best temperature to drink wine based on what people would report.

Some would say for a particular wine the temperature would be 10 celsius, while many others will say 11-12 degrees.  As we are talking about user input, someone may jokingly recommend drinking it at -20 Celsius or at 120 Celsius. The data cannot be trusted.

So let’s take the array recommended_temperature [8,10,10,12,11,10,10,12,10,11,-20,200,10,10]

Finding the right temperature to serve the wine at from this array can be tricky. Data is noisy and not to be trusted.

Total Numbers: 14

Mean (Average): 21.71429

Standard deviation:51.95624

Variance(Standard deviation):2699.45055

Do any of the above numbers even come close? No. You would not serve the wine at an average of 22 Celsius, and the standard deviation is high, but that doesn’t tell you if the values are spread, or if an outlier is messing everything up.

What about the mode? Mode in our case is 10 (the most common value)
Sounds reasonable, wine tastes nice at 10 degrees. But how sure are we that this is the right temperature?
We need more info, mainly modal share (number of values in the mode/all values)

For that, I put together in PostgreSQL a mode_count function, to allow you to find how many values you have in the mode, and possibly calculate a % of all values.
so now we have Mode:10;Modal_share 0.5 (7/14)
We can say with 50% certainty that the appropriate temperature to serve the wine at, based on user recommendations, is 10. Add in the high standard deviation, and you will easily detect the fact that there are strong outliers present in the data, giving you even more confidence towards your mode value.

The function from postgres manual for mode: 

and the custom mode count function:

They are aggregate functions, so call them as:

Hope you enjoy!