Python and SQL open source business intelligence stack

I’d like to cover here some basics about the common business intelligence stack of Python and SQL. Let’s see why this stack is so popular, and why it’s not yet the one and only gold standard.

Python Logo & SQL Logo

Why Python for business intelligence?

Readability.

Most business intelligence professionals using Python come from all walks of life and do not have formal programming education. This can lead to incredibly random ways of coding – not for Python though, as Python requires indentation.
This doesn’t mean that someone who has no idea will produce good code or a good architecture. And regardless of how clean the code, people can do very obscure and roundabout complex things. What it does mean that it is very easy to read their code, which is the first step in understanding it.

Staffing.

Python is an easy to learn, widespread, and easy to use language. It’s much easier to find Python developers than developers with VBA knowledge. Additionally, it’s even more difficult to find developers with BI knowledge and knowledge in other specific languages.

Libraries.

For BI work, you need to extract/load data, transform data and analyze data. Python can do all – most advertising  data producers offer Python libraries to access their APIs. Python can access any database you can think of. Python natively has great features for data munging. Python can be used to perform large parallel computations or statistical analysis. Data science too.

Widespread usage/versatility.

This means that whatever problem you are encountering, someone else probably encountered it too. You will have a very easy time to find solutions or get help for anything Python. While other languages might be better at specific things, Python is a jack of all trades, master of some.

Why not Python?

Spoon-fed.

Well, many BI developers stop there because the language is so versatile. However, this means they will not have the opportunity to work with languages designed for large software projects. Also they will miss out on best practices and programming paradigms. Too easy sometimes means people stop learning.

Slow.

Python is relatively slower than most other languages. In the context of working with data, this only appears as an issue when doing complex calculations on very large data sets.

Why SQL for business intelligence?

I will talk here about PostgreSQL for small data specifically, since it is the best open source database for analysis. It has better functionality that paid solutions like Microsoft’s, and equal or better performance and functionality to Oracle’s product. For free, without licensing fees or vendor locks, it is built by developers, for developers, with robustness and ease of use in mind. You can even run Python on Postgres 🙂

For big data, the leading SQL solution seems to be the postgres-like product from Amazon, Redshift.

Ease of access:

For the sake of easy data access, SQL makes a lot of sense. There are plenty of tools that can run SQL against a database to generate dashboards or can connect it easily in excel. And it is easy to learn for analysts.

Ease of data manipulation:

Most of the data you will have will not be single numbers or random sentences, but structured in tables. This means that instead of conventional algebra, we use relational algebra, to perform operations between data sets instead of single values. This is where SQL shines, since it is designed for these kinds of operations specifically. Once you have the data in a database, you can leverage the power of this language to very easily perform these operations.

Staffing:

Every self-respecting BI professional known at least a little SQL. Most good BI engineers know a lot of SQL. Also, particular type of SQL is not very relevant, as the flavors are very similar.

Why not SQL?

Poor options for architecture.

SQL is a query language, not a programming language. It’s meant to be used to rearrange and compute data, and not to do complex operations. You end up producing a lot of code, often wet (opposite of DRY). The code ends up being monolithic, and it is not easy to see at a glance what a query does. Basically, you cannot write easily maintainable code.

Lowers the bar.

SQL is so easy to learn, that a lot of people end up being able to use it. However, this does not suffice for writing clean code, designing an ETL, a BI application, or a data warehouse. I’ve personally seen ridiculous projects that were a dependency hell of views on views on views . Sadly, knowing SQL is not enough to say about one’s ability as a professional to deliver a quality product, but the two often get confused in the world of BI.

I hope you enjoyed this article, and if you have any opinions on the topic do not hesitate to leave a comment.

Lifecycle stages of Business Intelligence solutions in startups

Man showing business graph on wood table

Business Intelligence in companies is a decision support system. Its implementation is a reflection of the questions that need to be answered in the company at any given time. Many companies start with just a few spreadsheets with data. They populate them with copy pasted numbers. This allows them to understand how their business is evolving and where the next opportunities lie. As time goes on, the companies usually go through several stages of Business Intelligence solutions.

1. The copy paste spreadsheets

At the earliest stage in the company’s life, there is little automation of data extraction and reporting. Usually, the founder already has some profit and loss spreadsheets. Then he will ask the employees to fill in data relevant to their departments. This is when the team defines the first metrics. Unfortunately oftentimes the calculations are not complete due to missing data.

2. The VBA reports

First, let me state that I consider VBA+excel reporting to be a disaster when it comes to data solutions. The reasons are simple: it is a programming paradigm that follows zero best practices. It’s programming, so it’s too hard for the average Joe to build on. It’s VBA, so it’s not made for data munging, versioning, collaborative working, easy file handling etc. It works for a one-off project, but you will end up with too much code that is too hard to maintain for too little benefit.

This usually appears as a quick-and-dirty way of automating some reporting coming from someone with financial background. This happens because finance folks like to have things neatly ordered in Excel, with custom functionality for particular reports. This works on a small scale, but it is not worth maintaining bad code with custom functionality and particular layouts for all the reports on everyone’s computers.

3. The Excel Hell

Natural evolution: Each department fills in some data sheets that feed into the managers’ reports. After a while, each department wants their own reports as well, so they start using the data sources from the other departments. Soon, they want new numbers and start changing the data sources to add more data. This inevitably breaks other dependent reports and creates pain for everyone.

Soon, the reports become data sources for new reports, to try to avoid redundant maintenance. This only makes things worse, as now there are dependencies in many directions between many things. One small change could trigger catastrophic failure on company level. Moreover all the knowledge to maintain this system is so fragmented that it is not possible to centralize.

4. The data bucket

After excel hell, some companies fall back to what they know: Data source files. To allow collaborative development, they put these files on a shared drive or cloud storage solution. From here, the excel-savvy can build their own reports.

It has the advantage of standardizing data sources. Moreover it allows anyone to build their own reports based on those sources, instead of people building on top of existing reports.

Besides scale, the first shortcoming is the manual effort involved. Additionally, the lack of maintainability (cannot at a glace understand the logic behind a report) causes redundant reports to be build, which just adds to the work.

5. The data river

The company looks online to how to do better BI and receives misinformation from advertorials, ads, agencies and parrots (people who just repeat what they read). They end up buying some dashboard solution that promises seamless data integration and real time stats. They get IT to write some SQL queries for the tool, and they connect to the other services (advertising, ticketing systems etc) with ease.

It now looks like all the data is available. In a tool that displays, but cannot transform data to a useful format. This is the birth of the need for a data warehouse. Complex logic (and there is always complex logic in data munging) will need to be done by coding.

Now we can see our data, but we cannot leverage it for the reporting we need.

6. The data store, or unstructured data warehouse

After having struggled with having the data in files, or by piping the data directly to reports via a tool, the need for a database that can store both raw data and processed data appears.

Of course it makes sense to combine the data in a SQL database to allow all the other tools to pull data from it too. Also most of the transformations will happen in SQL. First of all it is easy to learn and use. And secondly it is a  very powerful language for working with sets of data  – tables.

This collection usually consists of everything that can be imported and is usually not structured or grouped together by a particular logic. It’s simply a place to store the data to make it  more accessible for analysis and further transformations. Often, much of the data is coming from manually populated data sources.

Some argue they are too cool for SQL. But in the end SQL just works because of how limited and powerful the language is. It’s powerful because it performs all operations on a table or column/row level. This allows for example to perform a Cartesian join between 2 data sets with a single keyword. Think of all the loops you’d write (run) in a programming language!

SQL has limitations because it only allows a restricted set of operations on tables. This is actually a huge boon, as it allows both good developers and bad developers to speak the same language. This means the code is unlikely to become too complex to understand.

7. The data warehouse

When a BI person wants to churn out a variety of reports, the benefits of a dimensional model become apparent. Typically, one adds this as a layer on top of the raw data. This makes the mapping of business processes at each grain or step possible. And thereby also enables easy creation of reports.

8. The self service reporting

Once the dimensional model is available, the team can create data cubes. This way they can make the data available easily to the rest of the company. Every stakeholder can filter the data cubes or drill down/roll up. With this functionality, a business analyst with access to the data can build complex reports and dashboards in minutes. The tool containing the data would usually be Excel or Dashboard tools. An operational data consumer might notice an anomaly and be able to troubleshoot it by drilling down in the same window. This is the end goal for most companies.

9. The advanced self service reporting

Drilling down predefined paths is not enough for complex analyses. Also, for the sake of agility many data sources will not be fully integrated into the data warehouse. (Example: a test of an ad network.) This is where custom complex transformations come into place. These are often to generate a report of a new data source, or to analyze particular details such as user behavior.

10. The data products: Automated decision making instead of BI as decision support.

Often times, the decision based on data is logical and repeatable. Therefore the data team creates algorithms to automate that decision. Take for example the decision of which products you should recommend to each customer, or the adjustment of bids based on performance metrics.

The data team can create these data products with relative ease. Sometimes the team builds them on top of the dimensional model of the DWH. Often, due to performance and different SLA requirements, the products run independently of the data warehouse. They consume data from the source or are integrated in an application.

Hope you enjoyed and let me know in the comments what else you have encountered. And where do you find yourself on this scale?

 

 

 

Twitter ads API: gotchas when downloading report

I recently built an ETL to download ad reports from the Twitter ads API.

If you are about to do the same, take note of the following gotcha.

Include historic data in your download

First, you should download at least 2+ weeks of history with every new daily load. The reason behind this? Twitter ads data will be updated historically by Twitter as they work out the actual numbers (removing bots, spam etc). This is common behavior for many ad APIs.

Provide datetime

Second, because of their API design, you NEED to provide start datetime and end datetime for whole days. Notice it requires datetime, and not date, even though you clearly request daily aggregation! Additionally, you need to provide the midnight of your account’s timezone in UTC time, or the API call will return an error. (Don’t ask me why you handle this on your end when the timezone info is coming from the API…)

You can retrieve the timezone of each account from the endpoint '/2/accounts/your_account_id', the body['data']['timezone'] of the response. Then you calculate your start and end date for your report request, and adjust both of them to the right UTC offset based on the account timezone. You can achieve this easily in Python with pytz module. Please note that if your start-end interval contains a daylight savings time change, you cannot apply the same offset to both dates, as offset from UTC changes. The best approach is to calculate this offset based on the timezone for each of the start and end dates.

Keep watching for gotchas in the Twitter ads API

There are a few other gotchas/undocumented limitations in their API, but you will find it easier to address those correctly. (Example: only request stats for 10 tweet_ids at a time).

Yes, it stinks, but at least Twitter ads API is still better than Bing ads API.

Good luck!