The data flow breakdown

Data FlowFor easy reference, I’d like to define the typical data flow encountered in business intelligence by its steps: Data integration, dimensional modelling, and data consumption. Basically these are the typical flows that make first and third party data available for analysis. Most of all it’s this data flow that brings a ROI when investing into BI.

1. Data flow part one: Data Integration

Integration measures how many of the first and third party data sources have been unified in a single place. This single place is usually called an integration layer. It can be any easy-to-access data store. Typically it would take the form of files on a network location (FTP, S3, GCS) or a relational database.

Usually this part is pretty messy in that the external sources are diverse and different. Often, each source requires its own access method to extract relevant data. As a result you get a very messy environment. Ideally handle it by a widely used scripting language to ‘glue’ systems together. It is a good practice here to use some form of utils to dry your code.Using common utils for loading data sources will prove verz time/saving when loading data.

2. Data flow part two: Dimensional Modelling

Dimensional modelling as a step in the data flow is the transformation of integration data to a format suitable for analysis. Usually in the form of a snowflake schema. The goal of this is to arrange the dimensions and fact tables such that metrics are unified.  Thus they are the same across reports.

Some small reporting set-ups omit dimensional modelling, but it is useful in enabling self service reporting.

Ultimately the end goal of dimensional modelling is to provide fact and dimension tables containing correct data. Later on, you will use this data for reporting and analysis. Ideally locate it in a SQL database for easy access.

3. Data flow part three: Data consumption

Here, I want to avoid using the terms reporting or analysis. In my opinion reporting is just one of many channels that a data consumer or analyst uses to access data. Such data consumption can take the form of performing an analysis that requires ad hoc query writing and statistical analysis. Or say the analyst has to perform a  seasonal product analysis based on data from a product performance dashboard.

The end goal of data consumption is to boost business metrics through informed decision making. Consequently this is where investing in data actually pays off, so make sure the needs of the decision-makers are met.

Python and SQL open source business intelligence stack

I’d like to cover here some basics about the common business intelligence stack of Python and SQL. Let’s see why this stack is so popular, and why it’s not yet the one and only gold standard.

Python Logo & SQL Logo

Why Python for business intelligence?

Readability.

Most business intelligence professionals using Python come from all walks of life and do not have formal programming education. This can lead to incredibly random ways of coding – not for Python though, as Python requires indentation.
This doesn’t mean that someone who has no idea will produce good code or a good architecture. And regardless of how clean the code, people can do very obscure and roundabout complex things. What it does mean that it is very easy to read their code, which is the first step in understanding it.

Staffing.

Python is an easy to learn, widespread, and easy to use language. It’s much easier to find Python developers than developers with VBA knowledge. Additionally, it’s even more difficult to find developers with BI knowledge and knowledge in other specific languages.

Libraries.

For BI work, you need to extract/load data, transform data and analyze data. Python can do all – most advertising  data producers offer Python libraries to access their APIs. Python can access any database you can think of. Python natively has great features for data munging. Python can be used to perform large parallel computations or statistical analysis. Data science too.

Widespread usage/versatility.

This means that whatever problem you are encountering, someone else probably encountered it too. You will have a very easy time to find solutions or get help for anything Python. While other languages might be better at specific things, Python is a jack of all trades, master of some.

Why not Python?

Spoon-fed.

Well, many BI developers stop there because the language is so versatile. However, this means they will not have the opportunity to work with languages designed for large software projects. Also they will miss out on best practices and programming paradigms. Too easy sometimes means people stop learning.

Slow.

Python is relatively slower than most other languages. In the context of working with data, this only appears as an issue when doing complex calculations on very large data sets.

Why SQL for business intelligence?

I will talk here about PostgreSQL for small data specifically, since it is the best open source database for analysis. It has better functionality that paid solutions like Microsoft’s, and equal or better performance and functionality to Oracle’s product. For free, without licensing fees or vendor locks, it is built by developers, for developers, with robustness and ease of use in mind. You can even run Python on Postgres 🙂

For big data, the leading SQL solution seems to be the postgres-like product from Amazon, Redshift.

Ease of access:

For the sake of easy data access, SQL makes a lot of sense. There are plenty of tools that can run SQL against a database to generate dashboards or can connect it easily in excel. And it is easy to learn for analysts.

Ease of data manipulation:

Most of the data you will have will not be single numbers or random sentences, but structured in tables. This means that instead of conventional algebra, we use relational algebra, to perform operations between data sets instead of single values. This is where SQL shines, since it is designed for these kinds of operations specifically. Once you have the data in a database, you can leverage the power of this language to very easily perform these operations.

Staffing:

Every self-respecting BI professional known at least a little SQL. Most good BI engineers know a lot of SQL. Also, particular type of SQL is not very relevant, as the flavors are very similar.

Why not SQL?

Poor options for architecture.

SQL is a query language, not a programming language. It’s meant to be used to rearrange and compute data, and not to do complex operations. You end up producing a lot of code, often wet (opposite of DRY). The code ends up being monolithic, and it is not easy to see at a glance what a query does. Basically, you cannot write easily maintainable code.

Lowers the bar.

SQL is so easy to learn, that a lot of people end up being able to use it. However, this does not suffice for writing clean code, designing an ETL, a BI application, or a data warehouse. I’ve personally seen ridiculous projects that were a dependency hell of views on views on views . Sadly, knowing SQL is not enough to say about one’s ability as a professional to deliver a quality product, but the two often get confused in the world of BI.

I hope you enjoyed this article, and if you have any opinions on the topic do not hesitate to leave a comment.

Lifecycle stages of Business Intelligence solutions in startups

Man showing business graph on wood table

Business Intelligence in companies is a decision support system. Its implementation is a reflection of the questions that need to be answered in the company at any given time. Many companies start with just a few spreadsheets with data. They populate them with copy pasted numbers. This allows them to understand how their business is evolving and where the next opportunities lie. As time goes on, the companies usually go through several stages of Business Intelligence solutions.

1. The copy paste spreadsheets

At the earliest stage in the company’s life, there is little automation of data extraction and reporting. Usually, the founder already has some profit and loss spreadsheets. Then he will ask the employees to fill in data relevant to their departments. This is when the team defines the first metrics. Unfortunately oftentimes the calculations are not complete due to missing data.

2. The VBA reports

First, let me state that I consider VBA+excel reporting to be a disaster when it comes to data solutions. The reasons are simple: it is a programming paradigm that follows zero best practices. It’s programming, so it’s too hard for the average Joe to build on. It’s VBA, so it’s not made for data munging, versioning, collaborative working, easy file handling etc. It works for a one-off project, but you will end up with too much code that is too hard to maintain for too little benefit.

This usually appears as a quick-and-dirty way of automating some reporting coming from someone with financial background. This happens because finance folks like to have things neatly ordered in Excel, with custom functionality for particular reports. This works on a small scale, but it is not worth maintaining bad code with custom functionality and particular layouts for all the reports on everyone’s computers.

3. The Excel Hell

Natural evolution: Each department fills in some data sheets that feed into the managers’ reports. After a while, each department wants their own reports as well, so they start using the data sources from the other departments. Soon, they want new numbers and start changing the data sources to add more data. This inevitably breaks other dependent reports and creates pain for everyone.

Soon, the reports become data sources for new reports, to try to avoid redundant maintenance. This only makes things worse, as now there are dependencies in many directions between many things. One small change could trigger catastrophic failure on company level. Moreover all the knowledge to maintain this system is so fragmented that it is not possible to centralize.

4. The data bucket

After excel hell, some companies fall back to what they know: Data source files. To allow collaborative development, they put these files on a shared drive or cloud storage solution. From here, the excel-savvy can build their own reports.

It has the advantage of standardizing data sources. Moreover it allows anyone to build their own reports based on those sources, instead of people building on top of existing reports.

Besides scale, the first shortcoming is the manual effort involved. Additionally, the lack of maintainability (cannot at a glace understand the logic behind a report) causes redundant reports to be build, which just adds to the work.

5. The data river

The company looks online to how to do better BI and receives misinformation from advertorials, ads, agencies and parrots (people who just repeat what they read). They end up buying some dashboard solution that promises seamless data integration and real time stats. They get IT to write some SQL queries for the tool, and they connect to the other services (advertising, ticketing systems etc) with ease.

It now looks like all the data is available. In a tool that displays, but cannot transform data to a useful format. This is the birth of the need for a data warehouse. Complex logic (and there is always complex logic in data munging) will need to be done by coding.

Now we can see our data, but we cannot leverage it for the reporting we need.

6. The data store, or unstructured data warehouse

After having struggled with having the data in files, or by piping the data directly to reports via a tool, the need for a database that can store both raw data and processed data appears.

Of course it makes sense to combine the data in a SQL database to allow all the other tools to pull data from it too. Also most of the transformations will happen in SQL. First of all it is easy to learn and use. And secondly it is a  very powerful language for working with sets of data  – tables.

This collection usually consists of everything that can be imported and is usually not structured or grouped together by a particular logic. It’s simply a place to store the data to make it  more accessible for analysis and further transformations. Often, much of the data is coming from manually populated data sources.

Some argue they are too cool for SQL. But in the end SQL just works because of how limited and powerful the language is. It’s powerful because it performs all operations on a table or column/row level. This allows for example to perform a Cartesian join between 2 data sets with a single keyword. Think of all the loops you’d write (run) in a programming language!

SQL has limitations because it only allows a restricted set of operations on tables. This is actually a huge boon, as it allows both good developers and bad developers to speak the same language. This means the code is unlikely to become too complex to understand.

7. The data warehouse

When a BI person wants to churn out a variety of reports, the benefits of a dimensional model become apparent. Typically, one adds this as a layer on top of the raw data. This makes the mapping of business processes at each grain or step possible. And thereby also enables easy creation of reports.

8. The self service reporting

Once the dimensional model is available, the team can create data cubes. This way they can make the data available easily to the rest of the company. Every stakeholder can filter the data cubes or drill down/roll up. With this functionality, a business analyst with access to the data can build complex reports and dashboards in minutes. The tool containing the data would usually be Excel or Dashboard tools. An operational data consumer might notice an anomaly and be able to troubleshoot it by drilling down in the same window. This is the end goal for most companies.

9. The advanced self service reporting

Drilling down predefined paths is not enough for complex analyses. Also, for the sake of agility many data sources will not be fully integrated into the data warehouse. (Example: a test of an ad network.) This is where custom complex transformations come into place. These are often to generate a report of a new data source, or to analyze particular details such as user behavior.

10. The data products: Automated decision making instead of BI as decision support.

Often times, the decision based on data is logical and repeatable. Therefore the data team creates algorithms to automate that decision. Take for example the decision of which products you should recommend to each customer, or the adjustment of bids based on performance metrics.

The data team can create these data products with relative ease. Sometimes the team builds them on top of the dimensional model of the DWH. Often, due to performance and different SLA requirements, the products run independently of the data warehouse. They consume data from the source or are integrated in an application.

Hope you enjoyed and let me know in the comments what else you have encountered. And where do you find yourself on this scale?

 

 

 

Scheduled emails with Python for BI

As part of reporting in business intelligence, scheduled emails are a very useful tool. They allow you to send various daily summaries, or attach larger reports.

For these purposes I put together a Python BI-mailer class to allow for easy management of all things you might normally require to send, such as text, summary/KPI html tables, embedded images or charts, or attach larger reports.

I will put the code here, and in the next articles I will give examples of how to use this easy Python emailer to accomplish daily reporting chores.

With this script you are able to:
send an email
optionally with text or multiple tables in body
optionally with attachments
optionally use the attachments as embedded images.