Learnings from my first 1.5 years freelancing in data

0. Ground Zero:  Learning the Mindset

 Your cost of opportunity is an employed job.

Your worst case scenario is getting a job. Basically, if you fail, you’re still ending up in a pretty decent place. Such, it’s worth taking a risk and trying freelancing if you think it might be for you.

 

Do it now, not later.

From my observation of people saying the above, if you do it later, you have <20% conversion rate in the first year.  If you do it now, you have 100%. Stop delaying. Your cost of opportunity as a freelancer is your hourly rate, which would probably translate into 2x your salary (unless you make 200k).

Once you have become a freelancer, you will look back and ask “was it worth paying 5K/month to stay in that job instead of having extra freedom and responsibility?” The answer will be a definite no.

I know it’s hard to see it that way without living it first, but try to put yourself in the freelancer’s shoes.

 

Value:

Don’t doubt your value or have impostor syndrome. Good workers are worth their weight in gold. A typical freelancer can be much cheaper than employees because he’s bringing in the right skills at the right time, slashing through company’s cost of delay. Also, many freelancers become better professionals due to the various exposure, and they end up being worth it even as salaried employees at those rates, because typically the top 1% of the workforce in salary has very different salaries anyway. Value based pricing in data, in the context of organisational change that brings a 20-40% revenue growth, would be worth millions for many companies. Your expertise is key in making that happen and your hourly rate is peanuts in comparison. Keep in mind that world – scale asian companies are sometimes paying millions per year as a salary for the data scientist they want – all is good as long as the ROI is fantastic, and it typically is for those companies with the scale to make it matter.

 

How to start:

Starting well is key to the first year survival. If you want to keep your transition to freelancing easy and without risk for  large gaps in your monthly pay,  consider the following tips:

1. Starting: Be able to take a freelance job.

The vast majority of freelancing gigs get filled relatively fast (<1 month), because freelancers tend to have short projects and are able to take on work more often. Such, it is ideal that you are able to start work on a new project on short notice, ideally <2-4 weeks.
There’s a few ways to be in that situation:

a) Be in an easy to leave job

If you just started a new job, are in probation period and it turns our to be different than what you wanted, you can look for a freelance project and jump ship quickly. If you’re stale in a job with 3 months quitting time, quit, start looking for a project and if you cannot find one and cannot afford the unemployed time, just take another job (maybe look in parallel).

b) Get fired.

Myself, I was in a situation where I could not work because the futility of the work was so clear, and the requirements so hazy, that I was unable to produce anything that wasn’t nonsense. I had a boss that dictated ‘Sql Only’ because that was the best he could do. He personally checked the code on pull request of each of his data team. He re-wrote everything himself, and only left the bugs to the others to worry about. He believed he invented ‘truncsert’ (truncate table and insert from a view in the db), a form of non incremental data loading where you still have to maintain ddls in database and deal with migrations (the worst of both worlds). I was in a team mixed  competences including brown nosers that kept their mouth shut and collected bonuses and promotions. I wouldn’t be a part of that and I got fired. Getting fired gave me 2 ‘paid’ leave months. It gave me 1 month to recover from the experience, after which I started looking for freelance projects. My girlfriend at the time (now my wife) supported me to stick to what is important and not jump into another job. Two months from being fired, I was already starting my first project.

I said ‘paid’ in quotes when referring to the unemployment, because in hindsight the unemployment help was too small to be relevant, and it was rather a mental comfort more than any significant financial contribution. Frankly, if you can afford it, don’t wait to get fired. Just quit and save yourself the hassle. In the long run, your cost of waiting is higher than the unemployment benefits.

c) Not have a job in the first place.

I have met quite a few freelancers that started as freelancers during university and never took a employed job, and have been freelancing since (years). Wherever you are in your life, if you are not currently employed, that’s already a step in the right direction. Get a project instead.

2. Find a freelance job.

Agencies and freelancers are your best friends. Connect on Linkedin to agencies (try big ones like Hays or DIS AG for Germany rather than those typical recruiters), and they will let you know every time a large client is looking for work. Work with agencies that specialize in freelancing, the others are optimized for jobs and don’t have freelance opportunities (or get them late). Ideally, do not work with amateurs – If they are flaky or scammy, avoid them at all cost.

Freelancers are best met face to face. Look them up on Linkedin and ask them for a breakfast/lunch/dinner. Be prepared to travel – it’s you that wants something. Form a network, get remembered, and you will have work coming in. Not just data freelancers, but any freelancers from intersecting domains. Data? Don’t forget tech, finance and marketing freelancers.

Hiring managers are also a typical customer. CTOs, founders, team leads. These guys sometimes need something done well from the first try, so they bring in expertise to build and hire. Sometimes, they just want a replacement to fill a gap left by someone that changed position, quit or went on parental leave/sabbaticals.

Be flexible with your requirements and don’t be too fussy as long as it’s in your field of expertise/specialty. It’s best to start off OK rather than not start off at all. Even OK freelancing is better than employment.

3. Do a freelance job.

Once you found a job, interview as usual, come to work as usual. You’re basically an employee, so you are expected to perform at the level of seniority you were hired for. Whatever you do, you’ll be fine as long as you have been honest about your ability.

Get your accounting sorted. If you’re into this stuff, do it yourself, else, an accountant is TOTALLY worth it. Your time and peace of mind for correct accounting is worth more than the accounting fees, and good advice will bring you further financially. A skilled tax accountant will often save you significantly more than you pay them. If you’re a Berliner, I recommend Suat Göydeniz or Firma.de.  Definitely find someone you are comfortable with and that answers your questions quickly.

My original choice for accounting was felix1, which turned out to be horrible. Felix1 overall took me 5 months to get my money back from, only after reaching out to a C level of theirs on linkedin to ask if these are their business practices or if I fell through the gaps… repeatedly. They were unable to start performing the service I paid for 2 months after first call and payment. I almost lost a client due to them. Don’t compromise on a decent accountant.

Set aside money for taxes. Find some way to approximate how much income tax you will have to pay in real-time (I use a google sheet with the brackets in a formula) and make sure you can afford the year end or tax pre payments. As a rule of thumb, if you save 50% of what you bill for taxes and health insurance, you should be fine.

Learning 1:

There is always more work. It’s like in a job – every day there’s something that needs done. Finished your main project goal? Your client will often find more uses for you.

If you’re hired by a team, they will give you as much work or nice to have things to do as they have budget for. Why not? It’s always nice to have a senior helping hand.

If you’re hired by a business owner, you’re typically in for a lasting relationship as long as your ROI is positive. In data speak, this is either when the client is big enough that your output optimizations are  worth your pay, or until they get data-mature to the point where most things are already optimized.

Learning 2: Vacation is different.

You know the joke about freelancers not having employer paid medical leave? “Freelancers don’t get sick.”
Of course you do, cos you’re human. And it’s OK to have unpaid medical leave because you deserve the best care you can afford: you are your best workhorse and it is only through good health that you are able to perform. Don’t mistreat yourself for money. Besides, you don’t have to feel bad about missing work – your client is cost- free for this time, so it’s even more relaxing than paid sick leave.

Sick leave aside, don’t end up calculating your vacation’s cost of opportunity. 40 hours/week *100/h*0.65 net rate = 2600 in your pocket and that’s not actually relevant to you enjoying your life. Remember why you are freelancing and imagine what happens if you advance 5 years into the future and you’ve done nothing else in your life but work. Don’t forget, your goal is to live the way you want, not to earn bank points. Take a week off if you feel the need to connect with your inner self, learn something new, spend time with your loved ones, follow a hobby, find divinity in the ordinary or go fishing.

You’re the youngest you’ll ever be, right now. Don’t sell that.

Learning 3: How to fail.

Freelancing and entrepreneurship are not the same.

Ask any data (or tech) freelancer if they would ever take a full time job again. They will tell you Definitely NOT! Whatever reasons they have (family, time, money, learning, culture, life), taking a job would be detrimental in almost every way. You’d halve your pay, let your skills get rusty, let company politics get to you, get stuck with all the poop, miss out on vacations and life, surrender control over your time, etc etc. Same recipe, different taste.

Except…

Except for the ones that are struggling or who struggled and sank. How do you sink? From my observations there’s two easy ways:

a. Charge too little.

Discount 50% for remote. Don’t charge toilet breaks. Don’t charge travel expenses. Don’t charge client calls. Don’t charge travel time for in person meetings.

This is a surefire way to make your effective rate lower than what you’s make in a job, but with much more extra hassle and stress. Don’t do this to yourself. Keep your opportunity cost in mind.

b. Building a product upfront.

If you believe you can sell  product, don’t build it upfront. Find a client and build it for them, while charging them for the work.

This way, your worst case scenario is if you cannot find a client. Such, it’s better to not find a client and lose the time spent looking for a client, than to not find a client and lose also the product building time. Your best case scenario? Still better at the end of the build with a client, since you have your payments early.

Learning 4. About going to Asia

I have heard from 2 clients so far that they had another freelancer but they went to Asia (Thailand, Vietnam) and never came back. You can imagine my deja vu the second time I heard this. When I told this story, I heard another about a freelancer (not data this time) that went to Vietnam while working remotely, and decided to stay and open a bar. Apologies if I am getting any of the 3 mixed up.

Regardless, you can go to Asia, but keep in mind rates are lower – so if you’re gonna take a pay cut, rather take a vacation and enjoy travelling.

Go to Asia, or wherever you wanna travel to, and enjoy.

Learning 5: The ideal client

The ideal client is a different one depending on what you are trying to achieve. In a perfect world, the perfect client is someone who is not price sensitive (you don’t want to argue about discounting a train ride to the office), who always buys your work but at the same time doesn’t depend on you.

However, while you’re not in control of what clients exist on the market, you are in control of how you develop your relationship.

Maintain a healthy long term relationship, without becoming a crutch for your client. You want a client with whom you are honest, and consult them on what’s best for them, even if it isn’t in your direct interest.

A smart client can understand the value you are able to bring, and will always have a place for you, or a recommendation. A successful freelancer is someone who is competitive in the workforce (read: a professional that doesn’t stop learning), and such workers are often rare and add more value than the average worker. Such, a smart client will have no issue paying for a freelancer rate either until staffing, or for the duration of a project (and often after if budget is there). Of course, know your strengths and weaknesses and tackle the projects you can take with your client, and don’t jump on things you cannot do well.

A smart freelancer will manage expectation, to keep their client informed and able to take good decisions regarding ROI.

Learning 6:  Work ethics

Having met various freelancers, I can confidently say most of them are nice and have good work ethics. In fact, it’s more common that freelancers will bill a little less fussy about the details (read: charge less than they should).

However, there are bad apples out there. Every forest has its stumps. Don’t be one of them.

Your work ethic should be solid. Be always honest with the client regarding your relationship, bill fairly, and don’t double book. Why? Because without values, all you have is a racket.

What if you are reselling code that took you hours to do the first time around? There’s no rule for this. I have seen the following:

a) Bill nothing extra, just bill for the time.

I am currently doing this but am having second thoughts, as I feel it’s counter productive to bill 2h for implementing something that might take a week to build.

b) Bill implementation + fraction (up to full) of the hours it took to first build.

Agencies do this. They fill in a config and suddenly hundreds of lines of code are  hard at work in your codebase.

c) Fixed price project

Here, I would stay away unless you know your client – you might end up with scope creep and those 50h budgeted that were supposed to be 20 for implementation  suddenly become 100.

Learning 7: Hourly or daily? Fixed price?

I bill hourly, rounded down to the nearest 10min increment, with the exception of short projects where I round up to the hour because of overhead and cost of switching.

I would advise to go for hourly billing. Why? Because sometimes work days are over 8h. More often, the work is finished after 7:30h for the day, and warming a seat for 30min is counter productive for both you and your client – so just call it a day then. I found I work an average just under 8h, if I there’s no pressure,  and just under 9h on the heavy load days/weeks. Another freelancer recently told me they switched from daily to hourly, and is now billing 20% more. Others told me how important it is to be able to come in 2h late when you have some other life appointments, without worrying how to make up for the time. Hourly = bill what’s fair.

The pros for daily rate is that depending on your client, you might get peppered with small requests. Such, charging half day or a full day for something that takes 10min but blocks you form other stuff is fair – but you can do this with hourly too, as long as you are upfront about it. Or, you could charge small clients on a daily rate and large ones hourly.

Fixed price? I feel this is a gamble. Usually those that advocate for fixed price, inflate the average price by something like 5x to give themselves plenty of wiggle room and high profits if accepted. Otherwise, they would be doing hourly and save themselves the headaches of scoping.  As a client, you should also be aware that any time estimate you receive does not take into account blockers on your side, since they are unknown to the external.

 

Data validation with airflow

Data validation with airflow can be a simple way to do some data quality checks without any overhead. Here I will briefly cover the topic of doing data checks or data quality tests when importing user-input data (like when integrating other data management systems such as CRMs or when taking some mapping tables as input).

Data validation methods

1. Data validation

Data validation refers to verifying that what you think about the data is true, such as that cost is a number and not a string, that a date is a valid date,  that the entries are unique, that values are not left unfilled.

It’s ideal that you do not only validate the data (that it fits expected conditions/formats/data types /primary keys etc) but also validate the referential integrity.

2. Referential validation

Referential validation is about validating that the references to or from the data you imported are valid.

If you import some entities (orders, products, users) from some data management system (CRM, CMS), make sure that those entities exist in your database. For example, make sure that the companies you import from salesforce or other crm also exist in your company dimension table.

3. Anomaly detection

This is about detecting potential data issues or events underlined in the data. It often takes the form of a monitoring system for data quality that alerts on unexpected changes.

Data validation with airflow

The anomaly detection is usually done on streaming data and does not make sense to run from Airflow, which is a workflow engine meant for running batch jobs. However, for data quality checks, airflow can work perfectly.
For your data validation, loading your data into a table with constraints (unique, not null etc) will produce a pretty meaningful message on fail. You can simply add the data producer to the on-fail emailing, and they will get the message on fail.

For referential validation, an error from the foreign keys would likely be insufficient for providing enough information to the data owner to fix the issue. Such, I prefer to use a query to select the offending records, and then send the output to the data producer. I wrap the sql in a python method that asserts output is empty, and create a task on airflow in the relevant DAG. If output exists, airflow will send the failure message to the data owner with the relevant offending keys.

How do you do your data checks? Bonus points if your solution makes use of existing components without overhead.

Easy ETL, or “friends don’t let friends write loading code”

Easy ETL is what every tool promises, but most do not deliver. Over the last years there has been a strong shift in BI towards the use of python. If previously any scripting language would do, nowadays it’s almost impossible to manage without python specifically.

Take the ever more common open source tech stack of
SQL for storage and analysis
Python as glue language
Airflow (python) as workflow engine
Airflow (python)+ custom utils (any scripting language) as framework.

You can reduce the complexity by deciding to stick with python as far as possible with regards to the utils. It makes a lot of sense as you can just import the utils directly in the job (DAG) definition file to call and parametrize your methods with ease.

Having worked on multiple BI data warehousing projects over the last years, the most common mistake I see made is underestimating the cost of SQL maintenance, and the most common difficulty is staffing with enough workers to maintain and extend the large codebases.

The cost of maintaining sql throughout its lifecycle is usually estimated at 3-4x the time it takes to create it in the first place. The truth is, for the most vital components of your transformation, the cost of maintenance is closer to 10x the cost of original development. You will touch this code again and again every time there is a source data change, addition etc. Add a new column? Easy! Modify your data loading scripts, your DM creation scripts, your reporting scripts to include it, your data dictionary, and the frontend. Change a column? not so easy anymore.

So we are already familiar with the hard way: Write code to do the transformations. When the schema changes, add or modify the code.

So what is the easy way? Write code to write code to do the transformations. Many of the transformation steps are repetitive, so if you were to standardize things, you could easily re-use the same few parametrized operations. It’s called don’t repeat yourself or ‘DRY’, and it’s a basic concept for producing decent, maintainable code. You know, what you SHOULD be doing instead of copy pasting all that sql. But while many of us know our code is ‘WET’, it is sometimes hard to abstract and standardize. So how do you do it?

Standardize.

When it comes to loading data into a table, incrementally or not, there are only a few ways to do this. Ideally, stick with the basics and re-use the code. If most of your operations are of one type, and a new one might run faster if done differently, consider that writing new loading logic and maintaining it 10 times over is likely more expensive than the drawbacks of your existing solution.

Abstract.

Split your jobs into subroutines. You will quickly identify that there are many common parts, that can be abstracted out and written as reusable methods in some shared utils. You can then construct the job from those blocks, having only the job-specific code to manage in any particular job.

Take the following examples:

A job that pulls some data from an api:
1. Download from api to csv
2. load csv in database
3. merge increment

A job that loads data from one rdbms to the dwh:
1. Download from db.
2. load csv in database
3. merge increment

A job that creates some reports:
1. Run some sql
2. Run some code
3. Send report

A job that monitors your data
1. Run some sql
2. Check if normal
3. Send report if abnormal

A job that delivers some segments into your CRM tool.
1. Run some sql
2. Write to file
3. Push to api.

Notice how many of the steps are closely shared with other jobs? what if we abstracted as much of the comon stuff as we could? We would only need to write the components that we do not already have.

A job that pulls some data from an api:
1. Download from api to csv

A job that loads data from one rdbms to the dwh:
1. Make a list of tables to copy

A job that creates some reports:
1. Write report SQL

A job that monitors your data
1. write some sql
2. Check if normal

A job that delivers some segments into your CRM tool.
1. Run some sql
2. Push to api.

Notice we now only have 7/15 parts of code to deal with. Reducing complexity in this manner not only reduces development time, but also makes it easier to maintain in the future. Want to stop loading to Postgres and load to Redshift instead? Just change the loading method once. Or, want to stop worrying about staging tables? You can just create them on the fly based on your target table and drop them when you are done.

Pareto principle, KISS, lean, or whatever you prefer.

You will do 80% of the work with 20% of the code. Do not try to solve every potential problem at once. Start with something that solves the majority of your worries.

You do not need to have the best code running from the start. Solve the problems as well as you can without adding complexity. While your shared methods might not always be the best suited for each case, they are better than 5x the code.

A few examples:

For loading data, if you are using postgres, mysql or other rdbms, the fastest way to load is through the copy command. Write a python method with file and table as parameter. Now, you can use this method as method(file, table, args = defaults) every time you need to load some data, without writing the copy statement every time. For engines like redshift, your method will additionally need to copy the file to s3 first and parametrize the copy with s3 credentials.

For incremental loading, write methods for your common operations. If you want to do an increment merge(redshift manual link), it is easy to automate as merge_increment(source, target, keys = [])

For indexing: all your join keys and columns with typical heavy usage have pretty standard naming. id, sk, date, from, to, ts, and so on. Why change indices every time you make a change when you can just use standardised naming to your advantage? Using system tables, you are able to identify the tablename, column and datatype for the index based on your naming convention, and index accordingly in an automated manner.

I hope this helps some of you in your data delivery journey. How do you keep your complexity low?

ETL vs ELT: It’s all about maintenance!

 

When designing a job, you will have to decide at some point between ETL vs ELT. This boils down to the question if you should perform the transformation to dimensional model in SQL. Or if you want to do it in a programming language.

Why, you ask?

What does ETL vs ELT mean for you practically?

Well, ETL means “Extract – Transform – Load”. So it is about extracting data from a source and then transforming it to a dimensional model. In the last step you load to the data warehouse for analysis.

ELT on the other hand stands for – you guessed it – “Extract – Load – Transform”. It is about extracting data and then loading it to a staging database. And from there transforming it to the dimensional model.

The difference is that the transformation step occurs before loading to a database (ETL) vs after (ELT). In theory, the data is loaded before transformation in order to assist with ease of access. The transformation takes place from the loaded staging database to the data warehouse dimensional model. In practice, the fundamental decision between ETL vs ELT is the one between using SQL or a programming language. Because once the data is in a staging database, the transformation will usually happen using complex SQL.

ETL vs ELT: my recommendation

I am going to assume you are using a tech stack of one or more programming languages, and SQL. In this case, what you need to decide is if you want to write the complex transformations in a programming language or if you want to use a query language. In my experience, it’s easier to maintain jobs where you do a little more programming and write less SQL.  This happens because SQL does not support testing natively, and you cannot refactor it, so you have to rebuild instead.

Write less SQL

Write less SQL

Let me start by saying I like SQL, and I believe it is fantastic in its ease of use. It’s meant for rearranging and aggregating data, making it easy to create reports from a dimensional model. I do however believe we should not misuse it, and simply write less SQL.

Now, there are two types of SQL code in BI: Simple SQL and Complex SQL.

Simple SQL

Simple SQL is the type of SQL that makes reports from a dimensional model – it seldom has subqueries, and all it does is some joining/grouping/aggregating, as all the numbers are available in fact tables in ready-to-use form. It’s easy enough that even some business users can easily understand it and possibly maintain it.

Simple SQL is fantastic because it allows easy operations on prepared data sets. It is easy to read and to maintain.

Complex SQL

Complex SQL is the SQL that makes fact tables, or the SQL of data products(data transformations for data applications). This SQL often has many subqueries, multiple steps of transformation, sometimes data cleaning and enriching.

Complex SQL is hard to maintain, and even harder to extend or modify. It often ends up worse and worse after each iteration. It’s next to impossible to refactor and you usually end up having to rewrite it after a while.

Complex SQL is cause for high rate of bugs, high maintenance and low bus factors. Since SQL is not easy to test (vs tests in a programming language), it is unlikely that it will ever have good test coverage, making it dangerous to work with. Subtle logic bugs will be well hidden, and due to the monolithic nature of the applications written in SQL it is very hard to maintain and debug.

I suggest to avoid Complex SQL entirely. The key concept for this argument is the ETL vs the ELT. Transform your data before it hits the database, and do it with easily testable and dry code.

Many decide to do the complex transformations in SQL mainly because it’s quick and easy. However, in the long run, this quick and easy SQL code costs you quite a bit of maintenance. And will get worse and worse with each iteration, as it is not possible to easily refactor it.

Instead of the query language, use a programing language with tests. This simply is the better approach if you desire iteration speed and accuracy.

Write less SQL, use a programming language.

That’s right, write less SQL. Use a programming language and dry your code. Only use SQL in easy or temporary transformations, and use something like Python when it comes to doing the transformations. There are limitations to this approach, but they are minor in comparison to the improvement it brings to the BI development process. Make a leap of faith and write less SQL!

The data flow breakdown

Data FlowFor easy reference, I’d like to define the typical data flow encountered in business intelligence by its steps: Data integration, dimensional modelling, and data consumption. Basically these are the typical flows that make first and third party data available for analysis. Most of all it’s this data flow that brings a ROI when investing into BI.

1. Data flow part one: Data Integration

Integration measures how many of the first and third party data sources have been unified in a single place. This single place is usually called an integration layer. It can be any easy-to-access data store. Typically it would take the form of files on a network location (FTP, S3, GCS) or a relational database.

Usually this part is pretty messy in that the external sources are diverse and different. Often, each source requires its own access method to extract relevant data. As a result you get a very messy environment. Ideally handle it by a widely used scripting language to ‘glue’ systems together. It is a good practice here to use some form of utils to dry your code.Using common utils for loading data sources will prove verz time/saving when loading data.

2. Data flow part two: Dimensional Modelling

Dimensional modelling as a step in the data flow is the transformation of integration data to a format suitable for analysis. Usually in the form of a snowflake schema. The goal of this is to arrange the dimensions and fact tables such that metrics are unified.  Thus they are the same across reports.

Some small reporting set-ups omit dimensional modelling, but it is useful in enabling self service reporting.

Ultimately the end goal of dimensional modelling is to provide fact and dimension tables containing correct data. Later on, you will use this data for reporting and analysis. Ideally locate it in a SQL database for easy access.

3. Data flow part three: Data consumption

Here, I want to avoid using the terms reporting or analysis. In my opinion reporting is just one of many channels that a data consumer or analyst uses to access data. Such data consumption can take the form of performing an analysis that requires ad hoc query writing and statistical analysis. Or say the analyst has to perform a  seasonal product analysis based on data from a product performance dashboard.

The end goal of data consumption is to boost business metrics through informed decision making. Consequently this is where investing in data actually pays off, so make sure the needs of the decision-makers are met.

Python and SQL open source business intelligence stack

I’d like to cover here some basics about the common business intelligence stack of Python and SQL. Let’s see why this stack is so popular, and why it’s not yet the one and only gold standard.

Python Logo & SQL Logo

Why Python for business intelligence?

Readability.

Most business intelligence professionals using Python come from all walks of life and do not have formal programming education. This can lead to incredibly random ways of coding – not for Python though, as Python requires indentation.
This doesn’t mean that someone who has no idea will produce good code or a good architecture. And regardless of how clean the code, people can do very obscure and roundabout complex things. What it does mean that it is very easy to read their code, which is the first step in understanding it.

Staffing.

Python is an easy to learn, widespread, and easy to use language. It’s much easier to find Python developers than developers with VBA knowledge. Additionally, it’s even more difficult to find developers with BI knowledge and knowledge in other specific languages.

Libraries.

For BI work, you need to extract/load data, transform data and analyze data. Python can do all – most advertising  data producers offer Python libraries to access their APIs. Python can access any database you can think of. Python natively has great features for data munging. Python can be used to perform large parallel computations or statistical analysis. Data science too.

Widespread usage/versatility.

This means that whatever problem you are encountering, someone else probably encountered it too. You will have a very easy time to find solutions or get help for anything Python. While other languages might be better at specific things, Python is a jack of all trades, master of some.

Why not Python?

Spoon-fed.

Well, many BI developers stop there because the language is so versatile. However, this means they will not have the opportunity to work with languages designed for large software projects. Also they will miss out on best practices and programming paradigms. Too easy sometimes means people stop learning.

Slow.

Python is relatively slower than most other languages. In the context of working with data, this only appears as an issue when doing complex calculations on very large data sets.

Why SQL for business intelligence?

I will talk here about PostgreSQL for small data specifically, since it is the best open source database for analysis. It has better functionality that paid solutions like Microsoft’s, and equal or better performance and functionality to Oracle’s product. For free, without licensing fees or vendor locks, it is built by developers, for developers, with robustness and ease of use in mind. You can even run Python on Postgres 🙂

For big data, the leading SQL solution seems to be the postgres-like product from Amazon, Redshift.

Ease of access:

For the sake of easy data access, SQL makes a lot of sense. There are plenty of tools that can run SQL against a database to generate dashboards or can connect it easily in excel. And it is easy to learn for analysts.

Ease of data manipulation:

Most of the data you will have will not be single numbers or random sentences, but structured in tables. This means that instead of conventional algebra, we use relational algebra, to perform operations between data sets instead of single values. This is where SQL shines, since it is designed for these kinds of operations specifically. Once you have the data in a database, you can leverage the power of this language to very easily perform these operations.

Staffing:

Every self-respecting BI professional known at least a little SQL. Most good BI engineers know a lot of SQL. Also, particular type of SQL is not very relevant, as the flavors are very similar.

Why not SQL?

Poor options for architecture.

SQL is a query language, not a programming language. It’s meant to be used to rearrange and compute data, and not to do complex operations. You end up producing a lot of code, often wet (opposite of DRY). The code ends up being monolithic, and it is not easy to see at a glance what a query does. Basically, you cannot write easily maintainable code.

Lowers the bar.

SQL is so easy to learn, that a lot of people end up being able to use it. However, this does not suffice for writing clean code, designing an ETL, a BI application, or a data warehouse. I’ve personally seen ridiculous projects that were a dependency hell of views on views on views . Sadly, knowing SQL is not enough to say about one’s ability as a professional to deliver a quality product, but the two often get confused in the world of BI.

I hope you enjoyed this article, and if you have any opinions on the topic do not hesitate to leave a comment.

Lifecycle stages of Business Intelligence solutions in startups

Man showing business graph on wood table

Business Intelligence in companies is a decision support system. Its implementation is a reflection of the questions that need to be answered in the company at any given time. Many companies start with just a few spreadsheets with data. They populate them with copy pasted numbers. This allows them to understand how their business is evolving and where the next opportunities lie. As time goes on, the companies usually go through several stages of Business Intelligence solutions.

1. The copy paste spreadsheets

At the earliest stage in the company’s life, there is little automation of data extraction and reporting. Usually, the founder already has some profit and loss spreadsheets. Then he will ask the employees to fill in data relevant to their departments. This is when the team defines the first metrics. Unfortunately oftentimes the calculations are not complete due to missing data.

2. The VBA reports

First, let me state that I consider VBA+excel reporting to be a disaster when it comes to data solutions. The reasons are simple: it is a programming paradigm that follows zero best practices. It’s programming, so it’s too hard for the average Joe to build on. It’s VBA, so it’s not made for data munging, versioning, collaborative working, easy file handling etc. It works for a one-off project, but you will end up with too much code that is too hard to maintain for too little benefit.

This usually appears as a quick-and-dirty way of automating some reporting coming from someone with financial background. This happens because finance folks like to have things neatly ordered in Excel, with custom functionality for particular reports. This works on a small scale, but it is not worth maintaining bad code with custom functionality and particular layouts for all the reports on everyone’s computers.

3. The Excel Hell

Natural evolution: Each department fills in some data sheets that feed into the managers’ reports. After a while, each department wants their own reports as well, so they start using the data sources from the other departments. Soon, they want new numbers and start changing the data sources to add more data. This inevitably breaks other dependent reports and creates pain for everyone.

Soon, the reports become data sources for new reports, to try to avoid redundant maintenance. This only makes things worse, as now there are dependencies in many directions between many things. One small change could trigger catastrophic failure on company level. Moreover all the knowledge to maintain this system is so fragmented that it is not possible to centralize.

4. The data bucket

After excel hell, some companies fall back to what they know: Data source files. To allow collaborative development, they put these files on a shared drive or cloud storage solution. From here, the excel-savvy can build their own reports.

It has the advantage of standardizing data sources. Moreover it allows anyone to build their own reports based on those sources, instead of people building on top of existing reports.

Besides scale, the first shortcoming is the manual effort involved. Additionally, the lack of maintainability (cannot at a glace understand the logic behind a report) causes redundant reports to be build, which just adds to the work.

5. The data river

The company looks online to how to do better BI and receives misinformation from advertorials, ads, agencies and parrots (people who just repeat what they read). They end up buying some dashboard solution that promises seamless data integration and real time stats. They get IT to write some SQL queries for the tool, and they connect to the other services (advertising, ticketing systems etc) with ease.

It now looks like all the data is available. In a tool that displays, but cannot transform data to a useful format. This is the birth of the need for a data warehouse. Complex logic (and there is always complex logic in data munging) will need to be done by coding.

Now we can see our data, but we cannot leverage it for the reporting we need.

6. The data store, or unstructured data warehouse

After having struggled with having the data in files, or by piping the data directly to reports via a tool, the need for a database that can store both raw data and processed data appears.

Of course it makes sense to combine the data in a SQL database to allow all the other tools to pull data from it too. Also most of the transformations will happen in SQL. First of all it is easy to learn and use. And secondly it is a  very powerful language for working with sets of data  – tables.

This collection usually consists of everything that can be imported and is usually not structured or grouped together by a particular logic. It’s simply a place to store the data to make it  more accessible for analysis and further transformations. Often, much of the data is coming from manually populated data sources.

Some argue they are too cool for SQL. But in the end SQL just works because of how limited and powerful the language is. It’s powerful because it performs all operations on a table or column/row level. This allows for example to perform a Cartesian join between 2 data sets with a single keyword. Think of all the loops you’d write (run) in a programming language!

SQL has limitations because it only allows a restricted set of operations on tables. This is actually a huge boon, as it allows both good developers and bad developers to speak the same language. This means the code is unlikely to become too complex to understand.

7. The data warehouse

When a BI person wants to churn out a variety of reports, the benefits of a dimensional model become apparent. Typically, one adds this as a layer on top of the raw data. This makes the mapping of business processes at each grain or step possible. And thereby also enables easy creation of reports.

8. The self service reporting

Once the dimensional model is available, the team can create data cubes. This way they can make the data available easily to the rest of the company. Every stakeholder can filter the data cubes or drill down/roll up. With this functionality, a business analyst with access to the data can build complex reports and dashboards in minutes. The tool containing the data would usually be Excel or Dashboard tools. An operational data consumer might notice an anomaly and be able to troubleshoot it by drilling down in the same window. This is the end goal for most companies.

9. The advanced self service reporting

Drilling down predefined paths is not enough for complex analyses. Also, for the sake of agility many data sources will not be fully integrated into the data warehouse. (Example: a test of an ad network.) This is where custom complex transformations come into place. These are often to generate a report of a new data source, or to analyze particular details such as user behavior.

10. The data products: Automated decision making instead of BI as decision support.

Often times, the decision based on data is logical and repeatable. Therefore the data team creates algorithms to automate that decision. Take for example the decision of which products you should recommend to each customer, or the adjustment of bids based on performance metrics.

The data team can create these data products with relative ease. Sometimes the team builds them on top of the dimensional model of the DWH. Often, due to performance and different SLA requirements, the products run independently of the data warehouse. They consume data from the source or are integrated in an application.

Hope you enjoyed and let me know in the comments what else you have encountered. And where do you find yourself on this scale?

 

 

 

Twitter ads API: gotchas when downloading report

I recently built an ETL to download ad reports from the Twitter ads API.

If you are about to do the same, take note of the following gotcha.

Include historic data in your download

First, you should download at least 2+ weeks of history with every new daily load. The reason behind this? Twitter ads data will be updated historically by Twitter as they work out the actual numbers (removing bots, spam etc). This is common behavior for many ad APIs.

Provide datetime

Second, because of their API design, you NEED to provide start datetime and end datetime for whole days. Notice it requires datetime, and not date, even though you clearly request daily aggregation! Additionally, you need to provide the midnight of your account’s timezone in UTC time, or the API call will return an error. (Don’t ask me why you handle this on your end when the timezone info is coming from the API…)

You can retrieve the timezone of each account from the endpoint '/2/accounts/your_account_id', the body['data']['timezone'] of the response. Then you calculate your start and end date for your report request, and adjust both of them to the right UTC offset based on the account timezone. You can achieve this easily in Python with pytz module. Please note that if your start-end interval contains a daylight savings time change, you cannot apply the same offset to both dates, as offset from UTC changes. The best approach is to calculate this offset based on the timezone for each of the start and end dates.

Keep watching for gotchas in the Twitter ads API

There are a few other gotchas/undocumented limitations in their API, but you will find it easier to address those correctly. (Example: only request stats for 10 tweet_ids at a time).

Yes, it stinks, but at least Twitter ads API is still better than Bing ads API.

Good luck!

Example MySQL to Gmail script

This is a usage example for the MySQL to HTML table combined with the BI mailer. We will make use of the two previous scripts to easily create a scheduled email sending from a MySQL database through a Gmail address with minimal effort.

The idea behind it was to be able to create a new email with only the base building blocks: the MySQL query for data source, subject/recipients as envelope/destination.

The advantage of this combination is the necessity of only 3(!) lines of code for a new email, allowing you to leverage email as an easy and flexible distribution channel for BI data.

In the example below we also add a greeting along with a relevant subject line, which raises our grand total to 4 easily readable lines of code to send an email from MySQL to Gmail. The connection string I used for the query is an open MySQL database with DNA info (this should go in your sql_to_html file).

The email contains a random sample of DNA shortened to 64 characters length. It takes a MySQL query and sends an HTML table via Gmail.

Result:
MySQL to Gmail result

While this example is Gmail specific, it can be easily changed to use any email service. Keep in mind that even when using Gmail you should be able to reach any other email, and if you choose to avoid Gmail altogether, you can just change the bimailer.py to use any other email server.