Choosing Your Analysis Database

There are a lot of databases out there. Many of them work well for their users and were well-chosen for their specific analysis use case. But there’s one antipattern that frustrates analysts again and again: MySQL.

This post is an attempt to do some good in the world. If you’re choosing a database for analysis purposes and you stumble across this post, read on and learn why MySQL is the wrong choice.

1. No Window Functions

Window functions are one of the greatest tools in an analyst’s tool belt. Their superpower is their flexibility in letting you aggregate across groupings without restructuring your query.

Let’s take an example: Daily revenue with day-over-day deltas. Here’s the SQL:

  (revenue - rev_yesterday) / rev_yesterday as daily_delta
from (
    date(created_at) as dt,
    sum(price) as revenue,
    lag(sum(price), 1) over (order by dt) as rev_yesterday
  from purchases
  group by 1
) t

This simple query is enabled by the lag window function, which lets us compute yesterday’s revenue inline in the query. All together, it gives us a very nice revenue graph, complete with growth rate, seen here:

Revenue graph with growth chart

Let’s do the same computation with MySQL:

  (revenue - rev_yesterday) / rev_yesterday as daily_delta
from (
    date(created_at) as dt,
    sum(purchases_today.price) as revenue,
    sum(purchases_yesterday.price) as rev_yesterday
  from purchases purchases_today 
  join purchases purchases_yesterday 
    on datediff(
       ) = 1
  group by 1
) t

This query is both cumbersome and very expensive. The self-join of a potentially large table like purchases is pretty bad. On top of that, we have to compute the date of every single created_at twice, and then do a datediff on every pair! All for a simple daily delta.

2. No Set-Returning Functions

In Postgres, a set-returning function is a function that returns a table which you can join to the rest of your query. These functions are useful in many ways, but for a simple example, let’s look at a common function: generate_series.

Let’s say we’ve only had purchases on three of the days in the last week. If we want to graph purchases per day, a simple group-and-count will give these results:

Purchases per day

The empty days are not showing up at all! To get meaningful results, we have to find a way to get zeroes on the other days. We do that by joining to a list of dates created by generate_series:

select d, coalesce(sum(price), 0)
    date(now() - interval '7 day'), 
    '1 day'
  ) d
  left join purchases on date(purchases.created_at) = d
group by 1

As we can see, d becomes a table of dates that is left-joined to purchases. Here’s the new results in a pretty graph:

Purchases by day

This time, there’s not even a clunky workaround for MySQL. It’s simply impossible to generate the list of dates as part of the query! As a MySQL-specific workaround, many analysts will build a table filled with numbers or dates and update it nightly with a script.

3. No Strictness on Groupings

Few things ruin an analyst’s day as badly as building a bunch of reports that appeared to be right, only to turn out to be incorrect. Yet that is exactly the issue this problem creates.

To understand, let’s take a look at a broken query:

select date(created_at), platform, sum(price) 
from purchases
group by 1

At first glance, this appears to be revenue by platform by day. However, it’s missing a second group by! Postgres-family databases will return an error insisting that the query group by platform:

MySQL, however, will return the wrong results! Here they are:

Wrong results

Look closely: There’s only one row per date! One of the platforms is chosen at random for that day. You’re left to realize later that revenue is higher than your results initially indicated.

4. No JSON Support

After three increasingly unacceptable gaps, this one is more of a nice-to-have. But the fact is: App developers will often log semistructured data in JSON fields, and analysts will be left to pick up the pieces if they want to analyze that data.

Postgres provides a rich set of functions for getting at that data. For example, it’s increasingly common to store app events like pageviews in JSON blobs. You might have an object each day that looks like this:

  pageview: 12,
  shopping_cart_click: 7,
  purchase: 3

In Postgres, you’d count the daily pageviews this way:

select d, ct from (
  select date(event_time) d, key, sum(value::float) ct
  from events, json_each_text(events.event_data)
) t where key = 'pageview'
group by 1

In MySQL, you’re out of luck. If your data’s formatted this way, counting pageviews by day is impossible without writing an ETL in another language.

When to use MySQL

MySQL is a solid choice as a production serving database, especially in high-load, highly-replicated environments. MySQL’s replication is well-understood and battle-tested, whereas newer databases like Postgres have less mature replication technologies.

But when choosing a database for analysis, don’t miss out on all these great features. Postgres itself is a solid choice for small datasets. For larger ones, we recommend a warehouse in the Postgres family.