Calculating Lifetime Metrics
Lifetime metrics are a great way to get a long term sense of the health of a business. They’re particularly useful for seeing how healthy a segment is by comparing it to others over the long term.
For example, here’s a fictional graph of lifetime game players by platform:
We can see that iOS is our fastest growing platform, quickly surpassing the others in lifetime players despite being launched months later!
However, metrics like these can be incredibly slow, since they’re typically calculated with distinct counts over asymmetric joins. In this post, we’ll show you how to calculate these metrics in milliseconds instead of minutes.
The easiest way to write the query is to join gameplays onto itself. The first gameplays, which we name dates, provides the dates that will go on our X axis. The second gameplays, called plays, supplies every gameplay that happened on or before each date.
The SQL is nice and simple:
select date(dates.created_at), plays.platform, count(distinct plays.user_id) from gameplays dates join gameplays plays group by 1, 2
This query has miserable performance: 37 minutes on a 10,000-row table!
To see why, let’s look at our explain:
By asymmetrically joining gameplays to itself, we’re bogging down in an n2 loop. Then we’re sorting the whole blown out table!
Fortunately, there’s a better way.
Astute readers will notice that we’re only using the first gameplays table in the join for its dates. It conveniently has the exact dates we need for our graph — the dates of every gameplay — but we can select those separately and then drop them into a generate_series.
The resulting SQL is still pretty simple:
select d, platform, count(distinct user_id) from generate_series( (select date(min(created_at)) from gameplays), (select date(max(created_at)) from gameplays), '1 day' ) join gameplays on created_at <= d group by 1, 2
This puppy runs in 26 seconds, achieving an 86X speedup over the previous query!
Don’t be fooled by the scans at the top. Those are just to get the single min and max values for generate_series.
Notice that instead of joining and nested-looping over gameplays multiplied by gameplays, we are joining with gameplays and a Postgres function, in this case, generate_series. Since the series has only one row per date, the result is a lot faster to loop over and sort.
But we can still do a lot better.
Window Functions To The Rescue
Every user started playing on a particular platform on one day. Once they started, they count forever. So let’s start with the first gameplay for each user on each platform:
select date(min(created_at)) dt, platform, user_id from gameplays group by platform, user_id
Now that we have that, we can aggregate it into how many first gameplays there were on each platform on each day:
with first_gameplays as ( select date(min(created_at)) dt, platform, user_id from gameplays group by platform, user_id ) select dt, platform, count(1) user_ct from first_gameplays group by dt, platform
We’ve included each user’s first gameplay per platform in a with clause, and then simply counted the number of users for each date and platform.
Now that we have the data organized this way, we can simply sum the values for each platform over time to get the lifetime numbers! Because we need each date’s sum to include previous dates but not new dates, and we need the sums to be separated by platform, we’ll use a window function. Here’s the full query:
with first_gameplays as ( select date(min(created_at)) dt, platform, user_id from gameplays group by platform, user_id ), daily_first_gameplays as ( select dt, platform, count(1) user_ct from first_gameplays group by dt, platform ) select dt, platform, sum(user_ct) over ( partition by platform order by dt rows between unbounded preceding and current row ) users from daily_first_gameplays
The window function is worth a closer look:
sum(user_ct) over ( partition by platform order by dt rows between unbounded preceding and current row )
partition by platform means we want a separate sum for each platform. order by dt specifies the order of the rows, in this case, by date. That’s important because rows between unbounded preceding and current row specifies that each row’s sum will include all previous rows (e.g. all previous dates), but no future rows. Thus it becomes a rolling sum, which is exactly what we want.
With the full query in hand, let’s look at the explain:
No joins at all! And when we’re sorting, it’s only over the relatively small daily_first_gameplays with-clause, and the final aggregated result.
As a result, this version runs in 28 milliseconds, a whopping 928X speedup over the previous version, and an unbelievable 79,000X speedup over the original self-join. Window functions to the rescue!