Cohort analysis is a powerful tool for evaluating the behavior of groups of users over time. One of the most common ways to cohort users is by when they started using your product or playing your game.
Here’s a graph of how many games the average user plays for a fictitious game company, cohorted by the month the user started playing:
select [users.created_at:month] , count(distinct gameplays.id) / count(distinct users.id) from [users+gameplays] group by 1
This chart makes it look like the game is getting dramatically less engaging over time — but it’s simply not true! Earlier users have had more time to play the game, so their average games per user is much higher.
To correct for this bias, we can introduce a time bound on each user’s gameplays. This way all users will have the same amount of time to play the game in this chart, and simply being an earlier user won’t bias the data.
We’ll add a 45-day window to the where clause. This ensures that the average games per user will only ever be calculated off the first 45 days of each user’s lifetime:
December through April look good, but something is wrong with May and June due to a second bias.
Like our earlier users having more time to play the game, our newest users have had less time. These users haven’t had 45 days to play the game, so their average games per user is unfairly lower.
To correct for that, we can add a second where clause to exclude users created in the last 45 days, ensuring that all users in the metric have had exactly 45 days of play time.
select [users.created_at:month] , count(distinct gameplays.id) / count(distinct users.id) from [users+gameplays] where gameplays.created_at <= users.created_at + interval '45 days' and users.created_at < getdate() - interval '45 days' group by 1
And finally we have a correct gameplays per user chart. Users are cohorted by the month they started and exactly 45 days of gameplays counted for each of them.
To further this analysis, we could start segmenting the data to see if certain dimensions were far less stable, or experiment with different cohorting windows (daily, weekly, yearly) and time bounds to achieve the right granularity.