A SQL-Based Approach to Cohort Retention and Analysis

Why Retention 

Some brands are great at connecting with new audiences. They might have a killer marketing campaign, demonstrably fulfill a gap in the marketplace, or just have that certain “it factor.” Being able to get new customers on board can lead to fantastic short-term financial gains. 

However, none of that matters if you can’t keep your audience. If people download your app and never open it, sign up for your newsletter and never go back to your site, or make one purchase in your store and never return, all of your marketing efforts were for naught. 

To sustain a successful business, it’s critical to understand what’s working or not working with your marketing efforts, your content, and your product(s). The ability to observe, monitor, and fix where a customer cohort begins to lose interest is tantamount to continued brand loyalty, revenue growth, and year-over-year success. 

The Value of Cohorts 

Cohort analysis segments users into groups based on time and then examines them against metrics such as retention and churn. Performing cohort analysis using SQL is a technique many skilled analysts use to understand differences between groups of customers. In this example, we’ll evaluate established and new customers to observe when product interest drops for either group, changes which may be associated with seasonal shifts, demographics, marketing, or any other number of factors. 

Maintaining and analyzing cohorts can improve your marketing, refine your user experience, and ensure your products and services are meeting the needs of your most valuable audiences. 

In this guide, we’ll outline step-by-step methods to script your own cohort analysis in SQL with examples and interpretations of the resulting data. 

How Do We Define Retention? 

The way a company decides to define retention varies by the business model. 

For example:  

  • A mobile app company may measure app engagement – such as daily active users (DAU)—when gauging retention. 
  • A subscription-based service may measure months-to-churn.
  • A retailer may measure repeat club member purchases over a fiscal year. 

For our example, let’s pretend we’re a mobile gaming company that looks at mobile app engagement—DAU, in our case—as a central key performance indicator (KPI). We have two groups of people who play the game in different, but identifiable ways:

  • User One plays the game on Monday and Tuesday. They’re a retained user. 
  • User Two plays the game Monday, but not on Tuesday. They’re considered a lapsed user. 
  • Retention for Monday is the number of retained users divided by the number of total users. If User One was the only user on Tuesday, then retention for Tuesday is 50%. How do we take this even further with SQL? 

Calculating Basic User Retention 

The key to calculating retention is counting users who were active at one time period, then counting how many were active at another. An easy way to do this in SQL is to left join your user activity table to itself.

Calculating retention rate

 We can then do our self-join like so:

select *
from activity
left join activity as future_activity
on activity.user_id = future_activity.user_id
 and activity.date = future_activity.date - interval
‘1 day’

Now, for every row of user activity, we have—in that same row—their activity one day in the future. This gives us an ideal table for calculating retention with some simple counts:

 , count(distinct gameplays.user_id) as active_users
 , count(distinct future_gameplays.user_id)
 as retained_users
 , count(distinct future_gameplays.user_id) /
 count(distinct gameplays.user_id)::float
 as retention
from gameplays
left join gameplays as future_activity
 on gameplays.user_id = future_activity.user_id
 and gameplays.date = future_activity.date - interval
‘1 day’
group by 1

Entering this into our SQL editor will give us this chart:

User retention rate

You can also change the retention interval to seven days or 30 days to get a sense of longer-term user engagement. 

Calculating Retention of New Versus Existing Users

Often, retention is quite different for users who just signed up as compared to loyal, long-time users. To calculate new-user retention, simply join in your user’s table and only look at activity rows that occurred on the user’s join date.

 users.date as date
 , count(distinct gameplays.user_id) as new_users,
 , count(distinct future_activity.user_id) as ,
 , count(distinct future_activity.user_id) /
 count(distinct gameplays.user_id)::float
 as retention
from gameplays
-- Limits gameplays to activity from new users
join users on
 gameplays.user_id = users.id
 and users.date = gameplays.date
left join gameplays as future_activity
on gameplays.user_id = future_activity.user_id
 and gameplays.date = future_activity.date
- interval ‘1 day’
group by 1
New user retention

Now we can see that while overall retention is 46%, new-user retention is only 5.4%! This is why it’s helpful to split out user segments—we now understand that improving new-user retention should clearly be a priority. In the case of our mobile game, it could mean that either our features aren’t appealing enough to keep new users engaged, or that our acquisition efforts are targeting the wrong type of user. 

Now let’s check out returning-user retention. To calculate this, simply change:

users.date = gameplays.date
users.date != gameplays.date

This effectively excludes activity from users who joined that day. Our query now looks like:

 gameplays.date as date
 , count(distinct gameplays.user_id) as new_users
 , count(distinct future_activity.user_id) as
 , count(distinct future_activity.user_id) /
 count(distinct gameplays.user_id)::float
 as retention
from gameplays
join users on
 gameplays.user_id = users.id
 and users.date != gameplays.date
left join gameplays as future_activity on
 gameplays.user_id = future_activity.user_id and
gameplays.date = future_activity.date - interval ‘1
group by 1

As expected, returning-user retention is higher than the overall average: 68% versus 46%. 

Calculating Retention in Cohorts

It can be illuminating to compare the retention of users who joined on Day A with those who joined on Day B. Or comparing the retention of users who joined as a result of different acquisition channels. This data gives us helpful insight into the impact that product or marketing changes have on improving retention. Ideally, we’d end up with a chart that shows the diminishing number of returning users in a cohort over time, like this:

Retention by cohort

We’ll start by defining a few handy subqueries to simplify the problem. New_user_activity restricts user activity to — you guessed it — new users, then identifies their date of first use and the length of their tenure.

with new_user_activity as (
 , min(date(created_at)) as first_use_date
 , max(date(created_at)) - min(date(created_at)) as
 from gameplays
group by 1)

We’ll then group this table by first use date and tenure.

, new_user_tenure as (
 , tenure
 , count(distinct user_id) as users
 from new_user_activity
 group by 1, 2)

cohort_active_user_count calculates the total number of active users — the denominator in our retention calculation — in each daily cohort.

, cohort_active_user_count as (
 , sum(users) as first_users
from new_user_activity
group by 1)

On top of that, we’ll use some additional techniques: 

1. In this query, we lose the simple count of active users in the cohort. Fortunately, we thought of this and made our cohort_ active_user_count subquery, which we can join in and use as the denominator. 

2. We will use a sequence table called all_numbers as the leftmost table. This will ensure populated data fields for all tenure lengths. 

3. Notice also the range, or inequality, join. This is one of our favorite SQL tricks and enables us to get multiple days of retention in one chart. 

4. Let’s take a look at the full query.

 , all_numbers.tenure || ‘ Days’ as tenure
 , sum(new_user_tenure.users) /
 as retention_rate
 left join new_user_tenure
 on all_numbers.tenure <= new_user_tenure.tenure
 left join cohort_active_user_count
 on new_user_tenure.first_use_date = cohor_active_
 where all_numbers.tenure != 0 and all_numbers.tenure
<= 7
 group by 1, 2, cohort_active_user_count.first_users
 order by 1,2

This results in the table:

Retention table

Using Sisense for Cloud Data Teams, we can automatically pivot the result and then color by percentile.

Retention by cohort

Interpreting Your Cohort Analysis 

For cohort analysis, we’re looking for significant differences between groups going down the chart to observe retention by start day. For example, we see an uptick in retention on 12/20 but a drop on 12/23. This would indicate to the analyst that they should investigate what occurred on those dates that might have contributed to a drop in retention. 

Improving Your Customer Retention Analysis With SQL 

Using the techniques outlined above, you can calculate your retention metrics and make better decisions for your business’ future. These comparative metrics can mean the difference between high-yield feature development and declining user numbers following disinterest and churn. 

While many of the functions outlined can be performed using a myriad of open source and spreadsheet software, the processes are very manual. Sisense for Cloud Data Teams offers an end-to-end analytics platform that reduces the time spent in clumsy workflows and produces immediate, up-to-date results.

Spend less time writing queries, and more time discovering and acting on business insights. Watch a demo, or sign up for a free trial today.

Watch Demo Start a Trial