Rolling averages are useful for comparing a specific point with those around it. However, in some cases, we might want to instead classify and group a period’s data based on its contemporary periods.
For each period, we examine those that came before or after it, and in doing so, perform rolling classifications. In this post, we’ll show how to determine the window of rolling, and how to use the different periods to classify data.
Defining the Window and Classifications
We’ll start off with an orders table that has two columns we care about: customer_id and order_date. For this example, we’ll aggregate the order dates by quarters. Our problem is to find the number of users that fall into the following classifications for each quarter:
- New Customer: If a customer made a purchase in the same quarter that they were created
- Active: If a customer made a purchase in the current quarter and were not created in the same quarter
- Pending: If a customer has not made an order in the current quarter, but made an order in the previous two quarters
- Non-Active: If a customer made a purchase two quarters ago, but not in the previous quarter or the current quarter
- Churned: If a customer has not made a purchase in the current quarter or the last two quarters
Based on these rules, our window will always cover the previous two quarters from the current quarter being evaluated. For each entry in the orders table, we’ll calculate whether or not the same customer made a purchase in the previous two quarters.
Grabbing All Possible Combinations
We’ll first want to create a table that, for each distinct customer_id, has rows for all the possible year and quarter combinations. To do that, we can extract the distinct customer_id’s, years, and quarters from our orders table into their own subqueries. Then we cross-join the three tables like so:
with year_table as ( select distinct extract(year from order_date) as year from orders ) , quarter_table as ( select distinct extract(qtr from order_date) as quarter from orders ) , customer_table as ( select distinct customer_id from orders ) , all_possible_orders as ( select * from customer_table, year_table, quarter_table )
Get Column for Current Quarter
In order to avoid repeated data where the same customer makes more than one purchase in a quarter, we’ll create a subquery from the orders table for distinct customer purchases.
distinct_orders as ( select distinct customer_id , extract(year from order_date) as year , extract(qtr from order_date) as quarter from orders )
Now that we have a table of orders without repeats in a quarter, we can join it to our all_possible_orders table. We’ll use a left join in order to not lose any of the necessary rows.
orders_with_current_quarter as ( select all_orders.customer_id , all_orders.year , all_orders.quarter , distinct_orders.year as curr_year distinct_orders.quarter as curr_quarter from all_orders left join distinct_orders on all_orders.customer_id = distinct_orders.customer_id and all_orders.year = distinct_orders.year and all_orders.quarter = distinct_orders.quarter )
The curr_year and curr_quarter columns will be null when there was not an order in that period, and will otherwise display the year and quarter for the user_id’s order. These columns will be used later for classifying the data, and are the key to checking the previous quarters.
Calculating Previous Quarters
The next step is to add a column for each of the two previous quarters. We’ll start by adding a column for the previous quarter first, as the logic for the second would be similar. To do this, we’ll want to left join the orders_with_current_quarter table onto itself, lagging the quarters by one.
Creating the join condition is the trickiest part of the query, and we have to consider the two possible cases:
- The previous quarter is in the same year (e.g. 2016 Quarter 2 → 2016 Quarter 1)
- The previous quarter is in the past year (e.g. 2016 Quarter 1 → 2015 Quarter 4)
The first possibility is relatively straightforward, and we can compare the values directly:
t1.year = t2.year and t1.quarter = t2.quarter + 1
The second possibility is a bit more complicated. We’ll have to use the modulus operator to account for the year potentially changing and the quarter resetting. To do this, we can use the modulus operator. Since we’re aggregating by quarters, our modulus check would be with four:
t1.year = t2.year + 1 and t1.quarter = (t2.quarter + 1) % 4 and t2.quarter = 4
Now we’ll apply this logic to the query we’ve built so far.
includes_one_previous_period as ( select t1.* , t2.quarter as one_prev_period from orders_with_current_quarter t1 left join orders_with_current_quarter t2 on t1.user_id = t2.user_id and ( ( t1.curr_year = t2.curr_year and t1.curr_quarter = t2.curr_quarter + 1 ) or ( t1.curr_year = t2.curr_year + 1 and t1.curr_quarter = (t2.curr_quarter + 1) % 4 and t2.curr_quarter = 3 ) ) )
Now we have a table with columns for both the current quarter and the previous quarter for all orders. To add a column for two quarters ago, we write a query that’s nearly identical to the one above. We modify it by using the includes_one_previous_period table, and adding two instead of one in the quarter join conditions.
includes_two_previous_period as ( select t1.* , t2.quarter as two_prev_period from includes_one_previous_period t1 left join includes_one_previous_period t2 on t1.user_id = t2.user_id and ( ( t1.curr_year = t2.curr_year and t1.curr_quarter = t2.curr_quarter + 2 ) or ( t1.curr_year = t2.curr_year + 1 and t1.curr_quarter = (t2.curr_quarter + 2) % 4 and t2.curr_quarter = 3 ) ) )
And our resulting table now has columns for the current quarter, one quarter ago, and two quarters ago.
We’re now one join condition away (We promise!) from being able to classify the data using our rules!
Join User Created Date
We now have the necessary columns to count the number of customers who fall into Categories 2-5 for each quarter. We’ll add one more column that lets us determine whether or not someone was a new customer in a specific Quarter. To do this, we can conveniently employ our customers table that has two columns of value: customer_id and created_at.
All we have to do is left join this onto our includes_two_previous_period subquery using the customer_id fields.
master_table as ( select t1.* , extract(year from created_at) as customer_created_year , extract(qtr from created_at) as customer_created_quarter from includes_two_previous_period left join customers on includes_two_previous_period.customer_id = customers.customer_id )
Now we have everything we need to build our classification rules. The finish line is in sight!
Putting It All Together
With all the necessary pieces, we can employ a case when statement to determine whether or not a row meets one of the 5 established classifications.
select year , quarter , case when year = customer_created_year and quarter = customer_created_quarter then ‘New Customer’ when current_year is not null then ‘Active’ when current_year is null and one_prev_period is not null and two_prev_period is not null then ‘Pending’ when current_year is null and one_prev_period is null and two_previs_period is not null then ‘Non-Active’ when current_year is null and one_prev_period is null and two_prev_period is null then ‘Churned’ end , count(1) from master_table group by 1,2,3
And voila! We now have a beautiful chart that groups customers into different categories each quarter depending on their activity in past quarters.
The same logic can be applied to go even further back or forward to grab the periods needed for classification. Now that you can classify data by rolling through aggregated periods, the world is your oyster! …or at least your dataset.