Filling in gaps in your dataset usually takes one of two forms. You can interpolate, where intermediate points are generated based on their neighbors, or you can carry values forward.
In this post we’re going to carry values forward. Below we have a log of when a customer changed payment plans. Since the customer didn’t change plans everyday, we only have data for certain days.
Most daily revenue metrics work best when you have daily data, so this isn’t the most convenient form for data analysis. Using a correlated subquery, we’ll convert this list of events to a time series showing the customer’s current plan on each date.
This customer upgraded from a $10/day plan to $15 and then $20, and then downgraded to $15:
select date(created_at) as d, plan_amount as plan from playment_plans where customer_id = 1 order by 1
By using generate series or joining a list of dates, we can build most of the daily plan dataset we’re looking for:
select * from spotty_data order by 1
Filling in the Gaps
We want the $10 plan value to fill down to 2016-03-24 and 2016-03-25, for $15 to fill in 2016-03-27 and 2016-03-28, etc.
It looks like window functions could work well here, but they won’t. The lag window function can look at previous rows, but it cannot look at previous versions of itself, so you’ll end up with a result like this:
select * , coalesce(plan, lag(plan) over(order by d)) from spotty_data order by 1
With the lag function, we only fill in one new value. To fill all of them in, we need a correlated subquery.
Correlated Subqueries for Multi-Row Lag
Be careful: correlated subqueries can be very slow. They execute once for each row in your result set, so make sure they’re using indices or the data set is small.
In our case, we want to find the most recent plan, at or before the current date. Using a correlated subquery here is similar to a self join or getting the first row from a join, but with the logic in the select statement as a new column:
select * , (select plan from spotty_data as backfill where backfill.d <= original.d and backfill.plan is not null order by backfill.d desc limit 1) as coalesce_down_plan from spotty_data as original order by 1
We define a new column that’s selecting a single value out of the same table, renamed backfill. The original spotty_data we’ll call original.
The condition backfill.d <= original.d coupled with order by backfill.d desc ensures that the top row of our subselect is the most recent plan. The limit 1 is required as the expression can only return one row to be used as the value for the new column.
With our correlated subquery, we’re able to fill in the blanks for each date, making our revenue analysis much easier.