Beyond group and count
Window functions are a wonderfully useful SQL technique. They make complex aggregations simple to build.
Simple aggregations and percentages
Let’s start with some data from a video game company. For each platform, we want to know how many times a user played a game on that platform and what percent of all gameplays that platform has.
select platform, count(1) as plays, count(1) / (sum(count(1)) over ())::float as "% of plays" from gameplays group by 1
The window function in this query is sum(count(1)) over ().
sum(count(1)) gives us the total number of gameplays. over () specifies to aggregate over all the rows without collapsing them. Thus this function gives us the total number of gameplays across all platforms.
The count(1) in the numerator is not part of the window function, and so it applies to all rows in the group, giving us a per-platform count.
Putting it all together, here are the results:
Quartiles or deciles can be a very useful way to split a dataset. Windows are by far the easiest way to do this in SQL. Let’s look at user spend quartiles, and the min and max spend within each quartile.
select quartile, min(spend) as min, max(spend) as max from ( select spend, ntile(4) over (order by spend asc) quartile from ( select user_id, sum(price) as spend from purchases group by 1 ) user_spend ) user_spend_quartiles group by 1 order by ntile asc
The inner query gives us a table of spend per user. The middle query annotates each row with the quartile — ntile(4) — of spend. Finally, the outer query aggregates the rows into just the min and max of each quartile.
Say what you will about cumulative metrics — they are certainly to be used sparingly — but they are easy to calculate with window functions. Here we’ll compute a running sum of all revenue.
select day, sum(spend) over ( order by day asc rows between unbounded preceding and current row ) from ( select date(created_at) as day, sum(price) as spend from purchases group by 1 ) daily_revenue
The inner query defines a simple daily sum of all revenue. The outer query makes it cumulative, summing all the values between the first day and the current day.
That’s accomplished with rows between unbounded preceding and current row. For each row, unbounded preceding begins the sum at the beginning of the table, and current row halts the sum at, well, the current row.
Here are the results of both the inner and outer queries:
Determining the position of a row
Ordering information is another useful trick window functions give us. Let’s take the previous query, and also add a ranking column for which platform has the highest number of plays:
select platform, plays, plays / (sum(plays) over ())::float as "% of plays", rank() over (order by plays desc) from ( select platform, count(1) as plays from gameplays group by 1 ) plays_by_platform
rank() gives the row’s number, and over (order by plays desc) specifies the order in which to apply the rank.
Multiple windows with partition
Often we want a separate ordering for different parts of the table. This is what the partition feature enables. It splits the window function, applying it separately to each specified partition.
For example, let’s find the players with the most gameplays for each platform:
select platform, user_id, plays, rank() over (partition by platform order by plays desc) from ( select platform, user_id, count(1) as plays from gameplays group by 1, 2 ) plays_by_user_and_platform
Our partition by platform makes the rank() function give us a separate rank for each platform.
How It All Works
Superficially, window functions are similar to your basic “group by” functionality. However, rather than subdividing tables into exclusive “groups” of rows and collapsing them, window functions can look at arbitrary “windows” and do so without collapsing the windows into a single row.
Pieces of a Window Function
Dissecting our last example, rank() over (partition by platform order by plays desc), we can pull out three pieces:
- rank() — the function, which aggregates, ranks, or filters the rows in the partition
- over(…) — the window, which specifies which rows the function applies to
- partition by platform — which subset of rows are considered (in this case, all rows with the same platform are in this partition)
- order by plays desc — the order of the rows in the window; this is especially useful for functions like first() or row_number() which depend on ordering
Finally, the over() window definition can also have a row specifier, which further restricts which rows are in the window. The cumulative metrics section above goes into this in more detail.
Window functions are evaluated after the join, group, and having clauses, at the same time as other select statements.
That, unfortunately, means your window functions can’t refer to other fields in the select statement. To do this, you’ll need to wrap the select in a subquery and put your window function in the outer query.
Window Function Availability
Window functions were defined in SQL:2003 and are available in PostgreSQL, SQL Server, Redshift (which supports a subset of Postgres’s functions) and Oracle (which calls them “analytic functions”).
More Neat Tricks
As you can see, we’re big fans of window functions! Here are some times we’ve used them to great effect:
- Predicting Exponential Growth with SQL, in which they calculate a regression of exponential data
- Use window functions for time-series percentages, in which they make a time series proportional
- Generate Series in Redshift and MySQL, in which they replace Redshift’s unfortunate lack of generate_series
- 4 Ways to Join Only The First Row in SQL, in which (spoiler alert!) one of the ways is to use a window function
- ASCII Art Charts in the Terminal, in which they auto-scale our ascii charts