Beyond group and count

Window functions are a wonderfully useful SQL technique. They make complex aggregations simple to build.

After using them to great effect for selecting only one row, computing running averages, and breaking out day-over-day changes, we thought it was high time to explain them in more detail.

Simple aggregations and percentages

Let’s start with some data from a video game company. For each platform, we want to know how many times a user played a game on that platform and what percent of all gameplays that platform has.

  count(1) as plays,
  count(1) / (sum(count(1)) over ())::float as "% of plays"
from gameplays
group by 1

The window function in this query is sum(count(1)) over ().

sum(count(1)) gives us the total number of gameplays. over () specifies to aggregate over all the rows without collapsing them. Thus this function gives us the total number of gameplays across all platforms.

The count(1) in the numerator is not part of the window function, and so it applies to all rows in the group, giving us a per-platform count.

Putting it all together, here are the results:

Platform plays

Calculating ntiles

Quartiles or deciles can be a very useful way to split a dataset. Windows are by far the easiest way to do this in SQL. Let’s look at user spend quartiles, and the min and max spend within each quartile.

  min(spend) as min, 
  max(spend) as max
from (
    ntile(4) over (order by spend asc) quartile
  from (
    select user_id, sum(price) as spend
    from purchases
    group by 1
  ) user_spend
) user_spend_quartiles
group by 1 
order by ntile asc

The inner query gives us a table of spend per user. The middle query annotates each row with the quartile — ntile(4) — of spend. Finally, the outer query aggregates the rows into just the min and max of each quartile.

Ntile table

Cumulative metrics

Say what you will about cumulative metrics — they are certainly to be used sparingly — but they are easy to calculate with window functions. Here we’ll compute a running sum of all revenue.

  sum(spend) over (
    order by day asc 
    rows between unbounded preceding and current row
from (
    date(created_at) as day, 
    sum(price) as spend
  from purchases
  group by 1
) daily_revenue

The inner query defines a simple daily sum of all revenue. The outer query makes it cumulative, summing all the values between the first day and the current day.

That’s accomplished with rows between unbounded preceding and current row. For each row, unbounded preceding begins the sum at the beginning of the table, and current row halts the sum at, well, the current row.

Here are the results of both the inner and outer queries:

Revenue by day

Determining the position of a row

Ordering information is another useful trick window functions give us. Let’s take the previous query, and also add a ranking column for which platform has the highest number of plays:

  plays / (sum(plays) over ())::float as "% of plays",
  rank() over (order by plays desc)
from (
  select platform, count(1) as plays
  from gameplays
  group by 1
) plays_by_platform

rank() gives the row’s number, and over (order by plays desc) specifies the order in which to apply the rank.

Platform plays 2

Multiple windows with partition

Often we want a separate ordering for different parts of the table. This is what the partition feature enables. It splits the window function, applying it separately to each specified partition.

For example, let’s find the players with the most gameplays for each platform:

  rank() over (partition by platform order by plays desc)
from (
  select platform, user_id, count(1) as plays
  from gameplays
  group by 1, 2
) plays_by_user_and_platform

Our partition by platform makes the rank() function give us a separate rank for each platform.

Platform user ID table

How It All Works

Superficially, window functions are similar to your basic “group by” functionality. However, rather than subdividing tables into exclusive “groups” of rows and collapsing them, window functions can look at arbitrary “windows” and do so without collapsing the windows into a single row.

Pieces of a Window Function

Dissecting our last example, rank() over (partition by platform order by plays desc), we can pull out three pieces:

  • rank() — the function, which aggregates, ranks, or filters the rows in the partition
  • over(…) — the window, which specifies which rows the function applies to
  • partition by platform — which subset of rows are considered (in this case, all rows with the same platform are in this partition)
  • order by plays desc — the order of the rows in the window; this is especially useful for functions like first() or row_number() which depend on ordering

Finally, the over() window definition can also have a row specifier, which further restricts which rows are in the window. The cumulative metrics section above goes into this in more detail.

Evaluation Order

Window functions are evaluated after the join, group, and having clauses, at the same time as other select statements.

That, unfortunately, means your window functions can’t refer to other fields in the select statement. To do this, you’ll need to wrap the select in a subquery and put your window function in the outer query.

Window Function Availability

Window functions were defined in SQL:2003 and are available in PostgreSQL, SQL Server, Redshift (which supports a subset of Postgres’s functions) and Oracle (which calls them “analytic functions”).

Unfortunately, they’re not supported on MySQL, though you can get a lot of mileage out of variables and group_concat.

More Neat Tricks

As you can see, we’re big fans of window functions! Here are some times we’ve used them to great effect: