It’s rare for one table to have all the information needed for a query. Across all our customers, the average chart joins 4 tables to get its result!

Many of these charts use the standard inner join. However, there are many other kinds of joins that are frequently used in data analysis. We’ll take a look at the most common joins we see:

  • Inner Joins
  • Left (and Right) Joins
  • Self-Range Joins

To illustrate each of these joins, we’ll work with a database for a fictitious concierge app used at the front office of an apartment complex. It stores information about the community’s residents and their activity.

Inner Joins

The inner join is the workhorse of SQL joins. This is the default and most common way of combining the data from two tables. When your data has a one-to-many or many-to-many relationship, use inner joins.

For example, we have two tables: residents and packages. Packages arrive for residents at the front office, and get logged in the app. To see which residents we should notify about their packages, we use an inner join:

select *
from residents
  join packages on
    residents.id = packages.residents_id
id      name resident_id     carrier
1Harry1UPS
3 Rya 7 FedEx
13John13 UPS

This inner join will show us residents with packages. It will not show residents without packages, or packages without residents.

Left (and Right) Joins

The inner join above will not include residents without packages because inner joins require rows from both tables involved to have data satisfying the join condition.

Since residents without packages won’t have any rows in the packages table, those residents are excluded from the results of an inner join.

We can use a left join to include rows from the residents table even if they don’t have any matching rows in the packages table.

Left joins will include every row from the left (residents) table at least one time. If there are no matching results from the right (packages) table, the columns in packages will simply be null.

select *
from residents
  left join packages on
    residents.id = packages.residents_id
id         name resident_id   carrier
1Harry1UPS
2Andreas (null) (null)
3                          Rya 7 FedEx

This left join includes all residents whether or not they have a package.

Likewise, if we wanted to see all packages including those without residents, we’d use a right join or swap the order of residents and packages in the select statement.

And we can use the fact that non-matching rows will return nulls to find packages without residents. This can be useful to find packages stamped with an incorrect resident id, because the residents.id field will be null:

select *
from residents
  right join packages on
    residents.id = packages.residents_id
where residents.id is null
id          name resident_id     carrier
(null)                  (null) -5 UPS
(null)              (null) (null) FedEx

Range Joins

The range join is very common when computing lifetime metrics. In our case, we want to calculate the cumulative distinct number of residents receiving packages over time.

Our packages table has a record for each package showing when it was received, making a daily count of distinct package recipients very easy:

select date(received_at), count(distinct recipient_id)
from packages
group by 1

However, this doesn’t get us close to the lifetime metric. Simply making the results cumulative will double-count residents, which we don’t want.

Instead, we’ll use a range join to join each date in our dates table to the packages that came on or before it. Our dates table is just a list of dates, so you could also use generate series or a subquery to produce the same thing.

The important difference with this join is that it doesn’t use equality, it uses greater than / less than instead. This makes it possible for a different number of rows to join to each date:

select date(dates.d), count(distinct recipient_id)
from dates
  join packages on
    packages.received_at <= dates.d
group by 1

With this range join, each date will be joined to all the rows in the packages table that have a received_at on that date, or any previous date. Now the count(distinct recipient_id) will be cumulative and correct!

Tags: