It’s rare for one table to have all the information needed for a query. Across all our customers, the average chart joins 4 tables to get its result!
Many of these charts use the standard inner join. However, there are many other kinds of joins that are frequently used in data analysis. We’ll take a look at the most common joins we see:
- Inner Joins
- Left (and Right) Joins
- Self-Range Joins
To illustrate each of these joins, we’ll work with a database for a fictitious concierge app used at the front office of an apartment complex. It stores information about the community’s residents and their activity.
Inner Joins
The inner join is the workhorse of SQL joins. This is the default and most common way of combining the data from two tables. When your data has a one-to-many or many-to-many relationship, use inner joins.
For example, we have two tables: residents and packages. Packages arrive for residents at the front office, and get logged in the app. To see which residents we should notify about their packages, we use an inner join:
select *
from residents
join packages on
residents.id = packages.residents_id
id | name | resident_id | carrier |
1 | Harry | 1 | UPS |
3 | Rya | 7 | FedEx |
13 | John | 13 | UPS |
… | … | … | … |
This inner join will show us residents with packages. It will not show residents without packages, or packages without residents.
Left (and Right) Joins
The inner join above will not include residents without packages because inner joins require rows from both tables involved to have data satisfying the join condition.
Since residents without packages won’t have any rows in the packages table, those residents are excluded from the results of an inner join.
We can use a left join to include rows from the residents table even if they don’t have any matching rows in the packages table.
Left joins will include every row from the left (residents) table at least one time. If there are no matching results from the right (packages) table, the columns in packages will simply be null.
select *
from residents
left join packages on
residents.id = packages.residents_id
id | name | resident_id | carrier |
1 | Harry | 1 | UPS |
2 | Andreas | (null) | (null) |
3 | Rya | 7 | FedEx |
… | … | … | … |
This left join includes all residents whether or not they have a package.
Likewise, if we wanted to see all packages including those without residents, we’d use a right join or swap the order of residents and packages in the select statement.
And we can use the fact that non-matching rows will return nulls to find packages without residents. This can be useful to find packages stamped with an incorrect resident id, because the residents.id field will be null:
select *
from residents
right join packages on
residents.id = packages.residents_id
where residents.id is null
id | name | resident_id | carrier |
(null) | (null) | -5 | UPS |
(null) | (null) | (null) | FedEx |
Range Joins
The range join is very common when computing lifetime metrics. In our case, we want to calculate the cumulative distinct number of residents receiving packages over time.
Our packages table has a record for each package showing when it was received, making a daily count of distinct package recipients very easy:
select date(received_at), count(distinct recipient_id)
from packages
group by 1
However, this doesn’t get us close to the lifetime metric. Simply making the results cumulative will double-count residents, which we don’t want.
Instead, we’ll use a range join to join each date in our dates table to the packages that came on or before it. Our dates table is just a list of dates, so you could also use generate series or a subquery to produce the same thing.
The important difference with this join is that it doesn’t use equality, it uses greater than / less than instead. This makes it possible for a different number of rows to join to each date:
select date(dates.d), count(distinct recipient_id)
from dates
join packages on
packages.received_at <= dates.d
group by 1
With this range join, each date will be joined to all the rows in the packages table that have a received_at on that date, or any previous date. Now the count(distinct recipient_id) will be cumulative and correct!