Many events tables contain low-level information that isn’t always easy to reason about. In this post we’ll cluster low-level delivery event data into trips, making further analysis much easier.
Suppose you operate a delivery business 10 minutes from the city, and the only information you have are the driver IDs and the delivery times:
You would like to determine the number of trips each driver took to complete their deliveries. To do so, we’ll cluster the drivers’ deliveries based on the time of delivery.
It takes 10 minutes to go from the warehouse to the city. All deliveries made within 10 minutes of the previous delivery are part of the same trip since there isn’t enough time to head back to the warehouse.
With the help of window functions, we can easily cluster the delivery times based on these 10 minutes gaps.
First, we want to determine the difference in time between deliveries for each driver using the lag window function.
with delivery_difference as ( select driver_id , time_of_delivery , lag(time_of_delivery) over ( partition by driver_id order by driver_id, time_of_delivery ) as previous_delivery , extract(epoch from (time_of_delivery - lag(time_of_delivery) over ( partition by driver_id order by driver_id, time_of_delivery ) )) as difference from driver_delivery )
Now that we have a difference in time between deliveries, we can cluster deliveries that are fewer than 10 minutes apart. To do this, we calculate if each row is the beginning of a new cluster or belongs to the current cluster.
clustering as ( select * , case when difference > 600 or difference is null then true else null end as new_cluster from delivery_difference )
To assign a cluster ID for each row, we utilize count as part of a window function. Since count leaves out null values, it will only count incrementally when non-null values appear. This identifies each cluster by the same value.
assigned_clustering as ( select * , count(new_cluster) over ( order by driver_id, time_of_delivery rows unbounded preceding ) as cluster_id from clustering )
Putting all of the above together, we’ve successfully clustered deliveries into trips. This allows us to finally calculate the deliveries per driver’s trip.
select driver_id , cluster_id , count(*) from assigned_clustering group by driver_id, cluster_id order by driver_id, cluster_id