Many events tables contain low-level information that isn’t always easy to reason about. In this post we’ll cluster low-level delivery event data into trips, making further analysis much easier.



Suppose you operate a delivery business 10 minutes from the city, and the only information you have are the driver IDs and the delivery times:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-01-10.png" alt="Driver id and delivery" class="wp-image-78179"/></figure>



You would like to determine the number of trips each driver took to complete their deliveries. To do so, we’ll cluster the drivers’ deliveries based on the time of delivery. ​



It takes 10 minutes to go from the warehouse to the city. All deliveries made within 10 minutes of the previous delivery are part of the same trip since there isn’t enough time to head back to the warehouse.



With the help of <a href="https://www.sisense.com/blog/window-functions-by-example/">window functions</a>, we can easily cluster the delivery times based on these 10 minutes gaps.



First, we want to determine the difference in time between deliveries for each driver using the lag window function.



<pre class="wp-block-code"><code>with delivery_difference as (
 select
 driver_id
 , time_of_delivery
 , lag(time_of_delivery) over (
 partition by driver_id 
 order by driver_id, time_of_delivery
 ) as previous_delivery
 , extract(epoch from (time_of_delivery - 
 lag(time_of_delivery) over (
 partition by driver_id
 order by driver_id, time_of_delivery
 )
 )) as difference
from
 driver_delivery
)</code></pre>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-02-11.png" alt="Time difference between deliveries" class="wp-image-78173"/></figure>



Now that we have a difference in time between deliveries, we can cluster deliveries that are fewer than 10 minutes apart. To do this, we calculate if each row is the beginning of a new cluster or belongs to the current cluster.



<pre class="wp-block-code"><code>clustering as (
 select
 *
 , case when difference > 600
 or difference is null then true
 else null
 end as new_cluster
 from
 delivery_difference
)</code></pre>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-03-10.png" alt="fewer than 10 minutes apart" class="wp-image-78167"/></figure>



To assign a cluster ID for each row, we utilize count as part of a window function. Since count leaves out null values, it will only count incrementally when non-null values appear. This identifies each cluster by the same value.



<pre class="wp-block-code"><code>assigned_clustering as (
 select
 *
 , count(new_cluster) over (
 order by driver_id, time_of_delivery
 rows unbounded preceding
 ) as cluster_id
 from
 clustering
)</code></pre>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-04-10.png" alt="Count table" class="wp-image-78191"/></figure>



Putting all of the above together, we’ve successfully clustered deliveries into trips. This allows us to finally calculate the deliveries per driver’s trip.



<pre class="wp-block-code"><code>select
 driver_id
 , cluster_id
 , count(*)
from
 assigned_clustering
group by driver_id, cluster_id
order by driver_id, cluster_id</code></pre>



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-05-8.png" alt="Deliveries per trip" class="wp-image-78185"/></figure>

One Dimensional Clustering in Postgres

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article