Tracking application pageload time
At Sisense, we are constantly tracking performance statistics for our computing infrastructure, database query times, and, of course, web servers. One of our most important charts shows the loading time of the application, which we monitor to ensure a user has a lightning fast experience.
Last week we were alerted by Sisense for Cloud Data Teams to a spike in the data — the web app was returning five times slower than average!
We manually inspected the data and discovered it was only affecting a small percentage of users. After resolving the issue, we wanted a way to more quickly understand how many users were being affected during a spikey event.
Performance distribution with ntile
We’re big fans of window functions at Sisense. For the chart we’re after, we can use ntile to bin the data in sorted order. ntile divides the rows evenly and sets the bin number for each row.
We have our response times stored in client_timing_logs, and want to append the bin using ntile. To do so, we’ll order by the app_duration column when assigning bins.
select [created_at:pst:date] , app_duration , ntile(3) over ( order by app_duration desc ) as ntile from client_timing_logs
Running this query gives us the date, response time, and the bin assigned by ntile, which it distributed as evenly as possible:
Let’s save the result of this table to binned_times so we can look at the distribution over days within each bin. For our query, we want to look at 10 bins for the app response time over the last 10 days. We’ll take the average of all the run times for each (date, bin) pair to discover anomalies across the entire bins. This gives us a sense of the performance characteristics for users at each level of response time.
with binnedTimes as ( select date(created_at) , app_duration , ntile(10) over ( order by app_duration desc ) as ntile from client_timing_logs where created_at > now() - interval '10 days' ) select date_trunc , ntile , avg(app_duration) from binnedTimes group by 1 , 2 order by 2 desc
We take the average of each bin and plot the proportional distribution — how much of the runtime was dominated by each day for that bin.
Each bar represents the average response time in groups of 10%. The first bar is the users with the top 10% fastest pageloads, and the last bar are the users with the 10% slowest page loads.
We can see the same spike on June 22 that we saw in our original graph, and by checking the other 9 bars, we know that it is only the slowest responses that have been affected.
This type of graph lets us easily compare across two dimensions: comparing across the distribution of response times along the X axis, and across different dates within each stacked bar.
In this case, we were able to determine that our data was invalid. We were measuring the time between when a request was made and when we received a “done loading” message.
Yet, if a user closed their laptop mid request, we wouldn’t receive a “done loading” message until they opened it again. This made it appear that queries were taking much longer than they actually were.
We can compare this data anomaly to a failure one of our caching web servers experienced in February, and see how run times were longer in bins 10, 9, and 8 during the outage:
On February 18, we see a “fan out” type shape as users in the fastest percentile were returning more slowly than average, a result of dramatically increased cache misses.
This shows us that the left-most points of our distribution had shifted towards the right, as opposed to the clustering of outliers we saw in the first example.
Tracking Percentiles Over Time
A popular use case of ntile is to find the median by using ntile(2) and taking the maximum of the first bin.
We can easily compare the traditional quintiles, [minimum, Q1, median, Q3, maximum], over time by using the same strategy and ntile(4). For example, we use this query to track performance patterns over the last six months:
with binnedTimes as ( select date(created_at) , resource_duration , ntile(4) over ( order by resource_duration desc ) as ntile from client_timing_logs where created_at > now() - interval '180 days' ) select date_trunc as date , ntile , max(resource_duration) as value from binnedTimes group by 1 , 2
The column max(resource_duration) of the first, second, third, and fourth bins are now Q1, the median, Q3, and the maximum. We use both max and avg to track performance distributions. Using the average is an easy wat to amortize outliers from the minimum and maximum bin. Here we plot time on the X Axis to track the response time from the resource server with ntil(5), using the average of each bin as an approximation of the quintile.
This lets us easily check for things like day of week effects. The peaks in the blue line are long response times on weekends. Because there are fewer total requests on weekends, long-running cache misses have a higher impact on the average.
A key advantage to using ntile is you only need to specify the number of tiles. We didn’t have to set ranges, use integer division or negative rounding to create the bins, so we can be confident that our query will function even as the data grows and evolves over time.