<h2 class="wp-block-heading">Tracking application pageload time</h2>



At Sisense, we are constantly tracking performance statistics for our computing infrastructure, database query times, and, of course, web servers. One of our most important charts shows the loading time of the application, which we monitor to ensure a user has a lightning fast experience.



Last week we were alerted by <a href="https://www.sisense.com/product/data-teams/">Sisense for Cloud Data Teams</a> to a spike in the data — the web app was returning five times slower than average!



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/oflast10daysper10Tile.png" alt="Average application page load" class="wp-image-77068"/></figure>



We manually inspected the data and discovered it was only affecting a small percentage of users. After resolving the issue, we wanted a way to more quickly understand how many users were being affected during a spikey event.



<h2 class="wp-block-heading">Performance distribution with ntile</h2>



We’re big fans of <a href="https://www.sisense.com/blog/window-functions-by-example/">window functions</a> at Sisense. For the chart we’re after, we can use ntile to bin the data in sorted order. ntile divides the rows evenly and sets the bin number for each row.



We have our response times stored in&nbsp;client_timing_logs, and want to append the bin using ntile. To do so, we’ll&nbsp;order by&nbsp;the&nbsp;app_duration&nbsp;column when assigning bins. 



<pre class="wp-block-code"><code>select
 [created_at:pst:date]
 , app_duration
 , ntile(3) over (
 order by
 app_duration desc
 )
 as ntile
from
 client_timing_logs</code></pre>



Running this query gives us the date, response time, and the bin assigned by ntile, which it distributed as evenly as possible:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/ntile_table.png" alt="Ntile table" class="wp-image-77058"/></figure>



Let’s save the result of this table to binned_times so we can look at the distribution over days within each bin. For our query, we want to look at 10 bins for the app response time over the last 10 days. We’ll take the average of all the run times for each (date, bin) pair to discover anomalies across the entire bins. This gives us a sense of the performance characteristics for users at each level of response time.



<pre class="wp-block-code"><code>with
 binnedTimes as (
 select
 date(created_at)
 , app_duration
 , ntile(10) over (
 order by
 app_duration desc
 )
 as ntile
 from
 client_timing_logs
 where
 created_at > now() - interval '10 days'
 )
select
 date_trunc
 , ntile
 , avg(app_duration)
from
 binnedTimes
group by
 1
 , 2
order by
 2 desc</code></pre>



We take the average of each bin and plot the proportional distribution — how much of the runtime was dominated by each day for that bin.



Each bar represents the average response time in groups of 10%. The first bar is the users with the top 10% fastest pageloads, and the last bar are the users with the 10% slowest page loads.



We can see the same spike on June 22 that we saw in our original graph, and by checking the other 9 bars, we know that it is only the slowest responses that have been affected.



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/AverageApplicationPageLoad.png" alt="Average application page load" class="wp-image-77050"/></figure>



This type of graph lets us easily compare across two dimensions: comparing across the distribution of response times along the X axis, and across different dates within each stacked bar.



In this case, we were able to determine that our data was invalid. We were measuring the time between when a request was made and when we received a “done loading” message.



Yet, if a user closed their laptop mid request, we wouldn’t receive a “done loading” message until they opened it again. This made it appear that queries were taking much longer than they actually were.



We can compare this data anomaly to a failure one of our caching web servers experienced in February, and see how run times were longer in bins 10, 9, and 8 during the outage:



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/cache_outage.png" alt="Cache outage" class="wp-image-77063"/></figure>



On February 18, we see a “fan out” type shape as users in the fastest percentile were returning more slowly than average, a result of dramatically increased cache misses.



This shows us that the left-most points of our distribution had shifted towards the right, as opposed to the clustering of outliers we saw in the first example.



<h3 class="wp-block-heading">Tracking Percentiles Over Time</h3>



A popular use case of ntile is to <a href="https://blog.carbonfive.com/2013/01/17/postgresql-aggregates-medians-and-a-brief-command-reference/" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">find the median</a> by using ntile(2) and taking the maximum of the first bin.



We can easily compare the traditional quintiles,&nbsp;[minimum, Q1, median, Q3, maximum], over time by using the same strategy and&nbsp;ntile(4). For example, we use this query to track performance patterns over the last six months:



<pre class="wp-block-code"><code>with
 binnedTimes as (
 select
 date(created_at)
 , resource_duration
 , ntile(4) over (
 order by
 resource_duration desc
 )
 as ntile
 from
 client_timing_logs
 where
 created_at > now() - interval '180 days'
 )
select
 date_trunc as date
 , ntile
 , max(resource_duration) as value
from
 binnedTimes
group by
 1
 , 2</code></pre>



 The column&nbsp;max(resource_duration)&nbsp;of the first, second, third, and fourth bins are now Q1, the median, Q3, and the maximum. We use both&nbsp;max&nbsp;and&nbsp;avg&nbsp;to track performance distributions. Using the average is an easy wat to amortize outliers from the minimum and maximum bin. Here we plot time on the X Axis to track the response time from the resource server with&nbsp;ntil(5), using the average of each bin as an approximation of the quintile. 



<figure class="wp-block-image fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/resource_duration.png" alt="Resource duration" class="wp-image-77073"/></figure>



This lets us easily check for things like day of week effects. The peaks in the blue line are long response times on weekends. Because there are fewer total requests on weekends, long-running cache misses have a higher impact on the average.



A key advantage to using ntile is you only need to specify the number of tiles. We didn’t have to set ranges, use integer division or <a href="http://blog.shlomoid.com/2011/08/how-to-quickly-create-histogram-in.html" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">negative rounding</a> to create the bins, so we can be confident that our query will function even as the data grows and evolves over time.



Check out ntile for <a href="https://www.postgresql.org/docs/8.4/functions-window.html" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">Postgres</a>, <a href="https://docs.microsoft.com/en-us/sql/t-sql/functions/ntile-transact-sql?redirectedfrom=MSDN&amp;view=sql-server-ver15" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">Sql Server</a>, or <a href="https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions101.htm" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">Oracle</a> and let us know if it works for you!

Understanding Website Response Time with SQL

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article