LinkedIn

Twitter

GitHub

Blog Landing

I had the honor to be on The Interop Conference <a href="/product/big-data-analytics/">Big Data</a> Panel in Las Vegas yesterday. The panel was composed of friends from Cloudera, Datameer, Aryaka and it was part of an all-day workshop led by Big Data celebrity and evangelist Chris Taylor. The focus of the discussion was “The Future of Data”. The audience was composed of very savvy technical leaders from diverse industries from <a href="/solutions/finance/">financial services</a> to <a href="/solutions/retail/">retail</a> to universities.



Debates like these can sometimes derail into sales pitches and friendly remarks. This time, though, the discussion, brilliantly orchestrated by Matt Marshall, Founder and CEO of VentureBeat, turned into a very passionate exchange on the key themes challenging our industry. We got so excited that Matt and Chris let us go over time to engage with the audience.



By the time we were done, I realized we hadn’t touched on a key theme Matt was interested in: Big Data and ROI. Many of you will be asked to justify the investment you’re making in Big Data Analytics technology. ROI is key term you’ll hear. It stands for “Return On Investment”.



Sure, there are many ways to justify technology investments and firms like Gartner and Forrester have built such models.



I’ll tell you this though. Companies that are trying to look at Data Analytics as a tactical budget item are in trouble. Think about it this way: do you have to justify the return on investment of your financial department? Probably not. Why? Because you need it to better run your business, safeguard yourself from exposure and spot opportunities before it’s too late – in short, to run your business better.



The same goes of Big Data and Analytics. Data Analytics will have immediate and long term return on investment on your culture, your processes and your bottom line. Now, you do want to use the most appropriate technology so you can avoid burning money on the wrong things…but that’s a different question. We happen to believe we have the most effective option for Terabyte-range Data Analytics problems.



If you are still running into ROI discussion issues – try this tactic: figure out the cost of a “wrong decision”. Meaning – what happens when your company, your executive team and/or your front-line employees execute the wrong moves because they didn’t have the right insights?



A customer of ours recently evaluated that the wrong “first move” could cost $50,000 in straight cost or lost opportunity. And that’s the first move. Unfortunately, “wrong first moves” rarely happen in isolation and the bill can quickly increase in an uncontrollable manner. How is that for an ROI?

ROI for Big Data and Analytics

NB: These techniques are universal, but for syntax we chose Postgres. Thanks to the inimitable <a href="https://www.pgadmin.org/" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">pgAdminIII</a> for the Explain graphics.



<h2>So Useful, Yet So Slow</h2>



Count distinct is the bane of SQL analysts, so it was an obvious choice for our first blog post.



First thing first: If you have a huge dataset and can tolerate some imprecision, a probabilistic counter like <a href="https://www.sisense.com/blog/hyperloglog-in-pure-sql/" target="_blank" rel="noreferrer noopener">HyperLogLog</a> can be your best bet. For a quick, precise answer, some simple subqueries can save you a lot of time.



Let’s start with a simple query we run all the time: Which dashboards do most users visit?



<pre class="wp-block-code"><code>select 
 dashboards.name, 
 count(distinct time_on_site_logs.user_id)
from time_on_site_logs 
join dashboards on time_on_site_logs.dashboard_id = dashboards.id
group by name 
order by count desc</code></pre>



This returns a graph like this:



<figure class="wp-block-image fancybox"><img src="https://cdn.sisense.com/wp-content/uploads/Dashboards-by-Distinct.png" alt="Dashboards by Distinct" class="wp-image-76193"/></figure>



For starters, let’s assume the handy indices on user_id and dashboard_id are in place, and there are lots more log lines than dashboards and users.



On just 10 million rows, this query takes 48 seconds. To understand why, let’s consult our handy SQL explain:



<figure class="wp-block-image fancybox"><img src="https://cdn.sisense.com/wp-content/uploads/48-seconds-map.png" alt="48 seconds" class="wp-image-76188"/></figure>



It’s slow because the database is iterating over all the logs and all the dashboards, then joining them, then sorting them, all before getting down to real work of grouping and aggregating.



Become an instant SQL expert:



<a class="action-btn " href="https://www.sisense.com/whitepapers/sql-analytics-best-practices-tips-and-tricks/" target="_blank" rel="noopener noreferrer">Get the Starter Kit</a>



<h2>Aggregate, Then Join</h2>



Anything after the group-and-aggregate is going to be a lot cheaper because the data size is much smaller. Since we don’t need dashboards.name in the group-and-aggregate, we can have the database do the aggregation first, before the join:



<pre class="wp-block-code"><code>select
 dashboards.name,
 log_counts.ct
from dashboards
join (
 select
 dashboard_id,
 count(distinct user_id) as ct
 from time_on_site_logs 
 group by dashboard_id
) as log_counts 
on log_counts.dashboard_id = dashboards.id
order by log_counts.ct desc</code></pre>



This query runs in 20 seconds, a 2.4X improvement! Once again, our trusty explain will show us why:



<figure class="wp-block-image fancybox"><img src="https://cdn.sisense.com/wp-content/uploads/20-seconds-map.png" alt="20 seconds map" class="wp-image-76183"/></figure>



As promised, our group-and-aggregate comes before the join. And, as a bonus, we can take advantage of the index on the time_on_site_logs table.



<h2>First, Reduce The Dataset</h2>



We can do better. By doing the group-and-aggregate over the whole logs table, we made our database process a lot of data unnecessarily. Count distinct builds a hash set for each group — in this case, each dashboard_id — to keep track of which values have been seen in which buckets.



Instead of doing all that work, we can compute the distincts in advance, which only needs one hash set. Then we do a simple aggregation over all of them.



<pre class="wp-block-code"><code>select
 dashboards.name,
 log_counts.ct
from dashboards 
join (
 select distinct_logs.dashboard_id, 
 count(1) as ct
 from (
 select distinct dashboard_id, user_id
 from time_on_site_logs
 ) as distinct_logs
 group by distinct_logs.dashboard_id
) as log_counts 
on log_counts.dashboard_id = dashboards.id
order by log_counts.ct desc</code></pre>



We’ve taken the inner count-distinct-and-group and broken it up into two pieces. The inner piece computes distinct (dashboard_id, user_id) pairs. The second piece runs a simple, speedy group-and-count over them. As always, the join is last.



<figure class="wp-block-image fancybox"><img src="https://cdn.sisense.com/wp-content/uploads/7_10-seconds-map.png" alt="7-10 seconds map" class="wp-image-76198"/></figure>



And now for the big reveal: This sucker takes 0.7 seconds! That’s a 28X increase over the previous query, and a 68X increase over the original query.



As always, data size and shape matters a lot. These examples benefit a lot from a relatively low cardinality. There are a small number of distinct (user_id, dashboard_id) pairs compared to the total amount of data. The more unique pairs there are — the more data rows are unique snowflakes that must be grouped and counted — the less free lunch there will be.



Next time count distinct is taking all day, try a few subqueries to lighten the load.



Become an instant SQL expert:



<a class="action-btn " href="https://www.sisense.com/whitepapers/sql-analytics-best-practices-tips-and-tricks/" target="_blank" rel="noopener noreferrer">Get the Starter Kit</a>

Use Subqueries to Count Distinct 50X Faster

<h2>The New Guy In Town</h2>



Like a lot of folks in the data community, we’ve been impressed with <a href="https://aws.amazon.com/redshift/" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">Redshift</a>, Amazon’s new distributed database. Yet at first, we couldn’t figure out why performance was so variable on seemingly-simple queries.



The key is carefully planning each table’s sort key and distribution key.



A table’s distkey is the column on which it’s distributed to each node. Rows with the same value in this column are guaranteed to be on the same node.



A table’s sortkey is the column by which it’s sorted within each node.



<h2>A Naive Table</h2>



Our 1B-row activity table is set up this way:



<pre class="wp-block-code"><code>create table activity (
 id integer primary key,
 created_at_date date,
 device varchar(30)
);</code></pre>



A common query is: How much activity was there on each day, split by device?



<pre class="wp-block-code"><code>select created_at_date, device, count(1)
from activity
group by created_at_date, device
order by created_at_date;</code></pre>



This gives you a chart like this:



<figure class="wp-block-image fancybox"><img src="https://cdn.sisense.com/wp-content/uploads/Activity-by-Device.png" alt="Activity by device chart" class="wp-image-76128"/></figure>



On a cluster with 8 <a href="https://aws.amazon.com/redshift/pricing/" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">dw2.large</a> nodes, this query takes 10 seconds. To understand why, let’s turn to Redshift’s handy CPU Utilization graph:



<figure class="wp-block-image fancybox"><img src="https://cdn.sisense.com/wp-content/uploads/CPU-Utilization.png" alt="CPU utilization" class="wp-image-76133"/></figure>



That is a ton of CPU usage for a simple count query!



The problem is our table has no sortkey and no distkey. This means Redshift has distributed our rows to each node round-robin as they’re created, and the nodes aren’t sorted at all.



As a result, each node must maintain thousands of counters — one for each (date, device) pair. Each time it counts a row, it looks up the right counter and increments it. On top of that, the leader must aggregate all the counters. This is where all of our CPU time is going.



<h2>Smarter Distribution and Sorting</h2>



Let’s remake our table with a simple, intentional sortkey and distkey:



<pre class="wp-block-code"><code>create table activity (
 id integer primary key,
 created_at_date date sortkey distkey,
 device varchar(30)
);</code></pre>



Now our table will be distributed according to created_at_date, and each node will be sorted by created_at_date. The same query runs on this table in 8 seconds, a solid 20% improvement.



Because each node is sorted by created_at_date, it only needs one counter per device. As soon as the node is done iterating over each date, the values in each device counter are written out to the result, because the node knows it will never see that date again.



Even better, because dates are unique to each node, the leader doesn’t have to do any math over the results. It can just concatenate them and send them back.



Our approach is validated by lower CPU usage across the board:



<figure class="wp-block-image fancybox"><img src="https://cdn.sisense.com/wp-content/uploads/CPU-Utilization-3.png" alt="CPU utilization 3" class="wp-image-76143"/></figure>



<h2>Getting Really Specific</h2>



But what if there were a way to only require one counter? Fortunately Redshift allows multi-key sorting:



<pre class="wp-block-code"><code>create table activity (
 id integer primary key,
 created_at_date distkey,
 device varchar(30)
)
sortkey (created_at_date, device);</code></pre>



Our query runs on this table in 5 seconds, a 38% improvement over the previous table, and a 2X improvement from the naive query!



Once again, the CPU chart will show us how much work is required:



<figure class="wp-block-image fancybox"><img src="https://cdn.sisense.com/wp-content/uploads/CPU-Utilization-3.png" alt="CPU utilization 3" class="wp-image-76143"/></figure>



Our query runs on this table in 5 seconds, a 38% improvement over the previous table, and a 2X improvement from the naive query!



Once again, the CPU chart will show us how much work is required:

2X Your Redshift Speed With Sortkeys and Distkeys

Editor’s picks

<h2>Statistical overconfidence: Dangerous and easy</h2>



Imagine you have a small online business. This month 200 users signed up on your website, and 10 of them bought your $800 service. Great! You’ve made $8k of income. How much should you expect to make this year?



The straightforward answer is $8k * 12 = $96k. But how confident should you be? Will your conversion rate always be so close to 5%? You could pad the estimate ±20% for safety, guessing at $77k to $115k. If $77k would cover all your expenses, should you feel secure?



This is a question of <a href="https://en.wikipedia.org/wiki/Binomial_distribution" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">binomial probability</a>. Using our favorite binomial confidence interval calculator, the 95% confidence interval for your conversion rate is about 2.5% to 9%.



With a confidence interval that wide, you should expect to make somewhere between $48k and $172k. Yikes! You could end up with half of your simple guess, and that’s if your business doesn’t change. 



<h2>Automating statistics: Calculating confidence intervals in SQL</h2>



These confidence intervals are very informative, but turning to a calculator for every metric is tedious. If you’ve got hundreds of metrics across dozens of dashboards, it’s downright unsustainable.



Fortunately, the math for calculating confidence interval is simple to implement:



<h2>The Normal Approximation Interval formula for binomial confidence intervals</h2>



<pre class="wp-block-code"><code>n = number of users
x = number of conversions
p = probability of conversion = (x / n)
se = standard error of p = sqrt((p * (1 - p)) / n)
confidence interval = p ± (1.96 * se)</code></pre>



See <a href="https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Normal_approximation_interval" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">Normal approximation interval on wikipedia</a>. Note the 1.96 constant specifies a 95% interval on a <a href="https://en.wikipedia.org/wiki/One-_and_two-tailed_tests" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">two-tailed normal distribution</a>.



<h3>Implementing the formula in SQL</h3>



Let’s start with a table of the total number of users, and how many converted. Any data that represents a rate — conversions per user, server errors per request, etc. — will also work.



<pre class="wp-block-code"><code>select 
 count(1) as n, 
 sum(case when converted then 1 else 0 end) as x
from users
group by date_trunc('month', created_at);</code></pre>



<figure class="wp-block-image fancybox"><img src="https://cdn.sisense.com/wp-content/uploads/image-01-19.png" alt="Users table" class="wp-image-78764"/></figure>



With our basic data in hand, we want to implement the above formula in SQL. To keep things clear, we wrap each step of the calculation separately: 



<ol><li>Calculate the conversation rate, p.</li><li>Using p, calculate the standard error, se.</li><li>Compute the low and high confidence intervals.</li><li>Include the original p conversion rate as our mid estimate.</li></ol>



<pre class="wp-block-code"><code>select 
 rates.n as users, 
 rates.x as conversions, 
 p - se * 1.96 as low, 
 intervals.p as mid, 
 p + se * 1.96 as high 
from (
 select 
 rates.*, 
 sqrt(p * (1 - p) / n) as se -- calculate se
 from (
 select conversions.*, 
 x / n::float as p -- calculate p
 from ( 
 -- Our conversion rate table from above
 select 
 count(1) as n, 
 sum(case when converted then 1 else 0 end) as x
 from users
 group by date_trunc('month', created_at);
 ) conversions
 ) rates
) intervals</code></pre>



<figure class="wp-block-image fancybox"><img src="https://cdn.sisense.com/wp-content/uploads/image-02-20.png" alt="adjusted table" class="wp-image-78770"/></figure>



You might be wondering why we’re seeing 8% on the high end, rather than the 9% mentioned in the introduction. We used the Adjusted Wald method in the introduction, which produces more accurate estimates for small amounts of data.



<h2>A refinement for little data: The Adjusted Wald method</h2>



The math explained above, though quite accurate with hundreds of users and a healthy conversion rate, becomes increasingly biased with less data or extremely high or low rates. A rule of thumb is to avoid using it with fewer than 5 conversions or 100 users.



One way to adjust for these shortcomings is to use a more robust <a href="https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">binomial proportion confidence</a> interval technique like the <a href="http://www.measuringux.com/adjustedwald.htm" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">Adjusted Wald method</a>. In short, it adds a bit of fuzziness to the estimated probability to smooth out the extremely high or low rates which are more common with few data points.



Given the z-score needed to reach a certain confidence level (1.96 for a 95% confidence), add 0.5 * z^2 to the number of conversions, and z^2 to the number of users. This is roughly +2 and +4 for the 1.96 z-score for 95%.



<pre class="wp-block-code"><code>select 
 rates.n as users, 
 rates.x as conversions, 
 p - se * 1.96 as low, 
 intervals.p as mid, 
 p + se * 1.96 as high 
from (
 select 
 rates.*, 
 sqrt(p * (1 - p) / n) as se -- calculate se
 from (
 select 
 conversions.*, 
 (x + 1.92) / (n + 3.84)::float as p -- calculate p
 from ( 
 -- Our conversion rate table from above
 select 
 count(1) as n, 
 sum(case when converted then 1 else 0 end) as x
 from users
 group by date_trunc('month', created_at);
 ) conversions
 ) rates
) intervals</code></pre>



 The important adjustment is here, where we add the constants to the numerator and denominator when calculating&nbsp;p: 



<pre class="wp-block-code"><code>(x + 1.92) / (n + 3.84)::float as p -- calculate p</code></pre>



<figure class="wp-block-image fancybox"><img src="https://cdn.sisense.com/wp-content/uploads/image-03-15.png" alt="Conversions table" class="wp-image-78788"/></figure>



This isn’t a magical solution to not enough data: If you have an expected 1% conversion rate and only 100 users, this adjustment will triple the estimated conversion rate, giving you a confidence interval of 0-6%. More data is the answer. At 10 conversions and 1,000 users, the interval shrinks to 0.5% to 1.9%.



In general, the more data you have, the more statistical approaches like these will be helpful to you.



<h2>Who are we?</h2>



We make a tool that makes data analysis on large SQL databases fast and easy. You could use our Snippets feature to implement this logic once, and apply it to any dataset.



If you have a database with many millions or billions of rows, and running hundreds of analyses is getting slow and cumbersome, we think you’ll really love it. Sign up for a free demo. We can also set you up with a free trial on the same day!

How to Calculate Confidence Intervals in SQL

A lot of charts and tables are time series, and the queries behind them are often easier when you can join and aggregate against a list of dates. Not having a complete list of dates causes gaps in the results, changing them in a misleading way:



<figure class="wp-block-image fancybox"><img src="https://cdn.sisense.com/wp-content/uploads/Daily-reports.png" alt="Daily reports" class="wp-image-74257"/></figure>



 Postgres has a great function for generating a list of dates (see <a href="https://www.sisense.com/blog/use-generate-series-to-get-continuous-results/">Use generate_series to get continuous results</a>), and making a list of the last 60 days with <a href="https://www.postgresql.org/docs/9.3/functions-srf.html" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">generate_series</a> is easy: 



<pre class="wp-block-code"><code>select now()::date - generate_series(0, 59)</code></pre>



Accomplishing the same thing in Redshift and MySQL requires a little more work.



<h2>Date Series from a Numbers Table</h2>



The simplest alternative to generate_series is to create a table containing a continuous list of numbers, starting at 0, and select from that table. (If you have a table with a sequential id column and never delete rows from it, you can just select the id column from that table instead of creating a new numbers table).



<pre class="wp-block-code"><code>select n from numbers;</code></pre>



Returns this list of rows: 0, 1, 2, 3…



Now that you have a numbers table, convert each number into a date:



<h2>Redshift:</h2>



<pre class="wp-block-code"><code>select (getdate()::date - n)::date from numbers</code></pre>



<h2>MySQL:</h2>



<pre class="wp-block-code"><code>select date_sub(date(now()), interval n day) from numbers</code></pre>



A numbers table is more convenient than a dates table since it never needs to be refreshed with new dates.



<h2>Redshift: Date Series using Window Functions</h2>



If you don’t have the option to create a numbers table, you can build one on the fly using a window function. All you need is a table that has at least as many rows as the number of dates desired. Using a window function, number the rows in any table to get a list of numbers, and then convert that to a list of dates:



<pre class="wp-block-code"><code>select row_number() over (order by true) as n
from users limit 60</code></pre>



And now creating the list of dates directly:



<pre class="wp-block-code"><code>select (
 getdate()::date - row_number() over (order by true)
 )::date as n
from users limit 60</code></pre>



<h2>MySQL: Date Series using Variables</h2>



With variables in MySQL, we can generate a numbers table by treating a select statement as a for loop:



<pre class="wp-block-code"><code>set @n:=-1;
select (select @n:= @n+1) n
from users limit 60</code></pre>



Now that we’ve made a list of dates, aggregating and joining data from other tables for time series charts is a breeze!

Generate Series in Redshift and MySQL

<h2>The all-important revenue graph</h2>



In your venerable orders table, you’re almost certainly storing prices as numbers. Perhaps they’re integer, perhaps they’re numeric, perhaps you’re using Postgres and they’re money, or perhaps you rolled the dice on floating-point rounding errors and went with real.



But save for Postgres’s money format, your revenue graph looks, well, not like revenue at all:



<figure class="wp-block-image fancybox"><img src="https://cdn.sisense.com/wp-content/uploads/Revenue-currency-blog.png" alt="Revenue chart" class="wp-image-74221"/></figure>



Wouldn’t you rather look at this?



<figure class="wp-block-image fancybox"><img src="https://cdn.sisense.com/wp-content/uploads/Revenue-2-currency-blog.png" alt="A better revenue chart" class="wp-image-74236"/></figure>



That’s a revenue graph we can all get behind!



SQL Analytics Starter Kit: Best Practices, Tips, and Tricks:



<a class="action-btn " href="https://www.sisense.com/whitepapers/sql-analytics-best-practices-tips-and-tricks/" target="_blank" rel="noopener noreferrer">Get the Starter Kit</a>



<h2>Formatting the query</h2>



Our query starts like this:



<pre class="wp-block-code"><code>select date(created_at), sum(price)
from orders
group by 1</code></pre>



Let’s rewrite it to get some nice currency formatting.



<h2>Postgres</h2>



Assuming you’re not already using the money type, you can leverage it for some quick formatting:



<pre class="wp-block-code"><code>select date(created_at), cast(sum(price) as money)
from orders
group by 1</code></pre>



<h2>MySQL</h2>



Things in MySQL aren’t quite so easy. But we can still format the number to get two decimal places and prepend a “$”:



<pre class="wp-block-code"><code>select date(created_at), concat('$', format(sum(price), 2))
from orders
group by 1</code></pre>



<h2>Redshift</h2>



Unfortunately Redshift doesn’t support the money type or the format function. We’ll use the <a href="https://docs.aws.amazon.com/redshift/latest/dg/r_TO_CHAR.html" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">to_char</a> function to specify an exact <a href="https://docs.aws.amazon.com/redshift/latest/dg/r_Numeric_formating.html" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">format string</a>:



<pre class="wp-block-code"><code>select 
 date(created_at), 
 to_char(sum(price), 'FM$999,999,999,990D00')
from orders
group by 1</code></pre>



Our format string covers a few bases:



<ul><li>FM removes leading and trailing whitespace; this allows our “$” to be right next to the number</li><li>$ is our dollar sign</li><li>The 9s are optional digits</li><li>The commas separate the thousands, millions, etc.</li><li>The 0s are required digits; all numbers will have a ones place and two decimal places, with zeroes in those places if necessary</li><li>The D specifies our decimal “.”</li><li>This SQL works on all 3 of these databases, but it’s a bit onerous to type, so we prefer the other options on Postgres and MySQL</li></ul>



Of course, using a tool like Sisense for Cloud Data Teams, you can reduce all of this to a single click:



<figure class="wp-block-image fancybox"><img src="https://cdn.sisense.com/wp-content/uploads/Dollar-button-currency-blog.png" alt="Dollar button image" class="wp-image-74231"/></figure>



Or if you prefer typing to clicking, you can annotate the SQL itself!



<figure class="wp-block-image fancybox"><img src="https://cdn.sisense.com/wp-content/uploads/Sum-price-currency-blog.png" alt="Revenue formatter sum price" class="wp-image-74226"/></figure>



SQL Analytics Starter Kit: Best Practices, Tips, and Tricks:



<a class="action-btn " href="https://www.sisense.com/whitepapers/sql-analytics-best-practices-tips-and-tricks/" target="_blank" rel="noopener noreferrer">Get the Starter Kit</a>