Your data warehouse is a vital part of your business, so making decisions like upgrading your read replica vs switching to Redshift are important. If you value fast queries, Redshift is the way to go.
When benchmarking Amazon Redshift against Amazon RDS Postgres, Redshift came out to be 100-1,000 times faster on common analytics queries.
To make the comparison as fair as possible, we benchmarked the largest RDS Postgres box (DB.R3.8XLarge) against a similarly priced and spec’d Redshift cluster (16 DW2.Large nodes). Both our RDS Postgres box and our Redshift cluster used default settings.
We ran each test query 3 times on an otherwise idle setup. The reported time is the average of the second two executions.
Each query was run against a transactions table that’s comprised of:
- 1 billion rows
- 50 million unique users in user_id
- 10 thousand unique products in product_id
- Timestamps spanning one year in created_at
- And a dozen extra columns representing various attributes of the transaction
The RDS Postgres version of this table had indexes on created_at, user_id, and product_id.
The Redshift table used user_id as the dist key, (user_id, created_at) as the sort key, and the compression encodings recommended by analyze compression.
Both tables were analyzed and vacuumed before running any queries.
Many of our customers look at metrics like Daily Revenue, Daily Active Users, and Daily ARPU. On average, Redshift was 500x faster than RDS Postgres:
Here are the metrics queries we tested:
-- Daily Revenue select date(created_at), sum(amount) from transactions group by 1 -- Daily Active Users select date(created_at), count(distinct user_id) from transactions group by 1 -- Daily ARPU select date(created_at), sum(amount) / count(distinct user_id) from transactions group by 1
Whether it’s 30-day retention or unique sessions, many analytics queries rely on being able to count the distinct number of elements in a set very fast. On average, Redshift was 200x faster than RDS Postgres for these queries.
Here are the distincting queries we tested:
-- Users per Product select product_id, count(distinct user_id) from transactions group by 1 -- Products per User select user_id, count(distinct product_id) from transactions group by 1 -- Products per Date select date(created_at), count(distinct product_id) from transactions group by 1
How is Redshift so Fast?
Redshift owes its speed to the following three factors:
Compressed Columnar Storage
Postgres stores data by row. This means you have to read the whole table to sum the price column.
Redshift stores its data organized by column. This allows the database to compress records because they’re all the same data type. Once they’re compressed, there’s less data to read off disk and store in RAM.
Block Storage and 100% CPU
Postgres does not use multiple cores for a single query. While this allows more queries to run in parallel, no single query can use all of the machine’s resources.
Redshift stores each table’s data in thousands of chunks called blocks. Each block can be read in parallel. This allows the Redshift cluster to use all of its resources on a single query.
Clusters make IOPS easy
The RDS Postgres box we used had the standard 3K Input/Output Operations Per Second (IOPS). Even raising that to 10K IOPS for another $1000 a month barely moved the needle. Reading from disk is just really slow.
A Redshift cluster can achieve higher much IOPS. Each node reads from a different disk and the IOPS sums across the cluster. Our benchmark cluster achieved over 50K IOPS.
Faster Queries without the Extra Work
You’ll always have faster results querying a lot of data on Redshift versus on a large read replica such as RDS Postgres.
If you want the speed of Redshift but don’t want to spend the time ETLing your data, Sisense can give you the best of both worlds.