In this post, we’ll explore how to calculate trendlines from a scatter plot of data. Sometimes the trend is obvious, but other times not so obvious. Calculating the precise linear trendline can give us a good idea of how one variable correlates with another. We can use SQL operations to do this!
Let’s work through creating the following chart comparing IMDb and RottenTomatoes ratings.
First, consider the dataset of the top 50 movies by domestic gross (not adjusted for inflation) from each year since 1980.
select * from movie_data order by gross_dollars desc limit 10;
We have some interesting data. Let’s look at the ratings.
Finding the Correlation Between IMDb and Tomatometer Ratings
How do IMDb and Tomatometer ratings relate? We’d like to understand the relationship between IMDb ratings and the RottenTomatoes Tomatometer. Does a movie rating on one scale translate directly to the other, or does one of the databases skew towards a different scale?
select title , tomatometer , imdbrating * 10 as imdbrating_n from movie_data order by tomatometer asc
This query returns each title with its Tomatometer and IMDb ratings. IMDb ratings are between 0.0 and 10.0, so we scale it by 10 to normalize to the Tomameter scale from 0 to 100. With Sisense for Cloud Data Teams, we can easily take this tabular data and build a scatter plot.
It appears the IMDb ratings correlate linearly with Tomatometer scores. With the help of easy calculations guide, we can calculate the the slope and intercept of the least square regression line with a few lines of SQL.
with -- define the dataset with X and Y columns dataset as ( select title , tomatometer as X , imdbrating * 10 as Y from movie_data order by tomatometer asc ) -- calculate all the primary values , terms as ( select count(*) as cnt , sum(X) as sum_X , sum(Y) as sum_Y , sum(X * Y) as sum_X_times_Y , sum(X * X) as sum_X_squared from dataset ) -- use the primary values to calculate the slope , slope as ( select ( cnt * sum_X_times_Y - sum_X * sum_Y ) / (cnt * sum_X_squared - sum_X * sum_X) as slope from terms ) -- use the primary values and slope to calculate the intercept , intercept as ( select ( terms.sum_Y - slope.slope * terms.sum_X ) / terms.cnt as intercept from terms , slope ) -- select the slope and interface select * from slope , intercept
Cool! Now we know the parameters. It appears that each IMDb rating earns about 49 points on the Tomatometer for free, but for every point increase on the Tomatometer, the IMDb rating increases just .27 points. This means movies with low Tomatometer scores tend to be overrated on IMDb and those with high scores tend to be underrated on IMDb. We can calculate the point at which the trend goes from overrated to underrated by finding the point of intersection of the trendline and the line y = x. This comes to about 68 points on the Tomatometer.
Instead of just returning the slopes and intercept, we can alter the query to plot points on the trendline by replacing
select * from slope , intercept
select distinct tomatometer , (tomatometer * slope) + intercept as imdbrating_predicted from movie_data , slope , intercept order by tomatometer asc
Finally, we can overlay the trendline on the scatter plot.
How Have the Ratings of the Top Grossing Movies Changed Over Time?
Now that we have all the tools we need to build trendlines with SQL, let’s try to answer one more question: how have the ratings of the top grossing movies in each year changed over time? First, we define a query to get the top 15 grossing movies for each year.
with ranks as ( select DATE_PART('year', year) - DATE_PART('year', '1980-01-01'::date) as years_since_1980 , title , gross_dollars , tomatometer , rank() over(partition by year order by gross_dollars desc) as rank from movie_data ) select years_since_1980 , title , gross_dollars , tomatometer from ranks where rank <= 15 order by years_since_1980 asc
Replace the dataset of the early calculation with the output of this query and you get the following trendline.
This shows the average rating of the top 15 films has not changed over time. There’s virtually no correlation. With these two examples, we’ve shown how SQL can be used compute trendlines to analyze your data.