In this post, we’ll explore how to calculate trendlines from a scatter plot of data. Sometimes the trend is obvious, but other times not so obvious. Calculating the precise linear trendline can give us a good idea of how one variable correlates with another. We can use SQL operations to do this!

Let’s work through creating the following chart comparing IMDb and RottenTomatoes ratings.

First, consider the dataset of the top 50 movies by domestic gross (not adjusted for inflation) from each year since 1980.

`select * from movie_data order by gross_dollars desc limit 10;`

We have some interesting data. Let’s look at the ratings.

## Finding the Correlation Between IMDb and Tomatometer Ratings

How do IMDb and Tomatometer ratings relate? We’d like to understand the relationship between IMDb ratings and the RottenTomatoes Tomatometer. Does a movie rating on one scale translate directly to the other, or does one of the databases skew towards a different scale?

```
select
title
, tomatometer
, imdbrating * 10 as imdbrating_n
from
movie_data
order by
tomatometer asc
```

This query returns each title with its Tomatometer and IMDb ratings. IMDb ratings are between 0.0 and 10.0, so we scale it by 10 to normalize to the Tomameter scale from 0 to 100. With Sisense for Cloud Data Teams, we can easily take this tabular data and build a scatter plot.

It appears the IMDb ratings correlate linearly with Tomatometer scores. With the help of easy calculations guide, we can calculate the the slope and intercept of the least square regression line with a few lines of SQL.

```
with
-- define the dataset with X and Y columns
dataset as (
select
title
, tomatometer as X
, imdbrating * 10 as Y
from
movie_data
order by
tomatometer asc
)
-- calculate all the primary values
, terms as (
select
count(*) as cnt
, sum(X) as sum_X
, sum(Y) as sum_Y
, sum(X * Y) as sum_X_times_Y
, sum(X * X) as sum_X_squared
from
dataset
)
-- use the primary values to calculate the slope
, slope as (
select
(
cnt * sum_X_times_Y - sum_X * sum_Y
)
/ (cnt * sum_X_squared - sum_X * sum_X) as slope
from
terms
)
-- use the primary values and slope to calculate the intercept
, intercept as (
select
(
terms.sum_Y - slope.slope * terms.sum_X
)
/ terms.cnt as intercept
from
terms
, slope
)
-- select the slope and interface
select
*
from
slope
, intercept
```

Results:

Cool! Now we know the parameters. It appears that each IMDb rating earns about 49 points on the Tomatometer for free, but for every point increase on the Tomatometer, the IMDb rating increases just .27 points. This means movies with low Tomatometer scores tend to be overrated on IMDb and those with high scores tend to be underrated on IMDb. We can calculate the point at which the trend goes from overrated to underrated by finding the point of intersection of the trendline and the line y = x. This comes to about 68 points on the Tomatometer.

Instead of just returning the slopes and intercept, we can alter the query to plot points on the trendline by replacing

```
select
*
from
slope
, intercept
```

with

```
select distinct
tomatometer
, (tomatometer * slope) + intercept as imdbrating_predicted
from
movie_data
, slope
, intercept
order by
tomatometer asc
```

Finally, we can overlay the trendline on the scatter plot.

### How Have the Ratings of the Top Grossing Movies Changed Over Time?

Now that we have all the tools we need to build trendlines with SQL, let’s try to answer one more question: how have the ratings of the top grossing movies in each year changed over time? First, we define a query to get the top 15 grossing movies for each year.

```
with
ranks as (
select
DATE_PART('year', year)
- DATE_PART('year', '1980-01-01'::date)
as years_since_1980
, title
, gross_dollars
, tomatometer
, rank() over(partition by year order by gross_dollars desc)
as rank
from
movie_data
)
select
years_since_1980
, title
, gross_dollars
, tomatometer
from
ranks
where
rank <= 15
order by
years_since_1980 asc
```

Replace the dataset of the early calculation with the output of this query and you get the following trendline.

This shows the average rating of the top 15 films has not changed over time. There’s virtually no correlation. With these two examples, we’ve shown how SQL can be used compute trendlines to analyze your data.

**Tags:**data team