Data professionals are often tasked with finding a correlation between variables in a dataset to determine if one variable (the x) can be a strong predictor of another (the y). For those new to this type of analysis, a great place to start is with linear regressions. Linear regressions involve fitting a set of independent and dependent variables to a linear equation that attempts to find any sort of correlated relationship.
The model most commonly used to make these predictions is the least squares model, which takes a scatter plot and fits a line with the shortest possible distance to all points. That line uses the formula y = mx + b where m is the slope of the line, and is calculated as (NΣxy − Σx Σy) / N(Σx2) − (Σx)2 and b is the y intercept, which is calculated as Σy − m(Σx)/n with n being the number of points being plotted. Before using a linear regression model, it is important to understand what it is, when it should and should not be used and how to evaluate its performance.
When to use a Linear Regression
Common use cases for linear regressions are determining the strength of the relationship between two variables, understanding how much a certain a change in an independent variable will affect a dependent variable or predicting a future set of results.
In the case of a prediction, linear regression can be used for either interpolation or extrapolation. To illustrate the difference, consider a set of x values between 1 and 1000, but no data for the exact value of 500. Interpolation is determining a fair estimate of the y value for x = 500. Extrapolation would be predicting the y outcome for x values outside the range of our dataset, for example 1,100.
Equally important is knowing when not to use a linear regression. For example, the amount saved in a 401k vs. the annual return might fit more of a quadratic model, starting with a slow incline and becoming much steeper as the value of the account grows. A great way to start determining if a linear regression is appropriate is plotting your data with a scatter plot to look for visible trends. In Sisense for Cloud Data Teams, we can easily plot our data and draw a line of least squares by clicking the “Show Trendlines” box.
Looking at this dataset from a fictitious gaming company showing revenue vs. gameplays, we can see that there is clearly a positive relationship between the two variables that fits a linear equation and is probably worth exploring further.
Common Mistakes and Misconceptions
It’s important to note that just because your data fits a linear model does not necessarily mean that x causes y, it merely means that x is a good predictor of y. For example, you might build a regression that shows a positive correlation between sunscreen sales and amusement park attendance, however, it would be misguided for an amusement park to hold a sale on sunscreen in hopes of increasing ticket sales.
Another common malpractice is to remove outliers from your dataset without justification. Excluding outliers can be a valid practice if the outlier is an impossible outcome or represents bad data, such as a 1000% on a test. However, if the outlier represents a possible (but unlikely) outcome, then it is often best to leave it, since these can often be the most interesting data for a business. Outliers can often be our most insightful data points and are worth studying.
Like wine, models get better with time. The more data we introduce to the model, the better it predicts outcomes. We might start with a sparse dataset that produces a weak model, but with time our model can improve if there is a true correlation between the variables.
Applying a Linear Regression Model to Our Data
Once we have determined that a linear regression analysis might be appropriate for our dataset, we can use the power of Python or R for deeper exploration. Our Data Community is a great source for coding examples and has detailed walkthroughs on how to perform linear regression analysis in Python and in R, depending on your preferred language.
The first step will be to split your data into training and test datasets. The best way to test the effectiveness of a model is to set aside a smaller set of data to be later used to see how the model performed in predicting a y variable. A 70/30 split is commonly used. When making the split, it is essential that entries are randomly assigned to the testing or training dataset. If these two groups have some inherent differences, that will lead to inaccurate model generation.
Evaluating the Performance of our Model
After generating the model, it’s helpful to look at the m coefficient, which quantifies the slope of our regression. The slope defines how positively or negatively correlated the variables are.
Next, you might want to evaluate the performance of your model. The posts linked above detail how your model score can be found, but the closer your score is to 1, the better the model served in predicting the y variable, and the closer to 0 the less accurate our predicted values where to the actual y values. The model score is also known as the r squared, which divides the sum of the squared difference between the regression line and the mean y value, by the sum of the squared difference between each datapoint and the mean y value, R2= ∑(ŷi−ȳ)2 / ∑(yi−ȳ)2.
The residuals is another performance indicator which looks at the difference between the predicted value and the actual value. This can be visualized as a scatter plot with the residuals as the y axis. Alternatively, we can look at the MAE (mean absolute error) or RMSE (root mean squared error). The mean absolute error takes the sum of the absolute value of the residuals divided by the number of data points, ∑|yi−ŷ| / n. The RMSE, calculated as ⎷∑(yi−ŷi)2 / n, takes the square root of the squared residuals over the number of datapoints. While both equations serve a similar purpose, the RMSE adds weight to data points with higher residuals, which may be desired if you would like to penalize the model on predictions that missed by a larger margin. Both values will range between 0 and infinity with values closer to 0 indicating our model’s predicted values came closer to the actual y values.
Adding Additional Variables
Evaluating the relationship between two variables is often a great place to start when predicting outcomes, however, once comfortable with single variable linear regressions, you might notice that the y value can be influenced by more than one variable. We can also look at how multiple variables combine to influence a single dependent variable, and determine which are stronger and weaker indicators. If you feel ready to start getting your feet wet with the next step in linear regression, feel free to check out these multivariate linear regressions.