In my last post I talked about why you need to check your regression analysis residuals. In a nutshell, your predictors should be so good at explaining (or predicting) the response that only the inherent randomness of any real-world phenomenon remains leftover for the error portion. If you observe explanatory or predictive power in the error, you know that your predictors are missing some of the predictive information. In this post I'll cover a specific type of pattern that you can see in the residuals and show you how to fix the problem.
Regression residuals should have a constant spread across all fitted values. If your plot looks like the one below, you've got a problem known as heteroscedasticity or non-constant variance. You can see that as the fitted values get larger, so does the vertical spread of the residuals. That increasing spread represents predictive information that is leaking over into your residual plot. If you can figure out why the variability of your residuals change, you can improve your analyses accordingly.
Why fix the problem? There are two big reasons:
Today, I'll look at a common solution that Minitab statistical software provides, weighted regression.
We'll model the number of car accidents as a function of the population. You can download the data to do this yourself. This is not real-world data but it accurately reflects the problem and the cure.
*It is crucial to use standardized residuals. Regular residuals will show us the problem with non-constant variance. However, only standardized residuals will show us that we have fixed the problem. Standardized residuals have other benefits as well, but that's the subject of another post.
After running the analysis, out pops the Residual versus fits graph that is shown above. As we assess the model, we notice that this graph displays residuals with non-constant variance. For this example, we'll only look at the Residuals versus fits plots that we create because it is the only place where we can see the heteroscedasticity.
Cross-sectional studies, such as this one, are at a greater risk of exhibiting heteroskedasticity due to a greater disparity between the largest and smallest values of a predictor. Think of small towns versus the large cities! In this case, we focus on population size because the values vary greatly. It is likely that the variance of the error increases with the size of the population.
Weighted regression is one method that you can use to correct the residuals. Determining the proper weight to use can be a challenging task and requires subject-area knowledge. This procedure is particularly useful when you can identify a variable that changes with the variance of the residuals.
For the reasons discussed above, the weights in this example are based on the population variable. I created a column of weights by calculating the reciprocal of the population (1/population) for each row. You can do this in Minitab with the tools in Calc > Calculator.
Weighted regression works by weighting each data point based on the variability of it's fitted value. In this case, data points with a larger population have residuals with a higher variance. We want to give places with a higher population a lower weight in order to shrink their squared residuals. With the proper weight, this procedure minimizes the sum of weighted squared residuals to produce residuals with a constant variance (homoskedasticity).
So, let's rerun this analysis with the column of weights.
Out pops another Residuals versus fits plot!
This one looks much better! The vertical spread of the residuals is consistent across the range of fitted values. We've cured the case of heteroscedasticity!
We've just compared the residual plots and it is clear that weighted regression produced better residuals than the regular regression. We can trust the weighted regression results.
If you like, you can compare the output in the session window. However, the differences between any given pair of weighted and unweighted analyses is unpredictable. The general takeaway here is that in this type of situation you can trust the weighted results more than the unweighted results.
Keep in mind that there are different reasons why residuals can have non-constant variance. We tackled one that involved a predictor variable that had a large range of values and was associated with the changing variance. Other reasons for heteroscedasticity can include an incorrect model, such as a missing predictor. Weighted regression is not an appropriate solution if the heteroskedasticity is caused by an omitted variable. So, you really have to use your subject-area knowledge to first determine what is causing the problem and then figure out how to fix it!