In regression analysis, we look at the correlations between one or more input variables, or factors, and a response. We might look at how baking time and temperature relate to the hardness of a piece of plastic, or how educational levels and the region of one's birth relate to annual income. The number of potential factors you might include in a regression model is limited only by your imagination...and your capacity to actually gather the data you imagine.
But before throwing data about every potential predictor under the sun into your regression model, remember a thing called multicollinearity. With regression, as with so many things in life, there comes a point where adding more is not better. In fact, sometimes not only does adding "more" factors to a regression model fail to make things clearer, it actually makes things harder to understand!
In regression, "multicollinearity" refers to predictors that are correlated with other predictors. Multicollinearity occurs when your model includes multiple factors that are correlated not just to your response variable, but also to each other. In other words, it results when you have factors that are a bit redundant.
You can think about it in terms of a football game: If one player tackles the opposing quarterback, it's easy to give credit for the sack where credit's due. But if three players are tackling the quarterback simultaneously, it's much more difficult to determine which of the three makes the biggest contribution to the sack.
Not that into football? All right, try this analogy instead: You go to see a rock and roll band with two great guitar players. You're eager to see which one plays best. But on stage, they're both playing furious leads at the same time! When they're both playing loud and fast, how can you tell which guitarist has the biggest effect on the sound? Even though they aren't playing the same notes, what they're doing is so similar it's difficult to tell one from the other.
That's the problem with multicollinearity.
Multicollinearity increases the standard errors of the coefficients. Increased standard errors in turn means that coefficients for some independent variables may be found not to be significantly different from 0. In other words, by overinflating the standard errors, multicollinearity makes some variables statistically insignificant when they should be significant. Without multicollinearity (and thus, with lower standard errors), those coefficients might be significant.
A little bit of multicollinearity isn't necessarily a huge problem: extending the rock band analogy, if one guitar player is louder than the other, you can easily tell them apart. But severe multicollinearity is a major problem, because it increases the variance of the regression coefficients, making them unstable. The more variance they have, the more difficult it is to interpret the coefficients.
So, how do you know if you need to be concerned about multicollinearity in your regression model? Here are some things to watch for:
One way to measure multicollinearity is the variance inflation factor (VIF), which assesses how much the variance of an estimated regression coefficient increases if your predictors are correlated. If no factors are correlated, the VIFs will all be 1.
To have Minitab Statistical Software calculate and display the VIF for your regression coefficients, just select it in the "Options" dialog when you perform your analysis.
With Display VIF selected as an option, Minitab will provide a table of coefficients as part of its output. Here's an example involving some data looking at the relationship between researcher salary, publications, and years of employment:
If the VIF is equal to 1 there is no multicollinearity among factors, but if the VIF is greater than 1, the predictors may be moderately correlated. The output above shows that the VIF for the Publication and Years factors are about 1.5, which indicates some correlation, but not enough to be overly concerned about. A VIF between 5 and 10 indicates high correlation that may be problematic. And if the VIF goes above 10, you can assume that the regression coefficients are poorly estimated due to multicollinearity.
You'll want to do something about that.
If multicollinearity is a problem in your model -- if the VIF for a factor is near or above 5 -- the solution may be relatively simple. Try one of these:
With Minitab Statistical Software, it's easy to use the tools available in Stat > Regression menu to quickly test different regression models to find the best one. If you're not using it, we invite you to try Minitab for free for 30 days.
Have you ever run into issues with multicollinearity? How did you solve the problem?