Choosing the correct linear regression model can be difficult. After all, the world and how it works is complex. Trying to model it with only a sample doesn’t make it any easier. In this post, I'll review some common statistical methods for selecting models, complications you may face, and provide some practical advice for choosing the best regression model.
It starts when a researcher wants to mathematically describe the relationship between some predictors and the response variable. The research team tasked to investigate typically measures many variables but includes only some of them in the model. The analysts try to eliminate the variables that are not related and include only those with a true relationship. Along the way, the analysts consider many possible models.
They strive to achieve a Goldilocks balance with the number of predictors they include.
- Too few: An underspecified model tends to produce biased estimates.
- Too many: An overspecified model tends to have less precise estimates.
- Just right: A model with the correct terms has no bias and the most precise estimates.
Statistical Methods for Finding the Best Regression Model
For a good regression model, you want to include the variables that you are specifically testing along with other variables that affect the response in order to avoid biased results. Minitab statistical software offers statistical measures and procedures that help you specify your regression model. I’ll review the common methods, but please do follow the links to read my more detailed posts about each.
Adjusted R-squared and Predicted R-squared: Generally, you choose the models that have higher adjusted and predicted R-squared values. These statistics are designed to avoid a key problem with regular R-squared—it increases every time you add a predictor and can trick you into specifying an overly complex model.
- The adjusted R squared increases only if the new term improves the model more than would be expected by chance and it can also decrease with poor quality predictors.
- The predicted R-squared is a form of cross-validation and it can also decrease. Cross-validation determines how well your model generalizes to other data sets by partitioning your data.
P-values for the predictors: In regression, low p-values indicate terms that are statistically significant. “Reducing the model” refers to the practice of including all candidate predictors in the model, and then systematically removing the term with the highest p-value one-by-one until you are left with only significant predictors.
Stepwise regression and Best subsets regression: These are two automated procedures that can identify useful predictors during the exploratory stages of model building. With best subsets regression, Minitab provides Mallows’ Cp, which is a statistic specifically designed to help you manage the tradeoff between precision and bias.
Real World Complications
Great, there are a variety of statistical methods to help us choose the best model. Unfortunately, there also are a number of potential complications. Don’t worry, I’ll provide some practical advice!
- The best model can be only as good as the variables measured by the study. The results for the variables you include in the analysis can be biased by the significant variables that you don’t include. Read about an example of omitted variable bias.
- Your sample might be unusual, either by chance or by data collection methodology. False positives and false negatives are part of the game when working with samples.
- P-values can change based on the specific terms in the model. In particular, multicollinearity can sap significance and make it difficult to determine the role of each predictor.
- If you assess enough models, you will find variables that appear to be significant but are only correlated by chance. This form of data mining can make random data appear significant. A low predicted R-squared is a good way to check for this problem.
- P-values, predicted and adjusted R-squared, and Mallows’ Cp can suggest different models.
- Stepwise regression and best subsets regression are great tools and can get you close to the correct model. However, studies have found that they generally don’t pick the correct model.
Recommendations for Finding the Best Regression Model
Choosing the correct regression model is as much a science as it is an art. Statistical methods can help point you in the right direction but ultimately you’ll need to incorporate other considerations.
Research what others have done and incorporate those findings into constructing your model. Before beginning the regression analysis, develop an idea of what the important variables are along with their relationships, coefficient signs, and effect magnitudes. Building on the results of others makes it easier both to collect the correct data and to specify the best regression model without the need for data mining.
Theoretical considerations should not be discarded based solely on statistical measures. After you fit your model, determine whether it aligns with theory and possibly make adjustments. For example, based on theory, you might include a predictor in the model even if its p-value is not significant. If any of the coefficient signs contradict theory, investigate and either change your model or explain the inconsistency.
You might think that complex problems require complex models, but many studies show that simpler models generally produce more precise predictions. Given several models with similar explanatory ability, the simplest is most likely to be the best choice. Start simple, and only make the model more complex as needed. The more complex you make your model, the more likely it is that you are tailoring the model to your dataset specifically, and generalizability suffers.
Verify that added complexity actually produces narrower prediction intervals. Check the predicted R-squared and don’t mindlessly chase a high regular R-squared!
As you evaluate models, check the residual plots because they can help you avoid inadequate models and help you adjust your model for better results. For example, the bias in underspecified models can show up as patterns in the residuals, such as the need to model curvature. The simplest model that produces random residuals is a good candidate for being a relatively precise and unbiased model.
In the end, no single measure can tell you which model is the best. Statistical methods don't understand the underlying process or subject-area. Your knowledge is a crucial part of the process!
If you're learning about regression, read my regression tutorial!
* The image of Rodin's The Thinker was taken by flickr user innoxius and licensed under CC BY 2.0.