Choosing the correct linear regression model can be difficult. Trying to model it with only a sample doesn’t make it any easier. In this post, we'll review some common statistical methods for selecting models, complications you may face, and provide some practical advice for choosing the best regression model.
It starts when a researcher wants to mathematically describe the relationship between some predictors and the response variable. The research team tasked to investigate typically measures many variables but includes only some of them in the model. The analysts try to eliminate the variables that are not related and include only those with a true relationship. Along the way, the analysts consider many possible models.
They strive to achieve a Goldilocks balance with the number of predictors they include.
Master statistics anytime, anywhere with animated lessons, quizzes and hands-on exercises in Quality Trainer.
For a good regression model, you want to include the variables that you are specifically testing along with other variables that affect the response in order to avoid biased results. Minitab Statistical Software offers statistical measures and procedures that help you specify your regression model.
Adjusted R-squared and Predicted R-squared: Generally, you choose the models that have higher adjusted and predicted R-squared values. These statistics are designed to avoid a key problem with regular R-squared—it increases every time you add a predictor and can trick you into specifying an overly complex model.
P-values for the predictors: In regression, low p-values indicate terms that are statistically significant. “Reducing the model” refers to the practice of including all candidate predictors in the model, and then systematically removing the term with the highest p-value one-by-one until you are left with only significant predictors.
Stepwise regression and Best subsets regression: These are two automated procedures that can identify useful predictors during the exploratory stages of model building. With best subsets regression, Minitab provides Mallows’ Cp, which is a statistic specifically designed to help you manage the tradeoff between precision and bias.
Related Blog: Get even more expert resources to help you along the way with our regression tutorial.
Great, there are a variety of statistical methods to help us choose the best model. Unfortunately, there also are a number of potential complications. Don’t worry, we'll provide some practical advice!
Choosing the correct regression model is as much a science as it is an art. Statistical methods can help point you in the right direction but ultimately you’ll need to incorporate other considerations.
Theory
Research what others have done and incorporate those findings into constructing your model. Before beginning the regression analysis, develop an idea of what the important variables are along with their relationships, coefficient signs, and effect magnitudes. Building on the results of others makes it easier both to collect the correct data and to specify the best regression model without the need for data mining.
Theoretical considerations should not be discarded based solely on statistical measures. After you fit your model, determine whether it aligns with theory and possibly make adjustments. For example, based on theory, you might include a predictor in the model even if its p-value is not significant. If any of the coefficient signs contradict theory, investigate and either change your model or explain the inconsistency.
Complexity
You might think that complex problems require complex models, but many studies show that simpler models generally produce more precise predictions. Given several models with similar explanatory ability, the simplest is most likely to be the best choice. Start simple, and only make the model more complex as needed. The more complex you make your model, the more likely it is that you are tailoring the model to your dataset specifically, and generalizability suffers.
Verify that added complexity actually produces narrower prediction intervals. Check the predicted R-squared and don’t mindlessly chase a high regular R-squared!
Residual Plots
As you evaluate models, check the residual plots because they can help you avoid inadequate models and help you adjust your model for better results. For example, the bias in underspecified models can show up as patterns in the residuals, such as the need to model curvature. The simplest model that produces random residuals is a good candidate for being a relatively precise and unbiased model.
In the end, no single measure can tell you which model is the best. Statistical methods don't understand the underlying process or subject-area. Your knowledge is a crucial part of the process!