We recently got a question from one of our friends on Facebook about stepwise regression. I’m new to stepwise regression myself, and I turned to a Minitab training manual for a little help in trying to explain this analysis. I found an interesting example about identifying the major sources of energy usage at a manufacturing plant that I thought might be helpful to share.
When Is Stepwise Regression Appropriate?
Stepwise regression is an appropriate analysis when you have many variables and you’re interested in identifying a useful subset of the predictors. In Minitab, the standard stepwise regression procedure both adds and removes predictors one at a time. Minitab stops when all variables not included in the model have p-values that are greater than a specified Alpha-to-Enter value and when all variables that are in the model have p-values that are less than or equal to a specified Alpha-to-Remove value.
In addition to the standard stepwise method, Minitab offers two other types of stepwise procedures:
- Forward selection: Minitab starts with no predictors in the model and adds the most significant variable for each step. Minitab stops when all variables not in the model have p-values that are greater than the specified Alpha-to-Enter value.
- Backward elimination: Minitab starts with all predictors in the model and removes the least significant variable for each step. Minitab stops when all variables in the model have p-values that are less than or equal to the specified Alpha-to-Remove value.
Stepwise Regression Example
In this example of using stepwise regression to identify the major sources of energy usage, analysts from the manufacturing plant considered the following predictor variables: total units produced, total equipment run time, staff size, mean outside temperature, minimum outside temperature, maximum outside temperature, percentage of sun, and mean equipment age. However, it’s important to note that stepwise regression can become especially helpful if you have over 100+ predictor variables!
Their goal was to narrow these variables into a list of the top predictors of energy usage. To get a final model, analysts chose Stat > Regression > Stepwise in Minitab and completed the dialog box by entering the response ‘Energy’ and the list of predictors from above.
They were presented with the following model that included the predictors of total equipment run time, max temp, and average equipment age. Minitab removed the other variables because their p-values were greater than the ‘Alpha-to-Enter’ value.
To obtain a final model, analysts chose Stat > Regression > Regression, and completed the dialog box by including ‘Energy’ as the response and the three significant variables as predictors. (To check residual plots, choose Graphs in the dialog box and then under Residual Plots, choose Four in one.)
The regression equation below indicates that energy usage increases as total equipment run time, maximum temperature, and average equipment age increase:
Total equipment run time has the largest impact according to the T-statistics. Maximum temperature is second, followed by average equipment age.
With this analysis, the analysts were able to conclude that energy usage is significantly higher due to the extensive air conditioner usage, and that newer equipment appears to reduce energy usage. The plant might want to limit running equipment during peak times where air conditioning use is consistent and consider purchasing new equipment before the summer season.
Pitfalls of Stepwise Regression
While a lot can be learned with stepwise regression, there are some potential pitfalls to be aware of:
- If two independent variables are highly correlated, only one may end up in the model even though both may be important.
- Because the procedure fits many models, it could be selecting models that fit the data well due to chance alone
- Stepwise regression may not always end with the model with the highest R2 value possible for a given number of predictors.
- Automatic procedures cannot take into account special knowledge the analyst may have about the data. Therefore, the model selected may not be the most practical one.
- Graphing individual predictors against the response is often misleading because graphs do not account for other predictors in the model.
If you'd like to work with this data set yourself, download the data on Scribd.
A special thanks to Jim for his help with this blog!
How do you typically use stepwise regression?