# Using Stepwise Regression to Explain Plant Energy Usage

We recently got a question from one of our friends on Facebook about stepwise regression. I’m new to stepwise regression myself, and I turned to a Minitab training manual for a little help in trying to explain this analysis. I found an interesting example about identifying the major sources of energy usage at a manufacturing plant that I thought might be helpful to share.

## When Is Stepwise Regression Appropriate?

Stepwise regression is an appropriate analysis when you have many variables and you’re interested in identifying a useful subset of the predictors. In Minitab, the standard stepwise regression procedure both adds and removes predictors one at a time. Minitab stops when all variables not included in the model have p-values that are greater than a specified Alpha-to-Enter value and when all variables that are in the model have p-values that are less than or equal to a specified Alpha-to-Remove value.

In addition to the standard stepwise method, Minitab offers two other types of stepwise procedures:

• Forward selection:  Minitab starts with no predictors in the model and adds the most significant variable for each step. Minitab stops when all variables not in the model have p-values that are greater than the specified Alpha-to-Enter value.
• Backward elimination:  Minitab starts with all predictors in the model and removes the least significant variable for each step. Minitab stops when all variables in the model have p-values that are less than or equal to the specified Alpha-to-Remove value.

## Stepwise Regression Example

In this example of using stepwise regression to identify the major sources of energy usage, analysts from the manufacturing plant considered the following predictor variables: total units produced, total equipment run time, staff size, mean outside temperature, minimum outside temperature, maximum outside temperature, percentage of sun, and mean equipment age. However, it’s important to note that stepwise regression can become especially helpful if you have over 100+ predictor variables!

Their goal was to narrow these variables into a list of the top predictors of energy usage. To get a final model, analysts chose Stat > Regression > Stepwise in Minitab and completed the dialog box by entering the response ‘Energy’ and the list of predictors from above.

They were presented with the following model that included the predictors of total equipment run time, max temp, and average equipment age. Minitab removed the other variables because their p-values were greater than the ‘Alpha-to-Enter’ value.

To obtain a final model, analysts chose Stat > Regression > Regression, and completed the dialog box by including ‘Energy’ as the response and the three significant variables as predictors. (To check residual plots, choose Graphs in the dialog box and then under Residual Plots, choose Four in one.)

The regression equation below indicates that energy usage increases as total equipment run time, maximum temperature, and average equipment age increase:

Total equipment run time has the largest impact according to the T-statistics. Maximum temperature is second, followed by average equipment age.

With this analysis, the analysts were able to conclude that energy usage is significantly higher due to the extensive air conditioner usage, and that newer equipment appears to reduce energy usage. The plant might want to limit running equipment during peak times where air conditioning use is consistent and consider purchasing new equipment before the summer season.

## Pitfalls of Stepwise Regression

While a lot can be learned with stepwise regression, there are some potential pitfalls to be aware of:

• If two independent variables are highly correlated, only one may end up in the model even though both may be important.
• Because the procedure fits many models, it could be selecting models that fit the data well due to chance alone
• Stepwise regression may not always end with the model with the highest R2 value possible for a given number of predictors.
• Automatic procedures cannot take into account special knowledge the analyst may have about the data. Therefore, the model selected may not be the most practical one.
• Graphing individual predictors against the response is often misleading because graphs do not account for other predictors in the model.

If you'd like to work with this data set yourself, download the data on Scribd.

A special thanks to Jim for his help with this blog!

How do you typically use stepwise regression?

Name: tamoghna • Friday, June 15, 2012

Cool article!!
Can we go ahead with step-wise regression if the data is categorical ( such as TITANIC survival rate)?

Name: Peter Flom • Saturday, June 16, 2012

Stepwise is a seriously flawed method that gives wrong results. The p-values will be too low, the standard errors too small, the coefficients biased away from 0, and more.

If you must use an automated variable selection procedure, it should be one that penalizes for multiple fits, e.g. lasso or LAR. See e.g my paper Stopping Stepwise, available here: www.nesug.org/proceedings/nesug07/sa/sa07.pdf

Name: Carly Barry • Monday, June 18, 2012

Hi Tamoghna - Thanks so much for your comment and for reading my post! We are looking into adding a feature to our next Minitab release that will allow you to perform stepwise regression with the survival rates data used in the Titanic blog post.
-Carly

Name: Carly Barry • Monday, June 18, 2012

Hi Peter - Thank you for mentioning that there are alternate model selection methods. Although “wrong results” is debatable, there certainly are risks with stepwise regression, as with other statistical analyses, which is why I thought it was important to mention the pitfalls. And like your article points out, employing expert knowledge in the model selection process is also imperitive to creating a good model and should not be overlooked.

Carly

Name: Safri Ishmayana • Friday, March 22, 2013

Thanks for the great article. I really need this at the moment.

I have a question, in an experiment, where the factors can be manipulated, does replication will increase the power of statistical analysis? In the example that you gave it only consists of one replicate for each combination "condition" what if the similar condition is repeated, once, or twice?

Name: Carly Barry • Friday, March 22, 2013

Hi Safri - Thank you for reading! In general, the larger your sample size is, the more precise your estimate will be for the strength of the relationship between the response and predictor variables. If this “condition” you refer to could change from one replicate to another, you may want to consider including the “condition” as a factor in your analysis to assess its significance.

Hope this helps!
Carly

Name: peter kolesar • Thursday, August 14, 2014

I could not access the energy consumption file -- it did not appear on the scrib site. Could you send it.
Professor Kolesar

Name: Eston • Friday, August 15, 2014

Hi Peter - our apologies about not being able to find the file on Scribd; it's there, but not as easy to find as it was when we first published this post. Here's the direct link, which we've also updated above: http://www.scribd.com/doc/91580526/Stepwise-Regression-Plant-Energy

Thank you for reading!
Eston