Using Regression to Evaluate Project Results, part 2
In part 1 of this post, I covered how Six Sigma students at Rose-Hulman Institute of Technology cleaned up and prepared project data for a regression analysis. Now we're ready to start our analysis. We’ll detail the steps in that process and what we can learn from our results.
What Factors Are Important?
We collected data about 11 factors we believe could be significant:
- Whether the date of collection was a Monday or a Tuesday
- The number of trashcans in a team's area
- The ratio of recycle bins to trash cans
- Number of plastic cups and bottles collected
- Number of Java City (a coffee shop on campus) cups collected
- Number of paper sheets collected
- Number of newspapers collected
- Number of glass bottles collected
- Number of aluminum cans collected
- Number of cardboard items collected
- Whether the data was collected pre or post improvement
Just because we collected data about 11 factors doesn't mean that they are all important. Any good regression model should attempt to keep the number of factors down to a minimum. So how do we go about finding out which factors are important? The easiest way is to use Minitab's Best Subsets regression tool! Best Subsets evaluates and gives you important descriptive statistics about the regression models that can be formed from the different combinations of factors. The resulting output table lists the number of factors in each model, R2 and adjusted R2, and also tells us which factors are included in each model.
Results of the Best Subsets
Looking at Adjusted R2
The output from the Best Subsets analysis gave us quite a lot of potential models we could use. Which one should we choose? We used two components to narrow down the options. The first was the adjusted R2 values, since this statistic takes into account the number of variables used. We want this value to be as high as possible. When we plot the adjusted R2 values against the number of factors in each model, we see a point where adding additional factors has diminishing returns. For this set of data, that point was at five factors.
Notice how at 5 variables and beyond the adjusted R-squared value hits a plateau? That’s our point of diminishing returns!
The Factors that Always Seem to Appear
The second component we considered was which factors consistently appeared in the top models generated. If these factors keep appearing in the top models, we reasoned, there's a good chance they’re significant.
When we look at the results from our Best Subsets, we find that five factors are consistently chosen by the algorithm: The number of plastics, paper, newspaper, aluminum, and the effect of the improvement efforts.
Identifying those five factors enables us to generate our final model.
Verifying the Final Model
Great! So we went through all this and got ourselves a model. Now we are ready to make conclusions, right? Not quite. We still need to ensure that the model we’ve created adheres to the assumptions that are associated with regression analysis. If our model does not meet these assumptions, then we can't make any definitive conclusions. Luckily for us, the process doesn't change from before.
As before, first we need to check whether the mean error is zero and the data is homoscedastic.
Plots used to verify regression assumptions.
As before, the plots indicate that we have no reason to assume that the data is not IID. Moving on, we check whether the residuals are normally distributed.
Last but not least, we are continuing our assumption that the teams can count and that there is no variance in the values in our predictor values.
It appears that this new model does in fact meet the regression assumptions. The final model created from this data is:
Final Results: What Have We Learned?
At the end of all of this, we determined our regression model, all ready to go and verified. But what does this single equation we created tell us? What can we use it for?
For starters, we now have an accurate model that we can use to predict the weight of recyclables disposed of in the trash, based solely on five factors. This is nice, as we can predict the weight of recyclables from various areas simply by just looking at what the items are present in the trash!
We also learned that of the 11 factors we started with, only five of them have a significant relationship with the weight of the recyclables. Plastic cups and bottles, sheets of paper, newspapers, and aluminum cans were found to be significant contributors to the total weight of recyclables disposed of in the trash. This is important to know, since it tells us what to focus on in future efforts.
The last factor that was found to be significant was the effect of the improvement phase of our project. More importantly, if you look at the equation for the final model, this factor has a negative constant associated with it. This tells us that our efforts have been successful, as the effort was statistically significant and in a manner that decreased the amount of recyclables thrown away in the trash.
Now that wasn’t too bad, was it? With regression and a little help from Minitab, there was no chance our data analysis efforts would go to waste!
About the Guest Blogger
Peter Olejnik is a graduate student at the Rose-Hulman Institute of Technology in Terre Haute, Indiana. He holds a bachelor’s degree in mechanical engineering and his professional interests include controls engineering and data analysis.