The goal of regression is to make accurate predictions. Two factors that impact the predictability of the model are the terms in the model (linear, interactions, quadratic) and the sample data used to calculate the model. Models with too many terms often overfit the sample data but lead to poor prediction of new data values.
Regression analysis can be used in Minitab Statistical Software to:
Previously we discussed how to quickly build, verify, and visualize the predictive model. Now we will get into the more advanced features of validating the model’s predictive power, automating analysis and model selection, and predicting new outcomes.
The figures below show a model with an overfit model. When new data from the same process is added, the model does a poor job of predicting these new measurements. If a linear model were used to fit the original data, then more accurate predictions could have been made. Validation is used to prevent building models with low predictability.
Validation is a two-step process, first build a model on a set of the data (the training set). Then use that model to make predictions on the set omitted from the model-building (the test set). There are three types of validation techniques: leave-one-out, K-Fold and validation with a test set.
When validation is used, the analyst needs to understand the model reported and the corresponding R2 values. These R2 values are used to understand how much variation the model explains in the sample data and the ability to accurately predict new values. A higher R2 is ideal. If overfitting is a potential problem, the R2 values will vary drastically between the test set and the training set.
The validation process omits one data point as the test set. The remaining n-1 observations are used to calculate a training model. Then the prediction error of the removed data point will be calculated by this model. This process will repeat for each observation. The prediction errors are used to generate the predicted R2. Note that predicted R2 is standard output for all Regression models.
In K-Fold validation, the data is randomly assigned into K equally sized groups, often K=10. The first group is removed as the test set and a model is built with the remaining groups as the training set. The omitted group is predicted with the training model to calculate the prediction error. This process will repeat for each of the groups and the composite K-Fold R2 is calculated.
In validation with a test set method, a random subset of data is assigned as the test set, say 30%, and the remaining training set (70%) is used to calculate the predictive model. The model is validated with the test set to calculate the Test R2.
The K-Fold validation is better to use with moderately sized samples, while the validate with a test set method is ideal for very large datasets. It is important to note that the leave-one-out and K-fold validation techniques are only validating the form of the model, not the exact model coefficients like the validate with a test set method.
Model selection for Regression is typically a manual process. However, data sets are not just increasing in the number of observations, more variables are also being measured. Having to remove terms manually can become daunting.
Model selection can be automated. Three common procedures are:
These methods often lead to different results; therefore, it is best to infuse industry knowledge to find the most practical and impactful solution.
Regression analysis is a powerful tool, and once the “best” model is selected, it can be used to make predictions. Consider an example involving a clean room at a manufacturing facility. It is important to understand the impact that several predictors have on the particle count that exceeds a total of 100 of size 0.5μ or larger per cubic foot. Process engineers build a predictive model for particle counts:
The model is used to predict for a production volume of 1000 with 7 employees and 24 entrances/exits from the clean room:
The predicted average particle count that exceeds a total of 100 of size 0.5μ or larger per cubic foot is 87.63. Confidence intervals and prediction intervals account for the potential error in the prediction.
The ease of Minitab empowers analysts to use all the contemporary tools for Regression. If you’re not already using the power of Minitab to get the maximum value from your data, download a free, fully-functional 30-day trial of Minitab Statistical Software today.