In April 2012, I wrote a short paper on binary logistic regression to analyze wine tasting data. At that time, François Hollande was about to get elected as French president and in the U.S., Mitt Romney was winning the Republican primaries. That seems like a long time ago…
Now, in 2014, Minitab 17 Statistical Software has just been released. Had Minitab 17, been available in 2012, would have I conducted my analysis in a different way? Would the results still look similar? I decided to re-analyze my April 2012 data with Minitab 17 and assess the differences, if there are any.
There were no less than 12 parameters to analyze with a binary response. Among them 11 parameters were continuous variables, one factor was discrete in nature (white and red wines: a qualitative variable), and the number of two-factor interactions that could be studied was huge (66 two-factor interactions were potentially available).
The parameters to be studied :
Variable |
Details |
Units |
Type |
red or white |
N/A |
pH |
acidity (below 7) or alkalinity (over 7) |
N/A |
Density |
density |
grams/cubic centimeter |
Sulphates |
potassium sulfate |
grams/liter |
Alcohol |
percentage alcohol |
% volume |
Residual sugar |
residual sugar |
grams/liter |
Chlorides |
sodium chloride |
grams/liter |
Free SO2 |
free sulphur dioxide |
milligrams/liter |
Total SO2 |
total sulphur dioxide |
milligrams/liter |
Fixed acidity |
tartaric acid |
grams/liter |
Volatile acidity |
acetic acid |
grams/liter |
Citric acid |
citric acid |
grams/liter |
Restricting Analysis to the Main Effects
In 2012, due to the very large number of potential two-factor interactions, I restricted my analysis to the main effects (not considering the interactions between continuous variables).
Because the individual parameters had to be eliminated one at a time, according to their p value (the highest p values are eliminated one at a time until all the parameters and interactions that remain in the model have p values that are lower than 0.05), this was a very lengthy process.
To avoid obtaining an excessively complex final model, I eventually decided to analyze white and red wines separately (a model for the white wines, another model for the red wines), suggesting that the effect of some of the variables were different according to the type of wine.
Including 2-Way Interactions in the Analysis
Using Minitab 17 makes a substantial difference in this respect. All 2-way interactions can be easily selected to generate an initial model :
With Minitab 17, you can use stepwise logistic binary regression to quickly build a final model and identify the significant effects. In 2012, I used a descending approach considering all variables first and eliminating one variable at a time manually.
This lengthy and tedious process takes just a single click in Minitab 17:
The results above show that Alcohol and Acidity (both fixed and volatile) seem to play a major role.
The Residual sugar by Type of wine interaction is barely significant with a p value (0.087) larger than 0.05 but smaller than 0.1.
The R Squared value (R-Sq) is also available in Minitab 17, to assess the proportion of the total variability that is explained by the model. The larger the R square value, the more comprehensive our model is (a large R squared means that we have got the full picture of our process, a low R squared means that our model explains only a small part of the variability in the response). In this example, the R squared is relatively low (28%) with 72% of the total variability unexplained by the model.
In 2012, the final result consisted of two equations that could be used to understand which variables were significant for each type of wine in order to improve their taste.
Optimizing the Response
In Minitab 17, I can go one step further and use the optimization tool to identify the ideal settings and help the experimenter make the right decision.
The optimization tool shows that tasters tend to prefer wines with a large amount of alcohol and both high fixed acidity and high volatile acidity.
Finally, showing graphs is important to convince colleagues and managers that the right decision has been taken. A visual representation is also very useful to better understand the factor effects. In Minitab 17, contour plots and response surface diagrams are available to describe the variable effects in the logistic binary regression sub-menu.
The contour plot below shows that tasters either prefer wines with high fixed acidity and high volatile acidity or with low fixed acidity but also low volatile acidity. The balance between the two types of acidity seems to be crucial.
The models I arrived at in April 2012 are different from the one I found with Minitab 17. The two types of Acidity (Fixed and Volatile) were significant in the model for white wines, and Alcohol and Fixed Acidity had been selected in the final model for red wines.
But the main difference is that the Fixed Acidity by Volatile Acidity interaction had not been considered in 2012. In April 2012, the two-factor interactions were not on my radar, and I instead focused only on the individual main effects and their impact on wine tastes.
Fortunately, with Minitab 17 it is a lot easier to build an initial model—even a complex one with 66 two-factor potential interactions—and stepwise regression allows you to consider a much larger number of potential effects in the initial full model.
Conclusion
Ultimately, this study shows that the methods you use definitely impact your conclusion and statistical analysis. I got a simpler model using the tools available in Minitab 17, and therefore I did not need to study white and red wines separately. The optimization tool as well as the graphs were very useful to better understand the effects of the variables that are significant.