Using Statistical Regression to model Wine tastes

Wine tastes are often described in a very poetic way: “full-bodied and rich but not heavy, high in alcohol, yet neither acidic nor tannic, with substantial black cherry flavor despite its delicacy”…. Everything from specific flowers and fruits descriptors are meant to help drinkers understand the flavors in a glass of wine. The conversion of fruit to wine is considered an art form.

Of course, there is no substitute for good grapes and diligent wine making practices, grape quality and barrel aging. Wine has many of the natural chemical compounds found in fruit and spices and specific chemical compounds come into play with our tasting flavor profiles: sweet, sour, bitter….

Each of the winemaking phases will have a different impact on flavor. Flavor changes are due to different chemicals being present in the wine due to occurrences in these stages of the process. All the flavors in wine come from the grapes and the winemaking process, of course, but manipulating these phases can result in a wine that has a better flavor.

Wine tasting may sound ethereal but flavor all comes to chemical compounds, that impact the taste of your wine. Behind the loving descriptions of wine as living art, there’s science. Acids primarily add sour notes. Alcohol compounds also affect taste. Ethanol adds bitter, sweet, and sour flavors etc…. If one wants to be able to use knowledge of the impact of certain compounds on flavor, they must understand which phase will naturally produce that compound.

Although we obviously know that wine tastes vary from person to person and that there are many different profiles of wine tasters (De Gustibus non est disputandum: “In matters of taste, there can be no disputes”), we know that some wines are obviously better than others, and most people would probably recognize a good wine from a bad one.

When you need to understand situations in which variability and noise play an important part, Statistical models are very efficient at identifying the key inputs out of seemingly completely chaotic data. This article details how wine-tasting data and powerful modeling techniques yielded insight into variables that were important to a panel of experienced wine-tasters. The analysis illustrates that even taste preferences, can be assessed with statistics if you choose the right analysis.

In this article, we are interested in using statistics to understand whether a wine that has, for instance, more sulphates or more chlorides would taste better. Based on that understanding, it could be possible to make a better wine. We will consider many potential predictors, such as acidity, sulphur dioxide, and percentage of alcohol….

A panel of oenologists tasted several types of white and red wines and provided binary assessments of quality—good (1) or poor (0)—for each. Our goal is to identify which of these many variables have a significant effect on wine quality.

Using Regression to Analyze Binary Taste Data

Simple graphs are not sufficient to identify which variables might be important due to complexity and variability in this data set. Regression analysis lets us see how multiple factors affect an outcome, so it is an ideal method to look at the wine-tasting variables.

However, our panel simply ranked each wine as either high- or low-quality. This means we have binary and not continuous response data, so we need to proceed with caution—using a standard regression or ANOVA to analyze a binary response is generally not a good idea.

Because binary data follow a binomial distribution rather than a normal, bell-shaped distribution, standard regression may result in probability predictions that are negative or larger than 100%. We might get an unnecessarily complex model, in which some spurious interactions seem to be significant. In addition, the variance for binary data is not constant.

Fortunately, there’s a simple solution, since we have binary response data, we simply need to use the appropriate tool for this: binary logistic regression.

Full Model Regression Analysis

A standard practice in regression analysis is to start with the “full model,” one that includes all of the potentially significant factors for which you collected data. In this case, we begin the analysis by including all variables and all interactions between those variables and types of wine.

To include interactions, in Minitab go to Stat > Regression > Binary Logistic Regression > Fit Binary Logistic Model > Model > Add interactions.

When introducing interactions, it is also useful to standardize the continuous predictors in your model to avoid, disturbing, scale effects (Stat > Regression > Regression > Fit Regression Model > Coding)

We used the stepwise method to automatically build the best model step-by-step and identify a useful subset of the terms out of a very large number of candidate terms. For that go to: Stat > Regression > Binary Logistic Regression > Fit Binary Logistic Model > Stepwise

The criteria that was used to identify the best model based on this stepwise approach was the Akaike Information Criteria (AIC). AIC estimates the relative amount of information lost by a given model, this statistic is used to compare different models. The smaller the AIC is, the better the model fits the data. AIC includes a penalty that increases with the number of estimated parameters to discourage overfitting. The objective is to avoid overfitting but also underfitting.

Ultimately, this iterative process leads us to the model below.

With 12 terms, this model might seem difficult to understand and explain, but it does give us a clue to how we can delve deeper into these data to better understand which factors contribute most to good-tasting wine.

Coded (standardized) coefficients are useful to understand which variables are the most important:

Density has the largest effect (-3.504) then Residual Sugar coupled with Types of Wines (2.75 for the Residual Sugar * Types of Wines interaction ) has the second largest effect, then comes Fixed acidity (1.33) and the Fixed acidity * Density interaction (1.213)

The interaction diagram above shows that the effect of Residual Sugar on wine quality, is virtually non existent in red wines, however it plays an important role in white wines.

Now that we have models for the wines, we can see what the data tell us about the wine characteristics that influenced our panel’s rankings. For example, this main effects plot summarizes the relationship between fixed acidity, Density and the probability of making a good wine. Higher fixed acidity and a lower density tend to improve wine quality.

So when you need to understand situations that, at least on the surface, defy data analysis or when the number of candidate variables is large, why not dig a little deeper by using techniques such as binary logistic regression?

You can use a similar approach to what we did with this wine-tasting data to analyze marketing or sales data, to better understand customer preferences, and to gain insight into factors that are important—even if, like taste preferences, they seem hard to measure.

Want to follow along in Minitab Statistical Software? Download the Data Set
(It's OK if you don't already have it – download a 30-day free trial)

Using Statistical Regression to model Wine tastes

Using Regression to Analyze Binary Taste Data

Full Model Regression Analysis

You Might Also Like