The Lady Tasting Beer: Evaluating a Go/No-Go Gage (Part II)

Minitab Blog Editor 09 April, 2012

If your measurement system depends on subjective decision-making, it’s critical to evaluate its consistency and accuracy.

Lately, I’ve been bottle-feeding my work colleagues skunky and sour beer to see how well they can identify defective grog.  

Now it’s time to evaluate my (hic) go/no-go gage in Minitab using an Attribute Agreement Analysis.

How well did our appraisers do? Let's look at the results from the Minitab Assistant (Assistant > Measurement Systems Analysis > Attribute Agreement Analysis.)

Accuracy of Appraisals

First, the accuracy of all the appraisals taken together:

Overall, our appraisers correctly identified the “good” and “bad” samples about 60% of the time. That's just slightly better than what you’d expect by pure chance (50%). There’s definitely room for improvement in this appraisal system!

Now let's break it down by appraiser:


Morgan appears to be the most accurate appraiser, discriminating good grog from bad 75% of the time. Tootsie was close at her heels, at 62.5%.

Monique and Jaenbe, on the other hand, have accuracy rates less than the average of all the appraisers (60%) and the same as that expected by pure chance (50%). To improve their accuracy rates, they've both volunteered for intensive retraining at Fred's Bar and Grill, if I continue to supply them with free Stella Artois. ( I being had?)

So we know our measurement system has some problems with accuracy. But what types of errors are causing the problem?


When misclassifying a sample, there was a slightly greater tendency to mistake good beer for bad beer (43.8%), rather than vice-versa (37.5%), as shown in the chart below:

This makes me wonder about a possible carryover effect. For example, could the sourness from a previous vinegary beer sample be lingering on the palate of an appraiser when they taste a subsequent “good beer” sample? 

Carryover can be a problem in appraisal studies. Even if it’s not a taste test, appraisers might recognize an item they appraised before and rate from memory. In that case, you want to ensure there’s an adequate time between appraisals to prevent that.

In this study, to help guard against carryover effect, I supplied appraisers with filtered water and gluten-free, salt-free rice cakes to clear their palates between samples. (It had to be something tasteless, like packing peanuts, to prevent introducing another source of taste bias.)

But most appraisers didn’t nibble their rice cakes or sip water between each sample. In hindsight, I should have made that a mandatory part of the appraisal process. (This shows why running a small experimental pilot study is a good idea—you can identify and address these types of issues before you do a more expensive, time-consuming full-fledged study.)

To evaluate the overall self-consistency of appraisers, look at the Mixed ratings result in the above chart. About 31% of the time, the same beer sample was rated differently in each of the two trials. When you obtain a high percentage of mixed ratings, it can indicate that appraisers may need additional training or that the standards themselves may not be clearly defined.

The bar charts of misclassifications, such as the %Bad rated Good (shown below), helps to clarify mixed ratings in your study.

It’s clear that the appraisers had a hard time discerning the sourness of the vinegar beer. I should have added more vinegar to allow them to more clearly differentiate the “good” from the “bad.” Based on this chart, I’ll take the rap for the fuzzy standards!

The Appraiser Misclassification Rates chart reveals some interesting patterns for each individual appraiser.  

Jaenbe seems to mistake good beer for bad beer at about the same rate (50%) as she mistakes bad beer for good beer. Interestingly, although her ratings weren't very accurate, Jaenbe was the only appraiser who was completely self-consistent in her ratings (% Rated both ways chart). She seems to be marching to the beat of a different bartender.

Monique was less likely to mistakenly classify good beer as bad than the other appraisers. On the other hand, she also rated bad beer as good 75% of the time—more than any other appraiser. She might benefit from retraining to better recognize Things That Taste Vile. (Although if the beer always tastes good, regardless of whether it’s flat or spiked with vinegar, she must be a very happy and content person. So maybe we should just leave her be.)

Morgan correctly identified every sample of defective beer (the % Bad rate Good chart indicates 0). When she did err, she mistook good beer for bad. That supports a possible carryover effect. So I checked the raw data. Sure enough, her two misclassifications of “bad” for “good” actually came immediately after she had  sampled a skunky beer. Does old, flat beer leave a lingering bad taste in Morgan’s mouth? If so, to paraphrase Marie Antoinette (or Jean-Jacques Rousseau), “Let her eat rice cake!”

Tootsie’s performance was somewhere in the middle of the pack. She was more accurate than Monique and Jaenbe at picking out bad beer, but less accurate than Morgan. Her self-consistency was higher than Morgan and Monique's, but lower than Jaenbe’s.

If You Don't Read Anything Else, Read This!

The Accuracy Report shows 95% confidence intervals for the accuracy rate of each appraiser. Here’s where the rubber meets the road, statistically speaking.

Yikes! Look at those humongous confidence intervals. You could dock a cruise ship in those babies. More formally stated, you can be 95% confident that:

  • The accuracy rate of Jaenbe and Monique falls between 15% and 84%
  • Morgan’s accuracy rate falls between 40% and 97%
  • Tootsie’s accuracy rate falls between 25% and 92%

Remember, you can be 100% confident that any appraiser’s accuracy rate is between 0% and 100%, without performing any statistical analysis whatsoever. So if your confidence intervals come close to that span, your data doesn’t tell you much.

So, I really had no right to impugn the skills of the appraisers using their individual results. Based on these huge confidence intervals, they could all successfully sue me for libel and slander.

To obtain more conclusive results, I'd need to conduct a much larger study, with more appraisals, more trials, and clearer standards for “good” and “bad.” 

Problem is, I can't afford to buy any more sixpacks of Stella. And my colleagues can’t drink 30 beer samples at work. And even if they could, the ensuing buzz they caught might introduce yet another source of bias into the appraisal process.

Ach! I think I could use a beer...