I’d like to give you a chance to win some money. Tell me which of these games you’d rather play:
Game 1: We flip a coin once. If it lands on tails, I’ll give you 100 bucks.
Game 2: We flip a coin 10 times. If it lands on tails at least one time, I’ll give you 100 bucks.
If you said “Doh” and picked Game 2, either you’re Homer Simpson or you already have a good intuitive understanding of an important statistical concept called “the multiple comparisons problem.”
You realized that the cumulative probability of getting at least one tail on 10 flips is greater than the probability of getting a tail on just one single flip, even though the probability of getting a tail is constant at 50% for each flip.
With that in mind, think about what happens if you perform a hypothesis test many times on the same set of data. Each hypothesis test has a “built-in” error rate, called alpha, which indicates the probability that the test will find a statistically significant result based on the sample data when, in reality, no such difference actually exists. Statisticians call this a Type I error.
By convention, alpha is often set at 0.05, which corresponds to a 5% error rate for each test. But if you perform a test multiple times, the cumulative error rate for all those tests together is going to be greater than 5%.
How much greater is this cumulative error rate, which statisticians call the experiment-wise or family error rate? It depends on how many tests you perform.
Suppose a researcher, Dr. Dredge, collects data on the number of hours worked per day by people in different countries. A little green around the gills, statistically, Dr. Dredge decides to use a 2-sample t-test to compare the mean hours worked per day between British with Japanese, then perform the 2-sample t test again to compare Brazilians with Americans, and then use it again to compare French with Australians, and then again and again and again...
If each t-test has a Type I error rate (alpha) of 0.05, what would happen?
To find out, we can set up a Minitab worksheet to automatically calculate the family error rate based on the number of multiple comparisons:
Here’s what you should get:
The more comparisons Dr. Dredge makes, the more likely he’s going to find a statistically significant difference that’s really just a Type I error.
“Yes, but who would ever make 100 comparisons?” you might be thinking. But just a few groups in your data can create the potential for quite a lot of pairwise comparisons. For example, if Dr. Dredge records daily work hours for just 10 countries, and then compares each possible pair of countries in his data, he’ll make (10*9)/2 = 45 pairwise comparisons, and incur a family error rate of 90%!
So what do you do in this situation? Luckily, Minitab Statistical Software provides an easy way out. We’ll explore that in my next blog.
P.S. -- If you’d like to learn more about the multiple comparisons problem, from a different angle, see what happens when a professor creates a “fake” data set of completely random numbers and then has students interpret the "significant" results of the multiple comparisons.