For this blog, let's look at a more recent Mythbusters episode, “Battle of the Sexes – Round Two.” I want to see how they’ve progressed with handling sample size. There are some encouraging signs: during the show, Adam Savage, one of the hosts, explains, “Sample size is everything in science; the more you have, the better your results.”
To paraphrase the show, here at Minitab, we don’t just talk about the hypotheses; we put them to the test. We’ll use two different hypothesis tests and this worksheet to determine whether:
The Mythbusters wanted to determine whether women are better multitaskers than men. To test this, they had 10 men and 10 women perform a set of tasks that required multitasking in order to have sufficient time to complete all of the tasks. They use a scoring system that produces scores between 0 and 100.
The women end up with an average of 72, while the men average 64. The Mythbusters conclude that this 8 point difference confirms the myth that women are better multitaskers. Does statistical analysis agree?
The average scores are based on samples rather than the entire population of men and women. Samples contain error because they are a subset of the entire population. Consequently, a sample mean and the corresponding population mean are likely to be different. It’s possible that if we reran the experiment, the sample results could change.
We want to be reasonably sure that the observed difference between samples actually represents a true difference between the entire population of men and women. This is where hypothesis tests play a role.
Because we want to compare the means between two groups, you might think that we’ll use the 2-Sample t test. However, based on a Normality Test, these data appear to be nonnormal.
The 2-Sample t test is robust to nonnormal data when each sample has at least 15 subjects (30 total). However, our sample sizes are too small for this test to handle nonnormal data. Therefore, we can’t trust the p-value calculated by the 2-Sample t test for these data.
Instead, we’ll use the nonparametric Mann-Whitney test, which compares the medians. Nonparametric tests have fewer requirements and are particularly useful when your data are nonnormal and you have small sample sizes. We’ll use a one-tailed test to determine whether the median multitasking score for women is greater than the median men’s score.
To run the test in Minitab statistical software, go to: Stat > Nonparametrics > Mann-Whitney
The p-value of 0.1271 is greater than 0.05, which indicates that the women’s median is not significantly greater than the men’s median. Further, the 95% confidence interval suggests that the median pairwise difference is likely between -9.99 and 30.01. Because the confidence interval includes both positive and negative values, it would not be surprising to repeat the experiment and find that men had the higher median!
The Mythbusters looked at the sample means and “Confirmed” the myth. However, the data do not support the conclusion that women have a higher median score than men.
If the Mythbusters were to perform this experiment again, how many subjects should they recruit? For a start, if they collect at least 15 samples per group, they can use the more powerful 2-Sample t test.
I’ll perform a power analysis for a 2-sample t test to estimate a good sample size based on the following:
In Minitab, go to Stat > Power and Sample Size > 2-Sample t and fill in the dialog as follows:
Under Options, choose Greater than, and click OK in all dialogs.
The output shows that we need 29 subjects per group, for a total of 58, to have a reasonable chance of detecting a meaningful difference, if that difference actually exists between the two populations.
The Mythbusters also wanted to determine whether men are better at parallel parking than women. They devised a test that produces scores between 0 and 100. At first glance, this appears to be a similar scenario as the multitasking myth where we’ll compare means, or medians. However, the means and medians are virtually identical and are not significantly different according to any test.
There’s a different story behind this myth. During the parking test, the hosts notice that the women’s scores seem more variable than the men’s. The women are either really good or really bad, while men are somewhere in between, as you can see below.
We want to be reasonably sure that the observed difference in variability actually represents a true difference between the populations. We need to use the correct hypothesis test, which is Two Variances (Stat > Basic Statistics > 2 Variances). The test results are below:
The null hypothesis is that the variability in both groups are equal. Because the p-value (0.000) is less than 0.05, we can reject the null hypothesis and conclude that women’s scores for parallel parking are more variable than men’s scores.
The Mythbusters correctly busted this myth because the means and medians are essentially equal. We can't conclude that one gender is better at parallel parking than the other.
However, we can conclude that men are more consistent at parallel parking than women.
In one of their videos, Adam and Jamie explain that they understand the importance of sample size. However, Adam states that the Mythbusters put more effort into the methodology of collecting good data. It’s true, they are great at reducing sources of variation, obtaining accurate measurements, etc. He goes on to explain that they just don’t have the resources to obtain larger sample sizes. Fair enough—for a television show.
However, if you’re in science or Six Sigma, you don’t have this luxury. You must:
Without all of the above, you risk drawing incorrect conclusions.