Using Hypothesis Tests to Bust Myths about the Battle of the Sexes

Minitab Blog Editor 05 September, 2013

Mythbusters title screenIn my home, we’re huge fans of Mythbusters, the show on Discovery Channel. This fun show mixes science and experiments to prove or disprove various myths, urban legends, and popular beliefs. It’s a great show because it brings the scientific method to life. I’ve written about Mythbusters before to show how, without proper statistical analysis, it’s difficult to know when a result is statistically significant. How much data do you need to collect and how large does the difference need to be?

For this blog, let's look at a more recent Mythbusters episode, “Battle of the Sexes – Round Two.” I want to see how they’ve progressed with handling sample size. There are some encouraging signs: during the show, Adam Savage, one of the hosts, explains, “Sample size is everything in science; the more you have, the better your results.”

To paraphrase the show, here at Minitab, we don’t just talk about the hypotheses; we put them to the test. We’ll use two different hypothesis tests and this worksheet to determine whether:

  • Women are better at multitasking
  • Men are better at parallel parking

Are Women Better Multitaskers?

The Mythbusters wanted to determine whether women are better multitaskers than men. To test this, they had 10 men and 10 women perform a set of tasks that required multitasking in order to have sufficient time to complete all of the tasks. They use a scoring system that produces scores between 0 and 100.

The women end up with an average of 72, while the men average 64. The Mythbusters conclude that this 8 point difference confirms the myth that women are better multitaskers. Does statistical analysis agree?

The statistical perspective

The average scores are based on samples rather than the entire population of men and women. Samples contain error because they are a subset of the entire population. Consequently, a sample mean and the corresponding population mean are likely to be different. It’s possible that if we reran the experiment, the sample results could change.

We want to be reasonably sure that the observed difference between samples actually represents a true difference between the entire population of men and women. This is where hypothesis tests play a role.

Choosing the correct hypothesis test

Because we want to compare the means between two groups, you might think that we’ll use the 2-Sample t test. However, based on a Normality Test, these data appear to be nonnormal.

The 2-Sample t test is robust to nonnormal data when each sample has at least 15 subjects (30 total). However, our sample sizes are too small for this test to handle nonnormal data. Therefore, we can’t trust the p-value calculated by the 2-Sample t test for these data.

Instead, we’ll use the nonparametric Mann-Whitney test, which compares the medians. Nonparametric tests have fewer requirements and are particularly useful when your data are nonnormal and you have small sample sizes. We’ll use a one-tailed test to determine whether the median multitasking score for women is greater than the median men’s score.

To run the test in Minitab statistical software, go to: Stat > Nonparametrics > Mann-Whitney

The Mann-Whitney test results

Mann-Whitney test results

The p-value of 0.1271 is greater than 0.05, which indicates that the women’s median is not significantly greater than the men’s median. Further, the 95% confidence interval suggests that the median pairwise difference is likely between -9.99 and 30.01. Because the confidence interval includes both positive and negative values, it would not be surprising to repeat the experiment and find that men had the higher median!

The Mythbusters looked at the sample means and “Confirmed” the myth. However, the data do not support the conclusion that women have a higher median score than men.

Power analysis to determine sample size

If the Mythbusters were to perform this experiment again, how many subjects should they recruit? For a start, if they collect at least 15 samples per group, they can use the more powerful 2-Sample t test.

I’ll perform a power analysis for a 2-sample t test to estimate a good sample size based on the following:

  • I’ll assume that the difference must be at least 10 points to be practically meaningful.
  • I want to have an 80% chance of detecting a meaningful difference if it exists.
  • I’ll use the sample standard deviation.

In Minitab, go to Stat > Power and Sample Size > 2-Sample t and fill in the dialog as follows:

Power and sample size for 2-sample t dialog

Under Options, choose Greater than, and click OK in all dialogs.

Power and sample size results for 2-sample t test

The output shows that we need 29 subjects per group, for a total of 58, to have a reasonable chance of detecting a meaningful difference, if that difference actually exists between the two populations.

Are Men Better at Parallel Parking?

The Mythbusters also wanted to determine whether men are better at parallel parking than women. They devised a test that produces scores between 0 and 100. At first glance, this appears to be a similar scenario as the multitasking myth where we’ll compare means, or medians. However, the means and medians are virtually identical and are not significantly different according to any test.

Descriptive statistics for parallel parking by gender

There’s a different story behind this myth. During the parking test, the hosts notice that the women’s scores seem more variable than the men’s. The women are either really good or really bad, while men are somewhere in between, as you can see below.

Individual value plot of parallel parking scores by gender

We want to be reasonably sure that the observed difference in variability actually represents a true difference between the populations. We need to use the correct hypothesis test, which is Two Variances (Stat > Basic Statistics > 2 Variances). The test results are below:

Two variances test results for parallel parking by gender

The null hypothesis is that the variability in both groups are equal. Because the p-value (0.000) is less than 0.05, we can reject the null hypothesis and conclude that women’s scores for parallel parking are more variable than men’s scores.

The Mythbusters correctly busted this myth because the means and medians are essentially equal. We can't conclude that one gender is better at parallel parking than the other.

However, we can conclude that men are more consistent at parallel parking than women.

Closing Thoughts

In one of their videos, Adam and Jamie explain that they understand the importance of sample size. However, Adam states that the Mythbusters put more effort into the methodology of collecting good data. It’s true, they are great at reducing sources of variation, obtaining accurate measurements, etc. He goes on to explain that they just don’t have the resources to obtain larger sample sizes. Fair enough—for a television show.

However, if you’re in science or Six Sigma, you don’t have this luxury. You must:

  • Have a good methodology for collecting data
  • Have a sufficient sample size
  • Use the correct statistical analysis

Without all of the above, you risk drawing incorrect conclusions.