# Busting the Mythbusters with Statistics: Are Yawns Contagious?

*This looks like a typical Mythbusters experiment!*

Statistics can be unintuitive. What’s a large difference? What’s a large sample size? When is something statistically significant? You might *think *you know, based on experience and intuition, but you really don’t know until you actually run the analysis. You have to run the proper statistical tests to know what the data are telling you!

Even experts can get tripped up by their hunches, as we'll see.

In my family, we’re huge fans of the Mythbusters. This fun Discovery Channel show mixes science and experiments to prove or disprove various myths, urban legends, and popular beliefs. Are Daddy Longleg spiders really super poisonous? Can diving underwater protect you from an explosion, or being shot? Are toilets really the cleanest place in the house? What is the fastest way to cool a beer? They often find a way to work in impressive explosions, one of their hallmarks. Thanks to Mythbusters, my 7 year-old daughter was able to explain to me that you can identify the explosive ANFO because it’s made out of pellets!

I love the Mythbusters because they make science fun. They find ways to test the myths and go to extensive efforts to rule out competing variables. The hosts go through extensive planning and small-scale testing before conducting the full-sized experiment. The Mythbuster’s skilled crew and well-stocked workshop can build a rig or robot to test virtually anything in a controlled and repeatable fashion. They also place a strong focus on collecting data and using that data to make decisions about the myths. This show is a fun way to bring the scientific method alive for our young daughter. Good stuff!

Having said that, I did catch them making a statistical mistake during an episode we watched recently. I’m pointing this mistake out only to highlight how non-intuitive statistics can be, and not to put down the hard work of the Mythbusters.

## The Myth: Yawning is Contagious

This episode tested the myth that yawning is contagious—so if you see someone yawn, you’re more likely to yawn yourself. They recruited 50 people who thought they were being considered for an appearance on the show. One by one, each subject spoke with the recruiter who either yawned, or not, during the spiel. The subjects then sat by themselves in an isolation room and were told to wait. While in the isolation room for a set amount of time, unbeknownst to them, the Mythbusters watched to see if they yawned.

The results:

- 25%, 4 out of 16, who were not exposed to a yawn, yawned while waiting. I’ll call this the non-yawn group.
- 29%, 10 out of 34, who were exposed to a yawn, yawned. I’ll call this the yawn group.

Jamie Hyneman, one of the hosts, concluded that because of their large sample size (n=50), the difference of 4% was meaningful. They didn’t run a statistical test but the decision was based on his intuition about the statistical power that the sample size gave them. Let’s test this out a bit more rigorously.

## Testing the Myth with the Two Proportions Test

To test their data, we’ll need to use the two proportions test in Minitab (**Stat > Basic Statistics > 2 Proportions**). We can use summarized data rather than data in a worksheet.

Fill in the main 2 Proportions dialog like this:

The Mythbusters wanted to test whether the proportion for the yawn group was greater than the non-yawn group. So we need to perform a one-sided test, which also provides a little more statistical power.

Click **Options** and choose **greater than** as the alternative hypothesis to determine whether the first proportion is greater than the second proportion.

We get the following output:

You’ll see that there are two p-values. The Fisher’s exact test is for small sample sizes. The note about the normal approximation and small sample sizes indicates that we should use the Fisher’s exact test P-Value of 0.513. This value is greater than any reasonable alpha value (typically 0.05), so we can’t reject the null hypothesis.

**Conclusion:** the data do **not** show that there is a higher proportion of yawning subjects in the yawn group than in the non-yawn group. Further, rather than having a large sample, Minitab indicates that the sample is small.

## Power and Sample Size: How Large is Large Enough?

Fans of the show know that when they can’t confirm a myth, the Mythbusters find an exaggerated way to replicate the myth to show the extreme conditions that are necessary to make the myth happen. This method is a great way to increase the number of explosions they get to show!

As much as I want to, I can’t give you an impressive explosion for the blog post finale! However, I can give you a startling answer to the question of how large a sample the Mythbusters needed to have a good chance to detect a difference of 29% versus 25%. The answer is so large that you might just end up waving yours arms around like Adam Savage!

To figure this out, we’ll use Minitab’s Power and Sample Size calculation for Two Proportions (**Stat > Power and Sample Size > 2 Proportions**). We’ll use the proportions from the study and a power of 0.8, which is a good standard value, as I’ve discussed here.

In a nutshell, a power of 0.8 indicates that a study has an 80% chance of detecting a difference between the 2 populations if that difference truly exists.

Fill in the dialog like this:

Under **Options**, choose **Greater than (p1 > p2)**. We get the following results:

The results show that the Mythbusters needed a whopping 1,523 subjects per group (3046 total) to have an 80% chance of detecting the small difference in population proportions! That's a far cry from the 50 subjects that they actually had. Why is this so large? There are two main reasons.

First, the effect size is small and that requires a larger sample. Second, the data for this test are categorical rather than continuous. The subjects either yawned or did not yawn while in the isolation room. Generally speaking, any given amount of categorical data represents less useful information than the same amount of continuous data. Consequently, you need a larger sample size when you're analyzing categorical data.

## Retrospective Power Analysis

We can also take the results of the study and use them to determine how much power the study had. To do this, we input the sample size and the estimate of each proportion from the study into the power and sample size dialog. Of course, we don’t know the true values of the population proportions, but the study provides the best estimates that we have at this point.

For this study, Minitab calculates a power of 0.09. This value indicates that there was less than a 10% chance of detecting such a small difference, assuming that the difference truly exists. Therefore, insignificant results are to be expected for this study regardless of whether the difference truly exists or not.

## Closing Thoughts: The Mythbusters Need Minitab

Given the results of the 2 Proportions Test and the power analysis, we can conclude:

- There is no evidence that yawns are contagious.
- The study had inadequate power to detect a difference.

Coming from the university world of academic research projects, I would say that the Mythbusters conducted a pilot study. These are small experiments designed to gather initial estimates (such as the proportions) and determine the feasibility of conducting a larger study. At this point, the main result is that the study, as it was performed, was not up to the task at hand. It could not reasonably detect the size of the difference that is likely to exist, if there is even a difference.

That does not mean that this project was a waste of time, though, because you don’t know this until you do at least some research.

In the research world, the question now would be whether further research is worthwhile. This determination is different for each research project. You need to balance the effect size (small in this case), the benefits (negligible), and the additional costs (very large for a much larger sample size). So, I'd guess that a large follow-up study is unlikely to happen!

We remain huge fans of the Mythbusters! This case study only serves to highlight the fact that conducting research and data analysis is a tricky business that can trip up even the experts! That’s why you need Minitab Statistical Software in your corner. The Mythbusters should look into getting a copy!

If you like the Mythbusters, read my post about the Mythbusters and the battle of the sexes!

Name: Dave Blundell• Friday, July 13, 2012Hi Jim

Attempted to try and reproduce your result of 0.09 from the section entitled: "Restrospective Power Analysis". The result I got using the power and sample size for 2 proportions was 0.124952 - what have done wrong?

Best regards,

Dave

Name: Jim Frost• Friday, July 13, 2012Hi Dave,

I couldn't quite replicate the power value that you got, but I'm wondering if you entered 50 for the sample size? When I do that, I get a power of 0.115931, which is close to yours. If so, you actually need to enter the sample size for each group, which is 25 (2 X 25 = 50).

Also, be sure to go to Options and make it a one-sided test. Here's exactly what I entered:

Sample sizes: 25

Comparison proportions (p1): 0.29

Power values: (blank)

Baseline proportion (p2): 0.25

Under Options, choose Greater than (p1 > p2).

The power I get is 0.0921459.

Thanks for reading!

Jim

Name: Tom Ryan• Sunday, July 15, 2012I object to the use of the term "retrospective power analysis", which is a nonsensical expression. This has been pointed out in the literature, including articles by Lenth (2001, The American Statistician), Hoenig and Heisey (2001, The American Statistician), and Zumbo and Hubley (1998, The Statistician).

In particular, quoting from Zumbo and Hubley (p. 387): "Finally, it is important to note that retrospective power cannot, generally, be computed in a research setting". They also stated "We suggest that it is nonsensical to make power calculations after a study has been conducted and a decision has been made".

It is indeed nonsensical if we realize that "power" and "probability" are synonymous terms. Please do not contribute to bad statistical practice by using such expresions.

Name: Jim Frost• Monday, July 16, 2012Hi Tom,

Thanks for writing! You raise some good points. Retrospective power studies definitely have limits. This type of study should not be used to either justify or overturn the decision of a statistical test.

However, this type of power study is useful for planning purposes. If you don’t like the term “retrospective study”, then think of it as a prospective study for the next experiment. We’re simply taking what we learned from the most recent study and using that to adjust how we conduct the next study. I’ll illustrate how this works.

The value of retrospective power studies is for tests that produce insignificant results, such as this case. The Mythbusters probably initially expected a larger difference than what they ultimately observed. Using this new, smaller estimate of the difference in a power analysis provides useful guidelines about how large the sample size should be for the next study.

In planning for the next study, the researchers would presumably factor in both the size and general region of the previous study’s estimates, along with information about the smallest difference that they’re interested in detecting. Again, we don’t know the population proportions for sure (we never will) but the previous study provided the best estimates available. Let’s also assume that the researchers were not interested in detecting such a small difference as 4%, but rather wanted to detect a difference of 10% or greater.

Based on this, they enter in the Power and Sample Size for 2 Proportions dialog boxes:

Sample sizes: (blank)

Comparison proportions (p1): 0.35

Power values: 0.8

Baseline proportion (p2): 0.25

Under Options, choose Greater than (p1 > p2).

Minitab calculates that a sample size of 259 for each group (518 total) is required to achieve a power of 0.8. This is very useful information for the researchers. It tells them that they’ll likely need a sample size that is 10 times greater than their initial study.

They could run other scenarios as well. Perhaps the largest sample they can afford is 50 per group (100 total). With this sample size, what’s the smallest difference that they can detect with 0.8 power? Minitab can estimate the answer. It’s best that the researchers know all of this BEFORE they proceed!

I’ll agree that retrospective power analysis has its limits but it does provide some very useful information if used properly.

Sincerely,

Jim

Name: Tom Ryan• Thursday, July 19, 2012Jim,

I agree with some of your points but my main argument is simply that the term "retrospective power" is not a valid statistical term. This is not just my opinion as retrospective power has been debunked in the articles that I cited.

The term "observed power" is also not a valid statistical term and is a misuse of statistics.

On the other hand, "conditional power", as proposed by Lan and Wittes (1988) is valid because it is prospective rather than retrospective in that it is essentially a conditional probability of rejecting the null hypothesis in favor of the alternative hypothesis at the end of a study period, conditional on the data that have been accumulated up to the point in time at which the conditional power is computed. It is applicable when data are slowly accruing from a nonsequentially designed trial.

Sincerely,

Tom

Name: Jim Frost• Thursday, July 19, 2012Tom,

It is valid to take the results of a study and use them to fine tune any followup studies, and that's all I'm proposing here. If there was a followup study in this case, the power analysis suggests that they'd need a much larger sample size.

Jim

Name: Doug• Monday, July 23, 2012Jim,

Great article. I shared it with some coworkers, since it is very easy to assume there are "differences" when they are actually not significant.

As far as "intuition" of significance, one thing I will do is to see how much the number of events can change before providing conflicting results.

In the Mythbuster data, having 1 more "yawner" in the "non-yawn" group changes the ratio to 5/16 (31%), already greater than the 10/34 of the "yawn" group.

To me that is a clear indication that the result is not significant (either the difference is not significant or the sample size is too small to detect).

What is your take on that?

Thanks,

Doug

Name: Jim Frost• Thursday, July 26, 2012Hi Doug, thanks and I'm glad you enjoyed it!

I think your method is a great red flag. And, it's also an easy way to demonstrate it to someone who may not be familiar with statistics.

I do have one caution with relying on that approach, it doesn't always work. For example, let's assume that the Mythbusters collected 200 samples per group and that the observed proportions for both groups remained the same. So, in the yawn group they had 58 yawn (29%) and in the non-yawn group they had 50 yawn (25%). Those numbers are still insignificant because a one-tailed test gives us a p-value of 0.184.

If you add 1 to the non-yawn group, the number increases from 50 to 51 and the percentage increases from 25% to 25.5%. This might not trip a red flag using your approach.

Perhaps the best way to look at this is with the confidence interval, or the lower bound in this case because we're conducting a one-tailed test.

If you look at the results in the blog post, you'll see that the "95% lower bound for difference" is -0.175487. The negative difference indicates that there are more yawners in the group that was *not* exposed to yawns. This lower bound indicates that if we reran the study with the same sample size and conditions over and over, we would not be surprised to see results where the non-yawn group actually had nearly 18% more people who yawned. This clearly indicates that you can't be certain about the results!

If you take the example with the sample size of 200 per group and looked at the lower bound, you'd see that it was about -0.03. You'd still have to reject the hypothesis that the yawn group had more yawners. However, because the lower bound is -3%, even if you ran the study multiple times you wouldn't expect to see the non-yawn group having many more yawns. In other words, while the large sample size wouldn't have produced significant results in the hypothetical example, it does give us more information if we look at the lower bound.

I think your approach is great when it works (for very small samples) and easily demonstrates the nature of the problem. But, using confidence intervals (or bounds) will show the same thing more accurately.

Jim

Name: B S Warrier• Saturday, February 1, 2014We may have to repeat the exercise with a much larger population, if we should arrive at a reliable conclusion. So also different trials with different age groups and backgrounds. We see in life that people yawn contagiously. Yes, there is a serious doubt. Why should man's precious hours be wasted in wild goose chases like these? Why not divert human energy into some productive endeavor?

Regards

B S Warrier