The interpretation of P values would seem to be fairly standard between different studies. Even if two hypothesis tests study different subject matter, we tend to assume that you can interpret a P value of 0.03 the same way for both tests. A P value is a P value, right?
Not so fast! While Minitab statistical software can correctly calculate all P values, it can’t factor in the larger context of the study. You and your common sense need to do that!
In this post, I’ll demonstrate that P values tell us very different things depending on the larger context.
In my previous post, I showed the correct way to interpret P values. Keep in mind the big caution: P values are not the error rate, or the likelihood of making a mistake by rejecting a true null hypothesis (Type I error).
You can equate this error rate to the false positive rate for a hypothesis test. A false positive happens when the sample is unusual due to chance alone and it produces a low P value. However, despite the low P value, the alternative hypothesis is not true. There is no effect at the population level.
Sellke et al. estimated that a P value of 0.05 corresponds to a false positive rate of “at least 23% (and typically close to 50%).”
Why is there a range of values for the error rate? To understand that, you need to understand the factors involved. David Colquhoun, a professor in biostatistics, lays them out here.
Whereas Sellke et al. use a Bayesian approach, Colquhoun uses a non-Bayesian approach but derives similar estimates. For example, Colquhoun estimates P values between 0.045 and 0.05 have a false positive rate of at least 26%.
The factors that affect the false positive rate are:
“Good” means that the test is less likely to produce a false positive. The 26% error rate assumes a prevalence of real effects of 0.5 and a power of 0.8. If you decrease the prevalence to 0.1, suddenly the false positive rate shoots up to 76%. Yikes!
Power is related to false positives because when a study has a lower probability of detecting a true effect, a higher proportion of the positives will be false positives.
Now, let’s dig into a very interesting factor: the prevalence of real effects. As we saw, this factor can hugely impact the error rate!
If the alternative hypothesis is farfetched, or has a poor track record, P(real) is low. For example, a prevalence of 0.1 indicates that 10% of similar alternative hypotheses have turned out to be true while 90% of the time the null was true. Perhaps the alternative hypothesis is unusual, untested, or otherwise implausible.
If the alternative hypothesis fits current theory, has an identified mechanism for the effect, and previous studies have already shown significant results, P(real) is higher. For example, a prevalence of 0.90 indicates that the alternative is true 90% of the time, and the null only 10% of the time.
If the prevalence is 0.5, there is a 50/50 chance that either the null or alternative hypothesis is true at the outset of the study.
You may not always know this probability, but theory and a previous track record can be guides. For our purposes, we’ll use this principle to see how it impacts our interpretation of P values. Specifically, we’ll focus on the probability of the null being true (1 – P(real)) at the beginning of the study.
Hypothesis tests begin with differing probabilities that the null hypothesis is true depending on the specific hypotheses being tested. This prior probability influences the probability that the null is true at the conclusion of the test, the posterior probability.
If P(real) = 0.9, there is only a 10% chance that the null hypothesis is true at the outset. Consequently, the probability of rejecting a true null at the conclusion of the test must be less than 10%. However, if you start with a 90% chance of the null being true, the odds of rejecting a true null increases because there are more true nulls.
Initial Probability of |
P value obtained |
Final Minimum Probability |
0.5 |
0.05 |
0.289 |
0.5 |
0.01 |
0.110 |
0.5 |
0.001 |
0.018 |
0.33 |
0.05 |
0.12 |
0.9 |
0.05 |
0.76 |
The table is based on calculations by Colquhoun and Sellke et al. It shows that the decrease from the initial probability to the final probability of a true null depends on the P value. Power is also a factor but not shown in the table.
Just remember two big takeaways:
The second point is epitomized by a quote that was popularized by Carl Sagan: “Extraordinary claims require extraordinary evidence.”
A surprising new study may have a significant P value, but you shouldn't trust the alternative hypothesis until the results are replicated by additional studies. As shown in the table, a significant but unusual alternative hypothesis can have an error rate of 76%!
Don’t fret! There are simple recommendations based on the principles above that can help you navigate P values and use them correctly. I’ll cover five guidelines for using P values in my next post.