Not All P Values are Created Equal

The interpretation of P values would seem to be fairly standard between different studies. Even if two hypothesis tests study different subject matter, we tend to assume that you can interpret a P value of 0.03 the same way for both tests. A P value is a P value, right?

Not so fast! While Minitab statistical software can correctly calculate all P values, it can’t factor in the larger context of the study. You and your common sense need to do that!

In this post, I’ll demonstrate that P values tell us very different things depending on the larger context.

Recap: P Values Are Not the Probability of Making a Mistake

In my previous post, I showed the correct way to interpret P values. Keep in mind the big caution: P values are not the error rate, or the likelihood of making a mistake by rejecting a true null hypothesis (Type I error).

You can equate this error rate to the false positive rate for a hypothesis test. A false positive happens when the sample is unusual due to chance alone and it produces a low P value. However, despite the low P value, the alternative hypothesis is not true. There is no effect at the population level.

Sellke et al. estimated that a P value of 0.05 corresponds to a false positive rate of “at least 23% (and typically close to 50%).”

What Affects the Error Rate?

Why is there a range of values for the error rate? To understand that, you need to understand the factors involved. David Colquhoun, a professor in biostatistics, lays them out here.

Whereas Sellke et al. use a Bayesian approach, Colquhoun uses a non-Bayesian approach but derives similar estimates. For example, Colquhoun estimates P values between 0.045 and 0.05 have a false positive rate of at least 26%.

The factors that affect the false positive rate are:

Prevalence of real effects (higher is good)
Power (higher is good)
P value (lower is good)

“Good” means that the test is less likely to produce a false positive. The 26% error rate assumes a prevalence of real effects of 0.5 and a power of 0.8. If you decrease the prevalence to 0.1, suddenly the false positive rate shoots up to 76%. Yikes!

Power is related to false positives because when a study has a lower probability of detecting a true effect, a higher proportion of the positives will be false positives.

Now, let’s dig into a very interesting factor: the prevalence of real effects. As we saw, this factor can hugely impact the error rate!

P Values and the Prevalence of Real Effects

Joke: I once asked a statistician out. She failed to reject me! What Colquhoun calls the prevalence of real effects (denoted as P(real)), the Bayesian approach calls the prior probability. It is the proportion of hypothesis tests in which the alternative hypothesis is true at the outset. It can be thought of as the long-term probability, or track record, of similar types of studies. It’s the plausibility of the alternative hypothesis.

If the alternative hypothesis is farfetched, or has a poor track record, P(real) is low. For example, a prevalence of 0.1 indicates that 10% of similar alternative hypotheses have turned out to be true while 90% of the time the null was true. Perhaps the alternative hypothesis is unusual, untested, or otherwise implausible.

If the alternative hypothesis fits current theory, has an identified mechanism for the effect, and previous studies have already shown significant results, P(real) is higher. For example, a prevalence of 0.90 indicates that the alternative is true 90% of the time, and the null only 10% of the time.

If the prevalence is 0.5, there is a 50/50 chance that either the null or alternative hypothesis is true at the outset of the study.

You may not always know this probability, but theory and a previous track record can be guides. For our purposes, we’ll use this principle to see how it impacts our interpretation of P values. Specifically, we’ll focus on the probability of the null being true (1 – P(real)) at the beginning of the study.

Hypothesis Tests Are Journeys from the Prior Probability to Posterior Probability

Hypothesis tests begin with differing probabilities that the null hypothesis is true depending on the specific hypotheses being tested. This prior probability influences the probability that the null is true at the conclusion of the test, the posterior probability.

If P(real) = 0.9, there is only a 10% chance that the null hypothesis is true at the outset. Consequently, the probability of rejecting a true null at the conclusion of the test must be less than 10%. However, if you start with a 90% chance of the null being true, the odds of rejecting a true null increases because there are more true nulls.

Initial Probability of true null (1 – P(real))	P value obtained	Final Minimum Probability of true null
0.5	0.05	0.289
0.5	0.01	0.110
0.5	0.001	0.018
0.33	0.05	0.12
0.9	0.05	0.76

The table is based on calculations by Colquhoun and Sellke et al. It shows that the decrease from the initial probability to the final probability of a true null depends on the P value. Power is also a factor but not shown in the table.

Where Do We Go with P values from Here?

wooden block P There are many combinations of conditions that affect the probability of rejecting a true null. However, don't try to remember every combination and the error rate, especially because you may only have a vague sense of the true P(real) value!

Just remember two big takeaways:

A single statistically significant hypothesis test often provides insufficient evidence to confidently discard the null hypothesis. This is particularly true when the P value is closer to 0.05.
P values from different hypothesis tests can have the same value, but correspond to very different false positive rates. You need to understand their context to be able to interpret them correctly.

The second point is epitomized by a quote that was popularized by Carl Sagan: “Extraordinary claims require extraordinary evidence.”

A surprising new study may have a significant P value, but you shouldn't trust the alternative hypothesis until the results are replicated by additional studies. As shown in the table, a significant but unusual alternative hypothesis can have an error rate of 76%!

Don’t fret! There are simple recommendations based on the principles above that can help you navigate P values and use them correctly. I’ll cover five guidelines for using P values in my next post.