I’ve written a fair bit about P values: how to correctly interpret P values, a graphical representation of how they work, guidelines for using P values, and why the P value ban in one journal is a mistake. Along the way, I’ve received many questions about P values, but the questions from one reader stand out.
This reader asked, why is it so easy to interpret P values incorrectly? Why is the common misinterpretation so pervasive? And, what can be done about it? He wasn’t sure if it these were fair questions, but I think they are. Let’s answer them!
First, to make sure we’re on the same page, here’s the correct definition of P values.
The P value is the probability of obtaining an effect at least as extreme as the one in your sample data, assuming the truth of the null hypothesis. In other words, if the null hypothesis is true, the P value is the probability of obtaining your sample data. It answers the question, are your sample data unusual if the null hypothesis is true?
If you’re thinking that the P value is the probability that the null hypothesis is true, the probability that you’re making a mistake if you reject the null, or anything else along these lines, that’s the most common misunderstanding. You should click the links above to learn how to correctly interpret P values.
This problem is nearly a century old and goes back to two very antagonistic camps from the early days of hypothesis testing: Fisher's measures of evidence approach (P values) and the Neyman-Pearson error rate approach (alpha). Fisher believed in inductive reasoning, which is the idea that we can use sample data to learn about a population. On the other side, the Neyman-Pearson methodology does not allow analysts to learn from individual studies. Instead, the results only apply to a long series of tests.
Courses and textbooks have mushed these disparate approaches together into the standard hypothesis-testing procedure that is known and taught today. This procedure seems like a seamless combination but it's really a muddled, Frankenstein's-monster combination of sometimes-contradictory methods that has promoted the confusion. The end result of this fusion is that P values are incorrectly entangled with the Type I error rate. Fisher tried to clarify this misunderstanding for decades, but to no avail.
For more information about the merging of these two schools of thought, I recommend this excellent article.
The common misconception is what we'd really like to know. We’d loooove to know the probability that a hypothesis is correct, or the probability that we’re making a mistake. What we get instead is the probability of our observation, which just isn’t as useful.
It would be great if we could take evidence solely from a sample and determine the probability that the sample is wrong. Unfortunately, that's not possible—for logical reasons when you think about it. Without outside information, a sample can’t tell you whether it’s representative of the population.
P values are based exclusively on information contained within a sample. Consequently, P values can't answer the question that we most want answered, but there seems to be an irresistible temptation towards interpreting it that way.
The correct definition of a P value is fairly convoluted. The definition is based on the probability of observing what you actually did observe (huh?), but in a hypothetical context (a true null hypothesis), and it includes strange wording about results that are at least as extreme as what you observed. It's hard to understand all of that without a lot of study. It's just not intuitive.
Unfortunately, there is no simple and accurate definition that can help counteract the pressures to believe in the common misinterpretation. In fact, the incorrect definition sounds so much simpler than the correct definition. Shoot, not even scientists can explain P values! And, so the misconceptions live on.
Historical circumstances have conspired to confuse the issue. We have a natural tendency to want P values to mean something else. And, there is no simple yet correct definition for P values that can counteract the common misunderstandings. No wonder this has been a problem for a long time!
Fisher tried in vain to correct this misinterpretation but didn't have much luck. As for myself, I hope to point out that what may seem like a semantic difference between the correct and incorrect definitions actually equates to a huge difference.
Using the incorrect definition is likely to come back to bite you! If you think a P value of 0.05 equates to a 5% chance of a mistake, boy, are you in for a big surprise—because it’s often around 26%! Instead, based on middle-of-the-road assumptions, you’ll need a P value around 0.0027 to achieve an error rate of about 5%. However, not all P values are created equal in terms of the error rate.
I also think that P values are easier for most people to understand graphically than through the tricky definition and the math. So, I wrote a series of blog posts that graphically show why we need hypothesis testing and how it works.
I have no reason to expect that I'll have any more impact than Fisher did himself, but it's an attempt!