In my previous post, I wrote about the hypothesis testing ban in the Journal of Basic and Applied Social Psychology. I showed how P values and confidence intervals provide important information that descriptive statistics alone don’t provide. In this post, I'll cover the editors’ concerns about hypothesis testing and how to avoid the problems they describe.
The editors describe hypothesis testing as "invalid" and the significance level of 0.05 as a “crutch” for weak data. They claim that it is a bar that is “too easy to pass and sometimes serves as an excuse for lower quality research.” They also bemoan the fact that sometimes the initial study obtains a significant P value but follow-up replication studies can fail to obtain significant results.
Ouch, right?
Their arguments against hypothesis testing focus on the following:
These issues are nothing new and aren't show stoppers for hypothesis testing. In fact, I believe using them to ban null hypothesis testing represents a basic misunderstanding of both how to correctly use hypothesis test results and how the scientific process works.
P values are not "invalid" but they do answer a different question than what many readers realize. There is a common misconception that the P value represents the probability that the null hypothesis is true. Under this mistaken understanding, a P value of 0.04 would indicate there is a 4% probability of a false positive when you reject the null hypothesis. This is WRONG!
The question that a P value actually answers is: If the null hypothesis is true, are my data unusual?
The correct interpretation for a P value of 0.04 is that if the null hypothesis is true, you would obtain the observed effect or more in 4% of the studies due to random sampling error. In other words, the observed sample results are unlikely if there truly is no effect in the population.
The actual false positive rate associated with a P value of 0.04 depends on a variety of factors but it is typically at least 23%. Unfortunately, the common misconception creates the illusion of substantially more evidence against the null hypothesis than is justified. You actually need a P value around 0.0027 to achieve an error rate of around 4.5%, which is close to the rate that many mistakenly attribute to a P value of 0.05.
The higher-than-expected false positive rate is the basis behind the editors’ criticisms that P values near 0.05 are a “crutch” and “too easy to pass.” However, this is due to misinterpretation rather than a problem with P values. The answer isn’t to ban P values, but to learn how to correctly interpret and use the results.
The common illusion described above ties into the second issue of studies that fail to replicate significant findings. If the false positive rate is higher than expected, it makes sense that the number of followup studies that can’t replicate the previously significant results will also be higher than expected.
Another related common misunderstanding is that once you obtain a significant P value, you have a proven effect. Trafimow claims in an earlier editorial that once a significant effect is published, "it becomes sacred." This claim misrepresents the scientific method because there is no magic significance level that distinguishes between the studies that have a true effect and those that don’t with 100% accuracy.
A P value near 0.05 simply indicates that the result is worth another look, but it’s nothing you can hang your hat on by itself. Instead, it’s all about repeated testing to lower the error rate to an acceptable level.
You always need repeated testing to prove the truth of an effect!
How does replication work with hypothesis tests and the false positive rate? Simulation studies show that the lower the P value, the greater the reduction in the probability that the null hypothesis is true from the beginning of the experiment to the end.
With this in mind, think of hypothesis tests as a filter that allows you to progressively lower the probability that the null hypothesis is true each time you obtain significant results. With repeated testing, we can filter out the false positives, as I illustrate below.
We generally don’t know the probability that a null hypothesis is true, but I’ll run through a hypothetical scenario based on the simulation studies. Let’s assume that initially there is a 50% chance that the null hypothesis is true. You perform the first experiment and obtain significant results. Let’s say this reduces the probability that the null is true down to 25%. Another study tests the same hypothesis, obtains significant results, and lowers the probability of a true null hypothesis even further to 10%.
Wash, rinse, and repeat! Eventually the probability that the null is true becomes a tiny value. This shows why significant results need to be replicated in order to become trustworthy findings.
The actual rate of reduction can be faster or slower than the example above. It depends on various factors including the initial probability of a true null hypothesis and the exact P value of each experiment. I used conservative P values near 0.05.
Of course, there’s always the possibility that the initial significant finding won’t be replicated. This is a normal part of the scientific process and not a problem. You won’t know for sure until a subsequent study tries to replicate a significant result!
Reality is complex and we’re trying to model it with samples. Conclusively proving a hypothesis with a single study is unlikely. So, don’t expect it!
"A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance."
—Sir Ronald A. Fisher, original developer of P values.
You can’t look at a P value to determine the quality level of a study. The overall quality depends on many factors that occur well before the P value is calculated. A P value is just the end result of a long process.
The factors that affect the quality of a study include the following: theoretical considerations, experimental design, variables measured, sampling technique, sample size, measurement precision and accuracy, data cleaning, and the modeling method.
Any of these factors can doom a study before a P value is even calculated!
The blame that the editors place on P values for low quality research appearing in their journal is misdirected. This is a peer-reviewed journal and it’s the reviewers’ job to assess the quality of each study and publish only those with merit.
Hypothesis tests and statistical output such as P values and confidence intervals are powerful tools. Like any tool, you need to use them correctly to obtain good results. Don't ban the tools. Instead, change the bad practices that surround them. Please follow these links for more details and references.
How to Correctly Interpret P Values: Just as the title says, this post helps you to correctly interpret P values and avoid the mistakes associated with the incorrect interpretations.
Understanding Hypothesis Tests: The graphical approach in this series of three posts provides a more intuitive understanding of how hypothesis testing works and what statistical significance truly means.
Not all P Values are Created Equal: If you want to understand better the false positive rate associated with different P values and the factors that effect it, this post is for you! This post also shows you how lower P values reduce the probability of a true null hypothesis.
Five Guidelines for Using P Values: The journal editors raise issues about how P values can be abused. These are real issues when P values are used incorrectly. However, there’s no need to banish them! This post provides simple guidelines for how to navigate these issues and avoid common problems.
The photo of the water filter is by the Wikimedia user TheMadBullDog and used under this Creative Commons license.