3 Common (and Dangerous!) Statistical Misconceptions
Have you ever been a victim of a statistical misconception that’s affected how you’ve interpreted your analysis? Like any field of study, statistics has some common misconceptions that can trip up even experienced statisticians. Here are a few common misconceptions to watch out for as you complete your analyses and interpret the results.
Mistake #1: Misinterpreting Overlapping Confidence Intervals
When comparing multiple means, statistical practitioners are sometimes advised to compare the results from confidence intervals and determine whether the intervals overlap. When 95% confidence intervals for the means of two independent populations don’t overlap, there will indeed be a statistically significant difference between the means (at the 0.05 level of significance). However, the opposite is not necessarily true. CI’s may overlap, yet there may be a statistically significant difference between the means.
Take this example:
Two 95% confidence intervals that overlap may be significantly different at the 95% confidence level.
What’s the significance of the t-test P-value? The P-value in this case is less than 0.05 (0.049 < 0.05), telling us that there is a statistical difference between the means, (yet the CIs overlap considerably).
Mistake #2: Making Incorrect Inferences about the Population
With statistics, we can analyze a small sample to make inferences about the entire population. But there are a few situations where you should avoid making inferences about a population that the sample does not represent:
- In capability analysis, data from a single day is sometimes inappropriately used to estimate the capability of the entire manufacturing process.
- In acceptance sampling, samples from one section of the lot are selected for the entire analysis.
- A common and severe case occurs in a reliability analysis when only the units that failed are included in an analysis and the population is all units produced.
To avoid these situations, define the population before sampling and take a sample that truly represents the population.
Mistake #3: Assuming Correlation = Causation
It’s sometimes overused, but “correlation does not imply causation” is a good reminder when you’re dealing with statistics. Correlation between two variables does not mean that one variable causes a change in the other, especially if correlation statistics are the only statistics you are using in your data analysis.
For example, data analysis has shown a strong positive correlation between shirt size and shoe size. As shirt size goes up, so does shoe size. Does this mean that wearing big shirts causes you to wear bigger shoes? Of course not! There could be other “hidden” factors at work here, such as height. (Tall people tend to wear bigger clothes and shoes.)
Take a look at this scatterplot that shows that HIV antibody false negative rates are correlated with patient age:
Does this show that the HIV antibody test does not work as well on older patients? Well, maybe …
But you can’t just stop there and assume that just because patients are older, age is the factor that is causing them to receive a false negative test result (a false negative is when a patient tests negative on the test, but is confirmed to have the disease).
Dig a little deeper! Here you see that patient age and days elapsed between at-risk exposure and test are correlated:
Older patients got tested faster … before the HIV antibodies were able to fully develop and show a positive test result.
Keep the idea that “correlation does not imply causation” in your mind when reading some of the many studies publicized in the media. Intentionally or not, the media frequently imply that a study has revealed some cause-and-effect relationship, even when the study's authors detail precisely the limitations of their research.