# 3 Common (and Dangerous!) Statistical Misconceptions

Have you ever been a victim of a statistical misconception that’s affected how you’ve interpreted your analysis? Like any field of study, statistics has some common misconceptions that can trip up even experienced statisticians. Here are a few common misconceptions to watch out for as you complete your analyses and interpret the results.

### Mistake #1: Misinterpreting Overlapping Confidence Intervals

When comparing multiple means, statistical practitioners are sometimes advised to compare the results from confidence intervals and determine whether the intervals overlap. When 95% confidence intervals for the means of two independent populations don’t overlap, there will indeed be a statistically significant difference between the means (at the 0.05 level of significance). However, the opposite is not necessarily true. CI’s may overlap, yet there may be a statistically significant difference between the means.

Take this example:

Two 95% confidence intervals that overlap may be significantly different at the 95% confidence level.

**What’s the significance of the t-test P-value?** The P-value in this case is less than 0.05 (0.049 < 0.05), telling us that there is a statistical difference between the means, (yet the CIs overlap considerably).

### Mistake #2: Making Incorrect Inferences about the Population

With statistics, we can analyze a small sample to make inferences about the entire population. But there are a few situations where you should avoid making inferences about a population that the sample does not represent:

- In capability analysis, data from a single day is sometimes inappropriately used to estimate the capability of the entire manufacturing process.
- In acceptance sampling, samples from one section of the lot are selected for the entire analysis.
- A common and severe case occurs in a reliability analysis when only the units that failed are included in an analysis and the population is all units produced.

To avoid these situations, define the population before sampling and take a sample that truly represents the population.

### Mistake #3: Assuming Correlation = Causation

It’s sometimes overused, but “correlation does not imply causation” is a good reminder when you’re dealing with statistics. Correlation between two variables does not mean that one variable causes a change in the other, especially if correlation statistics are the only statistics you are using in your data analysis.

For example, data analysis has shown a strong positive correlation between shirt size and shoe size. As shirt size goes up, so does shoe size. Does this mean that wearing big shirts causes you to wear bigger shoes? Of course not! There could be other “hidden” factors at work here, such as height. (Tall people tend to wear bigger clothes and shoes.)

Take a look at this scatterplot that shows that HIV antibody false negative rates are correlated with patient age:

Does this show that the HIV antibody test does not work as well on older patients? Well, maybe …

But you can’t just stop there and assume that just because patients are older, age is the factor that is causing them to receive a false negative test result (a false negative is when a patient tests negative on the test, but is confirmed to have the disease).

**Dig a little deeper!** Here you see that patient age and days elapsed between at-risk exposure and test are correlated:

Older patients got tested faster … before the HIV antibodies were able to fully develop and show a positive test result.

Keep the idea that “correlation does not imply causation” in your mind when reading some of the many studies publicized in the media. Intentionally or not, the media frequently imply that a study has revealed some cause-and-effect relationship, even when the study's authors detail precisely the limitations of their research.

Name: Alex• Wednesday, February 20, 2013Greetings! What about low value of R-sq in regression model? Does it mean that your model is of no sense at all?

Name: Carly Barry• Wednesday, February 20, 2013Hi Alex - Thanks for your comment. The interpretation of R-squared would make a great 4th statistical misconception!

To answer your question: The higher the R-sq percentage, the better the model fits the data. I wouldn’t say that a low R-sq value means that the model doesn’t “make sense,” it’s just that there’s less variation in the response that’s explained by the equation. Thus, the model could potentially benefit from including other variables that could help to increase R-sq and explain more of the variability in the response.

Thanks for reading!

Carly

Name: Carly Barry• Thursday, February 21, 2013Hi Alex - Great answer! This is a good topic that I'm planning to address in a future post.

Best,

Carly