Unwanted Male Pregnancy: How Error Begets Error

The demands of modern life can make us very distracted.  

We should all do our best to slow down and not make sally mistakes. But it can be tough.  

With information coming at us from all directions, it's easy to get side-tracked and lose your …um, whatever.

But it's critical to prevent careless erors from creeping into your data. Because if you're not careful, a lot of innocent men may wind up getting pregnant by accident.

At least that's what happened to thousands of British men, who supposedly received gyneocological, obstetric, and other prenatal services over a two-year period, according to a recent letter in the British Medical Journal 

But those men can relax...without Lamaze. Many were experiencing false pregnancies caused by incorrect diagnosis codes that were mistentered into the National Health Services database.

Statistical Significance and the Slip of a Finger

Unfortunately, data entry errors are not nearly as uncommon as pregnant men. Yet even a single typo can invalidate the results of your statistical analysis.

Consider two sets of data that each contain 700 total cholesterol values.

The data sets have 699 identical values. However, in one data set, 280.0 has been misentered as 2800.0 in row 688 of the worksheet.  

Suppose you perform a 1-sample t-test to evaluate whether the mean total cholesterol of the population is greater than 200.

Here's what you get when you use the data set with the typo:

Because the p-value is not less than the alpha level of 0.05, you announce to the world that there is no statistically significant evidence that total cholesterol levels are above 200 for this population.

Then, one day, your dog starts acting strange--sniffing, whining, and pawing clumsily at row 688 of your Minitab worksheet.

Uh-oh. A cholesterol reading of 2800? Impossible! You'd have to have arteries stuffed with Pecorino cheese. Why didn't anyone notice that before? 

"Good catch, Bongo!" You give your dog a biscuit and quickly correct the typo.

Hopefully, one incorrect decimal place in just one data value in such a large data set won't make much difference.  You rerun the t-test just to make sure:

Now the p-value is less than 0.05. You can reject the null hypothesis and conclude that the total cholesterol level of the population is greater than 200.

Your first conclusion was dead wrong, thanks to one tiny slip of the finger. And not even your large data set with hundreds of values could save you.

You are forced to publish a retraction: "Whoops...sorry about that folks...I'm blushing from ear to ear..."

The Spanish version of your retraction reads: "Lo siento. Estoy muy embarazado..."

Stretch Your Brain, Not Your Belly

Next time, we'll look at some useful methods in Minitab to help catch data entry errors before you perform an analysis.

Until then, see if you can figure out this seeming paradox:

The 1-sample t test above tested for evidence that the population mean is greater than 200. Notice that the sample with the typo (Total Cholesterol_1) actually had a higher mean (205.27) than the sample without the typo (Total Cholesterol_2), which had a mean of 202.09.

You'd think that sample with the higher mean would produce the statistically significant result. But it was the other way around. The sample with the smaller mean, closer to 200, provided stronger evidence that the population mean was greater than 200.


7 Deadly Statistical Sins Even the Experts Make

Do you know how to avoid them?

Get the facts >


Name: Max D • Monday, May 7, 2012

Lower standard error means that the data is more closely packed, and this mean is almost 4 times more of an "outlier" in the distribution of sample means, given how low the new SE is.

Name: Patrick Runkel • Wednesday, May 9, 2012

You got it, Max! The 1-sample t test is a signal-to-noise ratio that divides the difference between the sample mean and the hypothesized mean (the signal) by the variability in the sample (the noise). Although the typo increases the difference in the numerator, the increased signal is more than offset by the higher variability in the denominator, as you suggest by pointing out the whopping increase in the SE. Thanks for reading and for making the astute comment! Cheers, Patrick

blog comments powered by Disqus