Large Samples: Too Much of a Good Thing?

Minitab Blog Editor 04 June, 2012

The other day I stopped at a fast food joint for the first time in a while.

After I ordered my food, the cashier asked me if I wanted to upgrade to the “Super-Smiley-Savings-Meal” or something like that, and I said, “Sure.”  

When it came, I was astounded to see the gargantuan soda cup. I don’t know how many ounces it was, but you could have bathed a dachshund in it.

If I drank all the Coke that honkin' cup could hold, the megadose of sugar and caffeine would launch me into permanent orbit around Earth.

That huge cup made me think of sample size.

Generally, having more data is a good thing. But if your sample is extremely large, it can create a few potential issues.

Here are a few things to watch out for—and some tips for how to deal with them.    

Warning 1: Huge Samples Can Make the Insignificant...Significant

Look at the 2-sample t-test results shown below. Notice that in both Examples 1 and 2 the means and the difference between them are nearly identical. The variability (StDev) is also fairly similar. But the sample sizes and the p-values differ dramatically: 

(A sample size of 1 million is preposterous, of course. I use it just to illustrate a point—and to give an inkling of how much data the Minitab worksheet can chug.)

With a large sample, the t-test has so much power that even a miniscule difference like 0.009 is flagged as statistically significant.

Is a difference of 0.009 important? Depends on the unit of measurement and the application. For the width of a miniaturized part in a medical implant, it could be. For the diameter of a soda cup—and for many other applications—probably not.  

With large samples, it’s critical to evaluate the practical implications of a statistically significant difference.

Warning 2: Huge Samples Can Give Your Data an Identity Crisis

Minitab Technical Support frequently gets calls from users who perform an Individual Distribution Identification on a large data set and, based solely on the p-values, think that none of the distributions fit their data.

The problem? With a large data set, the goodness-of-fit test becomes sensitive to even very small, inconsequential departures from a distribution. Here's an example:

What happens if you test the fit of the normal distribution for these two data sets? 


For the large sample (shown in the histogram of C1), the p-value (0.047) is less than alpha (0.05), so you reject the null hypothesis that the data are from a normally distributed population. You conclude that the data are not normal.


For the small sample (shown in the histogram of C3), the p-value (0.676) is greater than alpha (0.05), so there is not enough evidence to reject the null hypothesis that the data are from a normally distributed population. You assume the data are normal.


Huh? Come again? Those test results don't seem to gibe with the data in the histograms, do they?

If it does not fit, you must (not always) acquit...

If you have an extremely large data set, don't rely solely on the the p-value from a goodness-of-fit test to evaluate the distribution fit. Examine the graphs and the probability plot as well.

Here are the probability plots for the large and small sample:

Do the points seem to follow a straight line? Could you cover them with a pleasingly plump pencil? If so, the distribution is probably a good fit.

Here, the probability plot for the large data set shows the points following a fairly thin, straight line, which indicates that you can assume that the normal distribution is a good fit.

The plots also shed insight into why the normal distribution was rejected for the large sample. 

Compare the blue confidence bands in each plot. Notice how "tight" they are in the plot with many observations? The large sample makes the margin of error so narrow that even the slightest departures (bottom and top of line) cause the p-value to signal a lack of fit.  

Warning 3: Huge Samples Can Make Your Control Charts "Cry Wolf"

The P charts below track the proportion of defective items. The two charts have nearly identical overall mean proportion and similar variability across subgroups. But one chart has subgroups of size 600 while the other chart has subgroups of size 60.

Can you guess which chart has the huge subgroups?

You chose P Chart 1, I hope!

Assuming a given amount of variation, the larger your subgroups, the narrower your control limits will be on a P or U chart. With huge subgroups, artificially tight control limits can cause the points on the chart to look like they’re out of control even when they're not. The chart tends to show many false alarms, especially if you also have a lot of dispersion in your data.

In those cases, consider using a Laney P' chart or Laney U' chart , which are specially designed to avoid false alarms in these situations. 

What?!! Don't Look at Me Like That!

See the pattern in all of these examples? When you biggie-size your sample, your statistical analysis can get a wee bit on the touchy side. I can totally relate. When I drink a bucket of caffeinated soda, I get nittery, jittery, and hypersensitive to even the slightest input.

Hmmm. Maybe that explains why my coworkers run the other way when they see me walking down the hall with a Big Gulp.