Large Samples: Too Much of a Good Thing?

The other day I stopped at a fast food joint for the first time in a while.

After I ordered my food, the cashier asked me if I wanted to upgrade to the “Super-Smiley-Savings-Meal” or something like that, and I said, “Sure.”

When it came, I was astounded to see the gargantuan soda cup. I don’t know how many ounces it was, but you could have bathed a dachshund in it.

If I drank all the Coke that honkin' cup could hold, the megadose of sugar and caffeine would launch me into permanent orbit around Earth.

That huge cup made me think of sample size.

Generally, having more data is a good thing. But if your sample is extremely large, it can create a few potential issues.

Here are a few things to watch out for—and some tips for how to deal with them.

Warning 1: Huge Samples Can Make the Insignificant...Significant

Look at the 2-sample t-test results shown below. Notice that in both Examples 1 and 2 the means and the difference between them are nearly identical. The variability (StDev) is also fairly similar. But the sample sizes and the p-values differ dramatically:

(A sample size of 1 million is preposterous, of course. I use it just to illustrate a point—and to give an inkling of how much data the Minitab worksheet can chug.)

With a large sample, the t-test has so much power that even a miniscule difference like 0.009 is flagged as statistically significant.

Is a difference of 0.009 important? Depends on the unit of measurement and the application. For the width of a miniaturized part in a medical implant, it could be. For the diameter of a soda cup—and for many other applications—probably not.

With large samples, it’s critical to evaluate the practical implications of a statistically significant difference.

Warning 2: Huge Samples Can Give Your Data an Identity Crisis

Minitab Technical Support frequently gets calls from users who perform an Individual Distribution Identification on a large data set and, based solely on the p-values, think that none of the distributions fit their data.

The problem? With a large data set, the goodness-of-fit test becomes sensitive to even very small, inconsequential departures from a distribution. Here's an example:

What happens if you test the fit of the normal distribution for these two data sets?

-------------------------------------------------------------------------------------------

For the large sample (shown in the histogram of C1), the p-value (0.047) is less than alpha (0.05), so you reject the null hypothesis that the data are from a normally distributed population. You conclude that the data are not normal.

-------------------------------------------------------------------------------------------

For the small sample (shown in the histogram of C3), the p-value (0.676) is greater than alpha (0.05), so there is not enough evidence to reject the null hypothesis that the data are from a normally distributed population. You assume the data are normal.

-------------------------------------------------------------------------------------------

Huh? Come again? Those test results don't seem to gibe with the data in the histograms, do they?

If it does not fit, you must (not always) acquit...

If you have an extremely large data set, don't rely solely on the the p-value from a goodness-of-fit test to evaluate the distribution fit. Examine the graphs and the probability plot as well.

Here are the probability plots for the large and small sample:

Do the points seem to follow a straight line? Could you cover them with a pleasingly plump pencil? If so, the distribution is probably a good fit.

Here, the probability plot for the large data set shows the points following a fairly thin, straight line, which indicates that you can assume that the normal distribution is a good fit.

The plots also shed insight into why the normal distribution was rejected for the large sample.

Compare the blue confidence bands in each plot. Notice how "tight" they are in the plot with many observations? The large sample makes the margin of error so narrow that even the slightest departures (bottom and top of line) cause the p-value to signal a lack of fit.

Warning 3: Huge Samples Can Make Your Control Charts "Cry Wolf"

The P charts below track the proportion of defective items. The two charts have nearly identical overall mean proportion and similar variability across subgroups. But one chart has subgroups of size 600 while the other chart has subgroups of size 60.

Can you guess which chart has the huge subgroups?

You chose P Chart 1, I hope!

Assuming a given amount of variation, the larger your subgroups, the narrower your control limits will be on a P or U chart. With huge subgroups, artificially tight control limits can cause the points on the chart to look like they’re out of control even when they're not. The chart tends to show many false alarms, especially if you also have a lot of dispersion in your data.

In those cases, consider using a Laney P' chart or Laney U' chart , which are specially designed to avoid false alarms in these situations.

What?!! Don't Look at Me Like That!

See the pattern in all of these examples? When you biggie-size your sample, your statistical analysis can get a wee bit on the touchy side. I can totally relate. When I drink a bucket of caffeinated soda, I get nittery, jittery, and hypersensitive to even the slightest input.

Hmmm. Maybe that explains why my coworkers run the other way when they see me walking down the hall with a Big Gulp.

Fearless Data Analysis

Minitab 17 gives you the confidence you need to improve quality.

Name: tamoghna • Monday, June 4, 2012

Patrick, I LOVE your posts. They are so much fun to read and learn.

Name: Patrick • Wednesday, June 6, 2012

Thank you for your kind feedback, Tamoghna. It keeps us going!

I hope that you receive the same positive encouragement in your work that you so generously provide to others.

Name: sammy • Monday, June 18, 2012

Really enjoy your informal way of sharing stats- thanks for doing a great job.

Name: Andrew Mullen • Monday, June 18, 2012

I agree, this is a great article. Funnily enough my company handles vast amounts of data. A MBB asked what I would do with 1 million pieces of transactional data. My response was to use sampling options from Minitab to pull a representrative sample? Would be intersted in your approach.

Name: Patrick Runkel • Tuesday, June 19, 2012

Andrew, you raise a really good question about how to handle the huge data sets that often result from tracking transactional data, such as website trends or customer transactions. There are many different approaches out there— which approach is most appropriate depends on the type of data you’re tracking and the overall objectives of your analysis.

Some graphs and analyses can handle a million data values. When data reduction is necessary, representative sampling, multivariate analysis (such as clustering), and aggregating data into time or spatial dimensions (such as on a time series plot), are just a few possibilities. If I were you, I’d explore different approaches and then weigh the pros and cons in relation to what you’re trying to learn from your data.
And of course, asking the MBB what he/she would do is also a good idea!
Best, Patrick

Name: Ella • Wednesday, June 20, 2012

Hi.
Thanks for the nice post.
I have a question: Shouldn't I look at the residuals when looking for normality? Not the individual distribution?
When I look at the residuals I don't get a p-value as when I'm looking at the individual plot. How do I determine if the data is really normally distributed? Where do I draw the line between normal distribution and not? Also, I have a very small sample size (n=28, 12 in each group), so the histogram of residuals doesn't look normally distributed, but the normal probability plot does. However, some dots are a bit away from the line. When is it too far away from the line?

Name: Patrick • Wednesday, June 20, 2012

Hi Ella,
Excellent questions. I can’t comment on the particulars of your specific data set or analysis but I can try to answer your questions in a general way.

You can use an Individual Distribution Identification to evaluate the distribution of your data in many different situations. For example, you might simply want to characterize the distribution of your data as a first step in understanding its basic properties. Or, you might need to clearly identify the distribution before you perform an analysis such as Capability Analysis or Reliability Analysis, because obtaining accurate results for those analyses critically depends on first correctly identifying the distribution.
I’m guessing that you’re performing a regression analysis. In that case, you’re right, typically it’s the residuals (the model errors—that is, how much the sample data values deviate from the model) that you examine for normality.

To see some specific examples of how to interpret the histogram and normal probability plot of residuals, including some specific examples of what’s normal and what’s not normal, take a look at StatGuide in Minitab.
Choose Help > Stat Guide. Click Regression. Under Regression, here are some topics that might be helpful:
-Histogram of residuals
-Normal probability plot of residuals
Under More, here are some other topics that might help answer your questions:
-Histograms and normality
-Patterns in normal probability plots
-What to do if you see a nonnormal pattern

If you’re not comfortable, based on the graphs, that the residuals are fairly normal, you can store or display the residuals and then perform a normality test on them: Stat > Basic Statistics> Normality Test.
One important caveat. With a small data set, it can be difficult to definitively arrive at conclusions about your model and its assumptions. Without sufficient data, the analysis may not be able to identify the true relationship between the variables and is more sensitive to the effects of nonnormal residuals. With a small data set, a goodness-of-fit test may not have enough power to detect a departure from normality.
If you’re performing a simple regression analysis, you may be able to use the Regression command in the Minitab Assistant to automatically check the assumptions of your analysis—including the amount of data. That would be the easiest solution!
Choose Assistant > Regression, run the analysis, and then look at the results in the Report Card.
Good luck!

Name: tamoghna • Monday, June 25, 2012

May I ask an off the article question? In scientific journals I often see that errors bars are represented as SD, SE or 95% CI. I get confused which is the best way of showing scatter in my data set. Would love to hear your comments on this.

Name: Patrick • Monday, June 25, 2012

This is a great question. You've given me a nice idea for an upcoming blog. Since it's a different topic and deserves its own space and discussion, and because I think there are probably other readers out there who could benefit from understanding the concepts you raise, I'll cover this topic in an upcoming blog. I'll let you know when it's published. Thanks for the idea!

Name: tamoghna • Tuesday, June 26, 2012

Thank you so much Patrick !! We will be looking forward for the blog post.

Name: Iain • Monday, January 14, 2013

Thanks for explaining this issue in such a succinct and entertaining way. Warning 2, in particular, solved a problem I was having with the university report I am writing. I was wondering if you could help me further: could you recommend any texts I can use as a reference for what you've explained in Warning 2?

Many thanks!

Name: Patrick • Thursday, January 17, 2013

Thanks for your kind comment, Iain. I'm glad to hear this post helped you out with your report.

Right off hand, I can't think of a text that specifically references warning 2. But in many statistical texts, you should be able to find a reference to the general problem of how extremely large sapmles can lead to rejecting the null hypothesis for differences that may be inconsequential. For example: Six Sigma and Beyond: Statistics and Probability, DH Stamatis, Vol 3, page 64.

In some texts, this issue may be indexed under "power"--describing what can happen when a hypothesis test has a very high power (often caused by a very large sample)

In addition, here's an online reference to the specific issue of how large samples can affect distribution goodness-of-fit (gof) tests:

http://www.epa.gov/scipoly/sap/meetings/1998/march/attach3.pdf

On page 20 of the document, you'll find this passage:

"The GoF tests indicate whether the hypothesized distribution can be reasonably rejected as improbable. It is important to recognize that failure to reject H0 is not the same as accepting H0 as true These tests, taken alone, are not very powerful for small to moderate sample sizes (i.e., subtle but systematic disagreements between the data and the hypothesized distribution may not be detected); conversely, the tests can be too sensitive for large numbers of data points -- that is, for data sets with a large number of points, H0 will almost always be rejected."

Best, p

Name: r • Monday, June 3, 2013

Nice article.Now, we know that sample has an effect on p-values and large sample size often produce (misleading) significant result. Can you please explain WHY or HOW does it happen? Did some google search but I didnt find anything that made me understand it fully with simple explanation. I would very much appreciate if you could explain it.
thanks.

R

Name: Patrick • Tuesday, June 4, 2013

Great question.

In short, it happens because a large sample size greatly increases the power of an analysis to detect a difference. And THAT occurs because the test statistic that determines statistical significance is based on a formula that, in effect, gets multiplied by a value based on the sample size (such as the square root of the sample size). So, assuming a given difference and a given standard deviation for the data, increasing the value of sample size (n) increases the test statistic and results in an increased likelihood of statistical significance. Therefore, a humongous sample size can "override" a difference of very small size, inflating the test statistic, and result in statistical significance, even though the difference is so small that it has no practical consequence.

In my next post (to be published Monday, June 10), I'll show an example of these relationships--sample size, difference, and variation--using the 1-sample t test as an example. I think (or at least hope) that it will answer your question more clearly.

Name: Edward • Sunday, June 16, 2013

Nice article Patrick.
I am an honors student in Bioinformatics. Obviously genomic data can yield very large samples so this topic is of particular interest to me.

I was performing a chi-squared test on a sample size of ~500,000 observed results. I have 9 groups which the results could be categorized by. Initially, i saw my p-value for the chi-squared test and thought "great, i have a significant result", and then thought "hmm, that p-value looks too good to be true". Based on the histogram it looks like the distribution of the observed and "expected" results are indeed different. just wondering if there is a standard way of dealing with this issue to get a meaningful p-value? As i am not testing for normality or comparing to a known distribution, i'm not sure if i can do probability plots? in one of your other responses you mention data reduction, can you elaborate on this? are there any standard procedures for this?

Thanks in advance for any help.
Edward.

Name: Tom • Monday, June 24, 2013

Iain, Don Wheeler writes about Warning 2. I don't have a specific reference handy, but you could start with his articles at SPC Press.

Name: Patrick • Monday, June 24, 2013

Hi Edward,

Thanks for reading and commenting. Unfortunately, I can't offer guidance on specific applications such as your particular analysis of genomic data.

A couple of thoughts though. The issue you seem to be encountering with the chi-square analysis using an extremely large sample is that the difference between expected and observed will always be statistically significant, no matter how slight the differences are. Therefore, you still need to determine whether the actual differences between expected and observed have any practical significance. For example if one category contains 33% and the other 35%, does that 2% have any ramifications in your particular application?

As far as data reduction, that depends on the objectives of your study and the protocols in your field. Some analysts who have large data samples for a chi-square analysis may decide to look at specific subsets of the data, to avoid the statistical significance problem caused by extremely large samples.

The link below shows an example where a researcher in the social sciences breaks down an original sample of 28,000 by year to examine instead a subset of about 1800 people, which falls between his stated "ideal" sample size for of between 100 and 2500 for a chi-square analysis. http://socquest.net/Q1/ResExQ1_2/ResExQ12.html

Consider whether there is some meaningful way for you to subset your data to analyze it. (Of course be mindful of the multiple comparisons issue--you may need to lower the level of significance if you plan to evaluate multiple subsets of your data.)

As for data reduction techniques--that is a whole field in itself! The following paper will give you an idea of how involved this can be--if I were you, I'd research whether any of these (or other) reduction techniques are commonly applied in the field of genomics.

http://www.stanford.edu/~thairu/07_184.Guest.1sts.pdf

Good luck--I'm sure it will be an interesting journey!

Name: Hugh • Tuesday, June 25, 2013

Interesting read. I often find myself handling data with 100000+ samples so this was useful for me. Example 2 particularly got me thinking but in my opinion when you look at example 2 I think that your eyes/brain are biased when it comes to judging goodness of fit and so 1 distribution appears closer to normal than it is. You may also have some bias in that you want the p value to be low so you may tend to say to yourself "That difference seems insignificant, I'll just ignore it".
As well as this, when you get a large p value and fail to reject the null hypothesis this does not mean you can accept the null hypothesis as fact, you must first check the power of the test.

Name: Patrick • Tuesday, June 25, 2013

What I love about blogs is that astute readers like you can fill in the blanks in the comments section on important related topics that couldn't be covered in the original post due to space and time limitations.

You're absolutely right, Hugh. And neglecting to consider the power of a test that evaluates the assumptions for an analysis (such as a distribution fit analysis, or an equal variance test)is a very common oversight. As you implied, failing to reject the null hypothesis in these cases means only there is not sufficient evidence to conclude that the particular assumption is violated (i.e., the distribution does not fit, or the variances are not equal). That's very different than being able to conclude that the distribution DOES fit--or that that the variances ARE equal.

To gauge how confident you can be in your results when you fail to reject the null, you need to know the power of the test. And of course, if your sample size is small, the power may not have been adequate to detect a deviance from a distribution, or a difference in variance (or whatever difference you're trying to detect).

Thanks for pointing that out.

Name: Michael • Thursday, October 24, 2013

I'm not sure why the large sample size comes with a warning just b/c things become statistically significant. From a common sense standpoint, it makes sense to me: 2 separate samples of N=10 with a mean diff of .009 would seem to be NOT statistically significant b/c N=10 allows outliers to be overweighted, etc. However, when N=1million for 2 samples, it makes sense that a .009 diff in mean is significant because insignificant differences would be explained only in a much, much tighter distance from the mean with a 1 million sample. Am I missing something? Your data tells me that the larger the samples, the better representation the means will be and therefore, the more credence the p-value will hold, but your premise is that large samples put out some kind of false significance. Thanks for taking the time to construct this for everyone.

Name: Michael • Thursday, October 24, 2013

I just submitted a comment but it seems that you answered something similarly on 6/4/13. Thanks anyway.

Name: Steve • Monday, December 9, 2013

I am stuck with sample size determination. I have the population mean and standard deviation. Population size is 15,000.
population mean = 15. Population standard deviation is 2. A subgroup of the population is showing a mean of 15.84 with standard deviation 1.18. Size of the subgroup is 1270. I want to know if the difference in mean is significant. Obviously if I choose sample size 1270 then it shows difference is significant. If I choose a smaller sample size then the difference in mean in insignificant. I am really stuck how to choose a sample size. Large sample size seems to reject the null hypothesis whereas a small sample size accepts the null. Is there a method to choose correct sample size?

Name: Patrick • Monday, December 9, 2013

I'm not sure what you mean by "choosing the correct sample size." For a given amount of variation in the data, a larger sample will give the test more power to detect a significant difference in means, if one exists. Remember, what you're really asking or finding out when you perform a t-test on the difference in means is: "Do I have ENOUGH EVIDENCE to conclude that the difference between these two means is statistically significant?". So all things being equal, a larger sample will be more likely to provide that evidence than a smaller sample. The difference between the means should be roughly the same with a larger or smaller sample(ASSUMING that your sample is representative of the population in either case). So really, the t-test is just telling you whether you have enough evidence to make an inference about the entire population. A larger sample will be more likely to allow you to do that. It's not a bad thing (to reject the null with a large sample). You just need to go one step further and evaluate whether the statistically significant difference itself has any practical ramifications--and that doesn't depend on your sample size (or your p-value result). So in your case, the question is, based on your application, does a mean difference 0.84 (15 vs 15.84 in the subgroup) really matter, given that it's statistically significant with a sample size of 1270)? Hope this helps. Thanks for reading.

Name: Alex • Tuesday, February 25, 2014

Hi, I was wondering what your thoughts were on the idea that "more sampling that is done to check for positives, leads to finding more positives". The basis behind this is that, lets say, a farm checks 5 pigs a month for carrying a type of bacteria. If they up that to 20 pigs a month, would they find more infected pigs than they did before? I was looking around for literature research done on this and found nothing. Thanks for any help!

Name: Patrick • Friday, February 28, 2014

Hi Alex, does increasing the sample size lead to finding more events of interest? Well, literally, yes, because say that 20% of pigs are infected. If you sample 20 pigs as opposed to 5 pigs the count of infected pigs will be 4 times greater. But will the rate of infection go up with a larger sample? Assuming you're taking a random, representative sample, it shouldn't. Although the larger the sample, the more likely the rate you find will be a more precise estimate of the actual rate in the entire population.

The issue you raise reminds me of what's known as the Observer Effect. For example, it's indisputable that the rate of children diagnosed with autism has increased substantially in the last few decades. Some say this is due to environmental triggers that didn't exist before. Others argue that the increased number of "positives" as you put it, is due to increased awareness and surveillance of the condition, which is similar to the concept you raise. Another example--supposed researchers report that the rate of a certain type of cancer has dropped 35% over the last 10 years. Many people would assume that this means the fewer people in the population are getting that type of cancer. But what if the AMA had stopped recommending routine surveillance for this type of cancer in the general population over this same time period. This is an example of "less checking for positives" which could lead to a decreased rate of positives. It is a very important consideration when evaluating incidence rates! But it's not related to sample size, per se.

Thanks for the interesting question!

Name: Carl • Saturday, July 19, 2014

Great article. Determining an appropriate sample size - or put another way, determining the ramifications of a sample size on the strengths and limitations of a study - is maybe the most overlooked aspect of analyses of observational data.

Most people can come up with a "low" number which their study should surpass to have adequate power, or can at least identify situations when they are lacking sample size (if they haven't performed the calculation a priori - "Hmm, those confidence intervals are pretty large...I bet small sample size is why my results are not significant).

But how can a researcher determine his "high" end number? In other words, how big of a sample is too big? With secondary data analysis this is often an issue (or should be), especially with national level data with lots of observations.

You suggested earlier considering the practical significance of the difference in question. I wonder if a researcher anticipating a too-large sample size problem could do this but in the opposite direction that is typical. "What size difference do I NOT want to be able to detect?" In other words, how small of a difference would not be practically significant, and therefore what should the upper limit of my sample size be to guarantee that I will not find such a difference to be significant?

I think this would be a sound strategy, except that researchers are usually TRYING to find a difference, so keeping themselves from finding a significant difference, no matter how small that difference is, may be unreasonable to expect.

Name: Patrick • Thursday, July 24, 2014

Very perceptive comments. The strategy you describe sounds very much like what is done when one defines the zone of equivalence in an equivalence test. See this post: http://blog.minitab.com/blog/statistics-and-quality-data-analysis/equivalence-testing-for-quality-analysis-part-ii-what-difference-does-the-difference-make

But you're right, most of the time researchers are trying to detect a significant difference (rather than prove equivalence).

But one could define the size of the difference in the a prior power analysis and then make sure the sample did not exceed the size required to achieve a given level of power--say 80% or 90%. However, it's been my experience that many researchers, when performing the power analysis, often do not know what the minimum value of a practically significant difference is. Hopefully discussions like this will get more people thinking about that.