With apologies to Charles Dickens, I'd like to begin this post by summing up the Anderson-Darling statistic this way:

*It was the best of fits, it was the worst of fits, it was the test of normality, it was the test for non-normality, it was the plot of belief, it was the plot of incredulity, it was the p-value of Light, it was the p-value of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us...*

I read and participate in discussions about a broad range of statistical topics daily, and few elicit as much misinformation combined with as many strong opinions as the issue of testing data for normality. So I'd like to provide some guidance on the issue by answering two key questions:

- Does my data need to be normal?
- How do I know if my data is normal?

So let's start with #1...

## Does My Data Need to be Normal?

Most of us learned in various courses that normality is an assumption of many statistical tests. However, it is worth considering what "assumption" means in most of these cases, and you may be surprised. When developing a statistical test, statisticians will start with some basic assumptions that seem reasonable in the real world (for example, suppose I have samples from two independent, normally-distributed populations) and derive from that a formula for testing a hypothesis (i.e., does the mean of the first population differ from that of the second?), from which some known probabilities can be calculated.

The example I've laid out describes a 2-sample t-test, and under the assumptions given, a p-value can be calculated based on the t-distribution. Without those initial assumptions, there would be little from which to derive that test; additionally, by making those assumptions we are able to make a much more powerful test.

So from that explanation, we say that normality is an assumption of the 2-sample t-test. HOWEVER—and pay close attention here—although an assumption was made in order to develop that test, we do not know automatically whether or not that assumption not being met will result in the test being inaccurate! So, under the assumptions of a 2-sample t-test the results *are *accurate. But even in the absence of one or more assumptions the test may still be accurate, or at least accurate enough. To determine this, statisticians can use a variety of tools, including simulation, to evaluate how the test behaves in conditions that do not match the assumptions. (For extensive details on some examples, check out our our papers detailing methods used in Minitab's Assistant Menu.)

For now, we will concern ourselves only with the assumption of normality, and will outline when it is and is not important for the accuracy of a test. The table below lists many commonly-used tools for which normality is either an assumption, or is commonly believed to be an assumption, and groups them by their sensitivity to the data actually being normal:

Normality very important | Capability Analysis (Normal) |

Data should be generally normal | Residuals on most common linear models like Regression, GLM, DOE, etc. |

Very robust to non-normal data | T-tests, ANOVA, control charts (unless data is extremely skewed and/or bounded at zero with many points near zero) |

**How Do I Know If My Data Is Normal?**

Here we return to the Anderson-Darling statistic, and the many contradictory statements I made at the top of this post, which are somehow all supposed to simultaneously be true. Most data analysts have found a situation where one of the following paradoxes is occuring:

- The data looks completely non-normal, but the p-value on the Anderson-Darling test is greater than .05.
- The data looks perfectly normal and we have plenty of data, but yet it still fails the Anderson-Darling test.

First, a high-level overview of what the Anderson-Darling test is and some things to keep in mind. It is a statistical test that looks for the absence of normality and indicates a significant lack of normality with a small p-value. Like any statistical test, it requires a certain amount of data to detect non-normality, and situation #1 above typically happens when there is very little data. Take 4 or 5 data points from even an extremely non-normal distribution, and you have a decent chance of it passing the test anyway:

Similarly, like other statistical tests, as you have more and more data, the Anderson-Darling test becomes more and more powerful, and it is important to consider not just statistical significance but also practical significance. Will a tiny departure from normality really affect your results? Let's consider two samples, shown below with overlaid histograms as well as on probability plots:

One of these samples has an Anderson-Darling p-value of 0.504 (not at all significant); the other's is 0.015 (highly significant). Which one passed and which one failed?

If you're wondering, then you have not yet grasped where I'm going with this.

Using the table in the first section, you may already have looked at this data and disregarded the need for normality for all but Capability Analysis (Normal). But should you even be concerned about your capability output? Here is a Capability Analysis of the two samples:

Not much difference in those statistics—certainly not enough to be concerned with. In other words, even with the most sensitive commonly-used tool, there comes a point where the Anderson-Darling test is too sensitive and you should trust your instincts.

With very little data, you may have difficulty getting a statistically significant result from the Anderson-Darling test, and with a large amount of data you are likely to get statistically significant results that aren't practically significant. So while Anderson-Darling *is *a useful test, it should only be used in conjunction with your instincts as well as your knowledge of whether the normality "assumption" is an important one for the test you are performing.

As for using our eyes on a histogram or probability plot in conjunction with the Anderson-Darling test to make a decision...

*It is a far, far better plot that I view, than I have ever done; it is a far, far better test that I go to than I have ever known.*

Time: Friday, November 30, 2012

Dear Joel,

thanks for a very exhaustive and informative post!

I am currently creating probability distribution plots and calculating Goodness of fit for all the possible distribution included in Minitab 16 for a high number of variables of a model that I want to test for global sensitivity and uncertainty. Minitab provides me with the probability plots as well as the AD and p-values. From a post written by your colleague Jim Frost, I learned that I'm looking for low AD values and high p-values to identify the distributions that best fit my datasets. The problem arises when the plots seem very good to me, but the p-values are extremely low, and hence suggest I should reject the H0 of the datasets fitting the distribution. But I do have a very high number of points for each variable i'm testing (between 1,000 and 1,800 usually). Should I trust the exceedingly powerful tests or my eyes?

All the best and thanks to you and your colleagues for these super-useful blog!

Tom

Time: Friday, November 30, 2012

Tom - thanks for your comments! I say go with your gut - if not obvious patterns exist and the only thing pointing you to a poor fit is the p-value then you're likely suffering from the large sample issue. The more formal test for this is the "squinty eye" test...if you squint your eyes and the points look like a straight line then you pass the test!

Time: Friday, February 1, 2013

Dear Joel,

I am measuring a typical variable and have collected 20 consecutive samples. The histogram approximates a normal distribution but some shift to the high side on the distribution. My Anderson-Darling Normality Test data is as follows:

A-squared: 0.35

P-Value: .426

Is the AD test indicating normality or not?

Thanks.

Time: Monday, February 4, 2013

Jeff-

I think you're finding yourself in "fail to reject" territory...the test does not indicate that there is evidence that the data are non-normal. As described above, this could be because they come from the normal distribution, or it could be that the difference is not large and you don't have enough data to make the A-D test powerful enough to detect the difference. Before sweating about it, I think it's worth first considering what tools you want to use the data with - if you're doing a control chart, t-test, etc. I'm guessing yuou're fine. If it's capability, you may want to be sure you know the distribution first.

Time: Monday, May 13, 2013

Thank you!!! I just ran into this situation this morning. I have 8 points which do not look normal at all, infact could possibly be from two different populations. Yet the AD test was saying the distribution is normal. I went online, and your post was the second link I investigated. Thanks again. Ed

Time: Tuesday, May 14, 2013

Ed-

So glad to hear that this was able to help you! Hopefully you'll find other blog posts or information on the site that help you out in other areas as well.

- Joel

Time: Sunday, July 14, 2013

Hi, I have some sets of observations which are not random samples from any population. I want to test whether they have a normal distribution. qq-plots suggest that they are normally distributed. The Shapiro-Wilks test also does so, but it assumes that the data come from random samples. Can the A-D test be used instead? Does it require independent observations? I want to be able to distinguish between a control group (with a normal distribution) and a treated group (deviation from normality). Sample sizes are typically 20-100.

Thanks. Michael

Time: Tuesday, July 16, 2013

Michael-

The answer depends on what you mean by "not random" and what you hope to infer. The math itself has nothing to do with randomness, but the interpretation of results would. From your description, it sounds like when you say "not random" you mean that the treated group was different from the original sample (because it had some treatment applied). In that case the test is fine, although I'm not sure testing for normality is the important thing to determine. Wouldn't it be more information to test for different means or variations? Or if the mean and variation were the same is it truly important to know that the shape of the distribution is different?

Otherwise, if by "not random" you mean that your sampling strategy involved samples that were correlated to one another more strongly than to the general population or somehow different from the general population in another manner, then you're not likely to learn much from almost any statistic. The results of most tests you could do could not be used to infer anything about the greater population, in which case they may be fairly useless.

Time: Tuesday, August 6, 2013

Dear Joel,

Many thanks for your post which I found very useful for me!

However, I still wonder if I HAVE TO conduct a normality test when I have very large random sample?

I have read in some source (for example, Sirkin: Statistics for Social Sciences) that when the sample size n is large enough, we can relax the normality assumption thank to Central Limit Theorem.

Do I correctly understand this statement that we can check the normality assumption by evaluation: (i) randomness of sample; (ii) size of sample; and (iii) independence of the observation? and, if the sampling is random, the sample size is more than 30 and the observations are independent, we can be safe perform the inferential parametric tests regardless the findings from normality test?

I also learned some statisticians suggest that the normality test is worthless or it is just a waste of time? Can you comment on that?

Many thanks in advance!

Time: Tuesday, August 6, 2013

Thanks for your questions and comments!

First off, I should clarify the central limit theorem. With sufficient sample size, the distribution of the sample MEAN approaches normality...the original data, if from a different distribution, will never be normal no matter how many samples we take. Khan Academy has a good explanation at http://www.khanacademy.org/math/probability/statistics-inferential/sampling_distribution/v/central-limit-theorem. In any event, this allows us to make inference about the mean using statistics based on the normal distribution, but not the entire distribution.

For example, no matter how many samples you take you should not assume normality when doing capability analysis of non-normal data.

I would not go so far as to say the normality test is worthless, but it should be combined with knowledge about how important it is for your data to be exactly normal and the power of the test. From the table in part 1 of this post, you can see which tests are sensitive and which are not...so with a large sample size - which can detect even slight departures from normality - ignore the test and use your judgment. But with small sample sizes and a sensitive tool like capability analysis, you may give a little more weight to the test.

I hope this helps!

Time: Wednesday, August 14, 2013

Dear Joel,

I've been sweating over this for few days now - after implementing the AD test in excel and calculating the p values.

I'm wondering if the AD value can be used a measure of the degree of non-normality irrespective of it's statistical significance. If so than a high AD value with a low p value (< 0.05) could be used to rule out normality, a high or low AD value with a high p value (>0.05) would mean more data required, and a low AD value with a low p value (

Time: Wednesday, August 14, 2013

Greg-

Unfortunately there is no good interpretation of the magnitude of an Anderson-Darling value so what is considered "high" and what is considered "low" is very specific to the dataset at hand.

Your head is in the right place though! There's just no good way to know what is high and what is low.

Joel

Time: Thursday, August 15, 2013

Hi Joel,

I see my my post got cropped. The missing part follows:

.... and a low AD value low p value (