Working as a technical support specialist at Minitab Ltd, I regularly come across customers who experience unusual-looking normal probability plots. I have to say, my initial reaction when I was first presented with these unfriendly creatures was: “Uh oh, the pattern does look very strange, what do I say now? “
Luckily for me, a colleague had seen many of these in his long and illustrious career. So I thought I would share his findings with you. His insights really helped me in understanding and dealing with the issue and I've never been scared of it since. I hope you will find this useful, too!
To me no other statistical tool is abused or misused as much as the poor probability plot. A lot of emphasis is placed on checking your data for normality. Read the p-value, is it normal? If it is we can breathe a sigh of relief and be happy, going on our merry way. But if it is not, tragedy strikes...how can we analyse our data now? We are in paralysis.
I joke (a little), but a probability plot can tell us a lot about our data, whether it's normal or not.
Before we go into this, a few words of advice…
Data doesn’t have to be normal. It can be perfectly natural for our data to not be normal. Indeed, in several statistical tools the original data need not be normal at all, because it is the residuals that we check for normality. (More on that another time…)
By the same token, just because our data is normally distributed does not mean our sample is good, in control, or capable. Take a look at this histogram, for example:
The histogram looks pretty reasonable. But let’s see what the humble probability plot can tell us.
Now isn’t that interesting. Our p-value is below 0.05. The null hypothesis for the normality test is that it is normally distributed; our alternative that it is not. "The p is low so the null must go," as they say.
Doesn’t that seem very strange to you? The histogram looked almost perfect. So how can the data not be normal?
The clue is in the probability plot. Notice the vertical lines; now take a look at a sample of the data.
What can we notice here? Well the data is all discrete, we have whole numbers. That quite clearly shows up in the probability plot. The Anderson Darling test is quite sensitive to discrete data and correctly says the data is not normal, because it isn’t continuous.
Now I am sure at this point some of you are asking, "Quick, what test would be more appropriate now? If the AD test can spot this as not normal should we use a different test?"
Why should we? Do we need another tool to assure us that it could be normal, when we know already it isn’t continuous? Look at the way the points follow the normal distribution line of the probability plot. It is fairly close. It will generally be okay to use this data as if it is normally distributed, as long as we have enough discrete values.
The important part is the information we can get from the chart. Why are the values discrete? Well, for the data here it is because I generated these from a normal distribution and then rounded the values to the nearest integer. You may see patterns like this when you are at the limit of a measurement system; it cannot resolve smaller differences.
Perhaps you have even rounded data yourself; a very common example of this is recording time measurements to the nearest minute.
Thank you for reading.