dcsimg

I love all data, whether it’s normally distributed or downright bizarre. However, many people are more comfortable with the symmetric, bell-shaped curve of a normal distribution. It is not as intuitive to understand a Gamma distribution, with its shape and scale parameters, as it is to understand the familiar Normal distribution with its mean and standard deviation.

However, it's a fact of life that not all data follow the Normal distribution. Hey, a lot of stuff is just abnormal...er...non-normally distributed. How to understand and present the practical implications of your non-normal distribution in an easy-to-understand manner is an ongoing challenge for analysts.

This is particularly true for quality process improvement analysts, because a lot of their data is skewed (non-symmetric). The output of many processes often have natural limits on one side of the distribution. Natural limits include things like purity, which can’t exceed 100%. Or drill hole sizes that cannot be smaller than the drill bit. These natural limits produce skewed distributions that extend away from the natural limit. So, non-normal data is actually typical in some areas.

Fear not; if you can shine the light on something and identify it, it makes it less scary. I will show you how to:

To illustrate this process, I’ll look at the body fat percentage data from my previous post about using regression analysis for prediction.  You can download this data here if you want to follow along. 

Going with Raw Sample Data

We could simply plot the raw, sample data in a histogram like this one:

Histogram of body fat percentage

This histogram does show us the shape of the sample data and it is a good starting point. We can see that this distribution is skewed to the right and probably non-normal. However, this graph only tells us about the data from this specific example. You can’t make any inferences about the larger population. 

What can be done to increase the usefulness of these data? First, identify the distribution that your data follow. Once you do that, you can learn things about the population—and you can create some cool-looking graphs!

How to Identify the Distribution of Your Data

To identify the distribution, we’ll go to Stat > Quality Tools > Individual Distribution Identification in Minitab. This handy tool allows you to easily compare how well your data fit 16 different distributions. It produces a lot of output both in the Session window and graphs, but don't be intimidated. Before we walk through the output, there are 3 measures you need to know.

Anderson-Darling statistic (AD): Lower AD values indicate a better fit. It’s generally valid to compare AD values between distributions and go with the lowest.

P-value: You want a high p-value. A low p-value (e.g., < 0.05) indicates that the data don’t follow that distribution. For some 3-parameter distributions, the p-value is impossible to calculate and is represented by asterisks.

LRT P: For 3-parameter distributions only, a low value indicates that adding the third parameter is a significant improvement over the 2-Parameter version. A higher value suggests that you may want to stick with the 2-Parameter version.

So, for my data, I’ll fill out the main dialog like this:

Individual Distribution Identification dialog box in Minitab

Let’s dive into the output. We’ll start with the Goodness of Fit Test table below.

Goodness of Fit Test table in Minitab's output

The very first line shows our data are definitely not normally distributed, because the p-value for Normal is less than 0.005! 

We'll skip the two transformations (Box-Cox and Johnson) because we want to identify the native distribution rather than transform it. 

A good place to start is to skim through the AD values and look for the lowest. The lowest AD is for 3-Parameter Weibull. However, the AD values for 3-Parameter Lognormal, Largest Extreme Value, and 3-Parameter Gamma are all close. For the 3-Parameter Weibull, the LRT P is significant (0.000), which means that the third parameter significantly improves the fit. The LRT P is not significant for the other 3-Parameter candidate distributions. 

Given the lower AD value and the significant LRT P value, we can pick the 3-Parameter Weibull distribution as the best fit for our data. We identified this distribution by looking at the table in the Session window, but Minitab also creates a series of graphs that provide most of the same information along with probability plots. You can see 3-Parameter Weibull in the graph below. The data points follow a fairly straight line, which indicates a fit.

Probability Plot with the 3-Parameter Weibull distribution

Now we know what the distribution is—but what are the distribution's parameter values? For those, look at the next table down in the Minitab Session window output:

Distribution parameters table

How Does Identifying the Distribution of Data Help with Analysis?

All right. Now we know that the body fat percentage data follow a 3-Parameter Weibull distribution with a shape of 1.85718, a scale of 14.07043, and a threshold of 16.06038.

At this point you may be wondering, "How does that help us?"   The answer: with this information about the distribution, we can go beyond the raw sample data and make statistical inferences about the larger population.

In my next post, I'll show you how to use powerful tools in Minitab to gain deeper insights into your research area and present your results more effectively.

Related blog posts:

Comments for How to Identify the Distribution of Your Data using Minitab

Name: Eric
Time: Thursday, March 8, 2012

Where can I find the raw data to follow along with your examples? Love the blog.

Name: Eston @ Minitab
Time: Thursday, March 8, 2012

Hi Eric, and thanks for the kind words about the blog. You can find the data discussed in this post here:

http://www.scribd.com/doc/84506538/Body-Fat-Data-for-Identifying-Distribution-in-Minitab

I've also added a link to the data in the body of Jim's post. Thanks again!

Eston Martz
Blog Editor

Name: Jonathan D. Cryer
Time: Monday, March 19, 2012

It is rather poor statistics to display these parameter values to umpteen decimals. Secondly, after you fit 16 or more distributions, any p-values are quite suspect. If you confirm this distribution with some new data
that would be very helpful. "Now we know that the body fat percentage data follow a 3-Parameter Weibull distribution with a shape of 1.85718, a scale of 14.07043, and a threshold of 16.06038."

Name: Jim Frost
Time: Tuesday, March 20, 2012

Hi Jonathan,

Thanks for writing. You raise several good points.

I agree that the 5 decimal places can give a false sense of precision. However, I wanted to stick with the default output so it would be easier for readers to follow along and match values. Particularly because there is no reason to suspect that these values are less accurate than more truncated values.

I'm not quite clear on why you think the p-values are suspect. Perhaps you're thinking of multiple comparisons where you adjust the alpha level downwards based on the number of comparisons. Here, we are more focused on the AD values. We also look at the p-values, but we actually want high values. If you're concerned about fitting a distribution by chance, see below.

Great point about the importance of replicating results. That is standard for inferential statistics, where you take a sample and make inferences about a larger population. This sample is the best evidence we have right now about the properties of the population. But, it's always a good idea to see if new data supports it.

Cheers,
Jim Frost

Name: Barb
Time: Friday, March 30, 2012

What do you do if none of the distributions fit?

Normal 108.709

Name: Jim Frost
Time: Friday, April 6, 2012

Hi Barb,

You must have some interesting data! That's a good question. The answer depends on what you want to do with your data. Fortunately, you have several options!

One thing I'd check for is whether your data set might combine different populations. If you see multiple peaks, or a stretched out peak, your data might contain multiple populations. If that's the case, try separating and then identifying the distribution for the subpopulations. For example, you wouldn't want to identify the combined distribution for the heights of men and women. Instead, you'd separate the data.

If you want to report the middle and spread of your data, you can use the emperical median and quartiles. These statistics are based on your sample rather than a fitted distribution. The advantage is that you don't need to identify the distribution. You can do this in Minitab if you go to Stat > Basic Statistics > Display Descriptive Statistics.

If you want to compare your data to a target value or compare groups within your data, you can use Nonparametric analyses. This set of analyses has fewer assumptions than parametric analyses and, generally, you don't need to identify your distribution. You can find them in Minitab if you go to: Stat > Nonparametrics.

I hope this helps!
Jim

Name: Shree Nanguneri
Time: Wednesday, April 18, 2012

Hi Mr. Frost,

Thanks for the treatise here and these are great examples of how one should deal with non-normal distribution based data.

I have a series of articles coming out this summer on the same topic in an academic environment where the natural limits is 100 % for student score in a given high school.

It would be very interesting to see how your predicted scores match up with the actual ones to have us a mathematical model that is valid.

Great work and keep it coming!

Sincerely,

Shree

Name: Stanley Alekman
Time: Saturday, April 21, 2012

How does one test count data for binomial and hypergeometric, geometric and other noncontinuous distributions?

Name: Jim Frost
Time: Tuesday, April 24, 2012

Hi Shree,

Thanks for reading and sharing your own work! Please let me know when your articles are out so I can read them!

Sincerely,
Jim

Name: Jim Frost
Time: Tuesday, April 24, 2012

Hi Stanley,

You can test for these distributions in Minitab but it'll take a little more work. First, you'll need to tally the observed counts for each value in your sample. Second, you'll need to calculate the theoretical proportions for each value in your sample based on the distribution that you want to test.

After that, you can use Stat > Tables > Chi-Square Goodness-of-Fit Test (One Variable) in Minitab. In this dialog, you'll see where to enter the column of observed counts. Under Test, you want to choose Specific proportions and enter your column of theoretically expected proportions.

Minitab will then check to see if your observed proportions differs from the proportions you'd expect for the distribution. A low p-value suggests that your data does not follow that distribution.

I hope this helps!
Jim Frost

Name: Reed Martin
Time: Thursday, May 31, 2012

Piggybacking on the Q/A from Barb. What to do if no distributions fit and you want to do a process capability analysis? I'm comparing solution temperature with that of a control. Both of those samples are normally distributed. However, I want to assess capability of the difference between the two and that distribution is unusual. Thanks!

Name: Jim Frost
Time: Friday, June 1, 2012

Hi Reed,

My first question would be, can you transform the differences so they follow a normal distribution? If so, that's the preferred approach. You can transform your data (Box-Cox or Johnson) and do a capability analysis all in one shot: Stat > Quality Tools > Capability Analysis > Normal.

If the differences don't follow any distribution and can't be transformed, that gets a bit more difficult. One possibility would be to create a Tolerance Interval: Stat > Quality Tools > Tolerance Intervals.

This analysis will assess your data and calculate an interval that is likely to contain 95% of all future differences. If you prefer, you can change the percentage and/or specify a one-sided bound. Then, compare the interval or bound to your requirements to see if the results are satisfactory.

For tolerance intervals, your data aren't required to follow any distribution. Just be sure to look at the results that use the nonparametric method.

I hope this helps!
Jim

Name: Tracey Oram
Time: Monday, August 20, 2012

Hi Jim,

firstly please can I just say THANK YOU for such a descriptive summary of how to do this. I managed to follow it with ease and now have managed to compile a capability study with non-normal data.

I was originally stumped. I wanted to compile something to show to our directors that was relatively straight forward to explain, but when I did the original histograms they were extremely skewed, and quite frankly looked a bit messy.

So I identifed that my data was 3 paramter logloglistic and have now managed to do some nice looking capability curves and also calculate process sigma level.

Thanks again for this, it has been extremely useful!

Name: Jim Frost
Time: Friday, September 7, 2012

Hi Tracey,

I'm so glad that you found the blog helpful. It's pretty amazing how easily you can convey complex information with a nice graph!

Thanks so much for writing!
Jim

Name: Andrew
Time: Monday, November 12, 2012

Hi Jim,

This is really useful. However I am slightly confused about distributions. For instance the literature says most examples of my data are negative-binomial distributions, and isn't count data usually a poisson distribution? How do I test for these distributions, aren't they pretty common? I'm using Generalized linear models for my data were I need to specify the distribution of my data.

Many thanks, great blog.

Andrew

Name: Jim Frost
Time: Tuesday, November 13, 2012

Hi Andrew,

Thanks for reading and for your question. Your comment is timely! My next blog will be about identifying a discrete distribution. In the mean time, if you look higher up in the comments for my reply to Stanley, you'll see a general procedure for how to do this.

It's true that both negative binomial and Poisson distributions are counts. However, they are counts of very different things that you would use for different situations.

The Poisson distribution describes a count of a characteristic (e.g. defects) over a finite observation space, such as the number of scratches on a windshield.

The negative binomial distribution describes a count of the number of trials necessary to produce a specified number of a certain outcome given a constant event probability. For example, it can model the number of windshields produced until you reach 100 defective units.

Jim

Name: Tommaso Locatelli
Time: Friday, November 30, 2012

Dear Jim,
thanks for a very exhaustive and informative post!
I am currently creating probability distribution plots and calculating Goodness of fit for all the possible distribution included in Minitab 16 for a high number of variables of a model that I want to test for global sensitivity and uncertainty. The problem arises when the plots seem very good to me, but the p-values are extremely low, and hence suggest I should reject the H0 of the datasets fitting the distribution. But I do have a very high number of points for each variable i'm testing (between 1,000 and 1,800 usually). Should I trust the exceedingly powerful tests or my eyes?
All the best and thanks to you and your colleagues for these super-useful blog!
Tom

Name: Jim Frost
Time: Friday, November 30, 2012

Hi Tom,

Thanks for reading and for the nice comments!

As you mention, very large sample sizes give the distribution tests very high power. So, even a small deviation is flagged as statistically significant. However, statistically significant doesn't mean that it is practically significant deviation. In this case, I would trust your eyes!

My colleague, Patrick Runkel, wrote a great blog post about this. You can find it here. Just look for the "Warning 2" section!

http://blog.minitab.com/blog/statistics-and-quality-data-analysis/large-samples-too-much-of-a-good-thing

Jim

Name: Lokesh Jasrotia
Time: Tuesday, December 4, 2012

Hi Jim,
many thanks for sharing the info.
all the queries in turn helped to understand this topic in detail.
thanks to all.

regards
Lokesh

Name: Niloo
Time: Thursday, December 13, 2012

Hi,
I have some left-censored datasets for which I want to find the distribution type. I use Distribution ID plot subdialog box under distribution analysis (Arbitrary Censoring). The problem is that Minitab does not consider the highest observation in drawing the probability plot. Adding an artificial value (larger than the largest observation) the missing value is then shown. However, the decision about the distribution type (based on A-D or correlation coefficient) will change if an artificial value is added or not. The question is should I add the artificial value or not?
Thanks

Name: RV
Time: Tuesday, April 2, 2013

Thanks, this is a great help! What if you have a pretty normal distribution (AD .16, p =.93) and another distribution or transformation is slightly better (AD .11, p=.991) - Can you simply choose the normal, or do you have to go with the slightly better distribution/transformed data?

Name: Jim Frost
Time: Tuesday, April 2, 2013

Hi RV,

I would go with the normal distribution, rather than the transformed data. They're so close that the difference is not important.

Generally, if you have two close distributions, it's probably easier to see any differences in a probability plot than through the numbers.

Thanks for reading!
Jim

Name: TP
Time: Tuesday, May 7, 2013

Hi Jim,
Thanks for the outstanding explanation above. I have below data set
113462
91038
52448
68147
78161
118206
122366
100877
105233
109658
110314
101057
I performed Individiual distribution Identification and got the following result


Distribution ID Plot for PBT


Descriptive Statistics

N N* Mean StDev Median Minimum Maximum Skewness Kurtosis
12 0 97630.6 21255.3 103145 52448 122366 -1.04395 0.348496


Box-Cox transformation: Lambda = 3


Goodness of Fit Test

Distribution AD P LRT P
Normal 0.505 0.162
Box-Cox Transformation 0.217 0.794
Lognormal 0.759 0.035
3-Parameter Lognormal 0.528 * 0.087
Exponential 3.513 0.500 0.257
Smallest Extreme Value 0.259 >0.250
Largest Extreme Value 0.798 0.032
Gamma 0.691 0.076
3-Parameter Gamma 4.921 * 1.000
Logistic 0.447 0.217
Loglogistic 0.631 0.058
3-Parameter Loglogistic 0.447 * 0.136


Now here p value is 0.162 which is greater than 0.05. So it suggests that the null is accepted and data is normally distributed.
But when I am performing Anderson Darling Manually with the formula
AD=∑ {(1-2i)/n}{ln(F[Zi])+ln(1-F[Zn+1-i])}-n i. which is the standard AD formula. I am getting a value of 7.13 which is way higher than 0.752 also the AD value shown by minitab is 0.505
I am unable to understand what is going wrong. Can you help me out?

TIA

Name: TP
Time: Tuesday, May 7, 2013

Hi Jim,
Thanks for the outstanding explanation. I have below data set
113462
91038
52448
68147
78161
118206
122366
100877
105233
109658
110314
101057
I performed Individiual distribution Identification and got the following result

Goodness of Fit test

Distribution AD P LRT P
Normal 0.505 0.162
Box-Cox Transformation 0.217 0.794
Lognormal 0.759 0.035



Now here p value is 0.162 which is greater than 0.05. So it suggests that the null is accepted and data is normally distributed.
But when I am performing Anderson Darling Manually with the formula
AD=∑ {(1-2i)/n}{ln(F[Zi])+ln(1-F[Zn+1-i])}-n i. which is the standard AD formula I am getting a value of 7.13 which is way higher than 0.752 also the AD value shown by minitab is 0.505
I am unable to understand what is going wrong. Can you help me out?

TIA

Name: Vanessa
Time: Tuesday, June 4, 2013

I normalized the data using the Box-Cox transformation and asked for the transformed data to be displayed in a specific column, but all I get are bunch of zeros. The transformed data looks OK in the Probability Plot, but can't get the transformed raw data in a column. Please help. Thanks !

Name: prasshanth bharadwaj
Time: Sunday, October 13, 2013

Thank for this wonderful topic. I am working on a project and unable to find out the distribution patter of my data all values are less thenn 0.05
Output as below

Distribution Identification for AHT_1

* NOTE * Fail to select a Johnson transformation function with P-Value > 0.1.
No transformation is made.


2-Parameter Exponential



Distribution ID Plot for AHT_1


Distribution ID Plot for AHT_1


Distribution ID Plot for AHT_1


Distribution ID Plot for AHT_1


Descriptive Statistics

N N* Mean StDev Median Minimum Maximum Skewness Kurtosis
545 0 405.653 273.417 336 2 992 0.473922 -0.983937


Box-Cox transformation: Lambda = 0.5


Goodness of Fit Test

Distribution AD P LRT P
Normal 12.802 less 0.005
Box-Cox Transformation 5.484

Name: Kangmin
Time: Wednesday, November 6, 2013

Hello Jim, I have a question for you. I submitted but I am not sure that it's been sent to you.
I have 12 years of wind speed data. As you know wind speeds follow weibull distribution. My result of goodness of fit shows that all distributions show less than 0.01 p-value but weibull distribution has lowest AD value. 3-parameter weibull has 0.000 LRT P value.
Can it be used for supporting the distribution of wind data is weibull? or not, what should I do?

Regards,

Kangmin.

Name: jerita
Time: Tuesday, November 19, 2013

Hi, i have data for about 200,000 teachers showing their survival times,i wanted to model the data using weibull. can such data be modelled using weibull?

Name: Jim Frost
Time: Thursday, December 5, 2013

Hi Jerita,

I'd guess that the Weibull distribution would be a good candidate because it's a very flexible distribution and is often used to model right skewed distributions, which I'd guess that yours would be. However, you can never know for sure until you actually compared it to the other alternatives. It's possible that a different distribution could provide a better fit.

A couple of caveat's specific to your case.

It sounds like you might have life data that could be censored. Any teachers that did not stop teaching before the study ended are censored, meaning their exact survival time is unknown. If this is the case with your data, you'll want to use Minitab's Distribution ID Plot (Right Censoring), which you can find here: Stat > Reliability/Survival > Distribution Analysis (Right Censoring) > Distribution ID Plot. (If the study followed all 200,000 teachers until they stopped teaching, you don't have censored data.)

Also, you have a tremendous amount of data. A distribution test is like any other hypothesis test. More data gives the test more power to detect deviations from any given distribution. With so much data, the distribution tests can detect trivial, practically insignificant departures from the distribution. So, don't be surprised if you get low p-values with so much data!

You should focus on comparing the Anderson-Darling values for the distributions and looking at the probability plots for data points that follow a fairly straight line.

Jim

Name: maria
Time: Sunday, March 9, 2014

Hi Jim.
i have question regarding the p-value of parameter coefficient. The out put give p-value with 3 decimal places. How do i know whether the value of p-value= 0.000 is indicate that the p-value is exactly zero or less than 0.000?
thanks a lot

Name: Jim Frost
Time: Monday, March 24, 2014

Hi Maria,

In Minitab, when the output indicates that the p-value is 0.000, this indicates that the p-value is less than 0.001.

Also, as long as you are working with a sample rather than a population, there will always be some chance of incorrectly rejecting the null hypothesis, which is what the p-value indicates. So, you can't have a p-value that truly equals zero.

Jim

Leave a comment





Captcha