# The Graphical Benefits of Identifying the Distribution of Your Data

In my previous post, we identified the distribution of the body fat data. Today, we're going to explore several benefits of knowing the distribution, with a special emphasis on creating informative graphs! After all, if you are not sure what a specific distribution with such and such parameters looks like, a graph gives you the picture!

## Using the Distribution Information

So far, we have identified the distribution and the parameter values for the body fat data from 14-year-old girls.

3-Parameter Weibull Distribution:

- Shape = 1.85718
- Scale = 14.07043
- Threshold = 16.06038

How does that help us? What does this even look like? And where do important health ranges fall within this distribution? You can't tell just by looking at the parameter values. However, I'll answer all of these questions with just one cool graph!

It is always a good practice to know the distribution of your data before analyzing them. Certain analyses require certain distributions. For example, it could be a costly mistake to use an analysis that strictly requires a normal distribution with nonnormal data. However, I'm not going to focus on choosing alternative analyses. Instead, I'll focus on the graphs and what we can do just by knowing the distribution.

Because we have identified the best-fitting distribution, we are no longer limited to graphing the raw sample data, like we did with the histogram. We can now make inferences about the population. We can graph the best estimate for what the entire population looks like and calculate probabilities for values that fall in certain ranges. So, let’s do that.

## Probability Distribution Plot

To answer all of our questions, we'll use Minitab's Probability Distribution Plot. I’m a huge fan of these plots. If you want to show your boss what an unusual distribution with inscrutable parameter names actually looks like, use this graph. You can highlight the effect of changing distributions and parameter values, show where target values fall in a distribution, and view the proportions that are associated with important regions. These simple plots clearly and easily communicate these advanced concepts to a non-statistical audience.

Probability Distribution Plots don't use any data. Instead, you specify the distribution and enter the parameter values. You can also specify regions of interest to you.

We'll use the population parameters that we've already identified. For our region of interest, I found a Web site that recommends that girls between the ages of 14-19 should have a body fat percentage between 20%-24% for health reasons. That range sounds very tight to me, but let’s see where it falls in our population distribution for 14-year-old girls.

In Minitab, I’ll go to **Graph > Probability Distribution Plot > View Probability** and enter our distribution information in the main dialog like this:

Then, I’ll click the **Shaded Area** tab and fill it out like this:

After we click **OK**, Minitab displays the following graph:

All in one shot you can see both the shape of the distribution and how a range of interest fits within it. I’m no health expert, but I can see that the Web site's range for ideal body fat percentages doesn’t reflect where most of the girls fall. Only 20% fall within the ideal range and it falls below the curve's peak. Already, we know something interesting is going on.

## Probability Plots to Calculate Percentiles

Probability Plots have a similar name to Probability Distribution Plots. They are related, but Probability Plots are particularly good at determining whether the data fit a distribution (check, we did that already) and calculating percentiles based on that distribution. In general, the n^{th} percentile has n% of the population below it, and (100-n)% of the population above it.

Percentiles are extra important for nonnormal distributions because you use them to find the center and spread of your distribution. Here's why.

Intuitively we think of the mean and standard deviation as the center and spread for a normal distribution. Further, a good rule of thumb for normal distributions is that two-thirds of the population falls symmetrically within 1 standard deviation from the mean. About 95% fall within 2 standard deviations.

However, *none* of this is true for non-symmetric distributions. The mean is not at the center and the general rule of thumb for the spread no longer works. However, once you identify your distribution you can calculate percentiles in order to find the center and spread of the population.

For example, if you want to find the middle value (median) and the range in which the middle 95% of a nonnormal population falls, calculate the 2.5^{th}, 50^{th}, and 97.5^{th} percentiles (97.5 - 2.5 = 95). The median is the 50^{th} percentile; half of the population are above the median and half are below.

We'll calculate the body fat percentages that correspond to the 2.5^{th}, 50^{th}, and 97.5^{th} percentiles. Also, let's see what percentile corresponds to the upper limit of the supposed ideal body fat range: 24%.

To do this, you'll need to open the data, which you can find here.

- In Minitab, go to
**Graph > Probability Plot > Single.** - In the main dialog, enter %Fat as the
**Graph Variable**. - Click the
**Distribution**button and choose 3-parameter Weibull. Click**OK**. - Click the
**Scale**button, and uncheck the**Adjust x-scale for threshold . . .**checkbox. This produces a curved distribution fit line but allows the percentiles to be read straight off the graph. - Still under
**Scale**, click the**Percentile Lines**tab, and fill it out as shown below to produce our desired percentiles. Click**OK**in all dialogs.

We get the following graph:

We already knew that these data follow this distribution from before, and the output reconfirms it. The data points follow the center line and the p-value in the legend is greater than 0.500, which is greater than any common alpha value. Hence, these data follow the 3-parameter Weibull distribution.

In the graph, the data values are on the X-axis and the percentiles are on the Y-axis. For this population, the 50th percentile (the median) corresponds to a body fat percentage of 27.6%. 95% of the population should fall between the 2.5^{th} and 97.5^{th} percentiles, which correspond to 18.0% and 44.5% body fat. Because of the non-symmetric shape of the distribution, the median (27.6) is closer to the low value than the high value.

24% body fat corresponds to the 29^{th} percentile. 24% is the top end of the ideal range recommended by the Web site but it is a fairly low percentile for this population. Said another way, 71% of the population exceeds the upper limit of the range. Yikes!

## Closing Thoughts

For the issue relating to the ideal body fat range, it's fairly clear that something is going on here. I'm not a health expert, so I don't know the answer. However, it appears that either the range is incorrect or a large majority (71%) of 14-year-old girls exceed the recommended range. Only 20% actually fall within the range. However, with a few simple tools in Minitab, we have brought the implications of these data to life! Just as important, we can easily present these results to others in an easy-to-understand manner.

I hope after reading this you're more comfortable with nonnormal distributions and can see the advantages of identifying your data's distribution. I’ve shown how you can transcend your raw sample data and make useful inferences about the larger population that your data represent. You can safely embrace your nonnormal data!

Name: Gail Daleiden• Monday, October 15, 2012Is this possible to do in Minitab 14? When I go to Probability Plot, I do not have "Shaded Area" as an option. Minitab 14 was sold as part of my course bundle so I assumed it was the version we were supposed to use.

Thanks for any assistance.

Name: Naveen• Tuesday, March 12, 2013This post by Jim Frost is fantastic. It has been so helpful. I am also forward to see the explanation about the other distributions as well in the near future with some examples as you have stated in this blog. This blog was more focussed on 3-parameter weibull. It would be very useful if you can provide explanation for more complicated distributions. Looking forward to see to your explanation.

Name: Tony• Sunday, April 21, 2013great post. Very helpful. thanks

Name: Abhijeet• Sunday, October 13, 2013Hello Jim

I am doing a six sigma project in which I am dealing with some non-normal data. So I was searching some help on web and read your article. You have explained in such a clear manner that this concepts has bed in my mind and will never forget it. I have been in some six sigma training but no instructor had unleashed the power of minitab so much as you have explained here. By chance my data fits 3-weibull distribution as well so it makes more sense to me to grasp the subject. I am huge fan of your work. Can I please request you to add me in your newsletter or posts update, as I want to learn more. You made me enjoy minitab, Thanks a lot Jim.

Name: Jim Frost• Tuesday, October 15, 2013Hi Abhijeet,

Thanks so much for reading and for your very kind words! I'm glad that you find our blog helpful!

If you follow this link, it'll take you to a page where you can sign up for our newsletter.

http://www.minitab.com/company/newsletter/subscribe.aspx

Cheers!

Jim

Name: prasshanth Bharadwaj• Sunday, January 12, 2014Hi,

Thank you for such a valuable information. I just wanted to know one thing why to calculate 95% percentile you have deducted 97.5 - 2.5 cant it be calculated straight away also what to do if we want to calculate 99.73 % of my data. Thank you and regards Prasshanth Bharadwaj

Name: Jim Frost• Monday, January 13, 2014Hi Prasshanth,

The reason I calculate the 97.5th and 2.5th percentiles is because for that example I'm calculating the middle 95% of the data. And, 95% of the data falls in between these two percentiles (97.5 - 2.5). In other words, I'm excluding 2.5% of the data on both the upper and lower tails of the distribution (for a total of 5%) and just including the middle 95%.

If you just want to calculate the 95th percentile, you would do that directly as you suggest. And, if you just want a 99.73 percentile for your data, rather than the middle 99.73% of your data, you'd calculate that directly as well.

Thanks for reading!

Jim

Name: prasshanth• Friday, January 24, 2014thank you so much got it.

Name: Hadas• Tuesday, January 28, 2014Hi Jim,

Thank you for the post.

I Identified the data as you explained in the previous post and I got 2 paramater exponential distribution. I want to perform like a T-test between two populations but I cannot use T-test since the data is not normally distributed. How can I do it using Minitab?

Thanks

Name: Jim Frost• Tuesday, January 28, 2014Hi Hadas,

The 2-sample t test is robust to departures from normality, which means that the test can be accurate even with nonnormal data. Our standard advice is that if you have more than 15 samples per group, normality isn't an issue. However, because the exponential distribution can be very nonnormal, you may want more than that to be sure.

Be aware that you are testing whether the mean is different, but the exponential distribution is generated by two other parameters, the threshold and scale. Be sure to examine the distribution for each group (rather than the combined distribution for both) to understand how both parameters work together to produce your test results.

Even though it may be totally valid from a statistical standpoint for you to use the 2-sample t test, you may want to choose a nonparametric test instead. It depends on what you want to learn.

For example, you could use the nonparametric Mann-Whitney test to compare the medians for 2 groups. In the Minitab menu: Stat > Nonparametrics > Mann-Whitney.

In your case, if your 2 sample distributions have long tails, which wouldn't be uncommon for the exponential distribution, they could unduly influence the mean even if the 2 distributions aren't that different. However, the median wouldn't be influence as much by changes far out in the tail.

For a good blog post that illustrates this, read Redouane's excellent post:

http://blog.minitab.com/blog/statistics-for-lean-six-sigma/the-non-parametric-economy-what-does-average-actually-mean

Thanks for reading!

Jim