Get a Head Start: Understand Your Data--Before You Analyze It


We humans do have a tendency to succumb to gold rush fever.

And this can happen even in the left-brained, rational field of statistics.

After we collect our data, it’s difficult to resist the urge to desperately dash for p-values, as if they were 70% off at Macy’s the day after Thanksgiving.

But no matter how well-versed you are in statistics, it’s good practice to get into the habit of intuitively “knowing” your data well before you dive into a sea of complex calculations. It helps ensure you’ll fully understand your analysis results, avoid careless errors, and draw sound conclusions.

Minitab’s graphical summary helps you quickly understand the main characteristics of your data.

A great tool to get a quick, birds-eye view of your measurement data is Minitab’s Graphical Summary (Stat > Basic Statistics > Graphical Summary).  Just as lean sigma tools can quickly show you the bottom-line performance of your process, Graphical Summary quickly gives you the bottom-line on your data.  

Here’s Minitab's Graphical Summary for data that tracks the number of days with precipitation each month:


As you look at a graphical summary, ask yourself these questions:

How are my data distributed?  Are your data  fairly normal (bell-shaped)  or skewed in one direction? A skewed distribution can affect some analyses, such as t-tests, especially if your sample is small.

How much does the data vary? Look to see how your data is spread across each graph. Is it spread evenly over high and low values?  How about the confidence intervals? Are they wide? Is the standard deviation high? More variability means less certainty and less precision in your estimates.

Do I have outliers? Check out any extreme values in your data set. They can have a big impact on your results. If they’re mistakes, correct or remove them from your data before you perform an analysis.

For example, suppose you want to determine the average number of days per month with precipitation using the sample data shown above. Because the data set is small (N = 13), nonnormal (p-value < 0.005) and contains outliers (two asterisks in the boxplot), you may want to use a nonparametric test to evaluate the median rather than a t-test to evaluate the mean, because a t-test is based on the normal distribution.

In fact, the  confidence interval for the median (bottom left corner) is narrower than the confidence interval for the mean—so the estimate of the median is more precise. Another option is to transform the data before you analyze it with a t-test.

See how a quick exploratory data analysis can save you time and rework? A “moment’s reflection” is good practice --not only before you analyze your data but before you measure your data as well.

Note: For more information on how to interpret the Graphical Summary results in this example, choose Help > Stat Guide in Minitab. Click the Index tab, then click Graphical Summary and click the arrow to see the interpretation of all the results.

7 Deadly Statistical Sins Even the Experts Make

Do you know how to avoid them?

Get the facts >


blog comments powered by Disqus