Using the Mean in Data Analysis: It’s Not Always a Slam-Dunk
We always hear about the "average" of this and the "average" of that…the average temperature, the average price of gasoline, the average number of children per household, etc. In fact, I just saw an article on average student math scores by country.
If you're a college grad, take a minute to recall when you were choosing your major. For those of us with aspirations of making big bucks, studying to become a doctor, lawyer, or CEO are some of the more lucrative career paths that may have come to mind.
Well, what if I told you that back in the mid-1980's at the University of North Carolina, the average starting salary of geography students was well over $100,000? Knowing that, would you have considered making a career change?
But what if I also told you that basketball great Michael Jordan—formerly the world’s highest paid athlete—graduated from UNC with a degree in geography? Now do you believe me?
Maybe the mean isn't always a slam dunk.
The Mean Can Mislead
In the case of Michael Jordan and fellow UNC geography graduates, the average is not a good representation of the true center of the data. Jordan's earnings from his athletic career raises the "average" salary for geography graduates in a way that doesn't accurately convey what graduates are likely to earn. By almost any measure, Jordan's earnings would be an outlier.
How could we have identified this anomaly, and potentially averted wishing we had chosen a different career path? (Geography, that is—not NBA superstar.)
Rule #1: ALWAYS graph your data.
Graphs that are useful for evaluating the distribution of your data include:
- Individual Value Plot
Any one of these three graphs would have quickly alerted us to the Michael Jordan outlier. If we were to graph geography graduates’ salaries using a histogram, it might look something like this:
Rule #2: Compute some basic statistics, including both the mean and median.
If you order all observations from smallest to largest and pick the middle value, you’ll have your median. If you have an even number of observations, then the median is the average of the two middle values.
So how will the median tell you whether or not you can rely on the mean?
- If the mean is greater than the median, your data are skewed to the right, like we see in the case of Mr. Jordan.
- If the mean is less than the median, then the data are skewed left.
Rule #3: Run a normality test.
If process knowledge tells you that your data should follow a normal distribution, then run a normality test to be sure. If your Anderson-Darling Normality Test p-value is larger than, say, an alpha level of 0.05, then you can conclude that your data follow a normal distribution and, therefore, the mean is an adequate measure of central tendency.
Am I trying to imply that your data need to follow a normal distribution for every analysis? Absolutely not. Some analyses are robust to the normality assumption, while others are not as forgiving. In other words, non-normal data is okay for some statistical methods, but you need to make sure the data satisfy the assumptions for your particular analysis. And for certain processes, the data will never be normally distributed. My goal here is not to imply that normality is critical to every analysis, but rather to remind us why we need to be wary of statistics that are reported in the form of the mean alone.
A Quick Graphical Summary of Your Data
Suppose you have some data and ask yourself, “Self, should I use the mean to convey my results?” To quickly graph your data, calculate some descriptive statistics, and run a normality test all at once, you can use Minitab Statistical Software’s Graphical Summary (located under Stat > Basic Statistics or Assistant > Graphical Analysis) to get your answer.
And the next time someone tries to use the average to prove a point and convince you of this or that, remember that, unlike Michael Jordan soaring towards the hoop with basketball firmly clutched in hand, using the mean is not always a slam dunk.