What Does the Shape of Your Data Indicate?

Written by Cody Steele | Feb 27, 2024 4:38:18 PM

When it comes to data, one of the most important things to understand is what values are common and what values are rare. One of the most common summaries of data is the arithmetic mean, which we often call the average. You might be used to hearing about average rainfall, average delivery times, or the average price of fuel. There are times, however, when the mean doesn’t do a good job of expressing what’s common. This is the perfect opportunity to plot out the data in a histogram.

Consider the lap times from driver 44 at the 2021 French Grand Prix, sorted from fastest to slowest (from the FastF1 python library).

We can see that most of the laps are between 90 and 92 seconds—that’s what’s most common. The mean of the data set is approximately 109 seconds, not near any of the values that are in the data set, either fast or slow. For data like this, the mean is a terrible way to know what’s common in the data. Because it’s useful to know when the mean represents what’s common in your data, it’s useful to know tools that can quickly help you identify whether the mean will represent what’s common.

See how Minitab’s powerful suite of tools can help you assess the quality of your processes:

What’s common

A quick way to see what’s common is to plot the data with a histogram. A histogram divides sample values into many intervals and represents the frequency of data values in each interval with a bar. Here’s a histogram of the lap times:

When a histogram has a gap between the tallest bar that shows what’s common and the most extreme data, then the mean is usually a poor representation of what’s common.

Bell-shaped data

When most values are close to the mean and values further away from the mean in either direction are increasingly rare, the histogram shows the shape of a bell. The mean is a good description of what’s common when the histogram shows a bell shape.

The following histogram shows a sample of birth weights from healthy babies in the United States from the first part of 2022 (from the National Bureau of Economic Research). Most babies are close to the common value of 3,300 grams. Weights further from the mean are increasingly rare in each direction.

Right-skewed data

Another common shape for data is when most of the data are typical, but some of the data can be much larger. We’ll call this shape “right-skewed.” Variables that have a lower boundary but no upper boundary, like income and strength, often follow a right-skewed distribution. For right-skewed data, the mean is often far from the tallest bars of the histogram, making the mean a poor indicator of what’s common. We’ll usually use an alternative statistic like the median to show what is common for right-skewed data.

The following histogram shows a sample of incomes of new mortgage holders in the United States (from FHFA.gov). The median is more representative of what’s common in the data set than the mean is.

What’s rare

When we think about what data are common, we think of the tall bars in the histogram. A common need in practice is to estimate the number of products that will be within customer specifications. Products outside of specifications will be rare, which requires understanding of the values that are far away from what’s common.

The shape of the data is crucial when we want to use a relatively small sample of data to describe what’s rare. If we want to take only a few dozen measurements, we won’t necessarily see data that occurs less than 1% of the time in the sample, but a customer who’s buying thousands of products from us will. In that case, we’ll use the shape of the data as a model so that we can infer what the rare data are like.

Suppose that we take measurements on the length of a small valve before we pronounce a batch ready to ship. To meet our tolerances as best we can, we produce the valves a bit large, then trim them as precisely as possible. Valves that are too short are discarded before trimming, so we never have any short valves to measure when inspecting a batch for shipment. That process produces right-skewed data.

If we use a bell shape to model those data, we would estimate having many more valves that are too short than we can have in real life. If we overlay a curve that shows a bell shape on a histogram of these right-skewed data, you can see the empty area to the left of the bars that show that the curve doesn’t fit the data.

If we add a right-skewed curve instead, the curve can give us a good idea of what’s rare in our data even from a relatively small sample.

Use the shape of your data

The use of a relatively small sample to estimate what’s going to happen in a much larger population is a common application of quality statistics. Using a histogram to understand the shape of your data lets you quickly determine whether or not the mean is a good representation of what’s common in the data.

Explore the shapes of your own data - get a free trial of Minitab Statistical Software

*The image of the formula 1 car is from Wikimedia Commons and is licensed under this creative commons license.

*The image of the fuel injectors is from flickr and is licensed under this creative commons license.

View full post