Surround yourself with quality data

Cody Steele | 2/28/2018

"They were careless people, Tom and Daisy—they smashed up things and creatures and then retreated back into their money or their vast carelessness, or whatever it was that kept them together, and let other people clean up the mess they had made."
— F. Scott Fitzgerald

Stutz wrecked - Indianapolis (LOC)As Nick learns in The Great Gatsby, it's important to verify the quality of the people who surround you. The last thing you want is to be riding in a car with your lover who hits someone else's wife. It gets even worse if you're later shot to death because the dead wife's husband thinks you were having an affair with her. Especially if the dead wife was really having an affair with your lover's husband.

When you’re doing process analysis, it’s the data that you have to to verify. The last thing that you want is to be careless and get misleading results. To have confidence in your statistical analyses, you’ll have to be confident in the quality of your data. That means that you need to check the data, and clean it if it’s messy.

Graphs are a good way to check your data. If you find anything unusual, investigate those data points before you move on. In some cases, the analysis of the unusual data can be the most informative step in process analysis.

Let’s graph some data:

  1. If you haven’t already, get the number of characters per line in I Wandered Lonely as a Cloud by Wordsworth.
  2. Choose Graph > Histogram.
  3. In the gallery of possible histograms, choose Simple.
  4. In Graph Variables, enter the column that contains the number of characters for each line of the poem. Click OK.

If you’ve left the data the way I did, you’ll see a histogram a bit like this:
Histogram of Characters per line in "I Wandered Lonely as a Cloud"
 
This graph reveals two features that are unusual:

  • 3 lines with 0 characters
  • 2 lines with over 55 characters

A glance at the worksheet shows that the zeroes are line breaks between stanzas, but what about the long lines? Did Wordsworth have more to say in those lines? 

It turns out that the longest lines aren’t where Wordsworth waxes extra-poetical. They’re lines where Bartleby.com has extra spaces to align the line numbers. Once you’ve identified what’s going on in messy data, you can make intelligent decisions about what to do so that the data reflect what you really care about.

In life, data is messy. Sometimes, messier than the love affairs in The Great Gatsby. If you’re careless, you'll create a huge mess that someone has to clean up. Graphs like the histogram are a great way to check the quality of your data. What kinds of graphs are your favorites?
 
Have a  few more minutes? You can check out some tips for editing your graphs in Minitab, like I did with the red bars in the histogram.