What The Great Gatsby Teaches Us about Messy Data
They were careless people, Tom and Daisy—they smashed up things and creatures and then retreated back into their money or their vast carelessness, or whatever it was that kept them together, and let other people clean up the mess they had made.
— F. Scott Fitzgerald
As Nick learns in The Great Gatsby, you can't be careless about your friends or you'll create a big mess. You don't want to be riding in a car with your lover who hits and kills someone else's wife. Then get shot to death because the dead wife's husband thinks you were having an affair with her. Especially if the dead wife was really having an affair with your lover's husband.
Of course, great literature contains lessons that apply to many different areas of life. So while I apologize to Fitzgerald for abusing his work this way, let's look at an analytical lesson.
When you’re doing process analysis, you can't be careless about your data or you'll create a big mess. To have confidence in your statistical analyses, you have to be confident in your data. That means that you need to check the data, and clean it if it’s messy.
Graphs are a basic way to check your data. If you find anything messy in a graph you make with Minitab Statistical Software, investigate those data points before you move on. Follow these steps to try it for yourself:
- If you haven’t already, get the number of characters per line in I Wandered Lonely as a Cloud by Wordsworth.
- Choose Graph > Histogram.
- In the gallery of possible histograms, choose Simple. It's the one that looks like the thumbnail there on the right:
- In Graph Variables, enter the column that contains the number of characters for each line of the poem. Click OK.
If you’ve left the data the way I did, you’ll see a histogram a bit like this:
The messy data are:
- 3 lines with 0 characters
- 2 lines with over 55 characters
The zeroes are line breaks between stanzas, but what about the long lines? Did Wordsworth have more to say in those lines?
When we check the data -- in this case, those lines of the poem -- it turns out that the longest lines aren’t where Wordsworth waxes extra-poetical. They’re lines where Bartleby.com inserted extra spaces to align the line numbers. Once you’ve identified what’s going on in messy data, you can decide what to do so that the data show what you really care about. I'll show you some easy ways to clean this data in my next entry.
In life, data is messy. Sometimes, it's even messier than the love affairs in The Great Gatsby. If you’re careless, you'll create a huge mess. Graphs like the histogram are a basic way to check your data. What kinds of graphs are your favorites?
(Have a few more minutes? You can check out some tips for editing your graphs in Minitab, like I did with the red bars in the histogram.)