Fond Memories of Clean Data

Cody Steele 06 September, 2011



"I gazed--and gazed--but little thought
What wealth the show to me had brought:
For oft, when on my couch I lie
In vacant or in pensive mood,
They flash upon that inward eye
Which is the bliss of solitude;
And then my heart with pleasure fills,
And dances with the daffodils."
— William Wordsworth

The ending of Wordsworth’s poem is about pleasant memories, the kind that I want you to have of statistics. With a few pleasant memories stored up, I hope your confidence in your process analysis skills will increase and you’ll feel ready to do more on your own.

In this post I'm going to show you how Minitab Statistical Software can help you clean up text data. Once again I'm going to use Wordsworth's I Wandered Lonely as a Cloud.  If you want to follow along, you can see how to copy the data set I created in a previous post. 

Of course, with a small data set like the Wordsworth one, you could easily clean the data without Minitab. But the larger the data set -- and the more important your time -- the more critical it is to make Minitab clean the data for you.

Here's what our data set looks like:
  Example of the<br />
worksheet that contains the<br />
number of characters per line in<br />
"I Wandered Lonely as a<br />
First, let’s remove the empty rows. Minitab can make a new worksheet of rows that meet a certain criteria:



  1. Choose Data > Subset Worksheet.
  2. In Specify Which Rows to Include, click Condition.
  3. In Condition, enter 'length of line'>0. Click all the OK buttons.

Now we have a worksheet without empty rows, but we still have rows with extra spaces. To clean these rows, we’ll use a few Minitab calculator functions: if, trim, and left.
Histogram<br />
of uncleaned data

If lets us get different results for different rows. The histogram on the left shows that the unusually long lines all had more than 45 characters. We’ll use if to make sure that we only shorten these longest lines.

Left gives us a certain number of characters, starting from the left of the observation. In combination with our existing column of lengths, we can use left to get all but the last two characters on every line.

Trim removes all spaces except for single spaces between words. We’ll use trim to get rid of the spaces at the end.

  1. Choose Calc > Calculator.
  2. Enter a column to store the new poem lines.
  3. In Expression, enter if('length of line'>45,trim(left('I Wandered Lonely as a Cloud','length of line'-2)),'I Wandered Lonely as a Cloud') . Click OK.

Now that you have the new poem, you need to get the lengths again, just as we did in the first entry where we copied the data. That gives you not just a set of data, but a clean set of data, ready for you to analyze.

Next time, we’ll switch from literature to candy so that you can work with collecting data yourself. Statistics are a lot of fun with candy.