Three Ways to Get More Out of Your Text Data

Text data can be a challenge to analyze. Even the word "data" usually makes me think about numbers, but a great deal of the data statisticians and quality professionals need to analyze is text.

Now, I majored in English as an undergraduate, so I find it very interesting to think about literature in terms of the data it contains. For instance, I'd love to treat Thomas Pynchon's The Crying of Lot 49 as a data set just to see what I could discover about the relative frequency of certain words and phrases -- but that's a project for another day.

So let's talk instead about the kinds of text data you might encounter in the course of a Six Sigma or similar quality improvement project. This might include long ID codes that include a letter, like "AB12345." It could include names, or dates. And depending on where your data are coming from, a lack of consistency and quality could be an issue.

The good news is that Minitab’s Calculator is just as useful with text as it is with numbers. You can use the Calculator to quickly change a column of text data to be all capital letters, or all lower-case letters. If you have a list of first names, you can use the calculator to make sure the first letter of each name is upper-case.

In fact, if you pull up Calc > Calculator... in Minitab Statistical Software and select "Text Functions" in the Functions drop-down menu, you'll find a list of 22 different operations you can use to alter, edit, correct, augment, transform and otherwise manipulate your text data so you can do what you want with it.

Calculator to Manipulate Text Data

Here are three Text functions available in the Minitab calculator that I've found particularly useful.

1. LENGTH

As a literature geek, I find the possibilities offered by this tool endless. For example, have you ever wondered which sentence in all of William Faulkner's novels is the longest? (Um...yes, I have.) The LENGTH function counts the number of characters in a string of text, and can store it in a separate column. So if the word "antidisestablishmentarianism" appears as a data point in your column of text, the calculator will return "28."

Now I just need someone to enter every sentence Faulkner wrote into a data sheet...

2. ITEM or WORD

The ITEM function extracts the nth word from a string of text. Let's say you want to analyze sales per county for a given region, and you get sales records that display customer names and counties like this:

Quentin Compson, Yoknapatawpha County

You could use the ITEM function to extract the 3rd word in each line of data, and Minitab's calculator would give you a list of county names. By default, one or more spaces define where each word begins and ends. You can specify other criteria for the separation between words, such as a comma, using an optional third argument, 'delimiters.'

ITEM is very similar to another function, WORD, but ITEM extracts empty text that occurs between repeated separators (like ,,) while WORD ignores the empty string and extracts the text that follows consecutive separators.

3. LEFT

The LEFT function returns the specified number of characters from the beginning of a string of text. If you have a column of text data that isn't entirely consistent -- for instance, if defectives are listed as both "defect" and "defective" -- the LEFT function will let you extract just three letters, giving you a consistent value.

Manipulating Text Data - Left Function

For text, you select the column of text values you want to extract characters from. For num_chars, enter the number of characters from the left you want to keep. So, if c1 contains both "Defective" and "Defect", entering LEFT (c1,3) will give you a new column of data that contains a consistent value: "Def".