It’s not easy to get data ready for analysis. Sometimes, data that include all the details we want aren’t clean enough for analysis. Even stranger, sometimes the exact opposite can be true: Data that are convenient to collect often don’t include the details that we want when we analyze them.
Let’s say that you’re looking at the documentation for the National Health and Nutrition Examination Survey (NHANES) from 2001-2002. By convention, the data set uses a symbol for missing values, but some variables have additional numeric codes for data that are missing for a specific reason. For example, one data set records hearing measurements (Audiometry). One variable in this data set is the middle ear pressure in the right ear, which has values from -282 to 180, but also includes these codes:
Although in some cases knowing how often each of these situations occurs could be important, to analyze the numeric data, you have to change these code values from numbers to something that won’t be analyzed. After all, leaving in a bunch of values that are more than twice what the maximum should be would have a serious effect on the mean of the data set.
In Minitab, try this:
Lower endpoint |
Upper endpoint |
Recoded value |
555 |
556 |
* |
777 |
778 |
* |
888 |
889 |
* |
The resulting column has missing values instead of the coded values. And that means the statistics that you calculate will now have the correct values.
Recoding can let you prepare data with numeric measurements for correct analysis, but the CDC data sets also often use numeric codes to represent categories. For example, one variable records these codes for the status of an audio exam:
Another reason to recode your data before analyzing it is so that both the data itself and the values that subsequently appear as categories and on graphs are descriptive. You can recode these numeric codes to text in a similar fashion. Try this:
Current value |
Recoded value |
1 |
Complete |
2 |
Partial |
3 |
Not done |
The resulting column has the text labels instead of the numeric codes. When you create graphs, the labels will be descriptive.
Sometimes, data that are good to collect differ from data that are good to analyze. Sometimes we need more detail in the data that we collect than we need in the data that we analyze, such as when we record the reason that data are missing. Sometimes, we need data that are faster to record than is convenient when we analyze data, so we use abbreviations or codes that aren’t as descriptive as they can be.
Fortunately, Minitab makes it easy for you to balance those needs by making it easy to manipulate your data, with features like recoding. Ready for more? Check out some of the ways that Minitab makes it easy to merge different worksheets together.