What to Do When Your Data's a Mess, part 1
Isn't it great when you get a set of data and it's perfectly organized and ready for you to analyze? I love it when the people who collect the data take special care to make sure to format it consistently, arrange it correctly, and eliminate the junk, clutter, and useless information I don't need.
You've never received a data set in such perfect condition, you say?
Yeah, me neither. But I can dream, right?
The truth is, when other people give me data, it's typically not ready to analyze. It's frequently messy, disorganized, and inconsistent. I get big headaches if I try to analyze it without doing a little clean-up work first.
I've talked with many people who've shared similar experiences, so I'm writing a series of posts on how to get your data in usable condition. In this first post, I'll talk about some basic methods you can use to make your data easier to work with.
Preparing Data Is a Little Like Preparing Food
I'm not complaining about the people who give me data. In most cases, they aren't statisticians and they have many higher priorities than giving me data in exactly the form I want.
The end result is that getting data is a little bit like getting food: it's not always going to be ready to eat when you pick it up. You don't eat raw chicken, and usually you can't analyze raw data, either. In both cases, you need to prepare it first or the results aren't going to be pretty.
Here are a couple of very basic things to look for when you get a messy data set, and how to handle them.
Kitchen-Sink Data and Information Overload
Frequently I get a data set that includes a lot of information that I don't need for my analysis. I also get data sets that combine or group information in ways that make analyzing it more difficult.
For example, let's say I needed to analyze data about different types of events that take place at a local theater. Here's my raw data sheet:
With each type of event jammed into a single worksheet, it's a challenge to analyze just one event category. What would work better? A separate worksheet for each type of occasion. In Minitab Statistical Software, I can go to Data > Split Worksheet... and choose the Event column:
And Minitab will create new worksheets that include only the data for each type of event.
Minitab also lets you merge worksheets to combine items provided in separate data files.
Let's say the data set you've been given contains a lot of columns that you don't need: irrelevant factors, redundant information, and the like. Those items just clutter up your data set, and getting rid of them will make it easier to identify and access the columns of data you actually need. You can delete rows and columns you don't need, or use the Data > Erase Variables tool to make your worksheet more manageable.
I Can't See You Right Now...Maybe Later
What if you don't want to actually delete any data, but you only want to see the columns you intend to use? For instance, in the data below, I don't need the Date, Manager, or Duration columns now, but I may have use for them in the future:
I can select and right-click those columns, then use Column > Hide Selected Columns to make them disappear.
Voila! They're gone from my sight. Note how the displayed columns jump from C1 to C5, indicating that some columns are hidden:
It's just as easy to bring those columns back in the limelight. When I want them to reappear, I select the C1 and C5 columns, right-click, and choose "Unhide Selected Columns."
Data may arrive in a disorganized and messy state, but you don't need to keep it that way. Getting rid of extraneous information and choosing the elements that are visible can make your work much easier. But that's just the tip of the iceberg. In my next post, I'll cover some more ways to make unruly data behave.