What to Do When Your Data's a Mess, part 2

Minitab Blog Editor 22 March, 2017

In my last post, I wrote about making a cluttered data set easier to work with by removing unneeded columns entirely, and by displaying just those columns you want to work with now. But too much unneeded data isn't always the problem.

What can you do when someone gives you data that isn't organized the way you need it to be?  

That happens for a variety of reasons, but most often it's because the simplest way for people to collect data is with a format that might make it difficult to assess in a worksheet. Most statistical software will accept a wide range of data layouts, but just because a layout is readable doesn't mean it will be easy to analyze.

You may not be in control of how your data were collected, but you can use tools like sorting, stacking, and ordering to put your data into a format that makes sense and is easy for you to use. 

Decide How You Want to Organize Your Data

Depending on how its arranged, the same data can be easier to work with, simpler to understand, and can even yield deeper and more sophisticated insights. I can't tell you the best way to organize your specific data set, because that will depend on the types of analysis you want to perform, and the nature of the data you're working with. However, I can show you some easy ways to rearrange your data into the form that you select. 

Unstack Data to Make Multiple Columns

The data below show concession sales for different types of events held at a local theater. 

stacked data

If we wanted to perform an analysis that requires each type of event to be in its own column, we can choose Data > Unstack Columns... and complete the dialog box as shown:

unstack columns dialog 

Minitab creates a new worksheet that contains a separate column of Concessions sales data for each type of event:

Unstacked Data

Stack Data to Form a Single Column (with Grouping Variable)

A similar tool will help you put data from separate columns into a single column for the type of analysis required. The data below show sales figures for four employees: 

Select Data > Stack > Columns... and select the columns you wish to combine. Checking the "Use variable names in subscript column" will create a second column that identifies the person who made each sale. 

Stack columns dialog

When you press OK, the sales data are stacked into a single column of measurements and ready for analysis, with Employee available as a grouping variable: 

stacked columns

Sort Data to Make It More Manageable

The following data appear in the worksheet in the order in which individual stores in a chain sent them into the central accounting system.

When the data appear in this uncontrolled order, finding an observation for any particular item, or from any specific store, would entail reviewing the entire list. We can fix that problem by selecting Data > Sort... and reordering the data by either store or item. 

sorted data by item    sorted data by store

Merge Multiple Worksheets

What if you need to analyze information about the same items, but that were recorded on separate worksheets?  For instance, if one group was gathering historic data about all of a corporation's manufacturing operations, while another was working on strategic planning, and your analysis required data from each? 

two worksheets

You can use Data > Merge Worksheets to bring the data together into a single worksheet, using the Division column to match the observations:

merging worksheets

You can also choose whether or not multiple, missing, or unmatched observations will be included in the merged worksheet.  

Reorganizing Data for Ease of Use and Clarity

Making changes to the layout of your worksheet does entail a small investment of time, but it can bring big returns in making analyses quicker and easier to perform. The next time you're confronted with raw data that isn't ready to play nice, try some of these approaches to get it under control. 

In my next post, I'll share some tips and tricks that can help you get more information out of your data.