Are You Putting the Data Cart Before the Horse? Best Practices for Prepping Data for Analysis, Part 2
Do you recall my “putting the cart before the horse” analogy in part 1 of this blog series? The comparison is simple.
We all, at times, put the cart before the horse in relatively innocuous ways, such as eating your dessert before you’ve eaten your dinner, or deciding what to wear before you’ve been invited to the party. But performing some tasks in the wrong order, such as running a statistical analysis before you’ve prepared your data, might result in more serious consequences.
Eating your dessert first might merely spoil your appetite for dinner, but performing a statistical analysis on dirty data could have much more serious repercussions—including misleading results, mistaken decisions, or, if you’re lucky enough to catch your mistake before it's too late, costly rework.
Spending quality time with your data up front can prevent you from wasting time and energy on an analysis that either can’t work or can’t be trusted. We began exploring this idea in Part 1 of this best practices series, where I offered some tips for cleaning your data before you import it into Minitab. The biggest takeaway from Part 1 is that cleaning your data before you begin an analysis can save time by preventing rework, such as reformatting data or correcting data entry errors, after you’ve already begun the analysis.
So, once our data is clean, what comes next?
Use formatting and highlighting tools to explore and visualize your data
You can use Minitab’s worksheet visualization tools to explore your data. Conditional formatting in particular brings color to your worksheet and can be used to highlight aspects of your data that you’d like to call attention to quickly.
In our data set, recall that we’ve recorded the amount of time a machine was out of operation, the reason for the machine being down, the shift number during which the machine went down, and the speed of the machine when it went down. Suppose you wish to identify frequently occurring values or points that are out-of-spec or out-of-control. You can use formatting rules to do just that!
In this example, I’ve used one of the statistical rules available in Minitab’s conditional formatting to identify values that are not within spec. Highlighting these values may indicate either a data entry error or be valid cause for investigation, and can help you better understand where to focus your exploration and visualization efforts moving forward.
With a simple right-click directly in the Minitab worksheet, you can identify out-of-spec values you may wish to investigate before you begin your analysis.
You can also use Cell Properties (available by right-clicking in the worksheet) to highlight individual cells or rows, and add cell comments to draw attention to data that need further investigation, such as out-of-control points, unusual observations, or other data of interest. Rather than removing questionable data right away, you can take note of the data, perhaps by commenting on the cell as a reminder to follow-up. Doing this will keep you from committing the statistically unsound practice of cherry-picking data, and will ensure you handle the data correctly when it comes time to analyze it.
In the Minitab worksheet, you can highlight an entire row to easily visualize all variables associated with particular data, or add a cell comment to an out-of-control point for future reference.
Use subsets to uncover insights prior to your analysis
Finally, data subsets are a good way to visualize only the data that is relevant to answering your questions. Minitab makes it really easy to subset by right-clicking in the worksheet, and allows you to create subsets based on the data you’ve explored and highlighted with conditional formatting.
For example, suppose you want to understand why machines are experiencing downtime so you can address productivity problems. You can use conditional formatting to identify the most frequent reason for a machine’s downtime, and then subset your data based on those formatted rows to understand the relationship the most frequent cause of machine downtime has with other variables.
It’s easy to subset your data in Minitab by right-clicking directly within the worksheet.
All of the data cleaning and exploration you’ve seen in the worksheet is just the beginning—but consider how much insight you’ve drawn from your data before you’ve visualized it graphically or formally analyzed it!
Taking the time to clean and explore your data before you begin an analysis is well worth the investment. Doing so will help you better understand and answer key questions about your process, lead to a more efficient analysis as you tackle only the most relevant data for answering your questions, and ultimately yield results you can trust.