Trimming Decision Trees to Make Paper: A Hands-on Machine Learning and Root Cause Analysis Exercise
Are you headed to Lean and Six Sigma World in Las Vegas April 3-4? In Beyond the Buzzwords: Application of Machine Learning in Lean Six Sigma, Minitab statistician Charles Harrison and I will share even more information on modern-day machine learning techniques. Learn more about Minitab at Lean and Six Sigma World
As we collect more and more observational data from our processes, we may need new tools to provide meaningful insights into this information. You can add modern-day machine learning techniques alongside traditional statistical tools to analyze, improve and control your processes.
Don’t worry if you’re not familiar with machine learning and Classification and Regression Trees (CART). I’ll walk you through an example below and then give you step-by-step instructions.
Finding the Root Cause of Excessive Variation in a Pulp Bleaching Process
You can quickly detect the root cause of an out-of-control or out-of-specification process condition using the tree-based machine learning methods in Minitab’s Salford Predictive Modeler (SPM) alongside traditional control charts.
Consider a paper manufacturer that needs to use current process data to determine which factors are contributing to excessive variation in its pulp bleaching process. An Individual’s Chart created in Minitab indicates that the process is very non-stable, which in turn results in an unacceptable defect rate.
To begin looking at the root cause of the excessive variability in this process, you might begin with a Binary Logistic Regression in Minitab where the response variable is one if the point falls outside the lower control limit and zero otherwise. Unfortunately, for these data, the crazy patterns in the residual plots below indicate that the binary logistic regression model may not be adequate.
The CART Approach
CART is a decision tree algorithm that works by creating a set of yes/no rules that split the response (Y) variable into partitions based on the predictor (X) settings. Using the CART feature in SPM, I see that one of my predictor variables – Production – is a large contributor to a point falling outside the lower control limit.
If production rate <= 91.76, then the estimated probability of the process being out of control is relatively high (33%). If production rate > 91.76, then the process is likely in statistical control.
The Minitab graph below explains why this rule works. The CART model finds the vertical line corresponding to production rate that best separates the Response = 0 (in control) from the Response = 1 (out-of-control) group.
I can continue growing the CART tree to eventually find more causes of the excessive variability in this process. Once I’ve narrowed the problem down to the vital few X’s, I can put controls in place to reduce the chance of the process drifting out of control, resulting in the process improvement shown in the Minitab Individuals Chart with Stages shown below.
The focus on analytics and data-driven decisions within most organizations should not present a threat. It’s a great opportunity for all of us. Experience in data analysis is not only valuable to your job. It’s becoming crucial. Consider augmenting your current analytics tool kit with some additional machine learning tools developed specifically for the problems that occur using large, observational data sets.
Ready to try for yourself?