Our guest blogger Dr. James Kulich holds a Ph.D. in mathematics from Northwestern University and currently serves as program director, as well as professor of mathematics at Elmhurst University. He has extensive experience in applying quantitative methods and analytical tools to produce useful and actionable information from widely ranging data. His expertise includes the full range of modern statistical and data modeling methods. |
When the data science buzz began around a decade ago, creating a predictive model was complicated work that only skilled programmers could accomplish. Today, new tools and new techniques are shifting the focus from programming details to building models that are robust, scalable and focused on creating business value.
In this blog, we’ll provide a framework for how machine learning works and show how we are now entering a third generation of machine learning capabilities that you can tap into.
What is machine learning? At its core, machine learning is nothing more than a collection of algorithms that allow to you make predictions about something that is unknown based on data that are known. In their book Prediction Machines, economists Ajay Agrawal, Joshua Gans, and Avi Goldfarb make the point that artificial intelligence is not about creating intelligence, but is about the ability for machines to make predictions, the central input for decision making.
As the authors note, machines and humans have distinct strengths and weaknesses when it comes to making predictions. Machines are better able to handle complex interactions, especially in situations rich with data, whereas humans do a better job when it is important to understand the process which led to the data.
I join many others in a strong belief that the best results are obtained when the strengths of machines and humans are combined. This is the approach my colleagues and I take in Elmhurst College’s Masters in Data Science program, of which I am the founding director.
Today’s approach to machine learning has its roots in statistics. Linear regression, which has been around for more than a century, remains an important form of machine learning. Over the past couple of decades, new forms of machine learning have become practical. I categorize modern machine learning approaches into three generations.
The first generation consists of a set of baseline modeling techniques that are often sufficient to make useful predictions. They come in a few flavors, beginning with decision trees.
Let’s say you are trying to predict whether a prospective customer will make a purchase based on past buying history. Many highly intertwined factors come into play. A typical BI analysis is simply not capable of seeing through the complexity. A decision tree systematically determines, at each step, which of the available variables can most quickly separate prospects who purchase from prospects who don’t.
The result is a roadmap you can follow to guide a decision. In the example from banking shown below, generated using Minitab’s CART capability, the most important variable is the duration of the last contact with the prospect. If the contact lasted less than 249 seconds, there was little chance of a purchase, but chances improved with longer contacts. From there, follow the tree.
Another important first generation machine learning model is logistic regression, which develops a formula for the predicted outcome as a weighted combination of the input variables. This approach is easy to implement in many IT environments and is also easy to understand.
First generation models are often all that you need, but they do have limitations. Sometimes, they miss important nuances in the data, which can result in either a model that is not sensitive enough or a model that too sensitive, mistaking randomness in your data for real patterns. Second generation models address some of these issues and sometimes provide additional useful information.
Random forest algorithms are a good example of a second-generation model. Instead of working with a simple decision tree, random forests build many trees using only parts of the data in each pass. They reach a final answer by essentially averaging results. This process ends to eliminate some noise in the system and can be more robust than a simple decision tree.
Random forests also produce, for free, good estimates of variable importance. Random forest algorithms, as well as some other more complex machine learning and predictive analytics tools, can be generated in software like Minitab's Salford Predictive Modeler (SPM). In our banking example, the length of the last contact, employment status, and monthly income rise to the top, as shown below. Note that this result is consistent with our CART tree and goes one step further, as seen in this screenshot from SPM Random Forests®:
Other second-generation models include multivariate adaptive regression splines and regularized regression models, which aim to keep model complexity issues under control.
A primary goal of any machine learning effort is to produce useful business guidance. Third generation machine learning models both extend the reach of their earlier counterparts and provide new levels of available guidance.
Let’s take one last look at our banking example.
The graph shown below is called a one-variable partial dependence plot. It shows how the impact of contact duration on the likelihood to make a purchase changes across various possible values for the length of the contact. This sounds like a simple idea, but is actually very hard to do as the effect of a variable like duration is highly intertwined with the effects of many other variables.
Our partial dependence plot tells us that the prospect for closing the deal continues to grow for contacts that last up until about 1,000 seconds and then holds steady. This is specific guidance that you can provide to front-line personnel that goes beyond what earlier generation models offer.
One other area of focus for third-generation models is the ability to handle text and image data. These kinds of data tend to have, once converted to numbers, many more columns than rows. This is usually a nightmare for earlier modeling techniques. Third-generation models like GPS and MARS® have the necessary capabilities built in. So-called unstructured data like text and images are becoming increasingly important, sometimes carrying the bulk of a model’s predictive power, and hence it ability to generate business value.
Models must be stable. Models must be deployable. Today’s machine learning tools are making both of these necessary outcomes readily possible. The challenges, from a business perspective, become building awareness of what is now possible and building capacity to wisely use the powerful results of first, second, and third-generation machine learning models.