The potential benefit of the data stored on servers is huge. Banks, insurance companies, telecom companies, manufacturers — in fact organizations from across all industries need to make good use of the data they own to improve their operations, better understand their customers and find competitive advantage.
With the advent of Industry 4.0 and the Industrial Internet of Things (IIoT), information comes streaming in from many sources — from equipment on production lines, sensors in products, sales data and much more. Being able to collect, collate and analyze this information is becoming ever more critical as companies use this to find insights and improve efficiency and effectiveness
This situation represents a massive amount of new opportunity, but it also brings some significant challenges.
The large amounts of data produced from these modern-day systems present unique challenges that are unseen with smaller data sets. These data sets may contain a large number of predictors, a large number of rows, or both. It is also common for observational data like this to be more complex than what you might find with data obtained from carefully designed experiments.
This article describes how these issues affect data analysis.
Large number of predictors
Traditional statistical modeling tools, such as regression and logistic regression, rely on p-values to detect significant effects. Specifically, we often claim that a predictor with a p-value less than 0.05 is statistically significant. However, this 0.05 benchmark means that we are agreeing to a 5% error rate, or one predictor out of twenty will be significant just by chance. With many predictors, relying on p-values can lead to modeling random noise.
To illustrate this, I randomly simulated 100 normally distributed columns with 15 observations each. A stepwise regression shows that no less than 13 columns out of 99 variables have a statistically significant effect on the last column (P values very close to 0) and the R square value is extremely high (100%). Obviously, all these are spurious effects, exclusively caused by random fluctuations (see the results below).
Large number of observations: Power vs. Practical significance
Very large sample sizes improve power and ability to detect statistically significant terms even when they are (very) small, however such statistically significant effects do not necessarily imply practical significance. With large datasets, the P value may become over-sensitive to small but real effects leading to a very complex final model containing most of the initial potential predictors.
Even though these terms may be statistically significant, most of them would actually have little practical significance.
To illustrate this, I simulated 6 columns with 100 000 observations each, I entered a model so that the 5 input columns have a tiny effect on the last column (actual but very small impact). The R square is, understandably, extremely poor (close to 0%) but the P values indicate that the effects are very significant from a statistical perspective (see the results below).
Complexity due to non-linear effects as well as outliers and missing values
Over a wider range and an extended period of time, variables are likely to follow non-linear patterns. Missing values and outliers are more likely to be present in large datasets and affect the efficiency of statistical modeling tools (due to a single missing value, for example, the full row of observations might not be taken into consideration).
Typically, a bank attempting to identify fraudulent transactions will need to analyze a very large number of predictors and observations, with many complex non-linear effects, missing values and outliers, same thing for a manufacturing site aiming to identify yield detractors, or for a company monitoring maintenance equipment records to prevent failures.
Data Mining and Predictive Analytics
Powerful machine learning tools such as CART, Random Forests, TreeNet Gradient Boosting, and Multivariate Adaptive Regression Splines are a useful addition to any practitioner’s toolkit, particularly when faced with larger data sets. These rule based techniques are less affected by the limitations that have been described above, since they do not rely on p values statistical significance thresholds and are based instead on decision trees with IF, AND, OR rules that will also isolate outliers and 'impute' missing values.
Of course, the days of small and medium data are far from over. Smart statistical tools such as designs of experiments and other statistical modeling tools will remain popular among Process, R & D, Quality or Validation engineers, to optimize tools or processes. Black Belts and Master Black Belts will continue deploying Six Sigma data analyses tools for root cause analysis, quality and efficiency improvement at every level, companywide, using P values to identify predictors that are statistically significant.
As we move into a future which requires organizations to extract insight from an ever-increasing amount of data, it becomes even more important to ensure the right tools are chosen to be able to analyze data of varying size and complexity.
The modern machine learning tools such as CART, TreeNet, Random Forests and MARS provide a great choice for large data and/or more complex relationships, while more traditional modeling techniques such as regression will continue to be the tools of choice when model assumptions are met and the goal is to find a simple, interpretable equation.
The approaches of both statistics and machine learning play a critical role in finding actionable intelligence, and it will be the collaboration and communication between these two data-driven disciplines which will enable organizations to make better decisions and gain a competitive edge.
For customers on a journey to handling large sets of complex data, the Minitab product offering has evolved, integrating a provide fast, highly accurate platform for data mining and predictive analytics including CART, Treenet Gradient Boosting, MARS and other methodologies. Register for our Machine Learning: the next step in Manufacturing Performance Webinar scheduled for Thursday, August 9 to learn more.