Minitab | MinitabBlog posts and articles about using Minitab software in quality improvement projects, research, and more.
http://blog.minitab.com/blog/minitab/rss
Sun, 23 Oct 2016 20:16:30 +0000FeedCreator 1.7.3Improving Cash Flow and Cutting Costs at Bank Branch Offices
http://blog.minitab.com/blog/understanding-statistics/improving-cash-flow-and-cutting-costs-at-bank-branch-offices
<p>Every day, thousands of people withdraw extra cash for daily expenses. Each transaction may be small, but the total amount of cash dispersed over hundreds or thousands of daily transactions can be very high. But every bank branch has a fixed cash flow, which must be set without knowing what each customer will need on a given day. This creates a challenge for financial entities. Customers expect their local bank office to have adequate cash on hand, so how can a bank confidently ensure each branch has enough funds to handle transactions without keeping too much in reserve?</p>
<p><img alt="Grupo Mutual" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b2366c2da44cd861775ebab6c6d07e55/grupo_mutual_logo_200w_1_.png" style="width: 200px; height: 95px; margin: 10px 15px; float: right;" />A quality project team led by Jean Carlos Zamora and Francisco Aguilar tackled that problem at Grupo Mutual, a financial entity in Costa Rica.</p>
<p>When the project began, each of Grupo Mutual's 55 branches kept additional cash in a vault to avoid having insufficient funds. But without a clear understanding of daily needs, some branches often ran out of cash anyway, while others had significant unused reserves.</p>
<p>When a branch ran short, it created high costs for the company and gave customers three undesirable options: receive the funds as an electronic transfer, wait 1–3 days for consignment, or travel to the main branch to withdraw their cash. Having the right amount of cash in each branch vault would reduce costs and maintain customer satisfaction.</p>
<p>Using <a href="http://www.minitab.com/products/minitab">Minitab Statistical Software</a> and Lean Six Sigma methods, the team set out to determine the optimal amount of currency to store at each branch to avoid both a negative cash flow and idle funds. The team followed the five-phase <a href="http://blog.minitab.com/blog/real-world-quality-improvement/dmaic-vs-dmadv-vs-dfss">DMAIC (Define, Measure, Analyze, Improve, and Control)</a> method. In the Define phase, they set the goal: creating an efficient process that transferred cash from idle vaults to branches that needed it most.</p>
<p>In the Measure phase, the team analyzed two years' worth of cash-flow data from the 55 branches. “Managing the databases and analyzing about 2,000 data points from each of the 55 branches was our biggest challenge,” says Jean-Carlos Zamora Mora, project leader and improvement specialist at Grupo Mutual. “Minitab played a very important part in addressing this issue. It reduced the analysis time by helping us identify where to focus our efforts to improve our process.” </p>
<p>The Analyze phase began with an analysis of variance (ANOVA) for to explore how the banks’ cash flow varied per month. They used Minitab to identify which months were different from one another, and grouped similar months together to streamline the analysis. </p>
<p>The team next used control charts to graph the data over time and assess whether or not the process was stable, in preparation for conducting capability analysis. To choose the right control chart and create comprehensive summaries of the results, the team used the Minitab Assistant.</p>
<p style="margin-left: 40px;"><img alt="grupo mutual i-mr chart" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/2d9ac9b2597c592e5be5b779bae85076/grupo_mutual_i_mr_chart_1_.png" style="width: 585px; height: 432px;" /></p>
<p>The team then performed a capability analysis of each group’s current cash flow to determine whether customer transactions matched the services provided, and establish the percentage of cash used at each branch.</p>
<p style="margin-left: 40px;"><img alt="grupo mutual capability analysis" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f0e25ef8282111550e8fe8733eb889de/grupo_mutual_capability_analysis_1_.png" style="width: 586px; height: 439px;" /></p>
<p>The analysis revealed that, in total, the vaults contained more than the necessary funds each branch needed to operate effectively, but excessive circulation of the money caused some to overdraw their vaults while others stored cash that was not utilized. </p>
<p>“We found a positive cash balance at 95% of the branches,” says Zamora Mora. “The analysis showed the cash on hand to meet customer needs exceeded the requirements by over 200%, so we suddenly had lots of money to invest.” </p>
<p>The analysis gave the team the confidence to move forward with the Improve phase: implementing real-time control charts that enabled management to check each branch’s cash balance throughout the day. Managers could now quickly move cash from branches with excess cash to those needing additional funds, and make more strategic cash flow decisions.</p>
<p>The team found that being able to answer objections with data helped secure buy-in from skeptical stakeholders. “Throughout this project, we encountered questions and situations that could have jeopardized our team’s credibility and our likelihood of success,” recalls Zamora Mora. “But the accuracy and reliability of our data analysis with Minitab was overpowering.” </p>
<p>The changes made during the project increased cash usage by 40% and slashed remittance costs by 60%.The new process also cut insurance costs and shrank risks associated with storing and transporting cash. Overall, the project increased revenue by $1.1 million. </p>
<p>To read a more detailed account of this project, <a href="https://www.minitab.com/Case-Studies/Grupo-Mutual/">click here</a>. </p>
Capability AnalysisLean Six SigmaQuality ImprovementFri, 21 Oct 2016 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/improving-cash-flow-and-cutting-costs-at-bank-branch-officesEston MartzProblems Using Data Mining to Build Regression Models, Part Two
http://blog.minitab.com/blog/adventures-in-statistics/problems-using-data-mining-to-build-regression-models-part-two
<p>Data mining can be helpful in the exploratory phase of an analysis. If you're in the early stages and you're just figuring out which predictors are potentially correlated with your response variable, data mining can help you identify candidates. However, there are problems associated with using data mining to select variables.</p>
<p>In my <a href="http://blog.minitab.com/blog/adventures-in-statistics/problems-using-data-mining-to-build-regression-models" target="_blank">previous post</a>, we used data mining to settle on the following model and graphed one of the relationships between the response (C1) and a predictor (C7). It all looks great! The only problem is that all of these data are randomly generated! No true relationships are present. </p>
<p style="margin-left: 40px;"><img alt="Regression output for data mining example" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/24e98167e2dfd848b346292af371acf3/regression_swo.png" style="width: 364px; height: 278px;" /></p>
<p style="margin-left: 40px;"><img alt="Scatter plot for data mining example" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/6e4dfb991b33031738756d4b2d1c77e4/scatterplot.png" style="width: 576px; height: 384px;" /></p>
<p>If you didn't already know there was no true relationship between these variables, these results could lead you to a very inaccurate conclusion.</p>
<p>Let's explore how these problems happen, and how to avoid them</p>
Why <em>Do </em>These Problems Occur with Data Mining?
<p>The problem with data mining is that you fit many different models, trying lots of different variables, and you pick your final model based mainly on statistical significance, rather than being guided by theory.</p>
<p>What's wrong with that approach? The problem is that every statistical test you perform has a chance of a false positive. A false positive in this context means that the <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">p-value</a> is statistically significant but there really is no relationship between the variables at the population level. If you set the <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests:-significance-levels-alpha-and-p-values-in-statistics" target="_blank">significance level at 0.05</a>, you can expect that in 5% of the cases where the null hypothesis is true, you'll have a false positive.</p>
<p>Because of this false positive rate, if you analyze many different models with many different variables you will inevitably find false positives. And if you're guided mainly by statistical significance, you'll leave the false positives in your model. If you keep going with this approach, you'll fill your model with these false positives. That’s exactly what happened in our example. We had 100 candidate predictor variables and the stepwise procedure literally dredged through hundreds and hundreds of potential models to arrive at our final model.</p>
<p>As we’ve seen, data mining problems can be hard to detect. The numeric results and graph all look great. However, these results don’t represent true relationships but instead are chance correlations that are bound to occur with enough opportunities.</p>
<p>If I had to name my favorite R-squared, it would be <a href="http://blog.minitab.com/blog/adventures-in-statistics/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables" target="_blank">predicted R-squared</a>, without a doubt. However, even predicted R-squared can't detect all problems. Ultimately, even though the predicted R-squared is moderate for our model, the ability of this model to predict accurately for an entirely new data set is practically zero.</p>
Theory, the Alternative to Data Mining
<p>Data mining can have a role in the exploratory stages of an analysis. However, for all variables that you identify through data mining, you should perform a confirmation study using newly collected to data to verify the relationships in the new sample. Failure to do so can be very costly. Just imagine if we had made decisions based on the model above!</p>
<p>An alternative to data mining is to use theory as a guide in terms of both the models you fit and the evaluation of your results. Look at what others have done and incorporate those findings when building your model. Before beginning the regression analysis, develop an idea of what the important variables are, along with their expected relationships, coefficient signs, and effect magnitudes.</p>
<p>Building on the results of others makes it easier both to collect the correct data and to specify the best regression model without the need for data mining. The difference is the process by which you fit and evaluate the models. When you’re guided by theory, you reduce the number of models you fit and you assess properties beyond just statistical significance.</p>
<p>Theoretical considerations should not be discarded based solely on statistical measures.</p>
<ul>
<li>Compare the coefficient signs to theory. If any of the signs contradict theory, investigate and either change your model or explain the inconsistency.</li>
<li>Use <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">Minitab statistical software</a> to create factorial plots based on your model to see if all the effects match theory.</li>
<li>Compare the <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit" target="_blank">R-squared</a> for your study to those of similar studies. If your R-squared is very different than those in similar studies, it's a sign that your model may have a problem.</li>
</ul>
<p>If you’re interested in learning more about these issues, read my post about <a href="http://blog.minitab.com/blog/adventures-in-statistics/beware-of-phantom-degrees-of-freedom-that-haunt-your-regression-models">how using too many <em>phantom</em> degrees of freedom is related to data mining problems</a>.</p>
<p> </p>
Data AnalysisHypothesis TestingLearningRegression AnalysisStatisticsStatistics HelpWed, 19 Oct 2016 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/problems-using-data-mining-to-build-regression-models-part-twoJim FrostMinitab 17 and Minitab Express: A Comparison of Software Features
http://blog.minitab.com/blog/marilyn-wheatleys-blog/minitab-17-and-minitab-express-a-comparison-of-software-features
<p><span style="line-height: 1.6;">Since the release of Minitab Express in 2014, we’ve often received questions in technical support about the differences between Express and Minitab 17. In this post, I’ll attempt to provide a comparison between these two Minitab products.</span></p>
What Is Minitab 17?
<p>Minitab 17 is an all-in-one graphical and statistical analysis package that includes basic analysis tools such as hypothesis testing, regression, and ANOVA. Additionally, Minitab 17 includes more advanced features such as reliability analysis, multivariate tools, design of experiments (DOE), and quality tools such as gage R&R and capability analysis. A full list of features that are included Minitab 17 is available on this <a href="http://www.minitab.com/en-us/products/minitab/features-list/">page</a>. </p>
What Is Minitab Express?
<p>Minitab Express is a more basic all-in-one software package for graphical and statistical analysis, designed for students and professors teaching introductory statistics courses. Minitab Express includes statistical analysis options such as hypothesis testing, regression, and ANOVA, but does not include many of the other advanced features that are available in Minitab 17. A full list of the features that are included in Minitab Express is available <a href="http://www.minitab.com/en-us/products/express/features-list/">here</a>.</p>
Key Differences
<strong><em>Supported Operating Systems</em></strong>
<p>One main difference between the two packages is that Minitab 17 is a Windows-only application (however, Minitab 17 can be installed on Mac OS X using one of the options described <a href="http://support.minitab.com/en-us/installation/frequently-asked-questions/other/minitab-companion-on-mac/">here</a>). System requirements for Minitab 17 are available <a href="http://www.minitab.com/en-us/products/minitab/system-requirements/">here</a>. </p>
<p>Minitab Express is available for both Window and Mac OS X. The system requirements for Minitab Express are available <a href="http://www.minitab.com/en-us/products/express/system-requirements/">here</a>.</p>
<strong><em>The Interface</em></strong>
<p>While the menu options for both versions of the software are located at the top and the worksheet/data window are below, there are several differences in the interface. The first screen shot below is for Minitab 17, while the next two screen shots are for Minitab Express:</p>
<p style="margin-left: 40px;"><br />
<strong>Minitab 17:</strong><img alt="Minitab 17 Interface" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f054ba83a85abb6245445502feb2ce86/minitab17interface.png" style="width: 800px; height: 481px;" /></p>
<p style="margin-left: 40px;"><strong>Minitab Express for Windows:</strong><img alt="Express for Windows" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/280aa535dde18d42aaf42eb517fbb9fe/expressforwindowsinterface.png" style="width: 800px; height: 571px;" /></p>
<strong>Minitab Express for OS X</strong><img alt="Minitab Express for OS X Interface" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/177920cdc081cd8d77458ccf3318d192/expressforosxinterface.png" style="width: 800px; height: 529px;" />
<em><strong>Comparison of Commonly Used Features</strong></em>
<p>In addition to cosmetic differences in appearance, the table below compares the features that are available in both versions:</p>
<p align="center"><strong>Feature</strong></p>
<p align="center"><strong>Minitab 17 </strong></p>
<p align="center"><strong>(Windows)</strong></p>
<p align="center"><strong>Minitab Express </strong></p>
<p align="center"><strong>(Windows & Mac OS X)</strong></p>
<p style="text-align: center;">Assistant menu</p>
<ul>
<li style="text-align: center;"> </li>
</ul>
<p style="text-align: center;"> </p>
<p style="text-align: center;">Graphs</p>
<ul>
<li style="text-align: center;"> </li>
</ul>
<ul>
<li style="text-align: center;"> </li>
</ul>
<p style="text-align: center;">Probability distributions</p>
<ul>
<li style="text-align: center;"> </li>
</ul>
<ul>
<li style="text-align: center;"> </li>
</ul>
<p style="text-align: center;">Summary statistics</p>
<ul>
<li style="text-align: center;"> </li>
</ul>
<ul>
<li style="text-align: center;"> </li>
</ul>
<p style="text-align: center;">Hypothesis tests</p>
<ul>
<li style="text-align: center;"> </li>
</ul>
<ul>
<li style="text-align: center;"> </li>
</ul>
<p style="text-align: center;">One-Way ANOVA</p>
<ul>
<li style="text-align: center;"> </li>
</ul>
<ul>
<li style="text-align: center;"> </li>
</ul>
<p style="text-align: center;">Two-Way ANOVA</p>
<ul>
<li style="text-align: center;"> </li>
</ul>
<ul>
<li style="text-align: center;"> </li>
</ul>
<p style="text-align: center;">ANOVA with > 2 factors</p>
<ul>
<li style="text-align: center;"> </li>
</ul>
<p style="text-align: center;"> </p>
<p style="text-align: center;">Linear regression</p>
<ul>
<li style="text-align: center;"> </li>
</ul>
<ul>
<li style="text-align: center;"> </li>
</ul>
<p style="text-align: center;">Logistic regression</p>
<ul>
<li style="text-align: center;"> </li>
</ul>
<ul>
<li style="text-align: center;"> </li>
</ul>
<p style="text-align: center;">Nonlinear regression</p>
<ul>
<li style="text-align: center;"> </li>
</ul>
<p style="text-align: center;"> </p>
<p style="text-align: center;">Design of experiments</p>
<ul>
<li style="text-align: center;"> </li>
</ul>
<p style="text-align: center;"> </p>
<p style="text-align: center;">Control charts</p>
<ul>
<li style="text-align: center;"> </li>
</ul>
<ul>
<li style="text-align: center;"> </li>
</ul>
<p style="text-align: center;">Gage R&R</p>
<ul>
<li style="text-align: center;"> </li>
</ul>
<p style="text-align: center;"> </p>
<p style="text-align: center;">Capability analysis</p>
<ul>
<li style="text-align: center;"> </li>
</ul>
<p style="text-align: center;"> </p>
<p style="text-align: center;">Reliability</p>
<ul>
<li style="text-align: center;"> </li>
</ul>
<p style="text-align: center;"> </p>
<p style="text-align: center;">Multivariate</p>
<ul>
<li style="text-align: center;"> </li>
</ul>
<p style="text-align: center;"> </p>
<p style="text-align: center;">Time series</p>
<ul>
<li style="text-align: center;"> </li>
</ul>
<p style="text-align: center;">Nonparametric tests</p>
<ul>
<li style="text-align: center;"> </li>
</ul>
<ul>
<li style="text-align: center;"> </li>
</ul>
<p style="text-align: center;">Equivalence tests</p>
<ul>
<li style="text-align: center;"> </li>
</ul>
<p style="text-align: center;"> </p>
<p style="text-align: center;">Power and sample size</p>
<ul>
<li style="text-align: center;"> </li>
</ul>
<p align="center"> </p>
<p>Although many of the same features are available in both packages, Minitab 17 has many graph editing options that are not available in Minitab Express. For many of the tests that are available in both packages, Minitab 17 allows more control over the results and has more options that Minitab Express. You can see a more detailed comparison <a href="http://www.minitab.com/academic/comparison/">here</a>. </p>
<p>I hope this post is useful in evaluating the two versions of Minitab. For any questions about either software package, we are more than happy to help here in <a href="http://www.minitab.com/en-us/support/">technical support</a>.</p>
StatisticsStatsMon, 17 Oct 2016 12:00:00 +0000http://blog.minitab.com/blog/marilyn-wheatleys-blog/minitab-17-and-minitab-express-a-comparison-of-software-featuresMarilyn WheatleyWhy You Should Celebrate Healthcare Quality Week
http://blog.minitab.com/blog/real-world-quality-improvement/ways-to-celebrate-healthcare-quality-week
<p>October 16–22 is National Healthcare Quality Week, started by the National Association for Healthcare Quality to increase awareness of healthcare quality programs and to highlight the work of healthcare quality professionals and their influence on improved patient care outcomes.</p>
<img alt="healthcare quality week logo" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/71359d037d4643f7c534b6b2e17a074e/hqw.jpg" style="width: 250px; height: 70px; float: right; margin: 10px 15px;" />
<p>This event deserves your attention because the quality of healthcare affects every one of us, and so does the cost of that care. Whether it's as a patient, a quality practitioner, or a health care provider, we all have a stake in learning what people are doing to improve the quality of care and, at the same time, working to make it more efficient and ultimately affordable.</p>
<div>In honor of the celebration, I wanted to point you to a few resources we have on hand that not only acknowledge the great work of healthcare quality professionals around the world, but also show off the tactics and tools they use to keep patients safe and care affordable.</div>
<p>Kudos to not only those in the field of healthcare quality, but all who work in healthcare to improve the experience of patients everywhere (thank you!).</p>
Q&As with Healthcare Quality Professionals
<p><a href="https://www.minitab.com/News/Getting-Better-All-The-Time/" target="_blank">Getting Better All the Time</a></p>
<p>As the corporate director of process excellence at Citrus Valley Health Partners, Denise Ronquillo plays a key role in improving quality and ensuring that patients receive excellent and safe care. Over the past two years, she and her colleagues have achieved substantial successes while overcoming resistance and skepticism, and are beginning to see a new culture of quality emerge in their organization.</p>
<p><a href="https://www.minitab.com/News/This-Isn-t-a-Game-We-re-Playing/" target="_blank">This Isn’t a Game We’re Playing</a></p>
<p>Quality improvement is something healthcare providers <em>have</em> to do, says Dr. Sandy Fogel, surgical quality officer at Carilion Clinic.</p>
<p><a href="https://www.minitab.com/en-us/News/Healthcare-Quality--Making-a-Difference-with-Data--A-Conversation-with-Dr--William-H--Woodall/" target="_blank">Healthcare Quality: Making a Difference with Data</a></p>
<p>How can statistics and data analysis help improve outcomes in healthcare? William H. Woodall, professor of statistics at Virginia Tech, has been focused on that question for over ten years.</p>
Blog Posts about Quality Improvement in Health Care
<p><a href="http://blog.minitab.com/blog/understanding-statistics/a-six-sigma-healthcare-project-part-1-examining-factors-with-a-pareto-chart" target="_blank">A Six Sigma Healthcare Project</a></p>
<p>Follow along with a series of blog posts on the application of binary logistic regression in a healthcare Six Sigma project, which had a goal of attracting and retaining more patients in a hospital's cardiac rehabilitation program.</p>
<p><a href="http://blog.minitab.com/blog/michelle-paret/monitoring-rare-events-with-g-charts" target="_blank">Monitoring Rare Events with G and T Charts</a></p>
<p>These charts make it easy to assess the stability of processes that involve rare events and have low defect rates.</p>
<p><a href="http://blog.minitab.com/blog/meredith-griffith/exploring-healthcare-data-part-1" target="_blank">Exploring Healthcare Data</a></p>
<p>Learn several tips for exploring and visualizing your healthcare data in a way that will prepare you for a formal analysis.</p>
Case Studies about Health Care Quality Improvement Projects
<p><a href="https://www.minitab.com/en-us/Case-Studies/Cathay-General-Hospital/" target="_blank">Cathay General Hospital</a></p>
<p>During an assessment of its angioplasty process for patients suffering from heart attacks, Cathay General Hospital in Taipei, Taiwan used Minitab to analyze data to help them introduce new treatment options that led to a decrease in the patients’ hospital stay and an increased savings in medical resources.</p>
<p><a href="https://www.minitab.com/Case-Studies/Riverview-Hospital-Association/" target="_blank">Riverview Hospital Association</a></p>
<p>The Riverview Hospital Association Lean Six Sigma team performed data analysis to identify patient groups who were scoring lower on patient satisfaction survey questions. This allowed the team to target process improvement efforts to specific patient populations.</p>
<p><a href="https://www.minitab.com/en-us/Case-Studies/Franciscan-Hospital-for-Children/" target="_blank">Franciscan Children’s Hospital</a></p>
<p>With the help of Lean Six Sigma and Minitab software, Franciscan Hospital for Children was able to analyze information about its processes and make data-driven decisions that increased dental operating room efficiency and enabled doctors to see more kids.</p>
<p><em>For more on how data analysis and Minitab can be used in healthcare, visit <a href="http://www.minitab.com/healthcare" target="_blank">www.minitab.com/healthcare</a>. </em></p>
Health Care Quality ImprovementFri, 14 Oct 2016 12:00:00 +0000http://blog.minitab.com/blog/real-world-quality-improvement/ways-to-celebrate-healthcare-quality-weekCarly BarryPareto Charts Revisited: The Full Truth about the Bars
http://blog.minitab.com/blog/understanding-statistics/pareto-charts-revisited%3A-the-full-truth-about-the-bars
<p><span style="line-height: 20.8px;">A reader asked a great question </span>in response to a post I wrote <span><a href="http://blog.minitab.com/blog/understanding-statistics/when-to-use-a-pareto-chart">about Pareto charts</a></span>. <span style="line-height: 1.6;">Our readers typically do ask great questions, but this one turned out to be more difficult to answer than it first seemed. </span></p>
<p><span style="line-height: 1.6;">M</span><span style="line-height: 1.6;">y correspondent wrote: </span></p>
<blockquote>My understanding is that when you have count data, a bar chart is the way to go. The gaps between the bars emphasize that the data are not measured on a continuous scale. The Pareto chart puts the bars in decreasing size going from left to right. However, the bars now touch, even though the data scale has not changed. I'm just looking for some history or explanation as to why the bars in the Pareto chart touch, which seems to violate basic rules of effective graphing.</blockquote>
<p style="line-height: 20.8px;"><span style="line-height: 1.6;">In case you're not familiar with all of this, here's a quick, mostly visual recap. A bar chart displays counts of categorical variables. </span><span style="line-height: 20.8px;">Separating the bars emphasizes the data's categorical nature:</span><span style="line-height: 1.6;"> </span></p>
<p style="line-height: 20.8px; margin-left: 40px;"><img alt="bar chart" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/90e6067d7f0a1f4f738462290a05f439/bar_chart.png" style="width: 576px; height: 384px;" /></p>
<p style="line-height: 20.8px;"><span style="line-height: 20.8px;">Assume that for some measurable aspect of our business, we classify measurements from 1-10 as Critical, from 11-20 as Very Important, from 21-30 as Important, and so on. A histogram of integer data corresponding to the counts in the bar chart above looks like this:</span></p>
<p style="line-height: 20.8px; margin-left: 40px;"><img alt="Histogram of Raw Data" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/aee07143f85a2d0a8601ed3530956d24/histogram_of_raw_data.png" style="width: 576px; height: 384px;" /></p>
<p style="line-height: 20.8px;"><span style="line-height: 1.6;">The bars of the histogram touch because they represent continuous data. It makes sense that the bars abut each other, since there's no categorical "gap" between, say, 1 and 2.</span></p>
<p style="line-height: 20.8px;"><span style="line-height: 1.6;">Which brings us to the Pareto chart, whose bins show counts or frequencies of defects—categorical data. And yet, when you produce a Pareto chart in Minitab and most other packages, the bars touch...</span></p>
<p style="line-height: 20.8px; margin-left: 40px;"><span style="line-height: 1.6;"><img alt="Pareto Chart of Category" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/bf0be8506cc30954165e854f24f0ed7d/pareto.png" style="width: 576px; height: 384px;" /></span></p>
Why <em>Do </em>the Bars on the Pareto Chart Touch?
<p>The question had me scratching my head. I checked Minitab's built-in statistical glossary and help files, our web site, and then expanded my search for to some reliable statistics resources on the Web. No answer. <span style="line-height: 1.6;">So I went to my colleague-next-door's office and asked </span><em style="line-height: 1.6;">her </em><span style="line-height: 1.6;">why the bars on the Pareto chart touch, even though they represent counts or frequencies of categories. </span></p>
<p>She didn't know, either, but she had a good idea who <em>would </em>know: Dr. Terry Ziemer, who as a senior statistician at Minitab during the 1990s directed development work in the area of industrial statistics. He later became a principal at the Six Sigma Academy and then founded <a href="http://www.sixsi.com/" target="_blank">Six Sigma Intelligence</a>. <span style="line-height: 1.6;">I e-mailed Terry to find out why the bars on Minitab's Pareto chart touch. He quickly replied: </span></p>
<blockquote>I wish there was some big technical answer I could give you, but it was simply a design choice. At the time when I did program this, most of the example Pareto charts I looked at had the bars touching, and I agreed that (at least in my opinion) the chart looks better that way than it does when there are gaps between the bars. Since a Pareto chart is sort of a distribution graph for defect types, it did seem to make sense to make it more like a histogram, another distribution graph where the bars touch, than to make it look like a standard bar chart where you have the gaps.</blockquote>
Bar Charts and Histograms and Paretos, Oh My!
<p><span style="line-height: 20.8px;">So that's why the bars on a Pareto chart in Minitab touch: it was an aesthetic choice, and one that makes perfect sense if we see the Pareto chart as similar to a histogram, in that it shows you the distribution of defect types. </span></p>
<p>If you're saying "But that's not a definitive answer," you're right. Unfortunately, there doesn't seem to <em>be </em>a definitive answer. Looking through the literature revealed advocates both for and against having the Pareto bars touch, but not much in the way of detailed rationales.</p>
<p>For example, <a href="https://books.google.com/books?id=Crqm2AmECD0C&dq=why+do+pareto+bars+touch&source=gbs_navlinks_s" target="_blank"><em>The Practitioner's Guide to Statistics and Lean Six Sigma for Process Improvement</em> by Mikel Harry et al.</a> states on page 171, "The bars in a Pareto chart are arranged side-by-side (touching) in descending order from the left." <em>Why </em>the bars should touch, however, is left unexplained. Joiner Associates' <a href="https://books.google.com/books?id=Mubz8xTERqEC&dq=why+do+pareto+bars+touch&source=gbs_navlinks_s" target="_blank"><em>Pareto Charts: Plain & Simple</em></a> suggests "Having the bars touch makes it easier to judge the relative size or impact of the different parts of the problem." A design choice, again. </p>
<p>On the other hand, page 112 of <a href="https://www.amazon.com/dp/0321817621/" target="_blank"><em>Statistical Reasoning for Everyday Life</em> by Jeffrey Bennett et al.</a>, states "To make the Pareto chart, we put the bars in descending order of size...Because the categories are nominal, the bars should not touch." Clearly, there's still some debate among academics, and<span style="line-height: 1.6;"> if you prefer your Pareto charts with space between bins, you'll find some support—but the touching bars, as implemented in Minitab, do appear to be the more popular option. </span></p>
Project ToolsQuality ImprovementSix SigmaStatisticsWed, 12 Oct 2016 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/pareto-charts-revisited%3A-the-full-truth-about-the-barsEston MartzWhen It’s Easier to Open Data in Minitab than in Excel
http://blog.minitab.com/blog/statistics-and-quality-improvement/when-it%E2%80%99s-easier-to-open-data-in-minitab-than-in-excel
<p>On the Minitab Blog, we’ve often discussed getting data into Minitab from Excel. Here's a small sampling, in case you currently have data in Excel:</p>
<ul>
<li><a href="http://blog.minitab.com/blog/the-statistics-of-science/minitab-and-excel-making-the-data-connection">Minitab and Excel: Making the (Data) Connection</a></li>
<li><a href="http://blog.minitab.com/blog/marilyn-wheatleys-blog/linking-minitab-to-excel-to-get-fast-answers">Linking Minitab to Excel to Get Answers Fast</a></li>
<li><a href="http://blog.minitab.com/blog/michelle-paret/3-tips-for-importing-excel-data-into-minitab">3 Tips for Importing Excel Data into Minitab</a></li>
</ul>
<p>But if your data is not in Excel to begin with, taking it into Excel to prepare it for entry into Minitab isn’t necessarily the best step. Minitab makes it easy to work with data in formats like txt, dat, and csv so that you don’t have to use more than one program.</p>
<p>For example, take the results of a recent study about using a wearable sensor to measure how well human subjects could move from a sitting to a standing position (Lummel et al., 2016). The data are <a href="https://easy.dans.knaw.nl/ui/datasets/id/easy-dataset:65075/tab/2">publically available in csv format</a>, but with an interesting twist. Although csv traditionally stands for “comma-separated values,” this data set uses semicolons as the separator between values in different variables.</p>
<p>If you open the data set in Excel, you’ll get something like this:</p>
<p style="margin-left: 40px;"><img alt="The Excel file loads all of the data into Column A." src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/463130863a3d9898b6b5323b9a4700af/excel_file2crop.jpg" style="width: 657px; height: 413px;" /></p>
<p>Because all of the data are opened in column A, they’re not ready to analyze. You could change all of the semi-colons to commas and reopen the data set...but it wouldn’t help. You could write formulas to separate the columns out yourself, but that’s at least 1 formula per variable, and there are 30 variables in this data set.</p>
<p>But if you’re planning to analyze the data in Minitab, fixing the data in Excel is unnecessary. In Minitab, you just have to do is identify that the data have column names and what the value separator is:</p>
<p style="margin-left: 40px;"></p>
<p>In this case, Minitab offers more flexibility in the data that you work with than Excel does. And once all of your data are in their correct columns, you can begin the analysis that will help you to make better decisions. Minitab makes sure that your data’s ready in just a few seconds.</p>
<p>For general overviews of some other ways to get data into Minitab, check out <a href="http://support.minitab.com/en-us/minitab/17/topic-library/minitab-environment/input-output/open-files-and-import-data/import-data-into-minitab/">Import data into Minitab</a> from the Minitab Support Center.</p>
<p><strong>References</strong></p>
<p>Lummel, MSc R.C. van (McRoberts BV); Walgaard, MSc S. (McRoberts BV); Ainsworth, E. (McRoberts BV) (2016): <em>The Instrumented Sit-to-Stand Test (iSTS) has greater clinical relevance than the Manually Recorded Sit-to-Stand Test in Older Adults</em>. DANS. http://dx.doi.org/10.17026/dans-zcp-kuj7</p>
<p>Rob C van Lummel, Walgaard, S., Maier, A. B., Ainsworth, E., Beek, P. J., & Jaap H van Dieën. (2016). The instrumented sit-to-stand test (iSTS) has greater clinical relevance than the manually recorded sit-to-stand test in older adults.<em> PLoS One, 11</em>(7), e0157968. doi:10.1371/journal.pone.0157968</p>
Data AnalysisQuality ImprovementStatistics HelpMon, 10 Oct 2016 12:00:00 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/when-it%E2%80%99s-easier-to-open-data-in-minitab-than-in-excelCody SteeleDo You Know the Truth about Gage Repeatability and Reproducibility?
http://blog.minitab.com/blog/michelle-paret/do-you-know-the-truth-about-gage-repeatability-and-reproducibility
<p>The ultimate goal of most quality improvement projects is clear: reducing the number of defects, improving a response, or making a change that benefits your customers.</p>
<p>We often want to jump right in and start gathering and analyzing data so we can solve the problems. Checking your measurement systems first, with methods like attribute agreement analysis or Gage R&R, may seem like a needless waste of time. </p>
<p>But the truth is that a Gage R&R Study is a critical step in <em>any </em>statistical analysis involving continuous data. That's because it allows you to determine if your measurement system for that data is adequate or not. If your measurement system isn’t capable of producing reliable measurements, then any analysis you conduct with those measurements is likely meaningless.</p>
<p>So let’s get to the “R&R” part of <span><a href="http://blog.minitab.com/blog/meredith-griffith/fundamentals-of-gage-rr">Gage R&R</a></span>—Repeatability and Reproducibility.</p>
<p>Suppose we’re measuring pencils with a ruler (which is an excellent hands-on activity you can use to teach Gage R&R). We want to determine if our measurement system can adequately measure the length of these pencils. To conduct a Gage R&R Study, we randomly select 10 pencils and 3 people—Abe, Brenda, and Charlie. Each person measures each pencil 2 times, using the same ruler. This gives us a total of 10 x 3 x 2 = 60 measurements.</p>
<p><img alt="parts and operators" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/c7685645ea8140d6ba67b1496ba57624/parts_and_ops.png" style="width: 548px; height: 293px;" /></p>
Repeatability
<p>Repeatability represents the variation observed when the same operator measures the same part multiple times with the same device. In other words, when Abe repeatedly measures the same pencil with the same ruler, will his measurements be consistent? If he measures 16.8 cm the first time, is he going to measure 16.8 cm the next time he measures that same pencil?</p>
Reproducibility
<p>Reproducibility represents the variation observed when DIFFERENT operators measure the same part multiple times with the same device. In other words, if Abe measures a pencil at 16.8 cm in length, will Brenda also measure 16.8 cm for that same pencil? And what about Charlie?</p>
<p><strong>Helpful Hint: </strong>To remember the difference between repeatability and reproducibility, note that reproducibility includes an ‘o’ – think ‘<strong>o</strong>’ for the variability across “<strong>o</strong>perators.”</p>
Answering Important Questions
<p>Gage R&R can help you answer questions such as:</p>
<ul>
<li>Is my measurement system capable of discriminating between parts?</li>
<li>Is the variability in my measurement system small compared with the manufacturing process variability?</li>
<li>How much variability is my measurement system is caused by differences between operators?</li>
</ul>
<p>And if your measurement system isn't great, you can also use Gage R&R to determine where the weaknesses are. For example, perhaps a study reveals that while repeatability is good, the reproducibility is poor. You can use Gage R&R to dig deeper and figure out why different operators reported different readings.</p>
<p>To easily setup your Gage R&R data collection plan and analyze the corresponding data to assess your measurement system, check out <a href="http://www.minitab.com/products/minitab">Minitab Statistical Software</a> and its <strong>Stat > Quality Tools > Gage Study</strong> and <strong>Assistant > Measurement Systems Analysis</strong> features.</p>
Data AnalysisFun StatisticsLean Six SigmaQuality ImprovementSix SigmaStatisticsStatistics HelpStatsFri, 07 Oct 2016 12:00:00 +0000http://blog.minitab.com/blog/michelle-paret/do-you-know-the-truth-about-gage-repeatability-and-reproducibilityMichelle Paret5 More Powerful Insights from Noted Quality Leaders
http://blog.minitab.com/blog/understanding-statistics/5-more-powerful-insights-from-noted-quality-leaders
<p>We hosted our first-ever Minitab Insights conference in September, and if you were among the attendees, you already know the caliber of the speakers and the value of the information they shared. Experts from a wide range of industries offered a lot of great lessons about how they use data analysis to improve business practices and solve a variety of problems.<img alt="tips from Minitab Insights 2016" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/394dfef193debd958deb2011edaaac16/insights_takeaways1.gif" style="width: 354px; height: 250px; margin: 10px 15px; float: right;" /></p>
<p>I blogged earlier about <a href="http://blog.minitab.com/blog/understanding-statistics/5-powerful-insights-from-noted-quality-leaders">five key takeaways</a> gleaned from the sessions at the Minitab Insights 2016 conference. But that was just the tip of the iceberg, and participants learned many more helpful things are well worth sharing. So here are five <em>more </em>helpful, challenging, and thought-provoking ideas and suggestions that we heard during the event.</p>
Improve Your Skills while Improving Yourself!
<p>Everyone has personal goals they'd like to achieve, such as getting fit, changing a habit, or writing a book. Rod Toro, deployment leader at <a href="http://www.minitab.com/en-us/Case-Studies/Edward-Jones/?cta=6675">Edward Jones</a>, explained how challenging himself and his team to apply Lean and Six Sigma tools to their personal goals has helped them better understand the underlying principles of quality improvement, personalized learning and gain deeper insights, and expanded their ability to apply quality methods in a variety of circumstances and situations. </p>
We Can't Claim the Null Hypothesis Is True.
<p>Minitab technical training specialist Scott Kowalski reminded us that when we test a hypothesis with statistics, "<span><a href="http://blog.minitab.com/blog/understanding-statistics/things-statisticians-say-failure-to-reject-the-null-hypothesis">failing to reject the null</a></span>" does not prove that the null hypothesis <em>is </em>true. It only means we don't have enough evidence to reject it. We need to keep this in mind when we interpret our results, and to be careful how we explain our findings to others. We also need to be sure our hypotheses are clearly stated, and that we've selected the appropriate test for our task!</p>
Outliers Won't Just Be Ignored, So You'd Better Investigate Them.
<p>We've all seen them in our data: those <a href="http://blog.minitab.com/blog/michelle-paret/how-to-identify-outliers-and-get-rid-of-them">troublesome observations</a> that just don't want to belong, lurking off in the margins, maybe with one or two other loners. It can be tempting to ignore or just delete those observations, but Larry Bartkus, senior distinguished engineer at Edwards Lifesciences, provided vivid illustrations of the drastic impact outliers can have on the results of an analysis. He also reminded us of the value in slowing down our assumptions, looking at the data in several ways, and trying to understand <em>why </em>our data is the way it is. </p>
Attribute Agreement Analysis Is Just One Option.
<p>When we need to assess how well an attribute measurement system performs, attribute agreement analysis is the go-to method—but Thomas Rust, reliability engineer at Autoliv, demonstrated that many more options are available. In encouraging quality practitioners to "break the attribute paradigm," Rust detailed four innovative ways to assess an attribute measurement system: measure an underlying variable; attribute measurement of a variable product; variable measurement of an attribute product; and attribute measurement of an attribute product.</p>
Minitab Users Do Great Things.
<p>More than anything else, what we took away from Minitab Insights 2016 was an even greater appreciation for the people who are using our software in innovative ways—to increase the quality of the products we use every day, to raise the level of service we receive from businesses and organizations, to increase the efficiency and safety of our healthcare providers, and so much more.</p>
<p>Watch for more stories and ideas from the the Minitab Insights conference in future issues of Minitab News, and on the Minitab Blog.</p>
Data AnalysisInsightsLean Six SigmaProject ToolsQuality ImprovementSix SigmaStatisticsWed, 05 Oct 2016 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/5-more-powerful-insights-from-noted-quality-leadersEston MartzWhy Shrewd Experts "Fail to Reject the Null" Every Time
http://blog.minitab.com/blog/understanding-statistics/why-shrewd-experts-fail-to-reject-the-null-every-time
<p><img alt="nulls angels: the toughest statisticians around!" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/d2c0571a-acbd-48c7-84f4-222276c293fe/Image/509959f8406d59b3bb31f686aeb3b6b0/nulls_angels.jpg" style="margin: 10px 15px; float: right; width: 175px; height: 198px;" />I watched an old <a href="https://en.wikipedia.org/wiki/The_Wild_Angels" target="_blank">motorcycle flick from the 1960s</a> the other night, and I was struck by the bikers' slang. They had a language all their own. Just like statisticians, whose manner of speaking often confounds those who aren't hep to the lingo of data analysis.</p>
<p>It got me thinking...what if there were an all-statistician biker gang? Call them the Nulls Angels. Imagine them in their colors, tearing across the countryside, analyzing data and asking the people they encounter on the road about whether they "fail to reject the null hypothesis."</p>
<p>If you point out how strange that phrase sounds, the Nulls Angels will <em>know</em> you're not cool...and not very aware of statistics.</p>
<p>Speaking purely as an editor, I acknowledge that "failing to reject the null hypothesis" <em>is</em> cringe-worthy. "Failing to reject" seems like an overly complicated equivalent to <em>accept</em>. At minimum, it's clunky phrasing.</p>
<p>But it turns out those rough-and-ready statisticians in the Nulls Angels have good reason to talk like that. From a <em>statistical</em> perspective, it's undeniably accurate—and replacing "failure to reject" with "accept" would just be wrong.</p>
What <em>Is </em>the Null Hypothesis, Anyway?
<p>Hypothesis tests include one- and two-sample t-tests, tests for association, tests for normality, and many more. (All of these tests are available under the <strong>Stat</strong><span> menu in Minitab <a href="http://www.minitab.com">statistical software</a>. Or, if you want a little more <a href="http://www.minitab.com/en-us/products/minitab/assistant">statistical guidance</a>, the Assistant can lead you through common hypothesis tests step-by-step.)</span></p>
<p>A hypothesis test examines two propositions: the null hypothesis (or H0 for short), and the alternative (H1). The <em>alternative </em>hypothesis is what we hope to support. We presume that the null hypothesis is true, unless the data provide sufficient evidence that it is not.</p>
<p>You've heard the phrase "Innocent until proven guilty." That means innocence is assumed until guilt is proven. In statistics, the null hypothesis is taken for granted until the alternative is proven true.</p>
So Why Do We "Fail to Reject" the Null Hypothesis?
<p>That brings up the issue of "proof."</p>
<p>The degree of statistical evidence we need in order to “prove” the alternative hypothesis is the <a href="http://blog.minitab.com/blog/michelle-paret/alphas-p-values-confidence-intervals-oh-my">confidence level</a>. The confidence level is 1 minus our risk of committing a Type I error, which occurs when you incorrectly reject the null hypothesis when it's true. Statisticians call this risk alpha, and also refer to it as the significance level. The typical alpha of 0.05 corresponds to a 95% confidence level: we're accepting a 5% chance of rejecting the null even if it is true. (In life-or-death matters, we might <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/alpha-male-vs-alpha-female">lower the risk of a Type I error to 1% or less</a>.)</p>
<p>Regardless of the alpha level we choose, any hypothesis test has only two possible outcomes:</p>
<ol>
<li><strong>Reject the null hypothesis</strong> and conclude that the alternative hypothesis is true at the 95% confidence level (or whatever level you've selected).<br />
</li>
<li><strong>Fail to reject the null hypothesis</strong> and conclude that <em>not</em> enough evidence is available to suggest the null is false at the 95% confidence level.</li>
</ol>
<p>We often use a <a href="http://blog.minitab.com/blog/understanding-statistics/three-things-the-p-value-cant-tell-you-about-your-hypothesis-test">p-value</a> to decide if the data support the null hypothesis or not. If the test's p-value is less than our selected alpha level, we reject the null. Or, as statisticians say "When the p-value's low, the null must go."</p>
<p>This still doesn't explain <em>why</em> a statistician won't "accept the null hypothesis." Here's the bottom line: failing to reject the null hypothesis does not mean the null hypothesis <em>is</em> true. That's because a hypothesis test does not determine <em>which</em> hypothesis is true, or even which is most likely: it <em>only</em> assesses whether evidence exists to reject the null hypothesis.</p>
<img alt=""My hypothesis is Null until proven Alternative, sir!" " src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/d2c0571a-acbd-48c7-84f4-222276c293fe/Image/a07b85370986a3dd126ac4d021775d13/trial.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 300px; height: 200px;" />"Null Until Proven Alternative"
<p>Hark back to "innocent until proven guilty." As the data analyst, you are the judge. The hypothesis test is the trial, and the null hypothesis is the defendant. The alternative hypothesis is the prosecution, which needs to make its case <em>beyond a reasonable doubt</em> (say, with 95% certainty).</p>
<p>If the trial evidence does not show the defendant is guilty, neither has it proved that the defendant <em>is</em> innocent. However, based on the available evidence, you can't reject that <em>possibility</em>. So how would you announce your verdict?</p>
<p>"Not guilty."</p>
<p>That phrase is perfect: "Not guilty"doesn't say the defendant <em>is</em> innocent, because that has not been proven. It just says the prosecution couldn't convince the judge to abandon the assumption of innocence.</p>
<p>So "failure to reject the null" is the statistical equivalent of "not guilty." In a trial, the burden of proof falls to the prosecution. When analyzing data, the entire burden of proof falls to your sample data. "Not guilty" does not mean "innocent," and "failing to reject" the null hypothesis is quite distinct from "accepting" it. </p>
<p>So if a group of marauding statisticians in their Nulls Angels leathers ever asks, keep yourself in their good graces, and show that know "failing to reject the null" is not "accepting the null."</p>
Fun StatisticsHypothesis TestingStatisticsStatistics HelpMon, 03 Oct 2016 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/why-shrewd-experts-fail-to-reject-the-null-every-timeEston Martz5 Powerful Insights from Noted Quality Leaders
http://blog.minitab.com/blog/understanding-statistics/5-powerful-insights-from-noted-quality-leaders
<p>If you were among the 300 people who attended the first-ever Minitab Insights conference in September, you already know how powerful it was. Attendees learned how practitioners from a wide range of industries use data analysis to address a variety of problems, find solutions, and improve business practices.<img alt="Minitab Insights 2016" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/63fae4d971c6146398481577788e5ee7/insights.gif" style="width: 317px; height: 232px; margin: 10px 15px; float: right;" /></p>
<p>In the coming weeks and months, we will share more of the great insights and guidance shared by our speakers and attendees. But here are five helpful, challenging, and thought-provoking ideas and suggestions that we heard during the event.</p>
You Can Get More Information from VOC Data.
<p>Joel Smith of the Dr. Pepper Snapple Group used the assessment of different beers to show how applying the tools in Minitab can help a business move from raw Voice of the Customer (VOC) data to actionable insights. His presentation showed how to use graphical analysis and descriptive statistics to clean observational VOC data, and then how to use <a href="http://blog.minitab.com/blog/quality-data-analysis-and-statistics/cluster-analysis-tips">cluster analysis</a>, <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/multivariate/principal-components-and-factor-analysis/what-is-pca/">principal component analysis</a>, and <a href="http://blog.minitab.com/blog/adventures-in-statistics/a-tribute-to-regression-analysis">regression analysis</a> to make informed decisions about how to create a better product. </p>
Consider Multiple Ways to Show Results.
<p><span style="line-height: 20.8px;">Graphs are often part of </span>a Minitab analysis, but a graph may not be the <em>only </em>way to visualize your results. Think about your audience and your communication goals when choosing and customizing your graphs, suggested Rip Stauffer, senior consultant at Management Science and Innovation. He showed examples of how the same information comes across very differently when presented in various charts, and when colors, thicknesses, and styles are selected carefully. Along the way, he also illustrated Minitab's flexibility in tailoring the appearance of a graph to fit your needs. </p>
Quality Methods Make Great Sales Tools.
<p>We hear all the time about the impact of quality improvement methods on manufacturing. But what about using statistical analysis to boost sales? Andrew Mohler from global chemical company Buckman explained how training technical sales associates to use data analysis and Minitab has transformed the company's business. <a href="http://www.minitab.com/Case-Studies/Buckman/">Empowering the sales team to help <em>customers </em>improve their processes</a> has enabled the company to provide more value and to drive sales—boosting the bottom line.</p>
Data-Driven Cultures Have Risks, Too.
<p>In the quality improvement world, we tend to think that transforming an organization's culture so everyone understands the value of data analysis only brings benefits. But Richard Titus, a consultant and adjunct instructor at Lehigh University who has worked with <a href="http://www.minitab.com/crayola/">Crayola</a>, Ingersoll-Rand, and many other organizations, highlighted potential traps for organizations with a high level of statistical knowledge. These include trying to find data to fit favored answer(s); working as a "lone ranger" independent of a team; failing to map and measure processes; not selecting a primary metric to measure success; searching for a "silver bullet;" and trying to outsmart the process. </p>
When Subgroup Sizes Are Large, Use P' Charts.
<p>T. C. Simpson and M. E. Rusak from Air Products illustrated how using a traditional P chart to monitor a transactional process can lead to problems if you have a large subgroup size. False alarms or failure to detect special-cause variation can result from overdispersion or underdispersion in your data when your subgroup sizes are large. You can avoid these risks with a Laney P' control chart, which uses calculations that account for large subgroups. <a href="http://blog.minitab.com/blog/understanding-statistics/ready-for-prime-time%3A-use-p-and-u-charts-to-avoid-false-alarms">Learn more about the Laney P' char</a>t. </p>
<p>Watch for more stories, tips, and ideas from the the Minitab Insights conference in future issues of <a href="http://www.minitab.com/news/">Minitab News</a>, and on the Minitab Blog.</p>
Data AnalysisInsightsLean Six SigmaQuality ImprovementSix SigmaStatisticsFri, 30 Sep 2016 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/5-powerful-insights-from-noted-quality-leadersEston MartzHow to Save a Failing Regression with PLS
http://blog.minitab.com/blog/statistics-and-quality-improvement/fix-problems-in-regression-analysis-with-partial-least-squares
<p>Face it, you love regression analysis as much as I do. Regression is one of the most satisfying analyses in <a href="http://www.minitab.com/en-US/products/minitab/free-trial/">Minitab</a>: get some predictors that should have a relationship to a response, go through a model selection process, interpret fit statistics like adjusted R2 and predicted R2, and make predictions. Yes, regression really is quite wonderful.</p>
<p>Except when it’s not. Dark, seedy corners of the data world exist, lying in wait to make regression confusing or impossible. Good old ordinary least squares regression, to be specific.</p>
<p>For instance, sometimes you have a lot of <em>detail</em> in your data, but not a lot of data. Want to see what I mean?</p>
<ol>
<li>In Minitab, choose <strong>Help > Sample Data...</strong></li>
<li>Open Soybean.mtw.</li>
</ol>
<p><img alt="Soybeans" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/e9bae86907cd8194ecf16b7622cf98bb/edamame_by_zesmerelda_in_chicago.jpg" style="float: right; width: 200px; height: 133px; border-width: 1px; border-style: solid; margin: 10px 15px;" />The data has 88 variables about soybeans, the results of near-infrared (NIR) spectroscopy at different wavelengths. But the data contains only 60 measurements, and the data are arranged to save 6 measurements for validation runs.</p>
A Limit on Coefficients
<p>With ordinary least squares regression, you only estimate as many coefficients as the data have samples. Thus, the traditional method that’s satisfactory in most cases would only let you estimate 53 coefficients for variables plus a constant coefficient.</p>
<p>This could leave you wondering about whether any of the other possible terms might have information that you need.</p>
Multicollinearity
<p>The NIR measurements are also highly collinear with each other. This <a href="http://blog.minitab.com/blog/understanding-statistics/handling-multicollinearity-in-regression-analysis">multicollinearity</a> complicates using statistical significance to choose among the variables to include in the model.</p>
<p>When the data have more variables than samples, especially when the predictor variables are highly collinear, it’s a good time to consider partial least squares regression.</p>
How to Perform Partial Least Squares Regression
<p>Try these steps if you want to follow along in Minitab Statistical Software using the soybean data:</p>
<ol>
<li>Choose <strong>Stat > Regression > Partial Least Squares</strong>.</li>
<li>In <strong>Responses</strong>, enter <em>Fat</em>.</li>
<li>In <strong>Model</strong>, enter <em>‘1’-‘88’</em>.</li>
<li>Click <strong>Options</strong>.</li>
<li>Under <strong>Cross-Validation</strong>, select <strong>Leave-one-out</strong>. Click OK.</li>
<li>Click <strong>Results</strong>.</li>
<li>Check <strong>Coefficients</strong>. Click <strong>OK </strong>twice.</li>
</ol>
<p>One of the great things about partial least squares regression is that it forms components and then does ordinary least squares regression with them. Thus the results include statistics that are familiar. For example, <a href="http://blog.minitab.com/blog/adventures-in-statistics/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables">predicted R2</a> is the criterion that Minitab uses to choose the number of components.</p>
<p style="margin-left: 40px;"><br />
<img alt="Minitab selects the model with the highest predicted R-squared." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/12f2493e350eb84a657035b915a5f45f/model_selection.gif" style="width: 476px; height: 194px;" /></p>
<p>Each of the 9 components in the model that maximizes the predicted R2 value is a complex linear combination of all 88 of the variables. So although the ANOVA table shows that you’re using only 9 degrees of freedom for the regression, the analysis uses information from all of the data.</p>
<p style="margin-left: 40px;"><img alt="The regression uses 9 degrees of freedom." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/ce90634261a6cd8994f8e72682473d74/anova.gif" style="width: 381px; height: 113px;" /></p>
<p> The full list of standardized coefficients shows the relative importance of each predictor in the model. (I’m only showing a portion here because the table is 88 rows long.)</p>
<p style="margin-left: 40px;"><br />
<img alt="Each variable has a standardized coefficient." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/b881dc2c5a4b26fa7330a0dbd9e70c8a/coefficients.gif" style="width: 255px; height: 284px;" /></p>
<p>Ordinary least squares regression is a great tool that’s allowed people to make lots of good decision over the years. But there are times when it’s not satisfying. Got too much detail in your data? Partial least squares regression could be the answer.</p>
<p>Want more partial least squares regression now? Check out how <a href="http://www.minitab.com/en-US/Case-Studies/Unifi-Manufacturing-Inc/">Unifi used partial least squares to improve their processes faster</a>.</p>
<span style="color:#a9a9a9;">The image of the soybeans is by Tammy Green </span><span style="color:#a9a9a9;">and is licensed for reuse under this</span> <a href="http://creativecommons.org/licenses/by-sa/2.0/deed.en">Creative Commons License</a>.
Data AnalysisRegression AnalysisStatisticsWed, 28 Sep 2016 12:00:00 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/fix-problems-in-regression-analysis-with-partial-least-squaresCody SteeleValidating Process Changes with Design of Experiments (DOE)
http://blog.minitab.com/blog/real-world-quality-improvement/validating-process-changes-with-design-of-experiments-doe
<p>We’ve got a plethora of <a href="https://www.minitab.com/en-us/company/case-studies/" target="_blank">case studies</a> showing how businesses from different industries solve problems and implement solutions with data analysis. Take a look for ideas about how you can use data analysis to ensure excellence at your business!</p>
<p>Boston Scientific, one of the world’s leading developers of medical devices, is just one organization who has shared their story. A team at their Heredia, Costa Rica facility was able to assess and validate a packaging process, which resulted in a streamlined process and a cost-saving redesign of the packaging.</p>
<p>Below is a brief look at how they did it, but you can also take a look at the full case study at <a href="https://www.minitab.com/Case-Studies/Boston-Scientific/" target="_blank">https://www.minitab.com/Case-Studies/Boston-Scientific/</a>.</p>
Their Challenge
<p><img alt="guidewires in pouch" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/03b6326dcb90a56ca905abbc2526f38c/guidewires.jpg" style="width: 233px; height: 174px; float: right;" />Boston Scientific Heredia evaluates its operations regularly, to maintain process efficiency and contribute to affordable healthcare by reducing costs. At this facility, one packaging engineer led an effort to streamline packaging for guidewires—which are used during procedures such as catheter placement or endoscopic diagnoses—with the introduction of a new, smaller plastic pouch.</p>
<p>Using smaller and different packaging materials for their guidewires would substantially reduce material costs, but the company needed to prove that the new pouches would work with their sealing process, which creates a barrier that keeps the guidewires sterile.</p>
How Data Analysis Helped
<p>To ensure that the seal strength for the smaller pouches met or exceeded standards, they evaluated the process and identified several important factors, such as the temperature of the sealing system. They then used a statistical method called <a href="http://blog.minitab.com/blog/doe" target="_blank">Design of Experiments (DOE)</a> to determine how each of the variables affected the quality of the pouch seal.</p>
<p>The DOE revealed which factors were most critical. Below is a Minitab <a href="http://blog.minitab.com/blog/understanding-statistics/when-to-use-a-pareto-chart" target="_blank">Pareto Chart</a> that identified the factors that significantly affect seal strength: front temperature, rear temperature, and their respective two-way interaction.</p>
<p><img alt="https://www.minitab.com/uploadedImages/Content/Case_Studies/EffectsParetoforAveragePull.jpg" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/abd4d05d00cf48c8b22ecc37e1264e93/pareto_chart.jpg" style="border-width: 0px; border-style: solid; width: 600px; height: 400px;" /></p>
<p>Armed with this knowledge, the team devised optimal process settings to ensure the new pouches had strong seals. To verify the effectiveness of the improved process, they used a statistical tool called capability analysis, which demonstrates whether or not a process meets specifications and can produce good results:</p>
<p><img alt="https://www.minitab.com/uploadedImages/Content/Case_Studies/ProcessCapabilityofHighSettings-SealStrength.jpg" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/c4d94b5ee153c1d3e38757565a5d24c2/process_capa.jpg" style="border-width: 0px; border-style: solid; width: 600px; height: 400px;" /></p>
Results
<p>The analysis showed that guidewires packaged using the new, optimal process settings met, and even exceeded, the minimum seal strength requirements.</p>
<p>With the new pouches, Boston Scientific has saved more than $330,000. “At the end of the day,” a key team member noted, “the more money we save, the more additional savings we can pass on to the people we serve.”</p>
<p><em>For another example of how Boston Scientific uses data analysis to ensure the safety and reliability of its products, read <a href="https://www.minitab.com/Case-Studies/Boston-Scientific-Heredia/" target="_blank">Pulling Its Weight: Tensile Testing Challenge Speeds Regulatory Approval for Boston Scientific</a>, a story about how the company used Minitab Statistical Software to confirm the equivalency of its catheter’s pull-wire strength to previous testing results, and eliminate the need to perform test method validation by leveraging its existing tension testing standard.</em></p>
Data AnalysisDesign of ExperimentsQuality ImprovementStatisticsStatsMon, 26 Sep 2016 12:00:00 +0000http://blog.minitab.com/blog/real-world-quality-improvement/validating-process-changes-with-design-of-experiments-doeCarly BarryDescriptive vs. Inferential Statistics: When Is a P-value Superfluous?
http://blog.minitab.com/blog/statistics-and-quality-data-analysis/descriptive-vs-inferential-statistics-when-is-a-p-value-superfluous
<p>True or false: When comparing a parameter for two sets of measurements, you should always use a hypothesis test to determine whether the difference is statistically significant.</p>
<p>The answer? (<em>drumroll...</em>) True!</p>
<p>...and False!</p>
<p>To understand this paradoxical answer, you need to keep in mind the difference between samples, populations, and descriptive and inferential statistics. </p>
Descriptive Statistics and Populations
<p>Consider the fictional countries of Glumpland and Dolmania.</p>
<p style="text-align: center;"><img alt="Welcome to Glumpland!" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c1f88e0e6d3e4e55684392ec5a8069e8/glumpland.jpg" style="width: 350px; height: 232px;" /></p>
<img alt="wkshet" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/47e5470dd8123218763ac3666f64bbdd/glumpland_dolmania_wkshet.jpg" style="line-height: 20.8px; width: 222px; height: 579px; float: right;" />
<p>The population of Glumpland is 8,442,012. The population of Dolmania is 6,977,201. For each country, the age of every citizen (to the nearest tenth), <a href="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/080981611ba11403dc8fde411e81d150/glumpland_and_dolmania_ages.mpj">is recorded in a cell of a Minitab worksheet</a>. </p>
<p>Using <strong>Stat > Basic Statistics > Display Descriptive Statistics</strong> we can quickly calculate the mean age of each country.</p>
<p style="margin-left: 40px;"><img alt="desc stats" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/1a791dd23ba85673193f20c2c9971fa4/mean_age_glump_and_dol.jpg" style="width: 316px; height: 96px;" /></p>
<p>It looks like Dolmanians are, on average, more youthful than Glumplanders. But is this difference in means statistically significant?</p>
<p>To find out, we might be tempted to evaluate these data using a <span><a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-t-tests%3A-1-sample%2C-2-sample%2C-and-paired-t-tests">2-sample t-test</a></span>.</p>
<p>Except for one thing: there's absolutely no point in doing that.</p>
<p>That's because these calculated means <em>are</em> the means of the entire populations. So we already know that the population means differ.</p>
<p>Another example. Suppose a baseball player gets 213 hits in 680 at bats in 2015, and 178 hits in 532 at bats in 2016.</p>
<p>Would you need a 2-proportions test to determine whether the difference in batting averages (.313 vs .335) is statistically significant? Of course not.</p>
<p>You've already calculated the proportions using all the data for the entire two seasons. There's nothing more to extrapolate. And yet you often see a hypothesis test applied in this type of situation, in the mistaken belief that if there's no p-value, the results aren't "solid" or "statistical" enough.</p>
<p>But if you've collected every possible piece of data for a population, that's about as solid as you can get!</p>
Inferential Statistics and Random Samples
<p>Now suppose that draconian budget cuts have made it infeasible to track and record the age of every resident in Glumpland and Dolmania. <span style="line-height: 1.6;">What can they do? </span></p>
<p><span style="line-height: 1.6;">Quite a lot, actually. They can apply inferential statistics, which is based on random sampling, to make reliable estimates without those millions of data values they don't have.</span></p>
<p>To see how it works, use <strong>Calc > Random Data > Sample from columns</strong> in Minitab. Randomly sample 50 values from the 8,422,012 values in column C1, which includes the ages of the entire population of Glumpland. Then use descriptive statistics to calculate the mean of the sample.</p>
<p>Here are the results for one random sample of 50:</p>
<p style="margin-left: 40px;"><strong>Descriptive Statistics: GPLND (50)</strong><br />
<span style="line-height: 1.6;">Variable Mean</span><br />
<span style="line-height: 1.6;">GPLND(50) 52.37</span></p>
<p>The sample mean, 52.37 is slightly less than the true mean age of 53 for the entire population of Glumpland. What about another random sample of 50?</p>
<p style="margin-left: 40px;"><strong>Descriptive Statistics: GPLND (50) </strong><br />
<span style="line-height: 1.6;">Variable Mean</span><br />
<span style="line-height: 1.6;">GPLND(50) 54.11</span></p>
<p>Hmm. This sample mean of 54.11 slightly <em>overshoots</em> the true population mean of 53.</p>
<p>Even though the sample estimates are in the ballpark of the true population mean, we're seeing some variation. <span style="line-height: 1.6;">How much variation can we expect? Using descriptive statistics alone, we have no inkling of how "close" a sample estimate might be to the truth. </span></p>
Enter...the Confidence Interval
<p>To quantify the precision of a sample estimate for the population, we can use a powerful tool in inferential statistics: the confidence interval.</p>
<p>Suppose you take random samples of size 5, 10, 20, 50, and 100 from Glumpland and Dolmania using <strong>Calc > Random Data > Sample from columns</strong>. Then use <strong>Graph > Interval Plot > Multiple Ys</strong> to display the 95% confidence intervals for the mean of each sample.</p>
<p>Here's what the interval plots look like for the random samples in my worksheet.</p>
<p style="margin-left: 40px;"><img alt="interval plot Glumpland" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/262031cc398ee9d48031fe1f43b38bdf/interval_plot_of_glumpland.jpg" style="line-height: 20.8px; width: 576px; height: 384px;" /></p>
<p style="margin-left: 40px;"><img alt="Interval plot Dolmania" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/75440d94eaff64a63e338b480029945b/interval_plot_of_dolmania.jpg" style="width: 576px; height: 384px;" /></p>
<p>Your plots will look different based on your random samples, but you should notice a similar pattern: The sample mean estimates (the blue dots) tend to vary more from the population mean as the sample sizes decrease. To compensate for this, the intervals "stretch out" more and more, to ensure the same 95% overall probability of "capturing" the true population mean.</p>
<p>The larger samples produce narrower intervals. In fact, using only 50-100 data values, we can closely estimate the mean of over 8.4 million values, and get a general sense of how precise the estimate is likely to be. That's the incredible power of random sampling and inferential statistics!</p>
<p>To display side-by-side confidence intervals of the mean estimates for Glumpland and Dolmania, you can use an interval plot with groups.</p>
<p style="margin-left: 40px;"><img alt="interval plot side by side" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/9e6348c87befdaf6434dbe80e8257516/interval_plot_of_age_side_by_side.jpg" style="width: 576px; height: 384px;" /></p>
<p>Now, you might be tempted to use these results to infer whether there's a statistically significant difference in the mean age of the populations of Glumpland and Dolmania. But don't. Confidence intervals can be misleading for that purpose.</p>
<p>For that, we need another powerful tool of inferential statistics...</p>
Enter...the hypothesis test and p-value
<p>The 2-sample t-test is used to determine whether there is a statistically significant difference in the means of the populations from which the two random samples were drawn. The following table shows the t-test results for each pair of same-sized samples from Glumpland and Dolmania. As the sample size increases, notice what happens to the p-value and the confidence interval for the difference between the population means.</p>
<p style="margin-left: 40px;"><img alt="t tests" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/7c1bf45756a7fb621094086e5350fef9/2_sample_t_test.jpg" style="width: 526px; height: 757px;" /></p>
<p>Again, the confidence intervals tend to get wider as the samples get smaller. With smaller samples, we're less certain of the precision of the estimate for the difference..</p>
<p>In fact, only for the two largest random samples (N=50 and N=100) is the p-value less than a 0.05 level of significance, allowing us to conclude that the mean ages of Glumplanders and Dolmanians are statistically different. For the three smallest samples (N=20, N=10, N=5), the p-value is greater than 0.05, and confidence interval for each of these small samples includes 0. Therefore, we cannot conclude that there is difference in the population means.</p>
<p>But remember, we already know that the true population means actually <em>do</em> differ by 5.4 years. We just can't statistically "prove" it with the small samples. That's why statisticians bristle when someone says, "The p-value is not less than 0.05. Therefore, there's no significant difference between the groups." There might very well be. So it's safer to say, especially with small samples, "<em>we don't have enough evidence </em>to conclude that there's a significant difference between the groups."</p>
<p>It's not just a matter of nit-picky semantics. It's simply the truth, as you can see when you take random samples of various sizes from the same known populations and test them for a difference.</p>
Wrap-up
<p>If you have a random sample, you should always accompany estimates of statistical parameters with a confidence interval and p-value, whenever possible. Without them, there's no way to know whether you can safely extrapolate to the entire population. But if you already know every value of the population, you're good to go. You don't need a p-value, a t-test, or a CI—any more than you need a clue to determine whats inside a box, if you already know what's in it.</p>
Data AnalysisHypothesis TestingLearningStatisticsFri, 23 Sep 2016 12:08:00 +0000http://blog.minitab.com/blog/statistics-and-quality-data-analysis/descriptive-vs-inferential-statistics-when-is-a-p-value-superfluousPatrick RunkelProblems Using Data Mining to Build Regression Models
http://blog.minitab.com/blog/adventures-in-statistics/problems-using-data-mining-to-build-regression-models
<p><img alt="Picture of mining truck filled with numbers" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/644d98694f1e6fec63d4f1db6b61a074/data_mining_crop.jpg" style="width: 250px; height: 171px; float: right; margin: 10px 15px;" />Data mining uses algorithms to explore correlations in data sets. An automated procedure sorts through large numbers of variables and includes them in the model based on statistical significance alone. No thought is given to whether the variables and the signs and magnitudes of their coefficients make theoretical sense.</p>
<p>We tend to think of data mining in the context of big data, with its huge databases and servers stuffed with information. However, it can also occur on the smaller scale of a research study.</p>
<p>The comment below is a real one that illustrates this point.</p>
<blockquote>“Then, I moved to the Regression menu and there I could add all the terms I wanted and more. Just for fun, I added many terms and performed backward elimination. Surprisingly, some terms appeared significant and my R-squared Predicted shot up. To me, your concerns are all taken care of with R-squared Predicted. If the model can still predict without the data point, then that's good.”</blockquote>
<p>Comments like this are common and emphasize the temptation to select regression models by trying as many different combinations of variables as possible and seeing which model produces the best-looking statistics. The overall gist of this type of comment is, "What could possibly be wrong with using data mining to build a regression model if the end results are that all the p-values are significant and the various types of R-squared values are all high?"</p>
<p>In this blog post, I’ll illustrate the problems associated with using data mining to build a regression model in the context of a smaller-scale analysis.</p>
An Example of Using Data Mining to Build a Regression Model
<p>My first order of business is to prove to you that data mining can have severe problems. I really want to bring the problems to life so you'll be leery of using this approach. Fortunately, this is simple to accomplish because I can use data mining to make it appear that a set of randomly generated <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">predictor variables</a> explains most of the changes in a randomly generated <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">response variable</a>!</p>
<p>To do this, I’ll create a worksheet in Minitab statistical software that has 100 columns, each of which contains 30 rows of entirely random data. In Minitab, you can use <strong>Calc > Random Data > Normal</strong> to create your own worksheet with random data, or you can use <a href="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/File/c740effad4cc27dc6580093ea6c070fd/randomdata.mtw">this worksheet</a> that I created for the data mining example below. (If you don’t have Minitab and want to try this out, <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">get the free 30 day trial!</a>)</p>
<p>Next, I’ll perform <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-smackdown-stepwise-versus-best-subsets" target="_blank">stepwise regression</a> using column 1 as the response variable and the other 99 columns as the potential predictor variables. This scenario produces a situation where stepwise regression is forced to dredge through 99 variables to see what sticks, which is a key characteristic of data mining.</p>
<p>When I perform stepwise regression, the procedure adds 28 variables that explain 100% of the variance! Because we only have 30 observations, we’re clearly overfitting the model. Overfitting the model is different problem that also inflates R-squared, which you can read about in my post about <a href="http://blog.minitab.com/blog/adventures-in-statistics/the-danger-of-overfitting-regression-models" target="_blank">the dangers of overfitting models</a>.</p>
<p>I’m specifically addressing the problems of data mining in this post, so I don’t want a model that is also overfit. To avoid an overfit model, a good rule of thumb is to include no more than one term for each 10 observations. We have 30 observations, so I’ll include only the first three variables that the stepwise procedure adds to the model: C7, C77, and C95. The output for the first three steps is below.</p>
<p style="margin-left: 40px;"><img alt="Stepwise regression output" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/e4fb01237dd0c8b34496dde3cc28b517/stepwise_swo.png" style="width: 498px; height: 251px;" /></p>
<p>Under step 3, we can see that all of the <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">coefficient p-values</a> are statistically significant. The <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit" target="_blank">R-squared</a> value of 67.54% can either be good or mediocre depending on your field of study. In a real study, there are likely to be some real effects mixed in that would boost the R-squared even higher. We can also look at <a href="http://blog.minitab.com/blog/adventures-in-statistics/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables" target="_blank">the adjusted and predicted R-squared values</a> and neither one suggests a problem.</p>
<p>If we look at the model building process of steps 1 - 3, we see that at each step all of the R-squared values increase. That’s what we like to see. For good measure, let’s graph the relationship between the predictor (C7) and the response (C1). After all, seeing is believing, right?</p>
<p style="margin-left: 40px;"><img alt="Scatterplot of two variables in regression model" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/6e4dfb991b33031738756d4b2d1c77e4/scatterplot.png" style="width: 576px; height: 384px;" /></p>
<p>This graph looks good too! It sure appears that as C7 increases, C1 tends to increase, which agrees with the positive <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">regression coefficient</a> in the output. If we didn’t know better, we’d think that we have a good model!</p>
<p>This example answers the question posed at the beginning: what could possibly be wrong with this approach? Data mining can produce deceptive results. The statistics and graph all look good but these results are based on entirely random data with absolutely no real effects. Our regression model suggests that random data explain other random data even though that's impossible. Everything looks great but we have a lousy model.</p>
The problems associated with using data mining are real, but how the heck do they happen? And, how do you avoid them? <a href="http://blog.minitab.com/blog/adventures-in-statistics/problems-using-data-mining-to-build-regression-models-part-two">Read my next post</a> to learn the answers to these questions!ANOVAData AnalysisRegression AnalysisStatisticsStatistics HelpStatsWed, 21 Sep 2016 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/problems-using-data-mining-to-build-regression-modelsJim FrostWhatever Happened to…the Ozone Hole?
http://blog.minitab.com/blog/statistics-and-quality-data-analysis/whatever-happened-to%E2%80%A6the-ozone-hole
<p><img alt="ozone hole" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/36ffdad1772934f71f8550dc13d4deca/ozone_hole.jpg" style="width: 300px; height: 279px; float: right; margin: 10px 15px;" />Today, September 16, is <a href="https://en.wikipedia.org/wiki/International_Day_for_the_Preservation_of_the_Ozone_Layer" target="_blank">World Ozone Day</a>. You don't hear much about the ozone layer any more.</p>
<p>In fact, if you’re under 30, you might think this is just another trivial, obscure observance, along the lines of <a href="https://www.daysoftheyear.com/days/international-dot-day/" target="_blank">International Dot Day</a> (yesterday) or <a href="http://www.nationaldaycalendar.com/national-apple-dumpling-day-september-17/" target="_blank">National Apple Dumpling Day</a> (tomorrow).</p>
<p>But there’s a good reason that, almost 30 years ago, the United Nations designated today to as a day to raise awareness of the ozone layer: unlike dots and apple dumplings, this fragile shield of gas in the stratosphere, which acts as a natural sunscreen against dangerous levels of UV radiation, is critical to sustain life on our planet. </p>
<p>In this post, we'll join the efforts of educators around the globe who organize special activities on this day, by using Minitab to statistically analyze ozone-related data. You can follow along using the data in <a href="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/b268a053e038bb53a94a5b38360899be/world_ozone_day.mpj" target="_blank">this Minitab project</a>. If you don't already have it, you can <a href="https://www.minitab.com/products/minitab/free-trial/">download Minitab here and use it free for 30 days</a>.</p>
Orthogonal Regression: Can You Trust Your Data?
<p><img alt="NIST data" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/385e642e6552abb2aa033b9194f0ae08/nist_data_worksheet.jpg" style="width: 158px; height: 298px; float: right; margin: 10px 15px;" />Before you analyze data, it's important to verify that your measuring system is accurate. Orthogonal regression, also known as Deming regression, is a tool used to evaluate whether two instruments or methods provide comparable measurements.</p>
<p>The following sample data is from the <a href="http://www.itl.nist.gov/div898/strd/lls/data/Norris.shtml" target="_blank">National Institute of Standards (NIST) web site</a>. The predictor variable <span style="line-height: 20.8px;">(x)</span> is the NIST measurement of ozone concentration. The response variable (y) is the measurement of ozone concentration using a customer's measuring device.</p>
<p>In Minitab, choose <strong>Stat > Regression > Orthogonal Regression</strong>.Enter <em>C1</em> as the<strong> Response (Y)</strong> and <em>NIST</em> as the<strong> Predictor (X)</strong>. Enter 1.5 as the <strong>Error Variance ratio (Y/X) </strong>and click <strong>OK.</strong></p>
<p><em><strong>Note</strong>: The error variance ratio is based on historic data, not the sample data. Because the ratio is not available for these data, we'll use 1.5 purely for illustrative purposes. To learn more about this ratio, and how to estimate it, see the comments following <a href="http://blog.minitab.com/blog/real-world-quality-improvement/orthogonal-regression-testing-the-equivalence-of-instruments" target="_blank">this Minitab blog post</a>. </em></p>
<em><strong><span style="line-height: 1.6;">Orthogonal Regression Analysis: Device versus NIST </span></strong></em>
<span style="line-height: 20.8px; font-size: 13px;">The fitted line plot shows the two sets of measurements appear almost identical. That's about as good as it gets:</span><strong><span style="line-height: 1.6;"><img alt="fitted plot" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/480385f6576f9121e8d90788f8ee8c9e/nist_plot_with_fitted_line.jpg" style="width: 576px; height: 384px; margin: 10px 15px;" /></span></strong>
<p><span style="line-height: 20.8px;">Now look at the numerical output. If there's perfect correlation, and no bias, you'd expect to see a constant value of 0 and a slope of 1 in the regression equation. </span></p>
<p style="margin-left: 40px;">Error Variance Ratio (Device/NIST): 1.5</p>
<p style="margin-left: 40px;">Regression Equation<br />
Device = -<strong><span style="color:#0000FF;"> 0.263</span></strong> + <strong><span style="color:#FF0000;">1.002</span></strong> NIST</p>
<p style="margin-left: 40px;">Coefficients</p>
<p style="margin-left: 40px;">Predictor Coef SE Coef Z P Approx 95% CI<br />
<strong><span style="color:#0000FF;">Constant</span></strong> -0.26338 0.232819 -1.1313 0.258 <strong><span style="color:#0000FF;">(-0.71969, 0.19294)</span></strong><br />
<strong><span style="color:#FF0000;">NIST </span></strong> 1.00212 0.000430 2331.6058 0.000 <strong><span style="color:#FF0000;">( 1.00128, 1.00296)</span></strong></p>
<p><span style="line-height: 1.6;">To assess this, look at the 95% confidence intervals for the coefficients. The confidence interval for constant includes 0. The confidence interval for the predictor variable (NIST) is extremely close to 1, but does not include 1. Technically, there is some bias, although it may be too small to be relevant. In cases like this, rely on your practical knowledge in the field to determine whether the amount of bias is important. </span></p>
<p><span style="line-height: 1.6;">I'm no ozone expert, but given the sample measurements</span><span style="line-height: 1.6;">, I'd speculate</span><span style="line-height: 1.6;"> that this tiny amount of bias is not critical. </span></p>
Plotting the Size of the Ozone Hole
<p>Usually holes just get bigger over time. Like the holes in my socks and sweaters. </p>
<p><span style="line-height: 1.6;">But what about the size of the hole in the ozone layer above Antarctica? </span></p>
<p>As part of the Ozone Hole Watch project, NASA scientists have been tracking the size of the ozone hole of the Southern Hemisphere for years. I copied the data into a Minitab project, and then used <strong>Graph > Time Series Plot > Multiple</strong> to plot both the mean ozone hole area and the maximum daily ozone hole area, by year. </p>
<p style="margin-left: 40px;"><img alt="Time series plot " src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/30ea9b376e2fe868bd2af2b71996ba78/time_series_plot_of_ozone_hole_size.jpg" style="width: 576px; height: 384px;" /></p>
<p>The plot shows why the ozone hole was such a big deal back in the 1980's. The size of the hole was increasing at extremely high rates, trending toward a potential environmental crisis. No wonder, then, that on September 16, 1987, the United Nations adopted the <a href="https://en.wikipedia.org/wiki/Montreal_Protocol" target="_blank" title="Montreal Protocol">Montreal Protocol</a>, an international agreement to reduce ozone-depleting substances such as chlorofluorocarbons. That agreement, <span style="line-height: 20.8px;">eventually signed by nearly 200 nations, is credited with stabilizing the size of the ozone hole at the end of the 20th century, </span>according to <a href="http://research.noaa.gov/News/NewsArchive/LatestNews/TabId/684/ArtMID/1768/ArticleID/10741/Report-telltale-signs-that-ozone-layer-is-recovering-.aspx" target="_blank">NASA</a> and the <a href="http://ozone.unep.org/Assessment_Panels/SAP/SAP2014_Assessment_for_Decision-Makers.pdf" target="_blank">World Meteorological Organization</a>. </p>
One-Way ANOVA: Seasonal Changes in the Ozone Layer
<p>The ozone layer is not static, but varies by latitude, season, and stratospheric conditions. On average, the "typical" thickness of the ozone layer is about 300 Dobson units (DU). </p>
<p>The Lauder Ozone worksheet in the Minitab project linked above contains ra<span style="line-height: 20.8px;">ndom samples of <a href="http://data.mfe.govt.nz/" target="_blank">total ozone column measurements taken in Lauder, New Zealand in 2013</a></span><span style="line-height: 20.8px;">. For this analysis, the seasons are defined as Summer = Dec-Feb, Fall = Mar-May, Winter = June-August, and Spring = Sept-Nov. </span></p>
<p><span style="line-height: 20.8px;">To evaluate whether there are statistically significant differences in mean ozone by season using Minitab, choose </span><strong style="line-height: 20.8px;">Stat > ANOVA > One-Way...</strong><span style="line-height: 20.8px;"> In the dialog box, select </span><strong style="line-height: 20.8px;">Response data are in a separate column for each factor level</strong><span style="line-height: 20.8px;">. As </span><strong style="line-height: 20.8px;">Responses</strong><span style="line-height: 20.8px;">, enter <em>Summer</em>, <em>Fall</em>, <em>Winter</em>, <em>Spring. </em> Click <em><strong>Options</strong></em>, and uncheck <strong>Assume equal variances</strong>. Click </span><strong style="line-height: 20.8px;">Comparisons</strong><em style="line-height: 20.8px;"> </em><span style="line-height: 20.8px;">and</span><em style="line-height: 20.8px;"> </em><span style="line-height: 20.8px;">check <b>Games-Howell</b>. After you click <strong>OK</strong> in each dialog box, Minitab returns the following output.</span></p>
<p style="margin-left: 40px;"><img alt="interval plot ozone" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/7399a0d7c95eb4250406327dfd1d0a52/interval_plot_of_ozone.jpg" style="width: 576px; height: 384px;" /></p>
<p style="margin-left: 40px;"><img alt="ozone session window" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/8eba7cc1fb6fa6b4c70cbe436924ad92/ozone_session_window.jpg" style="width: 433px; height: 552px;" /></p>
<p><span style="line-height: 1.6;">At a 0.05 level of significance, the p-value (</span>≈ 0.000)<span style="line-height: 1.6;"> is less than alpha. Thus, w</span><span style="line-height: 1.6;">e can conclude that there is a statistically significant difference in mean ozone thickness by season. The plot shows that the mean ozone is lowest in Summer and Fall, and highest in Spring. </span></p>
<p><span style="line-height: 1.6;">Look at the 95% confidence intervals (CI). Are any seasons likely to have a mean ozone thickness less than 300 DU? Greater than 300 DU? Based on the pairwise comparisons chart, for which seasons does the mean ozone layer significantly differ?</span></p>
<p><span style="line-height: 1.6;">The ozone layer is just one factor in the myriad complex relationships between human activity and the global environment. So these analyses are just the tip of the iceberg</span>—one that's<span style="line-height: 1.6;"> melting as we speak.</span></p>
Data AnalysisLearningStatisticsStatistics in the NewsFri, 16 Sep 2016 12:00:00 +0000http://blog.minitab.com/blog/statistics-and-quality-data-analysis/whatever-happened-to%E2%80%A6the-ozone-holePatrick Runkel