Data Analysis Software | MinitabBlog posts and articles with tips for using statistical software to analyze data for quality improvement.
http://blog.minitab.com/blog/data-analysis-software/rss
Mon, 26 Sep 2016 01:54:15 +0000FeedCreator 1.7.3Problems Using Data Mining to Build Regression Models
http://blog.minitab.com/blog/adventures-in-statistics/problems-using-data-mining-to-build-regression-models
<p><img alt="Picture of mining truck filled with numbers" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/644d98694f1e6fec63d4f1db6b61a074/data_mining_crop.jpg" style="width: 250px; height: 171px; float: right; margin: 10px 15px;" />Data mining uses algorithms to explore correlations in data sets. An automated procedure sorts through large numbers of variables and includes them in the model based on statistical significance alone. No thought is given to whether the variables and the signs and magnitudes of their coefficients make theoretical sense.</p>
<p>We tend to think of data mining in the context of big data, with its huge databases and servers stuffed with information. However, it can also occur on the smaller scale of a research study.</p>
<p>The comment below is a real one that illustrates this point.</p>
<blockquote>“Then, I moved to the Regression menu and there I could add all the terms I wanted and more. Just for fun, I added many terms and performed backward elimination. Surprisingly, some terms appeared significant and my R-squared Predicted shot up. To me, your concerns are all taken care of with R-squared Predicted. If the model can still predict without the data point, then that's good.”</blockquote>
<p>Comments like this are common and emphasize the temptation to select regression models by trying as many different combinations of variables as possible and seeing which model produces the best-looking statistics. The overall gist of this type of comment is, "What could possibly be wrong with using data mining to build a regression model if the end results are that all the p-values are significant and the various types of R-squared values are all high?"</p>
<p>In this blog post, I’ll illustrate the problems associated with using data mining to build a regression model in the context of a smaller-scale analysis.</p>
An Example of Using Data Mining to Build a Regression Model
<p>My first order of business is to prove to you that data mining can have severe problems. I really want to bring the problems to life so you'll be leery of using this approach. Fortunately, this is simple to accomplish because I can use data mining to make it appear that a set of randomly generated <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">predictor variables</a> explains most of the changes in a randomly generated <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">response variable</a>!</p>
<p>To do this, I’ll create a worksheet in Minitab statistical software that has 100 columns, each of which contains 30 rows of entirely random data. In Minitab, you can use <strong>Calc > Random Data > Normal</strong> to create your own worksheet with random data, or you can use <a href="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/File/c740effad4cc27dc6580093ea6c070fd/randomdata.mtw">this worksheet</a> that I created for the data mining example below. (If you don’t have Minitab and want to try this out, <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">get the free 30 day trial!</a>)</p>
<p>Next, I’ll perform <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-smackdown-stepwise-versus-best-subsets" target="_blank">stepwise regression</a> using column 1 as the response variable and the other 99 columns as the potential predictor variables. This scenario produces a situation where stepwise regression is forced to dredge through 99 variables to see what sticks, which is a key characteristic of data mining.</p>
<p>When I perform stepwise regression, the procedure adds 28 variables that explain 100% of the variance! Because we only have 30 observations, we’re clearly overfitting the model. Overfitting the model is different problem that also inflates R-squared, which you can read about in my post about <a href="http://blog.minitab.com/blog/adventures-in-statistics/the-danger-of-overfitting-regression-models" target="_blank">the dangers of overfitting models</a>.</p>
<p>I’m specifically addressing the problems of data mining in this post, so I don’t want a model that is also overfit. To avoid an overfit model, a good rule of thumb is to include no more than one term for each 10 observations. We have 30 observations, so I’ll include only the first three variables that the stepwise procedure adds to the model: C7, C77, and C95. The output for the first three steps is below.</p>
<p style="margin-left: 40px;"><img alt="Stepwise regression output" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/e4fb01237dd0c8b34496dde3cc28b517/stepwise_swo.png" style="width: 498px; height: 251px;" /></p>
<p>Under step 3, we can see that all of the <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">coefficient p-values</a> are statistically significant. The <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit" target="_blank">R-squared</a> value of 67.54% can either be good or mediocre depending on your field of study. In a real study, there are likely to be some real effects mixed in that would boost the R-squared even higher. We can also look at <a href="http://blog.minitab.com/blog/adventures-in-statistics/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables" target="_blank">the adjusted and predicted R-squared values</a> and neither one suggests a problem.</p>
<p>If we look at the model building process of steps 1 - 3, we see that at each step all of the R-squared values increase. That’s what we like to see. For good measure, let’s graph the relationship between the predictor (C7) and the response (C1). After all, seeing is believing, right?</p>
<p style="margin-left: 40px;"><img alt="Scatterplot of two variables in regression model" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/6e4dfb991b33031738756d4b2d1c77e4/scatterplot.png" style="width: 576px; height: 384px;" /></p>
<p>This graph looks good too! It sure appears that as C7 increases, C1 tends to increase, which agrees with the positive <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">regression coefficient</a> in the output. If we didn’t know better, we’d think that we have a good model!</p>
<p>This example answers the question posed at the beginning: what could possibly be wrong with this approach? Data mining can produce deceptive results. The statistics and graph all look good but these results are based on entirely random data with absolutely no real effects. Our regression model suggests that random data explain other random data even though that's impossible. Everything looks great but we have a lousy model.</p>
The problems associated with using data mining are real, but how the heck do they happen? And, how do you avoid them? Read my next post to learn the answers to these questions!ANOVAData AnalysisRegression AnalysisStatisticsStatistics HelpStatsWed, 21 Sep 2016 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/problems-using-data-mining-to-build-regression-modelsJim FrostWhen to Use a Pareto Chart
http://blog.minitab.com/blog/understanding-statistics/when-to-use-a-pareto-chart
<p>I confess: I'm not a natural-born decision-maker. Some people—my wife, for example—can assess even very complex situations, consider the options, and confidently choose a way forward. Me? I get anxious about deciding what to eat for lunch. So you can imagine what it used to <span style="line-height: 1.6;">be like when I needed to confront a really big decision or problem. My approach, to paraphrase the Byrds, was "Re: everything, churn, churn, churn."<img alt="question to answer" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/1b29ab96a420030f3551f71a26773259/question.jpg" style="width: 250px; height: 181px; margin: 10px 15px; float: right;" /></span></p>
<p>Thank heavens for Pareto charts.</p>
What Is a Pareto Chart, and How Do You Use It?
<p>A Pareto chart is a basic quality tool that helps you identify the most frequent defects, complaints, or any other factor you can <strong>count </strong>and <strong>categorize</strong>. The chart takes its name from Vilfredo Pareto, originator of the "80/20 rule," which postulates that, roughly speaking, 20 percent of the people own 80 percent of the wealth. Or, in quality terms, 80 percent of the losses come from 20 percent of the causes.</p>
<p><span style="line-height: 20.8px;">You can use a Pareto chart any time you have data that are broken down into categories, and you can count how often each category occurs. As children, most of us learned how to use this kind of data to make a bar chart:</span></p>
<p style="margin-left: 40px;"><img alt="bar chart" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/90e6067d7f0a1f4f738462290a05f439/bar_chart.png" style="width: 576px; height: 384px;" /></p>
<p>A Pareto chart is just a bar chart that arranges the bars (counts) from largest to smallest, from left to right. The categories or factors symbolized by the bigger bars on the left are more important than those on the right.</p>
<p style="margin-left: 40px;"><img alt="Pareto Chart" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/bf0be8506cc30954165e854f24f0ed7d/pareto.png" style="width: 576px; height: 384px;" /></p>
<p>By ordering the bars from largest to smallest, a Pareto chart helps you visualize which factors comprise the 20 percent that are most critical—the "vital few"—and which are the "trivial many."</p>
<p>A cumulative percentage line helps you judge the added contribution of each category. If a Pareto effect exists, the cumulative line rises steeply for the first few defect types and then levels off. In cases where the bars are approximately the same height, the cumulative percentage line makes it easier to compare categories.</p>
<p>It's common sense to focus on the ‘vital few’ factors. In the quality improvement arena, Pareto charts help teams direct their efforts where they can make the biggest impact. By taking a big problem and breaking it down into smaller pieces, a Pareto chart reveals where our efforts will create the most improvement.</p>
<p>If a Pareto chart seems rather basic, well, it is. But like a simple machine, its very simplicity makes the Pareto chart applicable to a very wide range of situations, both within and beyond quality improvement.</p>
Use a Pareto Chart Early in Your Quality Improvement Process
<p>At the leadership or management level, Pareto charts can be used at the start of a new round of quality improvement to figure out what business problems are responsible for the most complaints or losses, and dedicate improvement resources to those. Collecting and examining data like that can often result in surprises and upend an organization's "conventional wisdom." For example, leaders at one company believed that the majority of customer complaints involved product defects. But when they saw the complaint data in a Pareto chart, it showed that many more people complained about shipping delays. Perhaps the impression that defects caused the most complaints arose because the relatively few people who received defective products tended to complain very loudly—but since more customers were affected by shipping delays, the company's energy was better devoted to solving that problem.</p>
Use a Pareto Chart Later in Your Quality Improvement Process
<p>Once a project has been identified, and a team assembled to improve the problem, a Pareto chart can help the team select the appropriate areas to focus on. This is important because most business problems are big and multifaceted. For instance, shipping delays may occur for a wide variety of reasons, from mechanical breakdowns and accidents to data-entry mistakes and supplier issues. If there are many possible causes a team could focus on, it's smart to collect data about which categories account for the biggest number of incidents. That way, the team can choose a direction based on the numbers and not the team's "gut feeling."</p>
Use a Pareto Chart to Build Consensus
<p>Pareto charts also can be very helpful in resolving conflicts, particularly if a project involves many moving parts or crosses over many different units or work functions. Team members may have sharp disagreements about how to proceed, either because they wish to defend their own departments or because they honestly believe they <em>know </em>where the problem lies. For example, a hospital project improvement team was stymied in reducing operating room delays because the anesthesiologists blamed the surgeons, while the surgeons blamed the anesthesiologists. When the project team collected data and displayed it in a Pareto chart, it turned out that neither group accounted for a large proportion of the delays, and the team was able to stop finger-pointing. Even if the chart had indicated that one group or the other was involved in a significantly greater proportion of incidents, helping the team members see which types of delays were most 'vital' could be used to build consensus.</p>
Use Pareto Charts Outside of Quality Improvement Projects
<p>Their simplicity also makes <span><a href="http://blog.minitab.com/blog/real-world-quality-improvement/pareto-chart-power">Pareto charts</a> a valuable tool for making decisions beyond the world of quality improvement. By helping you visualize the relative importance of various categories, you can use them to prioritize customer needs, opportunities for training or investment—even your choices for lunch.</span></p>
How to Create a Pareto Chart
<p>Creating a Pareto chart is not difficult, even without statistical software. Of course, if you're using <a href="http://www.minitab.com/products/minitab/">Minitab</a>, the software will do all this for you automatically—create a Pareto chart by selecting <strong style="line-height: 1.6;">Stat > Quality Tools > Pareto Chart...</strong> or by selecting <strong style="line-height: 1.6;">Assistant > Graphical Analysis > Pareto Chart</strong>. You can collect raw data, in which each observation is recorded in a separate row of your worksheet, or summary data, in which you tally observation counts for each category.</p>
<p><strong>1. Gather Raw Data about Your Problem</strong></p>
<p>Be sure you collect a random sample that fully represents your process. For example, if you are counting the number of items returned to an electronics store in a given month, and you have multiple locations, you should not gather data from just one store and use it to make decisions about all locations. (If you want to compare the most important defects for different stores, you can show separate charts for each one side-by-side.)</p>
<p><strong>2. Tally Your Data</strong></p>
<p>Add up the observations in each of your categories.</p>
<p><strong>3. Label your horizontal and vertical axes.</strong></p>
<p>Make the widths of all your horizontal bars the same and label the categories in order from largest to smallest. On the vertical axis, use round numbers that slightly exceed your top category count, and include your measurement unit.</p>
<p><strong>4. Draw your category bars.</strong></p>
<p>Using your vertical axis, draw bars for each category that correspond to their respective counts. Keep the width of each bar the same.</p>
<p><strong>5. Add cumulative counts and lines.</strong></p>
<p>As a final step, you can list the cumulative counts along the horizontal axis and make a cumulative line over the top of your bars. Each category's cumulative count is the count for that category PLUS the total count of the preceding categories. If you want to add a line, draw a right axis and label it from 0 to 100%, lined up with the with the grand total on the left axis. Above the right edge of each category, mark a point at the cumulative total, then connect the points.</p>
Data AnalysisLean Six SigmaProject ToolsQuality ImprovementStatisticsWed, 14 Sep 2016 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/when-to-use-a-pareto-chartEston MartzControl Chart Tutorials and Examples
http://blog.minitab.com/blog/understanding-statistics/control-chart-tutorials-and-examples
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/3989007af54bf1e996aeee86c8cec497/control_chart_wow.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 288px; height: 173px;" />The other day I was talking with a friend about control charts, and I wanted to share an example one of my colleagues wrote on the Minitab Blog. Looking back through the index for "control charts" reminded me just how much material we've published on this topic.</p>
<p>Whether you're just getting started with control charts, or you're an old hand at statistical process control, you'll find some valuable information and food for thought in our control-chart related posts. </p>
Different Types of Control Charts
<p>One of the first things you learn in statistics is that when it comes to data, there's no one-size-fits-all approach. To get the most useful and reliable information from your analysis, you need to select the type of method that best suits the type of data you have.</p>
<p>The same is true with control charts. While there are a few charts that are used very frequently, a wide range of options is available, and selecting the right chart can make the difference between actionable information and false (or missed) alarms.</p>
<p><a href="http://blog.minitab.com/blog/understanding-statistics/what-control-chart-should-i-use">What Control Chart Should I Use?</a> offers a brief overview of the most common charts and a discussion of how to use the Assistant to help you choose the right one for your situation. And if you're a control chart neophyte and you want more background on why we use them, check out <a href="http://blog.minitab.com/blog/understanding-statistics/control-charts-show-you-variation-that-matters" itemprop="url">Control Charts Show You Variation that Matters.</a></p>
<p itemprop="headline">We extol the virtues of a less commonly used chart in <a href="http://blog.minitab.com/blog/fun-with-statistics/an-ode-to-the-ewma-control-chart" itemprop="url">Beyond the "Regular Guy" Control Charts: An Ode to the EWMA Chart</a>, and explain how to use control charts to track rare events in <a href="http://blog.minitab.com/blog/data-analysis-and-quality-improvement-and-stuff/using-g-whiz-charts-to-track-elusive-affirmations-from-almost-adolescents" itemprop="url">Using G-Whiz Charts to Track Elusive Affirmations from Almost Adolescents</a>.</p>
<p itemprop="headline">In <a href="http://blog.minitab.com/blog/adventures-in-software-development/the-laney-p-chart-and-minitab-software-development" itemprop="url">Using the Laney P' Control Chart in Minitab Software Development</a>, Dawn Keller discusses the distinction between P' charts and their cousins, described by Tammy Serensits in <a href="http://blog.minitab.com/blog/the-statistics-of-science/p-and-u-charts-and-limburger-cheese-a-smelly-combination" itemprop="url">P and U Charts and Limburger Cheese: A Smelly Combination</a>.</p>
<p itemprop="headline">And it's good to remember that things aren't always as complicated as they seem, and sometimes a simple solution can be just as effective as a more complicated approach. See why in <a href="http://blog.minitab.com/blog/understanding-statistics/take-it-easy-create-a-run-chart" itemprop="url">Take It Easy: Create a Run Chart. </a></p>
Control Chart Tutorials
<p itemprop="headline">Many of our Minitab bloggers have talked about the process of choosing, creating, and interpreting control charts under specific conditions. If you have data that can't be collected in subgroups, you may want to learn about <a href="http://blog.minitab.com/blog/understanding-statistics/how-create-and-read-an-i-mr-control-chart" itemprop="url">How to Create and Read an I-MR Control Chart</a>. </p>
<p itemprop="headline">If you do have data collected in subgroups, you'll want to understand why, when it comes to <a href="http://blog.minitab.com/blog/michelle-paret/control-charts-subgroup-size-matters" itemprop="url">Control Charts, Subgroup Size Matters</a>.</p>
<p itemprop="headline">It's often useful to look at control chart data in calendar-based increments, and taking the monthly approach is discussed in the series <a href="http://blog.minitab.com/blog/understanding-statistics/creating-a-chart-to-compare-month-to-month-change" itemprop="url">Creating a Chart to Compare Month-to-Month Change</a> and <a href="http://blog.minitab.com/blog/understanding-statistics/creating-charts-to-compare-month-to-month-change-part-2" itemprop="url">Creating Charts to Compare Month-to-Month Change, part 2</a>.</p>
<p itemprop="headline">If you want to see the difference your process improvements have made, check out <a href="http://blog.minitab.com/blog/real-world-quality-improvement/analyzing-a-process-before-and-after-improvement-historical-control-charts-with-stages" itemprop="url">Analyzing a Process Before and After Improvement: Historical Control Charts with Stages</a> and <a href="http://blog.minitab.com/blog/starting-out-with-statistical-software/setting-the-stage-accounting-for-process-changes-in-a-control-chart" itemprop="url">Setting the Stage: Accounting for Process Changes in a Control Chart</a>. </p>
<p itemprop="headline">While the basic idea of control charting is very simple, interpreting real-world control charts can be a little tricky. If you're using <a href="http://www.minitab.com/products/minitab">Minitab 17</a>, be sure to check out this post about a great new feature in the Assistant: <a href="http://blog.minitab.com/blog/statistics-and-quality-improvement/the-stability-report-for-control-charts-in-minitab-17-includes-example-patterns" itemprop="url">The Stability Report for Control Charts in Minitab 17 includes Example Patterns.</a></p>
<p itemprop="headline">Finally, one of our expert statistical trainers offers his suggestions about <a href="http://blog.minitab.com/blog/applying-statistics-in-quality-projects/five-ways-to-make-your-control-charts-more-effective" itemprop="url">Five Ways to Make Your Control Charts More Effective</a>.</p>
Control Chart Examples
<p itemprop="headline">Control charts are most frequently used for quality improvement and assurance, but they can be applied to almost any situation that involves variation.</p>
<p itemprop="headline">My favorite example of applying the lessons of quality improvement in business to your personal life involves Bill Howell, who applied his Six Sigma expertise to the (successful) management of his diabetes. Find out how he uses <a href="http://blog.minitab.com/blog/real-world-quality-improvement/control-charts-keep-blood-sugar-in-check" itemprop="url">Control Charts to Keep Blood Sugar in Check</a>.</p>
<p itemprop="headline">Some of our bloggers have applied control charts to their personal passions, including holiday candies in <a href="http://blog.minitab.com/blog/real-world-quality-improvement/control-charts-rational-subgrouping-and-marshmallow-peeps" itemprop="url">Control Charts: Rational Subgrouping and Marshmallow Peeps!</a> and bicycling in <a href="http://blog.minitab.com/blog/statistics-for-lean-six-sigma/the-problem-with-p-charts-out-of-control-cycle-laneys" itemprop="url">The Problem With P-Charts: Out-of-control Cycle LaneYs!</a>.</p>
<p itemprop="headline">If you're into sports, see how control charts can reveal <a href="http://blog.minitab.com/blog/the-statistical-mentor/when-should-nhl-goalies-get-pulled" itemprop="url">When NHL Goalies </a><a href="http://blog.minitab.com/blog/the-statistical-mentor/when-should-nhl-goalies-get-pulled" itemprop="url">Should </a><a href="http://blog.minitab.com/blog/the-statistical-mentor/when-should-nhl-goalies-get-pulled" itemprop="url">Get Pulled.</a> Or look to the cosmos to consider <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/signal-to-noise-detecting-extraterrestrials-and-special-causes" itemprop="url">Signal to Noise: Detecting Extraterrestrials and Special Causes</a>. And finally, compulsive readers like myself might be interested to see how relevant control charts are to literature, too, as Cody Steele illustrates in <a href="http://blog.minitab.com/blog/statistics-and-quality-improvement/laney-p-prime-charts-show-how-poe-creates-intensity-in-the-fall-of-the-house-of-usher" itemprop="url">Laney P' Charts Show How Poe Creates Intensity in "The Fall of the House of Usher."</a></p>
<p itemprop="headline">How are <em>you </em>using control charts?</p>
<p itemprop="headline"> </p>
Quality ImprovementMon, 12 Sep 2016 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/control-chart-tutorials-and-examplesEston MartzCreating Value from Your Data
http://blog.minitab.com/blog/applying-statistics-in-quality-projects/creating-value-from-your-data
<p>There may be huge potential benefits waiting in the data in your servers. These data may be used for many different purposes. Better data allows better decisions, of course. Banks, insurance firms, and telecom companies already own a large amount of data about their customers. These resources are useful for building a more personal relationship with each customer.</p>
<p>Some organizations already use data from agricultural fields to build complex and customized models based on a very extensive number of input variables (soil characteristics, weather, plant types, etc.) in order to improve crop yields. Airline companies and large hotel chains use dynamic pricing models to improve their yield management. Data is increasingly being referred as the new “gold mine” of the 21st century.</p>
<p>A couple of factors underlie the rising prominence of data (and, therefore, data analysis):</p>
<p><img alt="Afficher l'image d'origine" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/File/de034e63187d191e1666721fa12a8880/de034e63187d191e1666721fa12a8880.png" style="width: 283px; height: 212px; margin: 10px 15px; float: right;" /></p>
Huge volumes of data
<p><span style="line-height: 1.6;">Data acquisition has never been easier (sensors in manufacturing plants, sensors in connected objects, data from internet usage and web clicks, from credit cards, fidelity cards, Customer Relations Management databases, satellite images etc…) and it can easily be stored at costs that are lower than ever before (huge storage capacity now available on the cloud and elsewhere). The amount of data that is being collected is not only huge, it is growing very fast… in an exponential way.</span></p>
Unprecedented velocity
<p>Connected devices, like our smart phones, provide data in almost real time and it can be processed very quickly. It is now possible to react to any change…almost immediately.</p>
Incredible variety
<p>The data collected is not be restricted to billing information; every source of data is potentially valuable for a business. Not only is numeric data getting collected in a massive way, but also unstructured data such as videos, pictures, etc., in a large variety of situations.</p>
<p>But the explosion of data available to us is prompting every business to wrestle with an extremely complicated problem:</p>
How can we create value from these resources ?
<p>Very simple methods, such as counting words used in queries submitted to company web sites, do provide a good insight as to the general mood of your customers and its evolution. Simple statistical correlations are often used by web vendors to suggest a purchase just after buying a product on the web. Very simple descriptive statistics are also useful.</p>
<p>Just guess what could be achieved from advanced regression models or powerful statistical multivariate techniques, which can be applied easily with <a href="http://www.minitab.com/products/minitab/">statistical software packages like Minitab</a>.</p>
A simple example of the benefits of analyzing an enormous database
<p>Let's consider an example of how one company benefited from analyzing a very large database.</p>
<p><span style="line-height: 20.8px;">Many steps are needed (security and safety checks, cleaning the cabin, etc.) before a plane can depart.</span><span style="line-height: 20.8px;"> Since d</span><span style="line-height: 20.8px;">elays negatively impact customer perceptions and also affect productivity, a</span><span style="line-height: 1.6;">irline companies routinely collect a very large amount of data related to flight delays and times required to perform tasks before departure. Some times are automatically collected, others are manually recorded.</span></p>
<p>A major worldwide airline company intended to use this data to identify the crucial milestones among a very large number of preparation steps, and which ones often triggered delays in departure times. The company used Minitab's <span><a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-smackdown-stepwise-versus-best-subsets">stepwise regression analysis</a></span> to quickly focus on the few variables that played a major role among a large number of potential inputs. Many variables turned out to be statistically significant, but two among them clearly seemed to make a major contribution (X6 and X10).</p>
<p style="margin-left: 40px;">Analysis of Variance1</p>
<p style="margin-left: 40px;">Source DF Seq SS <strong><span style="color: rgb(0, 0, 128);">Contribution </span></strong> Adj SS Adj MS F-Value P-Value</p>
<p style="margin-left: 40px;"><span style="line-height: 1.6;"> X6 1 337394 </span><span style="line-height: 1.6; color: rgb(0, 0, 128);"><strong>53.54%</strong></span><span style="line-height: 1.6;"> 2512 2512.2 29.21 0.000</span></p>
<p style="margin-left: 40px;"><span style="line-height: 1.6;"> X10 1 112911 </span><strong style="line-height: 1.6;"><span style="color: rgb(0, 0, 128);"> 17.92%</span> </strong><span style="line-height: 1.6;"> 66357 66357.1 771.46 0.000</span></p>
<p>When huge databases are used, statistical analyses may become overly sensitive and <a href="http://blog.minitab.com/blog/the-stats-cat/sample-size-statistical-power-and-the-revenge-of-the-zombie-salmon-the-stats-cat">detect even very small differences</a> (due to the large sample and power of the analysis). P values often tend to be quite small (p < 0.05) for a large number of predictors.</p>
<p>However, in Minitab, if you click on Results in the regression dialogue box and select Expanded tables, contributions from each variable will get displayed. X6 and X10 when considered together were contributing to more than 80% of the overall variability (with the largest F values by far), the contributions from the remaining factors were much smaller. The airline then ran a residual analysis to cross-validate the final model. </p>
<p>In addition, a Principal Component Analysis (<a href="http://blog.minitab.com/blog/applying-statistics-in-quality-projects/use-statistics-to-better-understand-your-customers">PCA, a multivariate technique</a>) was performed in Minitab to describe the relations between the most important predictors and the response. Milestones were expected to be strongly correlated to the subsequent steps.</p>
<p style="margin-left: 40px;"><img src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/File/c023d71140ea4ee2b5b22480712a55a4/c023d71140ea4ee2b5b22480712a55a4.png" /></p>
<p>The graph above is a Loading Plot from a principal component analysis. Lines that go in the same direction and are close to one another indicate how the variables may be grouped. Variables are visually grouped together according to their statistical correlations and how closely they are related.</p>
<p>A group of nine variables turned out to be strongly correlated to the most important inputs (X6 and X10) and to the final delay times (Y). Delays at the X6 stage obviously affected the X7 and X8 stages (subsequent operations), and delays from X10 affected the subsequent X11 and X12 operations.</p>
Conclusion
<p>This analysis provided simple rules that this airline's crews can follow in order to avoid delays, making passengers' next flight more pleasant. </p>
<p>The airline can repeat this analysis periodically to search for the next most important causes of delays. Such an approach can propel innovation and help organizations replace traditional and intuitive decision-making methods with data-driven ones.</p>
<p>What's more, the use of data to make things better is not restricted to the corporate world. More and more public administrations and non-governmental organizations are making large, open databases easily accessible to communities and to virtually anyone. </p>
ANOVAData AnalysisHypothesis TestingRegression AnalysisStatisticsStatistics in the NewsTue, 06 Sep 2016 13:19:00 +0000http://blog.minitab.com/blog/applying-statistics-in-quality-projects/creating-value-from-your-dataBruno ScibiliaIs Alabama Going Undefeated this Year? Creating Simulations in Minitab
http://blog.minitab.com/blog/the-statistics-game/is-alabama-going-undefeated-this-year-creating-simulations-in-minitab
<p>The college football season is here, and this raises a very important question:</p>
<p>Is Alabama going to be undefeated when they win the national championship, or will they lose a regular-season game along the way?</p>
<img alt="Alabama" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/c353125e8df62efcc49bb6cc042e2006/alabama_crimson_tide.jpg" style="line-height: 20.8px; width: 250px; height: 250px; float: right; margin: 10px 15px;" />
<p>Okay, so it's not a <em>given </em>that Alabama is going to win the championship this year, but when you've won 4 of the last 7 you're definitely the odds-on favorite.</p>
<p>However, what if we wanted to take a quantitative look at Alabama's chances of going undefeated instead of just giving hot takes like the one above? How could we determine a probability of Alabama winning a specific number of games this year?</p>
<p>The answer is easy: a Monte Carlo Simulation.</p>
<p>Monte Carlo <a href="http://blog.minitab.com/blog/understanding-statistics/monte-carlo-is-not-as-difficult-as-you-think">simulations use repeated random sampling</a> to simulate data for a given mathematical model and evaluate the outcome. Sounds like the perfect situation for <a href="http://www.minitab.com/en-us/products/minitab/?WT.srch=1&WT.mc_id=SE3994&gclid=CMquxcr4684CFVBbhgod8sECMQ" target="_blank">Minitab Statistical Software</a>. We're going to use a Monte Carlo simulation to have Alabama play their schedule 100,000 times! But we need to establish a few things before we get started.</p>
The Transfer Equation
<p>First, we need a model to use in our simulation. This can be a known formula from your specific area of expertise, or it could be a model created from a designed experiment (DOE) or regression analysis. In our situation, we already know the transfer equation. It's just the summation of the number of games that Alabama wins during the season: </p>
<p style="margin-left: 40px;">Game1 + Game2 + Game3 ... + Game12</p>
The Variables
<p>Next, we need to define the distribution and parameters for the variables in our equation. We have 12 variables, one for each game Alabama will play.</p>
<p>For each game, Alabama can either win or lose. So each variable comes from the binomial distribution because there are only two outcomes.</p>
<p>Now we just need to determine the probability Alabama has of winning each game. For that, I'll turn to <a href="http://www.footballoutsiders.com/stats/ncaa2015" target="_blank">Bill Connelly's S&P+ rankings</a>. These rankings use play-by-play and drive data from every game to rank college football teams. But most importantly, these rankings can be used to generate win probabilities for individual games. And that's where the probability for our 12 binomial variables will come from.</p>
<p style="margin-left: 40px;"><img alt="Alabama probabilities" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/973494dbdb5a661a96ad4d746a48a50f/alabama_probabilities.jpg" style="width: 711px; height: 375px;" /></p>
Generate the Random Data
<p>Now that we have our variables, it's time to generate the random data for each one. We'll start with Alabama's opening game against USC, which is a binomial random variable with a probability of 0.71. To generate this data in Minitab, go to <strong>Calc > Random Data > Binomial</strong>. Then complete the dialog as follows.</p>
<p style="margin-left: 40px;"><img alt="Binomial Distribution" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/6591ac6257608ba23acf1f9d3386c89e/usc_dialog.jpg" style="width: 419px; height: 326px;" /></p>
<p>We're going to simulate this game 100,000 times, so that is the number of rows of data we want to generate. We want each row to represent a single game, so the number of trials is 1. And lastly, Alabama has a 71% chance of winning, so the event probability is 0.71. </p>
<p>After we repeat this for the other 11 games, we'll have simulated Alabama's regular season 100,000 times! Now all that's left to do is to analyze the results!</p>
<p><strong>Note:</strong> The probability for Alabama beating Chattanooga is 100%, but the probability for the binomial distribution has to be less than 1. So I used a value of 0.9999. Out of 100,000 games Chattanooga actually won twice! Hey, it's sports, anything can happen!</p>
Analyze the Simulation
<p>Remember that transfer equation we came up with at the beginning? Now that we have the data for all of our variables, it's time to use it! Go to <strong>Calc > Calculator</strong>, and set up the equation to store the results in a new column.</p>
<p style="margin-left: 40px;"><img alt="Calculator" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/5a590180fc44d9418329a2f86a1db2cc/alabama_wins.jpg" style="width: 443px; height: 393px;" /></p>
<p>I created a new column called "Alabama Wins" and entered the sum of the individual game columns in the expression. This will give me the number of wins Alabama will have for 100,000 different seasons! We can use a histogram to view the results.</p>
<p style="margin-left: 40px;"><img alt="Histogram" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/2c9882fcee727df721dc4ade21fc2455/histogram_of_alabama_wins.jpg" style="width: 576px; height: 384px;" /></p>
<p>The most common outcome was a 10-win season, which Alabama did approximately 29.6% of the time. And the simulation suggests it doesn't look good for Alabama going undefeated. That only happens in 4.6% of the simulations. In fact, there is a better chance that Alabama wins 7 games than all 12! A 7-5 Alabama team sounds impossible. But this is sports, and as our simulation has just shown, anything can happen!</p>
<p>Monte Carlo simulations can be applied to a wide variety of areas outside of sports too. If you want more, <a href="https://www.minitab.com/en-us/Published-Articles/Doing-Monte-Carlo-Simulation-in-Minitab-Statistical-Software/">check out this article</a> that illustrates how to use Minitab for Monte Carlo simulations using both a known engineering formula and a DOE equation.</p>
<p> </p>
Fun StatisticsMonte CarloMonte Carlo SimulationStatisticsStatistics in the NewsFri, 02 Sep 2016 12:00:00 +0000http://blog.minitab.com/blog/the-statistics-game/is-alabama-going-undefeated-this-year-creating-simulations-in-minitabKevin RudyHow to Pick the Right Statistical Software
http://blog.minitab.com/blog/real-world-quality-improvement/how-to-pick-the-right-statistical-software
<p>If you’re in the market for statistical software, there are many considerations and more than a few options for you to evaluate.</p>
<img alt="questions to ask" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/795f924e2aad164e93ba4654f3c012ac/photo_1458419948946_19fb2cc296af.jpg" style="line-height: 20.8px; width: 300px; height: 200px; border-width: 1px; border-style: solid; float: right; margin: 10px 15px;" />
<p>Check out these seven questions to ask yourself before choosing statistical software—your answers should help guide you towards the best solution for your needs!</p>
1. Who uses statistical software in your organization?
<p>Are they expert statisticians, novices, or a mix of both? Will they be analyzing data day-in, day-out, or will some be doing statistics on a less frequent basis? Is data analysis a core part of their jobs, or is it just one of many different hats some users have to wear? What's their relationship with technology—do they like computers, or just use them because they have to? </p>
<p>Figuring out who needs to use the software will help you match the options to their needs, so you can avoid choosing a package that does too much or too little.</p>
<p>If your users span a range of cultures and nationalities, be sure to see if the package you're considering is <a href="http://support.minitab.com/en-us/minitab/17/topic-library/minitab-environment/interface/customize-the-minitab-interface/change-the-language/" target="_blank">available in multiple languages</a>.</p>
2. What types of statistical analysis will they be doing?
<p>The specific types of analysis you need to do could play a big part in determining the right statistical software for your organization. The American Statistical Association's software page lists highly specialized programs for econometrics, spatial statistics, data mining, statistical genetics, risk modeling, and more. However, if your company has employees who specialize in the finer points of these kinds of analyses, chances are good they already identified and have access to the right software for their needs.</p>
<p>Most users will want a general statistical software package that offers the power and flexibility to do all of the most commonly used types of analysis, including regression, ANOVA, hypothesis testing, design of experiments, capability analysis, control charts, and more. If you're considering a general statistical software package, check its features list to make sure it does the kinds of analysis you need. <a href="http://www.minitab.com/products/minitab/features-list/" target="_blank">Here is the complete feature list for Minitab Statistical Software.</a> </p>
3. How easy is it to use the statistical software?
<p>Data analysis is not simple or easy, and many statistical software packages don’t even try to make it any easier. This is not necessarily a bad thing, because "ease of use" is different for different users.</p>
<p>An expert statistician will know how to set up data correctly and will be comfortable entering statistical equations in a command-line interface—in fact, they may even feel slowed down by using a menu-based interface. On the other hand, a less experienced user may be intimidated or overwhelmed by a statistical software package designed primarily for use by experts. </p>
<p>Since ease of use varies widely, look into what kinds of <a href="http://support.minitab.com/en-us/minitab/17/" target="_blank">built-in guidance statistical software packages offer</a> to see which would be easiest for the majority of your users.</p>
4. What kind of support is offered?
<p>If people in your organization will need help using statistical software to analyze their data, how will they get it? Does your company have expert statisticians who can provide assistance when it's needed, or is access to that kind of expertise limited? </p>
<p>If you think people in your organization are going to contact the software's support team for assistance, it's smart to check around and see what kinds of assistance different software companies offer. Do they offer help with analysis problems, or only with installation and IT issues? Do they charge for it?</p>
<p>Look around in online user and customer forums to see what people say about the customer service they've received for different types of statistical software. <a href="http://www.minitab.com/Support/" target="_blank">Some software packages offer free technical support from experts in statistics and IT</a>; others provide more limited, fee-based customer support; and some packages provide no support at all.</p>
5. Where will the software be used?
<p>Will you be doing data analysis in your office? At home? On the road? All of the above? Will people in your organization be using the software at different locations across the country, or even the world? What are the license requirements for software packages in that situation? Does each machine need a separate copy of the software, or are shared licenses available?</p>
<p>Check on the options available for the packages you're considering. A good software provider will seek to understand your organization's unique needs and work with you to find the most cost-effective solution.</p>
6. Are there special considerations for your industry?
<p>Some professions have specialized data analysis needs due to regulations, industry requirements, or the unique nature of their business. For example, the pharmaceutical and medical device industry needs to meet FDA recommendations for testing, which may involve statistical techniques such as Design of Experiments.</p>
<p>Depending on the needs of your business, one or more of these highly specialized software packages may be appropriate. However, general statistical software packages with a full range of tools may provide the functionality your industry requires, so be sure to investigate and compare these packages with the more specialized, and often more expensive, programs used in some industries.</p>
7. What do statistical software packages cost?
<p>Last but not least, you will need to consider the cost of the software package, which can range from $0 for some open-source programs to many thousands of dollars per license for more specialized offerings.</p>
<p>It’s important to compare not just the unit-copy price of a software package (i.e., what it costs to install a single copy of the software on a single machine), but to find out <a href="http://www.minitab.com/en-us/News/Minitab-Pricing-and-Licensing--Frequently-Asked-Questions/" target="_blank">what licensing options for statistical software</a> are available for your situation. </p>
Have more questions?
<p>If you have questions about data analysis software, please <a href="http://www.minitab.com/contact-us/" target="_blank">contact Minitab</a> to discuss your unique situation in detail. We are happy to help you identify the needs of your organization and find a solution that will best fit them!</p>
StatisticsStatistics HelpMon, 29 Aug 2016 18:10:00 +0000http://blog.minitab.com/blog/real-world-quality-improvement/how-to-pick-the-right-statistical-softwareCarly BarryData Not Normal? Try Letting It Be, with a Nonparametric Hypothesis Test
http://blog.minitab.com/blog/understanding-statistics/data-not-normal-try-letting-it-be-with-a-nonparametric-hypothesis-test
<p>So the data you nurtured, that you worked so hard to format and make useful, failed the normality test.</p>
<img alt="not-normal" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c6e92e8046f3fcee28e7cf505fb77005/data_freak_flag_300.jpg" style="line-height: 20.8px; width: 300px; height: 293px; margin: 10px 15px; float: right;" />
<p>Time to face the truth: despite your best efforts, that data set is <em>never </em>going to measure up to the assumption you may have been trained to fervently look for.</p>
<p>Your data's lack of normality seems to make it poorly suited for analysis. Now what?</p>
<p>Take it easy. Don't get uptight. Just let your data be what they are, go to the <strong>Stat </strong>menu in Minitab Statistical Software, and choose "Nonparametrics."</p>
<p style="margin-left: 40px;"><img alt="nonparametrics menu" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/fbebf763ac6bd92b40c0d241b7c4029c/nonparametrics_menu.png" style="width: 367px; height: 309px;" /></p>
<p>If you're stymied by your data's lack of normality, nonparametric statistics might help you find answers. And if the word "nonparametric" looks like five syllables' worth of trouble, don't be intimidated—it's just a big word that usually refers to "tests that don't assume your data follow a normal distribution."</p>
<p>In fact, nonparametric statistics don't assume your data follow <em>any distribution at all</em>. The following table lists common parametric tests, their equivalent nonparametric tests, and the main characteristics of each.</p>
<p style="margin-left: 40px;"><img alt="correspondence table for parametric and nonparametric tests" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/4a69043809861f5187be271de67f8161/parametric_correspondence_table.png" style="width: 661px; height: 488px;" /></p>
<p>Nonparametric analyses free your data from the straitjacket of the <span style="line-height: 20.8px;">normality </span><span style="line-height: 1.6;">assumption. So choosing a nonparametric analysis is sort of like removing your data from a stifling, </span><a href="https://www.verywell.com/the-asch-conformity-experiments-2794996" style="line-height: 1.6;" target="_blank">conformist environment</a><span style="line-height: 1.6;">, and putting it into </span><a href="https://en.wikipedia.org/wiki/Utopia" style="line-height: 1.6;" target="_blank">a judgment-free, groovy idyll</a><span style="line-height: 1.6;">, where your data set can just be what it is, with no hassles about its unique and beautiful shape. How cool is </span><em style="line-height: 1.6;">that</em><span style="line-height: 1.6;">, man? Can you dig it?</span></p>
<p>Of course, it's not <em>quite </em>that carefree. Just like the 1960s encompassed both <a href="https://en.wikipedia.org/wiki/Woodstock" target="_blank">Woodstock</a> and <a href="https://en.wikipedia.org/wiki/Altamont_Free_Concert" target="_blank">Altamont</a>, so nonparametric tests offer both compelling advantages and serious limitations.</p>
Advantages of Nonparametric Tests
<p>Both parametric and nonparametric tests draw inferences about populations based on samples, but parametric tests focus on sample parameters like the mean and the standard deviation, and make various assumptions about your data—for example, that it follows a normal distribution, and that samples include a minimum number of data points.</p>
<p>In contrast, nonparametric tests are unaffected by the distribution of your data. Nonparametric tests also accommodate many conditions that parametric tests do not handle, including small sample sizes, ordered outcomes, and outliers.</p>
<p>Consequently, they can be used in a wider range of situations and with more types of data than traditional parametric tests. Many people also feel that nonparametric analyses are more intuitive.</p>
Drawbacks of Nonparametric Tests
<p><span style="line-height: 20.8px;">But nonparametric tests are not </span><em style="line-height: 20.8px;">completely </em><span style="line-height: 20.8px;">free from assumptions—they do require data to be an independent random sample, for example.</span></p>
<p>And nonparametric tests aren't a cure-all. For starters, they typically have less <a href="http://blog.minitab.com/blog/starting-out-with-statistical-software/how-powerful-am-i-power-and-sample-size-in-minitab">statistical power</a> than parametric equivalents. Power is the probability that you will correctly reject the null hypothesis when it is false. That means you have an increased chance making a Type II error with these tests.</p>
<p>In practical terms, that means nonparametric tests are <em>less </em>likely to detect an effect or association when one really exists.</p>
<p>So if you want to draw conclusions with the same confidence level you'd get using an equivalent parametric test, you will need larger sample sizes. </p>
<p>Nonparametric tests are not a one-size-fits-all solution for non-normal data, but they can yield good answers in situations that parametric statistics just won't work.</p>
Is Parametric or Nonparametric the Right Choice for You?
<p>I've briefly outlined differences between parametric and nonparametric hypothesis tests, looked at which tests are equivalent, and considered some of their advantages and disadvantages. If you're waiting for me to tell you which direction you should choose...well, all I can say is, "It depends..." But I can give you some established rules of thumb to consider when you're looking at the specifics of your situation.</p>
<p>Keep in mind that <strong>nonnormal data does not immediately disqualify your data for a parametric test</strong>. What's your sample size? <span style="line-height: 20.8px;">As long as a certain minimum sample size is met, most parametric tests will be </span><a href="http://blog.minitab.com/blog/fun-with-statistics/forget-statistical-assumptions-just-check-the-requirements" style="line-height: 20.8px;">robust to the normality assumption</a><span style="line-height: 20.8px;">. </span><span style="line-height: 1.6;">For example, the Assistant in Minitab (which uses Welch's t-test) points out that </span><span style="line-height: 1.6;">while the 2-sample t-test is based on the assumption that the data are normally distributed, this assumption is not critical when the sample sizes are at least 15. And Bonnett's 2-sample standard deviation test performs well for nonnormal data even when sample sizes are as small as 20. </span></p>
<p><span style="line-height: 1.6;">In addition, while they may not require normal data, many nonparametric tests have other assumptions that you can’t disregard.</span> For example, t<span style="line-height: 20.8px;">he Kruskal-Wallis test assumes your samples come from populations that have similar shapes and equal variances. </span><span style="line-height: 1.6;">And the 1-sample Wilcoxon test does not assume a particular population distribution, but it does assume the distribution is symmetrical. </span></p>
<p><span style="line-height: 1.6;">In most cases, your choice between parametric and nonparametric tests ultimately comes down to sample size, and whether the center of your data's distribution is better reflected by the mean or the median.</span></p>
<ul>
<li>If the mean accurately represents the center of your distribution and your sample size is large enough, a parametric test offers you better accuracy and more power. </li>
<li>If your sample size is small, you'll likely need to go with a nonparametric test. But if the median better represents the center of your distribution, a nonparametric test may be a better option even for a large sample.</li>
</ul>
<p> </p>
Data AnalysisHypothesis TestingStatisticsStatistics HelpMon, 22 Aug 2016 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/data-not-normal-try-letting-it-be-with-a-nonparametric-hypothesis-testEston MartzHow to Calculate BX Life, Part 2b: Handling Triangular Matrix Data
http://blog.minitab.com/blog/meredith-griffith/how-to-calculate-bx-life-handling-triangular-matrix-data
<span style="font-size: 13px; line-height: 1.6;">I thought 3 posts would capture all the thoughts I had about B10 Life. That is, until this question appeared on the Minitab LinkedIn group:</span>
<p style="margin-left: 40px;"><img alt="pic1" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/f06ea25c49405cfc937bbade2c19275c/pic1.jpg" style="width: 572px; height: 103px;" /></p>
<p>In case you missed it, my first post, <a href="http://blog.minitab.com/blog/meredith-griffith/how-to-calculate-b10-life-with-statistical-software">How to Calculate B10 Life with Statistical Software</a>, explains what B10 life is and how Minitab calculates this value. My second post, <a href="http://blog.minitab.com/blog/meredith-griffith/how-to-calculate-bx-life-part-2">How to Calculate BX Life, Part 2</a>, shows how to compute any BX life in Minitab. But before I round out my BX life blog series with rationale for why BX life is one of the best measures for reliability, I thought I’d take this opportunity to address the LinkedIn question—as you might wonder the same thing.</p>
B10 Life and Warranty Analysis
<p>BX Life can be a useful metric for establishing warranty periods for products. Why? Because it indicates the time at which X% of items in a population will fail. So a manufacturer might set a warranty period after a product’s B10 life, for instance, with the goal of minimizing the number of customers who will <a href="http://blog.minitab.com/blog/understanding-statistics/how-to-predict-warranty-claims">take advantage of the warranty</a> should the product they purchase fail within the warranty period. Naturally, someone doing warranty analysis in Minitab should want to compute this value too! But looking at raw reliability field data, which are recorded in the form of a triangular matrix, it’s not obvious how to compute B10 life!</p>
Warranty Input in Triangular Matrices
<p>It’s common to keep track of reliability field data in the form of number of items shipped and number of items returned from a particular shipment over time. And when several shipments are made at different dates and their corresponding returns noted, the recorded data are in the form of a triangular matrix.</p>
<p>Minitab has a tool that helps you convert shipping and warranty return data from matrix form into a standard reliability data form of failures. </p>
Convert your data from a matrix form for easy analysis!
<p>To demonstrate, let’s start with a new example and new data. If you’d like to follow along and you’re using <a href="http://www.minitab.com/products/minitab/whats-new/">Minitab 17.3</a>, navigate to <strong>Help > Sample Data</strong> and select the Compressor.MTW file.</p>
<p style="margin-left: 40px;"><img alt="pic1" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/5f379e4e6795ad8d466926a882826e28/pic1.jpg" style="width: 423px; height: 130px;" /></p>
<p><span style="line-height: 1.6;">Here is what the data looks like:</span></p>
<p style="margin-left: 40px;"><img alt="pic3" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/18a3643f21a50455704b0d8ad52e6fcc/pic3.jpg" style="width: 807px; height: 308px;" /></p>
<p>From here, you can use Minitab’s Pre-Process Warranty Data to reshape your data from triangular matrix format into interval censoring format. Select <strong>Stat > Reliability/Survival > Warranty Analysis > Pre-Process Warranty Data. </strong>For “Shipment (sale) column,” enter <em>Ship. </em>For “Return (failure) columns,” enter <em>Month1-Month12.</em> Click <strong>OK</strong>.</p>
<p style="margin-left: 40px;"><img alt="pic4" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/4a00473b37bbb068fa2bce70a904c9f9/pic4.jpg" style="width: 543px; height: 363px;" /></p>
<p><span style="line-height: 1.6;">The Pre-Process step creates </span><em style="line-height: 1.6;">Start time, End time, </em><span style="line-height: 1.6;">and </span><em style="line-height: 1.6;">Frequencies </em><span style="line-height: 1.6;">columns in your worksheet! </span></p>
<p style="margin-left: 40px;"><img alt="pic5" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/29bf052fa69b8cb01c4b2f3e3f348920/pic5.jpg" style="width: 207px; height: 572px;" /></p>
<p><span style="line-height: 1.6;">You can now use these columns to obtain BX life using </span><strong style="line-height: 1.6;">Stat > Reliability/Survival > Distribution Analysis (Arbitrary Censoring) > Parametric Distribution Analysis</strong><span style="line-height: 1.6;">. Enter </span><em style="line-height: 1.6;">Start time</em><span style="line-height: 1.6;"> in “Start variables,” </span><em style="line-height: 1.6;">End time </em><span style="line-height: 1.6;">in “End variables,” and </span><em style="line-height: 1.6;">Frequencies</em><span style="line-height: 1.6;"> in “Frequency columns (optional).” </span><span style="line-height: 1.6;">Also, make sure you have the appropriate assumed distribution selected. We’ll assume the Weibull distribution fits our data.</span></p>
<p style="margin-left: 40px;"><img alt="pic6" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/dd0362a3f2b69e280eaa9fcd06532ac9/pic6.jpg" style="width: 497px; height: 358px;" /></p>
<p>Click the Estimate button to enter percents to be estimated in addition to what’s provided in the default output (In our case, let’s ask for B15 Life—so enter a 15 in “Estimate percentiles for these additional percents”). </p>
<p style="margin-left: 40px;"><img alt="pic7" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/6f2426e01509a75b4d39ae9b1780a4aa/pic7.jpg" style="width: 508px; height: 447px;" /></p>
<p><span style="line-height: 1.6;">When we <strong>OK </strong>out of these dialogs, Minitab performs the analysis. Among the output Minitab provides is our handy Table of Percentiles, including our value for B15 life—or the time at which 15% of the items in our population will fail.</span></p>
<p style="margin-left: 40px;"><img alt="pic8" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/f104d016a2b21c8aec8d9b54f0ac7338/pic8.jpg" style="width: 487px; height: 400px;" /></p>
<p><span style="line-height: 1.6;">And there you have it!</span></p>
<p>Collecting warranty data and doing <a href="http://blog.minitab.com/blog/understanding-statistics/how-to-predict-warranty-claims">warranty analysis</a> in Minitab shouldn’t prevent you from using reliability tools and metrics, such as BX life. In fact, letting Minitab reshape your data through the Pre-Process Warranty Data tool only makes your life easier when you dive into your reliability analysis!</p>
<p>Now, I promise, we’re well on our way to rounding out this series of posts, and in my next installment we'll look at the reasons BX life is a good metric to have in your reliability tool belt.</p>
Data AnalysisQuality ImprovementReliability AnalysisStatisticsWed, 17 Aug 2016 12:00:00 +0000http://blog.minitab.com/blog/meredith-griffith/how-to-calculate-bx-life-handling-triangular-matrix-dataMeredith GriffithAnalyzing the History of Olympic Events with Time Series
http://blog.minitab.com/blog/the-statistics-game/analyzing-the-history-of-olympic-events-with-time-series
<p><img alt="Olympics" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/a8081eeca606f0a3351825d49270d062/olympics.jpg" style="width: 320px; height: 226px; float: right;" />The Olympic games are about to begin in Rio de Janeiro. Over the next 16 days, more than 11,000 athletes from 206 countries will be competing in 306 different events. That's the most events ever in any Olympic games. It's almost twice as many events as there were 50 years ago, and exactly three times as many as there were 100 years ago.</p>
<p>Since the number of Olympic events has changed over time, this makes it a great data set for a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/time-series/basics/what-is-a-time-series/" target="_blank">time series analysis</a>.</p>
<p>A time series is a sequence of observations over regularly spaced intervals of time. The first step when analyzing time series data is to create a time series plot to look for trends and seasonality. A trend is a long-term tendency of a series to increase or decrease. Seasonality is the periodic fluctuation in the time series within a certain period—for example, sales for a store might increase every year in November and December. Here is a time series plot of the number of Olympic events since 1896.</p>
<p style="margin-left: 40px;"><img alt="Time Series Plot" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/51ef7f79b2b625a967eaa72a402c437a/histogram_of_olympic_events.jpg" style="width: 576px; height: 384px;" /></p>
<p>There is clearly an upward trend, but no seasonal pattern. The data is also a little choppy at the beginning. Part of the explanation is that the data points are not evenly spaced. Most Olympic games are 4 years apart, but a few of them are just 2 years apart, and during World War I and World War II there were 8-year and 12-year gaps, respectively. Since <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/time-series-plots-theres-gold-in-them-thar-hills">time series data</a> should be evenly spaced over time, we'll only look at data from 1948 on, when the Olympics started being held every 4 years without any interruptions.</p>
<p style="margin-left: 40px;"><img alt="Time Series Plot" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/f77b40bf156b5fc850fc6b96881d8808/time_series_plot_of_olympic_events.jpg" style="width: 576px; height: 384px;" /></p>
<p>Now that we have an evenly spaced series that clearly exhibits a trend, we can use a trend analysis in Minitab Statistical Software to model the data. With a trend analysis, you can use four different types of models: linear, quadratic, exponential growth, and s-curve. We'll analyze our data using both the linear and s-curve models. An additional time series analysis you can use when your data exhibit a trend is double exponential smoothing, so we'll use that method too. </p>
<p style="margin-left: 40px;"><img alt="Trend Analysis" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/fcf72a68cbdcce5a5a6284f5e5581a48/trend_analysis_linear.jpg" style="width: 576px; height: 384px;" /></p>
<p style="margin-left: 40px;"><img alt="Trend Analysis" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/dc6786e4cf82be102d53d42eba48c2fd/trend_analysis_s_curve.jpg" style="width: 576px; height: 384px;" /></p>
<p style="margin-left: 40px;"><img alt="Double Exponential Smoothing" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/a1f20460303b913fcb75a67b41b05eea/double_exponential_smoothing_plot_for_olympic_events.jpg" style="width: 576px; height: 384px;" /></p>
<p>You can use the accuracy measures (MAPE, MAD, MSD) to compare the fits of different time series models. For all three of these statistics, smaller values usually indicate a better-fitting model. If a single model does not have the lowest values for all three statistics, MAPE is usually the preferred measurement.</p>
<p>For the time series of olympic event data, the s-curve model has the lowest values of MAPE and MAD, while the double exponential smoothing method has the lowest value for MSD. Based on the "MAPE breaks all ties" guideline, it appears that the s-curve model is the one we want to use.</p>
<p>However, accuracy measures shouldn't be the sole criteria you use to select a model. It's also important to examine the fit of the model, especially at the end of the series. And if the last 5 Olympics are any indication, it appears that the trend of adding large quantities of events to the Olympic Games is coming to an end. In the last 16 years, only 6 events have been added.</p>
<p>The double exponential smoothing model appears to have adjusted for this change, whereas the two trend analysis models have not. Given this additional consideration, the double exponential smoothing model is the one we should pick, especially if we want to use it to forecast future observations.</p>
<p>And now that we've settled on a model, we can sit back, relax, and watch all 918 medals be won. Let the games begin!</p>
<p> </p>
Fun StatisticsStatisticsStatistics in the NewsFri, 05 Aug 2016 12:00:00 +0000http://blog.minitab.com/blog/the-statistics-game/analyzing-the-history-of-olympic-events-with-time-seriesKevin RudyOn Paying Bills, Marriage, and Alert Systems
http://blog.minitab.com/blog/meredith-griffith/on-paying-bills-marriage-and-alert-systems
<p>When I blogged about <a href="http://blog.minitab.com/blog/meredith-griffith/what-a-trip-to-the-dentist-taught-us-about-automation">automation</a> back in March, I made my husband out to be an automation guru. Well, he certainly is. But what you don’t know about my husband is that while he loves to automate everything in his life, sometimes he drops the ball. He’s human; even I have to cut him a break every now and then.</p>
<p>On the other hand, instances of hypocrisy in his behavior tend to make for a good story. So here we are again.</p>
<span style="line-height: 1.2;">On Paying Bills</span>
<p>When we married 5 years ago and began combining our bank accounts, I learned a few things about my husband. Nothing that I haven’t already shared with you. Because he loves automation, it came as no surprise to me that all his accounts resided in a single online repository (mint.com) where he could view his net worth—assets such as his home and car value, and debts including the loan left on his home and bills and credit card expenses that needed to be paid. He’d also made sure to automate the payment of all loans, utility bills, and credit cards—and the respective account would notify him when a payment was made.</p>
<p>This mint.com account served as one dashboard view of all possible accounts he would otherwise have to access independently to see statements and make payments. It was genius! </p>
<p><img alt="mint" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/299516c1c0685e413532648e7a185d6e/mint.jpg" style="width: 1000px; height: 563px;" /></p>
<p>He could set up savings goals, budgets, email alerts for credit card payment reminders and notification of payment, suspicious account activity, and just about any other miscellaneous charge or activity or change in spending habits. It really did make life easier.</p>
<p>Until I entered the picture.</p>
<span style="line-height: 1.2;">On Marriage</span>
<p>We married, I synced my bank accounts, and we combined cash. I scoured his historical data to observe spending habits—areas where we could save money (Taco Bell topped the ‘high spending’ for the Food/Dining category). As I began poking around his accounts, I noticed a monthly fee his Chase Freedom Visa credit card was charging him. I asked him about the fee; he pleaded ignorance. When I investigated further, I discovered that he’d been charged this fee for <em>years</em>, since he first got the credit card.</p>
<p>I researched online and discovered that other cardholders had complained of being erroneously enrolled in a protection program when they first got their Chase Freedom card, and were being charged a similar fee of varying amounts monthly. Turns out this monthly fee was a percentage of monthly spending—and the Chase Freedom Visa credit card incentivized a cardholder to make all his purchases with that card, given its offer of 5% cash back on all purchases at the time.</p>
<p>Needless to say, I wanted that money back. No less than a few minutes later, we were on the phone with Chase disputing the program enrollment and monthly charges. They acknowledged their error and refunded us the money lost over a span of several years.</p>
<p>The lesson in all of this? Marry someone who’s not afraid to dig through your historical data.</p>
On Alert Systems
<p>More seriously, automating processes or workflows is incredibly helpful, but without the proper attention and alert systems in place, you may still encounter holes in the story. Automation and alerts must go hand-in-hand to be effective—and as a consumer of the information you’re automating, you still must be invested enough to look at the big picture.</p>
<p>For my husband, the beauty in automating his bill payments and aggregating all his accounts on mint.com was to save time he'd otherwise spend paying bills separately and checking cash flows in multiple different accounts. But he failed to set up alerts about important aspects of the process he was automating, and he failed to check in on his process from time to time. Mint.com provides an incredibly useful dashboard to give you the big picture overview of your accounts and your net worth; it also provides a plethora of alert options that save a consumer time from digging for red flags <em>after</em> the undesirable event has become a regular occurrence in the process (like I did). But without checking the status of the system or using its full automation potential, the system is only as good as its inputs until you revisit it or tweak it.</p>
<p>This is just one piece of the puzzle. Alert systems offer so much more!</p>
<ol>
<li><strong>Awareness</strong>—setting alerts through mint.com with regard to miscellaneous fees would have offered insight about the credit card program my husband had been erroneously enrolled in.</li>
<li><strong>Immediate Feedback</strong>—the first time a fee was charged, he would have been able to take immediate action rather than waiting years later for his wife to discover the charge (manually, mind you).</li>
<li><strong>Time Saver</strong>—aside from automating bill pay and combining all accounts into a single repository for a big picture view of one’s financial status (which is certainly a time-saver in reviewing accounts and paying bills in various locations), an alert system would have saved me a lot of time in digging through my husband’s financial data to understand the origin of the fee Chase was charging him.</li>
<li><strong>Money Saver</strong>—while we <em>were </em>refunded all the money charged in monthly fees by Chase, clearly an alert system would have been a more foolproof way to save money in the first place. Alerts are also effective in ensuring bill pay occurs on time, notifying you when a statement has been prepared, when the bill is due, and when the bill has been paid.</li>
</ol>
<p>As process engineers or quality managers in the manufacturing world, you are very close to your process and its inputs. You want to know when something goes wrong, right when it happens. You don’t want a consumer to discover a flaw in a part or product you manufactured and sold years before, only to be faced with product recalls, customer reimbursements, time and money invested to re-manufacture and replace the defective product for unhappy customers, and in some cases, lawsuits. The stakes are high.</p>
<p>Minitab offers a solution to this pain point in its Real-Time SPC dashboard. The dashboard is completely powered by Minitab Statistical Software, taking the graphs and output you know and love and placing them on customized dashboard views that show the current state of your processes. The dashboard gives you a big picture view of your processes across all your production sites, for instance, and highlights where improvements can be made. You can incorporate any graph or analysis you want—such as histograms, control charts, or process capability analysis. You can automatically generate quality reports about your processes, and set up any alert that will help you respond to defects faster.</p>
<p><img alt="qualityDashboard" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/c9c6bb0f36670d640bf29072a830b9d5/qualitydashboard.jpg" style="width: 900px; height: 651px;" /></p>
<p><img alt="spcDashboard" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/27347695ab637e3931fe251860d12079/spcdashboard.jpg" style="line-height: 1.6; width: 900px; height: 665px;" /></p>
<p><span style="line-height: 1.6;">In the case of my marriage, alert systems are certainly practical from a financial standpoint. But in the world of manufacturing, ensuring alerts are set up around your automated processes has far-reaching implications as the time- and money-saving elements of alert systems greatly impacts a company’s bottom line. To learn more about how Minitab can help you, contact us at </span><a href="mailto:sales@minitab.com" style="line-height: 1.6;">Sales@minitab.com</a><span style="line-height: 1.6;">.</span></p>
<p>And if you’ve ever thought twice about whether or not you should marry, let this story be an encouragement to you—you may actually find a spouse who can make you richer.</p>
<p> </p>
AutomationData AnalysisQuality ImprovementSix SigmaMon, 25 Jul 2016 12:00:00 +0000http://blog.minitab.com/blog/meredith-griffith/on-paying-bills-marriage-and-alert-systemsMeredith GriffithCan Regression and Statistical Software Help You Find a Great Deal on a Used Car?
http://blog.minitab.com/blog/understanding-statistics/can-regression-and-statistical-software-help-you-find-a-great-deal-on-a-used-car
<p>You need to consider many factors when you’re buying a used car. Once you narrow your choice down to a particular car model, you can get a wealth of information about individual cars on the market through the Internet. How do you navigate through it all to find the best deal? By analyzing the data you have available. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/710ce579b4120727bf67e8b48f5965e8/240_used_car_kovacs.jpg" style="line-height: 20.7999992370605px; border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 240px; height: 240px;" /></p>
<p>Let's look at how this works using <a href="http://blog.minitab.com/blog/understanding-statistics/we-just-got-rid-of-five-reasons-to-fear-data-analysis">the Assistant</a> in Minitab 17. With the Assistant, you can use regression analysis to calculate the expected price of a vehicle based on variables such as year, mileage, whether or not the technology package is included, and whether or not a free Carfax report is included.</p>
<p>And it's probably a lot easier than you think. </p>
<p>A search of a leading Internet auto sales site yielded data about 988 vehicles of a specific make and model. After putting the data into Minitab, we choose <strong>Assistant > Regression…</strong></p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9e87de993a0daa39e6643b8c6d3aed9c/regression_dialog.png" style="width: 395px; height: 247px;" /></p>
<p>At this point, if you aren’t very comfortable with regression, <a href="http://www.minitab.com/products/minitab/assistant/">the Assistant makes it easy to select the right option for your analysis</a>.</p>
A Decision Tree for Selecting the Right Analysis
<p>We want to explore the relationships between the price of the vehicle and four factors, or X variables. Since we have more than one X variable, and since we're not looking to optimize a response, we want to choose Multiple Regression.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/bc802d35bfb57ca3b86e061da4fa4b09/regression_decision_tree_w640.png" style="width: 640px; height: 502px;" /></p>
<p>This <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/9ecb2280228deb621ee2db7f6fbe300e/used_cars.MTW">data set</a> includes five columns: mileage, the age of the car in years, whether or not it has a technology package, whether or not it includes a free CARFAX report, and, finally, the price of the car. <span style="line-height: 1.6;">We don’t know which of these factors may have significant relationship to the cost of the vehicle, but we don’t need to. Just fill out the dialog box as shown. </span></p>
<p><img alt="multiple regression in the Assistant" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/0655a116f60bdad98c97f1b460a72d50/multiple_regression.png" style="width: 542px; height: 392px;" /></p>
<p>Press OK and the Assistant assesses each potential model and selects the best-fitting one. It also provides a comprehensive set of reports, including a Model Building Report that details how the final model was selected and a Report Card that notifies you to potential problems with the analysis, if there are any.</p>
Interpreting Regression Results in Plain Language
<p>The Summary Report tells us in plain language that there is a significant relationship between the Y and X variables in this analysis, and that the factors in the final model explain 89.8 percent of the observed variation in price. It confirms that all of the variables we looked at are significant. </p>
<p><img alt="multiple regression output" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/80aaf633a64b8078afdaa6479aa0b17e/regression_output.png" style="width: 733px; height: 548px;" /></p>
<p>The Model Equations Report contains the final regression models, which can be used to predict the price of a used vehicle. The Assistant provides 2 equations, one for vehicles that include a free CARFAX report, and one for vehicles that do not.</p>
<p><img alt="regression equations" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b06fd24876a41c7b38803ea054de3db9/regression_equations.png" style="width: 723px; height: 161px;" /></p>
<p>We can see several interesting things about the price of this vehicle model by reading the equations. First, the constant for cars with a free CARFAX report is 27,799, while the constant for a paid report is 27,358. This tells us that all other things being equal, the average cost for vehicles with a free report is raised on average about $441 above vehicles with a paid report. This could be because these cars probably have a clean report (if not, the sellers probably wouldn’t provide it for free).</p>
<p>Second, each additional mile added to the car decreases its expected price by roughly 6 cents, while each year added to the car's age decreases the expected price by $1,310. <span style="line-height: 1.6;">The technology package adds, on average, $1,044 to the price of vehicles. </span></p>
Residuals versus Fitted Values
<p>While these findings are interesting, our goal is to find the car that offers the best value. In other words, we want to find the car that has the largest difference between the asking price and the expected asking price predicted by the regression analysis.</p>
<p>For that, we can look at the Assistant’s Diagnostic Report. The report presents a chart of Residuals vs. Fitted Values. If we see obvious patterns in this chart, it can indicate problems with the analysis. In that respect, this chart of Residuals vs. Fitted Values looks fine, but now we’re going to use the chart to identify the best value on the market.</p>
<p><img alt="residuals" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/407b5bae854a6dfab51b6cb37bf774e8/residuals.png" style="width: 737px; height: 548px;" /></p>
<p>In this analysis, the “Fitted Values” are the prices predicted by the regression model. “Residuals” are what you get when you subtract the actual asking price from the predicted asking price—exactly the information you’re looking for! The Assistant marks large residuals in red, making them very easy to find. And three of those residuals—which appear in light blue above because we’ve selected them—appear to be very far below the asking price predicted by the regression analysis.</p>
<p>Selecting these data points on the graph reveals that these are vehicles whose data appears in rows 357, 359, and 934 of the data sheet. Now we can revisit those vehicles online to see if one of them is the right vehicle to purchase, or if there’s something undesirable that explains the low asking price. </p>
<p>Sure enough, the records for those vehicles reveal that two of them have severe collision damage.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/5dbbf5aa405d4b2d53ec720657a09556/vehicles.jpg" style="width: 320px; height: 356px;" /></p>
<p>But the remaining vehicle appears to be in pristine condition, and is several thousand dollars less than the price you’d expect to pay, based on this analysis!</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/640bd720a3d1f8b04713aa0ec321a570/nice_car.png" style="width: 254px; height: 189px;" /></p>
<p>With the power of regression analysis and the Assistant, we’ve found a great used car—at a price you know is a real bargain.</p>
<p> </p>
Data AnalysisFun StatisticsRegression AnalysisStatisticsStatistics HelpFri, 22 Jul 2016 10:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/can-regression-and-statistical-software-help-you-find-a-great-deal-on-a-used-carEston MartzHigh Cpk and a Funny-Looking Histogram: Is My Process Really that Amazing?
http://blog.minitab.com/blog/marilyn-wheatleys-blog/high-cpk-and-a-funny-looking-histogram-is-my-process-really-that-amazing
<p>Here is a scenario involving process capability that we’ve seen from time to time in Minitab's technical support department. I’m sharing the details in this post so that you’ll know where to look if you encounter a similar situation.</p>
<p>You need to run a capability analysis. You generate the output using <a href="http://www.minitab.com/en-us/products/minitab/">Minitab Statistical Software</a>. When you look at the results, the Cpk is huge and the histogram in the output looks strange:</p>
<p style="margin-left: 40px;"><img border="0" height="468" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/9549037dc2e0a30a77ab36737edeeb09/9549037dc2e0a30a77ab36737edeeb09.png" width="624" /></p>
<p>What’s going on here? The Cpk seems unrealistic at 42.68, the "within" fit line is tall and narrow, and the bars on the histogram are all smashed down. Yet if we use the exact same data to make a histogram using the Graph menu, we see that things don’t look so bad:</p>
<p style="margin-left: 40px;"><img border="0" height="384" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/d111d612a239ac72e49fe7d3fccab0f5/d111d612a239ac72e49fe7d3fccab0f5.png" width="576" /></p>
<p><span style="line-height: 1.6;">So what explains the odd </span><span style="line-height: 20.8px;">output for the </span><span style="line-height: 1.6;">capability analysis?</span></p>
<p>Notice that the ‘within subgroup’ variation in the capability output is represented by the tall dashed line in the middle of the histogram. This is the StDev (Within) shown on the left side of the graph. The within subgroup variation of 0.0777 is very small relative to the overall standard deviation. </p>
<p>So what is causing the within subgroup variation to be so small? Another graph in Minitab can give us the answer: The Capability Sixpack. In the case above, the subgroup size was 1 and Minitab’s Capability Sixpack in <strong>Stat</strong> > <strong>Quality Tools</strong> > <strong>Capability Sixpack</strong> > <strong>Normal</strong> will plot the data on a control chart for individual observations, an I-chart:</p>
<p style="margin-left: 40px;"><img border="0" height="468" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/46352209a7f75bbfba20794c925fc897/46352209a7f75bbfba20794c925fc897.png" width="624" /></p>
<p>Hmmm...this could be why, in <a href="http://www.minitab.com/en-us/services/training/">Minitab training</a>, our instructors recommend using the Capability Sixpack first.</p>
<p>In the Capability Sixpack above, we can see that the individually plotted values on the I-chart show an upward trend, and it appears that the process is <em>not </em>stable and in control (as <span><a href="http://blog.minitab.com/blog/understanding-statistics/i-think-i-can-i-know-i-can-a-high-level-overview-of-process-capability-analysis">it should be for data used in a capability analysis</a></span>). A closer look at the data in the worksheet clearly reveals that the data was sorted in ascending order:</p>
<p style="margin-left: 40px;"><img border="0" height="265" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/44945e3490bf95cfc2c796618195b75e/44945e3490bf95cfc2c796618195b75e.png" width="103" /></p>
<p>Because the within-subgroup variation for data not collected in subgroups is estimated based on the <a href="http://blog.minitab.com/blog/marilyn-wheatleys-blog/whats-a-moving-range-and-how-is-it-calculated">moving ranges</a> (average of the distance between consecutive points), sorting the data causes the within-subgroup variation to be very small. With very little within-subgroup variation we see a very tall, narrow fit line that represents the within subgroup variation, and that is ‘smashing down’ the bars on the histogram. We can see this by creating a histogram in the Graph menu and forcing Minitab to use a very small standard deviation (by default this graph uses the overall standard deviation that is used when calculating Ppk): <strong>Graph</strong> > <strong>Histogram </strong>> <strong>Simple</strong>, enter the data, click <strong>Data View</strong>, choose the <strong>Distribution </strong>tab, check <strong>Fit distribution</strong> and for the Historical StDev enter 0.0777, then click <strong>OK</strong> and now we get:</p>
<p style="margin-left: 40px;"><img src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/a20807e5892fdb0823bccf0828b5f585/a20807e5892fdb0823bccf0828b5f585.png" /></p>
<p style="margin-left: 40px;"><img src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/421afe5e10ba17256da42ac98bf11192/421afe5e10ba17256da42ac98bf11192.png" /></p>
<p>Mystery solved! And if you still don’t believe me, we can get a better looking capability histogram by randomizing the data first (<strong>Calc</strong> > <strong>Random Data</strong> > <strong>Sample From Columns</strong>):</p>
<p style="margin-left: 40px;"><img border="0" height="312" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/7f92652636a67fd8f17396bcb52e960c/7f92652636a67fd8f17396bcb52e960c.png" width="397" /></p>
<p>Now if we run the capability analysis using the randomized data in C2 we see:</p>
<p style="margin-left: 40px;"><img border="0" height="468" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/5c4c93d603abfb79a1f49b4c412c689b/5c4c93d603abfb79a1f49b4c412c689b.png" width="624" /></p>
<p>A note of caution: I’m <strong><em>not </em></strong>suggesting that the data for a capability analysis should be randomized. The moral of the story is that the data in the worksheet should be entered in the order it was collected so that it is representative of the normal variation in the process (i.e., the data should not be <em>sorted</em>). </p>
<p>Too bad our Cpk doesn’t look as amazing as it did before…now it's time to get to <a href="http://blog.minitab.com/blog/michelle-paret/how-to-improve-cpk">work with Minitab to improve our Cpk</a>!</p>
Capability AnalysisData AnalysisLean Six SigmaQuality ImprovementReliability AnalysisWed, 20 Jul 2016 12:00:00 +0000http://blog.minitab.com/blog/marilyn-wheatleys-blog/high-cpk-and-a-funny-looking-histogram-is-my-process-really-that-amazingMarilyn WheatleyDOE Center Points: What They Are & Why They're Useful
http://blog.minitab.com/blog/michelle-paret/doe-center-points-what-they-are-why-theyre-useful
<p><a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/design-of-experiment-doe:-searching-for-a-selfie-fountain-of-youth">Design of Experiments</a> (DOE) is the perfect tool to efficiently determine if key inputs are related to key outputs. Behind the scenes, DOE is simply a regression analysis. What’s not simple, however, is all of the choices you have to make when planning your experiment. What X’s should you test? What ranges should you select for your X’s? How many replicates should you use? Do you need center points? Etc. <span style="line-height: 1.6;">So let’s talk about center points.</span></p>
What Are Center Points?
<p>Center points are simply experimental runs where your X’s are set halfway between (i.e., in the center of) the low and high settings. For example, suppose your DOE includes these X’s:</p>
<p><img alt="TimeAndTemp" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/a353ac7271581a7dadcf8dac48e33d3f/timeandtemp.jpg" style="width: 300px; height: 80px;" /></p>
<p>The center point would then be set midway at a Temperature of <strong>150 °C</strong> and a Time of <strong>20 seconds</strong>.</p>
<p>And your data collection plan in <a href="http://www.minitab.com/products/minitab/">Minitab Statistical Software</a> might look something like this, with the center points shown in blue:</p>
<p><img alt="Minitab Worksheet" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/32105497dbf355948fd66f57cde703bf/minitabworksheet.jpg" style="width: 361px; height: 320px;" /></p>
<p>You can have just 1 center point, or you can collect data at the center point multiple times. This particular design includes 2 experimental runs at the center point. Why pick 2, you may be asking? We’ll talk about that in just a moment.</p>
Why Should You Use Center Points in Your Designed Experiment?
<p>Including center points in a DOE offers many advantages:</p>
<strong><em>1. Is Y versus X linear?</em></strong>
<p>Factorial designs assume there’s a linear relationship between each X and Y. Therefore, if the relationship between any X and Y exhibits curvature, you shouldn’t use a factorial design because the results may mislead you.</p>
<p>So how do you statistically determine if the relationship is linear or not? With center points! If the center point p-value is significant (i.e., less than alpha), then you can conclude that curvature exists and use response surface DOE—such as a central composite design—to analyze your data. While factorial designs can <em>detect </em>curvature, you have to use a response surface design to <em>model</em> (build an equation for) the curvature.</p>
<p><img alt="Bad Fit Factorial Design" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/53c32bd3909d45e8354cb646226163c8/bad_fit.jpg" style="width: 300px; height: 200px; margin-left: 5px; margin-right: 5px;" /><img alt="Good Fit Response Surface Design" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/253453eaca36868d557cc80e23c8e4de/good_fit.jpg" style="width: 300px; height: 200px;" /></p>
<p>And the good news is that curvature often indicates that your X settings are near an optimum Y, and you've discovered insightful results!</p>
<strong><em>2. Did you collect enough data?</em></strong>
<p>If you don’t collect enough data, you aren’t going to detect significant X’s even if they truly exist. One way to increase the number of data points in a DOE is to use replicates. However, replicating an entire DOE can be expensive and time-consuming. For example, if you have 3 X’s and want to replicate the design, then you have to increase the number of experimental runs from 8 to 16!</p>
<p>Fortunately, using replicates is just one way to increase power. An alternative way to increase power is to use center points. By adding just a few center points to your design, you can increase the probability of detecting significant X’s, and estimate the variability (or pure error, statistically speaking).</p>
Learn More about DOE
<p><span style="line-height: 1.6;">DOE is a great tool. It tells you a lot about your inputs and outputs and can help you optimize process settings. But it’s only a great tool if you use it the right way. If you want to learn more about DOE, check out our e-learning course <a href="http://www.minitab.com/products/quality-trainer/">Quality Trainer</a> for $30 US. Or, you can participate in a full-day Factorial Designs course at one of our <a href="http://www.minitab.com/services/training/schedule/">instructor-led training sessions</a>.</span></p>
Data AnalysisDesign of ExperimentsLean Six SigmaQuality ImprovementSix SigmaStatisticsFri, 15 Jul 2016 12:00:00 +0000http://blog.minitab.com/blog/michelle-paret/doe-center-points-what-they-are-why-theyre-usefulMichelle ParetDoes Major League Baseball Really Need the Second Half of the Season?
http://blog.minitab.com/blog/the-statistics-game/does-major-league-baseball-really-need-the-second-half-of-the-season
<p><img alt="MLB Logo" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/8fe78a1febf88c009d5cf2943615c4a2/mlb_logo.gif" style="width: 250px; height: 129px; float: right; margin: 10px 15px;" />When you perform a statistical analysis, you want to make sure you collect enough data that your results are reliable. But you also want to avoid wasting time and money collecting more data than you need. So it's important to find an appropriate middle ground when determining your sample size.</p>
<p>Now, technically, the Major League Baseball regular season isn't a statistical analysis. But it does kind of work like one, since the goal of the regular season is to "determine who the best teams are." The National Football League uses a 16-game regular season to determine who the best teams are. Hockey and Basketball use 82 games. </p>
<p>Baseball uses 162 games.</p>
<p>So is baseball wasting time collecting more data than it needs? Right now the MLB regular season is about halfway over. So could they just end the regular season now? Will playing another 81 games really have a significant effect on the standings? Let's find out.</p>
How much do MLB standings change in the 2nd half of the season?
<p>I went back through five years of records and recorded where each MLB team ranked in their league (American League and National League) on July 8, and then again at the end of the season. We can use this data to look at concordant and discordant pairs. A pair is concordant if the observations are in the same direction. A pair is discordant if the observations are in opposite directions. This will let us compare teams to each other two at a time.</p>
<p>For example, let's compare the Astros and Angels from 2015. On July 8th, the Astros were ranked 2nd in the AL and the Angels were ranked 3rd. At the end of the season, Houston was ranked 5th and the Angles were ranked 6th. This pair is concordant since in both cases the Astros were ranked higher than the Angels. But if you compare the Astros and the Yankees, you'll see the Astros were ranked higher on July 8th, but the Yankees were ranked higher at the end of the season. That pair is discordant.</p>
<p>When we compare every team, we end up with 11,175 pairs. How many of those are concordant? <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">Minitab Statistical Software</a> has the answer.</p>
<p style="margin-left: 40px;"><img alt="Measures of Concordance" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/29a56ecd2f92d8adf4e17f8dd54c9765/measures_of_concordance.jpg" style="width: 461px; height: 150px;" /></p>
<p>There are 8,307 concordant pairs, which is just over 74% of the data. So most of the time, if a team is higher in the standings as of July 8th, they will finish higher in the final standings too. We can also use Spearman's rho and Pearson's r to asses the association between standings on July 8th and the final standings. These two values give us a coefficient that can range from -1 to +1. The larger the absolute value, the stronger the relationship between the variables. A value of 0 indicates the absence of a relationship. </p>
<p style="margin-left: 40px;"><img alt="Pearsons r and Spearmans rho" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/24cba34f8af1f68e9a6ef5647d876695/meaures_of_association.jpg" style="width: 180px; height: 52px;" /></p>
<p>Both values are high and positive, once again indicating that teams ranked higher than other teams on July 8th usually stay that way by the end of the season. So did we do it? Did we show that baseball doesn't really need the 2nd half of their season?</p>
<p>Not quite.</p>
<p>Consider that each league has 15 teams. So a lot of our pairs are comparing teams that aren't that close together, like 1st team to the 15th, the 1st team to the 14th, the 2nd team to the 15th, and so on. It's not very surprising that those pairs are going to be concordant. So let's dig a little deeper and compare each individual team's ranking in July compared to the end of the season. The following <a href="http://blog.minitab.com/blog/michelle-paret/3-things-a-histogram-can-tell-you">histogram</a> shows the difference in a team's rank. Positive values mean the team moved up in the standings, negative values mean they fell.</p>
<p style="margin-left: 40px;"><img alt="Histogram" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/1612d472c3a617bfee1098bc29e00f27/histogram_of_difference.jpg" style="width: 576px; height: 384px;" /></p>
<p>The most common outcome is that a team doesn't move up or down in the standings, as 34 of our observations have a difference of 0. However, there are 150 total observations, so most of the time a team does move up or down. In fact, 55 times a team moved up or down in the standings by 3 or more spots. That's over a third of the time! And there are multiple instances of a team moving 6, 7, or even 8 spots! That doesn't seem to imply that the 2nd half of the season doesn't matter. So what if we narrow the scope of our analysis?</p>
Looking at the Playoff Teams
<p>We previously noted that the regular season is supposed to determine the best teams. So let's focus on the top of the MLB standings. I took the top 5 teams in each league (since the top 5 teams make the playoffs) on July 8th, and recorded whether they were still a top 5 team (and in the playoffs) at the end of the season. The following pie chart shows the results.</p>
<p style="margin-left: 40px;"><img alt="Pie Chart" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/be2891f00cde3cab6eadb050a0abeadb/pie_chart_of_playoffs_end.jpg" style="width: 576px; height: 384px;" /></p>
<p>Twenty eight percent of the time, a team that was in the playoffs in July fell far enough in the standings to drop out. So over a quarter of your playoff teams would be different if the season ended around 82 games. That sounds like a significant effect to me. And last, let's return to our concordant and discordant pairs. Except this time, we'll just look at the top half of the standings (top 8 teams). </p>
<p style="margin-left: 40px;"><img alt="Measures of Concordance" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/4ed88280224f9eba7726e54b50450f4c/measures_of_concordance_2.jpg" style="width: 468px; height: 194px;" /></p>
<p>This time our percentage of concordant pairs has dropped to 59%, and the values for Spearman's rho and Pearson's r show a weaker association. Teams ranked higher in the 1st half of the season are usually still ranked higher at the end of the season. But there is clearly enough shuffling among the top teams to warrant the 2nd half of the season. So don't worry baseball fans, your regular season will continue to extend to September.</p>
<p>Because, you know, Major League Baseball <em>totally </em>would have shorten the season if this statistical analysis suggested doing so!</p>
<p>And if you're looking to determine the appropriate sample size for your own analysis, Minitab offers a wide variety of <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/power-and-sample-size/power-and-sample-size-analyses-in-minitab/">power and sample size analyses</a> that can help you out.</p>
<p> </p>
Data AnalysisFun StatisticsStatisticsStatistics in the NewsFri, 08 Jul 2016 12:00:00 +0000http://blog.minitab.com/blog/the-statistics-game/does-major-league-baseball-really-need-the-second-half-of-the-seasonKevin RudyUsing Marginal Plots, aka "Stuffed-Crust Charts"
http://blog.minitab.com/blog/data-analysis-and-quality-improvement-and-stuff/using-marginal-plots-aka-stuffed-crust-charts
<p><span style="line-height: 1.6;">In <a href="http://blog.minitab.com/blog/data-analysis-and-quality-improvement-and-stuff/the-matrix-its-a-complex-plot" target="_blank">my last post</a>, we took the red pill and dove deep into the unarguably fascinating and uncompromisingly compelling world of the matrix plot. I've stuffed this post with information about a topic of marginal interest...the marginal plot.</span></p>
<p>Margins are important. Back in my English composition days, I recall that margins were particularly prized for the inverse linear relationship they maintained with the number of words that one had to string together to complete an assignment. Mathematically, that relationship looks something like this:</p>
<p style="margin-left: 40px;">Bigger margins = fewer words</p>
<p><img alt="stuffed crust" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/62b6ee0d191245cf8e077f414a1e1d2d/stuffed_crust.jpg" style="width: 250px; height: 213px; margin: 10px 15px; float: right;" />In stark contrast to my concept of margins as information-free zones, the marginal plot actually utilizes the margins of a scatterplot to provide timely and important information about your data. Think of the marginal plot as the stuffed-crust pizza of the graph world. Only, instead of extra cheese, you get to bite into extra data. And instead of filling your stomach with carbs and cholesterol, you're filling your brain with data and knowledge. And instead of arriving late and cold because the delivery driver stopped off to canoodle with his girlfriend on his way to your house (<em>even though he's just not sure if the relationship is <span style="line-height: 20.8px;">really </span>working out: she seems distant lately and he's not sure if it's the constant cologne of consumables about him, or the ever-present film of pizza <span style="line-height: 1.6;">grease on his car seats, on his clothes, in his ears?)</span></em></p>
<p><span style="line-height: 1.6;">...anyway, unlike a cold, late pizza, marginal plots are always fresh and hot, because you bake them yourself, in </span><a href="http://www.minitab.com/en-us/products/minitab/" style="line-height: 1.6;" target="_blank">Minitab Statistical Software</a><span style="line-height: 1.6;">.</span></p>
<p>I tossed some randomly-generated data around and came up with this half-baked example. Like the pepperonis on a hastily prepared pie, the points on this plot are mostly piled in the middle, with only a few slices venturing to the edges. In fact, some of those points might be outliers. </p>
<p style="margin-left: 40px;"><img alt="Scatterplot of C1 vs C2" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/1f1e94af3820cd7138eda393fa0b0859/scatterplot_of_c1_vs_c2.jpg" style="width: 360px; height: 240px;" /></p>
<p><span style="line-height: 20.8px;">If only there were an easy, interesting, and integrated way to assess the data for outliers when we make a scatterplot. </span></p>
<p><span style="line-height: 20.8px;">Boxplots are a useful way look for outliers. You could make separate boxplots of each variable, like so:</span></p>
<p style="margin-left: 40px;"><img alt="Boxplot of C1" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/d542dd348a0c357f5e5dc0476bc5ea9f/boxplot_of_c1.jpg" style="width: 360px; height: 240px;" /> <img alt="Boxplot of C2" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/b89d246d7a4e9951d4e0a49be2ad7eaf/boxplot_of_c2.jpg" style="line-height: 1.6; width: 360px; height: 240px;" /></p>
<p><span style="line-height: 20.8px;">It's fairly easy to relate the boxplot of C1 to the values plotted on the y-axis of the scatterplot. But it's a little harder to relate the boxplot of C2 to the scatterplot, because the y-axis on the boxplot corresponds to the x-axis on the scatterplot. You can transpose the scales on the boxplot to make the comparison a little easier. Just </span><span style="line-height: 20.8px;">double-click one of the axes and select <strong>Transpose value and category scales</strong>:</span></p>
<p style="margin-left: 40px;"><img alt="Boxplot of C2, Transposed" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/6c4db0ef1ee268a3c6f38400fd9e1f1c/boxplot_of_c2__transposed.jpg" style="width: 360px; height: 240px;" /></p>
<p>That's a little better. The only thing that would be <em>even better</em> is if you could put each boxplot right up against the scatterplot...if you could stuff the crust of the scatterplot with boxplots, so to speak. Well, guess what? You can! Just choose <strong>Graph > Marginal Plot > With Boxplots</strong>, enter the variables and click <strong>OK</strong>: </p>
<p style="margin-left: 40px;"><img alt="Marginal Plot of C1 vs C2" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/69fdd93c28ebcfd93071ad22af62f407/marginal_plot_of_c1_vs_c2.jpg" style="width: 360px; height: 240px;" /></p>
<p>Not only are the boxplots nestled right up next to the scatterplot, but they also share the same axes as the scatterplot. For example, the outlier (asterisk) on the boxplot of C2 corresponds to the point directly below it on the scatterplot. Looks like that point could be an outlier, so you might want to investigate further. </p>
<p><span style="line-height: 1.6;">Marginal plots can also help alert you to other important complexities in your data. Here's another half-baked example. Unlike our pizza delivery guy's relationship with his girlfriend, it looks like the relationship between the fake response and the fake predictor represented in this scatterplot really is working out:</span><span style="line-height: 20.8px;"> </span></p>
<p style="line-height: 20.8px; margin-left: 40px;"><img alt="Scatterplot of Fake Response vs Fake Predictor" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/a8a2fa08a7a9a917b7130e740c69514d/scatterplot_of_fake_response_vs_fake_predictor.jpg" style="line-height: 20.8px; width: 360px; height: 240px;" /> </p>
<p style="line-height: 20.8px;"><span style="line-height: 1.6;">In fact, i</span><span style="line-height: 20.8px;">f you use <strong>Stat > Regression > Fitted Line Plot</strong>, the fitted line appears to fit the data nicely. And the regression analysis is highly significant:</span></p>
<p style="line-height: 20.8px; margin-left: 40px;"><img alt="Fitted Line_ Fake Response versus Fake Predictor" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/aea45ae01a9481d7bf2553e1780521c5/fitted_line__fake_response_versus_fake_predictor.jpg" style="width: 360px; height: 240px;" /></p>
<strong>Regression Analysis: Fake Response versus Fake Predictor </strong>
The regression equation is
Fake Response = 2.151 + 0.7723 Fake Predictor
S = 2.12304 R-Sq = 50.3% R-Sq(adj) = 49.7%
Analysis of Variance
Source DF SS MS F P
Regression 1 356.402 356.402 79.07 0.000
Error 78 351.568 4.507
Total 79 707.970
<p><span style="line-height: 1.6;">But wait. If you create a marginal plot instead, you can augment your exploration of these data with histograms and/or dotplots, as I have done below. Looks like there's trouble in </span>paradise:</p>
<p style="margin-left: 40px;"><img alt="Marginal Plot of Fake Response vs Fake Predictor, with Histograms" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/107dfda7466c6690e33aee6e7f3918b6/marginal_plot_of_fake_response_vs_fake_predictor__with_histograms.jpg" style="width: 360px; height: 240px;" /> <img alt="Marginal Plot of Fake Response vs Fake Predictor, with Dotplots" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/8e3044a69af337cfcc15d4aaacd88f9f/marginal_plot_of_fake_response_vs_fake_predictor__with_dotplots.jpg" style="width: 360px; height: 240px;" /></p>
<p><span style="line-height: 20.8px;">Like the poorly made pepperoni pizza, the points on our plot are distributed unevenly. There appear to be two clumps of points. The distribution of values for the fake predictor is bimodal: that is, it has two distinct peaks. The distribution of values for the response may also be bimodal.</span></p>
<p>Why is this important? Because the <span style="line-height: 20.8px;">two </span>clumps of toppings may suggest that you have more than one metaphorical cook in the metaphorical pizza kitchen. For example, it could be that Wendy, who is left handed, started placing the pepperonis carefully on the pie and then got called away, leaving Jimmy, who is right handed, to quickly and carelessly complete the covering of cured meats. In other words, it could be that the <span style="line-height: 20.8px;">two </span>clumps of points represent <span style="line-height: 20.8px;">two </span>very different populations. </p>
<p>When I tossed and stretched the data for this example, I took random samples from two different populations. I used 40 random observations from a normal distribution with a mean of 8 and a standard deviation of 1.5, and 40 random observations from a normal distribution with a mean of 13 and a standard deviation of 1.75. The two clumps of data are truly from <span style="line-height: 20.8px;">two </span>different populations. To illustrate, I separated the <span style="line-height: 20.8px;">two </span>populations into two different groups in this scatterplot: </p>
<p style="margin-left: 40px;"> <img alt="Scatterplot with Groups" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/924cfc845dd807e6e2fd57cbbcc0abcb/scatterplot_of_fake_response_vs_fake_predictor_with_groups.jpg" style="width: 360px; height: 240px;" /></p>
<p>This is a classic conundrum that can occur when you do a <span><a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit">regression analysis</a></span>. The regression line tries to pass through the center of the data. And because there are two clumps of data, the line tries to pass through the center of each clump. This <em>looks </em>like a relationship between the response and the predictor, but it's just an illusion. If you separate the clumps and analyze each population separately, you discover that there is no relationship at all: </p>
<p style="margin-left: 40px;"><img alt="Fitted Line_ Fake Response 1 versus Fake Predictor 1" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/e5981c9b2604bf841a525926d8282d8f/fitted_line__fake_response_1_versus_fake_predictor_1.jpg" style="width: 360px; height: 240px;" /></p>
<strong>Regression Analysis: Fake Response 1 versus Fake Predictor 1 </strong>
The regression equation is
Fake Response 1 = 9.067 - 0.1600 Fake Predictor 1
S = 1.64688 R-Sq = 1.5% R-Sq(adj) = 0.0%
Analysis of Variance
Source DF SS MS F P
Regression 1 1.609 1.60881 0.59 0.446
Error 38 103.064 2.71221
Total 39 104.673
<p style="margin-left: 40px;"><img alt="Fitted Line_ Fake Response 2 versus Fake Predictor 2" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/e7db8af8c22bd83b72ad559ca5aece86/fitted_line__fake_response_2_versus_fake_predictor_2.jpg" style="width: 360px; height: 240px;" /></p>
<strong>Regression Analysis: Fake Response 2 versus Fake Predictor 2</strong>
The regression equation is
Fake Response 2 = 12.09 + 0.0532 Fake Predictor 2
S = 1.62074 R-Sq = 0.3% R-Sq(adj) = 0.0%
Analysis of Variance
Source DF SS MS F P
Regression 1 0.291 0.29111 0.11 0.741
Error 38 99.818 2.62679
Total 39 100.109
<p>If only our unfortunate pizza delivery technician could somehow use a marginal plot to help him assess the state of his own relationship. But alas, I don't think a marginal plot is going to help with that particular analysis. Where is that guy anyway? I'm getting hungry. </p>
Fun StatisticsProject ToolsRegression AnalysisStatisticsWed, 06 Jul 2016 12:27:00 +0000http://blog.minitab.com/blog/data-analysis-and-quality-improvement-and-stuff/using-marginal-plots-aka-stuffed-crust-chartsGreg Fox