Minitab | MinitabBlog posts and articles about using Minitab software in quality improvement projects, research, and more.
http://blog.minitab.com/blog/minitab/rss
Sun, 05 Jul 2015 09:25:40 +0000FeedCreator 1.7.3Applying DOE for Great Grilling, part 2
http://blog.minitab.com/blog/understanding-statistics/applying-doe-for-great-grilling-part-2
<p><img alt="grill" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/111e4a65160cf20662dfb13013408f1f/grill.jpg" style="margin: 10px 15px; width: 202px; height: 202px; line-height: 18.9px; float: right;" /></p>
<p style="line-height: 18.9px;"><span style="line-height: 18.9px;">Design of Experiments is an extremely powerful statistical method, we added a DOE tool to the Assistant in Minitab 17 to make it more accessible to more people.</span></p>
<p style="line-height: 18.9px;"><span style="line-height: 18.9px;">Since it's summer here, I'm applying the Assistant's DOE tool to outdoor cooking.</span><span style="line-height: 18.9px;"> </span>Earlier, I showed you <a href="http://blog.minitab.com/blog/understanding-statistics/applying-doe-for-great-grilling-part-1">how to set up a designed experiment</a> that will let you optimize how you grill steaks. </p>
<p>If you're not already using it and you want to play along, you can download the <a href="http://it.minitab.com/products/minitab/free-trial.aspx">free 30-day trial version</a> of Minitab Statistical Software.</p>
<p style="line-height: 18.9px;">Perhaps you are following along, and you've already grilled your steaks according to the experimental plan and recorded the results of your experimental runs. Otherwise, feel free to download our data <a href="//cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/a0d8f12f27ee5a981619c2c3af59d524/steaks___asst_doe.MTW">here</a> for the next step: analyzing the results of our experiment. </p>
Analyzing the Results of the Steak Grilling Experiment
<p style="line-height: 18.9px;">After collecting your data and entering it into Minitab, you should have an experimental worksheet that looks like this: </p>
<p style="line-height: 18.9px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ed29e3c1fb41872df6529e91786215f2/grill_doe_worksheet.png" style="width: 500px; height: 320px;" /></p>
<p style="line-height: 18.9px;">With your results entered in the worksheet, select <strong>Assistant > DOE > Analyze and Interpret</strong>. As you can see below, the only button you can click is "Fit Linear Model." </p>
<p style="line-height: 18.9px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/1ce7cb7744e6fb78c4f5cb74d1903cf6/grill_doe_analyze.png" style="width: 500px; height: 375px;" /></p>
<p style="line-height: 18.9px;">As you might gather from the flowchart, when it analyzes your data, the Assistant first checks to see if the response exhibits curvature. If it does, the Assistant will prompt you to gather more data so you it can fit a quadratic model. Otherwise, the Assistant will fit the linear model and provide the following output. </p>
<p style="line-height: 18.9px;">When you click the "Fit Linear Model" button, the Assistant automatically identifies your response variable.</p>
<p style="line-height: 18.9px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a851201ceabf727ba38e53ef383d6091/grill_doe_analyze2.png" style="width: 435px; height: 260px;" /></p>
<p style="line-height: 18.9px;">All you need to do is confirm your response goal—maximizing flavor, in this case—and press OK. The Assistant performs the analysis, and provides you the results in a series of easy-to-interpret reports. </p>
Understanding the DOE Results
<p style="line-height: 18.9px;">First, the Assistant offers a summary report that gives you the bottom-line results of the analysis. The Pareto Chart of Effects in the top left shows that Turns, Grill type, and Seasoning are all statistically significant, and there's a significant interaction between Turns and Grill type, too. </p>
<p style="line-height: 18.9px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9ac1a8e009efec8b90fdbb32cfebd1df/grill_doe_results_summary.png" style="width: 751px; height: 563px;" /></p>
<p style="line-height: 18.9px;">The summary report also shows that the model explains very high proportion of the variation in flavor, with an R2 value of 95.75 percent. And the "Comments" window in the lower right corner puts things if plain language: "You can conclude that there is a relationship between Flavor and the factors in the model..."</p>
<p style="line-height: 18.9px;">The Assistant's Effects report, shown below, tells you more about the nature of the relationship between the factors in the model and Flavor, with both Interaction Plots and Main Effects plots that illustrate how different experimental settings affect the Flavor response. </p>
<p style="line-height: 18.9px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/4a9d3a9939ad51a9326bed0fbd061048/grill_doe_results_effects.png" style="width: 751px; height: 563px;" /></p>
<p style="line-height: 18.9px;">And if we're looking to make some changes as a result of our experimental results—like selecting an optimal method for grilling steaks in the future—the Prediction and Optimization report gives us the optimal solution (1 turn on a charcoal grill, with Montreal seasoning) and its predicted Flavor response (8.425). </p>
<p style="line-height: 18.9px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f9189c39c79160de4b9c5dbf8f4523ab/grill_doe_results_optimization.png" style="width: 751px; height: 563px;" /></p>
<p style="line-height: 18.9px;"><span style="line-height: 1.6;">It also gives us the Top 5 alternative solutions, shown in the bottom right corner, so if there's some reason we can't implement the optimal solution—for instance, if we only have a gas grill—we can still choose the best solution that suits our circumstances. </span></p>
<p style="line-height: 18.9px;">I hope this example illustrates how easy a designed experiment can be when you use the Assistant to create and analyze it, and that designed experiments can be very useful not just in industry or the lab, but also in your everyday life. </p>
<p style="line-height: 18.9px;">Where could you benefit from analyzing process data to optimize your results? </p>
Design of ExperimentsFun StatisticsStatistics HelpThu, 02 Jul 2015 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/applying-doe-for-great-grilling-part-2Eston MartzApplying DOE for Great Grilling, part 1
http://blog.minitab.com/blog/understanding-statistics/applying-doe-for-great-grilling-part-1
<p>Design of Experiments (DOE) has a reputation for difficulty, and to an extent, this statistical method <em>deserves </em>that reputation. While it's easy to grasp the basic idea—<em>acquire the maximum amount of information from the fewest number of experimental runs</em>—practical application of this tool can quickly become very confusing. </p>
<p><img alt="steaks" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/33d85058b493aff4240dfb9d78aff673/steaks.jpg" style="margin: 10px 15px; width: 250px; height: 250px; float: right;" />Even if you're a long-time user of designed experiments, it's still easy to feel uncertain if it's been a while since you last looked at split-plot designs or needed to choose the appropriate resolution for a fractional factorial design.</p>
<p>But DOE <em>is</em> an extremely powerful and useful tool, so when we launched Minitab 17, we added a DOE tool to the Assistant to make designed experiments more accessible to more people.</p>
<p>Since summer is here at Minitab's world headquarters, I'm going to illustrate how you can use the Assistant's DOE tool to optimize your grilling method. </p>
<p>If you're not already using it and you want to play along, you can download the free 30-day <a href="http://it.minitab.com/products/minitab/free-trial.aspx">trial version of Minitab Statistical Software</a>.</p>
Two Types of Designed Experiments: Screening and Optimizing
<p>To create a designed experiment using the Assistant, open Minitab and select <strong>Assistant > DOE > Plan and Create</strong>. You'll be presented with a decision tree that helps you take a sequential approach to the experimentation process by offering a choice between a screening design and a modeling design.</p>
<p><img alt="DOE Assistant" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/5b585531f6031882fb7880a49700f52c/grill_doe_1.png" style="width: 487px; height: 366px;" /></p>
<p>A <strong>screening design</strong> is important if <span><a href="http://blog.minitab.com/blog/understanding-statistics/why-is-the-office-coffee-so-bad-a-screening-experiment-narrows-down-the-critical-factors">you have a lot of potential factors to consider</a></span> and you want to figure out which ones are important. The Assistant guides you through the process of testing and analyzing the main effects of 6 to 15 factors, and identifies the factors that have greatest influence on the response.</p>
<p>Once you've identified the critical factors, you can use the <strong>modeling design.</strong> Select this option, and the Assistant guides you through testing and analyzing 2 to 5 critical factors and helps you find optimal settings for your process.</p>
<p>Even if you're an old hand at analyzing designed experiments, you may want to use the Assistant to create designs since the Assistant lets you print out easy-to-use data collection forms for each experimental run. After you've collected and entered your data, the designs created in the Assistant can also be analyzed using <span style="line-height: 18.9px;">Minitab's </span><span style="line-height: 1.6;">core DOE tools available through the <strong>Stat > DOE</strong> menu.</span></p>
<span style="line-height: 1.6;">Creating a DOE to Optimize How We Grill Steaks</span>
<p>For grilling steaks, there aren't that many variables to consider, so we'll use the Assistant to pl<span style="line-height: 1.6;">an and create a <strong>modeling design</strong> that will optimize our grilling process. Select <strong>Assistant > DOE > Plan and Create</strong>, then click the "Create Modeling Design" button. </span></p>
<p><span style="line-height: 1.6;">Minitab brings up an easy-to-follow dialog box; all we need to do is fill it in. </span></p>
<p><span style="line-height: 1.6;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/eb90fd8499ab96a579aa6dd63fa325d2/grill_doe_dialog_1.png" style="width: 461px; height: 500px;" /></span></p>
<p>First we enter the name of our Response and the goal of the experiment. Our response is "Flavor," and the goal is "Maximize the response." Next, we enter our factors. We'll look at three critical variables:</p>
<ul>
<li>Number of turns, a continuous variable with a low value of 1 and high value of 3.</li>
<li>Type of grill, a categorical variable with Gas or Charcoal as options. </li>
<li>Type of seasoning, a categorical variable with Salt-Pepper or Montreal steak seasoning as options. </li>
</ul>
<p>If we wanted to, we could select more than 1 replicate of the experiment. A replicate is simply a complete set of experimental runs, so if we did 3 replicates, we would repeat the full experiment three times. But since this experiment has 16 runs, and neither our budget nor our stomachs are limitless, we'll stick with a single replicate. </p>
<p>When we click OK, the Assistant first asks if we want to print out data collection forms for this experiment: </p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c4c63c4b5af7a4c6e4f3c4caa327f523/grill_doe_collection_form1.png" style="width: 445px; height: 207px;" /></p>
<p>Choose Yes, and you can print a form that lists each run, the variables and settings, and a space to fill in the response:</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/06ed8ad486f3a243c4aea352c9738b2c/grill_doe_collection_form2.png" style="border-width: 1px; border-style: solid; width: 500px; height: 313px;" /></p>
<p>Alternatively, you can just record the results of each run in the worksheet the Assistant creates, which you'll need to do anyway. But having the printed data collection forms can make it much easier to keep track of where you are in the experiment, and exactly what your factor settings should be for each run. </p>
<p>If you've used the Assistant in Minitab for other methods, you know that it seeks to demystify your analysis and make it easy to understand. When you create your experiment, the Assistant gives you a Report Card and Summary Report that explain the steps of the DOE and important considerations, and a summary of your goals and what your analysis will show. </p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a767257b9db465b81d6bd1456e5eb508/grill_doe_2_w1024.png" style="width: 650px; height: 439px;" /></p>
<p>Now it's time to cook some steaks, and rate the flavor of each. If you want to do this for real and collect your own data, please do so! <a href="http://blog.minitab.com/blog/understanding-statistics/applying-doe-for-great-grilling-part-2">Tomorrow's post</a> will show how to analyze your data with the Assistant. </p>
Wed, 01 Jul 2015 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/applying-doe-for-great-grilling-part-1Eston MartzHow MLB's Understanding of Line Drive Data Fails to Protect Pitchers
http://blog.minitab.com/blog/fun-with-statistics/how-mlbs-understanding-of-line-drive-data-fails-to-protect-pitchers
<p>Last month the ESPN series <em>Outside the Lines</em> reported on major league pitchers suffering serious injuries from being struck in the head by line drives, and efforts MLB is making towards having protective gear developed for pitchers. You can view the report here if you'd like:</p>
<p style="margin-left: 40px;"></p>
<p>A couple of things jump out at me from the clip:</p>
<ol>
<li>The overwhelming majority of pitchers are not interested in wearing protective gear if it is either visually obvious or noticeable to the pitcher himself, who fears it will affect his ability to pitch well.</li>
<li>The standard set by Major League baseball is that approved headgear must be able to protect against a ball travelling at 83 mph, the average speed in which line drives are travelling when they reach the pitcher's mound.</li>
</ol>
<p><img alt="Torres in protective headgear" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/2556886c7605e12a663ffcd69b419cdf/torres_gear.jpg" style="width: 170px; height: 170px; border-width: 1px; border-style: solid; margin: 10px 15px; float: right;" />Upon watching the report, I knew immediately it would have little if any impact on pitcher safety and that pitchers will continue suffering severe injuries or even death from line drives until a stronger standard is set and pitchers are forced to wear approved devises. The <a href="http://blog.minitab.com/blog/understanding-statistics/three-dangerous-statistical-mistakes">faulty understanding of statistics</a> has led to the current standard, and I will outline three reasons why.</p>
<ol>
<li><strong>The standard was set as the average.</strong> First, I would like to say it is commendable that MLB collected data on the ball speeds in order to set the standard rather than just making some intuitive guess. However, that data was then turned into a single value, as is unfortunately so common in the world: <em>the average</em>. I think the problem is that what statisticians call the mean, most people refer to as the "average." When most people hear the term "average" they associate it with a meaning somewhat like "common" or "typical," but to know what is common or typical we must also know about the variation in the data. Assuming line drive speeds are symmetrically distributed, half of them will exceed 83 mph and half will not. Very few will actually <em>be </em>83 mph, so that value is really not common or typical at all. In selecting this value as a standard, <em>baseball's governing body has determined that head gear does not need to protect against half of line drives.</em></li>
<li><strong>The standard ignores the relationship between speed and likelihood of striking the pitcher.</strong> The standard begins by ignoring half of all line drives. But it's actually worse than that. From #1 you might assume that while the average was not the best choice, cutting the rate of line drives hitting pitchers in the head and injuring them in half is a pretty good first step. But that would assume all line drive speeds are equally likely to hit the pitcher in the head, and that is certainly not the case. A pitcher has twice as long to react to a 60 mph drive as he does a 120 mph drive, which, of course, is more likely to actually hit him. Their analysis assumes the distribution of line drive speeds hitting pitchers in the head would match the distribution of all line drive speeds, whereas almost every instance of a head strike involves the ball travelling faster than the average speed. So the rule protects against line drives that are, for the most part, <em>not </em>actually hitting the pitcher in the head.</li>
<li><strong>The standard ignores the relationship between speed and severity of injury.</strong> Aside from a pitcher being much more likely to react to a slower ball and avoid the hit in the first place, that slower ball was likely to be much less damaging if contact was made. The balls travelling very fast—which we've just stated were more likely to hit the pitcher—are also considerably more damaging and most needing of protection.</li>
</ol>
<p>So to summarize, setting the standard at the average speed has the effect of protecting pitchers against line drives that are unlikely to hit them and will cause much less damage if they do so. Given that pitchers already don't want to wear protection and will quickly catch on to these facts intuitively (even if they don't think in statistical terms), it's hard to imagine many pitchers adopting the gear if not required, or truly being more protected in any meaningful way if they do.</p>
<p>As <a href="http://www.minitab.com/services/training/">Minitab trainer</a> Paul Sheehy was telling me recently, giving someone powerful tools like statistics and not properly training them in how to use them is letting them "run with scissors." Unfortunately in this case it is major league pitchers who stand to get hurt, and not the people carrying the scissors... </p>
<p style="font-size:9px;"><em>Photograph of Alex Torres by <a href="https://commons.wikimedia.org/wiki/File:Alex_Torres_on_April_23,_2015.jpg" target="_blank">UCinternational</a>, used under Creative Commons 2.0. </em></p>
Data AnalysisStatisticsStatistics HelpStatistics in the NewsStatsMon, 29 Jun 2015 12:50:00 +0000http://blog.minitab.com/blog/fun-with-statistics/how-mlbs-understanding-of-line-drive-data-fails-to-protect-pitchersJoel Smith3 Ways to Clean Up Data So You Can Promote Public Dialog
http://blog.minitab.com/blog/statistics-and-quality-improvement/3-ways-to-clean-up-data-so-you-can-promote-public-dialog
<p><em><img alt="A Philadelphia Police Department car" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/ca0d9fad2f6a379c414c42f134e34260/1280px_philadelphia_police___cruiser_on_ben_franklin_parkway_w1024.jpeg" style="float: right; width: 275px; height: 206px; border-width: 1px; border-style: solid; margin: 10px 15px;" />"By publishing the historical data, public dialogue that results from the data release can be more productive because you’ll be able to discuss changes over time."</em> — <a href="http://www.codeforamerica.org/blog/2015/05/17/5-ways-to-jumpstart-the-release-of-open-data-on-policing/" target="_blank">Denice Ross, 5/17/2015</a></p>
<p>Last month, President Obama launched the <a href="https://www.whitehouse.gov/blog/2015/05/18/launching-police-data-initiative" target="_blank">Police Data Initiative</a>. A key goal of the initiative was to make data about police departments more accessible to the public. Twenty-one communities decided to participate in the initial round, including Philadelphia.</p>
<p>Among <span style="line-height: 18.9090900421143px;">Code for America's</span><span style="line-height: 1.6;"> recommendations to help police departments get started was the suggestion to open historical records. On June 19</span><span style="line-height: 1.6;">, the data set "Philadelphia Police Advisory Commission Complaints" was made available via </span><a href="https://www.opendataphilly.org/dataset/activity/philadelphia-police-advisory-commission-complaints" style="line-height: 1.6;" target="_blank">opendataphilly.org</a><span style="line-height: 1.6;">. The data set includes several variables about complaints made against police officers between 2009 and 2012, and gives us the chance to explore some steps you can take to clean up your data for analysis, using features in Minitab.</span></p>
Proper
<p>One thing to look for is redundant categories and labels. If you download the data and take a look at the actions that resulted from the complaints, you’ll find these values in these frequencies.</p>
<p style="margin-left: 40px;"><strong>Tally for Discrete Variables: ACTION </strong></p>
<p style="margin-left: 40px;"><span style="font-family: courier new; font-size:9pt"> ACTION Count<br />
Accept 26<br />
ACCEPT 161<br />
Audit 11<br />
AUDIT 64<br />
NAR 16<br />
NoJurisdiction 2<br />
NON-JURISDICTIONAL 1<br />
Reject 23<br />
REJECT 142<br />
Rejected 1<br />
WITHDRAWN 3<br />
N= 450<br />
*= 5</span></p>
<p>It’s easy to see that the values “Accept” and “ACCEPT” should be the same. If you're using Minitab, it can change those values for you. (If you're not using Minitab, you can get a <a href="http://it.minitab.com/products/minitab/free-trial.aspx">free 30-day trial</a>.) Try this:</p>
<ol>
<li>Choose <strong>Calc > Calculator</strong>.</li>
<li>In <strong>Store result in variable</strong>, enter <em>‘Action taken’</em>.</li>
<li>In <strong>Expression</strong> enter <em>Proper(ACTION)</em>. Click <strong>OK</strong>.</li>
</ol>
<p>Now there’s a column with these values and frequencies:</p>
<p style="margin-left: 40px;"><strong>Tally for Discrete Variables: Action taken </strong></p>
<p style="margin-left: 40px;"><span style="font-family: courier new; font-size:9pt"> Action taken Count<br />
Accept 187<br />
Audit 75<br />
Nar 16<br />
Nojurisdiction 2<br />
Non-jurisdictional 1<br />
Reject 165<br />
Rejected 1<br />
Withdrawn 3<br />
N= 450<br />
*= 5</span></p>
<p>Instead of having to make 62 corrections in the data, you have to make only 2. Prefer a different format? You could substitute LOWER or UPPER for PROPER to get all lowercase or all uppercase letters.</p>
Left
<p>The Philadelphia data set includes a variable for the date and time of the incident, but none of the times are recorded. Including the unused values for time yields data like these:</p>
<p style="margin-left: 40px;"> <img alt="The time values are all 0." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/a6f1ea2e38c65d366fb987fe1e575370/rawdates.png" style="width: 173px; height: 121px;" /></p>
<p>To get the usable "date" portion of the data, you can use the calculator. Try this:</p>
<ol>
<li>Choose <strong>Calc > Calculator</strong>.</li>
<li>In <strong>Store result in variable</strong>, enter '<em>Text date'</em>.</li>
<li>In <strong>Expression</strong>, enter <em>Left(DATE_, 10)</em>. Click <strong>OK</strong>.</li>
</ol>
<p>The column that results is still formatted as text. To do an analysis where you can sort by date, you can quickly <a href="http://blog.minitab.com/blog/statistics-and-quality-improvement/3-new-things-you-can-do-by-right-clicking-in-minitab-172">change the date format</a>. Select a cell in the column, right-click, and select <strong>Format Column</strong>. When you pick <strong>Date</strong> from the list of types, Minitab recognizes the format for you.</p>
Code
<p>If you dig a bit deeper into the data, you’ll notice an oddity that’s not readily apparent. The current web site for the police in Philadelphia lists 21 districts. In the data, 23 units are included. That's because the 23rd District has been incorporated into the 22nd District, and the 4th District incorporated into the 3rd. If we want to include complaints about officers from those districts in their new districts, you can recode the districts. Try this:</p>
<ol>
<li>Choose <strong>Data > Code > To Text</strong>.</li>
<li>In <strong>Code values in the following columns</strong>, enter <em>Unit</em>.</li>
<li>In <strong>Method</strong>, select <strong>Code Individual Values</strong>.</li>
<li>For District 4, change the <strong>Coded value</strong> to <em>District 3</em>.</li>
<li>For District 23, change the <strong>Coded value</strong> to <em>District 22</em>.</li>
<li>Click <strong>OK.</strong></li>
</ol>
<p><strong>Code </strong></p>
<p><span style="font-family: courier new; font-size:9pt">Summary</span></p>
<p><span style="font-family: courier new; font-size:9pt"> Number<br />
Original Value Recoded Value of Rows<br />
District 4 District 3 2<br />
District 23 District 22 7</span></p>
<p><br />
<span style="font-family: courier new; font-size:9pt">Source data column UNIT<br />
Recoded data column Coded UNIT</span></p>
<p><span style="font-family: courier new; font-size:9pt">Number of unchanged rows: 446</span></p>
<p>Minitab shows you a summary table so you can see how the values were recoded and you’re ready to go!</p>
Wrap up
<p>Whether you have data about police complaints or <a href="http://www.minitab.com/Case-Studies/Via-Christi-Health/">patient throughput times</a>, you’re likely to need to do a little bit of work for your data to be ready to analyze. Fortunately, Minitab makes it easy to make common adjustments like getting the case of letters to match across entries. The faster your data is ready to analyze, the faster you can do the analysis to make better decisions.</p>
<p><em>The photo of the police car is by <a href="https://en.wikipedia.org/wiki/Philadelphia_Police_Department#/media/File:Philadelphia_Police_-_cruiser_on_Ben_Franklin_Parkway.jpeg">Zuzu</a> and is licensed under this <a href="https://creativecommons.org/licenses/by-sa/3.0/">Creative Commons License</a>.</em></p>
LearningStatistics in the NewsWed, 24 Jun 2015 13:20:43 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/3-ways-to-clean-up-data-so-you-can-promote-public-dialogCody SteeleAre the Chicago Blackhawks Currently the Luckiest Team in Sports?
http://blog.minitab.com/blog/the-statistics-game/are-the-chicago-blackhawks-currently-the-luckiest-team-in-sports
<p><img alt="Blackhawks" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/085c8f0c81d6b3d2422f3d3f0fbed19c/chicagoblackhawkslogo_svg_w1024.png" style="float: right; width: 220px; height: 224px; margin: 10px 15px;" />With their victory in game 6 over the Tampa Bay Lightning, the Chicago Blackhawks won their 3rd Stanley Cup Championship in the last 6 years. This is an incredible feat that no doubt means the Blackhawks have been a very talented hockey team over that stretch. But just like <a href="http://blog.minitab.com/blog/understanding-statistics/control-charts-show-you-variation-that-matters">random variation</a> can play a part in quality processes, luck can play a part in sporting outcomes. So how lucky has Chicago been?</p>
Probability of Winning 3 Out of 7 Stanley Cup Championships
<p>The Blackhawks have won 3 of the last 6 Stanley Cups, but their run really began the year before they won the first cup, which was 2009. That was the first year they had made the playoffs in 7 years, so I’ll start collecting the data from there. For each year, I took the odds that the Blackhawks would win the championship at the start of the playoffs and turned that into a probability. Rows in bold represent years the Hawks won.</p>
<p style="text-align: center;">Year</p>
<p style="text-align: center;">Odds</p>
<p style="text-align: center;">Percentage</p>
<p style="text-align: center;"><strong>2015</strong></p>
<p style="text-align: center;"><strong>8 to 1</strong></p>
<p style="text-align: center;"><strong>11%</strong></p>
<p style="text-align: center;">2014</p>
<p style="text-align: center;">8 to 1</p>
<p style="text-align: center;">11%</p>
<p style="text-align: center;"><strong>2013</strong></p>
<p style="text-align: center;"><strong>7 to 2</strong></p>
<p style="text-align: center;"><strong>22%</strong></p>
<p style="text-align: center;">2012</p>
<p style="text-align: center;">15 to 1</p>
<p style="text-align: center;">6%</p>
<p style="text-align: center;">2011</p>
<p style="text-align: center;">60 to 1</p>
<p style="text-align: center;">2%</p>
<p style="text-align: center;"><strong>2010</strong></p>
<p style="text-align: center;"><strong>8 to 1</strong></p>
<p style="text-align: center;"><strong>11%</strong></p>
<p style="text-align: center;">2009</p>
<p style="text-align: center;">11 to 1</p>
<p style="text-align: center;">8%</p>
<p style="text-align: center;">Average</p>
<p style="text-align: center;">9 to 1</p>
<p style="text-align: center;">10%</p>
<p>The only year Chicago was actually the favorite to win the cup was 2013, and even then they had a less than 1 in 4 chance of winning. So they have overcome some pretty long odds.</p>
<p>To calculate their overall odds of winning 3 championships in 7 years, I’m going to use their average percentage of 10%. Our number won’t be perfect, but it will give us a pretty good idea of how unlikely the Blackhawks run has been. The following <a href="http://www.minitab.com/products/minitab">Minitab </a>probability distribution plot shows the probability of the Blackhawks winning 3 or more championships in the last 7 years.</p>
<p><img alt="Binomial Distribution Plot" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/edcd1d25e026ac9cfdc54eff62f3bfeb/distribution_plot.jpg" style="width: 576px; height: 384px;" /></p>
<p>There is only a 2.6% (approximately 1 in 42) chance that the Blackhawks would have won 3 or more championships in the last 7 years! There is no doubt that skill and talent are integral parts of Chicago’s success. But to win as often as they have, you need to have some luck too. And speaking of luck, had you bet $100 on the Blackhawks at the start of the playoffs each of the last 7 years, you would be up $1,650! So if you think all these championships are something that could easily have been predicted (Kane! Toews! Hossa! Of course they won!) Las Vegas begs to differ.</p>
<p>Now, what would this graph have looked like if Chicago was the favorite at the start of each Stanley Cup Playoffs? I found that for each year, the favored NHL team has about a 20% chance of winning it all. So let’s look at another binomial distribution plot.</p>
<p><img alt="Binomial Distribution Plot" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/7a839988b09f9d8e4832dfed7a83440f/distribution_plot_2.jpg" style="width: 576px; height: 384px;" /></p>
<p>Even the NHL favorite winning 3 championships in 7 years is unlikely, happening only about 15% of the time. We see that the most likely outcome is that the favorite would win one championship. And wouldn’t you know it, the only NHL favorite to win the Stanley Cup the last 7 years was the 2013 Chicago Blackhawks.</p>
Can We Find a Luckier Team?
<p>Believe it or not, there <em>is</em> one team that has overcome even greater odds to recently win 3 championships. That would be the San Francisco Giants, as they won the World Series in 2010, 2012, and 2014. That’s 3 titles in 6 years! And their respective probability in each of those years was 11%, 12%, and 7%. The other 3 years they didn’t even make the playoffs. It’s feast or famine with San Francisco! So the odds of winning 3 World Series in the only 3 years you make the playoffs……</p>
<p align="center">0.11*0.12*0.07 = 0.000924 = approximately 1 in 1,082</p>
<p>Sorry Chicago, your run was impressive, but the Giants have proven Lady Luck is on their side even more. But do you want to overcome them? Well, how does 4 Stanley Cups in 8 years sound?</p>
Fri, 19 Jun 2015 12:39:00 +0000http://blog.minitab.com/blog/the-statistics-game/are-the-chicago-blackhawks-currently-the-luckiest-team-in-sportsKevin RudyUsing Quality Tools Like FMEA in Pathogen Testing
http://blog.minitab.com/blog/understanding-statistics/using-quality-tools-like-fmea-in-pathogen-testing
<p>Before I joined Minitab, I worked for many years in Penn State's College of Agricultural Sciences as a writer and editor. I frequently wrote about food science and particularly food safety, as I regularly needed to report on the research being conducted by Penn State's food safety experts, and also edited course materials and bulletins for professionals and consumers about ensuring they had safe food. </p>
<p><img alt="culture dish" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/18d6a3d63b7c0f1b80cb19461732c349/culture_dish.jpg" style="margin: 10px 15px; float: right; width: 200px; height: 200px;" />After I joined Minitab and became better acquainted with data-driven quality methods like Six Sigma, I was surprised at how infrequently some of the powerful quality tools common in many industries are used in food safety work. </p>
<p>So I was interested to see <a href="http://www.foodsafetytech.com/FoodSafetyTech/News/How-to-Use-FMEA-to-Risk-Assess-Pathogen-Testing-Me-2440.aspx">a recent article on the Food Safety Tech web site</a> about an application of the tool called FMEA in pathogen testing.</p>
What <em>Is </em>an FMEA?
<p style="line-height: 18.9090900421143px;">The acronym FMEA is short for "<span><a href="http://blog.minitab.com/blog/statistics-in-the-field/for-want-of-an-fmea-the-empire-fell">Failure Modes and Effects Analysis</a></span>." What the tool really does is help you look very carefully and systematically at <em>exactly </em>how and why things can go wrong, so you can do your best to prevent that from happening.</p>
<p>In the article, Maureen Harte, a consultant and Lean Six Sigma black belt, talks about the need to identify, quantify, and assess risks of the different pathogen detection methods used to create a Certificate of Analysis (COA)—a document companies obtain to verify product quality and purity.</p>
<p>Too often, Harte says, companies accept COA results blindly:</p>
<p style="margin-left: 40px;"><em>[They] lack the background information to really understand what goes into a COA, and they trust that what is coming to them is the highest quality. </em></p>
<p>Harte then proceeds to explain how doing an FMEA can make the COA more meaningful and useful. </p>
<p style="line-height: 18.9090900421143px; margin-left: 40px;"><em>FMEA helps us understand the differences between testing methods by individually identifying the risks associated with each method on its own. For each process step [in a test method], we ask: Where could it go wrong, and where could an error or failure mode occur? Then we put it down on paper and understand each failure mode. </em></p>
Completing an FMEA
<p>Doing an FMEA typically involves these steps:</p>
<ul>
<li>Identify potential failure types, or "modes," for each step of your process.</li>
<li>List the effects that result when with those failures occur.</li>
<li>Identify potential causes for each failure mode.</li>
<li>List existing controls that are in place to keep these failures from happening.</li>
<li>Rate the Severity of the effect, the likelihood of Occurrence, and the odds of Detecting the failure mode before it causes harm.</li>
<li>Multiply the values for severity, occurrence, and detection to get a risk priority number (RPN).</li>
<li>Improve items with a high RPN, record the actions you've taken, then revise the RPN.</li>
<li>Maintain as a living document.</li>
</ul>
<p>You can <span>do an FMEA </span><span style="line-height: 1.6;">with just a pencil and paper, although Minitab's <a href="http://www.minitab.com/products/quality-companion">Quality Companion</a> and <a href="http://www.minitab.com/products/qeystone">Qeystone Tools</a> process improvement software include forms that make it easy to complete the FMEA—and even share data from process maps and other forms you'll may be using. </span></p>
<p><span style="line-height: 1.6;">Here's an example of a completed Quality Companion FMEA tool: </span></p>
<p><img alt="FMEA" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d46282c5c0b55efeb25a146269263b97/pathogen_fmea.png" style="width: 750px; height: 406px; border-width: 1px; border-style: solid;" /></p>
FMEA Steps
<p>1) In Process Map - Activity, enter each process step, feature or type of activity. In the example above, it's preparation of growth culture and incubation. We also list the key components or inputs of each step.</p>
<p>2) In Potential Failure Mode, we note the ways the process can fail for each activity. There may be many ways it could fail. In the example, we've identified contamination of growth medium and incubating cultures at the wrong temperature as potential failure modes. </p>
<p>3) In Potential Failure Effects, we detail the possible fallout of each type of failure. There may be multiple failure effects.<span style="line-height: 18.9090900421143px;"> In the example above, contaminated growth culture could lead to the waste of perfectly good raw materials. An improperly performed incubation might lead to undetected pathogens, and possibly unsafe products. </span></p>
<p>4) In SEV (Severity Rating), we assign severity to each failure effect on a 1 to 10 scale, where 10 is high and 1 low. This is a relative assignment. In the food world, wasting some good materials is undesirable, but having pathogens reach the market is obviously much worse, hence the ranking of 6 and 9, respectively.</p>
<p>5) In OCC (Occurrence Rating), estimate the probability of occurrence of the cause. Use a 1 to 10 scale, where 10 signifies high frequency (guaranteed ongoing problem) and 1 signifies low frequency (extremely unlikely to occur). </p>
<p>6) In Current Control, enter the manner in which the failure causes/modes are detected or controlled. </p>
<p>7) In DET (Detection Rating), gauge the ability of each control to detect or control the failure cause/mode. Use a 1 to 10 scale, where 10 signifies poor detection/control and 1 signifies high detection/control (you're almost certain detection to catch the problem before it causes failure). </p>
<p>8) RPN (Risk Priority Number) is the product of the SEV, OCC, and DET scores. The higher the RPN, the more severe, more frequent, or less controlled a potential problem is, indicating a greater need for immediate attention. Above, the RPN of 81 for potential incubation error indicates that that type of failure should get higher priority than contaminated cultures. . </p>
<p>9) If you're doing FMEA as part of an improvement project, you can use it to prioritize corrective actions. Once you've implemented improvements, enter the revised SEV, OCC, and DET values to calculate a current RPN. </p>
The Benefits of an FMEA
<p>When you've completed the FMEA, you'll have the answers to these questions:</p>
<p>What are the potential failure modes at each step of a process?<br />
What is the potential effect of each failure mode on the process output, and how severe is it?<br />
What are the potential causes of each failure mode, and how often do they occur?<br />
How well can you detect a cause before it creates a failure mode and effect?<br />
How can you assign a risk value to a process step, that factors in the frequency of the cause, the severity of failure, and the capability of detecting it in advance?<br />
What part of the process should an improvement project focus on?<br />
Which inputs are vital to the process, and which aren't? <br />
How can reaction plans be documented as part of process control?</p>
<p>And if your understanding of the steps that underlie your Certificate of Analysis is that thorough, you will be able to stand behind it with much more confidence. </p>
<p>Where could you apply an FMEA in your organization? </p>
Lean Six SigmaProject ToolsQuality ImprovementStatistics in the NewsWed, 17 Jun 2015 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/using-quality-tools-like-fmea-in-pathogen-testingEston MartzHow Is Cpk Calculated When the Subgroup Size Is 1?
http://blog.minitab.com/blog/marilyn-wheatleys-blog/how-is-cpk-calculated-when-the-subgroup-size-is-1
<p>When data are collected in subgroups, it’s easy to understand how the variation can be calculated within each of the subgroups based the subgroup range or the subgroup standard deviation.</p>
<p>When data is not collected in subgroups (so the subgroup size is 1), it may be a little less intuitive to understand how within-subgroup standard deviation is calculated. How does Minitab <a href="http://www.minitab.com/products/minitab/">Statistical Software</a> calculate within-subgroup variation if there is only one data point in each subgroup? How does this affect Cpk? This blog post will discuss how within-subgroup variation and Cpk are calculated when the subgroup size is 1.</p>
<p>For this post, the data linked <a href="//cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/1e4503550f8501467a66417162dda53a/capability2.mtw">here</a> will be used with along with a lower spec of 10 and an upper spec of 20 (sorry, no back story to this data). We will also accept Minitab’s default method for calculating within-subgroup variation for when the subgroup size is 1, which is the average moving range.</p>
<p><img height="324" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/7cb39b337c7585b13f4c62bfcf8a5035/7cb39b337c7585b13f4c62bfcf8a5035.png" width="768" /><br />
<span style="line-height: 1.6;">The normal capability results below show that for this dataset, the within-subgroup standard deviation is 1.85172 and the Cpk is 0.89:</span></p>
<p><img height="393" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/b9e91df65e45e5cf4efd530f9718b906/b9e91df65e45e5cf4efd530f9718b906.png" width="524" /></p>
<p><span style="line-height: 1.6;">To find the formulas Minitab uses to calculate the average moving range, we navigate the following menu path in Minitab: </span><strong style="line-height: 1.6;">Help</strong><span style="line-height: 1.6;"> > </span><strong style="line-height: 1.6;">Methods and Formulas</strong><span style="line-height: 1.6;"> > </span><strong style="line-height: 1.6;">Process capability</strong><span style="line-height: 1.6;"> > </span><strong style="line-height: 1.6;">Process capability (Normal)</strong><span style="line-height: 1.6;">. The section titled </span><strong style="line-height: 1.6;">Estimating standard deviation</strong><span style="line-height: 1.6;"> shows the formula for the average moving range:</span></p>
<p><img height="189" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/e206ec2740eb85dadd6e7d15dbe009c2/e206ec2740eb85dadd6e7d15dbe009c2.png" width="485" /></p>
<p><span style="line-height: 1.6;">We’ll use the formula above (and link to the table of unbiasing constants) to replicate Minitab’s Cpk output for a normal capability with a subgroup size of 1.</span></p>
<p>First, we calculate Rbar. To do that, we’ll get the average of the moving ranges, by calculating the difference from the data point in row 1 to row 2, row 2 to row 3, and so forth. An easy way to do that in Minitab is to use the Lag function in the Time Series menu- we choose <strong>Stat</strong> > <strong>Time Series</strong> > <strong>Lag</strong>, and then complete the dialog box as shown below and click OK:</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/fffef8272e4a2ed0365c2ec9d21caaf5/capture.PNG" style="width: 605px; height: 340px;" /></p>
<p>The lag function shifts every row down by the number of rows we type in the Lag field above.</p>
<p>Now we can use <strong>Calc </strong>> <strong>Calculator</strong> to subtract C2 from C1 and store the differences in a new column. Because the formula tells us to take the Max minus the Min values and we don’t want to rearrange the data, we can just use the ABS function in the calculator to get the absolute values of the differences:</p>
<p><img height="345" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/4c877489b5d9d214e2782133f01112ab/4c877489b5d9d214e2782133f01112ab.png" width="385" /></p>
<p><span style="line-height: 1.6;">Next we can use </span><strong style="line-height: 1.6;">Stat</strong><span style="line-height: 1.6;"> > </span><strong style="line-height: 1.6;">Basic Statistics</strong><span style="line-height: 1.6;"> > </span><strong style="line-height: 1.6;">Store Descriptive Statistics</strong><span style="line-height: 1.6;"> to store the </span><strong style="line-height: 1.6;">Sum</strong><span style="line-height: 1.6;"> of the differences that we calculated in the previous step:</span></p>
<p><img height="303" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/5f807545727359d87af8f3bf5353d267/5f807545727359d87af8f3bf5353d267.png" width="443" /></p>
<p><span style="line-height: 1.6;">The value stored in the worksheet, 206.785, is the numerator for our R-bar calculation. Now we can plug that number into the formula from Methods and Formulas:</span></p>
<p style="margin-left: 40px;">Rbar<strong> = </strong>(Rw + ... + Rn) / (n - w + 1)</p>
<p>w = The number of observations used in the moving range. The default is w = 2</p>
<p style="margin-left: 40px;">Rbar = (206.785)/100-2+1 = <strong>2.08874</strong></p>
<p>Finally, we can find the value of the unbiassing constant (d2) using the table linked in Methods and Formulas. In this example, w = 2, and d2(w) = 1.128:</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/28b612710d01fb755ef8ec53c9258947/capture.PNG" style="width: 342px; height: 116px;" /> </p>
<p>To calculate sigma x-bar, we use the formula from Methods and Formulas, dividing our Rbar estimate by the d2 value from the table (I used Minitab’s calculator again to get the answer):</p>
<p>Sigma x-bar = 0.0210984/1.128 = <strong>1.85172</strong> – that matches Minitab’s capability output, so we’re almost there!</p>
<p>Now we can calculate Cpk, which is the lesser of CPU and CPL. Once again Methods and Formulas tells us how to calculate CPU and CPL:</p>
<img height="232" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/466e956f6a1b14c4c556c99744359053/466e956f6a1b14c4c556c99744359053.png" style="line-height: 18.9090900421143px;" width="354" />
<img height="233" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/3c96466152681e5a67e3fce21cc02311/3c96466152681e5a67e3fce21cc02311.png" style="line-height: 18.9090900421143px;" width="348" />
<p>We can get the sample mean, X-bar, from Minitab’ capability output or by using <strong>Stat</strong> > <strong>Basic Statistics</strong> > <strong>Store Descriptive Statistics</strong>. That X-bar value along with the other values we’ve calculated are plugged in the above formulas:</p>
<p>CPU = (20-15.063)/(3*1.85172) = <strong>0.89</strong></p>
<p>CPL = (15.063-10)/(3*1.85172) = <strong>0.91</strong></p>
<p>Since Cpk is the lesser of CPU and CPL, then Cpk = <strong>0.89</strong>, just like Minitab said!</p>
<p>I hope this post on calculating Cpk when the size of the subgroup is 1 was helpful. You may also be interested in learning <a href="http://blog.minitab.com/blog/marilyn-wheatleys-blog/how-cpk-and-ppk-are-calculated2c-part-2">how Minitab calculates Cpk when the subgroup size is greater than 1</a>.</p>
Data AnalysisLean Six SigmaLearningQuality ImprovementStatisticsMon, 15 Jun 2015 12:00:00 +0000http://blog.minitab.com/blog/marilyn-wheatleys-blog/how-is-cpk-calculated-when-the-subgroup-size-is-1Marilyn WheatleyWhat Is the F-test of Overall Significance in Regression Analysis?
http://blog.minitab.com/blog/adventures-in-statistics/what-is-the-f-test-of-overall-significance-in-regression-analysis
<p>Previously, I’ve written about <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients">how to interpret regression coefficients and their individual P values</a>.</p>
<p>I’ve also written about <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit">how to interpret R-squared</a> to assess the strength of the relationship between your model and the response variable.</p>
<p>Recently I've been asked, how does the F-test of the overall significance and its P value fit in with these other statistics? That’s the topic of this post!</p>
<p>In general, an F-test in regression compares the fits of different linear models. Unlike t-tests that can assess only one regression coefficient at a time, the F-test can assess multiple coefficients simultaneously.</p>
<p>The F-test of the overall significance is a specific form of the F-test. It compares a model with no predictors to the model that you specify. A regression model that contains no predictors is also known as an intercept-only model.</p>
<p>The hypotheses for the F-test of the overall significance are as follows:</p>
<ul>
<li><strong>Null hypothesis</strong>: The fit of the intercept-only model and your model are equal.</li>
<li><strong>Alternative hypothesis</strong>: The fit of the intercept-only model is significantly reduced compared to your model.</li>
</ul>
<p>In <a href="http://www.minitab.com/en-us/products/minitab/features/" target="_blank">Minitab statistical software</a>, you'll find the F-test for overall significance in the Analysis of Variance table.</p>
<p><img alt="Analysis of variance table with the F-test of overall significance" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/ea9b30b0fda316d2b561081a9e094c3e/ftest_anova_table.png" style="width: 368px; height: 144px;" /></p>
<p>If the P value for the F-test of overall significance test is less than your significance level, you can reject the null-hypothesis and conclude that your model provides a better fit than the intercept-only model.</p>
<p>Great! That set of terms you included in your model improved the fit!</p>
<p>Typically, if you don't have any significant P values for the individual coefficients in your model, the overall F-test won't be significant either. However, in a few cases, the tests could yield different results. For example, a significant overall F-test could determine that the coefficients are <em>jointly</em> not all equal to zero while the tests for individual coefficients could determine that all of them are <em>individually</em> equal to zero.</p>
<p>There are a couple of additional conclusions you can draw from a significant overall F-test.</p>
<p>In the intercept-only model, all of the fitted values equal the mean of the response variable. Therefore, if the P value of the overall F-test is significant, your regression model predicts the response variable better than the mean of the response.</p>
<p>While R-squared provides an estimate of the strength of the relationship between your model and the response variable, it does not provide a formal hypothesis test for this relationship. The overall F-test determines whether this relationship is statistically significant. If the P value for the overall F-test is less than your significance level, you can conclude that the R-squared value is significantly different from zero.</p>
<p>If your entire model is statistically significant, that's great news! However, be sure to <a href="http://blog.minitab.com/blog/adventures-in-statistics/why-you-need-to-check-your-residual-plots-for-regression-analysis">check the residual plots</a> so you can trust the results!</p>
<p>If you're learning about regression, read my <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples">regression tutorial</a>!</p>
Regression AnalysisStatistics HelpThu, 11 Jun 2015 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/what-is-the-f-test-of-overall-significance-in-regression-analysisJim Frost3 Features to Make You Glad You're You When You Have to Clean Data in Minitab
http://blog.minitab.com/blog/statistics-and-quality-improvement/done-cancel
<p>When someone gives you data to analyze, you can gauge how your life is going by what you've received. Get a Minitab file, or even comma-separated values, and everything feels fine. Get a PDF file, and you start to think maybe you’re cursed because of your no-good-dirty-rotten-pig-stealing-great-great-grandfather and wish that you were someone else. For those of you who might be in such dire straits today, here are 3 helpful things you can do in Minitab Statistical Software: change data type, code and remove missing values, and recode variables.</p>
<p>For the purposes of having an example, I’m going to use some data from the Centers for Medicare and Medicaid Services. <a href="http://www.cms.gov/Medicare/Quality-Initiatives-Patient-Assessment-Instruments/HospitalQualityInits/Downloads/HospitalTop50PercentYear6.zip" target="_blank">The data are from October 2008 to September 2009 and track the quality of a hospital’s response to a patient with pneumonia</a>. The data in the PDF file look like this:</p>
<p><img alt="The PDF file has header text and a nicely formatted table." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/eb685364783267efcf57c40d30999633/worksheets1_w1024.jpeg" style="border-width: 0px; border-style: solid; width: 1024px; height: 559px;" /></p>
<p>If you copy and paste it into Minitab, hoping for nicely-organized tables as appear in the document, you get a single column that contains everything:</p>
<p><img alt="The header text and the table content are all in one column." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/8c484f2c2d789faaeced739e945b9c5a/worksheets2.JPG" style="border-width: 0px; border-style: solid; width: 555px; height: 759px;" /></p>
<p>Don’t despair. Instead, look at the capabilities that are at your fingertips.</p>
Change Data Type
<p>What we’re really after for analysis are the numbers inside the table, so a good first step is to get the numbers.</p>
<ol>
<li>Choose <strong>Data > Change Data Type > Text to Numeric</strong>.</li>
<li>In <strong>Change text columns</strong>, enter <em>C1</em>.</li>
<li>In <strong>Store Numeric Columns in</strong>, enter <em>C2</em>.</li>
<li>Click <strong>OK</strong>. In the Error box, click <strong>Cancel</strong>.</li>
</ol>
<p>When you look at the worksheet, the cells that had text values after the paste are now missing value symbols and the numbers that were in the tables remain. You might be a bit unnerved that the percentages of patients who received treatments are all 1, but that’s only a result of the column formatting. (Want to see? <a href="http://support.minitab.com/en-us/minitab/17/topic-library/minitab-environment/data-and-data-manipulation/numeric-data-and-formats/numeric-data-and-formats/#change-the-numeric-data-display-format">Change the numeric display format</a>.)</p>
Remove missing values
<p>You can easily get rid of the missing values in these data so that the missing values don’t interfere with further analysis, but there’s an additional complication here. While most of the missing values are column headers that we don’t want in the data, the table itself contains some missing values. Anytime a hospital gave a treatment to fewer than 10 patients, the table contains the value “Low Sample (10 or less).” To preserve these missing values while eliminating the others, we want to use different values to represent the different cases in the data.</p>
<ol>
<li>Choose <strong>Calc > Calculator</strong>.</li>
<li>In <strong>Store Result in Variable</strong>, enter <em>C3</em>.</li>
<li>In <strong>Expression</strong>, enter <em>If(Left(C1,3)=”Low”, 99999999, C2)</em>.</li>
<li>Click <strong>OK</strong>.</li>
</ol>
<p>Now that you have two kinds of missing value, you can start cleaning them up. First, get rid of the ones that don’t represent values in the table.</p>
<ol>
<li>Choose <strong>Data > Copy > Columns to Columns</strong>.</li>
<li>In <strong>Copy from columns</strong>, enter <em>C3</em>.</li>
<li>In <strong>Store Copied Data in Columns</strong>, select <strong>In current worksheet, in columns</strong> and enter <em>C4</em>.</li>
<li>Click <strong>Subset the Data</strong>.</li>
<li>In <strong>Specify Which Rows to Include</strong>, select <strong>Rows that match</strong> and click <strong>Condition</strong>.</li>
<li>In <strong>Condition</strong>, enter <em>C3 <> '*'</em>.</li>
<li>Click <strong>OK</strong> in all of the dialog boxes.</li>
</ol>
<p>Now that we’ve gotten rid of the missing values that weren’t numbers in the table, we can change the missing values that we kept back to a form Minitab recognizes.</p>
<ol>
<li>Choose <strong>Calc > Calculator</strong>.</li>
<li>In <strong>Store result in variable</strong>, enter <em>C5</em>.</li>
<li>In <strong>Expression</strong>, enter <em>If(c4 = 99999999, ‘*’, c4)</em>.</li>
<li>Click <strong>OK</strong>.</li>
</ol>
Recode the data
<p>For analysis, we want one row for each hospital. To do this, we’ll create a table in the worksheet that shows how to identify the variables for analysis, then <a href="http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-2">unstack the variables</a>.</p>
<p>Because we kept the missing values from the table, every hospital has 9 variables. We make a table in the worksheet that shows the numbers 1 to 9 and a name for each variable:</p>
<p><img alt="A table with number codes and labels that you want for the variables." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/c92210568ca74f8438f661d29cb990ad/worksheets3.JPG" style="border-width: 0px; border-style: solid; width: 535px; height: 271px;" /></p>
<p>To associate the variable names with all 1,944 rows of data, we’ll make patterned data.</p>
<ol>
<li>Choose <strong>Calc > Make Patterned Data > Simple Set of Numbers</strong>.</li>
<li>In <strong>Store patterned data in</strong>, enter <em>C8</em>.</li>
<li>In <strong>From first value</strong>, enter <em>1</em>.</li>
<li>In <strong>To last value</strong>, enter <em>9</em>.</li>
<li>In <strong>Number of times to list sequence</strong>, enter <em>216</em>.</li>
<li>Click <strong>OK</strong>.</li>
</ol>
<p>To convert the number codes to the text variable descriptions, we’ll recode the data.</p>
<ol>
<li>Choose <strong>Data > Code > Use Conversion Table</strong>.</li>
<li>In <strong>Code values in the following column</strong>, enter <em>C8</em>.</li>
<li>In <strong>Current values</strong>, enter <em>C6</em>.</li>
<li>In <strong>Coded values</strong>, enter <em>C7</em>.</li>
<li>Click <strong>OK</strong>.</li>
</ol>
<p>Now that you have a column that says which number belongs to each variable, unstack the data.</p>
<ol>
<li>Choose <strong>Data > Unstack Columns</strong>.</li>
<li>In <strong>Unstack the data in</strong>, enter <em>C5</em>.</li>
<li>In <strong>Using subscripts in</strong>, enter <em>C9</em>.</li>
<li>Click <strong>OK</strong>.</li>
</ol>
<p>Now, you have a new worksheet where each hospital is identified by its unique CCN and the variables are the proportions of pneumonia patients who got each treatment from that hospital.</p>
<p>Once the data are in a traditional format for analysis, you can start to get the answers that you want quickly. For example a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/quality-tools/control-charts/understanding-attributes-control-charts/what-is-a-laney-p-chart/">Laney P’ chart</a> might suggest whether some hospitals had a higher proportion of unvaccinated pneumonia patients than you would expect from the variation in the data.</p>
<p><img alt="8 facilities have higher proportions for the year than you would expect from a random sample from a stable process." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/72f45bae753978f916b3dbf9974c1c6b/laney_p____chart_of_unvaccinated.png" style="border-width: 0px; border-style: solid; width: 576px; height: 384px;" /></p>
<p>Fortunately, being able to change data types, remove missing values, and recode data lets you get data ready to analyze in Minitab as fast as possible. That way, you’re ready to give the answers that your fearless data analysis justifies.</p>
Data AnalysisLearningWed, 10 Jun 2015 12:00:00 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/done-cancelCody SteeleA Closer Look at Probability and Survival Plots
http://blog.minitab.com/blog/quality-data-analysis-and-statistics/a-closer-look-at-probability-and-survival-plots
<p>I recently fielded an interesting question about the probability and survival plots in Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a>'s Reliability/Survival menus:</p>
<p style="margin-left: 40px;"><em>Is there a one-to-one match between the confidence interval points on a probability plot and the confidence interval points on survival plot at a specific percentile?</em></p>
<p>Now, this may seem like an easy question, given that the probabilities on a survival plot are simply 1 minus the failure probabilities on a probability plot at a specific time t or stressor (in the case of Probit Analysis, used for our example below).</p>
<p>This can be seen here, at the 10th percentile:</p>
<p style="text-align: center;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/0b5346729cfc39b1fbd5829bfb4cb58e/pic1.png" style="width: 350px; height: 234px;" /><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/eb499a6f799045d441a03d98f1a15732/pic2.png" style="width: 350px; height: 234px;" /></p>
<p>The probability plot is saying that at a voltage of 113.25, 10% of your items are failing. Conversely, the survival plot will show that 90% of your items will survive at that same voltage.</p>
<p>How do the graphs compare when adding confidence intervals to both graphs? Before we get our hands dirty with this, let’s first review some terms and methods to get us comfortable enough to proceed further.</p>
Reliability/Survival Analysis
<p>This is the overarching classification of tools within Minitab that help with modeling life data. Distribution Analysis, Repairable Systems Analysis, and Probit Analysis fall within this category.</p>
Probit Analysis
<p>This analysis will be used as our example today. Probit analysis is used when you want to estimate percentiles and survival probabilities of an item in the presence of a stress. The response is required to be binomial in nature (go/no go, pass/fail). One example of a probit analysis could be testing light bulb life at different voltages.</p>
<p>Since the response data is binomial, you’d have to specify what would be a considered an event for that light bulb at a certain voltage. Let’s say the event is a light bulb blowing out before 800 hours.</p>
<p>Excerpt of data set</p>
Blows(The Event)
Trials
Volts
2
50
108
6
50
114
11
50
120
45
50
132
Probability Plot
<p>This graph plots each value against the percentage of values in the sample that are less than or equal to it, along a fitted distribution line (middle line). In probit analysis, it helps determine, at certain voltages, what the percentage of bulbs fail before 800 hours.</p>
Survival Plot
<p>This graph displays a plot of the survival probabilities versus time. Each plot point represents the proportion of units surviving at time t. In probit analysis, it helps determine, at a certain voltages, what the percentage of bulbs survive beyond 800 hours.</p>
Back to the original question…
<p>Can we take a value along the CI of a probability plot and find its corresponding value on the CI of survival plot at a specific percentile? Here are the confidence interval values for the percentile at 113.246:</p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/8e52bb9e7bb7d7792646f5ccd35b13d5/sessionpic1.png" style="line-height: 1.6; width: 338px; height: 112px;" /></p>
<p> </p>
<p style="text-align: center;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/fb12e57862d4f6c963a9ad468cfa6e50/pic3.png" style="width: 576px; height: 384px;" /></p>
<p>If we add the above confidence interval values for the 10th percentile to the survival plot, you'll see that they don’t quite equal what’s shown at 90%. They’re a <em>little</em> off:</p>
<p style="text-align: center;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/ad209081968164d93813fe0209898afd/pic4_w1024.png" style="line-height: 1.6; width: 624px; height: 343px;" /></p>
The Reason
<p>In our probability plot, the confidence interval is calculated with the parameter of interest being the percentile. Let’s look at the 10th percentile again:</p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/8e52bb9e7bb7d7792646f5ccd35b13d5/sessionpic1.png" style="width: 338px; height: 112px;" /> </p>
<p>Our 95% CI (111.302 to 114.779) is around the value of 113.246 volts. In our survival plot, the confidence interval is calculated around the probability of survival. You can see this in the session window under the Table of Survival Probabilities. The 95% CI around the survival probability of 0.90 for a voltage of 113.246:</p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/5a31728b7793a1bc82344d192d8a12bb/session_pic2.PNG" style="width: 339px; height: 106px;" /></p>
<p>Here’s another look at our survival plot with our aforementioned survival probabilities added:</p>
<p style="text-align: center;"> </p>
<p style="text-align: center;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/4c0ca3d6c165ca31ce9139a7e710d5aa/pic5.png" style="width: 624px; height: 279px;" /></p>
<p>They all nicely fit on one straight vertical line at voltage = 113.246. </p>
<p>This all being said, you <em>can </em>convert the lower bound or upper bound of a percentile to a point on a survival plot. Let’s say we look at the lower bound for 113.246 (which is 111.302). We’d first have to find the survival probability for that value:</p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/ba4559dda5a9d3ed2107ac6bc8bb37f3/sessionpic3.PNG" style="width: 322px; height: 105px;" /></p>
<p>Now let’s look at that table of survival probabilities for 0.90 again:</p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/5a31728b7793a1bc82344d192d8a12bb/session_pic2.PNG" style="width: 339px; height: 106px;" /></p>
<p>Notice that the survival probability for the lower CI of 113.246 ends up being the upper bound of the survival probability of 0.90. Given that the survival probabilities are one minus the failure probabilities, it makes sense that you'd have to look at the upper bound of a survival plot when analyzing the lower bound of a probability plot. </p>
<p>I hope this post helps you develop a deeper understanding of the relationship between our probability and survival plots—and I hope it wasn't <em>too </em>technical!</p>
<p>Please check out these other posts on Reliability/Survival:</p>
<p><a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/probit-analysis-down-goes-the-meathouse">Probit Analysis: Down Goes the Meathouse!</a></p>
<p><a href="http://blog.minitab.com/blog/the-statistics-of-science/reliability-statistics-and-the-care-and-feeding-of-capital-equipment">The Care and Feeding of Capital Equipment( with Reliability Statistics)</a></p>
Data AnalysisQuality ImprovementReliability AnalysisSix SigmaMon, 08 Jun 2015 12:00:00 +0000http://blog.minitab.com/blog/quality-data-analysis-and-statistics/a-closer-look-at-probability-and-survival-plotsAndy CheshireHow to Explore Interactions with Line Plots
http://blog.minitab.com/blog/understanding-statistics/how-to-explore-interactions-with-line-plots
<p><span style="line-height: 1.6;">The line plot is an incredibly agile but frequently overlooked tool in the quest to better understand your processes.</span></p>
<p>In any process, whether it's baking a cake or processing loan forms, many factors have the potential to affect the outcome. Changing<span style="line-height: 1.6;"> the source of raw materials could affect the strength of plywood a factory produces. Similarly, one method of gluing this plywood might be better or worse than another.</span></p>
<p>But what is even more complicated to consider is how these factors might interact. In this case, plywood made with materials obtained from supplier “A” might be strongest when glued with one adhesive, while plywood that uses material from supplier “B” might be strongest when you glued with a different adhesive.</p>
<p>Understanding these kinds of interactions can help you maintain quality when conditions change. But where do you begin? Try starting with a line plot.</p>
The Line Plot Has Two Faces
<p>Line plots created with Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a> are flexible enough to help you find interactions and response patterns whether you have 2 factors or 20. But while the graph is always created the same way, such changes in scale produce two seemingly distinct types of graph.</p>
<p><strong style="line-height: 18.9090900421143px;">With just a few groups…</strong><span style="line-height: 18.9090900421143px;">the focus is on <a href="http://blog.minitab.com/blog/michelle-paret/evaluating-statistical-interactions-with-ketchup-and-soy-sauce">interaction effects</a>. In the graph below, a paint company that wants to improve the performance of its products has created a line plot that finds a strong interaction between spray paint formulation and the pressure at which it’s applied.</span><br />
<img alt="Line Plot 1" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/13435f41f412013464ec6bf1b94ed8c4/line_plot_of_mean__flaws__.png" style="width: 576px; height: 384px;" /></p>
<p>An interaction is present where the lines are not parallel.</p>
<p><strong style="line-height: 18.9090900421143px;">With many groups…</strong><span style="line-height: 18.9090900421143px;">the focus is on deviations from an expected response profile. (That's why in the chemical industry this is sometimes called a profile graph.) The line plot below shows a comparison of chemical profiles of a drug from three different manufacturing lines.</span></p>
<p><img alt="Many Groups" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/aedb3873f1dc4d3605a5355604ee35bb/line_plot_of_drug_profiles.png" style="width: 576px; height: 384px;" /></p>
<p>Any profile that deviates from the established pattern could suggest quality problems with that production line, but these three profiles look quite similar.</p>
More Possibilities to Explore
<p>If you’re an experienced Minitab user, these examples may seem familiar. In its various incarnations, the line plot is similar to the interaction plot, to "Calculated X" plots used in PLS, and even to time series plots that appear with more advanced analyses. But the line plot gives you many more options for exploring your data. Here’s another example.</p>
<p><img alt="explore the mean" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/59276a51a780cd785048a1ad91bda670/line_plot_of_mean__sales__.png" style="width: 576px; height: 384px;" /></p>
<p>A line plot of the mean sales from a call center shows little interaction between the call script and whether the operators received sales training because the lines are parallel.</p>
<p><img alt="explore standard deviation" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ae08bf56a7653c3a1e81cd361713e5e7/line_plot_of_sum__stdev_sales___.png" style="width: 576px; height: 384px;" /></p>
<p>But because line plot allows us to examine functions other than the mean, we can see that there is, in fact, an interaction effect in terms of standard deviation. The lines are not parallel. For some reason, the variability in sales seems to be affected by the combination of script and training.</p>
How to create a line plot in Minitab
<p>Creating a line plot in Minitab is simple. For example, suppose that your company makes pipes. You’re concerned about the mean diameter of pipes that are produced on three manufacturing lines with raw materials from two suppliers.</p>
<p><img alt="Example with Symbols" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/45cd6933975f00505182e2ec2b34ab5b/line_plot_dialog.png" style="width: 369px; height: 348px;" /></p>
<p>Because you’re examining only two factors­—line and supplier—a With Symbols option is appropriate. Use Without Symbols options when you have many groups to consider. Symbols may clutter the graph. Within these categories, you have your choice of data arrangement.</p>
<p>Choose <strong>Graph > Line Plot > With Symbols, One Y</strong>.<br />
Click OK.</p>
<p><img alt="example variables" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/2d2259668fda4d2c61b63da97369e786/line_plot_dialog_2.png" style="width: 521px; height: 381px;" /></p>
<p>Now, enter the variables to graph. Note that Line Plot allows you to graph a number of different functions apart from the mean.</p>
<p>In Graph variables, enter 'Diameter'.<br />
In Categorical variable for X-scale grouping, enter Line.<br />
In Categorical variable for legend grouping, enter Supplier.<br />
Click OK.</p>
<p><img alt="Line Plot of Diameter" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/3cb757cc063bc6ad638706233f0e5be6/line_plot_of_mean__diameter__.png" style="width: 576px; height: 384px;" /></p>
<p>The line plot shows a clear interaction between the supplier and the line that manufacture the pipe. </p>
Putting line plots to use
<p style="line-height: 18.9090900421143px;"><span style="line-height: 18.9090900421143px;">The line plot is an ideal way to get a first glimpse into the data behind your processes.</span><span style="line-height: 18.9090900421143px;"> </span>The line plot resembles a number of graphs, particularly the interaction plots used with DOE or ANOVA analyses. But, while the function of line plots may be similar, their simplicity makes them an especially appropriate starting point.</p>
<p>It can highlight the variables and the interactions that are worth exploration. Its powerful graphing features also allow you to analyze subsets of your data or to graph different functions of your measurement variable, like standard deviation or count.</p>
Data AnalysisQuality ImprovementStatisticsStatistics HelpWed, 03 Jun 2015 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/how-to-explore-interactions-with-line-plotsEston MartzOperational Definitions: The First Step in a Statistical Analysis (Even after the Apocalypse)
http://blog.minitab.com/blog/statistics-in-the-field/operational-definitions%3A-the-first-step-in-a-statistical-analysis-even-after-the-apocalypse
<p><em><span style="line-height: 1.6;">By Matthew Barsalou, guest blogger. </span></em></p>
<p>Minitab <a href="http://www.minitab.com/products/minitab/">Statistical Software</a> can assist us in our analysis of data, but we must make judgments when selecting the data for an analysis. A good operational definition can be invaluable for ensuring the data we collect can be effectively analyzed using software.</p>
<p>Dr. W. Edwards Deming explains in <em>Out of the Crisis</em> (<a href="http://www.amazon.com/Out-Crisis-W-Edwards-Deming/dp/0262541157/ref=sr_1_1?ie=UTF8&qid=1432151883&sr=8-1&keywords=out+of+the+crisis" target="_blank">1989</a>), “An operational definition of safe, round, reliable, or any other quality must be communicable, with the same meaning to vendor as to purchaser, same meaning yesterday and today to the production worker.” Deming goes onto to tell us an operational definition requires a specific test, a judgment criteria, and a decision criteria to determine if something met the criteria.<a href="http://www.madmaxmovie.com/" target="_blank"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/84bcea34f0da721bcc361d3bc1d39859/mmax_lo_res_fair_use.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 169px; height: 261px;" /></a></p>
<p>The concept of operational definitions crossed my mind when I read Todd VanDerWerff’s review of <a href="http://www.madmaxmovie.com/" target="_blank"><em>Mad Max: </em></a><em><a href="http://www.imdb.com/title/tt1392190/">Fury Road</a></em> at <a href="http://www.vox.com/2015/5/15/8612481/mad-max-review-fury-road" target="_blank">Vox</a>.</p>
<p>VonDerWerff presented an illustration of the percent of time individual Mad Max movies contained a chase scene based on data from the Internet Movie Data Base. I have recreated the illustration below as a bar chart using Minitab.</p>
<p>I first typed the data into a Minitab worksheet as shown below:</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/0a2344f7b96a3a551a7d09a885f046e4/operational_definitions_1.png" style="width: 500px; height: 124px;" /></p>
<p>I then stacked the data by going to <strong>Data > Stack > Columns…</strong> and selecting columns C1-C4. Next, I relabeled column C1-T as “Film” and column C2 as % Chase.”</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/bf0dcd0118f17cf2e169fa71e4376319/operational_definitions_2.png" style="width: 500px; height: 203px;" /></p>
<p><span style="line-height: 1.6;">Then I went to <strong>Graph > Bar Chart</strong> and selected “Values from a table” and a “Simple” bar chart. The graph variables were % Chase and the categorical variable was Film. I clicked on the resulting bar chart and then right clicked and selected <strong>Add > Data labels</strong>. The resulting bar chart is shown below:</span></p>
<p><img alt="Chart of Percent Chase" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a330153602cccee83f722be4213dfc9d/operational_definitions_3.png" style="width: 500px; height: 334px;" /></p>
<p><span style="line-height: 1.6;">As a connoisseur of the Mad Max series, I was rather shocked to see that <em>Mad Max: Fury </em></span><em style="line-height: 18.9090900421143px;">Road </em><span style="line-height: 1.6;">consisted of only 32% chase scenes. I would have estimated 95-95% chase scenes! VanDerWreff explains “We're skewing toward the conservative side here and only counting scenes where the characters are in the thick of a really contentious chase, where either side might prevail.” Obviously, we are using different criteria to identify a chase scene. VanDerWreff is close to an operational definition; however, “where either side might prevail” could still be open to interpretation and therefore, inadequate as an operational definition. </span></p>
<p>In <em>Twenty Things you Need to Know</em> (<a href="http://www.amazon.com/Twenty-Things-You-Need-Know/dp/094532068X" target="_blank">2009</a>), Wheeler lists three questions that can serve as a framework for an operational definition:</p>
<ol>
<li>What do you want to accomplish?</li>
<li>By what method will you accomplish your objective?</li>
<li>How will you know you have accomplished your objective?</li>
</ol>
<p>Answering Wheeler’s three questions can help us to define an operational definition for chase scenes in the latest Mad Max movie: We want to identify chase scenes in a <em>Mad Max: Fury Road</em>. We will use a calibrated stop watch capable of differentiating down to 1/100th of a second to identify the start and stop time of a chase where a chase is defined as “the time from when a chasing party first appears on screen at a range of 1,800 <a href="http://en.wikipedia.org/wiki/M2_Browning" target="_blank">meters</a> or less away from the chased party and the time will stop at the point where the chasing party is seen to be more than 1,800 meters away from the chased party or the last scene in which the chasing party appears.” The total chase time is to be divided by the total length of the movie and multiplied by 100. The objective will be accomplished after the last credit appears on the screen at the end of the movie.</p>
<p>Such a simple operational definition makes it clear what should be considered a chase scene. Notices that the operational definition refers to “chased parties” and not “chased vehicles”? This operational definition would include foot chases as chase time. Without an operational definition, one evaluator may include foot chases while another ignores them.</p>
<p>Tina Turner tells us, “We don’t need another <a href="http://www.lastfm.de/music/Tina+Turner/_/We+Don%27t+Need+Another+Hero" target="_blank">hero</a>.” Perhaps, but what we do need is a good operational definition if we want to correctly collect data for a statistical analysis.</p>
<p> </p>
<p><strong>About the Guest Blogger</strong></p>
<p><em><a href="https://www.linkedin.com/pub/matthew-barsalou/5b/539/198" target="_blank">Matthew Barsalou</a> is a statistical problem resolution Master Black Belt at <a href="http://www.3k-warner.de/" target="_blank">BorgWarner</a> Turbo Systems Engineering GmbH. He is a Smarter Solutions certified Lean Six Sigma Master Black Belt, ASQ-certified Six Sigma Black Belt, quality engineer, and quality technician, and a TÜV-certified quality manager, quality management representative, and auditor. He has a bachelor of science in industrial sciences, a master of liberal studies with emphasis in international business, and has a master of science in business administration and engineering from the Wilhelm Büchner Hochschule in Darmstadt, Germany. He is author of the books <a href="http://www.amazon.com/Root-Cause-Analysis-Step---Step/dp/148225879X/ref=sr_1_1?ie=UTF8&qid=1416937278&sr=8-1&keywords=Root+Cause+Analysis%3A+A+Step-By-Step+Guide+to+Using+the+Right+Tool+at+the+Right+Time" target="_blank">Root Cause Analysis: A Step-By-Step Guide to Using the Right Tool at the Right Time</a>, <a href="http://asq.org/quality-press/display-item/index.html?item=H1472" target="_blank">Statistics for Six Sigma Black Belts</a> and <a href="http://asq.org/quality-press/display-item/index.html?item=H1473&xvl=76115763" target="_blank">The ASQ Pocket Guide to Statistics for Six Sigma Black Belts</a>.<br />
</em></p>
<p><em> </em></p>
<p style="font-size:10px;"><em>Low-resolution poster image displayed under <a href="http://en.wikipedia.org/wiki/Fair_use" target="_blank">fair use</a>. Copyright is believed to belong to the distributor of the item promoted, <a href="http://www.warnerbros.com/" target="_blank">Warner Bros. Pictures</a>. </em></p>
Data AnalysisFun StatisticsProject ToolsStatisticsStatistics in the NewsMon, 01 Jun 2015 12:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/operational-definitions%3A-the-first-step-in-a-statistical-analysis-even-after-the-apocalypseGuest BloggerGraphing Distributions with Probability Distribution Plots
http://blog.minitab.com/blog/adventures-in-statistics/graphing-distributions-with-probability-distribution-plots
<p>Scientists who use the Hubble Space Telescope to explore the galaxy receive a stream of digitized images in the form binary code. In this state, the information is essentially worthless- these 1s and 0s must first be converted into pictures before the scientists can learn anything from them.</p>
<p>The same is true of statistical distributions and <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/basic-concepts/parameter-esimates/" target="_blank">parameters</a> that are used to describe <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/basic-concepts/sample-and-population/" target="_blank">sample data</a>. They offer important information, but the numbers can be meaningless without an illustration to help you interpret them. For instance, what does it mean if your data follow a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/probability-distributions-and-random-data/distributions/gamma-distribution/" target="_blank">gamma distribution</a> with a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/probability-distributions-and-random-data/parameters/scale/" target="_blank">scale</a> of 8 and a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/probability-distributions-and-random-data/parameters/shape/" target="_blank">shape</a> of 7? If the distribution shifts to a shape of 10, is that good or bad? And how would you explain all of this to an audience that is more interested in outcomes than in statistics?</p>
<p><a href="http://www.minitab.com/en-us/products/minitab/features/" target="_blank">Minitab’s</a> probability distribution plots create the pictures that bring the numbers to life. Even novice users can reap the benefits that come from understanding their data’s distribution. Here are a few examples.</p>
See what you’ve been missing
<p><img alt="Estimates of Distribution Parameters output" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/fc0dfeb2904310214b6b49b1de040a7b/dist_para_mle.png" style="width: 294px; height: 90px;" /></p>
<p>A building materials manufacturer develops a new process to increase the strength of its I-beams. The output shows that the old process fit a gamma distribution with a scale of 8 and a shape of 7, whereas the new process has a shape of 10. The manufacturer does not know what this change in the shape parameter means.</p>
<p><img alt="Probability distribution plots that compare Gamma distributions" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/c50ff5d1491305e9ed0a14791f50821d/gamma_compare_large.png" style="width: 576px; height: 384px;" /></p>
<p>Minitab’s probability distribution plots show that the subtle shape change increases the number of acceptable beams from 91.4% to 99.5%, an improvement of 8.1%. Additionally, the right tail appears to be much thicker, which indicates many more unusually strong units. Perhaps these could lead to a premium line of products.</p>
Communicate your results
<p>A quality improvement specialist at a grocery store chain wants to implement a new but expensive program to reduce discrepancies between the item’s shelf price and the amount that is charged at the register. No difference in prices is ideal, but any difference within the range of ± 0.5% is considered acceptable.</p>
<p><img alt="Descriptive statistics of grocery program results before and after" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/9b5d652feed6a2c336a0a02fbfb19b70/desc_stats.png" style="width: 266px; height: 92px;" /></p>
<p>In the pilot study, the mean improvement is tiny and the president doesn’t see the benefits of the smaller standard deviation. Therefore, the president is reluctant to approve the costly program.</p>
<p><img alt="Probability distribution plots that compere the before and after for the pilot study" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/ad5c78a30cad82ebf651126ef8fd08bb/grocery_program_large.png" style="width: 576px; height: 384px;" /></p>
<p>The specialist knows that the tighter distribution is key to the program’s success. To illustrate this, she creates this plot to show that the differences are clustered much closer to zero and most are in the acceptable range. Now the president can see the improvement.</p>
Compare distributions
<p>The fabrication department of a farm equipment manufacturer counts the number of tractor chassis that are completed per hour. A <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/probability-distributions-and-random-data/distributions/poisson-distribution/" target="_blank">Poisson distribution</a> with a mean of 3.2 best describes the sample data. However, the test lab prefers to use an analysis that requires a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/probability-distributions-and-random-data/distributions/normal-distribution/" target="_blank">normal distribution</a> and wants to know if it is appropriate. If the normal distribution does not approximate the Poisson distribution, then the test results are invalid.</p>
<p><img alt="Probability distribution plot that compares a Poisson distribution to a Normal distribution" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/f9d5d879bdc549efd7e5a826964aaced/poisson_large.png" style="width: 576px; height: 385px;" /></p>
<p>The distribution plot can easily compare the known distribution with a normal distribution. In this case, lab workers can clearly see that the normal distribution, as well as the analyses that require it, won’t be a good fit.</p>
How to create probability distribution plots in Minitab
<p>You can easily create a probability distribution plot to visualize and to compare distributions and even to scrutinize an area of interest. For example, an analyst wants to interview customers who have customer satisfaction scores that are between 115 and 1 35. Minitab’s <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-identify-the-distribution-of-your-data-using-minitab" target="_blank">Individual Distribution Identification</a> feature shows that these scores are normally distributed with a mean of 100 and a standard deviation of 15. However, the analyst can’t visualize where his subjects fall within the range of scores or their proportion of the entire distribution.</p>
<p><img alt="Dialog box to create probability distribution plot" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/97c605080f66a9ded4649e3512d52545/view_prob_dialog.png" style="width: 294px; height: 265px;" /></p>
<ol>
<li>Choose <strong>Graph > Probability Distribution Plot > View Probability</strong>.</li>
<li>Click <strong>OK</strong>.</li>
<li>From <strong>Distribution</strong>, choose <strong>Normal</strong>.</li>
<li>In <strong>Mean</strong>, type <em>100</em>.</li>
<li>In Standard deviation, type <em>15</em>.</li>
</ol>
<p><img alt="Shade area dialog for creating probability distribution plots" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/b4ec8b22148db66201fc9048a4909226/shade_area_dialog.png" style="width: 296px; height: 267px;" /></p>
<ol>
<li>Click the <strong>Shaded Area</strong> tab.</li>
<li>In <strong>Define Shaded Area By</strong>, choose <strong>X Value</strong>.</li>
<li>Click <strong>Middle</strong>.</li>
<li>In <strong>X value 1</strong>, type <em>115</em>.</li>
<li>In <strong>X value 2</strong>, type <em>135</em>.</li>
<li>Click <strong>OK</strong>.</li>
</ol>
<p><img alt="Probability distribution plot that shows the probability of IQ scores from 115 to 135" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/a651de293cfc0e2a677e096f62a8d505/iq_plot_large.png" style="width: 576px; height: 384px;" /></p>
<p>The scores in the region of interest (115-135) represent 14.9% of the population. This somewhat small percentage suggests that the analyst may have to expend extra effort to find a sufficient number of qualified subjects.</p>
Putting probability distribution plots to use
<p>Probability distribution plots provide valuable insight because they reveal the deeper meaning of your distributions. Use these graphs to highlight the effect of changing distributions and parameter values, to show where target values fall in a distribution, and to view the proportions that are associated with shaded areas. These simple plots also clearly and easily communicate these advanced concepts to a non-statistical audience.</p>
<p>Don’t let your audience be confused by hard-to-understand concepts and numbers. Instead, use Minitab to illustrate what your data are telling you.</p>
Thu, 28 May 2015 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/graphing-distributions-with-probability-distribution-plotsJim FrostDo Criminals Pay the Cost of Their Crimes?
http://blog.minitab.com/blog/statistics-and-quality-improvement/do-criminals-pay-the-cost-of-their-crimes
<p>In Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a>, putting a regression line on a scatterplot is as easy as choosing a picture with a regression line on a scatterplot:</p>
<p><img alt="With regression" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/6a7dc7563e65aac0b3a6e94a9950dc0b/graph_gallery.png" style="width: 370px; height: 326px;" /></p>
<p>A neat trick is that you can also add calculated lines onto a scatterplot for comparison or other communication purposes. Here’s a demonstration.</p>
United States Sentencing Guidelines
<p><img alt="justice" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/df671fcba4bb4e98146a21d9a5fdc842/cudrefin_justice_w1024.jpeg" style="width: 150px; height: 225px; float: right; border-width: 1px; border-style: solid; margin: 10px 15px;" />The United States Sentencing Guidelines say how people who are convicted of crimes should be punished. Sentencing can vary from the guidelines. How often deviations happen either more severely or less severely are some of the statistics that the <a href="http://www.ussc.gov/" target="_blank">United States Sentencing Commission</a> keeps. If we were to look at it simply, one thing that we might expect is that the amount of money spent on fines and restitution has a relationship with the measured monetary loss that results from a crime, at least in cases where the recorded statistics include a monetary loss.</p>
<p>The raw data from the United States Sentencing Comission for 2013, the most recent year on their website as of 2/16/2015, has 80,035 cases. Cut that data set down to cases where a specific, nonzero amount was recorded for a monetary loss and a specific amount was recorded for the total of fines, restitution, and cost of supervision and you get a data set with 9,619 cases. Here’s what the scatterplot with a regression line for that data set looks like:</p>
<p><img alt="The entire data set, with a regression line" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/b0a8dd370207276ce7114e5196a44c1b/all_data_with_regression_line.png" style="width: 576px; height: 384px;" /></p>
<p>If there’s a relationship between the cost and the loss, we might hypothesize that a fair solution would be for cost and loss to be approximately equal, Y = X. Here are the steps for drawing a new line on the scatterplot:</p>
<ol>
<li>In the worksheet, name an empty column <em>X</em>.</li>
<li>Enter starting and ending x-values in the first two rows of column X. (Because I’m, going to show only a portion of the data, for now, I’m going to enter 0 in the first row and 400 million in the second row.)</li>
<li>Choose <strong>Calc > Calculator</strong>.</li>
<li><strong>In Store Result in Variable</strong>, enter <em>Y</em>.</li>
<li>In E<strong>xpression</strong>, enter the formula for the calculated line. (In this case, because I’m interested in whether cost and loss are approximately equal, so I’m going to enter ‘X’.)</li>
<li>Click <strong>OK.</strong></li>
<li>Right-click the scatterplot. Choose <strong>Add > Calculated Line</strong>.</li>
<li>In <strong>Y Column</strong>, enter <em>Y</em>.</li>
<li>In <strong>X Column</strong>, enter <em>X</em>.</li>
<li>Click <strong>OK</strong>.</li>
</ol>
<p>The single case where the loss was $5.9 billion and no restitution or fines were part of the sentence, as well as the other 5 cases where the loss exceeded $500 million seem to squish the main portion of the data considerably, so I <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/graph-options/graph-framework-elements/modifying-graph-scales/#modify-the-range-of-a-continuous-scale">edited the x-axis</a> to extend only to 400 million.</p>
<p><img alt="The slope of the calculated line is much steeper than the regression line." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/54d914b1612cf2977c119939a31b4a98/with_calculated_line.png" style="width: 576px; height: 384px;" /></p>
<p>The regression fit is well below the calculated line, which suggests that the costs tend to be less than the loss. However, the r-squared value for the regression line is 3.3%. What the data really indicate is that there's no linear relationship between the loss and the costs a criminal is asked to pay.</p>
<p>Of course, we know that the regression line fitting all of the data is heavily influenced by the most extreme case where the loss was $5.9 billion and there was no cost. Actually, the cost and the loss are identical in about 34% of the cases in the data. If we consider only cases where the costs a criminal paid were nonzero and the loss was less than $500 million, the r-squared value increases to 73.3% and the regression line looks much closer to the line Y = X:</p>
<p><img alt="The regression line for a subset of the data is closer to Y = X." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/f471dc83509ea4e95c4ccc4f1381cdd0/subset.png" style="width: 576px; height: 384px;" /></p>
<p>The United States Sentencing Commission recorded over 18,000 variables about the sentences that defendants received in 2013. Coming up with what’s fair is clearly a complicated matter.</p>
Bonus
<p>You can add calculated lines to all kinds of graphs in Minitab. If you’re ready for more, see how you can <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/graph-options/preparing-graphs-for-presentation/adding-reference-lines-to-graphs/#place-a-reference-line-in-front-of-the-data-display">use a calculated line to put a line in front of the bars on a histogram</a>.</p>
<p><em>The image of the Fontaine de la Justice in Cudrefin, Switzerland, is by <a href="http://commons.wikimedia.org/wiki/User:Roland_Zumbuehl">Roland Zumbuehl</a> and is licensed under this <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">Creative Commons License</a>.</em></p>
Fun StatisticsStatistics in the NewsStatsWed, 27 May 2015 12:00:00 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/do-criminals-pay-the-cost-of-their-crimesCody SteeleDesign of Experiment (DOE): Searching for a Selfie Fountain of Youth
http://blog.minitab.com/blog/statistics-and-quality-data-analysis/design-of-experiment-doe%3A-searching-for-a-selfie-fountain-of-youth
<p>I've never understood the fascination with selfies.</p>
<p>Maybe it's because I'm over 50. After surviving the slings and arrows of a half a century on Earth, the minute or two I spend in front of the bathroom mirror each morning is <em>more</em> than enough selfie time for me.</p>
<p>Still, when I heard that Microsoft had <a href="http://how-old.net/#" target="_blank">an online app that estimates the age</a> of any face on a photo, I was intrigued.</p>
<p>How would the app quantify the cracks, fissures, and crevices of my 56-year-old mug?</p>
The Pre-Experiment Phase, aka PlayTime
<p>At first, I just goofed around with the app, taking some selfies with my iPad and observing the estimates.</p>
<p>It didn’t take long to notice some whopping variability in the estimates. Variability that made me alternately smile or weep:</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/d19f1a7b130b11e49f12c011f60a9dfc/how_old_duo.jpg" style="width: 516px; height: 230px;" /></p>
<p>But I soon tired of my subjective responses to the age estimates. They just caused my face to scrunch up and accelerate the aging process, anyway.</p>
<p>It was time to take a step back and approach the problem more objectively.</p>
<p>If the <em>How Old Do I Look</em> app were a process, and its age estimates its “product,” what factors might affect its variability?</p>
<p>And, by identifying the optimal settings for these factors, could I uncover a strategy (sans plastic surgery) for taking the most age-defying selfie possible?</p>
Creating a Full Factorial Design
<p>After informally experimenting with selfies taken with an iPad, I came up with 5 factors that might affect the age estimates produced by the app.</p>
<ul>
<li><strong>Light source</strong>: Indicates whether main light source was in front of me or behind me</li>
<li><strong>Angle</strong>: Indicates whether the iPad was held straight in front to me (0°), above my face (+45°), or below my face (-45°)</li>
<li><strong>Smile</strong>: Indicates whether I smiled, frowned, or stared blankly like a zombie</li>
<li><strong>Distance</strong>: Indicates how far the iPad was held from my face (0.5, 1, or 2 ft)</li>
<li><strong>Shave</strong>: Indicates whether I had shaved before the photo was taken</li>
</ul>
<p>Using Minitab (<strong>Stat > DOE> Factorial > Create Factorial Design</strong>), I created a general full factorial design with these 5 factors, entering the levels for each factor as shown below:</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/887c000fb23fd51812e7cf6bf60915d0/factors_dialog.jpg" style="width: 509px; height: 216px;" /></p>
<p>Using that information, Minitab created a randomized worksheet that detailed the factor settings I should use to take each selfie for the experiment. I added the Age column to record the age estimate given by the app.</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/e9a3af1dbcbbf201625ea99216d7c329/doe_worksheet.jpg" style="width: 717px; height: 122px;" /></p>
<p>So, for the first selfie (row 1), I needed to take the photo with the light source in back of me, with the iPad below my face pointing upward (-45), at a distance of 1 foot, when I was smiling and unshaven.</p>
<p>The full factorial design required 108 runs—that is, 108 selfies. Brutal, yes. But a necessary sacrifice for the advancement of selfie science.</p>
<p><strong>Note</strong>: I used a full factorial design because collecting data for this experiment was quick, easy, and free. If collecting data for an experiment requires a significant amount of time and money, and you have limited resources, you might opt instead for a <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/doggy-doe-part-i-design-on-a-dime" target="_blank">fractional factorial design</a>.</p>
Evaluating the Main Effects
<p>When you analyze a DOE experiment, you can display a main effects plot to examine differences among the means across the factor levels. For this experiment, the plot shows the differences in the mean age estimate at each "setting" used to take the selfie.</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/4f0ee8fcdd1a8c1a7d5bcea24e76e426/main_effect_plot_cropped_w1024.jpeg" style="width: 1024px; height: 385px;" /></p>
<p>To get the lowest age estimate from the app, the selfie should be taken with the light source in front of me, from above my face (45 degrees), at a distance of 2 feet, when I was clean-shaven, and had a blank, zombie-like expression on my face. The mean age estimate was highest when I smiled.</p>
<p>But before I make a pointed effort to mimic the expression of the walking dead, or ask everyone stand at least 2 feet away from me, there are a couple other things that are important to consider.</p>
<ul>
<li>Are any of these main effects statistically significant?</li>
<li>Are there significant interactions between the factors that could make these main effects misleading?</li>
</ul>
Determining the Final Model
<p>For this experiment, I limited the analysis to main effects and two-way interactions between factors. Using step-wise selection, and a significance level of 0.15, Minitab determined the following final model:</p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/e6d7bd0d0c25fa79c7ee5a7afabc238a/sw_output_how_old.jpg" style="width: 442px; height: 346px;" /></p>
<p>Of the main effects, only Angle, Smile, and Shave are statistically significant (P-value < 0.15). Distance (P-value = 0.186) and Light source (P-value = 0.210) are not statistically significant. However, both of these factors are part of at least one significant 2-way interaction, so they're included in the final model.</p>
<p>The adjusted <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/r-squared-sometimes-a-square-is-just-a-square" target="_blank">R-squared value</a> (51.27%) shows that these factors and their interactions explain over 50% of the variation in the app's age estimates of the selfies!</p>
Evaluating the Interactions
<p>To clearly see how these significant 2-way interactions affect the response, I displayed an interaction plot.</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/9fc8daa998a7692d943b9082b1101fc6/interaction_plot_for_age.jpg" style="width: 576px; height: 384px;" /></p>
<p>When you eyeball the interaction plot, look for lines that aren't parallel. They indicate interactions.</p>
<p>For example, consider the the Light source*Angle plot in the upper left. It shows that when the selfie is taken straight-on (0 degrees) or from below (-45), a light source from the front results in a slightly higher mean age estimate. However, when the selfie is taken from above (+45), the effect of the light source is reversed--having the light source in front significantly reduces the mean age estimate. That puts a whole new spin on the original interpretation of the light source as a main effect.</p>
Concluding Comments
<p>I don't know what algorithms the "How Old Do I Look" app uses to make its age estimates. But by analyzing a designed experiment of selfies taken with an iPad, I identified some factors and interactions that are significantly associated with the variability of the age estimates.</p>
<p>As for the selfie craze itself…I remain as baffled as ever. A lot of new research is being done to delve deeper into the phenomenon. For example, one recent study found that men who post a lot of selfies online <a href="https://news.osu.edu/news/2015/01/06/hey-guys-posting-a-lot-of-selfies-doesn%E2%80%99t-send-a-good-message/" target="_blank">score higher on measures of anti-social psychopathy</a>.</p>
<p>After taking 108 selfies for this experiment and posting the results online, I'm feeling very relieved about one thing: Correlation does *not* equal causation.</p>
Design of ExperimentsFun StatisticsTue, 26 May 2015 13:06:00 +0000http://blog.minitab.com/blog/statistics-and-quality-data-analysis/design-of-experiment-doe%3A-searching-for-a-selfie-fountain-of-youthPatrick Runkel