Minitab | MinitabBlog posts and articles about using Minitab software in quality improvement projects, research, and more.
http://blog.minitab.com/blog/minitab/rss
Tue, 09 Feb 2016 05:30:36 +0000FeedCreator 1.7.3Imprisoned by Statistics: How Poor Data Collection and Analysis Sent an Innocent Nurse to Jail
http://blog.minitab.com/blog/understanding-statistics/imprisoned-by-statistics%3A-how-poor-data-collection-and-analysis-convicted-an-innocent-nurse
<p>If you want to convince someone that at least a <em>basic </em>understanding of statistics is an essential life skill, bring up the case of Lucia de Berk. Hers is a story that's too awful to be true—except that it is completely true.</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/2b47d769b8ffc2eafac2a6729e69e719/jail.jpg" style="margin: 10px 15px; float: right; width: 252px; height: 194px;" />A flawed analysis irrevocably altered de Berk's life and kept her behind bars for a full decade, and the fact that this analysis targeted and harmed just one person makes it more frightening. When tragedy befalls many people, aggregating the harmed individuals into a faceless mass helps us cope with the horror. <span style="line-height: 1.6;">You can't play the same trick on yourself when you consider a single innocent woman, sentenced to life in prison, thanks to an <span><a href="http://blog.minitab.com/blog/real-world-quality-improvement/3-common-and-dangerous-statistical-misconceptions">erroneous analysis</a></span>. </span></p>
The Case Against Lucia
<p>It started with an infant's unexpected death at a children's hospital in The Hague. Administrators subsequently reviewed earlier deaths and near-death incidents, and identified 9 other incidents in the previous year they believed were medically suspicious. Dutch prosecutors proceeded to press charges against pediatric nurse Lucia de Berk, who had been responsible for patient care and medication at the time of all of those incidents. <span style="line-height: 1.6;">In 2003, de Berk was sentenced to life in prison for the murder of four patients and the attempted murder of three. </span></p>
<p><span style="line-height: 1.6;">The guilty verdict, rendered </span><span style="line-height: 20.8px;">despite a glaring lack of physical or even circumstantial evidence, was based </span><span style="line-height: 1.6;">(at least in part) on a prosecution calculation that only a 1-in-342-million chance existed that a nurse's shifts would coincide with so many suspicious incidents. "In the Lucia de B. case statistical evidence has been of enormous importance," a Dutch criminologist said at the time. "I do not see how one could have come to a conviction without it." The guilty verdict was upheld on appeal, and de Berk spent the next 10 years in prison.</span></p>
One in 342 Million...?
<p>If an expert states that the probability of something happening by random chance is just 1 in 342 million, and you're not a statistician, perhaps you'd be convinced those incidents did <em>not </em>happen by random chance. </p>
<p>But if you are statistically inclined, perhaps you'd wonder how experts reached this conclusion. That's exactly what statisticians Richard Gill and Piet Groeneboom, among others, began asking. They soon realized that the prosecution's 1-in-342-million figure was very, very wrong.</p>
<p>Here's where the case began to fall apart—and not because the situation was complicated. In fact, the problems should have been readily apparent to anyone with a solid grounding in statistics. </p>
What Prosecutors Failed to Ask
<p>The first question in any analysis should be, "<a href="http://blog.minitab.com/blog/understanding-statistics/the-single-most-important-question-in-every-statistical-analysis">Can you trust your data?</a>" In de Berk's case, it seems nobody bothered to ask. </p>
<p>Richard Gill graciously attributes this to a kind of culture clash between criminal and scientific investigation. Criminal investigation begins with the assumption a crime occurred, and proceeds to seek out evidence that identifies a suspect. A scientific approach begins by asking whether a crime was even committed.</p>
<p>In Lucia's case, investigators took a decidedly non-scientific approach. In gathering data from the hospitals where she worked, they omitted incidents that <em>didn't </em>involve Lucia from their totals (cherry-picking), and made arbitrary and inconsistent classifications of other incidents. Incredibly, events De Berk <em>could not have been involved in</em> were nonetheless attributed to her. Confirmation and selection bias were hard at work on the prosecution's behalf. </p>
<p><span style="line-height: 20.8px;">Further, much of the "data" about events was based on individuals' memories, which are notoriously unreliable. In a criminal investigation where witnesses know what's being sought and may have opinions about a suspect's guilt, relying on memories of events that happened weeks and months ago seems like it would be a particularly dubious decision. Nonetheless, the prosecution's statistical experts deemed the data gathered under such circumstances trustworthy.</span></p>
<p>As Gill, one of the few heroes in this sordid and sorry mess, <a href="http://Richard Gill, http://www.math.leidenuniv.nl/~gill/elfferscorrected.pdf">points out</a>, "<span style="line-height: 20.8px;">The statistician has to question all his clients’ assumptions and certainly not to jump to the conclusions which the client is aiming for." Clearly, that did not happen here. </span></p>
Even If the Data <em>Had </em>Been Reliable...
<p>So the data used against de Berk didn't pass the smell test for several reasons. But even if the data had been collected in a defensible manner, the prosecution's statement about 1-in-342-million odds was <em>still</em> wrong. To arrive at that figure, the prosecution's statistical expert multiplied p-values from three separate analyses. However, in combining those p-values the expert failed to perform necessary statistical corrections, resulting in a p-value that was far, far lower than it should have been. You can read the details about these calculations <a href="http://arxiv.org/pdf/math/0607340v1.pdf" target="_blank">in this paper</a>. </p>
<p>In fact, when statisticians, including Gill, analyzed the prosecution's data using the proper formulas and corrected numbers, they found the odds that a nurse could experience the pattern of events exhibited in the data could have been as low as 1 in 25.</p>
Justice Prevails at Last (Sort Of)
<p>Even though de Berk had exhausted her appeals, thanks to the efforts of Gill and others, the courts finally re-evaluated her case in light of the revised analyses. The nurse, now declared innocent of all charges, was released from prison (and quietly given an undisclosed settlement by the Dutch government). But for an innocent defendant, justice remained blind to the statistical problems in this case across 10 years and multiple appeals, during which de Berk experienced a stress-induced stroke. It's well worth learning <a href="http://www.math.leidenuniv.nl/~gill/lucia.html">more about the role of statistics in her experience</a> if you're interested in the impact data analysis can have on one person's life. </p>
<p>At a minimum, what happened to Lucia de Berk should be more than enough evidence that a better understanding of statistics could set you free.</p>
<p>Literally. </p>
Data AnalysisHealth Care Quality ImprovementStatisticsMon, 08 Feb 2016 13:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/imprisoned-by-statistics%3A-how-poor-data-collection-and-analysis-convicted-an-innocent-nurseEston MartzHow to Calculate BX Life, Part 2
http://blog.minitab.com/blog/meredith-griffith/how-to-calculate-bx-life-part-2
<p><span style="line-height: 1.6;">When I wrote <a href="http://blog.minitab.com/blog/meredith-griffith/how-to-calculate-b10-life-with-statistical-software">How to Calculate B10 Life with Statistical Software</a></span><span style="line-height: 1.6;">, I promised a follow-up blog post that would describe how to compute any “BX” lifetime. In this post I’ll follow through on that promise, and in a third blog post in this series, I will explain why BX life is one of the best measures you can use in your reliability analysis.</span></p>
<p>As a refresher, B10 life refers to the time at which 10% of the population has failed—or, to put it another way, it is the 90% reliability of a population at a specific point in time. Let’s revisit our pacemaker battery example from part 1 of this blog series. Here's <a href="//cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/126181c5eca45c380dfed332ee3c3c7d/pacemakerbatterylife.MTW">the data</a>.</p>
<p><img alt="Data" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/87f5a262f6fa042461f047c74c26a72a/table1.jpg" style="width: 177px; height: 243px;" /></p>
<p>Recall that we found the B10 life of pacemaker batteries to be 6.36 years. Another way to interpret this value is to say that 6.36 years is the time at which 10% of the population of pacemaker batteries will fail. This information is useful in establishing a realistic warranty period for a product so that customers are covered through a product’s 90% reliability period, and so the manufacturer won’t have to incur extra cost by replacing an excess of the product during the warranty period.</p>
<p>But perhaps a particular product has additional reliability requirements a manufacturer wishes to monitor, such as B15 life. Or perhaps we would like to know when half of the population will fail—its B50 life. Both B10 and B50 life are industry standards for measuring the life expectancy of an automotive engine, for instance. This is where BX life calculations become even more useful—and Minitab makes it incredibly easy to compute and interpret those values. (If you don't already have Minitab and you'd like to follow along, <a href="http://www.minitab.com/products/minitab/free-trial/">download the free trial</a>.)</p>
Calculating BX Life
<p>Navigate to Minitab’s <strong>Statistics > Reliability/Survival > Distribution Analysis (Right Censoring) > Parametric Distribution Analysis</strong> menu and set up the main dialog and the 'Censor' subdialog the same way we did in <a href="http://blog.minitab.com/blog/meredith-griffith/how-to-calculate-b10-life-with-statistical-software">Part 1</a>:</p>
<p><img alt="Parametric Distribution Analysis - Main Dialog" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/d0669f86f85b236ba2a3adcef520a994/dialog1.jpg" style="width: 507px; height: 345px;" /></p>
<p>Press the "Censor" button and fill out the subdialog as follows: </p>
<p><img alt="Censor Subdialog" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/d319e5438d544d051aa997adeb14f271/dialog3.jpg" style="width: 426px; height: 313px;" /></p>
<p>When you press OK, Minitab analyzes the distribution of your data and by default will display a Table of Percentiles in the session window. We can take advantage of this table for measures such as B50 life, because the table produces output for a variety of percentiles by default. The percent of population failures at the 50th percentile is included in the default output.</p>
<p><img alt="Table of Percentiles for B50 Life" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/1e6e0c6abf084578a9c2993e6c09f530/table2.jpg" style="width: 536px; height: 482px;" /></p>
<p>We see that 50% of the population of pacemaker batteries will fail by 9.735 years. But what if we want to compute B15 life? This percentile does not display by default in the Table of Percentiles.</p>
<p>Revisiting the Parametric Distribution Analysis dialog (pressing CTRL-E is a Minitab shortcut that will bring up your most recently completed dialog), we can click the ‘Estimate’ button to specify what “BX” life we want. In the section titled ‘Estimate percentiles for these additional percents,’ entering the number 15 will give us the B15 life for pacemaker batteries.</p>
<p><img alt="Estimate Subdialog" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/c84a41ca235867a883d886d754e6fc5d/dialog2.jpg" style="width: 508px; height: 447px;" /></p>
<p>Click OK through the dialogs, and we see that a row of output for the 15th percentile is now included in the Table of Percentiles.</p>
<p><img alt="Table of Percentiles for B15 Life" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/1ac702357a0ce4437048ec6aa470ba1f/table3.jpg" style="width: 313px; height: 47px;" /></p>
<p>It’s as simple as that!</p>
<p>If you’ve never used BX life as a reliability metric, and you’re wondering just how and why these can be some of the best measures of reliability, stay tuned for my final post in this series!</p>
Quality ImprovementReliability AnalysisSix SigmaFri, 05 Feb 2016 13:00:00 +0000http://blog.minitab.com/blog/meredith-griffith/how-to-calculate-bx-life-part-2Meredith GriffithANOVA: Data Means and Fitted Means, Balanced and Unbalanced Designs
http://blog.minitab.com/blog/marilyn-wheatleys-blog/anova-data-means-and-fitted-means-balanced-and-unbalanced-designs
<p>In this post, I’ll address some common questions we’ve received in <a href="http://support.minitab.com">technical support</a> about the difference between fitted and data means, where to find each option within Minitab, and how Minitab calculates each.</p>
<p><img alt="Cat Meme" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/11acae6b1240abc7bc47f09fd5d2470f/capture.PNG" style="margin: 10px 15px; float: right; width: 250px; height: 187px;" />First, let’s look at some definitions. It’s useful to have an example, so I’ll be using the Light Output data set from Minitab’s Data Set Library, which includes a description of the sample data <a href="http://support.minitab.com/en-us/datasets/anova-data-sets/light-output-data/">here</a>. This same data set is available within Minitab by choosing <strong>File</strong> > <strong>Open Worksheet</strong>, clicking the <strong>Look in Minitab Sample Data folder</strong> button at the bottom, and then opening the file titled <strong>LightOutput_model.MTW</strong>.</p>
Calculating Data Means
<p>In an ANOVA, data means are the raw response variable means for each factor/level combination.</p>
<p>For the LightOutput data set, we can calculate the data means for Temperature by choosing <strong>Stat</strong> > <strong>Basic Statistics</strong> > <strong>Display Descriptive Statistics</strong>, and then completing the dialog box as shown below:</p>
<p><img border="0" height="351" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/e7f9c3df74e6bf359bd7b2c9d6d5b2fb/e7f9c3df74e6bf359bd7b2c9d6d5b2fb.png" width="334" /></p>
<p>Click the <strong>Statistics</strong> button and make sure only <strong>Mean</strong> is selected, then click <strong>OK</strong> in each dialog. Repeat the above steps, and replace Temperature with GlassType to calculate the data means for that second factor. The session window will display these results:</p>
<p><img border="0" height="229" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/2018d9d24b3d4958e1a4abaa87c935b5/2018d9d24b3d4958e1a4abaa87c935b5.png" width="276" /></p>
<p>The means calculated directly from the data shown above are the values that would be plotted in a Main Effects plot. To create that plot in Minitab, use <strong>Stat</strong> > <strong>ANOVA</strong> > <strong>Main Effects Plot</strong> and complete the dialog box as shown below:</p>
<p><img border="0" height="268" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/10798c96262e71dc5f2ac1faf970265d/10798c96262e71dc5f2ac1faf970265d.png" width="454" /></p>
<p>Click <strong>OK </strong>display the graph, which will show the same mean values for each level of the two factors (I’ve added data labels to the graph below):</p>
<p><img border="0" height="309" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/07a1919493ac7320ed88511665791f3e/07a1919493ac7320ed88511665791f3e.png" width="467" /></p>
<p>So, <strong>data means</strong> are the <em>raw response variable means</em> for each factor/level combination. On the other hand, <strong>fitted means</strong> use least squares regression to predict the mean response values of a balanced design, in which your data has the same number of observations for every combination of factor levels. The two types of means are identical for balanced designs but can be different for unbalanced designs.</p>
Balanced Designs
<p>As I mentioned above, in ANOVA a balanced design has an equal number of observations for all possible combinations of factor levels, whereas an unbalanced design has an unequal number of observations. </p>
<p>If you’re not sure whether your design is balanced or not, Minitab makes it easy to find out. For the Light output data set, we can see that the design is balanced by choosing <strong>Stat</strong> > <strong>Tables</strong> > <strong>Cross Tabulation and Chi-Square</strong>, and then completing the dialog as shown below:</p>
<p><img alt="1" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/257cdc7599caad0b5d651415227e69e0/capture.PNG" style="width: 718px; height: 346px;" /></p>
<p>Because there are 3 observations for every combination of Temperature and GlassType, this design is balanced.</p>
<p>We can fit a model to this data by choosing <strong>Stat</strong> > <strong>ANOVA</strong> > <strong>General Linear Model</strong> > <strong>Fit General Linear Model</strong>, and then completing that dialog box as shown below and clicking <strong>OK</strong>:</p>
<p><img border="0" height="313" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/eafcbac4acb154ce356c5776af0d4d31/eafcbac4acb154ce356c5776af0d4d31.png" width="411" /></p>
<p>Now that we have a model for this data, we can obtain a main effect plot based on the least-squares model by choosing <strong>Stat</strong> > <strong>ANOVA </strong>> <strong>General Linear Model</strong> > <strong>Factorial Plots</strong> (NOTE: The Factorial Plots option will not be available until a model is fit, because these graphs are based on the model). Click <strong>OK</strong> in the dialog box below to accept the defaults and generate the main effects plot:</p>
<p><img alt="2" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/4a0e60d8b48b00c90fbc06e84903d7c7/capture.PNG" style="width: 766px; height: 350px;" /></p>
Calculating Main Effects for Balanced Designs
<p>Again, the fitted means in the main effects plot above are the same as the previous data means plot because this is a balanced design. In this case, the answer is the same, but Minitab obtained these results by finding the fitted value for every possible combination of factor levels. The following steps illustrate what Minitab is doing automatically, behind the scenes:</p>
<ol>
<li>
<p>To obtain these fitted values, after the model has already been fit to the data, type all possible combinations of factor levels into the worksheet as shown below, and then use <strong>Stat</strong> > <strong>ANOVA</strong> > <strong>General Linear Model</strong> > <strong>Predict</strong>, and enter the two columns with all possible combinations:</p>
<p><img border="0" height="358" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/defee6b9e6ca9abe3913e934ff651ec1/defee6b9e6ca9abe3913e934ff651ec1.png" width="626" /></p>
</li>
<li>
<p>Click <strong>OK</strong> in the dialog box above to store the results in the worksheet.</p>
</li>
<li>
<p>Now use <strong>Stat</strong> > <strong>Basic Statistics</strong> > <strong>Store Descriptive Statistics</strong> twice; once to get the means of the fits calculated in step 2 for Temp, and a second time to get the means of the fits for Glass Type:</p>
</li>
</ol>
<p><img alt="3" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/87550eb6425b78d325f17cf2d4f5edf4/capture.PNG" style="width: 731px; height: 328px;" /></p>
<p>The results show the same means calculated in the fitted means main effects plot:</p>
<p><img border="0" height="115" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/cc634194e4ae1a338fa9a0d488a23d94/cc634194e4ae1a338fa9a0d488a23d94.png" width="262" /></p>
Unbalanced Designs
<p>Now let’s take a look at what happens in an unbalanced design, where there are an unequal number of observations per factor/level combination.</p>
<p>First, we’ll need to modify the worksheet to make the design unbalanced. Recall that this data set includes 3 observations per combination of factor levels. To make the design unbalanced, I’m changing the second row of data in the Temperature column. The original value there was 125, and I’ve changed that to 100:</p>
<p><img border="0" height="123" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/d311fe6fb278847ba8497bd12ab93083/d311fe6fb278847ba8497bd12ab93083.png" width="286" /></p>
<p>With the data modified as shown above, we can use <strong>Stat</strong> > <strong>Tables</strong> > <strong>Cross Tabulation and Chi-Square</strong> again to see that the design is unbalanced:</p>
<p><img border="0" height="191" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/078deb10e154741f9c08a2962985da49/078deb10e154741f9c08a2962985da49.png" width="350" /></p>
Calculating Main Effects for Unbalanced Designs
<p>Now let’s fit a model to this data using <strong>Stat</strong> > <strong>ANOVA</strong> > <strong>General Linear Model</strong> > <strong>Fit General Linear Model</strong>. This time, click the <strong>Results</strong> button and use the drop-down list next to <strong>Coefficients</strong> to select <strong>Full set of coefficients</strong>, then click <strong>OK</strong> in each dialog. Our results are different. If we generate new factorial plots using the new model, we can see that some of these fitted means are different than those in the balanced model:</p>
<p><img border="0" height="286" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/a8c6b31bbd9947b9e37c7b6aea4a9281/a8c6b31bbd9947b9e37c7b6aea4a9281.png" width="438" /></p>
<p>We can calculate the fitted means of the main effects in the same way as we calculated them for the balanced case, or we can see the same results by looking at the full table of coefficients:</p>
<p><img border="0" height="196" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/c6f927ab025d5da0aa0c3639568eef46/c6f927ab025d5da0aa0c3639568eef46.png" width="386" /></p>
<p>The fitted mean in the main effects plot for temperature at 100 is calculated by adding the coefficient for temperature at 100 to the constant. So 957.3 + (-349.5) = 607.8 (rounded). For temperature at 125, we add 957.3 + 111.5 = 1168.8, and so forth.</p>
<p>If you’ve enjoyed this post and would like to learn more, check out our other blog posts related to <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis?blog_id=079e7c3c-16b0-49f1-b495-d019505642f3&search_terms=anova&button-submit.x=0&button-submit.y=0">ANOVA</a>.</p>
<div>
<div id="_com_1" uage="JavaScript"> </div>
</div>
StatisticsWed, 03 Feb 2016 13:00:00 +0000http://blog.minitab.com/blog/marilyn-wheatleys-blog/anova-data-means-and-fitted-means-balanced-and-unbalanced-designsMarilyn WheatleyHow to Analyze Like a Citizen Data Scientist in Flint
http://blog.minitab.com/blog/statistics-and-quality-improvement/how-to-analyze-like-a-citizen-data-scientist-in-flint
<p><img alt="The Citizen's Bank Weather Ball in Flint, Michigan" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/f0a4660c3136750443aede3c2be41c52/6109589699_98d685d0d5_z.jpg" style="width: 200px; height: 133px; float: right; border-width: 1px; border-style: solid; margin: 10px 15px;" />If you follow the news in the United States then you’ve heard that there’s a water crisis in Flint, Michigan. Although there’s going to continue to be debate about how much ethics played a role in the data collection practices, it’s worthwhile to at least be ready to perform the correct analysis on the data when you have it. Here’s how you can use Minitab to be like a citizen data scientist in Flint, and see for yourself what the data indicate.</p>
<p>Let’s start with the Environmental Protection Agency’s (EPA) <a href="http://www.epa.gov/dwreginfo/lead-and-copper-rule">Lead and Copper Rule</a>. The EPA says that a water system needs to act when “lead concentrations exceed an action level of 15 ppb” in more than 10% of samples. The statistic that identifies the highest 10% of the samples is called the 90th percentile.</p>
<p><a href="http://www.ecfr.gov/cgi-bin/text-idx?SID=531617f923c3de2cbf5d12ae4663f56d&mc=true&node=sp40.23.141.i&rgn=div6#se40.23.141_186">The applicable Code of Federal Regulations</a> (CFR) does not prescribe a random sample to characterize the entire water system. Instead, the CFR suggests that those who administer the water system should select sampling sites based on the likelihood of contamination. In particular, those who administer the system should prefer sampling sites that meet these two criteria:</p>
<p style="margin-left:.5in;">(i) Contain copper pipes with lead solder installed after 1982 or contain lead pipes; and/or</p>
<p style="margin-left:.5in;">(ii) Are served by a lead service line.</p>
<p>Clearly, we are not dealing with a random sample—that's because the goal is not to characterize the entire system, but to better understand the worst contamination risks. In this context we're characterizing only the sites that we sample, which we suspect contain the highest lead results in the system. The CFR suggests taking samples from at least 60 sites for a system the size of Flint’s.</p>
<p>The <a href="http://flintwaterstudy.org/2015/12/complete-dataset-lead-results-in-tap-water-for-271-flint-samples/" target="_blank">data we’ll work with</a> was collected through an effort organized by <a href="http://flintwaterstudy.org/about-page/about-us/" target="_blank">an independent research team at Virginia Tech</a>. The data contain 271 samples from 269 different locations, which exceeds the minimum recommended sample size. Because we’re looking for the 90th percentile, what we do isn’t very different from counting down 271/10 ≈ 27 data points from the maximum. The CFR references the use of “first draw” tap samples, so we’ll pay attention to that column in the Virginia Tech data.</p>
A Quick Calculation of the 90th Percentile
<p>Once the data’s in <a href="http://www.minitab.com/products/minitab">Minitab Statistical Software</a>, the fastest way to calculate the 90th percentile is with Minitab’s calculator. Try this:</p>
<ol>
<li>Choose <strong>Calc > Calculator</strong>.</li>
<li>In <strong>Store result in variable</strong>, enter <em>90th percentile</em>.</li>
<li>In <strong>Expression</strong>, enter <em>percentile (‘PB Bottle 1 (ppb) – First Draw’, 0.9)</em>. Click <strong>OK.</strong></li>
</ol>
<p>Minitab stores the value 26.944. Because this value is greater than 15, you are now ready to make <a href="http://flintwaterstudy.org/information-for-flint-residents/results-for-citizen-testing-for-lead-300-kits/" target="_blank">strongly-worded statements urging people to take measures to protect themselves from lead exposure</a>.</p>
Communicating the 90th Percentile on a Graph
<p>But if you’re really going to communicate your results, it’s nice to have a graph available. A simple bar chart might do:</p>
<p><img alt="Bart chart of the actual 90th percentile and the action limit." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/44a161f0fc39a3b030b9895a11313c1f/bar_chart.png" style="border-width: 0px; border-style: solid; width: 576px; height: 384px;" /></p>
<p>However, you can show the data in more detail with a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/graphs/graphs-of-distributions/histograms/histogram/">histogram</a>.</p>
<ol>
<li>Choose <strong>Graph > Histogram</strong>.</li>
<li>Select <strong>Simple</strong>. Click <strong>OK</strong>.</li>
<li>In <strong>Graph variables</strong>, enter ‘<em>PB Bottle 1 (ppb) – First Draw’</em>.</li>
<li>Click <strong>Scale</strong>.</li>
<li>Select the <strong>Reference Lines</strong> tab.</li>
<li>In <strong>Show reference lines at data values</strong>, enter <em>15 26.9</em>. Click <strong>OK</strong> twice.</li>
</ol>
<p><img alt="Histogram showing the 90th percentile exceeds the action limit of 15 parts per billion." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/a6d9b14bf5031621ac62f922b0d68466/histogram.png" style="border-width: 0px; border-style: solid; width: 576px; height: 384px;" /></p>
<p>Histograms divide the sample values into intervals called bins. The height of the histogram represents the number of observations that are in the bin. The taller the bar, the more observations in that interval. The reference lines on the graph show the action limit for the 90th percentile and the actual value of the 90th percentile. This graph shows that the action limit is exceeded.</p>
Gather Your Data
<p>In April of 2015, then-mayor of Flint Dayne Walling reported that he and his family “drink and use the Flint water everyday, at home, work, and schools.” It’s easy for me to believe that the mayor’s personal experience with water that was not dangerous affected his judgment about the situation. The zip code for the mayor’s office in Flint is 48502. The news bureau for WNEM TV 5, <a href="http://www.wnem.com/story/29511581/flints-mayor-drinks-water-from-tap" target="_blank">one place where Mayor Walling drank tap water on TV</a>, is in the same zip code. The citizen data scientists who analyzed the Flint data knew that the geographically-limited sample being shown on TV and Twitter wasn't good enough. Instead, they collected data from 269 different locations around Flint and found that lead was a serious problem.</p>
<p>Of course, collecting that data was no small task: the data scientists estimate that gathering, preparing, and analyzing water samples ended up costing about $180,000, not including volunteer labor. If you’d like to donate towards offsetting the costs and future efforts, check out the <a href="http://flintwaterstudy.org/2016/01/the-flintwaterstudy-research-support-fundraiser/" target="_blank">Flint Water Study Research Support Fundraiser</a>.</p>
<p>If you’d like to support residents in Flint, consider volunteering for or contributing to the <a href="http://www.unitedwaygenesee.org/civicrm/contribute/transact?reset=1&id=5" target="_blank">United Way of Genesee County’s Flint Water Fund</a> which “has sourced more than 11,000 filters systems and 5,000 replacement filters, ongoing sources of bottled water to the Food Bank of Eastern Michigan and also supports a dedicated driver for daily distribution.”</p>
<p>The attention brought to Flint <a href="http://www.theguardian.com/environment/2016/jan/22/water-lead-content-tests-us-authorities-distorting-flint-crisis" target="_blank">has called into question the water testing done in other municipalities in the United States</a>. If you’re concerned about the potential for lead in your own water, the EPA notes that <a href="http://www.epa.gov/lead/protect-your-family#testdw" target="_blank">lead testing kits are available in home improvement stores</a> that can be sent to laboratories for analysis.</p>
<p>The citation for the referenced data set is: FlintWaterStudy.org (2015)<strong> “Lead Results from Tap Water Sampling in Flint, MI during the Flint Water Crisis.”</strong> This link provides the data as a Minitab worksheet: <a href="https://app.compendium.com/api/post_attachments/3d9b8ce9-c0ce-45ed-a759-3da70816d238/view">lead_results_from_tap_water_sampling_in_flint__mi_during_the_flint_water_crisis.MTW</a></p>
<p> </p>
<p><em>The image of the Citizen's Bank Weather Ball is by the <a href="https://www.flickr.com/photos/michigancommunities/6109589699">Michigan Municipal League</a> and is licensed under <a href="https://creativecommons.org/licenses/by-nd/2.0/">this Creative Commons License</a></em>.</p>
Statistics in the NewsMon, 01 Feb 2016 13:00:00 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/how-to-analyze-like-a-citizen-data-scientist-in-flintCody SteeleWhy Does Cpk Change When I Sort my Data?
http://blog.minitab.com/blog/michelle-paret/why-does-cpk-change-when-i-sort-my-data
<p style="line-height: 20.8px;">If you need to assess process performance relative to some specification limit(s), then <a href="http://blog.minitab.com/blog/statistics-in-the-field/learning-process-capability-analysis-with-a-catapult-part-1">process capability</a> is the tool to use. You collect some accurate data from a stable process, enter those measurements in Minitab, and then choose <strong>Stat > Quality Tools > Capability Analysis/Sixpack</strong> or <strong>Assistant > Capability Analysis</strong>.</p>
<p style="line-height: 20.8px;">Now, what about sorting the data? I’ve been asked “why does Cpk change when I sort my data?” many times during my years at Minitab, so if you’ve wondered the same thing, here’s your answer.</p>
<p style="line-height: 20.8px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/a07bd7ec690a1381437a5f51e344b5fb/data_plus_sorted.jpg" style="line-height: 20.8px; float: right; width: 250px; height: 256px; margin: 10px 15px;" /></p>
From Soap to Standard Deviations
<p style="line-height: 20.8px;">Suppose you work for a company that manufactures bars of soap. Each bar should weigh between 3.2 and 5.2 ounces. To conduct the study, you randomly select 5 bars of soap eve<span style="line-height: 1.6;">ry hour from the production line and weigh them.</span></p>
<p style="line-height: 20.8px;"><span style="line-height: 1.6;">You can see from the spreadsheet that at 9 a.m. on February 1</span><span style="line-height: 1.6;">, the 5 bars weighed in at 3.743, 4.447, 4.009, 4.252 and 3.973 ounces. These 5 measurements make up our first </span><a href="http://blog.minitab.com/blog/quality-data-analysis-and-statistics/a-rational-look-at-subgrouping" style="line-height: 1.6;">subgroup</a><span style="line-height: 1.6;">. For our second </span><span style="line-height: 1.6;">subgroup, we have the 5 bar weights corresponding to 10 a.m., and 11 a.m. data for the third </span><span style="line-height: 1.6;">subgroup, and so on.</span></p>
<p style="line-height: 20.8px;">To calculate Cpk, Minitab first computes the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/standard-deviation-variance-and-the-normal-distribution/pooled-sd/">pooled standard deviation</a>. Without getting into the specific mathematics of the pooled standard deviation formula, you can basically think of it as the <em>average of all of the subgroup standard deviations</em>. In other words, if we calculate the standard deviation for rows 1-5 (subgroup #1), then the standard deviation for rows 6-10 (subgroup #2), then the standard deviation for rows 11-15 (subgroup #3), etc., and then calculate the average of those standard deviations, we (more or less) arrive at the pooled standard deviation.</p>
<p style="line-height: 20.8px;">Therefore, the pooled standard deviation only accounts for the variability <em>within</em> subgroups—it does not include the shift and drift <em>between </em>them. If you want to account for all of the variability across all of the data, then you should look at the overall standard deviation, and use <a href="http://blog.minitab.com/blog/michelle-paret/process-capability-statistics-cpk-vs-ppk">Ppk rather than Cpk</a>.</p>
The Sordid Details
<p style="line-height: 20.8px;">Now let’s sort this data from smallest to largest and see what happens to the pooled standard deviation and Cpk. If we calculate the subgroup standard deviations for the <em>sorted</em> rows 1-5, then 6-10, then 11-15, etc., we’re going to arrive at much smaller values than the original subgroup standard deviations because we’ve minimized the variability within each subgroup. And the smaller the subgroup standard deviations, the smaller the pooled standard deviation, and thus the larger the Cpk statistic.</p>
<p style="line-height: 20.8px;">If we look at the original, unsorted soap weights and run capability analysis, we get a pooled (also known as “within”) standard deviation of <strong>0.352</strong> and a Cpk of <strong>0.80</strong>. And if we re-run the analysis on the sorted soap weights, we arrive at a pooled standard deviation of <strong>0.033</strong> and a Cpk of <strong>8.61</strong>. That's two completely different sets of results! The original—and accurate—Cpk is below the 1.33 rule-of-thumb, while the other Cpk is exceptionally larger than 1.33.</p>
<p style="line-height: 20.8px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/f6e050b3dc004716cb575cec1251b4a7/individual_value_plot_of_weights_w1024.jpeg" style="width: 400px; height: 263px;" /><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/b302a36e9ce7e7d6ee412f01aee35755/individual_value_plot_of_sorted_weights_w1024.jpeg" style="margin-left: 20px; margin-right: 20px; width: 400px; height: 263px;" /></p>
The Moral of the Subgroup Story
<p style="line-height: 20.8px;">I hope it's now clear why we should <em>not </em>sort our data when running capability analysis. Subgroups are intended to provide information regarding the natural variability of a process at a given point in time. By sorting the data, we are looking at an inaccurate picture of the true subgroup variability, and thereby inflating Cpk to an unrealistic value.</p>
Data AnalysisLean Six SigmaQuality ImprovementSix SigmaStatisticsStatistics HelpStatsFri, 29 Jan 2016 13:00:00 +0000http://blog.minitab.com/blog/michelle-paret/why-does-cpk-change-when-i-sort-my-dataMichelle ParetWhen Should You Fit a Non-Hierarchical Regression Model?
http://blog.minitab.com/blog/adventures-in-statistics/when-should-you-fit-a-non-hierarchical-regression-model
<p>In the world of linear models, a hierarchical model contains all lower-order terms that comprise the higher-order terms that also appear in the model. For example, a model that includes the interaction term A*B*C is hierarchical if it includes these terms: A, B, C, A*B, A*C, and B*C.</p>
<p><img alt="Minitab dialog box that asks about a non-hierarchical regression model" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/ef6ab57fd80017380b648e2246a278ff/nonhierarchical_dialog.png" style="line-height: 20.8px; float: right; width: 300px; height: 246px; margin: 10px 15px;" /></p>
<p><a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-choose-the-best-regression-model" target="_blank">Fitting the correct regression model</a> can be as much of an art as it is a science. Consequently, there's not always a best model that everyone agrees on. This uncertainty carries over to hierarchical models because statisticians disagree on their importance. Some think that you should <em>always</em> fit a hierarchical model whereas others will say it's okay to leave out insignificant lower-order terms in specific cases.</p>
<p>Beginning with <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">Minitab 17</a>, you have the flexibility to specify either a hierarchical or a non-hierarchical linear model for a variety analyses in regression, ANOVA, and designed experiments (DOE). In the example above, if A*B is not statistically significant, why would you include it in the model? Or, perhaps you’ve specified a non-hierarchical model, have seen this dialog box, and you aren’t sure what to do?</p>
<p>In this blog post, I’ll help you decide between fitting a hierarchical or a non-hierarchical regression model.</p>
Practical Reasons to Fit a Hierarchical Linear Model
<p><strong>Reason 1: The terms are all statistically significant or theoretically important</strong></p>
<p>This one is a no-brainer—if all the terms necessary to produce a hierarchical model are statistically significant, you should probably include all of them in the regression model. However, even when a lower-order term is not statistically significant, theoretical considerations and subject area knowledge can suggest that it is a relevant variable. In this case, you should probably still include that term and fit a hierarchical model.</p>
<p>If the interaction term A*B is statistically significant, it can be hard to imagine that the main effect of A is not theoretically relevant at all even if it is not statistically significant. Use your subject area knowledge to decide!</p>
<p><strong>Reason 2: You standardized your continuous predictors or have a DOE model</strong></p>
<p>If you standardize your continuous predictors, you should fit a hierarchical model so that Minitab can produce a regression equation in <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/doe/basics/coded-units-and-uncoded-units/" target="_blank">uncoded (or natural) units</a>. When the equation is in natural units, it’s much easier to <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">interpret the regression coefficients</a>.</p>
<p>If you standardize the predictors and fit a non-hierarchical model, Minitab can only display the regression equation in coded units. For an equation in coded units, the coefficients reflect the coded values of the data rather than the natural values, which makes the interpretation more difficult.</p>
<p>You should always consider a hierarchical model for DOE models because they always use standardized predictors. Starting with Minitab 17, standardizing the continuous predictors is an option for <a href="http://blog.minitab.com/blog/adventures-in-statistics/unleash-the-power-of-linear-models-with-minitab-17" target="_blank">other linear models</a>.</p>
<p>Even if you aren’t using a DOE model, this reason probably applies to you more often than you realize in the context of hierarchical models. When your model contains <a href="http://blog.minitab.com/blog/michelle-paret/evaluating-statistical-interactions-with-ketchup-and-soy-sauce" target="_blank">interaction terms</a> or <a href="http://blog.minitab.com/blog/adventures-in-statistics/curve-fitting-with-linear-and-nonlinear-regression" target="_blank">polynomial terms</a>, you have a great reason to standardize your predictors. These higher-order terms often cause high levels of multicollinearity, which can produce poorly estimated coefficients, cause the coefficients to switch signs, and sap the statistical power of the analysis. Standardizing the continuous predictors can reduce the multicollinearity and related problems that are caused by higher-order terms.</p>
<p><a href="http://blog.minitab.com/blog/adventures-in-statistics/what-are-the-effects-of-multicollinearity-and-when-can-i-ignore-them" target="_blank">Read my blog post about multicollinearity, VIFs, and standardizing the continuous predictors.</a></p>
Why You Might <em>Not </em>Want to Fit a Hierarchical Linear Model
<p>Models that contain too many terms can be relatively imprecise and can have a lessened ability to predict the values of new observations.</p>
<p>Consequently, if the reasons to fit a hierarchical model do not apply to your scenario, you can consider removing lower-order terms if they are not statistically significant.</p>
Discussion
<p>In my view, the best time to fit a non-hierarchical regression model is when a hierarchical model forces you to include many terms that are not statistically significant. Your model might be more precise without these extra terms.</p>
<p>However, keep an eye on the VIFs to assess multicollinearity. VIFs greater than 5 indicate that multicollinearity might be causing problems. If the VIFs are high, you may want to standardize the predictors, which can tip the balance towards fitting a hierarchical model. On the other hand, removing the interaction terms that are not significant can also reduce the multicollinearity.</p>
<p style="margin-left: 40px;"><img alt="Minitab output that shows the VIFs" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/65965977d6a269db735ad90aa7d5a079/vif_illustration.png" style="width: 386px; height: 152px;" /></p>
<p>You can fit the hierarchical model with standardization first to determine which terms are significant. Then, fit a non-hierarchical model without standardization and check the VIFs to see if you can trust the coefficients and p-values. You should also <a href="http://blog.minitab.com/blog/adventures-in-statistics/why-you-need-to-check-your-residual-plots-for-regression-analysis" target="_blank">check the residual plots</a> to be sure that you aren't introducing a bias by removing the terms.</p>
<p>Keep in mind that some statisticians believe you should always fit a hierarchical model. Their rationale, as I understand it, is that a lower-order term provides more basic information about the shape of the response function and a higher-order term simply refines it. This approach has more of a theoretical basis than a mathematical basis. It is not problematic as long as you don’t include too many terms that are not statistically significant.</p>
<p>Unfortunately, there is not always a clear-cut answer to the question of whether you should fit a hierarchical model. I hope this post at least helps you sort through the relevant issues.</p>
Design of ExperimentsRegression AnalysisStatistics HelpWed, 27 Jan 2016 13:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/when-should-you-fit-a-non-hierarchical-regression-modelJim FrostArea Graphs: An Underutilized Tool
http://blog.minitab.com/blog/starting-out-with-statistical-software/area-graphs-an-underutilized-tool
<p>In my time at Minitab, I’ve gotten a good understanding of what types of graphs users create. Everyone knows about histograms, bar charts, and time series plots. Even relatively less familiar plots like the interval plot and <span><a href="http://blog.minitab.com/blog/understanding-statistics/trouble-starting-an-analysis-graph-your-data-with-an-individual-value-plot">individual value plot</a></span> are still used quite often. However, one of the most underutilized graphs we have available is the area graph. If you’re not familiar with an Area Graph, here’s the example from the Minitab help menu of what it looks like:</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/732ead34-1005-4470-b034-d7f8b87fabcf/Image/67c9fc3399dc4a8a5c72d2ace452db62/areagraph1.png" style="width: 366px; height: 245px;" /></p>
<p>As you can see, an area graph is a great way to be able to view multiple time series trends in one plot, especially if those plots form a part of one whole. There are numerous ways this can be used to visualize things. Anytime you are interested in multiple series that make up a whole, an area graph can do the job. You could use it to show enrollment rates by gender, precipitation rates by county, population totals by city, etc.</p>
<p>I’m going to show you how to go about creating one in <a href="http://www.minitab.com/products/minitab">Minitab</a>. First, we need to put our data in our worksheet. For this graph, we need each of the series, or sections, in a separate column. An additional constraint on this graph is that we need all of the columns to be of equal length, so be sure that’s the case. In our example we will use <a href="//cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/8ad6bf30e2b5eed510c2bd1e19f52e1c/areagraphblogdata.mtw">sales data</a> from different regional branches, and show that an area graph can be an improvement over a simple time series plot.</p>
<p>Once it’s in your worksheet, we can go to <strong>Graph > Time Series Plot</strong>, and look at the data in a basic time series plot. As you can see, there are a few challenges with interpreting this plot. </p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/732ead34-1005-4470-b034-d7f8b87fabcf/Image/dc1239167766d00e1eeea7a8e1227414/areagraph2.png" style="width: 577px; height: 385px;" /></p>
<p>First, the plot looks extremely messy. While it gives a good look at the sales from the individual branches, it is very hard to track an individual branch through time. And it’s not much better to look at 4 (or more) separate individual plots, because it then makes it harder to compare. Additionally, when you make separate plots, an important piece of information is lost: total sales. For example, in August, Philadelphia, London, and Seattle had a total sales increase, while New York had its worst month of the year. Was this an overall gain or overall loss? We can’t really tell from individual plots. </p>
<p>Instead, let’s look at an Area Graph. You can find this by going to <strong>Graph > Area Graph</strong>, and entering the series the same way as we did the time series plot. Take a look at our output below:</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/732ead34-1005-4470-b034-d7f8b87fabcf/Image/f5f5d94367127534af22a16b9dc364d1/areagraph3.png" style="width: 577px; height: 385px;" /></p>
<p>For starters, it looks much cleaner. We are able to see clear trends in the overall pattern. We can see that overall sales spiked in August, answering our question from above. We can use this to evaluate trends in multiple series, <em>as well as</em> the contribution of each series to the total quantity. We get all the information about total sales month-to-month, as well as the individual series for each location, in one plot, instead of in the messy, hard-to-read Time Series plot we created first.</p>
<p>Next time you need to evaluate multiple series together, considering taking a look at the Area Graph to get a cleaner picture of your data!</p>
Data AnalysisStatisticsStatsMon, 25 Jan 2016 13:00:00 +0000http://blog.minitab.com/blog/starting-out-with-statistical-software/area-graphs-an-underutilized-toolEric Heckman3-Point Shooting in the NBA: Long-Term Trend or Fad?
http://blog.minitab.com/blog/the-statistics-game/3-point-shooting-in-the-nba%3A-long-term-trend-or-fad
<p><span style="line-height: 20.8px;">Any time you see a process changing, it's important to determine why. Is it indicative of a long term trend, or is it a fad that you can ignore since it will be gone shortly? </span></p>
<p>For example, in the 2014 NBA Finals, the San Antonio Spurs beat the two-time defending champion Miami Heat by attempting more 3-pointers (23.6 per game) than any championship team in league history. In the 2015 regular season, the Golden State Warriors <em>made </em>more 3-pointers than any NBA team event <em>attempted </em>from 1980-1988<em>. </em>And this season Steph Curry, <em>by himself</em>, has attempted more 3-pointers than the average NBA team attempted from 1980-1994.</p>
<p><span style="line-height: 1.6;">As I said, when you see a process changing, it's important to determine why. Are you seeing a </span><a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/time-series/time-series-models/trend-models-in-time-series-analysis/" style="line-height: 1.6;">long-term trend</a><span style="line-height: 1.6;">, or is it a soon-to-fade fad? If it's the former, you don't want to be left behind as everybody else moves on without you. But if it's the latter, you don't want to waste time and money changing your entire process for something that won't help you in the long run.</span></p>
<p>Of course, this applies outside the world of sports, too. Whether you're trying to remove defects from your process, determine how the market for your product is changing, or develop the best strategy for your basketball team, it's always good to know all the details on the changes going on around you. So let's see if the increased use of the 3-pointer in the NBA is here to stay, or if it is a fad that might fade away once Steph Curry leaves the league.</p>
The History of the NBA 3-Pointer
<p>The NBA introduced the 3-pointer in the 1979-80 season. At first it was considered a "gimmick" and wasn't heavily used. But as time went on, teams become more and more reliant on the 3-point shot. In fact, the number of 3-point attempts per game has increased from 2.8 in 1980 to 23.7 in 2016!</p>
<p><img alt="Time Series Plot" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/463dd1f06544f1739e3e96a24dd7590e/time_series_plot_3pt_att_per_game.jpg" style="width: 576px; height: 384px;" /></p>
<p>This increase in the 3-point shot isn't some new fad. It's actually been going on since the 3-pointer was introduced to the league! (The bump you see from 1995-97 resulted from the NBA shortening the 3-point line before reverting the line to its original distance in 1998.) Now, with the success Golden State has had in implementing a strategy that emphases 3-point shots, it's likely that other teams will follow suit and increase the number of 3-pointers per game even further in the coming years.</p>
<p>So what is the driving force behind this increase? Well, it's just simple math! Since 1980, teams have pretty consistently made about 48% of their 2-point shots. That means when you shoot a 2-point shot, your expected points are 0.48 * 2 = 0.96. Now, this number is actually a little higher, since it doesn't include times you're fouled shooting a 2-point shot (which happens much more often than being fouled shooting a 3-pointer), and you get to shoot resulting free throws. So let's just call the number of expected points "1" to make the math easy.</p>
<p>If you can expect to score 1 point every time you shoot a 2-pointer, you would need to make at least 33% of your 3-pointers to have the same expected value. So do NBA shooters consistently shoot above 33% on their 3-pointers? I used Minitab Statistical Software to create the following <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/time-series-plots-theres-gold-in-them-thar-hills">time series plot</a> of the data:</p>
<p><img alt="Time Series Plot" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/d37c297c18e594d8d5ab8c8ad6cef64b/time_series_plot_of_3p_.jpg" style="width: 576px; height: 384px;" /></p>
<p>We see that it took awhile for NBA players to consistently make more than 33% of their 3-point shots. Coaches were actually correct in not using the 3-pointer too frequently in the 80s and early 90s. But since 1995, the NBA has averaged a percentage that warrants an increased use of the 3-point line. And if you want to explain the reason behind the amazing start that Golden State has been off to this season, look no further than the amount of 3-pointers they attempt and the percentage they make. They average almost 30 attempts per game, and they make a ridiculous 42.4% of their 3-point attempts! You would have to make 63.6% of your 2-pointers to have the same expected number of points as the Warriors' 3-point shots! For some perspective on how hard that is, in his <em>best</em> season, (2013-14) LeBron James made only 62.2% of his 2-point shots.</p>
<p>So 3-point shooting has been steadily increasing from the start, NBA players have consistently made over 33% of their 3-point shots since 1995, and now 3-point shooting has Golden State on track to have the best record in the history of the NBA. Add it all up, and there is only one conclusion:</p>
<p>3-point shooting isn't going away anytime soon.</p>
Data AnalysisFun StatisticsQuality ImprovementStatistics in the NewsFri, 22 Jan 2016 13:01:00 +0000http://blog.minitab.com/blog/the-statistics-game/3-point-shooting-in-the-nba%3A-long-term-trend-or-fadKevin RudyDavid Bowie: Look Back in Quality
http://blog.minitab.com/blog/data-analysis-and-quality-improvement-and-stuff/david-bowie%3A-look-back-in-quality
<p><span style="line-height: 1.6;">Unless you live under a black country rock, you’ve no doubt heard that the world recently lost one of the greatest artists of our time, David Bowie. My memories of the Thin White Duke go all the way back to my formative years. I recall his music echoing through the halls of our house as I crooned along whilst doing the chores. Then as now, Bowie’s creativity and energy inspired me and helped me do what I do.</span></p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/28a5b39a8cf4f8a39ab4ac69bc382204/i_bowie2.jpg" style="line-height: 20.8px; float: right; width: 279px; height: 186px; margin: 10px 15px;" /></p>
<p>Since his death, I’ve been reflecting on the many prophetic works that this prolific and visionary artist contributed to the world. In the old days, songs were released in collections called “albums.” This <span style="line-height: 1.6;">was an artifact of an inefficient and technologically unsophisticated delivery system that relied on large, unwieldy disk</span><span style="line-height: 1.6;">s that were prone to scratches, warping, and other defect modalities. But I digress. Like a true artist, Bowie ofte</span><span style="line-height: 1.6;">n used the media at hand as a vehicle for his art. </span></p>
<p><span style="line-height: 1.6;">In addition, his albums often told stories, which many different audiences have interpreted in many different ways. When I listen to Bowie, I hear</span><span style="line-height: 1.6;"> stories about life, love...and pro</span><span style="line-height: 1.6;">cess quality control.</span></p>
<p>You might be surprised to discover that David Bowie was a proponent of quality process improvement. For example, you may be familiar with one of David’s earlier classics, “The Man Who Sold the World.” But did you know that David’s original title for the album was <em>The Man Who Sold the World on the Benefits of Continuous Quality Improvement</em>? Of course, that's never been publicly acknowledged. Unfortunately, cigar-chomping executives at the record company forced him to shorten the title because, in their words, “Kids don’t dig quality improvement.” Fools.</p>
<p>Bowie’s subsequent album, <em>Hunky Dory</em>, was an ode to the happy state of affairs that can be achieved if one practices continuous quality improvement. Don’t believe me? Then I challenge you to explain why I hear these lines from the song “Changes”:</p>
<p style="margin-left: 40px;"><em>I watch the ripples change their size<br />
b<span style="line-height: 1.6;">ut never leave the value stream of warm impermanence</span></em></p>
<p>For decades I’ve struggled to understand these inscrutable lyrics, but now I realize that they are about <span><a href="http://blog.minitab.com/blog/understanding-statistics/control-chart-tutorials-and-examples">control charts</a></span>. Of course! You see, by <em>ripples</em>, David refers to the random fluctuations of varying sizes that occur naturally in any process. And he asserts that if the process is in control, then the ripples don’t wander outside of the control limits (a.k.a. the <em>stream</em>). Whilst acknowledging that such control makes us feel <em>warm </em>and fuzzy, David also reminds us that process stability is <em>impermanent </em>unless one is dedicated to continuous process improvement and control.</p>
<p>If <em>Hunky Dory</em> is an homage to quality utopia, then <em>Diamond Dogs</em> surely represents the dysphoric chronicles of a harrowing dystopia in which the pursuit of quality has been abandoned. (Fun fact: some claim the original album title was <em>Your Business Is a Diamond in the Rough; Don’t Let Quality Go to the Dogs</em>.) Perhaps jarred by the panic in Detroit, David warned us to pay careful attention to issues of quality in our economic and social institutions. And he warned of an Orwellian future in which individuals are unable to pursue and maintain quality in their organizations because they are stifled by an authoritative ‘big brother’ who gives them neither the attention nor the resources to do so effectively.</p>
<p>By the time his album <em>Young Americans</em> was released, David appeared to be feeling cautiously optimistic about improvements in the quality of quality improvements, as I am reminded every time I hear these lyrics from the song “Golden Years”:</p>
<p style="margin-left: 40px;"><em>Some of these days, and it won't be long<br />
<span style="line-height: 1.6;">Gonna’ drive back down where you once belonged</span><br />
<span style="line-height: 1.6;">In the back of a dream car twenty foot long</span><br />
<span style="line-height: 1.6;">Don't cry my sweet, don't break my heart</span><br />
<span style="line-height: 1.6;">Doing all right, but you gotta work smart</span><br />
<span style="line-height: 1.6;">Shift upon, shift upon, day upon day, I believe oh Lord</span><br />
<span style="line-height: 1.6;">I believe Six Sigma is the way</span></em></p>
<p>Some might question Bowie’s insistence on Six Sigma methodology, but I believe none would question his assertion that we must “work smart,” and that dedication to quality is absolutely essential.</p>
<p>As one final piece of evidence, I present the following quote from Bowie's song, "Starman." I personally believe this song is about a quality analyst from an advanced civilization in another galaxy. Gifted songwriter that he was, David realized that "<span style="line-height: 20.8px;">quality analyst from an advanced civilization in another galaxy</span>" was too many syllables to belt out on stage, so he used the "starman" as a metaphor. I've taken the liberty of making the substitution below; I think you'll agree, the veracity of my interpretation is inescapable. </p>
<p style="margin-left: 40px;"><em>There's a [</em>quality analyst from an advanced civilization in another galaxy<em>] waiting in the sky<br />
<span style="line-height: 1.6;">He'd like to come and meet us</span><br />
<span style="line-height: 1.6;">But he thinks he'd blow our minds<br />
There's a</span><span style="line-height: 20.8px;"> [</span></em><span style="line-height: 20.8px;">quality analyst from an advanced civilization in another galaxy</span><span style="line-height: 20.8px;">â€‹</span><em><span style="line-height: 20.8px;">]</span><span style="line-height: 1.6;"> waiting in the sky<br />
He's told us not to blow it</span><br />
<span style="line-height: 1.6;">'Cause he knows it's all worthwhile</span></em></p>
<p>So, so obvious when you know what you're looking for. Kind of gives you goosebumps.</p>
<p><span style="line-height: 1.6;">I took a few moments with fellow Minitab blogger and Bowie fan, <a href="http://blog.minitab.com/blog/understanding-statistics">Eston Martz</a>, </span><span style="line-height: 1.6;">to brainstorm about what made Bowie such a monumental and influential artist. I collected our notes and created this fishbone diagram </span><span style="line-height: 20.8px;">in <a href="https://www.minitab.com/en-us/">Minitab Statistical Software</a>. This is only a partial listing </span><span style="line-height: 1.6;">of Bowie's albums, musical collaborators, personas, and topics that he covered in his music. It would take many more fish with many more bones to cover all of his artistic collaborations, movie roles, and other artistic endeavors. Thanks for the music, David, and thanks for the inspiration, past, present, and future.</span></p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/33ce1f44e4f49b825c835527527f928f/bowiefishbone.jpg" style="line-height: 1.6; width: 751px; height: 501px;" /><span style="line-height: 1.6;"> </span></p>
Fun StatisticsWed, 20 Jan 2016 13:00:00 +0000http://blog.minitab.com/blog/data-analysis-and-quality-improvement-and-stuff/david-bowie%3A-look-back-in-qualityGreg FoxBottled Water Stats & Time Series Plots
http://blog.minitab.com/blog/real-world-quality-improvement/bottled-water-stats-time-series-plots
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/d05ec0703d25c51b9852ec72f182caa0/featured_image_bottled_water.jpg" style="float: right; width: 200px; height: 194px; margin-left: 5px; margin-right: 5px;" />Ahh, bottled water. Refreshing, convenient...and sometimes pricey. Or in my case, I should say <em>usually </em>pricey. Confession: I’m a sucker for water that comes in the “pretty” plastic bottles, and my experience is that the pretty-bottle brands are usually the pricier ones. Does bottled water cost increase with the fanciness of the bottle? Well, that could be an analysis for a different day …</p>
<p>My colleague recently shared some interesting stats about the buying and disposing of plastic bottled water containers (Maybe she’s noticed my excessive use of “pretty” bottled waters …?).</p>
<p>According to the organization <a href="https://www.banthebottle.net/bottled-water-facts/" target="_blank">Ban the Bottle</a>, making bottles to meet America’s demand for bottled water requires more than 17 million barrels of oil annually, which is enough to fuel 1.3 million cars for 1 year. They also cite that Americans consume more than 48 billion bottles of water annually, which is enough bottles to circle the earth 230 times!</p>
<p>While these bottled water facts are certainly enough to convince me to scale back on my “pretty” bottled water habit, a visual representation of data in the form of a Minitab graph is also compelling.</p>
<p>Check out the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/graphs/graphs-of-time-series/time-series-plots/create-a-plot-of-overlaid-time-series/#dctm_Chron09000457802052c8/" target="_blank">overlaid time series plot</a> below that shows data published by the <a href="http://www.container-recycling.org/" target="_blank">Container Recycling Institute</a> on the number of plastic bottled water containers sold, recycled, and wasted for 1991-2013:</p>
<p><strong><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/dec17beaaf1d62088e1e3c663e57bea1/time_series_blog.png" style="border-width: 0px; border-style: solid; width: 576px; height: 384px;" /></strong></p>
<p>You can see that bottled water sales have largely risen over the past 20+ years, although it’s interesting to note that they were potentially impacted by the economic downturn in 2008. And while recycling rates have seen a gentle increase, they have not seen enough of an increase to come even close to the volume of bottles wasted (and not recycled) over the years.</p>
<p>This is certainly food for thought when considering whether or not you should fork over a buck or two for a bottle of water—or in my case, $4 or $5 for the “pretty” bottled water—and whether or not you should throw those bottles in the garbage can!</p>
<p>For more on creating time series plots in Minitab, visit <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/graphs/graphs-of-time-series/time-series-plots/time-series-plot/" target="_blank">this article</a> from Minitab Support.</p>
Data AnalysisFun StatisticsStatisticsStatistics in the NewsMon, 18 Jan 2016 13:00:00 +0000http://blog.minitab.com/blog/real-world-quality-improvement/bottled-water-stats-time-series-plotsCarly BarryThe Minitab Blog Quiz: Test Your Stat-Smarts!
http://blog.minitab.com/blog/statistics-and-quality-data-analysis/the-minitab-blog-quiz%3A-test-your-stat-smarts
<p>How deeply has statistical content from Minitab blog posts (or other sources) seeped into your brain tissue? Rather than submit a biopsy specimen from your temporal lobe for analysis, take this short quiz to find out. <em>Each question may have more than one correct answer</em>. Good luck!</p>
<ol>
<li><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/b24ea7bc615838432626799d3551e449/figure_skating.jpg" style="float: right; width: 250px; height: 198px; border-width: 1px; border-style: solid; margin: 10px 15px;" /><strong>Which of the following are famous figure skating pairs, and which are methods for testing whether your data follow a normal distribution?</strong><br />
<br />
a. Belousova-Protopopov<br />
b. Anderson-Darling<br />
c. Kolmogorov-Smirnov<br />
d. Shen-Zhao<br />
e. Shapiro-Wilk<br />
f. Salé-Pelletier<br />
g. Ryan-Joiner<br />
<br />
<span style="color:#cc6666;">Figure skaters are <strong>a</strong>, <strong>d</strong>, and <strong>f</strong>. Methods for testing normality are <strong>b</strong>, <strong>c</strong>, <strong>e</strong>, and <strong>g</strong>. To learn about the different methods for testing normality in Minitab, click </span><a href="http://blog.minitab.com/blog/the-statistical-mentor/anderson-darling-ryan-joiner-or-kolmogorov-smirnov-which-normality-test-is-the-best" target="_blank">here</a><span style="color:#cc6666;">.</span>
<br />
<br />
</li>
<li><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/058bc3b6b11dc8493c311329f410a284/tea_leaves.jpg" style="float: right; width: 175px; height: 148px; border-width: 1px; border-style: solid; margin: 10px 15px;" /><strong>A t-value is so-named because...</strong><br />
<br />
a. Its value lies midway between the standard deviation(s) and the u-value coefficient (u).<br />
b. It was first calculated in Fisher’s famous “Lady Tasting Tea” experiment.<br />
c. It comes from a t-distribution.<br />
d. It’s the first letter of the last name of the statistician who first defined it.<br />
e. It was originally estimated by reading tea leaves.<br />
<br />
<span style="color:#cc6666;">The correct answer is <strong>c</strong>. To find out what the t-value means, read </span><a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-are-t-values-and-p-values-in-statistics" target="_blank">this post</a><span style="color:#cc6666;">.</span>
<p><br />
</p>
</li>
<li><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/5781cd0af35f27ab8315a02bc13f3f1a/sheep.jpg" style="width: 200px; height: 143px; float: right; margin: 10px 15px;" /><strong>How do you pronounce µ, the mean of the population, in English?</strong><br />
<br />
a. The way a cow sounds<br />
b. The way a kitten sounds<br />
c. The way a chicken sounds<br />
d. The way a sheep sounds<br />
e. The way a bullfrog sounds<br />
<br />
<span style="color:#cc6666;">The correct answer is <strong>b</strong>. For the English pronunciation of <strong>µ </strong>and, more importantly, to understand how the population mean differs from the sample mean, read </span><a href="http://blog.minitab.com/blog/statistics-tips-from-a-technical-trainer/tip-1-every-sample-statistic-is-a-at-least-little-bit-wrong">this post</a><span style="color:#cc6666;">.</span>
<br />
<br />
<img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/cf1a74382d85042e2a457a5bf7818935/liberty_bell_w1024.jpeg" style="float: right; width: 110px; height: 120px; margin: 10px 15px; border-width: 1px; border-style: solid;" /></li>
<li><strong>What does it mean when we say a statistical test is “robust” to the assumption of normality?</strong><br />
<br />
a. The test strongly depends on having data that follow normal distribution.<br />
b. The test can perform well even when the data do not strictly follow a normal distribution.<br />
c. The test cannot be used with data that follow a normal distribution.<br />
d. The test will never produce normal results.<br />
<br />
<span style="color:#cc6666;">The correct answer is <strong>b</strong>. To find out which commonly used statistical tests are robust to the assumption of normality, see </span><a href="http://blog.minitab.com/blog/understanding-statistics-and-its-application/what-should-i-do-if-my-data-is-not-normal-v2">this post</a><span style="color:#cc6666;">.</span>
<br />
<br />
</li>
<li><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/a3c6c22bc315f7ce64e79ae9bff76740/unicorn.jpg" style="width: 250px; height: 180px; float: right; margin: 10px 15px;" /><strong>A Multi-Vari chart is used to...</strong><br />
<br />
a. Study patterns of variation from many possible causes.<br />
b. Display positional or cyclical variations in processes.<br />
c. Study variations within a subgroup, and between subgroups.<br />
d. Obtain an overall view of the factor effects.<br />
e. All of the above.<br />
f. Ha! There’s no such thing as a “Multi-Vari chart!”<br />
<br />
<span style="color:#cc6666;">The correct answer is <strong>e </strong>(or, equivalently, <strong>a</strong>, <strong>b</strong>, <strong>c</strong>, and <strong>d</strong>). To learn how you can use a Multi-Vari chart, see </span><a href="http://blog.minitab.com/blog/applying-statistics-in-quality-projects/using-multi-vari-charts-to-analyze-families-of-variations">this post</a><span style="color:#cc6666;">.</span>
<br />
<br />
</li>
<li><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/43e436e3fa82dad5ebe14c06ef998567/shhh.jpg" style="float: right; width: 225px; height: 182px; margin: 10px 15px;" /><strong>How can you identify a discrete distribution?</strong><br />
<br />
a. Determine whether the probabilities of all outcomes sum to 1.<br />
b. Perform the Kelly-Banga Discreteness Test.<br />
c. Assess the kurtosis value for the distribution.<br />
d. You can’t—that’s why it’s discrete.<br />
<br />
<br />
<span style="color:#cc6666;">The correct answer is <strong>a</strong>. To learn how to identify and use discrete distributions, see </span><a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-and-using-discrete-distributions" target="_blank">this post</a><span style="color:#cc6666;">. For a general description of different data types, click </span><a href="http://blog.minitab.com/blog/understanding-statistics/understanding-qualitative-quantitative-attribute-discrete-and-continuous-data-types" target="_blank">here</a><span style="color:#cc6666;">. If you incorrectly answered c, see <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/why-kurtosis-is-like-liposuction-and-why-it-matters" target="_blank">this post</a>.</span>
<p><br />
</p>
</li>
<li><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/e02048d6542056f8440b4c958e505d3a/zombie.jpg" style="float: right; width: 128px; height: 150px;" /><strong>Which of these events can be modeled by a Poisson process?</strong><br />
<br />
a. Getting pooped on by a bird<br />
b. Dying from a horse kick while serving in the Prussian army<br />
c. Tracking the location of an escaped zombie<br />
d. Blinks of a human eye over 24-hour period<br />
e. None of the above.<br />
<br />
<span style="color:#cc6666;">The correct answer is <strong>a</strong>, <strong>b</strong>, and <strong>c</strong>. To understand how the Poisson process is used to model rare events, see the the following posts on </span><a href="http://blog.minitab.com/blog/fun-with-statistics/poisson-processes-and-probability-of-poop" target="_blank">Poisson and bird pooping</a><span style="color:#cc6666;">, </span><a href="http://blog.minitab.com/blog/quality-data-analysis-and-statistics/poisson-rates-and-the-undead" target="_blank">Poisson and escaped zombies</a><span style="color:#cc6666;">, and </span><a href="http://blog.minitab.com/blog/quality-data-analysis-and-statistics/no-horsing-around-with-the-poisson-distribution-troops" target="_blank">Poisson and horse kicks</a><span style="color:#cc6666;">.</span>
<br />
<br />
</li>
<li><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/8e79c08b36239547f5705f62f203d254/grumpy_cats.jpg" style="float: right; margin: 10px 15px; width: 269px; height: 121px;" /><strong>Why should you examine a Residuals vs. Order Plot when you perform a regression analysis?</strong>
<p>a. To identify non-random error, such as a time effect.<br />
b. To verify that the order of the residuals matches the order of data in the worksheet.<br />
c. Because a grumpy, finicky statistician said you have to.<br />
d. To verify that the residuals have constant variance.<br />
<br />
</p>
<span style="color:#cc6666;">The correct answer is <strong>a</strong>. For examples of how to interpret the Residuals vs Order plot in regression, see the following posts on </span><a href="http://blog.minitab.com/blog/the-statistics-game/snakes-alcohol-and-checking-the-residuals-vs-order-plot-in-regression" target="_blank">snakes and alcohol</a><span style="color:#cc6666;">, </span><a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/violations-of-the-assumptions-for-linear-regression-day-2-independence-of-the-residuals" target="_blank">independence of the residuals</a><span style="color:#cc6666;">, and </span><a href="http://blog.minitab.com/blog/statistics-for-lean-six-sigma/residual-revelations" target="_blank">residuals in DOE</a>.
<p><br />
</p>
</li>
<li>
<p><strong>The Central Limit Theorem says that...</strong></p>
<p>a. If you take a large number of independent, random samples from a population, the distribution of the samples approaches a normal distribution.<br />
b. If you take a large number of independent, random samples from a population, the sample means will fall between well-defined confidence limits.<br />
c. If you take a large number of independent, random samples from a population, the distribution of the sample means<strong> </strong>approaches a normal distribution.<br />
d. If you take a large number of independent, random samples from a population, you must put them back immediately.<br />
</p>
<span style="color:#cc6666;">The correct answer is c, although it is frequently misinterpreted as a. To better understand the central limit theorem, see this </span><a href="http://blog.minitab.com/blog/understanding-statistics/how-the-central-limit-theorem-works" target="_blank">brief, introductory post</a><span style="color:#cc6666;"> on how it works, or </span><a href="http://blog.minitab.com/blog/michelle-paret/explaining-the-central-limit-theorem-with-bunnies-and-dragons-v2" target="_blank">this post</a><span style="color:#cc6666;"> that explains it with bunnies and dragons.</span>
<p> </p>
<p><br />
<img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/d2a6b2bf5d36df22031d50dc782feec0/fear_w640.jpeg" style="width: 159px; height: 200px; float: right;" /></p>
</li>
<li><strong>You notice an extreme outlier in your data. What do you do?</strong><br />
<br />
a. Scream. Then try to hit it with a broom.<br />
b. Highlight the row in the worksheet and press [Delete]<br />
c. Multiply the outlier by e-1<br />
d. Try to figure out what’s going on<br />
e. Change the value to the sample mean<br />
f. Nothing. You’ve got bigger problems in life.<br />
<p><span style="line-height: 1.6; color: rgb(204, 102, 102);">The correct answer is <strong>d</strong>. Unfortunately, <strong>a</strong>, <strong>b</strong>, and <strong>f </strong>are common responses in practice. To see how to use brushing in Minitab graphs to investigate outliers, see </span><a href="http://blog.minitab.com/blog/statistics-and-quality-improvement/how-to-use-brushing-to-investigate-outliers-on-a-graph" style="line-height: 1.6;" target="_blank">this post</a><span style="line-height: 1.6; color: rgb(204, 102, 102);">. To see how to handle extreme outliers in a capability analysis, click </span><a href="http://blog.minitab.com/blog/the-statistical-mentor/how-to-handle-extreme-outliers-in-capability-analysis-v2" style="line-height: 1.6;" target="_blank">here</a><span style="line-height: 1.6; color: rgb(204, 102, 102);">. To read about when it is and isn't appropriate to delete data values, see </span><a href="http://blog.minitab.com/blog/understanding-statistics/can-i-just-delete-some-values-to-reduce-the-standard-variation-in-my-anova" style="line-height: 1.6;" target="_blank">this post</a><span style="line-height: 1.6; color: rgb(204, 102, 102);">. To see what it feels like, statistically and personally, to be an outlier, click </span><a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/my-life-as-an-outlier" style="line-height: 1.6;" target="_blank">here</a><span style="line-height: 1.6; color: rgb(204, 102, 102);">.</span></p>
<p><br />
</p>
</li>
<li><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/5d0e5b13f08e315078b9c6205fcf9cd3/hypercube.jpg" style="float: right; width: 117px; height: 107px;" /><strong>Which of the following are true statements about the Box-Cox transformation?</strong><br />
<br />
a. The Box-Cox transformation can be used with regression analysis.<br />
b. You can only use the Box-Cox transformation with positive data.<br />
c. The Box-Cox transformation is not as powerful as the Johnson transformation.<br />
d. The Box-Cox transformation transforms data into 3-dimensional cube space.<br />
<span style="color:#cc6666;"><strong>a</strong>, <strong>b</strong>, and <strong>c </strong>are true statements. To see how the Box-Cox uses a logarithmic function to transform non-normal data, see </span><a href="http://blog.minitab.com/blog/applying-statistics-in-quality-projects/how-could-you-benefit-from-a-box-cox-transformation" target="_blank">this post</a><span style="color:#cc6666;">. For an example of how to use the Box-Cox transformation when performing a regression analysis, see </span><a href="http://blog.minitab.com/blog/statistics-and-quality-improvement/see-how-easily-you-can-do-a-box-cox-transformation-in-regression" target="_blank">this post</a><span style="color:#cc6666;">. For a comparison of the Box-Cox and Johnson transformations, see </span><a href="http://blog.minitab.com/blog/quality-data-analysis-and-statistics/transformers-normal-data-in-disguise" target="_blank">this post</a><span style="color:#cc6666;">.</span>
<p><br />
</p>
</li>
<li>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/337ab2a21bc5695218d5cee0a758dc49/duck_pairs.jpg" style="float: right; width: 225px; height: 150px; margin: 10px 15px;" /><strong>When would you use a paired t-test instead of a 2-sample t-test?</strong><br />
<br />
a. When you don’t get significant results using a 2-sample t test.<br />
b. When you have dependent pairs of observations.<br />
c. When you want to compare data in adjacent columns of the worksheet.<br />
d. When you want to analyze the courtship behavior of exotic animals.<br />
</p>
<span style="color:#cc6666;">The correct answer is <strong>b</strong>. For an explanation of the difference between a paired t test and a 2-sample t-test, click </span><a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/t-for-2-should-i-use-a-paired-t-or-a-2-sample-t" target="_blank">here</a><span style="color:#cc6666;">.</span>
<p><br />
</p>
</li>
<li>
<p><strong>Which of these are common pitfalls to avoid when interpreting regression results?</strong><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/864a22e8598adc181ae1e032aee31a4e/piifall.jpg" style="float: right; width: 250px; height: 193px;" /><br />
<br />
a. Extrapolating predictions beyond the range of values in the sample data.<br />
b. Confusing correlation with causation.<br />
c. Using uncooked spaghetti to model linear trends.<br />
d. Adding too much jitter to points on the scatterplot.<br />
e. Assuming the R-squared value must always be high.<br />
f. Treating the residuals as model errors.<br />
g. Holding the graph upside-down.<br />
</p>
<span style="color:#cc6666;">The correct answers are <strong>a</strong>, <strong>b</strong>, and <strong>e</strong>. To see an amusing example of extrapolating beyond the range of sample data values, click </span><a href="http://blog.minitab.com/blog/data-analysis-and-quality-improvement-and-stuff/it-was-the-best-of-times-it-was-the-end-of-times" target="_blank">here</a><span style="color:#cc6666;">. To understand why correlation doesn't imply causation, see </span><a href="http://blog.minitab.com/blog/understanding-statistics/no-matter-how-strong-correlation-still-doesnt-imply-causation" target="_blank">this post</a><span style="color:#cc6666;">. For another example, using NFL data, click </span><a href="http://blog.minitab.com/blog/the-statistics-game/correlation-is-not-causation-why-running-the-football-doesnt-cause-you-to-win-games-in-the-nfl" target="_blank">here</a><span style="color:#cc6666;">, and for yet another, using NBA data, click </span><a href="http://blog.minitab.com/blog/statistics-and-quality-improvement/correlation-causation-and-remorse-for-my-nba-finals-prediction" target="_blank">here</a><span style="color:#cc6666;">. To understand what R-squared is, see </span><a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/r-squared-sometimes-a-square-is-just-a-square" target="_blank">this post</a><span style="color:#cc6666;">. To learn why a high R-squared is not <em>always</em> good, and a low R-squared is not <em>always</em> bad, see </span><a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit" target="_blank">this post</a><span style="color:#cc6666;">.</span>
<p><br />
</p>
</li>
<li>
<p><strong>Which of the following are terms associated with DOE (design of experiment), and which are terms associated with a BUCK? </strong><br />
<img alt="" height="242" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/32bad6c4a4a6c78a9c6950bba07c4d6a/buck_w1024.jpeg" style="float: right;" width="326" /><br />
a. Center point<br />
b. Crown tine<br />
c. Main effect<br />
d. Corner point<br />
e. Pedicle<br />
f. Split plot<br />
g. Block<br />
h. Burr<br />
i. Main beam<br />
j. Run<br />
</p>
<span style="color:#cc6666;">The design of experiment (DOE) terms are <strong>a</strong>, <strong>c</strong>, <strong>d</strong>, <strong>f</strong>, <strong>g</strong>, and <strong>j</strong>. The parts of a buck's antlers are b, e, and h. The Minitab blog contains many great posts on DOE, including several step-by-step examples that provide a clear, easy-to-understand synopsis of the process to follow when you create and analyze a designed experiment in Minitab. Click </span><a href="http://blog.minitab.com/blog/design-of-experiments-2" target="_blank">here</a> <span style="color:#cc6666;">to see a complete compilation of these DOE posts.</span>
<p><br />
</p>
</li>
<li>
<p><strong>Which of these are frequently cited as common statistical errors?</strong></p>
<p>a. Assuming that a small amount of random error is OK.<br />
b. Assuming that you've proven the null hypothesis when the p-value is greater than 0.05.<br />
c. Assuming that correlation implies causation.<br />
d. Assuming that statistical significance implies practical significance.<br />
e. Assuming that inferential statistics is a method of estimation.<br />
f. Assuming that statisticians are always right.<br />
</p>
<span style="color:#cc6666;">The correct answers are <strong>b</strong>, <strong>c</strong>, and <strong>d</strong>. To see common statistical mistakes you should avoid click </span><a href="http://blog.minitab.com/blog/real-world-quality-improvement/common-statistical-mistakes-you-should-avoid" target="_blank">here</a><span style="color:#cc6666;">. And </span><a href="http://blog.minitab.com/blog/understanding-statistics/three-dangerous-statistical-mistakes" target="_blank">here</a><span style="color:#cc6666;">.</span>
</li>
</ol>
Looking for more information? Try the online Minitab Topic Library
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/671c3d5f6639ac42de46dc489f0ad3f4/library.jpg" style="width: 350px; height: 263px; float: right; margin: 10px 15px;" /></p>
<p>For more information on the concepts covered in this quiz—as well as many other statistical concepts—check out the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/topic-library-overview/" target="_blank">Minitab Topic Library</a>.</p>
<p>On the Topic Library Overview page, click <strong>Menu</strong> to access topic of your choice.<br />
For example, for more information on interpreting residual plots in regression analysis, click <strong>Modeling Statistics > Regression and correlation > Residuals and residual plots.</strong></p>
Fun StatisticsLearningStatisticsFri, 15 Jan 2016 13:00:00 +0000http://blog.minitab.com/blog/statistics-and-quality-data-analysis/the-minitab-blog-quiz%3A-test-your-stat-smartsPatrick RunkelHow to Compare Regression Slopes
http://blog.minitab.com/blog/adventures-in-statistics/how-to-compare-regression-lines-between-different-models
<p>If you perform linear regression analysis, you might need to compare different regression lines to see if their constants and slope coefficients are different. Imagine there is an established relationship between X and Y. Now, suppose you want to determine whether that relationship has changed. Perhaps there is a new context, process, or some other qualitative change, and you want to determine whether that affects the relationship between X and Y.</p>
<p>For example, you might want to assess whether the relationship between the height and weight of football players is significantly different than the same relationship in the general population.</p>
<p>You can graph the regression lines to visually compare the slope coefficients and constants. However, you should also statistically test the differences. <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-why-we-need-to-use-hypothesis-tests-in-statistics" target="_blank">Hypothesis testing</a> helps separate the true differences from the random differences caused by sampling error so you can have more confidence in your findings.</p>
<p>In this blog post, I’ll show you how to compare a relationship between different regression models and determine whether the differences are statistically significant. Fortunately, these tests are easy to do using <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">Minitab statistical software</a>.</p>
<p>In the example I’ll use throughout this post, there is an input variable and an output variable for a hypothetical process. We want to compare the relationship between these two variables under two different conditions. Here is the <a href="//cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/File/569a0e7d067944f6f9147434794efcd6/comparingregressionmodels.MPJ">Minitab project file</a> with the data.</p>
Comparing Constants in Regression Analysis
<p>When the <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-to-interpret-the-constant-y-intercept" target="_blank">constants</a> (or y intercepts) in two different regression equations are different, this indicates that the two regression lines are shifted up or down on the Y axis. In the scatterplot below, you can see that the Output from Condition B is consistently higher than Condition A for any given Input value. We want to determine whether this vertical shift is statistically significant.</p>
<p><img alt="Scatterplot with two regression lines that have different constants." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/2ed27f4204515bac9d9674c16fa0c0f7/scatter_constant_dift.png" style="width: 576px; height: 384px;" /></p>
<p>To test the difference between the constants, we just need to include a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/data-concepts/cat-quan-variable/" target="_blank">categorical variable</a> that identifies the qualitative attribute of interest in the model. For our example, I have created a variable for the condition (A or B) associated with each observation.</p>
<p>To fit the model in Minitab, I’ll use: <strong>Stat > Regression > Regression > Fit Regression Model</strong>. I’ll include <em>Output</em> as the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">response variable</a>, <em>Input</em> as the continuous <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">predictor</a>, and <em>Condition</em> as the categorical predictor.</p>
<p>In the regression analysis output, we’ll first check the coefficients table.</p>
<p style="margin-left: 40px;"><img alt="Coefficients table that shows that the constants are different" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/23657868f2cf893d216d05d3400ab9e6/coeff_constant_dift.png" style="width: 369px; height: 117px;" /></p>
<p>This table shows us that the relationship between Input and Output is statistically significant because the p-value for Input is 0.000.</p>
<p>The <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">coefficient</a> for Condition is 10 and its <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">p-value</a> is significant (0.000). The coefficient tells us that the vertical distance between the two regression lines in the scatterplot is 10 units of Output. The p-value tells us that this difference is statistically significant—you can reject the null hypothesis that the distance between the two constants is zero. You can also see the difference between the two constants in the regression equation table below.</p>
<p style="margin-left: 40px;"><img alt="Regression equation table that shows constants that are different" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/a879996e37ebb05a297721e695a71943/equ_constant_dift.png" style="width: 305px; height: 113px;" /></p>
Comparing Coefficients in Regression Analysis
<p>When two slope coefficients are different, a one-unit change in a predictor is associated with different mean changes in the response. In the scatterplot below, it appears that a one-unit increase in Input is associated with a greater increase in Output in Condition B than in Condition A. We can <em>see</em> that the slopes look different, but we want to be sure this difference is statistically significant.</p>
<p><img alt="Scatterplot that shows two slopes that are different" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/200c12087fdf7eecd9b773d9ce213020/scatter_slope_dift.png" style="width: 576px; height: 384px;" /></p>
<p>How do you statistically test the difference between regression coefficients? It sounds like it might be complicated, but it is actually very simple. We can even use the same Condition variable that we did for testing the constants.</p>
<p>We need to determine whether the coefficient for Input depends on the Condition. In statistics, when we say that the effect of one variable depends on another variable, that’s an interaction effect. All we need to do is include the interaction term for Input*Condition!</p>
<p>In Minitab, you can specify interaction terms by clicking the <strong>Model</strong> button in the main regression dialog box. After I fit the regression model with the interaction term, we obtain the following coefficients table:</p>
<p style="margin-left: 40px;"><img alt="Coefficients table that shows different slopes" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/f06eff56f2266d0ff7e3919aa1292285/coeff_slope_dift.png" style="width: 410px; height: 154px;" /></p>
<p>The table shows us that the interaction term (Input*Condition) is statistically significant (p = 0.000). Consequently, we reject the null hypothesis and conclude that the difference between the two coefficients for Input (below, 1.5359 and 2.0050) does not equal zero. We also see that the main effect of Condition is not significant (p = 0.093), which indicates that difference between the two constants is not statistically significant.</p>
<p style="margin-left: 40px;"><img alt="Regression equation table that shows different slopes" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/d5e5142c0ff13645d1dacc3e2c0bee27/equ_coeff_dift.png" style="width: 295px; height: 105px;" /></p>
<p>It is easy to compare and test the differences between the constants and coefficients in regression models by including a categorical variable. These tests are useful when you can see differences between regression models and you want to defend your conclusions with p-values.</p>
<p>If you're learning about regression, read my <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples">regression tutorial</a>!</p>
Data AnalysisHypothesis TestingRegression AnalysisStatistics HelpWed, 13 Jan 2016 13:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/how-to-compare-regression-lines-between-different-modelsJim FrostExploratory Graphs of 2016 Qualified Health Plans
http://blog.minitab.com/blog/statistics-and-quality-improvement/exploratory-graphs-of-2016-qualified-health-plans
<p>At the start of a new year, I like to look for data that’s labeled 2016. While it’s not necessarily new for 2016, one of the first data sets I found was <a href="https://www.healthcare.gov/health-plan-information-2016/">healthcare.gov’s data about qualified health and stand-alone dental plans</a> offered through their site.</p>
<p>Now, there’s lots of fun stuff to poke around in a data set this size—there are over 90,000 records on more than 140 variables. But to start out I used Minitab to do some exploratory graphical analysis.</p>
<p>One statistic you might be interested in is the mean cost of the plans available. Minitab makes this easy because <a href="http://blog.minitab.com/blog/quality-data-analysis-and-statistics/bar-charts-decoded">Minitab’s bar chart automatically computes the means</a>, and other statistics, to plot them. This is a chart of the means by state for premiums paid by 21-year old adults. I colored Utah in red because it’s going to do something none of the other states do.</p>
<p><img alt="Mean premiums by state" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/7e761935cbc455502cb91387ccb844f1/individual_age_21.png" style="width: 576px; height: 384px;" /></p>
<p>Here’s a bar chart of the means for a couple with 2 children, aged 40:</p>
<p><img alt="Mean premiums by state for a couple with 2 children, age 40" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/03283ebd11a5bdfdb500f5b1a492fca3/couples_2_children_age_40.png" style="width: 576px; height: 384px;" /></p>
<p>See how Utah moved? For 21-year olds, Utah was the second-cheapest. For the category of couple+2 children, age 40, Utah’s not radically different in price from many other states, but its rank changed. In fact, of all the states, Utah is the only one that changed position relative to any others.</p>
<p>We’re not talking about large differences in the means, but what makes the change seem really odd is this: Utah is the only state where the mean price for a couple+2 children, age 40, is not completely determined by the price for adults at the age of 21.</p>
<p>Here’s a scatterplot of the means of all plan premiums in each state for the two example groups from the dataset. Utah is the red dot:</p>
<p><img alt="Scatterplot with Utah" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/2a74022f73d798e431faa3926e3473c3/scatterplot_with_utah.png" style="width: 576px; height: 384px;" /></p>
<p>If you remove Utah from the data set (<a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/graph-options/exploring-data-and-revising-graphs/using-brushing-to-investigate-data-points/#graph-a-subset-of-your-data-based-on-brushed-points">Minitab makes excluding points easy</a>) the <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-high-should-r-squared-be-in-regression-analysis">R2 value</a> is 100%</p>
<p><img alt="Scatterplot without Utah" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/89abc02ae2cee0f0f57bbb783769a784/scatterplot_without_utah.png" style="width: 576px; height: 384px;" /></p>
<p>Does the difference have to do with the plans? Do the providers in Utah do something different? Is this simply a quirk of how the data are recorded? Does it have to do with Utah’s history of providing a healthcare exchange before the Affordable Care Act? It’s hard to say without looking a little deeper. But Minitab’s easy exploratory graphs make it simple to find the points in a data that show the need for further investigation.</p>
<p>I’ll do my own follow-up, because my natural curiosity can’t be satisfied otherwise. If you have your own hypothesis, feel free to share it in the comments section.</p>
Data AnalysisStatistics in the NewsMon, 11 Jan 2016 13:04:00 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/exploratory-graphs-of-2016-qualified-health-plansCody SteeleHow to Perform Acceptance Sampling by Attributes
http://blog.minitab.com/blog/understanding-statistics/how-to-perform-acceptance-sampling-by-attributes
<p>In an earlier post, I shared an <a href="http://blog.minitab.com/blog/understanding-statistics/what-is-acceptance-sampling">overview of acceptance sampling</a>, a method that lets you evaluate a sample of items from a larger batch of products (for instance, electronics components you've sourced from a new supplier) and use that sample to decide whether or not you should accept or reject the entire shipment. </p>
<p>There are two approaches to acceptance sampling. If you do it by attributes, you count the number of defects or defective items in the sample, and base your decision about the entire lot on that. The alternative approach is acceptance sampling by variables, in which you use a measurable characteristic to evaluate the sampled items. Doing it by attributes is easier, but sampling by variables requires smaller sample sizes. </p>
<p>In this post, we'll do acceptance sampling by attributes using Minitab Statistical Software. If you're not already using it and you'd like to follow along, you can get our <a href="http://it.minitab.com/products/minitab/free-trial.aspx">free 30-day trial version</a>. </p>
Getting Started with Acceptance Sampling by Attributes
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ca5bf6a240b3402a5d4c41c358d2fb85/capacitors2.jpg" style="line-height: 20.8px; margin: 10px 15px; float: right; width: 250px; height: 200px; border-width: 1px; border-style: solid;" /></p>
<p>You manage components for a large consumer electronics firm. In that role, you're responsible for sourcing the transistors, resistors, integrated circuits, and other components your company uses in its finished products. You're also responsible for making sure your vendors are supplying high-quality products, and rejecting any batches that don't meet your standards.</p>
<p>Recently, you've been hearing from the assembly managers about problems with one of your suppliers of capacitors. You order these components in batches of 1,000, and it's just not feasible to inspect every individual item coming in. When the next batch of capacitors arrives from this supplier, you decide to use sampling so you can make a data-driven decision to either accept or reject the entire lot.</p>
<p>Before you can devise your sampling plan, you need to know what constitutes an acceptable quality level (AQL) for a batch of capacitors, and what is a rejectable quality level (RQL). As you might surmise, these are figures that need to be discussed with and agreed to by your supplier. You'll also need to settle on levels of the "producer's risk," which is the probability of incorrectly rejecting a lot that should have been accepted, and the "consumer's risk," which the probability that a batch which should have been rejected is accepted. <span style="line-height: 20.8px;">In many cases, the Consumer's Risk is set at a higher level than the Producer's Risk.</span></p>
<p>Your agreement with the supplier is that the AQL is 1%, and the RQL is 8%. The producer's risk has been set at 5%, which means that about 95% of the time, you'll correctly accept a lot with a defect level of 1% or lower. You've agreed to accept a consumer's risk level of 10%, which means that about 90% of the time you would correctly reject a lot that has a defect level of 8% or higher. </p>
Creating Your Plan for Acceptance Sampling by Attributes
<p>Now we can use Minitab to determine an appropriate sampling plan. <img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/5a4e272f8f8ed1d53d3ad52ef75ec154/acceptance_sampling_attributes_dialog_complete.jpg" style="line-height: 20.8px; width: 300px; height: 245px; margin: 10px 15px; float: right;" /></p>
<ol>
<li>Choose <strong>Stat > Quality Tools > Acceptance Sampling by Attributes</strong>.</li>
<li>Choose <em>Create a sampling plan</em>.</li>
<li>In <em>Measurement type</em>, choose Go / no go (defective).</li>
<li>In <em>Units for quality levels</em>, choose Percent defective.</li>
<li>In <em>Acceptable quality level (AQL)</em>, enter 1. In <em>Rejectable quality level (RQL or LTPD)</em>, enter 8.</li>
<li>In <em>Producer's risk (Alpha)</em>, enter 0.05. In <em>Consumer's risk (Beta)</em>, enter 0.1.</li>
<li>In <em>Lot size</em>, enter 1000.</li>
<li>Click <strong>OK</strong>.</li>
</ol>
<p>Minitab produces the following output in the Session Window: </p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b42ea93107b5f0250a63e90f2e81bbba/acceptance_sampling_attributes_output.gif" style="width: 546px; height: 228px;" /></p>
Interpreting the Acceptance Sampling by Attributes Plan
<p>For each lot of 1,000 capacitors, you need to randomly select and inspect 65. If you find more than 2 defectives among these 65 capacitors, you should reject the entire lot. If you find 2 or fewer defective items, accept the entire lot.</p>
<p><span style="line-height: 20.8px;">Minitab plots an Operating Characteristic Curve to show you the probability of accepting lots at various incoming quality levels. </span>In this case, the probability of acceptance at the AQL (1%) is 0.972, and the probability of rejecting is 0.028. When the sampling plan was set up, you and your supplier agreed that lots of 1% defective would be accepted approximately 95% of the time to protect the producer. </p>
<p><img alt="Operating Characteristic (OC) Curve" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f2788766c0a6a34b56861526187d8774/oc_curve.png" style="width: 576px; height: 384px;" /></p>
<p>The probability of accepting a batch of capacitors at the RQL (8%) is 0.099 and the probability of rejecting is 0.901. The consumer and supplier agreed that lots of 8% defective would be rejected most of the time to protect the consumer.</p>
What Happens If a Lot Gets Rejected?
<p>When the next batch of capacitors arrives at the dock, you pick out 65 at rando<span style="line-height: 1.6;">m and test them. Five of the 65 samples are defective. </span></p>
<p>Based on your plan, you reject the lot. Now what? <span style="line-height: 1.6;">Typically, the supplier will need to take some corrective action, such as inspecting all units and reworking or replacing any that are defective.</span></p>
<p>Minitab produces two graphs that can tell you more. If we assume that rejected lots will be 100% inspected and all defects rectified, the Average Outgoing Quality (AOQ) plot represents the relationship <span style="line-height: 20.8px;">between the quality of incoming and outgoing materials. </span><span style="line-height: 1.6;">The Average Total Inspection (ATI) shows the correlation between the quality of incoming materials and the number of items that need to be inspected.</span></p>
<p><span style="line-height: 20.8px;">When incoming lots are very good or very bad, the outgoing quality will be good because poor lots get reinspected and fixed, and good lots are already good.</span><span style="line-height: 20.8px;"> In the graph below, t</span>he AOQ level is 1.4% at the AQL and 1.0% at the RQL. But when incoming quality is neither very good or very bad, the number of bad parts that gets through rises, so outgoing quality gets worse. The maximum % defective level for outgoing quality is called the Average Outgoing Quality Limit (AOQL). This figure is included in the session window output above, and you can see it in the graph below: At about 3.45% defective, Average Outgoing Quality Limit(AOQL) = 1.968, the worst-case outgoing quality level.</p>
<p><span style="color:#FF0000;"><img alt="Average Outgoing Quality (AOQ) Curve" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/e6860ff5a5c71686703a5b0dc5abe1b6/aoq_curve.png" style="width: 576px; height: 384px;" /></span></p>
<p>The ATI per lot represents the average number of capacitors you will need to inspect at a particular quality level. </p>
<p><span style="color:#FF0000;"><img alt="Average Total Inspection Curve" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/cc5aa396e04889959ff97cdf2d2c9f5e/ati_curve.png" style="width: 576px; height: 384px;" /></span></p>
<p><span style="line-height: 20.8px;">In the graph above, you can see that if the lot's actual % defective is 2%, the average total number of capacitors inspected per lot will approach 200 (including re-inspections after the supplier has rectified a rejected lot). If the quality level of 10% defective, the average total number of capacitors inspected per lot is 907.3.</span></p>
<p>Check out my earlier posts for a walk through of <a href="http://blog.minitab.com/blog/understanding-statistics/how-to-perform-acceptance-sampling-by-variables%2C-part-1">performing acceptance sampling by variables</a>. </p>
Data AnalysisQuality ImprovementStatisticsStatsFri, 08 Jan 2016 13:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/how-to-perform-acceptance-sampling-by-attributesEston MartzHow to Perform Acceptance Sampling by Variables, part 3
http://blog.minitab.com/blog/understanding-statistics/how-to-perform-acceptance-sampling-by-variables-part-3
<p>Now that we've seen how easy it is to <a href="http://blog.minitab.com/blog/understanding-statistics/how-to-perform-acceptance-sampling-by-variables%2C-part-1">create plans for acceptance sampling by variables</a>, and to <a href="http://blog.minitab.com/blog/understanding-statistics/how-to-perform-acceptance-sampling-by-variables%2C-part-2">compare different sampling plans</a>, it's time to see how to actually analyze the data you collect when you follow the sampling plan. </p>
<p>If you'd like to follow along and you're not already using Minitab, please download <a href="http://it.minitab.com/products/minitab/free-trial.aspx">the free 30-day trial</a>. </p>
Collecting the Data for Acceptance Sampling by Variable
<p>If you'll recall from the previous post, after comparing several different sampling plans, you decided that sampling 50 items from your next incoming lot of 1,500 LEDs would be the best option to satisfy your supervisor's desire to sample as few items as possible while at the same time providing sufficient protection to you and your supplier. That protection stems from an acceptable probability that lots will not be accepted or rejected in error. Under this plan, you have just a 7% chance of rejecting a good lot, and an 87% chance to rejecting a poor lot. </p>
<p>So, on the day your next shipment of LEDs arrives, you select 50 of them and carefully measure the soldering leads. To make sure the sampling process will be effective, you're diligent about taking samples from throughout the entire lot, at random. You record your measurements and place the data into a Minitab worksheet. </p>
Analyzing Acceptance Sampling by Variable Data
<p>This time, when you go to <strong>Stat > Quality Tools > Acceptance Sampling by Variables,</strong> choose the <em>Accept/Reject Lot...</em> option. </p>
<p><img alt="" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/105c2f1b467e455ed234546325e435d2/acceptances_sampling_variables_menu_choice.gif" style="width: 450px; height: 144px;" /></p>
<p>The goal of this analysis is to determine whether you should accept or reject this latest batch of LEDs, based on your sample data. <span style="line-height: 1.6;">If the calculated Z value is greater than the critical distance (3.5132), you will accept the entire lot. Otherwise, the lot goes back to your supplier for rework and correction.</span></p>
<p><span style="line-height: 1.6;">In <em>Measurement data</em>, enter 'Lead Length'. </span><span style="line-height: 1.6;">In <em>Critical distance (k value)</em>, enter 3.5132. </span><span style="line-height: 1.6;">In <em>Lower spec</em>, enter 2. Finally, for </span><span style="line-height: 1.6;"><em>Historical standard deviation</em>, enter 0.145. Your dialog box will look like this: </span></p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c8b331860bb90dc1b1d478fcc4d865bf/acceptance_by_variables_analysis_dialog.gif" style="width: 487px; height: 373px;" /></p>
<p> </p>
<p>When you click <strong>OK</strong>, the Session Window provides the following output: </p>
<p> </p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/db07e9bb50e187ceb278ebe1a4e7882e/acceptance_by_variables_analysis_output.gif" style="width: 442px; height: 209px;" /></p>
Interpreting the Acceptance Sampling Output
<p>From the measurements of the 50 LEDs, that you sampled, the mean length of the solder leads is 2.52254 centimeters, and the historical standard deviation is 0.145 inches. The lower specification of the pipe thickness is 2 inches. </p>
<p><span style="line-height: 20.8px;">When you created the sampling plan, t</span>he critical distance was determined to be 3.5132. Because this is smaller than the calculated Z.LSL (3.60375), you will accept the lot of 1,500 LEDs.</p>
Data AnalysisQuality ImprovementThu, 07 Jan 2016 13:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/how-to-perform-acceptance-sampling-by-variables-part-3Eston Martz