Data Analysis Software | MinitabBlog posts and articles with tips for using statistical software to analyze data for quality improvement.
http://blog.minitab.com/blog/data-analysis-software/rss
Sat, 23 May 2015 11:50:13 +0000FeedCreator 1.7.3Making Data Analysis Easier with Coding Schemes, Part 2
http://blog.minitab.com/blog/statistics-support/http%3Ablogminitabcomblogstatistics-supportmaking-data-analysis-easier-with-coding-schemes-part-2
<p><span style="line-height: 1.6;">In my previous post, I showed you that the <a href="http://blog.minitab.com/blog/statistics-support/making-data-analysis-easier-with-coding-schemes-part-1">coefficients are different when choosing (-1,0,1) vs (1,0) coding schemes</a> for General Linear Model (or Regression). </span></p>
<p><span style="line-height: 1.6;">We used the two different equations to calculate the same fitted values. Here I will focus on showing what the different coefficients represent. </span></p>
<p>Let's use the data and models from the last blog post:</p>
<p style="margin-left: 40px;"><img alt="General Linear Model ouput" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/50b29d318d90483641720b8f8e98473f/coding_2_1.gif" style="width: 378px; height: 342px;" /></p>
<p><span style="line-height: 1.6;">We can display the means for each level by choosing <strong>Stat > ANOVA > General Linear Model > Fit General Linear Model > Options</strong>. These means are called fitted means, or least squares means. Note that when the design is not balanced, the fitted means will differ from the data means.</span></p>
<p style="margin-left: 40px;"><img alt="Fitted Means or Least Squares Means" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/20e567c59f9bc5a4cbc52e304ca63568/coding_2_2.gif" style="width: 222px; height: 107px;" /></p>
<p><span style="line-height: 1.6;">And we could calculate the overall mean using <strong>Stat > Basic Statistics > Display Descriptive Stats</strong>:</span></p>
<p style="margin-left: 40px;"><img alt="Descriptive Stats" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/5af702b2696c80d495cd3dfe6385d637/coding_2_3.gif" style="width: 139px; height: 54px;" /></p>
<p><span style="line-height: 1.6;">Using the (-1,0,1) coding scheme, the coefficients represent the <em>difference between each level mean and the <strong>overall </strong>mean</em>.</span></p>
<p>Using the (1,0) coding scheme, the coefficients represent the <em>difference between each level mean and the <strong>baseline level</strong>’s mean</em>.</p>
(-1,0,1) Coding
<p style="text-align: center;"><strong>Coeff = Mean for level </strong><strong style="line-height: 18.9090900421143px; text-align: center;">– </strong><strong>Overall Mean</strong></p>
<p style="text-align: center;"> </p>
<p style="text-align: center;">Coeffa = Mean for a – Overall Mean</p>
<p style="text-align: center;">-4.54 = 207.97 - 212.51</p>
<p style="text-align: center;">Coeffb = Mean for b – Overall Mean</p>
<p style="text-align: center;">1.46 = 213.97 - 212.51</p>
<p style="text-align: center;">Coeffc = Mean for c – Overall Mean</p>
<p style="text-align: center;">3.08 = 215.59 - 212.51</p>
(1,0) Coding
<p style="text-align: center;"><strong>Coeff = Mean for level – Mean for baseline level</strong></p>
<p style="text-align: center;"> </p>
<p style="text-align: center;">Coeffa = Mean for a – Mean for a</p>
<p style="text-align: center;">0.0 = 207.97 - 207.97</p>
<p style="text-align: center;">Coeffb = Mean for b –Mean for a</p>
<p style="text-align: center;">6.0 = 213.97 - 207.97</p>
<p style="text-align: center;">Coeffc = Mean for c –Mean for a</p>
<p style="text-align: center;">7.62 = 215.59 - 209.97</p>
<p><span style="line-height: 1.6;">After reading this two-part blog, I hope you see that you could get the same results when running General Linear Model and Regression on the same data set. </span></p>
<p><span style="line-height: 1.6;">When changing the coding scheme you can still expect the results to match, except for the coefficients and single equations. Although the coefficients are different, you'll get the same fitted values. And finally, I hope you see what the coefficients mean for each coding scheme.</span></p>
<p><span style="line-height: 1.6;">And remember, if you're using Minitab Statistical Software to analyze your data, our <a href="http://support.minitab.com">technical support team</a> is always ready to help you! </span></p>
Data AnalysisRegression AnalysisStatisticsStatsWed, 20 May 2015 12:00:00 +0000http://blog.minitab.com/blog/statistics-support/http%3Ablogminitabcomblogstatistics-supportmaking-data-analysis-easier-with-coding-schemes-part-2Michelle ShemoMaking Data Analysis Easier with Coding Schemes, Part 1
http://blog.minitab.com/blog/statistics-support/making-data-analysis-easier-with-coding-schemes-part-1
<p>Since Minitab 17 <a href="http://www.minitab.com/products/minitab">Statistical Software</a> launched in February 2014, we've gotten great feedback from many people have been using the General Linear Model and Regression tools.</p>
<p>But in speaking with people as part of Minitab's Technical Support team, I've found many are noticing that there are two coding schemes available with each. We frequently get calls from people asking how the coding scheme you choose affects your results. I will show you here.</p>
General Linear Model vs. Regression
<p>First, let’s review Minitab’s General Linear Model (GLM) and Regression tools.</p>
<p>GLM uses a (-1,0,1) coding scheme by default. Regression uses (1,0) by default. If you make them match in the Coding sub-dialog box, you will get the same results.</p>
<p>Suppose you have a continuous dependent variable (Y), one categorical variable (Factor, with 3 levels, 1, 2, and 3) and 2 <a href="http://blog.minitab.com/blog/understanding-statistics/why-is-continuous-data-better-than-categorical-or-discrete-data">continuous variables</a> (X1 and X2), and you use the same coding scheme (-1,0,1) to analyze your data using both GLM and Regression. </p>
<p>Here's the dialog box you'll see when you select <strong>Stat > ANOVA > General Linear Model > Fit General Linear Model...</strong> and select the "Coding" options button. </p>
<p><img alt="Coding Dialog Box" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/6f3c12cf2f300637ec01bb217f571faf/coding_1_1.png" style="width: 412px; height: 272px; border-width: 1px; border-style: solid;" /></p>
<p><span style="line-height: 18.9090900421143px;">And here's the dialog box you'll see when you select </span><strong style="line-height: 18.9090900421143px;">Stat > Regression > Regression > Fit Regression Model...</strong><span style="line-height: 18.9090900421143px;"> and select the "Coding" options button. </span></p>
<p><img alt="Coding Dialog Box" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a939102f7c927e051fd91df2ca00d2d6/coding_1_2.png" style="width: 417px; height: 275px; border-width: 1px; border-style: solid;" /></p>
<p>And here are the results of the analyses:</p>
<p style="margin-left: 40px;"><img alt="General Linear Model output" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b754345b80e2e17e120873606dd69c3b/coding_1_output1.gif" style="width: 618px; height: 372px;" /></p>
<p style="margin-left: 40px;"><img alt="Regression Analysis ouput" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9f5c6f9ef86610bead6f4d8c40d2ccba/coding_1_output2.gif" style="width: 614px; height: 368px;" /></p>
<p><span style="line-height: 1.6;">Notice that aside from Regression having an additional line in its ANOVA table, and having a different subtitle ("Factor coding" instead of "Categorical predictor coding"), you get the same results.</span></p>
(-1,0,1) Coding Scheme vs. (1,0) Coding Scheme
<p>So what if you don’t make the coding schemes match and keep the default coding scheme for each?</p>
<p>Here is the output using GLM with the (1,0) coding scheme. (Note that the results would be the same if we ran <span style="line-height: 18.9090900421143px;">Regression</span><span style="line-height: 18.9090900421143px;"> </span><span style="line-height: 1.6;"> with the (1,0) scheme.) </span></p>
<p><span style="line-height: 1.6;">How do the results from (1,0) scheme differ from the results from the (-1,0,1) scheme above? </span></p>
<p style="margin-left: 40px;"><img alt="General Linear Model Output" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/4bfd5601e72cff38a19cdca7155caaa9/coding_1_output3.gif" style="width: 622px; height: 389px;" /></p>
<p><span style="line-height: 1.6;">Compare them and you'll see that coefficients and the equations are different. So what if you want to use the equations to calculate predicted values (i.e., “plug into the equation”)? How do you work with these two different equations?</span></p>
(-1,0,1) Coding Scheme
<p>Let’s return to the (-1,0,1) coding scheme. Here is the equation:</p>
<p style="margin-left: 40px;">Y = 205.44 + 1.158 X1 + 0.2416 X2 - 4.54 Factor_a + 1.46 Factor_b + 3.08 Factor_c</p>
<p>Use the actual values for the continuous factors, X1 and X2.</p>
<ul>
<li>To predict for Factor=a, plug in Factor_a=1, Factor_b=0, Factor _c=0</li>
<li>To predict for Factor=b, plug in Factor_a=0, Factor_b=1, Factor _c=0</li>
<li>To predict for Factor=c, plug in Factor_a=0, Factor_b=0, Factor _c=1</li>
</ul>
<p>Let’s try it! Let’s predict for X1=3.5, X2=6.0, Factor=a</p>
<p style="margin-left: 40px;"><img alt="Regression Equation" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9f54179c17ebec3041cfcfdb07f2fb8b/coding_1_output4.gif" style="line-height: 18.9090900421143px; width: 604px; height: 57px;" /></p>
<p><span style="line-height: 1.6;">Now let’s predict for X1=6.5, X2=-9.9, Factor=c</span></p>
<p style="margin-left: 40px;"><span style="line-height: 1.6;"><img alt="Regression Equation" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b538ba3fd14ea7ba8c46cdcb42071dd8/coding_1_output5.gif" style="width: 601px; height: 46px;" /></span></p>
<p>At this point, you may be wondering why this coding scheme is called (-1,0,1) if you plug in 1 or 0? </p>
<p>With this coding scheme there only needs to be k-1 coefficients to provide coefficients for all k groups.</p>
<p>You sometimes even see the equation for (-1,0,1) coding scheme written without the last level. The equation above could be written as:</p>
<p style="margin-left: 40px;">Y = 205.44 + 1.158 X1 + 0.2416 X2 - 4.54 Factor_a + 1.46 Factor_b</p>
<p>In this case:</p>
<ul>
<li>To predict for Factor=a, plug in Factor_a=1, Factor_b=0</li>
<li>To predict for Factor=b, plug in Factor_a=0, Factor_b=1</li>
<li>To predict for Factor=c, plug in Factor_a=-1, Factor_b=-1</li>
</ul>
<p>You get the same result when predicting for X1=6.5, X2=-9.9, Factor=c:</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/94115256d0bd362792ff26c3173021f1/coding_1_output7.gif" style="width: 457px; height: 31px;" /></p>
<p><span style="line-height: 1.6;">Note that (-4.54)*(-1)+1.46*(-1) = 3.08, which is the coefficient above for c.</span></p>
(1,0) Coding Scheme
<p>Now let’s switch to the (1,0) coding scheme. Here is the equation:</p>
<p style="margin-left: 40px;">Y = 200.90 + 1.158 X1 + 0.2416 X2 + 0.0 Factor_a + 6.00 Factor_b + 7.62 Factor_c</p>
<p>Use the actual values for the continuous factors, X1 and X2.</p>
<ul>
<li>To predict for Factor=a, plug in Factor_a=1, Factor_b=0, Factor _c=0</li>
<li>To predict for Factor=b, plug in Factor_a=0, Factor_b=1, Factor _c=0</li>
<li>To predict for Factor=c, plug in Factor_a=0, Factor_b=0, Factor _c=1</li>
</ul>
<p>Let’s predict for the same observations we did above.</p>
<p style="margin-left: 40px;"><strong>X1=3.5, X2=6.0, Factor=a</strong></p>
<p style="margin-left: 40px;"><img alt="Regression Equation" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a5a6fc93e59ffb65e4f823b450f4cfd0/coding_1_output9.gif" style="width: 589px; height: 44px;" /></p>
<p style="margin-left: 40px;"><strong><span style="line-height: 1.6;">X1=6.5, X2=-9.9, Factor=c</span></strong></p>
<p style="margin-left: 40px;"><img alt="Regression Equation" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c37d77f1c40e63e9a7b45d14ada4077d/coding_1_output10.gif" style="width: 582px; height: 50px;" /></p>
<p><span style="line-height: 1.6;">Notice both of these predictions (i.e., fitted values) are the same as those for the (-1,0,1) coding scheme.</span></p>
Single vs. Separate Equations
<p>In Minitab 17, you can display a single equation which contains the last level, or (using the Results sub-dialog box)<span style="line-height: 18.9090900421143px;"> </span><span style="line-height: 18.9090900421143px;">separate equations</span><span style="line-height: 1.6;">. When you display separate equations, the <em>coefficients in the table</em> will differ for the two coding schemes (as we saw above), but the equations, as well as the rest of the output, will match for the two coding schemes.</span></p>
<p>Factor</p>
<p style="margin-left: 40px;"><strong>a </strong>Y = 200.90 + 1.158 X1 + 0.2416 X2<br />
<span style="line-height: 1.6;"><strong>b </strong>Y = 206.90 + 1.158 X1 + 0.2416 X2</span><br />
<span style="line-height: 1.6;"><strong>c </strong>Y = 208.53 + 1.158 X1 + 0.2416 X2</span></p>
<p>In my next post, I’ll focus on showing what the different coefficients represent.</p>
<p> </p>
Data AnalysisRegression AnalysisStatisticsStatsTue, 19 May 2015 12:00:00 +0000http://blog.minitab.com/blog/statistics-support/making-data-analysis-easier-with-coding-schemes-part-1Michelle ShemoWhy Is Continuous Data "Better" than Categorical or Discrete Data?
http://blog.minitab.com/blog/understanding-statistics/why-is-continuous-data-better-than-categorical-or-discrete-data
<p>Earlier, I wrote about the <a href="http://blog.minitab.com/blog/understanding-statistics/understanding-qualitative-quantitative-attribute-discrete-and-continuous-data-types">different types of data</a> statisticians typically encounter. In this post, we're going to look at why, when given a choice in the matter, we prefer to analyze continuous data rather than categorical/attribute or discrete data. </p>
<p>As a reminder, when we assign something to a group or give it a name, we have created <strong>attribute </strong>or <strong>categorical </strong>data. If we count something, like defects, we have gathered <strong>discrete </strong>data. And if we can measure something to a (theoretically) infinite degree, we have <strong>continuous </strong>data.</p>
<p>Or, to put in bullet points: </p>
<ul>
<li><strong>Categorical </strong>= naming or grouping data</li>
<li><strong>Discrete </strong>= count data</li>
<li><strong>Continuous</strong> = measurement data</li>
</ul>
<p>A <a href="http://www.minitab.com/products/minitab" style="font-size: 13px; line-height: 18.9090900421143px;">statistical software package</a><span style="font-size: 13px; line-height: 18.9090900421143px;"> like Minitab is extremely powerful and can tell us many valuable things</span><span style="font-size: 13px; line-height: 18.9090900421143px;">—as long as we're able to feed it good numbers. Without numbers, we have no analyses nor graphs. Even categorical or</span><span style="font-size: 13px; line-height: 18.9090900421143px;"> attribute data needs to be converted into numeric form by counting before we can analyze it. </span></p>
What Makes Numeric Data Discrete or Continuous?
<p>At this point, you may be thinking, "Wait a minute—we can't <em>really </em>measure <em>anything </em>infinitely,so isn't measurement data actually discrete, too?" That's a fair question. </p>
<p>If you're a strict literalist, the answer is "yes"—when we measure a property that's continuous, like height or distance, we are <i>de facto </i>making a discrete assessment. When we collect a lot of those discrete measurements, it's the amount of detail they contain that will dictate whether we can treat the collection as discrete or continuous.</p>
<p>I like to think of it as a question of scale. Say <span style="line-height: 1.6;">I want to measure the weight of 16-ounce cereal boxes coming off a production line, and I want to be sure that the weight of each box is at least 16 ounces, but no more than 1/2 ounce over that. </span></p>
<p><span style="line-height: 1.6;">With a scale calibrated to whole pounds, all I can do is put every box into one of three categories: less than a pound, 1 pound, or more than a pound. </span></p>
<p>With a scale that can distinguish ounces, I will be able to measure with a bit more accuracy just how close to a pound the individual boxes are. I'm getting nearer to continuous data, but there are still only 16 degrees between each pound. </p>
<p>But if I measure with a scale capable of distinguishing 1/1000th of an ounce, I will have quite a wide scale—a <em>continuum</em>—of potential values between pounds. The individual boxes could have any value between 0.000 and 1.999 pounds. The scale of these measurements is fine enough to be analyzed with powerful statistical tools made for continuous data. </p>
What Can I Do with Continuous Data that I Can't Do with Discrete?
<p>Not all data points are equally valuable, and you can glean a lot more insight from 100 points of continuous data than you can from 100 points of attribute or count data. <span style="line-height: 18.9090900421143px;">How does this finer degree of detail affect what we can learn from a set of data?</span><span style="line-height: 18.9090900421143px;"> It's easy to see. </span></p>
<p>Let's start with the simplest kind of data, attribute data that rates a the weight of a cereal box as good or bad. For 100 boxes of cereal, any that are under 1 pound are classified as bad, so each box can have one of only two values.</p>
<p>We can create a bar chart or a pie chart to visualize this data, and that's about it:</p>
<p><img alt="Attribute Data Bar Chart" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9a3aaad00a1a5858433f17bfd121f465/attribute_data_bar_chart.png" style="width: 576px; height: 384px;" /></p>
<p>If we bump up the precision of our scale to differentiate between boxes that are over and under 1 pound, we can put each box of cereal into one of three categories. Here's what that looks like in a pie chart:</p>
<p><img alt="pie chart of count data" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ae87a08eae95accccbc82b97fe3f0ced/pie_chart_of_count_data.png" style="width: 576px; height: 384px;" /></p>
<p>This gives us a little bit more insight—we now see that we are overfilling more boxes than we are underfilling—but there is still a very limited amount of information we can extract from the data. </p>
<p>If we measure each box to the nearest ounce, we open the door to using methods for continuous data, and get a still better picture of what's going on. We can see that, on average, the boxes weigh 1 pound. But there's high variability, with a standard deviation of 0.9. There's also a wide range in our data, with observed values from 12 to 20 ounces: </p>
<p><img alt="graphical summary of ounce data" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/26b4e51027b7afa154d0e6e3f14ab8e9/summary_statistics_for_ounces.png" style="width: 575px; height: 431px;" /></p>
<p>If I measure the boxes with a scale capable of differentiating thousandths of an ounce, more options for analysis open up. For example, now that the data are fine enough to distinguish half-ounces (and then some), I can perform a capability analysis to see if my process is even capable of consistently delivering boxes that fall between 16 and 16.5 ounces. I'll use the Assistant in Minitab to do it, selecting <strong>Assistant > Capability Analysis</strong>: </p>
<p><img alt="capability analysis for thousandths" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/0b0a37d1515c25b2e1d8d633b09da447/capability_analysis_for_thousandths___summary_report.png" style="width: 575px; height: 431px;" /></p>
<p>The analysis has revealed that my process isn't capable of meeting specifications. Looks like I have some work to do...but the Assistant also gives me an I-MR control chart, which reveals where and when my process is going out of spec, so I can start looking for root causes.</p>
<p><img alt="IMR Chart" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/df4a5f568e1d931ddcb96404fd888547/imr_chart.png" style="width: 575px; height: 224px;" /></p>
<p>If I were only looking at attribute data, I might think my process was just fine. Continuous data has allowed me to see that I can make the process better, and given me a rough idea where to start. <span style="line-height: 1.6;">By making changes and collecting additional continuous data, I'll be able to conduct hypothesis tests, analyze sources of variances, and more. </span></p>
Some Final Advantages of Continuous Over Discrete Data
<p>Does this mean discrete data is no good at all? Of course not—we are concerned with many things that can't be measured effectively except through discrete data, such as opinions and demographics. But when you can get it, continuous data is the better option. The table below lays out the reasons why. </p>
<p><strong>Continuous Data</strong></p>
<p><strong>Discrete Data</strong></p>
Inferences can be made with few data points—valid analysis can be performed with small samples.
More data points (a larger sample) needed to make an equivalent inference.
Smaller samples are usually less expensive to gather
Larger samples are usually more expensive to gather.
High sensitivity (how close to or far from a target)
Low sensitivity (good/bad, pass/fail)
Variety of analysis options that can offer insight into the sources of variation
Limited options for analysis, with little indication of sources of variation
<p>I hope this very basic overview has effectively illustrated why you should opt for continuous data over discrete data whenever you can get it. </p>Data AnalysisStatisticsStatistics HelpMon, 18 May 2015 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/why-is-continuous-data-better-than-categorical-or-discrete-dataEston MartzLooking at the Avengers with Conditional Formatting
http://blog.minitab.com/blog/quality-data-analysis-and-statistics/looking-at-the-avengers-with-conditional-formatting
<p style="line-height: 20.7999992370605px;">The first summer blockbuster of 2015 was released two weeks ago—<em>The Avengers: Age of Ultron</em>. The first Avengers film featured a pretty well known cast of superheroes (if, of course, you’re a superhero fan). However, in the 40-year run of the Avengers comic book, that team has evolved to keep the material fresh and to allow some characters to go their solo ways.</p>
<p style="line-height: 20.7999992370605px;"><a href="http://marvel.com/comics/issue/6953/avengers_1963_10" target="_blank"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/341a4ed2fcbb444e80672d3c65ed7412/avngrs_150.jpg" style="margin: 10px 15px; float: right; width: 152px; height: 228px;" /></a>I want to use Minitab's <a href="http://www.minitab.com/products/minitab">statistical software</a> to look at some characteristics of the ever-evolving Avengers roster, as well as compare the first roster to the one we see on film. We’ll only be focusing on Volume 1 of the comic book series, which ran from 1963 to 1966, for 407 issues.</p>
<p style="line-height: 20.7999992370605px;">On GitHub, you can find <a href="https://github.com/fivethirtyeight/data/blob/master/comic-characters/marvel-wikia-data.csv" target="_blank">a list of all superheroes/villains ever created for the Marvel Universe</a> along with these characteristics:</p>
<ul style="line-height: 20.7999992370605px;">
<li>Is the character good, bad or neutral?</li>
<li>Eye color</li>
<li>Hair color</li>
<li>Sex of the character</li>
<li>Sexual orientation</li>
<li>Alive or deceased</li>
<li># of Total Appearances as of Sep 2, 2014</li>
<li>The month and year of character's first appearance, if available</li>
</ul>
Creating a Subset of Data about <em>The Avengers</em>
<p style="line-height: 20.7999992370605px;">I started my data analysis with a new tool added to our most recent release of Minitab, conditional data formatting. I right-clicked on the name column and selected <strong>Conditional Formatting > Highlight Cell > Match from List</strong>. I then looked for the characters from the team in the 2012 film <em>The Avengers</em>:</p>
<ul style="line-height: 20.7999992370605px;">
<li>Iron Man</li>
<li>Captain America</li>
<li>Thor</li>
<li>Black Widow-Natalia Romanova</li>
<li>Hawkeye</li>
<li>The Hulk</li>
</ul>
<p style="text-align: center;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/18ad5c84cda040cc896b42fa2c4cd6c0/pic1.png" style="line-height: 20.7999992370605px; border-width: 0px; border-style: solid; width: 598px; height: 450px;" /></p>
<p style="line-height: 20.7999992370605px; text-align: center;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/2d0b5698c0a3ffa5d580851f0ad031f2/pic2.png" style="width: 316px; height: 392px;" /></p>
<p style="line-height: 20.7999992370605px;">Next, I right-clicked on my Name column again and chose <strong>Subset Worksheet > Include Rows with Formatted Cells.</strong></p>
<p style="line-height: 20.7999992370605px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/77888a5a463185d9235a724f6710eba5/pic3_w1024.png" /></p>
<p style="line-height: 20.7999992370605px;">Interestingly, all the characters on this team have gone public with their identity. Well, Thor is listed as “No Dual Identity,” but wouldn’t that mean he’s still public?</p>
<p style="line-height: 20.7999992370605px;">Another fascinating tidbit is that the <em>youngest </em>Avenger on screen, Black Widow, is 51 years old. We’re watching characters that were created over 50 years ago!</p>
Which Marvel Superhero Is Most Likely to Show Up?
<p style="line-height: 20.7999992370605px;">Among the movie Avengers, Captain America has the highest number of in appearances in comic books at 3,360, but how does he rank against the entire Marvel universe? Returning to the original worksheet, I highlighted the Appearances column, right-clicked and navigated to <strong>Conditional Formatting > High/Low > Highest Values</strong>.</p>
<p style="line-height: 20.7999992370605px; text-align: center;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/23002d854556023acff07b062aa08dcb/pic4.png" style="width: 312px; height: 136px;" /></p>
<p style="line-height: 20.7999992370605px;">After pressing OK, I right clicked on the column again and went to <strong>Sort > Entire Worksheet > Formatted Cells at the Top.</strong></p>
<p style="line-height: 20.7999992370605px; text-align: center;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/6deed47c584cd984ce2660b7a93b5fc4/pic5.png" style="width: 553px; height: 249px;" /></p>
<p style="line-height: 20.7999992370605px;">Four of the six characters from the Avengers film appear in the above list. Reed Richards, Ben Grimm, and Jonathan Storm make up part of the superhero group The Fantastic Four. Scott Summers, also known as Cyclops, and Wolverine are part of the <span style="line-height: 20.7999992370605px;">X-Men. Not surprisingly, the very popular Spider-man is at the top. Reed Richards, Ben Grimm, and Spidey were part of the Avengers at one point as well.</span></p>
<p style="line-height: 20.7999992370605px;">The Avengers #1 (Sept 1963) did not feature Captain America, Black Widow, nor Hawkeye. (Although Cap did replace the Hulk in issue 2.) Here are the founders:</p>
<ul style="line-height: 20.7999992370605px;">
<li>Iron Man</li>
<li>Thor</li>
<li>Henry Pym (Ant-Man, Giant-Man)</li>
<li>Janet Van Dyne Pym (Wasp)</li>
<li>Hulk</li>
</ul>
<p style="line-height: 20.7999992370605px;">Wasp has 1,120 appearances in comics and Ant-Man has 1,237. That may not seem like so many compared to the Hulk’s appearances, but they’re right up there with popular Marvel characters such as Charles Xavier(Professor X) and Nick Fury.</p>
<p style="line-height: 20.7999992370605px; text-align: center;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/53d02bc27c8f2c62a70f2c3edc44209f/pic6.png" style="width: 555px; height: 264px;" /></p>
Identity, Gender and Survival Among the Avengers
<p style="line-height: 20.7999992370605px;">I then went ahead tagged all characters in the Marvel Universe who at one point or another were part of the Avengers team. Using Minitab’s <a href="http://blog.minitab.com/blog/applying-statistics-in-quality-projects/analyzing-qualitative-data-part-1-pareto-pie-and-stacked-bar-charts">Pie Chart</a>, I looked at a few more characteristics from the data set mentioned earlier. (Keep in mind I am only looking at rosters for the Volume 1 comic series) Here is a chart of the percentage of Avengers who have chosen to reveal their identity to the public:</p>
<p style="line-height: 20.7999992370605px; text-align: center;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/8d34d01723f8fe78e7b7e2c1d9d15fd1/pic7.png" style="width: 577px; height: 385px;" /></p>
<p style="line-height: 20.7999992370605px;"><span style="line-height: 20.7999992370605px;">I wasn’t expecting a split like this. I began wondering if, at any given time, the rosters mostly fit into just one specific category. Or was it always a sort of “mix and match”, where half an Avengers' roster would keep their identities secret?</span></p>
<p style="line-height: 20.7999992370605px;">How about Gender?</p>
<p style="line-height: 20.7999992370605px; text-align: center;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/e9553d17d4366625f6ca51ce58cf5969/pic8.png" style="width: 577px; height: 385px;" /></p>
<p style="line-height: 20.7999992370605px;">I frankly thought the percentage for men was going to be higher than 68.1 percent. Unfortunately, this graph doesn’t indicate how long each character stayed on the team. It would be interesting to look at longevity and how long, on average, female characters stayed on the roster compared to the men.</p>
<p style="line-height: 20.7999992370605px;">Our last pie chart takes a look at what percentage are still alive. Sadly, it’s not all fun and games at the Avengers Mansion.</p>
<p style="line-height: 20.7999992370605px; text-align: center;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/ec99463efc9526647194b70fbdff9d11/pic9.png" style="width: 577px; height: 385px;" /></p>
<p style="line-height: 20.7999992370605px;">However, if you split the above chart by gender:</p>
<p style="line-height: 20.7999992370605px; text-align: center;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/364a2ab655ee34032ce7f6da3745a013/pic10.png" style="width: 577px; height: 385px;" /></p>
<p style="line-height: 20.7999992370605px;">If you’re a female superhero, joining the Avengers is the safest decision you could make!</p>
<p style="line-height: 20.7999992370605px;">I hope you enjoyed this diversion into world of comics and superhero fandom! </p>
Fun StatisticsMon, 11 May 2015 14:06:00 +0000http://blog.minitab.com/blog/quality-data-analysis-and-statistics/looking-at-the-avengers-with-conditional-formattingAndy CheshireImproving Recycling Processes at Rose-Hulman, Part III
http://blog.minitab.com/blog/real-world-quality-improvement/improving-recycling-processes-at-rose-hulman-part-iii
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/fa7a4559e547be217d5fa38f61c978c1/landfill.jpg" style="float: right; width: 350px; height: 253px; margin: 10px 15px;" />In previous posts, I discussed the results of a recycling project done by Six Sigma students at Rose-Hulman Institute of Technology last spring. (If you’re playing catch up, you can read <a href="http://blog.minitab.com/blog/real-world-quality-improvement/a-little-trash-talk3a-improving-recycling-processes-at-rose-hulman" target="_blank">Part I</a> and <a href="http://blog.minitab.com/blog/real-world-quality-improvement/a-little-trash-talk%3A-improving-recycling-processes-at-rose-hulman%2C-part-ii" target="_blank">Part II</a>.)</p>
<p>The students did an awesome job reducing the amount of recycling that was thrown into the normal trash cans across all of the institution’s academic buildings. At the end of the spring quarter (2014), 24% of trash cans (by weight) included recyclable items. At the beginning of that spring quarter, 36% of trash cans were recyclable items, so you can see that they were very successful in reducing this percentage!</p>
<p>The fall quarter (2015) brought a new set of Six Sigma students to Rose-Hulman who were just as dedicated to reducing the amount of recycling thrown into normal trash cans, and I want to cover their success in this post, as well as some of the neat statistical methods they used when completing their project.</p>
Fall 2015 goals
<p>This time around, the students wanted to at least maintain or improve on the percentage spring quarter (2014) students were able to achieve. They set out with a specific goal to reduce the amount of recycling in the trash to 20% by weight.</p>
<p>In order to further reduce the recyclables in the academic buildings in fall 2015, the standard “Define, Measure, Analyze, Improve, Control” (DMAIC) methodology of Six Sigma was once again implemented. The main project goal focused on standardizing the recycling process within the buildings, and their plan to reduce the amount of recyclables focused on optimizing the operating procedure for collecting recyclables in all academic building areas (excluding classrooms) where trash and recycling are collected.</p>
<p>Many of the same DMAIC tools that were used by spring 2014 students were also used here, including—<a href="http://support.minitab.com/quality-companion/3/help-and-how-to/run-projects/brainstorming/ct-tree/" target="_blank">Critical to Quality Diagrams</a>, <a href="http://support.minitab.com/quality-companion/3/help-and-how-to/run-projects/maps/process-map/" target="_blank">Process Maps</a>, <a href="http://blog.minitab.com/blog/real-world-quality-improvement/spicy-statistics-and-attribute-agreement-analysis" target="_blank">Attribute Agreement Analysis</a>, <a href="http://blog.minitab.com/blog/marilyn-wheatleys-blog/evaluating-a-gage-study-with-one-part-v2" target="_blank">Gage R&R</a>, Statistical Plots, <a href="http://blog.minitab.com/blog/adventures-in-software-development/risk-based-testing-at-minitab-using-quality-companions-fmea" target="_blank">FMEA</a>, <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples" target="_blank">Regression</a>—among many others.</p>
Making and measuring improvements
<p>The spring 2014 initiative added recycling bins to every classroom, which created a measurable improvement. The fall 2015 effort focused on improvement through <em>standardization of operation</em>. For example, many areas in the academic buildings suffer from random placement and arrangement of trash cans and recycling bins. The students thought standardization of bin areas (one trash, one plastic/aluminum recycling, and one paper recycling) would lessen the confusion of recycling, and clear signage and stickers on identically shaped trash cans and recycling bins would be better visual cues of where to place waste of both kinds.</p>
<p>For fall 2015, there were seven teams, and they were assigned different academic building floors (not including classrooms) and common areas. Unlike the spring 2014 data collection, the teams did not combine the trash from their assigned areas. They treated each recycling station as a unique data point.</p>
<p>After implementing the improvements to standardize the bins, the teams collected data for four days across twenty-nine total stations. Thus, there were a total of 116 fall 2015 improvement percentages. The fall 2015 students used the post-improvement percentage of recyclables in the trash from spring 2014 (24%) as their baseline for determining improvement in fall 2015.</p>
<p>The descriptive statistics for the percentage of recyclables (by weight) in the trash were as follows:</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/5c77690aaaff21d0b33eb5083f82074e/descriptive_stats.jpg" style="border-width: 0px; border-style: solid; width: 550px; height: 67px;" /></p>
<p>Below, the students put together a histogram and a boxplot of the data using <a href="http://www.minitab.com/products/minitab/features/" target="_blank">Minitab Statistical Software</a>. Over half of the stations (61 out of 116) had less than 5% of recyclables in the trash. Forty-six of the 116 recycling stations had no recyclables. The value of the third quartile (16.6%), meant that 75% of the stations had less than 16.6% recyclables. The descriptive statistics above showed that the sample mean was much larger than the sample median. The graphs confirmed that this must be the case because of the strong positively skewed shape of the data.</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/4e730181a9288e531ff9caf69a347dd0/histogram.jpg" style="border-width: 0px; border-style: solid; width: 624px; height: 206px;" /></p>
<p>Even though the 116 data points didn’t follow a normal distribution and there was a large mound of 0’s as part of the distribution from collection spots that had no recyclables, the students trusted that the <a href="http://blog.minitab.com/blog/understanding-statistics/how-the-central-limit-theorem-works" target="_blank">Central Limit Theorem</a> with a sample size of 116 would generate a sampling distribution of the means that was normally distributed. Because of the large sample size and unknown standard deviation, they used a <em>t</em> distribution to create a 95% confidence interval for the true mean percentage of recyclables in the trash for fall 2015.</p>
<p>Also using Minitab, they constructed the 95% confidence interval:</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/2ccf17f68f0055c32282c2020f2c9108/one_sample_t.jpg" style="border-width: 0px; border-style: solid; width: 423px; height: 48px;" /></p>
<p>The 95% confidence interval meant that the students were 95% certain that the interval [9.94, 18.22] contains the true mean percentage of recyclables in the trash for fall 2015. At an alpha level equal to 0.025, they were able to reject the null hypothesis, where H0: μ = 24% versus Ha: μ < 24%, because 24% was not contained in the two-sided 95% confidence interval. (Remember that 24% was the mean percentage of recyclables in trash after the spring 2014 improvement phase.) The null hypothesis for H0: μ = 20% versus Ha: μ < 20%, was rejected. This meant that they had met their goal to reduce the percentage of recyclables in the trash to below 20% for this project!</p>
Continuing to analyze the data
<p>The students also subgrouped their data by collection day. Each day consisted of data from 29 recycling stations. The comparative boxplots and individual value plots below show the percentage of recyclables in the trash across the four collection dates. (The horizontal dotted line in the boxplot is the mean from spring 2014’s post-improvement data.)</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/664e8bf0f443d278376e71a70817e727/ivp.jpg" style="border-width: 0px; border-style: solid; width: 624px; height: 207px;" /></p>
<p>Though all four collection days have sample means less than 24%, it’s obvious from the boxplots that the first three collection days are clearly below 24%, and the medians from all four days are less than 11%. The individual value plots reveal the large number of 0’s on each day, which represented collection spots that had no recyclables. Both graphs display the positively skewed nature of the data. Because of the positive skewness, each day’s mean is much larger than its median.</p>
How capable was the process?
<p>Next, the students ran a <a href="http://blog.minitab.com/blog/real-world-quality-improvement/using-statistics-to-show-your-boss-process-improvements" target="_blank">process capability analysis</a> for the seven areas where trash was collected over four days:</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/8f9b85a55164f9e957809a8be1eef1c0/process_cap.jpg" style="border-width: 0px; border-style: solid; width: 465px; height: 347px;" /></p>
<p>The process capability indices were Pp = 0.48 and Ppk = 0.42. (The Pp value corresponds to a 1.44 Sigma Level, while the Ppk value corresponds to a 1.26 Sigma Level.) Recall that the previous Ppk value after improvements in <a href="http://blog.minitab.com/blog/real-world-quality-improvement/a-little-trash-talk%3A-improving-recycling-processes-at-rose-hulman%2C-part-ii" target="_blank">spring 2014</a> was 0.22. The fall 2015 index is almost double that value!</p>
<p>The students knew that they still needed to account for the total weight of the trash and recyclables by calculating the percentage of recyclables per station. Some collection stations with the highest percentage of recyclables had the lowest total weight, while some stations with the lowest percentage of recyclables had the highest total weight. Instead of strictly using a capability index to indicate their improvement, they incorporated a <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples" target="_blank">regression</a> model for the trash weight versus the total weight of trash and recyclables to show that the percentage of recyclables in the trash was less than 20%.</p>
<p>The 95% confidence interval for the true mean slope of the regression line was [0.856, 0.954]. The students were 95% certain that the trash weight was somewhere between 0.86 to 0.96 of the total weight of the collection. Hence, the recycling weight was between 0.046 and 0.114 of the total weight. This value is clearly below 20% with 95% confidence! From this, they were able to state through yet another type of analysis that there was a statistically significant improvement over the spring 2014 recycling project, and that they met their goal of reducing the percentage of recyclables in the trash to below 20%. Compared to the spring 2014 project where 24% of the trash was recyclables, the fall 2015 students saved <em>at least</em> 4% more recyclables from ending up in the local landfill!</p>
<p>For even more on this topic, be sure to check out Rose-Hulman student Peter Olejnik’s blog posts on how he and the recycling project team at the school used regression to evaluate project results:</p>
<p><a href="http://blog.minitab.com/blog/statistics-in-the-field/using-regression-to-evaluate-project-results%2C-part-1" target="_blank">Using Regression to Evaluate Project Results, part 1</a></p>
<p><a href="http://blog.minitab.com/blog/statistics-in-the-field/using-regression-to-evaluate-project-results%2C-part-2" target="_blank">Using Regression to Evaluate Project Results, part 2</a></p>
<p><em>Many thanks to Dr. Diane Evans for her contributions to this post!</em></p>
Data AnalysisFun StatisticsHypothesis TestingLean Six SigmaLearningSix SigmaStatisticsStatsFri, 08 May 2015 12:00:00 +0000http://blog.minitab.com/blog/real-world-quality-improvement/improving-recycling-processes-at-rose-hulman-part-iiiCarly BarryA Field Guide to Statistical Distributions
http://blog.minitab.com/blog/statistics-in-the-field/a-field-guide-to-statistical-distributions
<p><em><span style="line-height: 1.6;">by Matthew Barsalou, guest blogger. </span></em></p>
<p>The old saying “if it walks like a duck, quacks like a duck and looks like a duck, then it must be a duck” may be appropriate in bird watching; however, the same idea can’t be applied when observing a statistical distribution. The dedicated ornithologist is often armed with binoculars and a field guide to the local birds and this should be sufficient. A statologist (I just made the word up, feel free to use it) on the other hand, is ill-equipped for the visual identification of his or her targets.</p>
Normal, Student's t, Chi-Square, and F Distributions
<p>Notice the upper two distributions in figure 1. The <span><a href="http://blog.minitab.com/blog/fun-with-statistics/normal-the-kevin-bacon-of-distributions">normal distribution</a></span> and student’s t distribution may appear similar. However, the standard normal distribution is calculated using n and <a href="http://blog.minitab.com/blog/michelle-paret/guinness-t-tests-and-proving-a-pint-really-does-taste-better-in-ireland">student’s t distribution</a> is calculated using n-1. This may appear to be a minor difference, but when n is small, student’s t distribution displays much more peakedness. Student’s t distribution approaches the normal distribution as the sample size increases, but it never truly matches the shape of the normal distribution.</p>
<p>Observe the Chi-square and F distribution in the lower half of figure 1. The shapes of the distributions can vary and even the most astute observer will not be able to differentiate between them by eye. Many distributions can be sneaky like that. It is a part of their nature that we must accept as we can’t change it.</p>
<p align="center"><img alt="Distribution Field Guide Figure 1" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b5c12365f066b6ca3d255bcd458314e1/distribution_field_guide_1.gif" style="width: 605px; height: 352px;" /><em><span style="line-height: 1.6;">Figure 1</span></em></p>
Binomial, Hypergeometric, Poisson, and Laplace Distributions
<p>Notice the distributions illustrated in figure 2. A bird watcher may suddenly encounter four birds sitting in a tree; a quick check of a reference book may help to determine that they are all of a different species. The same can’t always be said for statistical distributions. <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-and-using-discrete-distributions">Observe the binomial distribution, hypergeometric distribution and Poisson distribution</a>. We can’t even be sure the three are not the same distribution. If they are together with a Laplace distribution, an observer may conclude “one of these does not appear to be the same as the others.” But they <em>are </em>all different, which our eyes alone may fail to tell us.</p>
<p align="center"><img alt="Distribution Field Guide Figure 2" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b9011bf86767f49c3e7ec47c76d20631/distribution_field_guide_2.gif" style="width: 605px; height: 352px;" /><em><span style="line-height: 1.6;">Figure 2</span></em></p>
Weibull, Cauchy, Loglogistic, and Logistic Distributions
<p>Suppose we observe the four distributions in figure 3.What are they? Could you tell if they were not labeled? We must identify them correctly before we can do anything with them. One is a Weibull distribution, but all four could conceivably be various Weibull distributions. The shape of the Weibull distribution varies based upon the shape parameter (κ) and scale parameter (λ).The Weibull distribution is a useful, but potentially devious distribution that can be much like the double-barred finch, which may be mistaken for an owl upon first glance.</p>
<p align="center"><img alt="Distribution Field Guide Figure 3" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/2b606d88ff9ae159f94dcac04748c3e2/distribution_field_guide_3.gif" style="width: 605px; height: 351px;" /><em><span style="line-height: 1.6;">Figure 3</span></em></p>
<p>Attempting to visually identify a statistical distribution can be very risky. Many distributions such as the Chi-Square and F distribution change shape drastically based on the number of degrees of freedom. Figure 4 shows various shapes for the Chi-Square, F distribution and the Weibull distribution. Figure 4 also compares a standard normal distribution with a standard deviation of one to a t distribution with 27 degrees of freedom; notices how the shapes overlap to the point where it is no longer possible to tell the two distributions apart.</p>
<p>Although there is no definitive Field Guide to Statistical Distributions to guide us, there are formulas available to correctly identify statistical distributions. We can also use <a href="http://www.minitab.com/products/minitab">Minitab Statistical Software</a> to identify our distribution.</p>
<p align="center"><img alt="Distribution Field Guide Figure 4" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/aa4be49733e980c8c7e26395c5e8262a/distribution_field_guide_4.gif" style="width: 605px; height: 351px;" /><em style="line-height: 1.6;">Figure 4</em></p>
<p>Go to <strong>Stat > Quality Tools > Individual Distribution Identification...</strong> and enter the column containing the data and the subgroup size. The results can be observed in either the session window (figure 5) or the graphical outputs shown in figures 6 through 9.</p>
<p>In this case, we can conclude we are observing a 3-parameter Weibull distribution based on the p value of 0.364.</p>
<p align="center"><img alt="Distribution Field Guide Figure 5" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/29448180c3ff01cae81cfaf250a60115/distribution_field_guide_5.gif" style="width: 547px; height: 739px;" /></p>
<p align="center"><em>Figure 5</em></p>
<p> </p>
<p style="text-align: center;"><img alt="Distribution Field Guide Figure 6" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/781c7a83b14261ae062c63a07479b10d/distribution_field_guide_6.png" style="width: 576px; height: 384px;" /><em style="line-height: 1.6;">Figure 6</em></p>
<p style="text-align: center;"><img alt="Distribution Field Guide Figure 7" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/fcf5a7b56b859e6861ae8d96e8273fe1/distribution_field_guide_7.png" style="width: 576px; height: 384px;" /><em><span style="line-height: 1.6;">Figure 7</span></em></p>
<p style="text-align: center;"><em><img alt="Distribution Field Guide Figure 8" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a13530fb7ec7ee8e3fe90143772eefbc/distribution_field_guide_8.png" style="width: 576px; height: 384px;" /><span style="line-height: 1.6;">Figure 8</span></em></p>
<p style="text-align: center;"><em><img alt="Distribution Field Guide Figure " src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/6f28cb199afaee379ccc2244a955557f/distribution_field_guide_9.png" style="width: 576px; height: 384px;" /><span style="line-height: 1.6;">Figure 9</span></em></p>
<p> </p>
<p> </p>
<div> </div>
<div>
<p style="line-height: 20.7999992370605px;"><strong>About the Guest Blogger</strong></p>
<p style="line-height: 20.7999992370605px;"><em><a href="https://www.linkedin.com/pub/matthew-barsalou/5b/539/198" target="_blank">Matthew Barsalou</a> is a statistical problem resolution Master Black Belt at <a href="http://www.3k-warner.de/" target="_blank">BorgWarner</a> Turbo Systems Engineering GmbH. He is a Smarter Solutions certified Lean Six Sigma Master Black Belt, ASQ-certified Six Sigma Black Belt, quality engineer, and quality technician, and a TÜV-certified quality manager, quality management representative, and auditor. He has a bachelor of science in industrial sciences, a master of liberal studies with emphasis in international business, and has a master of science in business administration and engineering from the Wilhelm Büchner Hochschule in Darmstadt, Germany. He is author of the books <a href="http://www.amazon.com/Root-Cause-Analysis-Step---Step/dp/148225879X/ref=sr_1_1?ie=UTF8&qid=1416937278&sr=8-1&keywords=Root+Cause+Analysis%3A+A+Step-By-Step+Guide+to+Using+the+Right+Tool+at+the+Right+Time" target="_blank">Root Cause Analysis: A Step-By-Step Guide to Using the Right Tool at the Right Time</a>, <a href="http://asq.org/quality-press/display-item/index.html?item=H1472" target="_blank">Statistics for Six Sigma Black Belts</a> and <a href="http://asq.org/quality-press/display-item/index.html?item=H1473&xvl=76115763" target="_blank">The ASQ Pocket Guide to Statistics for Six Sigma Black Belts</a>.</em></p>
</div>
<p> </p>
Fun StatisticsStatisticsStatistics HelpStatsTue, 05 May 2015 11:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/a-field-guide-to-statistical-distributionsGuest BloggerSeeing Quality in Full Color with Crayola's Quality Team
http://blog.minitab.com/blog/understanding-statistics/seeing-quality-in-full-color-with-crayolas-quality-team
<p>This week I'm at the American Society for Quality's World Conference on Quality and Improvement in Nashville, TN. The ASQ conference is a great opportunity to see how quality professionals are tackling problems in every industry, from beverage distribution to banking services. </p>
<p>Given my statistical bent, I like to see how companies apply tools like ANOVA, regression, and especially designed experiments—particularly if they happen to be using the <a href="http://www.minitab.com/products/minitab">statistical software</a> I like best. </p>
<p><img alt="Crayola crayons" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/2f1f8af921b44fe474ee3c69f4699469/crayons.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 250px; height: 203px;" />One of the most popular sessions involved a company whose products are instantly recognizable to almost everyone who's ever had (or been) a child: <a href="http://www.crayola.com" target="_blank">Crayola</a>.</p>
<p>There's something about using crayons that brings out the imaginative kid in all of us, and as this session started I saw lots of smiles and even overheard some wistful recollections about "new crayon smell" from the row behind me. </p>
<p>I also heard comments about the quality of Crayola's crayons compared to other brands, and I flashed back to my own childhood experiences: other crayons' tips weren't as strong, and <span style="line-height: 18.9090900421143px;">if you pressed really hard </span><span style="line-height: 1.6;">they were much more prone to snapping in two. But Crayolas were always the best</span><span style="line-height: 1.6;">—if you wanted to break a Crayola, you needed to </span><em style="line-height: 1.6;">work </em><span style="line-height: 1.6;">at it! </span></p>
<p>The conference room was packed with people who'd had similar experiences. We'd seen the results of Crayola's efforts to make the highest-quality crayons available, and now we wanted to learn more about how they did it. We weren't disappointed. </p>
Improving Inventory with DOE and Simulation
<p>Speaking for Crayola were Bonnie Hall, the company's vice president for global quality and continuous improvement, and Rich Titus, a Lean Six Sigma Master Black Belt who has consulted with Crayola for several years. </p>
<p>They talked about the history of Crayola, and the fact that they've been making crayons for more than 100 years from their headquarters and facilities in Bethlehem, Pennsylvania. They also shared a brief history of the company's Lean and Six Sigma initiatives, which kicked off in 2001 and have yielded great benefits. </p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/542850daee02cea6117763d86e791669/summary_statistics.gif" style="margin: 10px 15px; float: right; width: 376px; height: 251px;" />Then they walked participants through several examples of how Crayola has used data analysis and statistics to reduce waste, cut costs, and most important, maximize the quality of the crayons, markers, modeling materials and other art supplies they make. </p>
<p>They talked about how Crayola followed a systematic process to improve the accuracy of its inventory system, following the DMAIC roadmap and applying tools like graphical analysis, ANOVA, <a href="http://blog.minitab.com/blog/real-world-quality-improvement/leveraging-designed-experiments-doe-for-success">Design of Experiments</a>, and <a href="http://blog.minitab.com/blog/understanding-statistics/i-think-i-can-i-know-i-can-a-high-level-overview-of-process-capability-analysis">capability analysis</a>. </p>
<p>It's pretty cool to see and hear how statistics helps Crayola make sure they're delivering items that kids can count on, and it's really gratifying to know they trust Minitab's software to make the data analysis as easy and straightforward as possible. </p>
What Makes Crayola's Quality Program So Successful?
<p>In their presentation today, Bonnie Hall and Rich Titus cited a couple of key attributes that they believe have made Crayola's quality program successful. </p>
<ul>
<li>Minitab and Data Driven Problem Solving now the Norm at Crayola</li>
<li>Lean and Six Sigma now part of Crayola Culture</li>
<li>Senior-level managers are fully trained green and black belts, and do their own projects</li>
<li>Executives and managers conduct regular project reviews for current projects</li>
</ul>
<p><span style="line-height: 18.9090900421143px;">Crayola's quality improvement efforts have been a tremendous success, and the companies leaders were gracious enough to spend time with me and some other Minitab folks earlier this year to tell us more about about how data analysis has helped them compete and improve.</span></p>
<p>You can visit our web site to learn more about <a href="http://www.minitab.com/crayola">how Crayola is using statistics and data</a> to maintain and enhance the quality of their products. </p>
<p> </p>
Fun StatisticsLean Six SigmaSix SigmaStatistics in the NewsMon, 04 May 2015 17:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/seeing-quality-in-full-color-with-crayolas-quality-teamEston MartzCp and Cpk: Two Process Perspectives, One Process Reality
http://blog.minitab.com/blog/statistics-and-quality-data-analysis/cp-and-cpk-two-process-perspectives-one-process-reality
<p>It’s usually not a good idea to rely solely on a single statistic to draw conclusions about your process. Do that, and you could fall into the clutches of the “duck-rabbit” illusion shown here:</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/4ee77630a518a133bd8146e5e96b3e28/cpk_cp_cropped.jpg" style="line-height: 18.9090900421143px; margin: 10px 15px; width: 353px; height: 183px;" /></p>
<p>If you fix your eyes solely on the duck, you’ll miss the rabbit—and vice-versa.</p>
<p><span style="line-height: 18.9090900421143px;">If you're using <a href="http://www.minitab.com/products/minitab">Minitab Statistical Software</a> for capability analysis, t</span>he capability indices Cp and Cpk are good examples of this. If you focus on only one measure, and ignore the other, you might miss seeing something critical about the performance of your process. </p>
Cp: A Tale of Two Tails
<p>Cp is a ratio of the specification spread to the process spread. The process spread is often defined as the 6-sigma spread of the process (that is, 6 times the within-subgroup standard deviation). Higher Cp values indicate a more capable process.</p>
<p>When the specification spread is considerably greater than the process spread, Cp is high.</p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/73367467a6c16919e9bc030f1a63c913/cp_high.jpg" style="width: 328px; height: 213px;" /></p>
<p>When the specification spread is less than the process spread, Cp is low.</p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/2818c4502396e8b64627502918ddd08f/cp_low.jpg" style="width: 319px; height: 217px;" /></p>
<p>By using the 6-sigma process spread, Cp incorporates information about both tails of the process data. But there’s something Cp doesn’t do—it doesn’t tell you anything about the location of the process data.</p>
<p>For example, the following two processes have the about same Cp value (≈ 3):</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/228e5e4d4b24aa4eb760c3db8f36a5a0/cp_same.jpg" style="width: 582px; height: 210px;" /></p>
<p>Obviously, Process B has a serious issue with its location in relation to the spec limits that Cp just can't "see."</p>
Cpk: Location, Location, Location!
<p>Like Cp, Cpk is also a ratio of the specification spread to the process spread. But unlike Cp, Cpk compares the distance from the process mean to the closest specification limit, to about half the spread of the process (often, the 3-sigma spread).</p>
<p>When the distance from the mean to the nearest specification limit is considerably greater than the one-sided process spread, Cpk is high.</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/34c73f8708aaf7af89d37ac2e38ee8cf/cpk_high.jpg" style="width: 318px; height: 216px;" /></p>
<p>When the distance from the mean to the nearest specification limit is less than the one-sided process spread, Cpk is low.</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/4b0494897f6986f25d7f6c638eb01191/cpk_low.jpg" style="width: 326px; height: 214px;" /></p>
<p>Notice how the location of the process <em>does</em> affect the Cpk value—by virtue of its being calculated using the process mean.</p>
<p>Yet there's something important that Cpk doesn't do. Because it's a "worst-case" estimate that uses only the nearest specification limit, Cpk can't "see" how the process is performing on the other side.</p>
<p>For example, the following two processes have the about same Cpk value (≈ 0.9):</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/7ef5c1fb343200b07f0914b4db5d813e/cpk_same.jpg" style="width: 554px; height: 208px;" /><br />
Notice that Process X has nonconforming parts in relation to both spec limits, while Process Y has nonconforming parts in relation to only the upper spec limit (USL). But Cpk can't "see"any difference between these two processes.</p>
<p>To get the two-sided picture of each process, in relation to both spec limits, you can look at Cp, which would be higher for Process Y than for Process X.</p>
Summing Up: Look for Ducks, Rabbits, and Other Critters as Well
<p>Avoid getting too fixated on any single statistic. If you have both a lower and upper specification limit for your process, Cp and Cpk each might “know” something about your process that the other one doesn’t. That “something” could be critical to fully understand how your process is performing.</p>
<p>To see a concrete example of how Cp and Cpk work together, using real data from the National Renewable Energy Laboratory, see <a href="http://blog.minitab.com/blog/statistics-and-quality-improvement/process-capability-statistics-cp-and-cpk-working-together" target="_blank">this post by Cody Steele</a>.</p>
<p>By the way, the potential "blind spot" for Cp and Cpk also applies to Pp and Ppk. The only difference is that the process spread for those indices is calculated using the overall standard deviation, instead of the within-subgroup standard deviation. For more on that distinction, see <a href="http://blog.minitab.com/blog/michelle-paret/process-capability-statistics-cpk-vs-ppk" target="_blank">this post by Michelle Paret</a>.</p>
<p>And if you’re interested other optical and statistical illusions, check out <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/optical-illusions-zen-koans-and-simpsons-paradox" target="_blank">this post on Simpson's paradox</a>.</p>
Quality ImprovementStatisticsTue, 28 Apr 2015 12:03:00 +0000http://blog.minitab.com/blog/statistics-and-quality-data-analysis/cp-and-cpk-two-process-perspectives-one-process-realityPatrick RunkelItem Analysis with Cronbach's Alpha for Reliable Surveys
http://blog.minitab.com/blog/meredith-griffith/item-analysis-with-cronbachs-alpha-for-reliable-surveys
<p><span style="line-height: 1.6;">Many of the things you need to monitor can be measured in a concrete, objective way, such as an item's weight or length. But, many important characteristics are more subjective, such as the collaborative culture of the workplace, or an individual's political outlook.</span></p>
<p>A survey is an excellent way to measure these kinds of characteristics. To better understand a characteristic, a researcher asks multiple questions about it. For example, rather than simply ask diners whether they are satisfied, a researcher may ask:</p>
<ul>
<li>How satisfied are you with our services?</li>
<li>How likely are you to visit our restaurant again?</li>
<li>How likely are you to recommend our restaurant?</li>
</ul>
<p>Collectively, these questions give the researcher a deeper, more nuanced <span><a href="http://blog.minitab.com/blog/applying-statistics-in-quality-projects/use-statistics-to-better-understand-your-customers">understanding of customer satisfaction</a></span> than a single question.</p>
<p>The challenge is to ask questions that vary enough to measure the different facets of the characteristic, yet still relate to the same characteristic. If you ask questions that don't measure the same characteristic, your survey will produce misleading data, which can lead you to make poor, and potentially costly, decisions. So, how do you know whether different questions all measure the same characteristic?</p>
<p>Item Analysis with Cronbach's alpha can help, and it's easy to do in Minitab's <a href="http://www.minitab.com/products/minitab/">statistical software</a>.</p>
What Is Item Analysis?
<p>Item Analysis tells you how well a set of questions (or items) measures one characteristic (or construct) and helps to identify questions that are problematic.</p>
<p>For example, two questions measure different aspects of quality on a Likert scale (1 is worst, 5 is best). For the most part, respondents who rated Question 1 high also rated Question 2 high. And, those who rated Question 1 low tended to rate Question 2 low. This correlation suggests these questions measure the same characteristic, and so comprise a reliable survey.</p>
<p><img alt="Scatterplot of Question 1 vs Question 2" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/26609466f0192b049b5b0f49c7d62b0d/scatterplot_of_question_2_vs_question_1.jpg" style="width: 384px; height: 256px;" /></p>
<p>However, for Question 1 and Question 4, respondents gave markedly different ratings. This lack of a correlation indicates that the items do not measure the same characteristic.</p>
<p><img alt="scatterplot of question 1 vs question 4" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d0aa4747b2051c9e960948eb03d86fcc/scatterplot_of_question_4_vs_question_1.jpg" style="width: 384px; height: 256px;" /></p>
<span style="line-height: 1.2;">Cronbach's alpha and other key statistics</span>
<p>Item Analysis helps you to evaluate the correlation of related survey items with only a few statistics. Most important is Cronbach's alpha, a single number that tells you how well a set of items measures a single characteristic. This statistic is an overall item correlation where the values range between 0 and 1. Values above 0.7 are often considered to be acceptable.</p>
<p>To identify problematic items, look at the Omitted Item Statistics section of the output. This section tells you how removing any one item from the analysis improves or worsens Cronbach's alpha. This information allows you to fine-tune your survey, keeping the good questions while replacing the bad.</p>
Trust your data
<p>Suppose a bank surveys customers to assess customer satisfaction.</p>
<p><img alt="Cronbach's alpha" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/162d253cce9fa2b2e38091561eba847c/session_output_1.jpg" style="width: 431px; height: 184px;" /></p>
<p>Analysts use Item Analysis to determine how well all of the questions measure customer satisfaction. The results show that Cronbach's alpha is quite high: 0.9550. The bank can trust the three questions in the survey reliably assess the same construct, customer satisfaction.</p>
Reveal an unreliable survey
<p>Now suppose a medical group surveys patients who are in physical rehabilitation to assess their degree of mobility.</p>
<p><img alt="Cronbach's alpha" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/dcd1eb44ff6c5c1a441e28b81ce435f4/session_output_2.jpg" style="width: 432px; height: 197px;" /></p>
<p>Analysts use Item Analysis to determine whether all of the questions measure mobility. The results show that Cronbach's alpha is quite low: 0.5191. This value suggests the questions do not all measure mobility.</p>
Identify a problematic question
<p>Item Analysis provides more than just a passing or failing grade; it also helps you identify problematic questions.</p>
<p>Suppose a manufacturing company surveys its employees to assess the strength of the safety culture in its factories. The survey asks the respondents to indicate the strength of their agreement with statements such as the following:</p>
<ul>
<li>When a safety mistake is made but no one is harmed, the mistake is usually reported.</li>
<li>My supervisor wants us to work faster, but not by taking shortcuts on safety.</li>
<li>Our procedures and systems are good at preventing errors.</li>
<li>I feel that I am safe at work.</li>
</ul>
<p><img alt="Cronbach's alpha" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9b7562d6c4d09b50d27203a23c731c71/session_output_3.jpg" style="width: 431px; height: 184px;" /></p>
<p>Cronbach's alpha is above 0.7, which is promising. However, looking at the Omitted Item Statistics output shows us that Cronbach's alpha increases from 0.7853 to 0.921674 when Minitab removes Question 4 from the analysis.</p>
<p>Collectively, the results suggest that Questions 1, 2, and 3 are the best indicators of the safety culture. The manager should remove Question 4 from the analysis and possibly replace it in future surveys.</p>
Conducting an Item Analysis in Minitab
<p>Analyzing your own survey data is easy.</p>
<ol>
<li>Choose <strong>Stat > Multivariate > Item Analysis</strong>.</li>
<li>In <strong>Variables</strong>, enter all items which measure the same construct.</li>
<li>If your items are measured on different scales, check <strong>Standardize variables</strong>.</li>
<li>Click <strong>OK</strong>.</li>
</ol>
<p><img alt="Item Analysis" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/127fa671b2b576347d08e2f28764585b/item_analysis_dialog.jpg" style="width: 432px; height: 281px;" /></p>
Putting Item Analysis to Use
<p>Surveys and tests are like any other measurement tool—you first need to assess whether your data are reliable. Minitab's Item Analysis evaluates your survey responses so you can trust your data and be confident in the decisions you make as a result.</p>
Data AnalysisStatisticsTue, 21 Apr 2015 12:00:00 +0000http://blog.minitab.com/blog/meredith-griffith/item-analysis-with-cronbachs-alpha-for-reliable-surveysMeredith GriffithThe Easiest Way to Do Capability Analysis
http://blog.minitab.com/blog/understanding-statistics/the-easiest-way-to-do-capability-analysis
<p>A while back, I offered an <span><a href="http://blog.minitab.com/blog/understanding-statistics/i-think-i-can-i-know-i-can-a-high-level-overview-of-process-capability-analysis">overview of process capability analysis</a></span> that emphasized the importance of matching your analysis to the distribution of your data.</p>
<p>If you're already familiar with different types of distributions, Minitab makes it easy to identify what type of data you're working with, or to transform your data to approximate the normal distribution.</p>
<p>But what if you're <em>not</em> so great with probability distributions, or you're not sure about how or even <em>if</em> you should transform your data? You can still do capability analysis with the <a href="http://www.minitab.com/products/minitab/assistant/">Assistant in Minitab Statistical Software</a>. Even if you're a stats whiz, the Assistant's easy-to-follow output can make the task of explaining your results much easier to people who don't share your expertise. </p>
<p>Let's walk through an example of capability analysis with non-normal data, using the Assistant.</p>
The Easy Way to Do Capability Analysis on Non-normal Data
<p>For this example, we'll use a data set that's included with Minitab Statistical Software. (If you're not already using Minitab, <a href="http://it.minitab.com/products/minitab/free-trial.aspx">download the free trial</a> and follow along.) Click <strong>File > Open Worksheet</strong>, and then click the button labeled "Look in Minitab Sample Data folder." Open the dataset named <em>Tiles</em>. </p>
<p>This is data from a manufacturer of floor tiles. The company is concerned about the flexibility of the tiles, and the data set contains data collected on 10 tiles produced on each of 10 consecutive working days.</p>
<p>Select <strong>Assistant > Capability Analysis</strong> in Minitab:</p>
<p><img alt="Capability Analysis " src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/1dd468ac3be6a5b50f54f44335c06a97/capability_analysis_assistant.jpg" style="width: 284px; height: 249px;" /></p>
<p>The Assistant presents you with simple decision tree that will guide you to the right kind of capability analysis:</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/af60e5da8c327c8922c9cd7fe54ac703/capability_analysis_decision_tree.jpg" style="width: 500px; height: 353px;" /></p>
<p>The first decision we need to make is what type of data we've collected—Continuous or Attribute. If you're not sure what the difference is, you can just click the "Data Type" diamond to see a straightforward explanation.</p>
<p>Attribute data involves counts and characteristics, while Continuous data involves measurements of factors such as height, length, weight, and so on, so it's pretty easy to recognize that the measurements of tile flexibility are continuous data. With that question settled, the Assistant leads us to the "Capability Analysis" button:</p>
<p><img alt="capability analysis option" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/e33b9eea580b04d253a0821b23232f4b/capability_analysis_assistant_click.jpg" style="width: 139px; height: 137px;" /></p>
<p>Clicking that button brings up the dialog shown below. <span style="line-height: 18.9090900421143px;">Our data are all in the "Warping" column of the worksheet. The subgroup size is "10", since we measured 10 samples on each day. Enter "8" as the upper spec limit, because that's the customer's guideline.</span></p>
<p><img alt="capability dialog" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/3b076036799aec62891417383f0e5f8a/capability_dialog.jpg" style="width: 500px; height: 414px;" /></p>
<p>Then press OK.</p>
Transforming Non-normal Data
<p>Uh-oh—the Assistant immediately gives us a warning. Our data don't meet the assumption of normality:</p>
<p><img alt="normality test" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/1fd577f8166cb8e4b7e90fd6c7f111ae/normality_test.jpg" style="width: 497px; height: 274px;" /></p>
<p>When you click "Yes," the Assistant will transform the data automatically (using the Box-Cox transformation) and continue the analysis. Once the analysis is complete, you'll get a <span style="line-height: 18.9090900421143px;">Report Card that alerts you if there are potential issues with your analysis,</span><span style="line-height: 18.9090900421143px;"> </span><span style="line-height: 18.9090900421143px;">a Diagnostic Report that assesses the stability of your process and the normality of your data, a detailed Process Performance Report, and </span><span style="line-height: 18.9090900421143px;">a </span><span style="line-height: 1.6;">Summary Report that captures the bottom line results of your analysis and presents them in plain language.</span></p>
<p><img alt="capability analysis summary report" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/76571303a05b3b76f251357b55879fea/capability_assistant_summary_report.jpg" style="width: 600px; height: 450px;" /></p>
<p>The Ppk of .75 is below the typical industry acceptability benchmark of 1.33, so this process is not capable. Looks like we have some opportunities to improve the quality of our process!</p>
<span style="line-height: 1.6;">Comparing Before and After Capability Analysis Results</span>
<p>Once we've made adjustments to the process, we can also use the Assistant to see how much of an impact those changes have had. The Assistant's Before/After Capability Analysis is just what we need:</p>
<p><img alt="Before/After Capability Analysis" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/7c596638d158180fa78fccc3550d45a9/before_after_capa_asst.jpg" style="width: 428px; height: 297px;" /></p>
<p>The dialog box for this analysis is very similar to that for the first capability analysis we performed, but this time we can select a column of data from before we made improvements (Baseline process data), and a column of data collected after our improvements were implemented: </p>
<p><img alt="before-after capability analysis dialog box" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b47e6fa1ac9e3b9174651ac90639f319/before_after_dialog.jpg" style="width: 500px; height: 454px;" /></p>
<p>Press OK and the Assistant will again check if you want to transform your data for normality before it proceeds with the analysis. Then it presents us with a series of reports that make it easy to see the impact of our changes. The summary report gives you the bottom line quickly. </p>
<p><img alt="before/after capability analysis summary report" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/da02b638ce3481035abdd01474b25f35/before_after_capability_analysis_summary_report.gif" style="width: 600px; height: 450px;" /></p>
<p>The changes did affect the process variability, and this process now has a Ppk of 1.94, a vast improvement over the original value of .75, and well above the 1.33 benchmark for acceptability. </p>
<p>I hope this post helps you see how the Assistant can make performing capability analyses easier, and that you'll be able to get more value from your process data as a result. </p>
<p> </p>
Quality ImprovementMon, 20 Apr 2015 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/the-easiest-way-to-do-capability-analysisEston MartzThinking about Predictors in Regression, an Example
http://blog.minitab.com/blog/statistics-and-quality-improvement/thinking-about-predictors-in-regression-an-example
<p>A few times a year, the <a href="http://www.bls.gov/spotlight/2015/long-term-unemployment/home.htm">Bureau of Labor Statistics (BLS)</a> publishes a Spotlight on Statistics Article. The first such article of 2015 recently arrived, providing analysis of trends in long-term unemployment. </p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/048c63ca9af692c560ed58a0105091df/hired.jpg" style="line-height: 18.9090900421143px; margin: 10px 15px; float: right; width: 353px; height: 266px;" /></p>
<p>Certainly an interesting read on its own, but some of the included data gives us a good opportunity to look at how thought can improve your regression analysis. Fortunately, <a href="http://it.minitab.com/en-us/products/minitab/free-trial.aspx">Minitab Statistical Software</a> includes 3-D graphs and Regression Diagnostics that can help you spot opportunities for improvement.</p>
<p>The first chart in the report highlights how high the share of the unemployed who are unemployed for a long time is compared to historical levels. That chart looks a bit like this:</p>
<p><img alt="Percent of total unemployed in each category tend to follow each other." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/2c134f899638833af4277a85e9816557/time_series_unemployment.png" style="border-width: 0px; border-style: solid; width: 576px; height: 384px;" /></p>
<p>The discussion points out an interesting relationship. The authors note that the record high for those unemployed 27 weeks or longer occurred in the second quarter of 2010. The record high for those unemployed 52 weeks or longer occurred in the second quarter of 2011. The record high for those unemployed 99 weeks or longer occurred in the 4th quarter of 2011. That is, the highest proportion of unemployed in each category happens earlier for shorter terms.</p>
<p>This relationship is where we can see how to put some thought into regression variables. Let’s say that we want to predict the percentage of unemployed who will have been unemployed for 99 weeks or longer, using the other two figures. The most natural setup for the data is for all of the figures to be in the same row by date, like this:</p>
<p><img alt="In this worksheet, each column starts in row 1." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/e224887ab5cbfa18b531cb83018045db/natural_worksheet.png" style="border-width: 0px; border-style: solid; width: 425px; height: 174px;" /></p>
<p>When your data are set up like this, it’s natural to want to analyze the data this way. The relationship that you get this way is strong. If you looked at the R-squared statistics, you might stop.</p>
<p><span style="font-family: courier new">Model Summary</span></p>
<p><span style="font-family: courier"> S R-sq R-sq(adj) R-sq(pred)<br />
0.963437 94.69% 94.56% 93.96%</span></p>
<p>But if you look a little deeper, you might find that there are some unsatisfactory aspects with the variables this way. Here's what the relationship looks like when you plot all 3 variables on a 3-D graph:</p>
<p><img alt="The relationship between the variables is weaker as the values increase." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/819be48ca8fb3317ccf269cc1f116bde/3_d_synchronized_variables.png" style="border-width: 0px; border-style: solid; width: 576px; height: 384px;" /></p>
<p>I’ve marked the points on this graph that have unusual predictor values. In the diagnostic report for the model, we can see that these points are followed by large standardized residuals. That is, the lag that the article pointed out in the maximums shows up in the regression relationship as well.</p>
<p><span style="font-family: courier new">Fits and Diagnostics for Unusual Observations</span></p>
<p><span style="font-family: courier new"> 99 weeks<br />
Percent<br />
Obs unemployed Fit Resid Std Resid<br />
63 4.500 3.219 1.281 1.54 X<br />
64 5.800 6.793 -0.993 -1.11 X<br />
65 6.500 8.323 -1.823 -2.03 R X<br />
66 9.500 13.152 -3.652 -3.92 R<br />
67 9.600 12.786 -3.186 -3.40 R<br />
68 10.700 14.019 -3.319 -3.57 R<br />
75 14.300 12.387 1.913 2.04 R</span></p>
<p><span style="font-family: courier new">R Large residual<br />
X Unusual X</span></p>
<p>If you think about the predictor variables, this makes perfect sense. The BLS report notes that finding a job is less likely the longer you are unemployed. People unemployed for more than 27 weeks can become people who are unemployed for longer than 52 weeks. People who are unemployed for more than 52 weeks can become people who are unemployed longer than 99 weeks.</p>
<p>So what are the right predictors to use for the percentage of the unemployed for longer than 99 weeks? The closest we can get with terms provided is probably that people who are unemployed for over 27 weeks can become people who are unemployed for over 99 weeks about 4 quarters later. Similarly, people who are unemployed for over 52 weeks can become people who are unemployed for over 99 weeks about 2 quarters later.</p>
<p>To get these variables in Minitab, use the Time Series menu.</p>
<ol>
<li>Choose<strong> Stat > Time Series > Lag</strong>.</li>
<li>In<strong> Series</strong>, enter '<em>Over 27 Weeks'</em>.</li>
<li>In <strong>Store lags in</strong>, enter<em> ‘Over 27 Lag 4’</em>.</li>
<li>In <strong>Lag</strong>, enter <strong>4</strong>.</li>
<li>Press CTRL + E.</li>
<li>In <strong>Series</strong>, enter <em>'Over 52 Weeks'</em>.</li>
<li>In <strong>Store lags in</strong>, enter<em> ‘Over 52 Lag 2’</em>.</li>
<li>In<strong> Lag</strong>, enter <em>2</em>.</li>
</ol>
<p>The resulting worksheet looks like this:</p>
<p><img alt="New variables are in this worksheet that line up the rows at more logical intervals." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/fafd01f883d7613f0f100de45b0d66cc/lagged_worksheet.png" style="border-width: 0px; border-style: solid; width: 532px; height: 157px;" /></p>
<p>Now, the value for the percentage unemployed over 27 weeks lines from the first quarter of 1994 lines up with the percentage of unemployed over 52 weeks from the third quarter of 1994 and the percentage unemployed over 99 weeks from the first quarter of 1995. Plot these data and the relationship looks stronger than before:</p>
<p><img alt="The relationship between the response and the lagged predictors looks stronger." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/2db1ee6b4f828eda2a556bcb2fa78a01/3_d_lagged_variables.png" style="border-width: 0px; border-style: solid; width: 576px; height: 384px;" /></p>
<p>Highlighting the same 3 points from the first graph in red, the points don’t seem unusual at all. In fact, these points don’t appear in the diagnostic report anymore. One point still has a large standardized residual and it is preceded by an unusual X value. But the regression that compare appropriate time frames explains more variation in the data than the regression that compares simultaneous ones.</p>
<p><span style="font-family: courier new">Model Summary</span></p>
<p><span style="font-family: courier new"> S R-sq R-sq(adj) R-sq(pred)<br />
0.676735 97.50% 97.43% 97.04%</span></p>
<p><span style="font-family: courier new">Fits and Diagnostics for Unusual Observations</span></p>
<p><span style="font-family: courier new"> 99 weeks<br />
Percent<br />
Obs unemployed Fit Resid Std Resid<br />
66 9.500 8.866 0.634 1.01 X<br />
68 10.700 13.357 -2.657 -4.40 R X</span></p>
<p><span style="font-family: courier new">R Large residual<br />
X Unusual X</span></p>
<p>Minitab Statistical Software provides a number of ways for you to evaluate your regression model. If your diagnostics reveal model inadequacies, the you have a lot of easy ways to make improvements. I used lag to create appropriate variables. If you’re ready for more, check out how <a href="http://blog.minitab.com/blog/applying-statistics-in-quality-projects/re-analyzing-wine-tastes-with-minitab-17">Bruno Scibilia</a> uses <a href="http://blog.minitab.com/blog/applying-statistics-in-quality-projects/re-analyzing-wine-tastes-with-minitab-17">includes interactions in his model for wine tasting</a> or <a href="http://blog.minitab.com/blog/applying-statistics-in-quality-projects/how-could-you-benefit-from-a-box-cox-transformation">explains the benefits of a Box-Cox transformation</a>.</p>
Wed, 15 Apr 2015 17:02:00 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/thinking-about-predictors-in-regression-an-exampleCody SteeleIdentifying the Distribution of Your Data
http://blog.minitab.com/blog/meredith-griffith/identifying-the-distribution-of-your-data
<p><span style="font-size: 13px; line-height: 1.6;">To choose the right statistical analysis, you need to know the distribution of your data. Suppose you want to assess the capability of your process. If you conduct an analysis that assumes the data follow a normal distribution when, in fact, the data are nonnormal, your results will be inaccurate. To avoid this costly error, you must determine the distribution of your data.</span></p>
<p>So, how do you determine the distribution? Minitab’s Individual Distribution Identification is a simple way to find the distribution of your data so you can choose the appropriate statistical analysis. You can use it to:</p>
<ul>
<li>Determine whether a distribution you used previously is still valid for the current data</li>
<li>Choose the right distribution when you’re not sure which distribution to use</li>
<li>Transform your data to follow a normal distribution</li>
</ul>
<p>Let's take a closer look at three ways you can use the Individual Distribution Identification tool in our <a href="http://www.minitab.com/products/minitab">statistical software</a>. </p>
Confirm a Certain Distribution Fits Your Data
<p>In most cases, your process knowledge helps you identify the distribution of your data. In these situations, you can use Minitab’s Individual Distribution Identification to confirm the known distribution fits the current data.</p>
<p>Suppose you want to perform a capability analysis to ensure that the weight of ice cream containers from your production line meets specifications. In the past, ice cream container weights have been normally distributed, but you want to confirm normality. Here’s how you use Individual Distribution Identification to quickly assess the fit.</p>
<ol>
<li>Choose <strong>Stat > Quality Tools > Individual Distribution Identification</strong>.</li>
<li>Specify the column of data to analyze and the distribution to check it against.</li>
<li>Click <strong>OK</strong>.</li>
</ol>
<p><img alt="Probability Plot for Weight" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/809b9e485f550984b027a447a547eb76/distribution_id_plot_for_weight_graph1.jpg" style="width: 450px; height: 300px;" /></p>
<p>A given distribution is a good fit if:</p>
<ul>
<li>The data points roughly follow a straight line</li>
<li>The p-value is greater than 0.05</li>
</ul>
<p>In this case, the ice cream weight data appear to follow a normal distribution, so you can justify using normal capability analysis.</p>
Determine Which Distribution Best Fits Your Data
<p>Perhaps you have successfully used more than one distribution in the past. You can use Individual Distribution Identification to help you decide which distribution best fits your current data. For example, you want to assess whether a particular weld strength meets customers’ requirements, but several distributions have been used to model this data historically. Here’s how you use Individual Distribution Identification to choose the distribution that best fits your current data.</p>
<ol>
<li>Choose <strong>Stat > Quality Tools > Individual Distribution Identification</strong>.</li>
<li>Specify the column of data to analyze and the distributions to check it against.</li>
<li>Click <strong>OK</strong>.</li>
</ol>
<p><img alt="Determine Which Distribution Best Fits Your Data" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/93bb72b716b2b3636510a50921044c55/distribution_id_plot_for_strength_graph2_w1024.jpeg" style="width: 450px; height: 296px;" /></p>
<p>Choose the distribution with data points that roughly follow a straight line and the highest p-value. In this case, the Weibull distribution fits the data best.</p>
<p><strong>Note</strong></p>
<p>When you fit your data with both a 2-parameter distribution and its 3-parameter counterpart, the latter often appears to be a better fit. However, you should use a 3-parameter distribution only if it is significantly better. See Minitab Help for information about <a href="http://support.minitab.com/en-us/minitab/17/topic-library/quality-tools/capability-analyses/distributions-and-transformations-for-nonnormal-data/p-value-for-a-goodness-of-fit-test/#choose-between-a-3-parameter-and-a-2-parameter-distribution">choosing between a 2-parameter distribution and a 3-parameter distribution</a>.</p>
Use a Normal Statistical Analysis on Nonnormal Data
<p>While Minitab offers various options for analysis of nonnormal data, many users prefer to use the broader palette of normal statistical analyses. Minitab’s Individual Distribution Identification can <a href="http://blog.minitab.com/blog/applying-statistics-in-quality-projects/how-could-you-benefit-from-a-box-cox-transformation">transform your nonnormal data using the Box-Cox method</a> so that it follows a normal distribution. You can then use the transformed data with any analysis that assumes the data follow a normal distribution.</p>
<ol>
<li>Choose <strong>Stat > Quality Tools > Individual Distribution Identification</strong>.</li>
<li>Specify the column of data to analyze.</li>
<li>From the Distribution drop-down menu in the main dialog, choose <em>Box-Cox transformation</em>, and select any other distributions to compare it with.</li>
<li>Click <strong>OK</strong> in each dialog box.</li>
</ol>
<p><img alt="USE A NORMAL STATISTICAL ANALYSIS ON NONNORMAL DATA" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/e15daf80da959004a1e44134d2bab016/distribution_id_plot_for_strength_graph3.jpg" style="width: 384px; height: 256px;" /></p>
<p>For the transformed data, check for data points that roughly follow a straight line and a p-value greater than 0.05.</p>
<p>In this case, the probability plot and p-value suggest the transformed data follow a normal distribution. You can now use the transformed data for further analysis.</p>
<p><strong>Note</strong></p>
<p>Data transformations will not always produce normal data. You must check the probability plot and p-value to assess whether the normal distribution fits the transformed data well.</p>
Putting Individual Distribution Identification to Use
<p>It is always good practice to know the distribution of your data before choosing a statistical analysis. Minitab’s Individual Distribution Identification is an easy-to-use tool that can help you identify the distribution of your data as well as eliminate errors and wasted time that result from an inappropriate analysis.</p>
<p>You can use this feature to check the fit of a single distribution, or to compare the fits of several distributions and select the one that best fits your data. If you prefer to work with normal data, you can even use Minitab’s Individual Distribution Identification to transform your nonnormal data to see if they follow a normal distribution.</p>
Data AnalysisStatisticsStatistics HelpTue, 31 Mar 2015 12:00:00 +0000http://blog.minitab.com/blog/meredith-griffith/identifying-the-distribution-of-your-dataMeredith GriffithMaking Better Estimates of Project Duration Using Monte Carlo Analysis
http://blog.minitab.com/blog/statistics-in-the-field/making-better-estimates-of-project-duration-using-monte-carlo-analysis
<p><em style="box-sizing: border-box; font-family: 'Segoe UI', Frutiger, 'Frutiger Linotype', 'Dejavu Sans', 'Helvetica Neue', Tahoma, Arial, sans-serif; line-height: 21px; color: rgb(77, 79, 81); font-size: 14px;">by Lion "Ari" Ondiappan Arivazhagan, guest blogger. </em></p>
<p>Predicting project completion times is one of the major challenges project managers face. Project schedule overruns are quite common due to the high uncertainty in estimating the amount of time activities require, a lack of <span style="line-height: 20.7999992370605px;">historical data about </span><span style="line-height: 1.6;">project completion, organizational culture, inadequate skills, the complex and elaborative nature of projects, and many other factors.</span></p>
<p>PMI’s Pulse of the Profession™ research, which is consistent with other studies, shows that "fewer than two-thirds of projects meet their goals and business intent (success rates have been falling since 2008), and about 17 percent fail outright. Failed projects waste an organization’s money: for every US$1 billion spent on a failed project, US$135 million is lost forever…unrecoverable."</p>
<p>In another report on infrastructure project schedule and cost overruns, released in 2013<span style="line-height: 20.7999992370605px;"> </span><span style="line-height: 20.7999992370605px;">by PMI-KPMG</span><span style="line-height: 1.6;">, 79 percent of the survey respondents agreed that the infrastructure sector in India faces a shortage of skilled project managers with the prerequisite skill set, which results in time/schedule overruns. One of the reasons for inefficient project delivery is the paucity of skilled project managers in the infrastructure sector.</span></p>
<p>Yet predicting an achievable project completion time is more important today than ever before, due to the high liquidated damages (LD) or penalty charges for late completion and growing dissatisfaction among clients and the public.</p>
The Drawbacks of Traditional CPM Technique
<p>Deterministic, single-point estimates of project activities are highly risky as it is impossible to complete all the project activities exactly on the estimated single-point durations. Moreover, most estimators tend to estimate activity durations that are closer to optimistic estimates than to pessimistic ones. The most likely estimates are the modal estimates and the traditional Critical Path Method (CPM), which assumes activities are normally distributed. In a normal distribution, the modal estimates have only a 50% chance of being completed within or below the estimated duration, and hence the critical path duration. In other words, we typically start with estimated project completion time that has a 50% chance of being EXCEEDED from the second the project begins.</p>
Why Probabilistic Method (PERT)
<p>Models that use three-point estimates, such as the PERT model, reduce uncertainty in project completion estimates by taking into account the Optimistic (To), Most-likely (Tml) and Pessimistic (Tp) to some extent. The width of the range (Tp -To) indicates the degree of the risk in each activity duration. While probabilistic estimates can give us three different project completion times based on either To, Tml, or Tp, we generally calculate the project completion time based on an equivalent single-point expected duration by assigning appropriate weights to each of the 3 durations. For example, the PERT model, which assumes a Beta distribution, uses the following formula to calculate the expected duration,Te.</p>
<p style="margin-left: 40px;"><img alt="beta distribution for activity duration estimate" class="center" data-loading-tracked="true" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b18855de35b95b9147053d80dfb639c2/aaeaaqaaaaaaaagyaaaajgvjowy1ntfhltazzdutndhjni04nju1lwexywewmjmxztrmyw_1_.jpg" style="width: 300px; height: 198px;" /></p>
<p>Expected duration,Te = (To + 4Tml +Tp) / 6</p>
<p><img alt="activity table" class="center" data-loading-tracked="true" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9cdb231b009720d35f759fbfbb6cc89a/aaeaaqaaaaaaaahsaaaajdk1yziwmzhhltringitndg1oc04zwnllwflmgu0ndy5n2mwza_1_.gif" style="width: 423px; height: 181px;" /></p>
<p>Using the PERT's 3-point estimates of activities whose durations are in weeks, we get the following PERT Network Diagram to calculate the critical path.The expected durations so calculated are then used as single-point durations in the traditional CPM method to arrive at the critical path duration. Please note that the Te values have been used as the fixed length or known activity durations (similar to the CPM) and the critical path is found by the traditional CPM way using forward and backward passes to calculate the total float of each activity.The critical path is shown below in red.</p>
<p><img alt="flowchart" class="center" data-loading-tracked="true" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/72902a470515f19854e9bb7c81516817/aaeaaqaaaaaaaaf5aaaajdc5ntvinmu2ltjmntmtndg2my1inwnhlwjjnjbjymfmotqxmw_1_.png" style="width: 645px; height: 393px;" /></p>
<p>The Critical Path Duration , T = A + E + H + I + J = 6 + 3 + 4 + 2 + 2 = 17 Weeks<br />
<br />
Unfortunately, this PERT project duration, found by adding the critical activities, <em>also </em>enjoys a mere 50% chance of on-time completion.The project completion time, regardless of the distribution shapes of the critical activities, tends to follow an approximately normal distribution if there are a sufficiently large number of activities ( say, >30) in the critical path, according to the <a href="http://blog.minitab.com/blog/understanding-statistics/how-the-central-limit-theorem-works">Central Limit Theorem (CLT)</a>. <span style="line-height: 1.6;">Hence, our problem is still not solved, as PERT-based project completion time is nothing but a glorified-CPM-based completion time. </span></p>
<span style="line-height: 1.6;">Going to Monte Carlo</span>
<p><span style="line-height: 1.6;">This is where simulation techniques, such as Monte Carlo, come in handy. We can use simulation to estimate various project completion times along with their probability of completion so that we can plan contingency reserves (CR) to ensure at least a 90-95% probability of completion (as opposed to 50% by CPM or PERT methods) during the risk management planning stage itself.</span></p>
<p>In <span><a href="http://blog.minitab.com/blog/understanding-statistics/monte-carlo-is-not-as-difficult-as-you-think">Monte Carlo simulation</a></span>, the durations of critical path activities are simulated to take on random values between their Low and High limits, depending on the distributions assumed, using a random number generator until the specified number of simulations—say, 5,000—are exhausted. For each simulation, a set of project completion time and its probability of completion is calculated and stored. When all 5,000 simulations are done, we get 5,000 project completion times and their probability values. </p>
<p>Monte Carlo Simulation outputs (with 5,000 simulations using the software <a href="http://www.minitab.com/products/devize/video/">Devize</a>® from Minitab) for various target project completion times are given below. These simulated outputs help determine the Contingency Reserves (CR) needed in terms of the project completion time for better planning and completion assurance to clients.</p>
<p><img alt="Devize output" class="center" data-loading-tracked="true" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/50f4f5b024835128212824dde202cd7b/aaeaaqaaaaaaaad7aaaajdi1ytiwmgywlteyn2itndg4ns1iyjzmltu2ytmzzgq1nzqzma_1_.jpg" style="width: 700px; height: 311px;" /><br />
The 5,000-simulation output above predicts that the single-point Critical Path duration of 17 weeks has only a 25.9% chance of completion, or a 74.1% chance of failure (exceeding the estimated duration).</p>
<p>The simulation shown below estimates the probability of completing the project ahead of schedule by 1 month, possibly by fast-tracking. It shows that the chances of completing the project in 16 weeks (as opposed to the baseline duration of 17 weeks from CPM) are only 13.14%. Such predictions are very helpful to project managers in effective planning and deployment of project resources.</p>
<p><img alt="Devize output" class="center" data-loading-tracked="true" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/1596804c3c295c585de2c3e3104c58ae/aaeaaqaaaaaaaahwaaaajge2zmvjzmqyltnhotmtnda0my1iyte0ltkymmzmmza2ztkzmg_1_.jpg" style="width: 700px; height: 311px;" /><br />
If the client wants to know or predict the project completion duration that has at least an 85% chance of success, we can easily do that using <span><a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-monte-carlo-simulation-with-an-example">simulations performed in Devize</a></span>. In the output below, we can see that the target completion duration of 21 weeks ( USL=21 weeks) has an 86.58% chance of being completed on time.</p>
<p><img alt="monte carlo simulation software output" class="center" data-loading-tracked="true" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f3a64acc67eb0a963675b3a553060160/aaeaaqaaaaaaaagyaaaajdexzjg0mdhjlwfhmzqtndcyyi04mwq2lwiznmfknte1zwnkna_1_.jpg" style="width: 700px; height: 311px;" /><br />
If the project manager wants to submit a completion time that has at least an 85% chance of completion with all the duration combinations of the critical path activities taken into account, it will be wiser to commit to a completion time of 21 weeks, as opposed to the contractual completion time of 16 weeks, which had only a 13.14% chance of success.</p>
<strong>Monte Carlo Simulation for Project Managers</strong>
<p>Monte Carlo simulation is a boon to project managers in general—and to risk managers in particular—for simulating various possible combinations of the predictor variables within their range of values. <span style="line-height: 20.7999992370605px;">Project managers can use Monte Carlo simulations to make more informed decisions and, as a result, complete more projects within the agreed time. </span><span style="line-height: 1.6;">Software packages such as Devize make the analysis simpler and intuitive, which in turn makes it easier for us to mitigate the overall project schedule risks to an acceptable threshold. </span></p>
<p> </p>
<p class="left"><strong>References </strong></p>
<p class="left">1. <em>An Introduction to Management Science: Quantitative Approaches to Decision Making</em>, by Anderson et al.</p>
<p class="left">2.<em>The PMBOK® Guide </em>- 5th edition, Project Management Institute (PMI).</p>
<p class="left">3. Devize®, Simulation and Optimization software from Minitab® Inc.</p>
<p class="left">4. PMI’s Pulse of the Profession™ -The High Cost of Low Performance. 2013.</p>
<p class="left">5. PMI-KPMG Study on Project Schedule and Cost Overruns - Expedite Infrastructure Projects. 2013.</p>
<p> </p>
<p><strong>About the Guest Blogger: </strong></p>
<p><em>The author, Ondiappan Arivazhagan, "Ari", is an Honors graduate in Civil / Structural Engineering from University of Madras.He is a certified PMP, PMI-SP, PMI-RMP from PMI, USA. He is also a Master Black Belt in Lean Six Sigma and has done Business Analytics from IIM,Bangalore. He has 30 years of professional global project management experience in various countries around the World and has almost 14 years of teaching / training experience in Project management, Analytics, Risk Management and Lean Six Sigma .He is the Founder-CEO of International Institute of Project Management (IIPM), Chennai and can be reached at <a target="_blank">askari@iipmchennai.com.</a></em></p>
<p><em>An earlier version of the article appeared on LinkedIn. </em></p>
Monte CarloMonte Carlo SimulationQuality ImprovementStatisticsThu, 26 Mar 2015 12:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/making-better-estimates-of-project-duration-using-monte-carlo-analysisGuest BloggerMy Life as an Outlier
http://blog.minitab.com/blog/statistics-and-quality-data-analysis/my-life-as-an-outlier
<p>I always knew I was different. Even as a kid.</p>
<p>“Is that me? Way out there in left field?” I asked the doc.</p>
<p>“Yes,” he nodded, as he looked at my chart. “I used brushing to identify you on the graph.”</p>
<p>I wasn’t sure I liked getting brushed. It felt like my true identify was being detected and displayed in a window for all to see.</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/60cd600d9a51989ba25a081e3a6ae3b5/boxplot_outlier_3.jpg" style="width: 866px; height: 293px; border-width: 1px; border-style: solid;" /></p>
<p>The doctor must have sensed my discomfort.</p>
<p>“It’s not uncommon—even for those from a normal population—to appear as outliers,” he said, doing his best to put a good spin on it.</p>
<p>“For example, based on diagnostic criteria that define an outlier as a value that lies beyond the quartile 1 value minus 1.5 times the inter-quartile range, or beyond the quartile 3 value plus 1.5 the times the inter-quartile range, we’d expect <a href="http://www.amstat.org/publications/jse/v19n2/dawson.pdf" target="_blank">0.8% of observations</a> to appear as outliers, even when they come from a perfectly normal population.”</p>
<p>I wondered where he’d learned to speak <a href="http://omniglot.com/conscripts/vulcan.htm" target="_blank">Golic Vulcan</a> so well. Seeing the blank look on my face, he called in <a href="http://www.minitab.com/products/minitab/assistant">the Assistant</a> to explain in clearer, simpler terms.</p>
<p>“For every 1000 observations,” the Assistant said, “roughly 8 are going to be labelled as funny little stars on this chart—<em>even when they’re perfectly normal</em>. In fact, that’s a very conservative estimate.”</p>
<p>“So maybe I’m just like everybody else?” I asked, hopefully.</p>
<p>“I’d like to do some follow-up tests,” the doctor replied, cautiously.</p>
The Results Come In, and I'm Out
<p>It took about 9 seconds, using Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a>, to get my Dixon’s r22 Ratio Test results back. It seemed like forever. </p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/187b1e83536d4580e3660f55b47ce1e5/outlier_plot_of_c1.jpg" style="width: 576px; height: 384px;" /></p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/b065073d289bf50d009685ed90c886db/outlier_test_output.jpg" style="width: 582px; height: 324px;" /></p>
<p>“The p-value for the outlier test is less than the significance level of 0.05,” the doctor began. “So we must reject the null hypothesis that you come from the same normal population as others.“</p>
<p>He paused to take a deep drag on a Lucky Strike that he held between the two thumbs of his left hand. Then he droned on in measured tones, summarizing each and every analysis that seemed to confirm my diagnosis as a delinquent datum. </p>
<p>But I didn’t hear a word he said. I was already a million miles away, wondering how my parents would react.</p>
Mom and Dad Try to Interpret Their Outlier
<p>When my dad saw the individuals chart, he hit the roof.</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/ff0ccd4d1d3f7c00378c6a739535c60a/i_chart_of_c1.jpg" style="width: 576px; height: 384px;" /></p>
<p>“He’s out of control!!” Dad exploded.</p>
<p>“I’m sure there must be some special cause for it,” my mother reasoned.</p>
<p>“He’s never learned to respect limits,” he said.</p>
<p>“Let’s not overreact, dear. This might be a false alarm.”</p>
<p>“False alarm, huh?” my dad sneered. “Then what about this stem-and-leaf I found in his bedroom?”</p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/d79134a457caf04f34782a5037fd564b/stem_leaf_outlier.jpg" style="line-height: 20.7999992370605px; width: 250px; height: 302px; border-width: 1px; border-style: solid;" /></p>
<p>“What were you doing in my bedroom?!” I protested."Did you brush me?"</p>
<p>“Maybe it belongs to one of his friends…” my mother said, with the same vague, speculative tone you’d use to say, “Maybe there’s life on other planets…"</p>
Rebel without a Special Cause
<p>When you treat someone a certain way, they begin living up (or down) to your expectations.</p>
<p>Once the world pegged me as an outlier, my attitude quickly changed. If I was going to be treated as an outcast, by god, I’d be an extreme one. <em>Then</em> they’d find out just how problematic a single, aberrant datum could be!</p>
<p>At first, I started messing with simple parametric statistics, like the mean. They were so sensitive and easy to push around, especially when they weren’t part of a large crowd.</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/a871fd11ff8ca8f5211363cc7ebde099/summary_report_for_without_me.jpg" style="width: 600px; height: 450px;" /></p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/8c69ccda6bb11a6fe8f866eca5043059/summary_report_for_with_me.jpg" style="width: 600px; height: 450px;" /></p>
<p>Man, what a power trip! Single-handedly I could drag down an arithmetic average. Or blow a variance sky-high, until it reached over 50 times its original magnitude. Sweet! </p>
<p>See the data huddle fearfully inside a single histogram bin when I’m around? Heh heh heh...they' so afraid they even begin to question their <em>own</em> normality!</p>
<p>As time went on, my insatiable craving for deviation made me move on to bigger things. That's when I started going out at night to wreck models.</p>
<p>I loved to ruin a clean, shiny model and instantly make it a disjointed, insignificant mess.</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/182788196ed789fea4a40208886b9f75/wreck_model_1.jpg" style="width: 661px; height: 425px;" /><br />
When I was feeling even more insidious, I'd use sleight of hand to make an insignificant relationship <em>appear</em> significant, to the unsuspecting. Little did they know, as soon as I walked away their perfect little model would crumble into a million little unrelated pieces. Ha ha ha!</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/df349145fc41c1567c57efe49ff832d9/wreck_model_2.jpg" style="width: 645px; height: 415px;" /></p>
<p>Ah, those were the days. The grand vicissitudes of youth! My pointy, pixelated head was either soaring high in the clouds, or spiraling down to the bottom of a subterranean sinkhole.</p>
<p>Then I got busted.</p>
<p>Remember that Assistant in the doctor's office? The one who could cogently explain the maximum likelihood function to a group of rodeo clowns? Turns out he's also a part-time policeman who conducts routine data checks.</p>
<p>One day he flagged me running a red light. Then the gig was up. My deviance was exposed for all to see.</p>
What To Do with Me?
<p>Once I'd been apprehended and booked, the debate began. How should the world deal with the error of my ways?<img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/e3c1174814b71c8d5dbc2cda160b43f1/debate_1.jpg" style="float: right; width: 300px; height: 225px;" /></p>
<p>Some wished I'd never existed in the first place. They believed I wasn't fit to live with other normal data. I upset the natural balance.</p>
<p>"How simple and peaceful and wonderful the results would be," they argued, "If we could just delete this errant value."</p>
<p>Others argued it was ethically wrong to expunge me. They believed that, with the right transformation, I could be successfully reformed.</p>
<p>To curb my extremist tendencies, some statistical shrinks recommended that I undergo the rigors of a square root or logarithmic transformation. A few even advised shipping me off to the Box-Cox Boarding School for the Delinquent Datum.</p>
<p>"It can work wonders on reforming outliers like your son," the headmaster told my parents.</p>
<p>Yet others felt the reformist approach was just a charade. A sneaky scaling maneuver with smoke and mirrors--one that really didn't change the true nature of my underlying character. They argued against treating me as an aberration.</p>
<p>"There's nothing really wrong with him," they said. "He doesn't need changing. He's just crying out for attention."</p>
<p>These people recognized a simple, basic truth about me. </p>
<p>All I'd ever really wanted, was to be understood.</p>
Data AnalysisFun StatisticsLearningTue, 24 Mar 2015 12:31:00 +0000http://blog.minitab.com/blog/statistics-and-quality-data-analysis/my-life-as-an-outlierPatrick RunkelPlanning a Trip to Disney World: Using Statistics to Keep It in the Green
http://blog.minitab.com/blog/cpammer/planning-a-trip-to-disney-world%3A-using-statistics-to-keep-it-in-the-green
<p>Our vacation planning has begun. My daughter has requested a trip to <a href="http://disneyworld.disney.go.com/" target="_blank">Disney World</a> as her high school graduation present. For most people, trip planning might mean a simple phone call to the local travel agent or an even simpler do-it-yourself online booking.</p>
<p>Not for me.</p>
<p>As a statistician, a request like this means I’ve got a lot of data analysis ahead. So many travel questions require (in my world, anyway) data-driven decisions. What is the best time to book tickets? What is the best flight/airline to use given the probability of cancellation and/or missed connections traveling from our small airport? How do we schedule our in-park time so we aren’t waiting in line most of the day?</p>
<p style="font-face:Arial,Tahoma,Sans-serif; font-style:italic; font-size:12px; line-height:14px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/883b3a04d5b6300523fd74b68de36f75/castle.JPG" style="width: 250px;" /><br />
The statistician and the graduate-to-be during a previous visit to Disneyland Paris. </p>
<p>My list of questions goes on and on, and will keep me very busy in the weeks to come. But to keep this at a reasonable length for a blog post, let’s just focus on the last one. Specifically, how do we minimize queue time and maximize fun? There are many valid approaches to looking at a question like this, but to keep things simple and use available data, I’m going to take advantage of some <a href="http://www.minitab.com/en-us/products/minitab/whats-new/#">new Data window features</a> available in Minitab Statistical Software 17.2.</p>
<p>Disney queue time data is available on several websites with varying levels of sophistication, but I chose to use <a href="http://www.easywdw.com/cheatsheets/mk_cheatsheet.pdf" target="_blank">a very simple set of average wait times</a>. It’s well known that park attendance is highly seasonal, so I chose to only look at data that matches the predicted crowd level for the days we will be there. We’re also going to focus this particular analysis on my family’s seven must-see attractions at The Magic Kingdom.</p>
<div>
<div id="_com_3" uage="JavaScript">
<p>If you want to follow along in Minitab 17.2, please download my <a href="//cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/ba72cfe9eb69927859bedf6d1a023674/disneywaittimes.mtw">data sheet</a>. </p>
<p><span style="line-height: 20.7999992370605px;">My primary variable of interest is wait time in minutes. I want to investigate wait time by time of day using the specific ride as a grouping variable. </span>To get a quick overview of the data, I started with a <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/time-series-plots-theres-gold-in-them-thar-hills">time series plot</a> (<strong>Graph > Times Series Plot > Simple</strong>.) You can overlay multiple graphs (in this case, our seven must-see rides) on a single plot like this by selecting <strong>Overlaid on the same graph</strong> under the <strong>Multiple Graphs</strong> button. </p>
<div>
<div id="_com_2" uage="JavaScript">
<p><img alt="time series plot of Magic Kingdom rides" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/818aa748ebfad35c2c7b0c64a981b398/magic_kingdom_time_series_plot.png" style="width: 577px; height: 385px;" /></p>
<p>From this graph, I see that the new Seven Dwarfs Mine Train—the yellow line at or near the top for every hour—is going to be a tough one to ride without a substantial wait, but with the right timing, we can get through It’s a Small World pretty quickly. Everything else falls somewhere in between.</p>
<p>This is certainly good to know. What would be more useful, though, is to look at the actual data in our Minitab spreadsheet and set up some rules based on what we believe are acceptable wait times. My personal wait time tolerance can be roughly described as:</p>
<ul>
<li>Under 20 minutes: totally Happy.</li>
<li>20 to 35 minutes: may get a little Sleepy.</li>
<li>More than 35 minutes: this better be the best ride ever or I’ll become very Grumpy.</li>
</ul>
<p>Wouldn’t it be great if I could use information to visualize my data in the Data window? Fortunately, with Minitab 17.2, I can use <a href="http://support.minitab.com/minitab/17/topic-library/minitab-environment/data-and-data-manipulation/conditional-formatting/conditional-formatting-overview/">conditional formatting</a> to do this. Simply click in the Data window, and either right-click or choose <strong>Editor > Conditional Formatting</strong>. I used the options under <strong>Highlight Cell </strong> to set three rules: Less than 20, Between 20 and 35, and Greater than 35. I can now use the resulting Green, Yellow, and Red formatting to plan my day at The Magic Kingdom.</p>
<p><img alt="conditional formatting of data for magic kingdom rides" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f88d04106f2ba1fa85d32e2abf2f0dca/conditional_formatting_w1024.png" style="width: 800px; height: 342px;" /></p>
<p>Although Seven Dwarfs Mine Train never makes it into the green zone, putting the wait at the very end of our day—when our feet will be tired from walking anyway—may be reasonable. To avoid red as much as possible, we might want to hit either Space Mountain or Big Thunder Mountain first and move around from there.</p>
<p>I still need to collect information about the distance between rides to complete our plan, but I think we’re off to a promising start!</p>
</div>
</div>
</div>
</div>
Data AnalysisFun StatisticsStatisticsStatsTue, 17 Mar 2015 14:00:00 +0000http://blog.minitab.com/blog/cpammer/planning-a-trip-to-disney-world%3A-using-statistics-to-keep-it-in-the-greenCheryl Pammer