Hypothesis Testing | MinitabBlog posts and articles about hypothesis testing, especially in the course of Lean Six Sigma quality improvement projects.
http://blog.minitab.com/blog/hypothesis-testing-2/rss
Sat, 28 Nov 2015 04:02:53 +0000FeedCreator 1.7.3What Is ANOVA? And Who Drinks the Most Beer?
http://blog.minitab.com/blog/michelle-paret/what-is-anova-and-who-drinks-the-most-beer
<p><span style="line-height: 1.6;">Back when I was an undergrad in statistics, I unfortunately spent an entire semester of my life taking a class, diligently crunching numbers with my TI-82, before realizing 1) that I was actually in an Analysis of Variance (ANOVA) class, 2) why I would want to use such a tool in the first place, and 3) that ANOVA doesn’t necessarily tell you a thing about variances.</span></p>
<p>Fortunately, I've had a lot more real-world experience to draw from since then, which makes it much easier to understand today. TI-82 not required.</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f000c555e0ba8a9d9a78461b7230073c/beer.jpg" style="line-height: 20.8px; margin: 10px 15px; float: right; width: 220px; height: 220px;" /></p>
Why Conduct an ANOVA?
<p>In its simplest form—specifically, a 1-way ANOVA—you take 1 continuous (“response”) variable and 1 categorical (“factor”) variable and test the null hypothesis that all group means for the categorical variable are equal. Typically, we’re talking about at least 3 groups, because if you only have 2 groups (samples), then you can use a <span><a href="http://blog.minitab.com/blog/adventures-in-statistics/using-hypothesis-tests-to-bust-myths-about-the-battle-of-the-sexes">2-sample t-test</a></span> and skip ANOVA all together.</p>
<div>
<p>As an example, let’s look at the <a href="https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita">average annual per capita beer consumption</a> across 3 regions of the world: Asia, Europe, and America. Here’s the null and alternative hypothesis:</p>
<p style="margin-left: 40px;">H0: All regions drink the same average amount of beer (μAsia = μEurope = μAmerica)</p>
<p style="margin-left: 40px;">H1: Not all regions drink the same average amount of beer</p>
<p>Any guess on who consumes the most beer?</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/371609298620629565add1ee54234c37/individual_value_plot_of_volume_consumed__liters__w1024.jpeg" style="border-width: 0px; border-style: solid; width: 600px; height: 394px;" /></p>
<p><span style="line-height: 1.6;">According to the individual value plot created using </span><a href="http://www.minitab.com/products/minitab/" style="line-height: 1.6;">Minitab 17</a><span style="line-height: 1.6;">, Europe consumes the most beer on average and Asia consumes the least. However, are these differences statistically significant? Or are these differences simply due to random variation?</span></p>
How ANOVA Works
<p>The basic logic behind ANOVA is that the within-group variation is due only to random error. Therefore:</p>
<ul>
<li><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/2c9466a316c75e4c17cd250f0aff45bd/comparing_variation.jpg" style="line-height: 20.8px; margin-left: 12px; margin-right: 12px; float: right; width: 112px; height: 200px;" />If the between-group variation is similar to the within-group variation, then the group means are likely to differ only due to random error. (Figure 1)</li>
<li>If the between-group variation is large relative to the within-group variation, then there are likely differences between the group means. (Figure 2)</li>
</ul>
<p>Say what?</p>
<p>In our example, the between-group variation represents the variation <em>between</em> the 3 different regions. And the within-group variation represents the beer consumption variability <em>within</em> a given region. Take Europe, for instance, where we have the Czech Republic. It appears to be the thirstiest country, consuming the most beer at 148.6 liters. But Europe also contains Italy, whose population drinks the least at only 29 liters (perhaps the Italians are passing up the Peroni for some vino and Limoncello?). So you can see that there is variability within the Europe group. There’s also variability within the Asia group, and within the America group.</p>
<p>With ANOVA, we compare the between-group variation (i.e., Asia vs. Europe vs. America) to the within-group variation (i.e., within each of those regions). The higher this ratio, the smaller the p-value. So the term ANOVA refers to the fact that we're using information about the variances to draw conclusions about the means.</p>
The Analysis
<p>If we run a 1-way ANOVA using this beer data, Minitab Statistical Software provides the following output in the Session Window:</p>
<p style="margin-left:.5in;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/956e35d59d68a844fdf00986397dbc9d/oneway_anova_for_beer.jpg" style="border-width: 0px; border-style: solid; width: 460px; height: 286px;" /></p>
<p><span style="line-height: 1.6;">Our </span><a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" style="line-height: 1.6;">p-value</a><span style="line-height: 1.6;"> is statistically significant at 0.000. Therefore, we can reject the null hypothesis that all regions drink the same average amount of beer. </span></p>
<p><span style="line-height: 1.6;">This leads us to our next question: Which countries differ? Let’s use Tukey multiple comparisons to find out.</span></p>
<p align="center"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b499b8c87a401c007713a0d5179a5e56/oneway_anova_interval_plot_for_beer_w1024.jpeg" style="border-width: 0px; border-style: solid; width: 600px; height: 394px;" /></p>
<p>Per the footnote on the Tukey comparisons graph, “If an interval does not contain zero, the corresponding means are significantly different.” Therefore, the intervals shown in red tell us where the differences are. Specifically, we can conclude that the average beer consumption for Europe is significantly higher than that of Asia. We can also conclude that America consumes significantly more than Asia. However, there is not sufficient evidence to conclude that the average beer consumption for Europe is different than for America.</p>
The Last Sip
<p>Although it’s unlikely that you’re analyzing beer data in your professional career, I do hope this provides a little insight into ANOVA and how you can utilize it to test averages between 3 or more groups.</p>
<p> </p>
</div>
Data AnalysisFun StatisticsHypothesis TestingStatisticsStatistics HelpThu, 19 Nov 2015 13:00:00 +0000http://blog.minitab.com/blog/michelle-paret/what-is-anova-and-who-drinks-the-most-beerMichelle ParetControl Charts - Not Just for Statistical Process Control (SPC) Anymore!
http://blog.minitab.com/blog/adventures-in-statistics/control-charts-not-just-for-statistical-process-control-spc-anymore
<p>Control charts are a fantastic tool. These charts plot your process data to identify common cause and special cause variation. By identifying the different causes of variation, you can take action on your process without over-controlling it.</p>
<p>Assessing the stability of a process can help you determine whether there is a problem and identify the source of the problem. Is the mean too high, too low, or unstable? Is variability a problem? If so, is the variability inherent in the process or attributable to specific sources? Control charts answer these questions, which can guide your corrective efforts.</p>
<p>Determining that your process is stable is good information all by itself, but it is also a prerequisite for further analysis, such as <a href="http://blog.minitab.com/blog/understanding-statistics/i-think-i-can-i-know-i-can-a-high-level-overview-of-process-capability-analysis" target="_blank">capability analysis</a>. Before assessing process capability, you must be sure that your process is stable. An unstable process is unpredictable. If your process is stable, you can predict future performance and improve its capability.</p>
<p>While we associate control charts with business processes, I’ll argue in this post that control charts provide the same great benefits in other areas beyond statistical process control (SPC) and Six Sigma. In fact, you’ll see several examples where control charts find answers that you’d be hard pressed to uncover using different methods.</p>
The Importance of Assessing Whether Other Types of Processes Are In Control
<p>I want you to expand your mental concept of a process to include processes outside the business environment. After all, unstable process levels and excessive variability can be problems in many different settings. For example:</p>
<ul>
<li>A teacher has a process that helps students learn the material as measured by test scores.</li>
<li><a href="http://blog.minitab.com/blog/real-world-quality-improvement/control-charts-keep-blood-sugar-in-check" target="_blank">A diabetic has a process for keeping blood sugar in control</a>.</li>
<li>A researcher has a process that causes subjects to experience an impact of 6 times their body weight.</li>
</ul>
<p>All of these processes can be stable or unstable, have a certain amount of inherent variability, and can also have special causes of variability. Understanding these issues can help improve all of them.</p>
<p>The third bullet relates to a <a href="http://blog.minitab.com/blog/adventures-in-statistics/quality-improvement-controlling-variability-more-difficult-than-the-mean" target="_blank">research study</a> that I was involved with. Our research goal was to have middle school subjects jump from 24-inch steps, 30 times, every other school day to determine whether it would increase their bone density. We defined our treatment as the subjects experiencing an impact of 6 body weights. However, we weren’t quite hitting the mark.</p>
<p>To guide our corrective efforts, I conducted a pilot study and graphed the results in the Xbar-S chart below.</p>
<p><img alt="Xbar-S chart of ground reaction forces for pilot study" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/e721bd172aa55d5ec9976e81990f1293/xbars_grf_w1024.jpeg" style="width: 576px; height: 384px;" /></p>
<p>The in-control S chart (bottom) shows that each subject has a consistent landing style that produces impacts of a consistent magnitude—the variability is in control. However, the out-of-control Xbar chart (top) indicates that, while the overall mean (6.141) exceeds our target, different subjects have very different means. Collectively, the chart shows that some subjects are consistently hard landers while others are consistently soft landers. The control chart suggests that the variability is not inherent in the process (common cause variation) but rather assignable to differences between subjects (special cause variation).</p>
<p>Based on this information, we decided to train the subjects how to land and to have a nurse observe all of the jumping sessions. This ongoing training and corrective action reduced the variability enough so that the impacts were consistently greater than 6 body weights.</p>
Control Charts as a Prerequisite for Statistical Hypothesis Tests
<p>As I mentioned, control charts are also important because they can verify the assumption that a process is stable, which is required to produce a valid capability analysis. We don’t often think of using control charts to test the assumptions for <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-why-we-need-to-use-hypothesis-tests-in-statistics" target="_blank">hypothesis tests</a> in a similar fashion, but they are very useful for that as well.</p>
<p>The assumption that the measurements used in a hypothesis test are stable is often overlooked. As with any process, if the measurements are not stable, you can’t make inferences about whatever you are measuring.</p>
<p>Let’s assume that we’re comparing test scores between group A and group B. We’ll use this <a href="//cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/File/6053477fc294de59d5b3837389daab3a/groupcomparison.MTW">data set</a> to perform a 2-sample t-test as shown below.</p>
<p style="margin-left: 40px;"><img alt="two sample t-test results" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/d60543dc8eb46afc282b9776b0517a5e/2samplet.png" style="width: 522px; height: 210px;" /></p>
<p>The results appear to show that group A has the higher mean and that the difference is statistically significant. Group B has a marginally higher standard deviation, but we’re not assuming equal variances, so that’s not a problem. If you conduct normality tests, you’ll see that the data for both groups are normally distributed—although we have a sufficient number of observations per group that we don’t have to worry about normality. All is good, right?</p>
<p>The I-MR charts below suggest otherwise!</p>
<p><img alt="I-MR chart for group A" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/cef240bbb760bb6760ddcbc33e446be9/imr_a.png" style="width: 576px; height: 384px;" /></p>
<p><img alt="I-MR chart of group B" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/e4bd53da7831826959be94540b7ab0a2/imr_b.png" style="width: 576px; height: 384px;" /></p>
<p>The chart for group A shows that these scores are stable. However, in group B, the multiple out-of-control points indicate that the scores are unstable. Clearly, there is a negative trend. Comparing a stable group to an unstable group is not a valid comparison even though the data satisfy the other assumptions.</p>
<p>This I-MR chart illustrates just one type of problem that control charts can detect. Control charts can also test for a variety of patterns in the data and for out-of-control variability. As these data show, you can miss problems using other methods.</p>
Using the Different Types of Control Charts
<p>The I-MR chart assesses the stability of the mean and standard deviation when you don’t have subgroups, while the XBar-S chart shown earlier assesses the same parameters but <em>with </em>subgroups.</p>
<p>You can also use other control charts to test other types of data. In Minitab, the U Chart and Laney U’ Chart are control charts that use the Poisson distribution. You can use these charts in conjunction with the 1-Sample and 2-Sample Poisson Rate tests. The P Chart and Laney P’ Chart are control charts that use the binomial distribution. Use these charts with the 1 Proportion and 2 Proportions tests.</p>
<p>If you're using <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank" title="Minitab 16 Statistical Software">Minitab Statistical Software</a>, you can choose <strong>Assistant > Control Charts</strong> and get step-by-step guidance through the process of creating a control chart, from determining what type of data you have, to making sure that your data meets necessary assumptions, to interpreting the results of your chart.</p>
<p>Additionally, check out the great <a href="http://blog.minitab.com/blog/understanding-statistics/control-chart-tutorials-and-examples">control charts tutorial</a> put together by my colleague, Eston Martz.</p>
Data AnalysisHypothesis TestingLearningQuality ImprovementSix SigmaStatisticsStatistics HelpThu, 12 Nov 2015 13:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/control-charts-not-just-for-statistical-process-control-spc-anymoreJim FrostTerry Bradshaw Might be the Best Super Bowl Quarterback Ever
http://blog.minitab.com/blog/statistics-and-quality-improvement/terry-bradshaw-might-be-the-best-super-bowl-quarterback-ever
<p><img alt="By U.S. Navy photo by Chief Photographer's Mate Chris Desmond. [Public domain], via Wikimedia Commons" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/fddd60d1ca304b5397c16d0bc050910b/bradshaw.jpg" style="width: 350px; height: 228px; float: right; margin: 10px 15px;" />Last time I touched on the subject of <a href="http://blog.minitab.com/blog/statistics-and-quality-improvement/troy-aikman-or-joe-montana-might-be-the-best-super-bowl-quarterback-ever">the greatest Super Bowl quarterback</a>, I promised a multivariate analysis considering several different statistics. Let’s get right to a factor analysis.</p>
Getting Ready for Factor Analysis
<p>One purpose of factor analysis is to identify underlying factors that you can’t measure directly. These factors explain the variation of many different variables in fewer dimensions. Here are the variables we’re going to consider:</p>
<ul>
<li>Margin of victory</li>
<li>Difference between Super Bowl winner’s passer rating and the playoff passer rating allowed by the opposing team—PR Diff (Winner – Allowed)</li>
<li>Point spread—Spread</li>
<li>Adjusted career rating of the losing quarterback—Adjusted Career PR Loser</li>
<li>The difference between the winning and losing quarterback’s ratings—PR Difference (Winner – Loser)</li>
<li>Winning quarterback’s rating—Passer Rating Winner</li>
<li>Losing quarterback’s rating—Passer Rating Loser</li>
</ul>
Determining the number of factors
<p>To begin the factor analysis, you usually <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/multivariate/principal-components-and-factor-analysis/number-of-principal-components/">determine the number of factors to use</a>. The determination is similar to determining the number of principal components. Looking for eigenvectors bigger than 1, for the number of factors that determine 80% of the variation, and for the number of components that explain large amounts of variation relative to the other factors. A scree plot of the eigenvectors look like this:</p>
<p><img alt="The first two factors have eigenvalues greater than 1. The third eigenvalue is close to 1." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/1ba3b721712e20c94bc9a75f1bd2c000/eigenvalue_scree_plot.png" style="border-width: 0px; border-style: solid; width: 576px; height: 384px;" /></p>
<p>Two factors have eigenvalues greater than 1, and the third factor is close. The 3 factors explain about 80% of the variation in the data, so 3 factors seems likes a reasonable number to explore.</p>
Factor rotation
<p>Once we determine the number of factors, we want to see if we can find a rotation that produces underlying factors that make sense. In general, rotation of the factors makes them load on fewer variables so that the factors are simpler. For example, the Minitab output from the varimax rotation shows the unrotated and rotated factor loadings:</p>
<p style="margin-left: 40px;"><span style="font-family: courier new; font-size:9pt">Unrotated Factor Loadings and Communalities<br />
Variable Factor1 Factor2 Factor3 Communality<br />
Passer Rating Loser -0.644 0.502 0.420 0.843<br />
Passer Rating Winner 0.723 0.563 0.127 0.857<br />
PR Difference (Winner - Loser) 0.953 0.027 -0.212 0.955<br />
Adjusted Career PR Loser -0.181 0.748 -0.400 0.752<br />
Spread -0.536 0.189 -0.687 0.794<br />
PR Diff (Winner - Allowed) 0.620 0.570 0.145 0.731<br />
Margin of victory 0.700 -0.324 -0.214 0.640<br />
Variance 3.0410 1.5952 0.9359 5.5721<br />
% Var 0.434 0.228 0.134 0.796<br />
<br />
Rotated Factor Loadings and Communalities<br />
Varimax Rotation<br />
Variable Factor1 Factor2 Factor3 Communality<br />
Passer Rating Loser 0.032 -0.912 -0.106 0.843<br />
Passer Rating Winner 0.909 0.171 0.030 0.857<br />
PR Difference (Winner - Loser) 0.598 0.767 0.096 0.955<br />
Adjusted Career PR Loser 0.336 -0.256 -0.757 0.752<br />
Spread -0.363 -0.085 -0.810 0.794<br />
PR Diff (Winner - Allowed) 0.850 0.086 0.011 0.731<br />
Margin of victory 0.178 0.754 0.198 0.640<br />
Variance 2.1852 2.0973 1.2896 5.5721<br />
% Var 0.312 0.300 0.184 0.796</span></p>
<p style="margin-left: 40px;"> </p>
<p>In this output, the unrotated first factor has 5 variables where the absolute value of the loading is 0.6 or higher. The rotated first factor has 2 variables with loadings of 0.6 or higher, so the rotated factor should be easier to interpret.</p>
<p>We’re lucky, in this case, because the different rotation methods available in Minitab all produce factors that load on the same variables. When different methods agree, you feel more certain about the results.</p>
Interpreting the factors
<p>The first factor, which loads highly on the winning quarterback’s passer rating and the difference between that passer rating and what the opposing team allowed in the playoffs, looks like a measure of how well the winning quarterback played. Higher values of this factor indicate better performance.</p>
<p>The second factor is the most difficult to interpret because of the signs of the different variables with high loadings. You get a higher value of the second component by having a higher margin of victory, by having a higher difference between the ratings of the winning and losing quarterbacks, and by having a lower passer rating by the losing quarterback. I would think that the first two components would be values that you would want to be high, but that you would also want the third value to be high.</p>
<p>It looks like the variation in the data suggests that a losing team is much more likely to lose by a lot of points if the opposing quarterback plays poorly. In <a href="http://blog.minitab.com/blog/statistics-and-quality-improvement/tom-brady-is-the-best-super-bowl-quarterback-ever">my first post about the best Super Bowl quarterback</a>, I made the judgement that winning a competitive Super Bowl was more impressive than winning a noncompetitive match. Thus, I’m going to tend to think that lower values of the second component, caused by high passer ratings of the opposing quarterback, small differences, and smaller margins of victory are more impressive; but I’ll conduct the final comparisons both ways to see how it affects the conclusion.</p>
<p>The third factor loads on two factors: the point spread and the passer rating of the losing quarterback. This factor is about the quality of the victory. The more positive the point spread, the more unexpected the victory was. Also, the better the opposing quarterback the better the victory was. Because both of these loadings are negative, more negative values of the third factor indicate better performance.</p>
Conclusion?
<p>In addition to the decision of what to do with the second component, there are still some other considerations for how to determine the best super Bowl quarterback. For example, should we compare the candidate quarterbacks to the average performance or to the best performance? Should we look at the mean performance of the best quarterbacks or the median performance? With so many options available for the remaining analysis, we’ll have to wait for next time to review them all. For now, here’s some initial impressions of the three factors.</p>
<p><img alt="Different points identify Montana, Bradshaw, Aikman, and Brady" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/f666023c739116e02a0b2884c172eda7/3d_graph_legend.jpg" style="width: 207px; height: 154px;" /></p>
<p><img alt="Factor scores for all Super Bowl victors" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/0da0045e07c54f01e67fa5816b1a9a9f/3d_graph.jpg" style="width: 576px; height: 378px;" /></p>
<p>In terms of a quarterback playing well, especially in light of the opposing team, Terry Bradshaw’s first victory over the Dallas Cowboys, in Super Bowl X, takes the prize among our candidate quarterbacks. A factor score of 1.62 is not quite as good as Jim Plunkett’s 1.77, but pretty good for a guy throwing against Hall of Fame cornerback Mike Renfro. Among the candidates, Bradshaw also has the second-place score for his second victory over the Cowboys in Super Bowl XIII.</p>
<p>We’ll explore the best of the second factor in more detail, but the extremes make quarterback’s look good in both directions. Among the candidates, Tom Brady has the minimum score from his victory over the Carolina Panthers in Super Bowl XXXVIII. Brady overcame an incredible effort by Jake Delhomme that resulted in a 113.6 passer rating, the highest rating by a losing quarterback in a Super Bowl. Brady’s effort is also the overall minimum for factor 2.</p>
<p>On the maximum side of factor 2 lies another candidate, Montana’s victory over the Broncos in Super Bowl XXIV. The 45-point victory is the only Super Bowl in our data set where the winning quarterback’s passer rating exceeded the losing quarterback’s by over 100 points.</p>
<p>With respect to the third component, no victory was more unexpected than Brady’s overcoming of the Kurt Warner-led Rams in Super Bowl XXXVI. The 14-point underdog did enough that day to fend off the fourth quarter charge of the Greatest Show on Turf in what was, at the time, the first Super Bowl to be decided by a score on the final play of the game.</p>
<p>So, will it come down to Bradshaw or Brady defying the odds, or Montana’s domination? We’ll evaluate all three factors next time!</p>
<p>Ready to try out your own factor analysis? Check out the overview in <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/multivariate/principal-components-and-factor-analysis/perform-a-factor-analysis/">Perform a Factor Analysis</a> on the Minitab Support Center.</p>
The photograph of Terry Bradshaw and Lieutenant Commander Heather Pouncey is by Chief Photographer's Mate Chris Desmond, whose work deserves attribution this Veteran's Day even though it's in the public domain.
Data AnalysisFun StatisticsHypothesis TestingWed, 11 Nov 2015 15:17:00 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/terry-bradshaw-might-be-the-best-super-bowl-quarterback-everCody SteelePractical Statistical Problem Solving Using Minitab to Explore the Problem
http://blog.minitab.com/blog/statistics-in-the-field/practical-statistical-problem-solving-using-minitab-to-explore-the-problem
<p><em style="line-height: 1.6;">By Matthew Barsalou, guest blogger</em></p>
<p>A problem must be understood before it can be properly addressed. A thorough understanding of the problem is critical when performing a <a href="http://blog.minitab.com/blog/understanding-statistics/root-cause-analysis-and-process-improvement-for-patient-safety">root cause analysis (RCA)</a> and an RCA is necessary if an organization wants to implement corrective actions that truly address the root cause of the problem. An RCA may also be necessary for process improvement projects; it is necessary to understand the cause of the current level performance before attempts are made to improve the performance.</p>
<p>Many <span style="line-height: 20.8px;">statistical tests</span><span style="line-height: 20.8px;"> related to </span><span style="line-height: 1.6;">problem-solving can be performed using <a href="http://www.minitab.com/products/minitab">Minitab Statistical Software</a>. However, the actual test you select should be based upon the type of data you have and what needs to be understood. The figure below shows various statistical options structured in a cause-and-effect diagram with the main branches based on characteristics that describe what the tests and methods are used for.</span></p>
<p align="center"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/153fae1037161a10541f14572c58ce02/statistical_problem_solving_1_w1024.png" style="width: 700px; height: 310px;" /></p>
<p>The main branch labeled “differences” is split into two high-level sub-branches: hypothesis tests that have an assumption of normality, and non-parametric tests of medians. The <a href="http://blog.minitab.com/blog/understanding-statistics/what-statistical-hypothesis-test-should-i-use">hypothesis tests</a> assume data is normally distributed and can be used to compare means, variances, or proportions to either a given value or to the value of a second sample. An ANOVA can be performed to compare the means of two or more samples.</p>
<p>The non-parametric tests listed in the cause-and-effect diagram are used to compare medians, either to a specified value, or two or more medians, depending upon which test is selected. The non-parametric tests provide an option when data is too skewed to use other options, such as a Z-test.</p>
<p>Time may also be of interest when exploring a problem. If your data are recorded in order of occurrence, a <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/time-series-plots-theres-gold-in-them-thar-hills">time series plot</a> can be created to show each value at the time it was produced; this may give insights into potential changes in a process.</p>
<p style="margin-left: 40px;"><img alt="" src="https://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/1d760d455a55b6a76d9e7fe25a20764e/time_series_gold_2.gif" style="width: 300px; height: 203px;" /></p>
<p>A <a href="http://blog.minitab.com/blog/real-world-quality-improvement/trend-analysis-super-bowl-ticket-prices">trend analysis</a> looks much like the time series plot; however, Minitab also tests for potential trends in the data such as increasing or decreasing values over time. Exponential smoothing options are available to assign exponentially decreasing weights to the values over time when attempting to predict future outcomes.</p>
<p>Relationships can be explored using various types of <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples">regression analysis</a> to identify potential correlations in the data such as the relationship between the hardness of steel and the quenching time of the steel. This can be helpful when attempting to identify the factors that influence a process. Another option for understanding relationships is <a href="http://blog.minitab.com/blog/understanding-statistics/getting-started-with-factorial-design-of-experiments-doe">Design of Experiments (DoE)</a>, where experiments are planned specifically to economically explore the effects and interactions between multiple factors and a response variable.</p>
<p>Another main branch is for capability and stability assessments. There are two main sub-branches here; one is for<a href="http://blog.minitab.com/blog/understanding-statistics/i-think-i-can-i-know-i-can-a-high-level-overview-of-process-capability-analysis"> measures of process capability and performance</a> and the other is for Statistical Process Control (SPC), which can assess the stability of a process.</p>
<p>The measures of process performance and capability can be useful for establishing the baseline performance of a process; this can be helpful in determining of process improvement activities have actually improved the process. The SPC sub-branch is split into three lower-level sub-branches; these are control charts for attribute data such as number of defective units, control charts for continues data such as diameters, and time-weighted charts that don’t give all values equal weights.</p>
<p><a href="http://blog.minitab.com/blog/understanding-statistics/what-control-chart-should-i-use">Control charts</a> can be used for both assessing the current performance of a process such as by using an individual’s chart to determine if the process is in a states of statistical control, or for monitoring the performance of a process such as after improvements have been implemented.</p>
<p><img alt="" src="https://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/d2c0571a-acbd-48c7-84f4-222276c293fe/Image/e36f985ab12401b70318197b3b8a1c77/control_chart_components.jpg" style="width: 400px; height: 103px;" /></p>
<p><a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/get-a-head-start-understand-your-data-before-you-analyze-it">Exploratory data analysis (EDA)</a> can be useful for gaining insights to the problem using graphical methods. The individual values plot is useful for simply observing the position of each value relative to the other values in a data set. For example, a box plot can be helpful when comparing the means, medians and spread of data from multiple processes. The purpose of EDA is not to form conclusions, but to gain insights that can be helpful in forming tentative hypotheses or in deciding which type of statistical test to perform.</p>
<p>The tests and methods presented here do not cover all available statistical tests and methods in Minitab; however, they do provide a large selection of basic options to choose from.</p>
<p>These tools and methods are helpful when exploring a problem, but their use should not be limited to problem exploration. They can also be helpful for planning and verifying improvements. For example, an individual value plot may indicate one process performs better than a comparable process, and this can then be confirmed using a two-sample t test. Or, the settings of the better process can be used to plan a DoE to identify the optimal settings for the two processes and the improvements can be monitored using an xBar and S chart for the two processes. </p>
<p> </p>
<p><strong>About the Guest Blogger</strong></p>
<p><em><a href="https://www.linkedin.com/pub/matthew-barsalou/5b/539/198" target="_blank">Matthew Barsalou</a> is a statistical problem resolution Master Black Belt at <a href="http://www.3k-warner.de/" target="_blank">BorgWarner</a> Turbo Systems Engineering GmbH. He is a Smarter Solutions certified Lean Six Sigma Master Black Belt, ASQ-certified Six Sigma Black Belt, quality engineer, and quality technician, and a TÜV-certified quality manager, quality management representative, and auditor. He has a bachelor of science in industrial sciences, a master of liberal studies with emphasis in international business, and has a master of science in business administration and engineering from the Wilhelm Büchner Hochschule in Darmstadt, Germany. He is author of the books <a href="http://www.amazon.com/Root-Cause-Analysis-Step---Step/dp/148225879X/ref=sr_1_1?ie=UTF8&qid=1416937278&sr=8-1&keywords=Root+Cause+Analysis%3A+A+Step-By-Step+Guide+to+Using+the+Right+Tool+at+the+Right+Time" target="_blank">Root Cause Analysis: A Step-By-Step Guide to Using the Right Tool at the Right Time</a>, <a href="http://asq.org/quality-press/display-item/index.html?item=H1472" target="_blank">Statistics for Six Sigma Black Belts</a> and <a href="http://asq.org/quality-press/display-item/index.html?item=H1473&xvl=76115763" target="_blank">The ASQ Pocket Guide to Statistics for Six Sigma Black Belts</a>.</em></p>
Data AnalysisDesign of ExperimentsHypothesis TestingLean Six SigmaQuality ImprovementRegression AnalysisSix SigmaStatisticsFri, 06 Nov 2015 13:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/practical-statistical-problem-solving-using-minitab-to-explore-the-problemGuest BloggerBeware of Phantom Degrees of Freedom that Haunt Your Regression Models!
http://blog.minitab.com/blog/adventures-in-statistics/beware-of-phantom-degrees-of-freedom-that-haunt-your-regression-models
<p><img alt="Demon" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/357fb1b145dc3e177860509a7791e6b6/demon1.gif" style="float: right; width: 275px; height: 308px;" />As Halloween approaches, you are probably taking the necessary steps to protect yourself from the various <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-be-a-ghost-hunter-with-a-statistical-mindset" target="_blank">ghosts</a>, goblins, and witches that are prowling around. Monsters of all sorts are out to get you, unless they’re sufficiently bribed with candy offerings!</p>
<p>I’m here to warn you about a ghoul that all statisticians and data scientists need to be aware of: phantom degrees of freedom. These phantoms are really sneaky. You can be out, fitting a regression model, looking at your output, and thinking everything is fine. Then, whammo, these phantoms get you! They suck the explanatory and predictive power right out of your regression model but, deviously, leave all of the output looking just fine. Now that’s truly spooky!</p>
<p>In this blog post, I’ll show you how these phantoms work and how to avoid their dastardly deeds!</p>
What Are Normal Degrees of Freedom in Regression Models?
<p>I’ve written previously about the <a href="http://blog.minitab.com/blog/adventures-in-statistics/the-danger-of-overfitting-regression-models" target="_blank">dangers of overfitting your regression model</a>. An overfit model is one that is too complicated for your data set.</p>
<p>You can learn only so much from a data set of a given size. A degree of freedom is a measure of how much you’ve learned. Your model uses these degrees of freedom with every parameter that it estimates. If you use too many, you’re overfitting the model. The end result is that the <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients">regression coefficients, p-values</a>, and <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit">R-squared</a> can all be misleading.</p>
<p>You can detect overfit models by looking at the number of observations per parameter estimate and assessing the <a href="http://blog.minitab.com/blog/adventures-in-statistics/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables" target="_blank">predicted R-squared</a>. However, these methods won’t necessarily detect the misbegotten effects of summoning an excessive number of <em>phantom </em>degrees of freedom!</p>
<p>In the degrees of freedom (DF) column in the ANOVA table below, you can see that this regression model uses 3 degrees of freedom out of a total of 28. It appears that this model is fine. Or is it? <em><Cue evil laugh!></em></p>
<p style="margin-left: 40px;"><img alt="Analysis of variance table for a regression model" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/3aad670bc46412ddfe1ab642089349d1/anova_table.png" style="width: 371px; height: 151px;" /></p>
What Are Phantom Degrees of Freedom?
<p>Phantom degrees of freedom are devilish because they latch onto you through the manner in which you settle on the final model. They are not detectable in the output for the final model even as they haunt your regression models.</p>
<div style="float: right; width: 275px; margin: 15px 0px 15px 15px; line-height: 1;"><img alt="Guy surrounded by demons" height="369px;" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/949386ee39624c2cb187e3dc0d9cb630/demons.jpg" width="275px;" /><br />
<em style="font-size: x-small; line-height: 1;">The dangers of invoking too many phantom degrees of freedom!</em></div>
<p>Every time your incantation adds or removes predictors from a model based on a statistical test, you invoke a phantom degree of freedom because you’re learning something from your data set. However, even when you summon many phantom degrees of freedom during the model selection process, they are not evident in <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">Minitab’s</a> output for the final model. That is what makes them phantoms.</p>
<p>When you invoke too many phantoms, your regression model becomes haunted. This occurs because you’re performing many statistical tests, and every statistical test has a false positive rate. When you try many different models, you're bound to find variables that appear to be significant but are correlated only by chance. These relationships are nothing more than ghostly apparitions!</p>
<p>To protect yourself from this type of bewitching, you need to understand the environment that these phantoms inhabit. Phantom degrees of freedom have the strongest powers when you have a small-to-moderate sample size, many potential predictors, correlated predictors, and when the light of knowledge does not illuminate your conception of the true model.</p>
<p>In this scenario, you are likely to fit many possible models, adding and removing different predictors, and testing curvature and interaction terms in an attempt to conjure an answer out of the darkness. Perhaps you use an automatic incantation procedure like <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-smackdown-stepwise-versus-best-subsets" target="_blank">stepwise or best subsets regression</a>. If you have <a href="http://blog.minitab.com/blog/adventures-in-statistics/what-are-the-effects-of-multicollinearity-and-when-can-i-ignore-them" target="_blank">multicollinearity</a>, the parameter estimates are particularly unhinged.</p>
<p>The ANOVA table we saw above appears to be perfectly normal, but it could be haunted. To divine the truth, you must understand the entire ritual that incited the final model to materialize. If you start out with 20 variables, a sample size of 29, and fit many models to see what works, you could conjure a possessed model beguiling you to accept false conclusions.</p>
<p>In fact, this method of dredging through data to see what sticks casts such a diabolical spell that it can manifest a statistically significant regression model with a high R-squared <em><a href="http://blog.minitab.com/blog/adventures-in-statistics/four-tips-on-how-to-perform-a-regression-analysis-that-avoids-common-problems" target="_blank">from completely random data</a></em>! Beware—this is the environment that the phantoms inhabit!</p>
How to Protect Yourself from the Phantom Degrees of Freedom
<p>To protect yourself from phantom degrees of freedom, information and advance planning are your best talismans. Use the following rites to shine the light of truth on your research and to guide yourself out of the darkness:</p>
<ul>
<li>Conduct prior research about the important variables and their relationships to help you specify the best regression model without the need for data mining.</li>
<li>Collect a large enough sample size to support the level of model complexity that you will need.</li>
<li>Avoid data mining and keep track of how many phantom degrees of freedom that you raise before arriving at your final model.</li>
</ul>
<p>For more information about avoiding haunted models, read my post about <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-choose-the-best-regression-model">How to Choose the Best Regression Model</a>.</p>
<p>Happy Halloween!</p>
<p> </p>
<p style="font-size:10px;"><em>"Buer." Licensed under Public Domain via <a href="https://en.wikipedia.org/wiki/Buer_(demon)#/media/File:Buer.gif" target="_blank">Commons.</a></em></p>
Data AnalysisHypothesis TestingRegression AnalysisStatisticsStatistics HelpThu, 29 Oct 2015 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/beware-of-phantom-degrees-of-freedom-that-haunt-your-regression-modelsJim FrostP Values and the Replication of Experiments
http://blog.minitab.com/blog/adventures-in-statistics/p-values-and-the-replication-of-experiments
<p>An exciting new study sheds light on the relationship between P values and the replication of experimental results. This study highlights issues that I've emphasized repeatedly—it is crucial to <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">interpret P values correctly</a>, and significant results must be replicated to be trustworthy.<img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/2762a55291d134b8185ba9da47ea6f83/p_fancy.gif" style="margin: 10px 15px; float: right; width: 150px; height: 194px;" /></p>
<p>The study also supports my disagreement with the decision by the <em>Journal of Basic and Applied Social Psychology </em>to ban P values and confidence intervals. About six months ago, I laid out my case that <a href="http://blog.minitab.com/blog/adventures-in-statistics/banned-p-values-and-confidence-intervals-a-rebuttal-part-1">P values and confidence intervals provide important information</a>.</p>
<p>The authors of the August 2015 study, <em><a href="http://www.sciencemag.org/content/349/6251/aac4716" target="_blank">Estimating the reproducibility of psychological science</a></em>, set out to assess the rate and predictors of reproducibility in the field of psychology. Unfortunately, there is a shortage of replication studies available for this study to analyze. The shortage exists because, sadly, it’s generally easier for authors to publish the results of new studies than replicate studies.</p>
<p>To get the reproducibility study off the ground, the group of 300 researchers associated with the project had to conduct their own replication studies first! These researchers conducted replications of 100 psychology studies that had already obtained statistically significant results and had been accepted for publication by three respected psychology journals.</p>
<p>Overall, the study found that only 36% of the replication studies were themselves statistically significant. This low rate reaffirms the importance of replicating the results before accepting a finding as being experimentally established!</p>
<p>Scientific progress is not neat and tidy. After all, we’re trying to model a complex reality using samples. False positives and negatives are an inherent part of the process. These issues are why I oppose the "one and done" approach of accepting a single significant study as the truth. Replication studies are as important as the original study.</p>
<p>The study also assessed whether various factors can predict the likelihood that a replication study will be statistically significant. The authors looked at factors such as the characteristics of the investigators, hypotheses, analytical methods, as well as indicators of the strength of the original evidence, such as the P value.</p>
<p>Most factors did not predict reproducibility. However, the study found that the P value did a pretty good job! The graph shows how lower P values in the original studies are associated with a higher rate of statistically significant results in the follow-up studies.</p>
<p><img alt="Bar chart that shows replication rate by original P-value" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/01bec95ec63634b9062de57edde1ecf7/replicationbypvalue.png" style="width: 475px; height: 317px;" /></p>
<p>Right now it’s not looking like such a good idea to ban P values! Clearly, P values provide important information about which studies warrant more skepticism.</p>
<p>The study results are consistent with what I wrote in <a href="http://blog.minitab.com/blog/adventures-in-statistics/five-guidelines-for-using-p-values">Five Guidelines for Using P Values</a>:</p>
<ul>
<li>The exact P value matters—not just whether a result is significant or not.</li>
<li>A P value near 0.05 isn’t worth much by itself.</li>
<li>Replication is crucial.</li>
</ul>
<p>It’s important to note that while the replication rate in psychology is probably different than other fields of study, the general principles should apply elsewhere.</p>
Hypothesis TestingStatisticsStatistics HelpStatistics in the NewsThu, 01 Oct 2015 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/p-values-and-the-replication-of-experimentsJim FrostChi-Square Analysis: Powerful, Versatile, Statistically Objective
http://blog.minitab.com/blog/michelle-paret/chi-square-analysis-powerful-versatile-statistically-objective
<p style="line-height: 20.7999992370605px;">To make objective decisions about the processes that are critical to your organization, you often need to examine categorical data. You may know how to use a t-test or ANOVA when you’re comparing measurement data (like weight, length, <span style="line-height: 1.6;">revenue, </span><span style="line-height: 1.6;">and so on), but do you know how to compare attribute or counts data? It easy to do with <a href="http://www.minitab.com/products/minitab">statistical software</a> like Minitab. </span></p>
<p style="line-height: 20.7999992370605px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/60bfd1eb8d2c2c3689bce89ea55453ab/chisquare_onevariable_w1024.jpeg" style="line-height: 20.7999992370605px; width: 350px; height: 230px; float: right; margin: 10px 15px;" /></p>
<p style="line-height: 20.7999992370605px;">One person may look at this bar chart and decide that each production line had the same <span style="line-height: 1.6;">proportion of defects. But another person may focus on the small difference between the bars and decide that one of the lines has outperformed the others. Without an appropriate statistical analysis, how can you know which person is right?</span></p>
<p style="line-height: 20.7999992370605px;">When time, money, and quality depend on your answers, you can’t rely on subjective visual assessments alone. To answer questions like these with statistical objectivity, you can use a Chi-Square analysis.</p>
Which Analysis Is Right for Me?
<p style="line-height: 20.7999992370605px;">Minitab offers three Chi-Square tests. The appropriate analysis depends on the number of variables that you want to examine. And for all three options, the data can be formatted either as raw data or summarized counts.</p>
<strong>Chi-Square Goodness-of-Fit Test – 1 Variable</strong>
<p style="line-height: 20.7999992370605px;">Use Minitab’s <strong>Stat > Tables > Chi-Square Goodness-of-Fit Test (One Variable)</strong> when you have just one variable.</p>
<p style="line-height: 20.7999992370605px;">The Chi-Square Goodness-of-Fit Test can test if the proportions for all groups are equal. It can also be used to test if the proportions for groups are equal to specific values. For example:</p>
<ul style="line-height: 20.7999992370605px;">
<li>A bottle cap manufacturer operates three production lines and records the number of defective caps for each line. The manufacturer uses the <strong>Chi-Square Goodness-of-Fit Test</strong> to determine if the proportion of defects is equal across all three lines.</li>
<li>A bottle cap manufacturer operates three production lines and records the number of defective caps for each line. One line runs at high speed and produces twice as many caps as the other two lines that run at a slower speed. The manufacturer uses the <strong>Chi-Square Goodness-of-Fit Test</strong> to determine if the defects for each line is proportional to the volume of caps it produces.</li>
</ul>
<strong>Chi-Square Test for Association – 2 Variables</strong>
<p style="line-height: 20.7999992370605px;">Use Minitab’s <strong>Stat > Tables > Chi-Square Test for Association</strong> when you have two variables.</p>
<p style="line-height: 20.7999992370605px;">The Chi-Square Test for Association can tell you if there’s an association between two variables. In another words, it can test if two variables are independent or not. For example:</p>
<ul style="line-height: 20.7999992370605px;">
<li>A paint manufacturer operates two production lines across three shifts and records the number of defective units per line per shift. The manufacturer uses the <strong>Chi-Square Goodness-of-Fit Test</strong> to determine if the defect rates are similar across all shifts and production lines. Or, are certain lines during certain shifts more prone to defects?</li>
<li>A credit card billing center records the type of billing error that is made, as well as the type of form that is used. The billing center uses a Chi-Square Test to determine whether certain types of errors are related to certain forms.</li>
</ul>
<p style="line-height: 20.7999992370605px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/7af9e9b2ee624e7d912393d7debe7f1b/chisquare_twovariables_w1024.jpeg" style="width: 500px; height: 329px;" /></p>
<strong>Cross Tabulation and Chi-Square – 2 or more variables</strong>
<p style="line-height: 20.7999992370605px;">Use Minitab’s <strong>Stat > Tables > Cross Tabulation and Chi-Square </strong>when you have two or more variables.</p>
<p style="line-height: 20.7999992370605px;">If you simply want to test for associations between two variables, you can use either <strong>Cross Tabulation and Chi-Square</strong> or <strong>Chi-Square Test for Association</strong>. However, <span><a href="http://blog.minitab.com/blog/understanding-statistics/using-cross-tabulation-and-chi-square-the-survey-says">Cross Tabulation and Chi-Square</a></span> also lets you control for the effect of additional variables. Here’s an example:</p>
<ul style="line-height: 20.7999992370605px;">
<li>A dairy processing plant records information about each defective milk carton that it produces. The plant uses a Cross Tabulation and Chi-Square analysis to look for dependencies between the defect types and the machine that produces the carton, while controlling for any shift effect. Perhaps a particular filling machine is prone to a certain type of defect, but only during the first shift.</li>
</ul>
<p style="line-height: 20.7999992370605px;">This analysis also offers advanced options. For example, if your categories are ordinal (good, better, best or small, medium, large) you can include a special test for concordance.</p>
Conducting a Chi-Square Analysis in Minitab
<p style="line-height: 20.7999992370605px;">Each of these analyses is easy to run in Minitab. For more examples that include step-by-step instructions, just navigate to the Chi-Square menu of your choice and then click Help > example.</p>
<p style="line-height: 20.7999992370605px;">It can be tempting to make subjective assessments about a given set of data, their makeup, and possible interdependencies, but why risk an error in judgment when you can be sure with a Chi-Square test?</p>
<p style="line-height: 20.7999992370605px;">Whether you’re interested in one variable, two variables, or more, a Chi-Square analysis can help you make a clear, statistically sound assessment.</p>
Data AnalysisHypothesis TestingLean Six SigmaQuality ImprovementSix SigmaStatisticsStatistics HelpThu, 27 Aug 2015 12:33:39 +0000http://blog.minitab.com/blog/michelle-paret/chi-square-analysis-powerful-versatile-statistically-objectiveMichelle ParetThe Null Hypothesis: Always “Busy Doing Nothing”
http://blog.minitab.com/blog/using-data-and-statistics/the-null-hypothesis-always-busy-doing-nothing
<p>The 1949 film <a href="http://www.imdb.com/title/tt0041259/" target="_blank"><em>A Connecticut Yankee in King Arthur's Court</em></a> includes the song “Busy Doing Nothing,” and this could be written about the <a href="http://blog.minitab.com/blog/understanding-statistics/things-statisticians-say-failure-to-reject-the-null-hypothesis">Null Hypothesis</a> as it is used in statistical analyses. </p>
<p></p>
<p>The words to the song go:</p>
<p style="margin-left: 40px;"><em>We're busy doin' nothin'<br />
<span style="line-height: 1.6;">Workin' the whole day through<br />
Tryin' to find lots of things not to do </span></em></p>
<p><span style="line-height: 1.6;">And that summarises the role of the Null Hypothesis perfectly. Let me explain why.</span></p>
<span style="line-height: 1.6;">What's the Question?</span>
<p>Before doing any statistical analysis—in fact even before we collect any data—we need to define what problem and/or question we need to answer. Once we have this, we can then work on defining our Null and Alternative Hypotheses.</p>
<p>The null hypothesis is always the option that maintains the status quo and results in the least amount of disruption, hence it is “Busy Doin’ Nothin'”. </p>
<p>When the probability of the Null Hypothesis is very low and we reject the Null Hypothesis, then we will have to take some action and we will no longer be “Doin Nothin'”.</p>
<p>Let’s have a look at how this works in practice with some common examples.</p>
<p><strong>Question</strong></p>
<p><strong>Null Hypothesis</strong></p>
Do the chocolate bars I am selling weigh 100g?
Chocolate Weight = 100g<br />
<br />
If I am giving my customers the right size chocolate bars I don’t need to make changes to my chocolate packing process.<br />
Are the diameters of my bolts normally distributed?
<p>Bolt diameters are n<span style="line-height: 1.6;">ormally distributed.</span></p>
<p>If my bolt diameters are normally distributed I can use any statistical techniques that use the standard normal approach.<br />
</p>
Does the weather affect how my strawberries grow?
Number of hours sunshine has no effect on strawberry yield<br />
<br />
Amount of rain has no effect on strawberry yield<br />
<br />
Temperature has no effect on strawberry yield<br />
<p>Note that the last instance in the table, investigating if weather affects the growth of my strawberries, is a bit more complicated. That's because I needed to define some metrics to measure the weather. Once I decided that the weather was a combination of sunshine, rain and temperature, I established my null hypotheses. These all assume that none of these factors impact the strawberry yield. I only need to control the sunshine, temperature and rain if the probability that they have no effect is very small.</p>
Is Your Null Hypothesis Suitably Inactive?
<p><span style="line-height: 1.6;">So in conclusion, in order to be “Busy Doin’ Nothin’”, your Null Hypothesis has to be as follows:</span></p>
<ul>
<li>A logical question.</li>
<li>Focused on one objective.</li>
<li>Requires action only if <a href="http://blog.minitab.com/blog/michelle-paret/alphas-p-values-confidence-intervals-oh-my">its probability of being true</a> is low (typically 5%).</li>
</ul>
Hypothesis TestingStatisticsWed, 12 Aug 2015 12:00:00 +0000http://blog.minitab.com/blog/using-data-and-statistics/the-null-hypothesis-always-busy-doing-nothingGillian GroomLessons from a Statistical Analysis Gone Wrong, part 1
http://blog.minitab.com/blog/understanding-statistics/lessons-from-a-statistical-analysis-gone-wrong-part-3-v2
<p style="line-height: 18.9090900421143px;">I don't like the taste of crow. That's a shame, because I'm about to eat a huge helping of it. </p>
<p style="line-height: 18.9090900421143px;">I'm going to tell you how I messed up an analysis. But in the process, I learned some new lessons and was reminded of some older ones I should remember to apply more carefully. </p>
This Failure Starts in a Victory
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/3e3a70cd6b6094eda21615f6eee14c0f/pharoah.jpg" style="line-height: 18.9090900421143px; border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 280px; height: 296px;" /></p>
<p style="line-height: 18.9090900421143px;"><span style="line-height: 18.9090900421143px;">My mistake originated in the 2015 Triple Crown victory of American Pharoah. I'm no racing enthusiast, but I knew this horse had ended almost four decades of Triple Crown disappointments, and that was exciting. </span><span style="line-height: 18.9090900421143px;">I'd never seen a </span><a href="http://blog.minitab.com/blog/the-statistics-game/triple-crown-odds-ill-have-another" style="line-height: 18.9090900421143px;">Triple Crown</a><span style="line-height: 18.9090900421143px;"> won before. It hadn't happened since 1978. </span></p>
<p style="line-height: 18.9090900421143px;">So when an acquaintance asked to contribute a guest post to the Minitab Blog that compared American Pharoah with previous Triple Crown contenders, including the record-shattering Secretariat, who took the Triple Crown in 1973, I eagerly accepted. </p>
<p style="line-height: 18.9090900421143px;">In reviewing the post, I checked and replicated the contributor's analysis. It was a fun post, and I was excited about publishing it. But a few days after it went live, I had to remove it: the analysis was not acceptable. </p>
<p style="line-height: 18.9090900421143px;">To explain how I made my mistake, I'll need to review that analysis. </p>
Comparing American Pharoah and Secretariat
<p style="line-height: 18.9090900421143px;"><span style="line-height: 18.9090900421143px;">In the post, we used Minitab's </span><a href="http://www.minitab.com/products/minitab/" style="line-height: 18.9090900421143px;">statistical software</a><span style="line-height: 18.9090900421143px;"> to compare Secretariat's performance to other winners of Triple Crown races. </span></p>
<p style="line-height: 18.9090900421143px;">Since 1926, the Belmont Stakes has been the longest of the three races at 1.5 miles. The analysis began by charting 89 years of winning horse times<span style="line-height: 1.6;">:</span><span style="line-height: 18.9090900421143px;"> </span></p>
<p style="line-height: 18.9090900421143px;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ad64da996c235ee5ff8cb4c3cef66292/belmont1.png" style="width: 500px; height: 334px;" /></p>
<p style="line-height: 18.9090900421143px;"><span style="line-height: 1.6;">Only two data points were outside of the I-chart's control limits:</span></p>
<ul style="line-height: 18.9090900421143px;">
<li>The fastest winner, Secretariat's 1973 time of 144 seconds</li>
<li>The slowest winner, High Echelon's 1970 time of 154 seconds</li>
</ul>
<p style="line-height: 18.9090900421143px;">The average winning time was 148.81 seconds, which Secretariat beat by more than 4 seconds. </p>
Applying a Capability Approach to the Race Data
<p style="line-height: 18.9090900421143px;">Next, the analysis approached the data from a capability perspective: Secretariat's time was used as a lower spec limit, and the analysis sought to assess the probability of another horse beating that time. </p>
<p style="line-height: 18.9090900421143px;">The way you assess capability depends on the distribution of your data, and a normality test in Minitab showed this data to be nonnormal<span style="line-height: 18.9090900421143px;">. </span></p>
<p style="line-height: 18.9090900421143px;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/89d338659c8cace002fe777633a238cf/belmont2.png" style="width: 500px; height: 334px;" /></p>
<p style="line-height: 18.9090900421143px;"><span style="line-height: 18.9090900421143px;">When you run Minitab's normal capability analysis, you can elect to apply the Johnson transformation, which can automatically transform many nonnormal distributions before the capability analysis is performed. This is an extremely convenient feature, but here's where I made my mistake. </span></p>
<p style="line-height: 18.9090900421143px;">Running the capability analysis with Johnson transformation, using Secretariat's 144-second time as a lower spec limit, produced the following output:</p>
<p style="line-height: 18.9090900421143px;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/0f3d2967b87743714821fa47e6bd999d/belmont4.png" style="width: 500px; height: 375px;" /></p>
<p style="line-height: 18.9090900421143px;">The analysis found a .36% chance of any horse beating Secretariat's time, making it very unlikely indeed. </p>
<p>The same method was applied to Kentucky Derby and Preakness data. </p>
<p style="line-height: 18.9090900421143px;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/6268cce550e1f97de81d0889da797814/belmont5.png" style="width: 500px; height: 375px;" /></p>
<p style="line-height: 18.9090900421143px;">We found a 5.54% chance of a horse beating Secretariat's Kentucky Derby time.</p>
<p style="line-height: 18.9090900421143px;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/21fda483b790f76e051ddb22359cbfe2/belmont6.png" style="width: 500px; height: 375px;" /></p>
<p style="line-height: 18.9090900421143px;">We found a 3.5% probability of a horse beating Secretariat's Preakness time.</p>
<p style="line-height: 18.9090900421143px;">Despite the b<span style="line-height: 18.9090900421143px;">illions of dollars and countless time and effort spent trying to make thoroughbred horses faster over the past 43 years,</span><span style="line-height: 18.9090900421143px;"> no one has yet beaten “Big Red,” as Secretariat was known. So the analysis indicated that American Pharoah may be a great horse, but he</span><span style="line-height: 1.6;"> is no Secretariat. </span></p>
<p style="line-height: 18.9090900421143px;"><span style="line-height: 1.6;">That conclusion may well be true...but it turns out we can't use <em>this</em> analysis to make that assertion. </span></p>
My Mistake Is Discovered, and the Analysis Unravels
<p style="line-height: 18.9090900421143px;">Here's where I start chewing those crow feathers. A day or so after sharing the post about American Pharoah, a reader sent the following comment: </p>
<p style="line-height: 18.9090900421143px; margin-left: 40px;"><em>Why does Minitab allow a Johnson Transformation on this data when using <strong>Quality Tools > Capability Analysis > Normal > Transform</strong>, but does not allow a transformation when using <strong>Quality Tools > Johnson Transformation</strong>? Or could I be doing something wrong? </em></p>
<p style="line-height: 18.9090900421143px;">Interesting question. Honestly, i<span style="line-height: 18.9090900421143px;">t hadn't even occurred to me to try to run the Johnson transformation on the data by itself. </span></p>
<p style="line-height: 18.9090900421143px;"><span style="line-height: 18.9090900421143px;">But if the Johnson Transformation worked when performed as part of the capability analysis, it ought to work when applied outside of that analysis, too. </span></p>
<p style="line-height: 18.9090900421143px;"><span style="line-height: 18.9090900421143px;">I suspected the person who asked th</span><span style="line-height: 18.9090900421143px;">is question might have just checked a wrong option in the dialog box. </span>So I tried running the Johnson Transformation on the data by itself.</p>
<p style="line-height: 18.9090900421143px;">The following <span style="line-height: 18.9090900421143px;">note appeared in Minitab's session window: </span></p>
<p style="line-height: 18.9090900421143px;"><img alt="no transformation is made" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/2892f3daa1549df56defbdd4fe9dc48a/no_transformation.gif" style="line-height: 18.9090900421143px; width: 500px; height: 55px;" /></p>
<p style="line-height: 18.9090900421143px;">Uh oh. </p>
<p style="line-height: 18.9090900421143px;">Our reader <em>hadn't</em> done anything wrong, but it was looking like I made an error somewhere. But where?</p>
<p style="line-height: 18.9090900421143px;">I'll show you exactly where I made my mistake in <a href="http://blog.minitab.com/blog/understanding-statistics/lessons-from-a-statistical-analysis-gone-wrong-part-2">my next post.</a> </p>
<p style="line-height: 18.9090900421143px;"> </p>
<p style="font-size: 9px;">Photo of American Pharoah used under Creative Commons license 2.0. Source: Maryland GovPics <a href="https://www.flickr.com/people/64018555@N03" target="_blank">https://www.flickr.com/people/64018555@N03</a> </p>
Data AnalysisFun StatisticsHypothesis TestingStatisticsStatistics in the NewsTue, 14 Jul 2015 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/lessons-from-a-statistical-analysis-gone-wrong-part-3-v2Eston MartzTime of Game: Are MLB Games Getting Any Shorter?
http://blog.minitab.com/blog/starting-out-with-statistical-software/time-of-game-are-mlb-games-getting-any-shorter
<p>Over the past few years, the average length of an MLB game has been steadily increasing. We can create a quick time series plot in Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a> to display this:</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/732ead34-1005-4470-b034-d7f8b87fabcf/Image/5d6c7b2edfd1611a6daddbf93f4deb76/lengthofgame.jpg" style="width: 576px; height: 384px;" /></p>
<p><span style="line-height: 1.6;">As games have been lasting longer, there's been a feeling shared by many that this was a negative. Games seemed to drag on, with a lot of unnecessary stoppages and breaks. </span></p>
<p><span style="line-height: 1.6;"><img alt="game lasts into the night" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/354e61083ab5a1abf9eff587c412ccea/diamond.jpg" style="margin: 10px 15px; float: right; width: 300px; height: 230px;" />To combat this trend, and to try to speed up games to make them more accessible to casual fans, a few different rules have gone into effect this year to help increase the pace of games. First, the batter is now required to keep one foot in the batter's box at all times (with a few exceptions). Additionally, there is a clock that runs between innings and pitching changes to make sure that the game restarts in a timely manner.</span></p>
<p>But are these rules having an effect at all? We can look at the time of game data for games played in the first month of the season, and see if the games have been any shorter. We can use a 1-sample t-test within Minitab to determine if the average game length is less than <span><a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-significance-levels-alpha-and-p-values-in-statistics">a certain hypothesized value</a></span>; in our case, we can look and see if it's less than last year's average. </p>
<p>I have created a data set that has game time for every game played so far in 2015 (up through April 29). In Minitab, we can go to <strong>Stat > Basic Statistics > 1-Sample t...</strong> and fill out the dialog box as follows:</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/732ead34-1005-4470-b034-d7f8b87fabcf/Image/5be1455e1a6e1ab7452329871f801a00/dialog.png" /></p>
<p>We check the box to perform our hypothesis test. The hypothesized mean we're testing against is the average time of game (in minutes) from 2014, which was 187.8.</p>
<p>Now we want to click 'Options' and change our hypothesis to "less than." Why? A one-tailed test allots all of our alpha to determining significance in one specific direction. In a one-tailed test, we are testing the possibility of a relationship in one direction and ignoring the possibility of a relationship in another. Statistically, by <em>not </em>looking for an effect in one direction, we have more power to detect an effect in the other direction. In this case, we are ignoring the possibility that the mean time of games may be greater than last year's. </p>
<p>Here are our results:</p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/732ead34-1005-4470-b034-d7f8b87fabcf/Image/783aab154c24fd0d5dcc63be5ee7e007/ttest.PNG" style="width: 531px; height: 130px;" /></p>
<p>Look at the mean, and the upper bound. The mean time of games played so far is about 177 minutes, almost a full 10 minutes shorter! The upper bound indicates that we are 95% confident that the true mean is less than 180 minutes, clocking in at under 3 hours.</p>
<p>Based on early season results, it appears that the new rules are definitely serving their intended purpose. </p>
<p>What hypotheses—sports-related or otherwise<span style="line-height: 18.9090900421143px;">—could you use a 1-sample t-test to examine?</span></p>
Data AnalysisFun StatisticsHypothesis TestingStatistics in the NewsFri, 15 May 2015 12:00:00 +0000http://blog.minitab.com/blog/starting-out-with-statistical-software/time-of-game-are-mlb-games-getting-any-shorterEric HeckmanBanned: P Values and Confidence Intervals! A Rebuttal, Part 2
http://blog.minitab.com/blog/adventures-in-statistics/banned-p-values-and-confidence-intervals-a-rebuttal-part-2
<p>In <a href="http://blog.minitab.com/blog/adventures-in-statistics/banned-p-values-and-confidence-intervals-a-rebuttal-part-1">my previous post</a>, I wrote about the hypothesis testing ban in the <em>Journal of Basic and Applied Social Psychology.</em> I showed how P values and confidence intervals provide important information that descriptive statistics alone don’t provide. In this post, I'll cover the editors’ concerns about hypothesis testing and how to avoid the problems they describe.</p>
<p>The editors describe hypothesis testing as "invalid" and the significance level of 0.05 as a “crutch” for weak data. They claim that it is a bar that is “too easy to pass and sometimes serves as an excuse for lower quality research.” They also bemoan the fact that sometimes the initial study obtains a significant P value but follow-up replication studies can fail to obtain significant results.</p>
<p>Ouch, right?</p>
<p>Their arguments against hypothesis testing focus on the following:</p>
<ol>
<li>You can’t determine the probability that either the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/null-and-alternative-hypotheses/" target="_blank">null hypothesis or the alternative hypothesis</a> is true.</li>
<li>Studies that attempt to replicate previous significant findings do not always obtain significant results.</li>
</ol>
<p>These issues are nothing new and aren't show stoppers for hypothesis testing. In fact, I believe using them to ban null hypothesis testing represents a basic misunderstanding of both how to correctly use hypothesis test results and how the scientific process works.</p>
P Values Are Frequently Misinterpreted and This Leads to Problems
<p>P values are not "invalid" but they do answer a different question than what many readers realize. There is a common misconception that the P value represents the probability that the null hypothesis is true. Under this mistaken understanding, a P value of 0.04 would indicate there is a 4% probability of a false positive when you reject the null hypothesis. This is <strong>WRONG</strong>!</p>
<p>The question that a P value <em>actually</em> answers is: <em>If </em>the null hypothesis is true, are my data unusual?</p>
<p>The correct interpretation for a P value of 0.04 is that <em>if the null hypothesis is true</em>, you would obtain the observed effect or more in 4% of the studies due to random sampling error. In other words, the observed sample results are unlikely if there truly is no effect in the population.</p>
<p>The actual false positive rate associated with a P value of 0.04 depends on a variety of factors but it is typically at least 23%. Unfortunately, the common misconception creates the illusion of substantially more evidence against the null hypothesis than is justified. You actually need a P value around 0.0027 to achieve an error rate of around 4.5%, which is close to the rate that many mistakenly attribute to a P value of 0.05.</p>
<p>The higher-than-expected false positive rate is the basis behind the editors’ criticisms that P values near 0.05 are a “crutch” and “too easy to pass.” However, this is due to misinterpretation rather than a problem with P values. The answer isn’t to ban P values, but to learn how to correctly interpret and use the results.</p>
Failure to Replicate
<p>The common illusion described above ties into the second issue of studies that fail to replicate significant findings. If the false positive rate is higher than expected, it makes sense that the number of followup studies that can’t replicate the previously significant results will also be higher than expected.</p>
<p>Another related common misunderstanding is that once you obtain a significant P value, you have a proven effect. Trafimow claims in an earlier editorial that once a significant effect is published, "it becomes sacred." This claim misrepresents the scientific method because there is no magic significance level that distinguishes between the studies that have a true effect and those that don’t with 100% accuracy.</p>
<p>A P value near 0.05 simply indicates that the result is worth another look, but it’s nothing you can hang your hat on by itself. Instead, it’s all about repeated testing to lower the error rate to an acceptable level.</p>
<p>You <em>always</em> need repeated testing to prove the truth of an effect!</p>
How to Use Hypothesis Tests Correctly
<div style="float: right; width: 250px; margin: 25px 25px;">
<p><img alt="water filter" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/f9f46589d93374e78dc6c9dcbfdcade5/soma_carafe_w1024.jpeg" style="float: right; width: 250px; height: 177px;" /> <em>Keep filtering until the results are clean and clear!</em></p>
</div>
<p>How does replication work with hypothesis tests and the false positive rate? Simulation studies show that the lower the P value, the greater the reduction in the probability that the null hypothesis is true from the beginning of the experiment to the end.</p>
<p>With this in mind, think of hypothesis tests as a filter that allows you to progressively lower the probability that the null hypothesis is true each time you obtain significant results. With repeated testing, we can filter out the false positives, as I illustrate below.</p>
<p>We generally don’t know the probability that a null hypothesis is true, but I’ll run through a hypothetical scenario based on the simulation studies. Let’s assume that initially there is a 50% chance that the null hypothesis is true. You perform the first experiment and obtain significant results. Let’s say this reduces the probability that the null is true down to 25%. Another study tests the same hypothesis, obtains significant results, and lowers the probability of a true null hypothesis even further to 10%.</p>
<p>Wash, rinse, and repeat! Eventually the probability that the null is true becomes a tiny value. This shows why significant results need to be replicated in order to become trustworthy findings.</p>
<p>The actual rate of reduction can be faster or slower than the example above. It depends on various factors including the initial probability of a true null hypothesis and the exact P value of each experiment. I used conservative P values near 0.05.</p>
<p>Of course, there’s always the possibility that the initial significant finding won’t be replicated. <em>This is a normal part of the scientific process and not a problem. </em>You won’t know for sure until a subsequent study tries to replicate a significant result!</p>
<p>Reality is complex and we’re trying to model it with samples. Conclusively proving a hypothesis with a single study is unlikely. So, don’t expect it!</p>
<p style="margin-left: 40px;">"A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance." <br />
<em><span style="line-height: 1.6;">—Sir Ronald A. Fisher, original developer of P values.</span></em></p>
Don't Blame the P Values for Poor Quality Studies
<p>You can’t look at a P value to determine the quality level of a study. The overall quality depends on <em>many </em>factors that occur well before the P value is calculated. A P value is just the end result of a long process.</p>
<p>The factors that affect the quality of a study include the following: theoretical considerations, experimental design, variables measured, sampling technique, sample size, measurement precision and accuracy, data cleaning, and the modeling method.</p>
<p>Any of these factors can doom a study before a P value is even calculated!</p>
<p>The blame that the editors place on P values for low quality research appearing in their journal is misdirected. This is a peer-reviewed journal and it’s the reviewers’ job to assess the quality of each study and publish only those with merit.</p>
Four Key Points!
<ol>
<li>Hypothesis test results such as P values and confidence intervals provide important information in addition to descriptive statistics.</li>
<li>But you need to interpret them correctly.</li>
<li>Significant results must be replicated to be trustworthy.</li>
<li>To evaluate the quality of a study, you must assess the entire process rather than the P value.</li>
</ol>
How to Avoid Common Problems with Hypothesis Test Results
<p>Hypothesis tests and statistical output such as P values and confidence intervals are powerful tools. Like any tool, you need to use them correctly to obtain good results. Don't ban the tools. Instead, change the bad practices that surround them. <span style="line-height: 1.6;">Please follow these links for more details and references.</span></p>
<p><a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">How to Correctly Interpret P Values</a>: Just as the title says, this post helps you to correctly interpret P values and avoid the mistakes associated with the incorrect interpretations.</p>
<p><a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-why-we-need-to-use-hypothesis-tests-in-statistics" target="_blank">Understanding Hypothesis Tests</a>: The graphical approach in this series of three posts provides a more intuitive understanding of how hypothesis testing works and what statistical significance truly means.</p>
<p><a href="http://blog.minitab.com/blog/adventures-in-statistics/not-all-p-values-are-created-equal" target="_blank">Not all P Values are Created Equal</a>: If you want to understand better the false positive rate associated with different P values and the factors that effect it, this post is for you! This post also shows you how lower P values reduce the probability of a true null hypothesis.</p>
<p><a href="http://blog.minitab.com/blog/adventures-in-statistics/five-guidelines-for-using-p-values" target="_blank">Five Guidelines for Using P Values</a>: The journal editors raise issues about how P values can be abused. These are real issues when P values are used incorrectly. However, there’s no need to banish them! This post provides simple guidelines for how to navigate these issues and avoid common problems.</p>
<p><em>The photo of the water filter is by the Wikimedia user TheMadBullDog and used under this <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en" target="_blank">Creative Commons license</a>.</em></p>
Data AnalysisHypothesis TestingLearningStatisticsStatistics HelpStatsThu, 14 May 2015 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/banned-p-values-and-confidence-intervals-a-rebuttal-part-2Jim FrostImproving Recycling Processes at Rose-Hulman, Part III
http://blog.minitab.com/blog/real-world-quality-improvement/improving-recycling-processes-at-rose-hulman-part-iii
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/fa7a4559e547be217d5fa38f61c978c1/landfill.jpg" style="float: right; width: 350px; height: 253px; margin: 10px 15px;" />In previous posts, I discussed the results of a recycling project done by Six Sigma students at Rose-Hulman Institute of Technology last spring. (If you’re playing catch up, you can read <a href="http://blog.minitab.com/blog/real-world-quality-improvement/a-little-trash-talk3a-improving-recycling-processes-at-rose-hulman" target="_blank">Part I</a> and <a href="http://blog.minitab.com/blog/real-world-quality-improvement/a-little-trash-talk%3A-improving-recycling-processes-at-rose-hulman%2C-part-ii" target="_blank">Part II</a>.)</p>
<p>The students did an awesome job reducing the amount of recycling that was thrown into the normal trash cans across all of the institution’s academic buildings. At the end of the spring quarter (2014), 24% of trash cans (by weight) included recyclable items. At the beginning of that spring quarter, 36% of trash cans were recyclable items, so you can see that they were very successful in reducing this percentage!</p>
<p>The fall quarter (2015) brought a new set of Six Sigma students to Rose-Hulman who were just as dedicated to reducing the amount of recycling thrown into normal trash cans, and I want to cover their success in this post, as well as some of the neat statistical methods they used when completing their project.</p>
Fall 2015 goals
<p>This time around, the students wanted to at least maintain or improve on the percentage spring quarter (2014) students were able to achieve. They set out with a specific goal to reduce the amount of recycling in the trash to 20% by weight.</p>
<p>In order to further reduce the recyclables in the academic buildings in fall 2015, the standard “Define, Measure, Analyze, Improve, Control” (DMAIC) methodology of Six Sigma was once again implemented. The main project goal focused on standardizing the recycling process within the buildings, and their plan to reduce the amount of recyclables focused on optimizing the operating procedure for collecting recyclables in all academic building areas (excluding classrooms) where trash and recycling are collected.</p>
<p>Many of the same DMAIC tools that were used by spring 2014 students were also used here, including—<a href="http://support.minitab.com/quality-companion/3/help-and-how-to/run-projects/brainstorming/ct-tree/" target="_blank">Critical to Quality Diagrams</a>, <a href="http://support.minitab.com/quality-companion/3/help-and-how-to/run-projects/maps/process-map/" target="_blank">Process Maps</a>, <a href="http://blog.minitab.com/blog/real-world-quality-improvement/spicy-statistics-and-attribute-agreement-analysis" target="_blank">Attribute Agreement Analysis</a>, <a href="http://blog.minitab.com/blog/marilyn-wheatleys-blog/evaluating-a-gage-study-with-one-part-v2" target="_blank">Gage R&R</a>, Statistical Plots, <a href="http://blog.minitab.com/blog/adventures-in-software-development/risk-based-testing-at-minitab-using-quality-companions-fmea" target="_blank">FMEA</a>, <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples" target="_blank">Regression</a>—among many others.</p>
Making and measuring improvements
<p>The spring 2014 initiative added recycling bins to every classroom, which created a measurable improvement. The fall 2015 effort focused on improvement through <em>standardization of operation</em>. For example, many areas in the academic buildings suffer from random placement and arrangement of trash cans and recycling bins. The students thought standardization of bin areas (one trash, one plastic/aluminum recycling, and one paper recycling) would lessen the confusion of recycling, and clear signage and stickers on identically shaped trash cans and recycling bins would be better visual cues of where to place waste of both kinds.</p>
<p>For fall 2015, there were seven teams, and they were assigned different academic building floors (not including classrooms) and common areas. Unlike the spring 2014 data collection, the teams did not combine the trash from their assigned areas. They treated each recycling station as a unique data point.</p>
<p>After implementing the improvements to standardize the bins, the teams collected data for four days across twenty-nine total stations. Thus, there were a total of 116 fall 2015 improvement percentages. The fall 2015 students used the post-improvement percentage of recyclables in the trash from spring 2014 (24%) as their baseline for determining improvement in fall 2015.</p>
<p>The descriptive statistics for the percentage of recyclables (by weight) in the trash were as follows:</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/5c77690aaaff21d0b33eb5083f82074e/descriptive_stats.jpg" style="border-width: 0px; border-style: solid; width: 550px; height: 67px;" /></p>
<p>Below, the students put together a histogram and a boxplot of the data using <a href="http://www.minitab.com/products/minitab/features/" target="_blank">Minitab Statistical Software</a>. Over half of the stations (61 out of 116) had less than 5% of recyclables in the trash. Forty-six of the 116 recycling stations had no recyclables. The value of the third quartile (16.6%), meant that 75% of the stations had less than 16.6% recyclables. The descriptive statistics above showed that the sample mean was much larger than the sample median. The graphs confirmed that this must be the case because of the strong positively skewed shape of the data.</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/4e730181a9288e531ff9caf69a347dd0/histogram.jpg" style="border-width: 0px; border-style: solid; width: 624px; height: 206px;" /></p>
<p>Even though the 116 data points didn’t follow a normal distribution and there was a large mound of 0’s as part of the distribution from collection spots that had no recyclables, the students trusted that the <a href="http://blog.minitab.com/blog/understanding-statistics/how-the-central-limit-theorem-works" target="_blank">Central Limit Theorem</a> with a sample size of 116 would generate a sampling distribution of the means that was normally distributed. Because of the large sample size and unknown standard deviation, they used a <em>t</em> distribution to create a 95% confidence interval for the true mean percentage of recyclables in the trash for fall 2015.</p>
<p>Also using Minitab, they constructed the 95% confidence interval:</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/2ccf17f68f0055c32282c2020f2c9108/one_sample_t.jpg" style="border-width: 0px; border-style: solid; width: 423px; height: 48px;" /></p>
<p>The 95% confidence interval meant that the students were 95% certain that the interval [9.94, 18.22] contains the true mean percentage of recyclables in the trash for fall 2015. At an alpha level equal to 0.025, they were able to reject the null hypothesis, where H0: μ = 24% versus Ha: μ < 24%, because 24% was not contained in the two-sided 95% confidence interval. (Remember that 24% was the mean percentage of recyclables in trash after the spring 2014 improvement phase.) The null hypothesis for H0: μ = 20% versus Ha: μ < 20%, was rejected. This meant that they had met their goal to reduce the percentage of recyclables in the trash to below 20% for this project!</p>
Continuing to analyze the data
<p>The students also subgrouped their data by collection day. Each day consisted of data from 29 recycling stations. The comparative boxplots and individual value plots below show the percentage of recyclables in the trash across the four collection dates. (The horizontal dotted line in the boxplot is the mean from spring 2014’s post-improvement data.)</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/664e8bf0f443d278376e71a70817e727/ivp.jpg" style="border-width: 0px; border-style: solid; width: 624px; height: 207px;" /></p>
<p>Though all four collection days have sample means less than 24%, it’s obvious from the boxplots that the first three collection days are clearly below 24%, and the medians from all four days are less than 11%. The individual value plots reveal the large number of 0’s on each day, which represented collection spots that had no recyclables. Both graphs display the positively skewed nature of the data. Because of the positive skewness, each day’s mean is much larger than its median.</p>
How capable was the process?
<p>Next, the students ran a <a href="http://blog.minitab.com/blog/real-world-quality-improvement/using-statistics-to-show-your-boss-process-improvements" target="_blank">process capability analysis</a> for the seven areas where trash was collected over four days:</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/8f9b85a55164f9e957809a8be1eef1c0/process_cap.jpg" style="border-width: 0px; border-style: solid; width: 465px; height: 347px;" /></p>
<p>The process capability indices were Pp = 0.48 and Ppk = 0.42. (The Pp value corresponds to a 1.44 Sigma Level, while the Ppk value corresponds to a 1.26 Sigma Level.) Recall that the previous Ppk value after improvements in <a href="http://blog.minitab.com/blog/real-world-quality-improvement/a-little-trash-talk%3A-improving-recycling-processes-at-rose-hulman%2C-part-ii" target="_blank">spring 2014</a> was 0.22. The fall 2015 index is almost double that value!</p>
<p>The students knew that they still needed to account for the total weight of the trash and recyclables by calculating the percentage of recyclables per station. Some collection stations with the highest percentage of recyclables had the lowest total weight, while some stations with the lowest percentage of recyclables had the highest total weight. Instead of strictly using a capability index to indicate their improvement, they incorporated a <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples" target="_blank">regression</a> model for the trash weight versus the total weight of trash and recyclables to show that the percentage of recyclables in the trash was less than 20%.</p>
<p>The 95% confidence interval for the true mean slope of the regression line was [0.856, 0.954]. The students were 95% certain that the trash weight was somewhere between 0.86 to 0.96 of the total weight of the collection. Hence, the recycling weight was between 0.046 and 0.114 of the total weight. This value is clearly below 20% with 95% confidence! From this, they were able to state through yet another type of analysis that there was a statistically significant improvement over the spring 2014 recycling project, and that they met their goal of reducing the percentage of recyclables in the trash to below 20%. Compared to the spring 2014 project where 24% of the trash was recyclables, the fall 2015 students saved <em>at least</em> 4% more recyclables from ending up in the local landfill!</p>
<p>For even more on this topic, be sure to check out Rose-Hulman student Peter Olejnik’s blog posts on how he and the recycling project team at the school used regression to evaluate project results:</p>
<p><a href="http://blog.minitab.com/blog/statistics-in-the-field/using-regression-to-evaluate-project-results%2C-part-1" target="_blank">Using Regression to Evaluate Project Results, part 1</a></p>
<p><a href="http://blog.minitab.com/blog/statistics-in-the-field/using-regression-to-evaluate-project-results%2C-part-2" target="_blank">Using Regression to Evaluate Project Results, part 2</a></p>
<p><em>Many thanks to Dr. Diane Evans for her contributions to this post!</em></p>
Data AnalysisFun StatisticsHypothesis TestingLean Six SigmaLearningSix SigmaStatisticsStatsFri, 08 May 2015 12:00:00 +0000http://blog.minitab.com/blog/real-world-quality-improvement/improving-recycling-processes-at-rose-hulman-part-iiiCarly BarryBanned: P Values and Confidence Intervals! A Rebuttal, Part 1
http://blog.minitab.com/blog/adventures-in-statistics/banned-p-values-and-confidence-intervals-a-rebuttal-part-1
<p>Banned! In February 2015, editor David Trafimow and associate editor Michael Marks of the <em>Journal of Basic and Applied Social Psychology</em> <a href="http://www.tandfonline.com/doi/full/10.1080/01973533.2015.1012991#abstract" target="_blank">declared</a> that the null hypothesis statistical testing procedure is invalid. They promptly banned P values, confidence intervals, and hypothesis testing from the journal.</p>
<p>The journal now requires descriptive statistics and effect sizes. They also encourage large sample sizes, but they don’t require it.</p>
<p>This is the first of two posts in which I focus on the ban. In this post, I’ll start by showing how hypothesis testing provides crucial information that descriptive statistics alone just can't convey. In my next post, I’ll explain the editors' rationale for the ban—and why I disagree with them.</p>
P Values and Confidence Intervals Are Valuable!
<p>It’s really easy to show how P values and confidence intervals are valuable. Take a look at the graph below and determine which study found a true treatment effect and which one didn’t. The difference between the treatment group and the control group is the effect size, which is what the editors want authors to focus on.</p>
<p><img alt="Bar chart that compares the effect size of two studies" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/94164d874cc69ffe3763cf5cee64d47b/banned_pvalues.png" style="width: 576px; height: 384px;" /></p>
<p>Can you tell? The truth is that the results from both of these studies could represent either a true treatment effect or a random fluctuation due to sampling error.</p>
<p>So, how do you know? There are three factors at play.</p>
<ul>
<li><strong>Effect size</strong>: The larger the effect size, the less likely it is to be a random fluctuation. Clearly, Study A has a larger effect size. The large effect seems significant, but it’s not enough by itself.</li>
<li><strong>Sample size</strong>: A larger sample size allows you to detect smaller effects. If the sample size for Study B is large enough, its smaller treatment effect may very well be real.</li>
<li><strong>Variability in the data</strong>: The greater the variability, the more likely you’ll see large differences between the experimental groups due to random sampling error. If the variability in Study A is large enough, its larger difference may be attributable to random error rather than a treatment effect.</li>
</ul>
<p>The effect size from either study could be meaningful, or not, depending on the other factors. As you can see, there are scenarios where the larger effect size in Study A can be random error while the smaller effect size in Study B can be a true treatment effect.</p>
<p>Presumably, these statistics will all be reported under the journal's new focus on effect size and descriptive statistics. However, assessing different combinations of effect sizes, sample sizes, and variability gets fairly complicated. The ban forces journal readers to use a subjective eyeball approach to determine whether the difference is a true effect. And this is just for <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/tests-of-means/why-use-2-sample-t/" target="_blank">comparing two means</a>, which is about as simple as it can get! (How the heck would you even perform multiple regression analysis with only descriptive statistics?!)</p>
<p>Wouldn’t it be nice if there was some sort of statistic that incorporated all of these factors and rolled them into one objective number?</p>
<p>Hold on . . . that’s the P value! The <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">P value</a> provides an objective standard for everyone assessing the results from a study.</p>
<p>Now, let’s consider two different experiments that have studied the same treatment and have come up with the following two estimates of the effect size.</p>
<strong>Effect Size Study C</strong>
<strong>Effect Size Study D</strong>
10
10
<p>Which estimate is better? It is pretty hard to say which 10 is better, right? Wouldn’t it be nice if there was a procedure that incorporated the effect size, sample size, and variability to provide a range of probable values <em>and</em> indicate the precision of the estimate?</p>
<p>Oh wait . . . that’s the confidence interval!</p>
<p>If we create the <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-confidence-intervals-and-confidence-levels" target="_blank">confidence intervals</a> for Study C [-5 25] and Study D [8 12], we gain some very valuable information. The confidence interval for Study C is both very wide and contains 0. This estimate is imprecise, and we can't rule out the possibility of no treatment effect. We're not learning anything from this study. On the other hand, the estimate from Study D is both very precise and statistically significant.</p>
<p>The two studies produced the same point estimate of the effect size, but the confidence interval shows that they're actually very different.</p>
<p>Focusing solely on effect sizes and descriptive statistics is inadequate. P values and confidence intervals contribute truly important information that descriptive statistics alone can’t provide. That's why banning them is a mistake.</p>
<p><a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-why-we-need-to-use-hypothesis-tests-in-statistics">See a graphical explanation of how hypothesis tests work</a>.</p>
<p>If you'd like to see some fun examples of hypothesis tests in action, check out my posts about the Mythbusters!</p>
<ul>
<li><a href="http://blog.minitab.com/blog/adventures-in-statistics/busting-the-mythbusters-are-yawns-contagious">Busting the Mythbusters with Statistics: Are Yawns Contagious?</a></li>
<li><a href="http://blog.minitab.com/blog/adventures-in-statistics/using-hypothesis-tests-to-bust-myths-about-the-battle-of-the-sexes">Using Hypothesis Tests to Bust Myths about the Battle of the Sexes</a></li>
</ul>
<p>The editors do raise some legitimate concerns about the hypothesis testing process. In <a href="http://blog.minitab.com/blog/adventures-in-statistics/banned-p-values-and-confidence-intervals-a-rebuttal-part-2">part two</a>, I assess their arguments and explain why I believe a ban still is not justified.</p>
Data AnalysisHypothesis TestingLearningStatisticsStatistics HelpStatsThu, 30 Apr 2015 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/banned-p-values-and-confidence-intervals-a-rebuttal-part-1Jim FrostNo Horsing Around with the Poisson Distribution, Troops
http://blog.minitab.com/blog/quality-data-analysis-and-statistics/no-horsing-around-with-the-poisson-distribution-troops
<p>In 1898, Russian economist Ladislaus Bortkiewicz published his first statistics book entitled <em>Das Gesetz der keinem Zahlen</em><em>,</em> in which he included an example that eventually became famous for illustrating the Poisson distribution. <img alt="horses" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d76b2523d4819d498f66c0a10250df7b/horses.jpg" style="margin: 10px 15px; float: right; width: 250px; height: 154px;" /></p>
<p><span style="line-height: 18.9090900421143px;">Bortkiewicz </span>researched the annual deaths by horse kicks in the Prussian Army from 1875-1984. Data was recorded from 14 different army corps, with one being the Guard Corps. (According to one Wikipedia article on the subject, the Guard Corps may have been responsible for Prussia’s elite Guard units.) Let's take a closer look at his data and see what Minitab has to say using a Poisson goodness-of-fit test.</p>
<p>Here's the data set (thank you, <a href="http://www.math.uah.edu/stat/data/HorseKicks.html" target="_blank">University of Alabama in Huntsville</a>):</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/823a75a6dcf75897e05eaaa468d7350c/data_set.PNG" style="width: 997px; height: 466px;" /><br />
</p>
What Is the Poisson Distribution?
<p>As a review, the Poisson distribution is a discrete probability distribution for the counts of events that occur randomly in a given interval of time or space. The Poisson distribution only has one parameter, which is called lambda (or mean). To divert your attention just a little bit before we run our goodness-of-fit test, let’s look at how the distribution changes with different values of lambda. <span style="line-height: 1.6;">Go to </span><strong style="line-height: 1.6;">Graph > </strong><strong style="line-height: 1.6;">Probability Distribution Plot > </strong><strong style="line-height: 1.6;">View Single</strong><span style="line-height: 1.6;">. Select <em>Poisson </em>from the Distribution drop-down and enter in <em>.5</em> for the mean, then press <em>OK</em>:</span></p>
<p align="center"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/770927a9991955cf769dc73e508af2f4/pic1.png" style="width: 415px; height: 334px;" /></p>
<p>After I created my first plot, I created 3 more probability distribution plots with lambda at 2, 4, 10. I then used Minitab’s Layout Tool under the <a href="http://blog.minitab.com/blog/marilyn-wheatleys-blog/getting-the-most-out-of-your-text-data-part-iii">Editor Menu</a> to combine four graphs.</p>
<p>As lambda increases, the graphs begin to resemble a normally distributed curve:</p>
<p style="text-align: center;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/3e77aef5baac0edd39721bdf9a57a0de/pic2.png" style="width: 577px; height: 385px;" /></p>
Does This Data Follow a Poisson Distribution?
<p><span style="line-height: 1.6;">Interesting, right? But let's get back on track and test if the overall data obtained by Bortkiewicz follows a Poisson distribution. </span></p>
<p><span style="line-height: 1.6;">I first had to stack the data from 14 columns into one column. This is done via </span><strong style="line-height: 1.6;">Data > Stack > Columns…</strong></p>
<p style="text-align: center;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/7a29a555220937423f326e10d1fe2475/pic3.png" style="width: 461px; height: 335px;" /></p>
<p><span style="line-height: 1.6;">With the data stacked, I went to</span><strong style="line-height: 1.6;"> Stat > Basic Statistics > Goodness-of-Fit for Poisson…, </strong><span style="line-height: 1.6;">filling out the dialog as shown below:</span></p>
<p align="center"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/b8dd1fc71f1cccbc277b3b35770b037e/pic4.png" style="width: 434px; height: 334px;" /></p>
<p>After I clicked OK, Minitab delivered the following results:</p>
<p style="text-align: center;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/b6e13edc14af23b6c5f1b27b1616e066/results.PNG" style="width: 339px; height: 216px;" /></p>
<p><span style="line-height: 1.6;">The Poisson mean, or lambda, is 0.70. This means that we can expect, on average, 0.70 deaths per one corps per one year. If I knew of these statistics and served in the army corps at that time, I would have treated my horse like gold. Anything my horse wants, it gets.</span></p>
<p>Further down you’ll see a table showing the observed counts and the Expected Counts for the number of deaths by horse. The expected counts visually mirror pretty well to what was observed. To further validate these claims that this data can be modeled by a Poisson distribution, we can use the p-value for the Goodness-of-Fit Test in the last section of the output.</p>
<p>The hypothesis for the Chi-Square Goodness-of-Fit test for Poisson is:</p>
<p style="margin-left: 40px;">Ho: The data follow a Poisson distribution</p>
<p style="margin-left: 40px;">H1: The data do not follow a Poisson distribution</p>
<p>We are going to use an alpha level of 0.05. Since our p-value is greater than our alpha, we can say that we do not have enough evidence to reject the null hypothesis, which is that the horse kick deaths per year follow a Poisson distribution.</p>
<p>The chart below shows how close the both the expected and observed values for deaths are to each other. </p>
<p style="text-align: center;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/8200b5c574e8824a563089db792909e8/pic5.png" style="width: 577px; height: 385px;" /></p>
<p><span style="line-height: 1.6;">I've been thinking about what other data could have been collected to serve as potential predictors if we wanted to do a poisson regression. We could then see if there were any significant relationships between our horse kick death counts and some factor of interest. Maybe corps location or horse breed could have been documented? Given that the space or unit of time is considered one year, that specific location or breed would have to be the same value for the entire length of that time. For example, Corps 14 in 1893 must have remained entirely in “Location A” during that year, or every horse in a particular corps must be of the same breed for a particular year.</span></p>
<p>According to <a href="http://equusmagazine.com/article/whyhorseskick_012307-8294">equusmagazine.com</a>, horses kick for six reasons:</p>
<ul>
<li>"I feel threatened."</li>
<li>"I feel good."</li>
<li>"I hurt."</li>
<li>"I feel frustrated."</li>
<li>"Back off."</li>
<li>"I'm the boss around here."</li>
</ul>
<p>Wouldn’t this have made for a great categorical variable?</p>
<p> </p>
Data AnalysisFun StatisticsHypothesis TestingStatisticsStatistics HelpTue, 14 Apr 2015 12:00:00 +0000http://blog.minitab.com/blog/quality-data-analysis-and-statistics/no-horsing-around-with-the-poisson-distribution-troopsAndy CheshireUnderstanding Hypothesis Tests: Confidence Intervals and Confidence Levels
http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-confidence-intervals-and-confidence-levels
<p>In this series of posts, I show how hypothesis tests and confidence intervals work by focusing on concepts and graphs rather than equations and numbers. </p>
<p>Previously, I used graphs to <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-significance-levels-alpha-and-p-values-in-statistics" target="_blank">show what statistical significance really means</a>. In this post, I’ll explain both confidence intervals and confidence levels, and how they’re closely related to P values and significance levels.</p>
How to Correctly Interpret Confidence Intervals and Confidence Levels
<p><img alt="Illustration of confidence levels" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/a9bd1376510c8289a0daf15f5bcd376f/ci.gif" style="float: right; width: 327px; height: 224px;" />A confidence interval is a range of values that is likely to contain an unknown population parameter. If you draw a random sample many times, a certain percentage of the confidence intervals will contain the population mean. This percentage is the confidence level.</p>
<p>Most frequently, you’ll use confidence intervals to bound the mean or standard deviation, but you can also obtain them for regression coefficients, proportions, rates of occurrence (Poisson), and for the differences between populations.</p>
<p>Just as there is a common <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">misconception of how to interpret P values</a>, there’s a common misconception of how to interpret confidence intervals. In this case, the confidence level is<em> <strong>not </strong></em>the probability that a specific confidence interval contains the population parameter.</p>
<p>The confidence level represents the theoretical ability of the analysis to produce accurate intervals if you are able to assess <em>many intervals</em> and you know the value of the population parameter. For a <em>specific</em> confidence interval from one study, the interval either contains the population value or it does not—there’s no room for probabilities other than 0 or 1. And you can't choose between these two possibilities because you don’t know the value of the population parameter.</p>
<p style="margin-left: 40px;">"The parameter is an unknown constant and no probability statement concerning its value may be made." <br />
<em><span style="line-height: 1.6;">—Jerzy Neyman, original developer of confidence intervals.</span></em></p>
<p>This will be easier to understand after we discuss the graph below . . .</p>
<p>With this in mind, how <em>do</em> you interpret confidence intervals?</p>
<p>Confidence intervals serve as good estimates of the population parameter because the procedure tends to produce intervals that contain the parameter. Confidence intervals are comprised of the point estimate (the most likely value) and a margin of error around that point estimate. The margin of error indicates the amount of uncertainty that surrounds the sample estimate of the population parameter.</p>
<p>In this vein, you can use confidence intervals to assess the precision of the sample estimate. For a specific variable, a narrower confidence interval [90 110] suggests a more precise estimate of the population parameter than a wider confidence interval [50 150].</p>
Confidence Intervals and the Margin of Error
<p>Let’s move on to see how confidence intervals account for that margin of error. To do this, we’ll use the same tools that we’ve been using to understand hypothesis tests. I’ll create a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/basic-concepts/sampling-distribution/" target="_blank">sampling distribution</a> using <a href="http://blog.minitab.com/blog/adventures-in-statistics/graphing-distributions-with-probability-distribution-plots" target="_blank">probability distribution plots</a>, the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/probability-distributions-and-random-data/distributions/t-distribution/" target="_blank">t-distribution</a>, and the variability in our data. We'll base our confidence interval on the <a href="http://support.minitab.com/datasets/FamilyEnergyCost.MTW">energy cost data set</a> that we've been using.</p>
<p>When we looked at <a href="http://blog.minitab.com/blog/adventures-in-statistics/when-should-i-use-confidence-intervals-prediction-intervals-and-tolerance-intervals" target="_blank">significance levels</a>, the graphs displayed a sampling distribution centered on the null hypothesis value, and the outer 5% of the distribution was shaded. For confidence intervals, we need to shift the sampling distribution so that it is centered on the sample mean and shade the middle 95%.</p>
<p><img alt="Probability distribution plot that illustrates how a confidence interval works" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/80de5f2397507752d74ffff86fbd94ea/ci_sample_mean.png" style="width: 576px; height: 384px;" /></p>
<p>The shaded area shows the range of sample means that you’d obtain 95% of the time using our sample mean as the point estimate of the population mean. This range [267 394] is our 95% confidence interval.</p>
<p>Using the graph, it’s easier to understand how a specific confidence interval represents the margin of error, or the amount of uncertainty, around the point estimate. The sample mean is the most likely value for the population mean given the information that we have. However, the graph shows it would not be unusual at all for other random samples drawn from the same population to obtain different sample means within the shaded area. These other likely sample means all suggest different values for the population mean. Hence, the interval represents the inherent uncertainty that comes with using sample data.</p>
<p>You can use these graphs to calculate probabilities for specific values. However, notice that you can’t place the population mean on the graph because that value is unknown. Consequently, you can’t calculate probabilities for the population mean, just as Neyman said!</p>
Why P Values and Confidence Intervals Always Agree About Statistical Significance
<p>You can use either P values or confidence intervals to determine whether your results are statistically significant. If a hypothesis test produces both, these results will agree.</p>
<p>The confidence level is equivalent to 1 – the alpha level. So, if your significance level is 0.05, the corresponding confidence level is 95%.</p>
<ul>
<li>If the P value is less than your significance (alpha) level, the hypothesis test is statistically significant.</li>
<li>If the confidence interval does not contain the null hypothesis value, the results are statistically significant.</li>
<li>If the P value is less than alpha, the confidence interval will not contain the null hypothesis value.</li>
</ul>
<p>For our example, the P value (0.031) is less than the significance level (0.05), which indicates that our results are statistically significant. Similarly, our 95% confidence interval [267 394] does not include the null hypothesis mean of 260 and we draw the same conclusion.</p>
<p>To understand why the results always agree, let’s recall how both the significance level and confidence level work.</p>
<ul>
<li>The significance level defines the distance the sample mean must be from the null hypothesis to be considered statistically significant.</li>
<li>The confidence level defines the distance for how close the confidence limits are to sample mean.</li>
</ul>
<p>Both the significance level and the confidence level define a distance from a limit to a mean. Guess what? The distances in both cases are exactly the same!</p>
<p>The distance equals the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/what-is-a-critical-value/" target="_blank">critical t-value</a> * <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/tests-of-means/what-is-the-standard-error-of-the-mean/" target="_blank">standard error of the mean</a>. For our energy cost example data, the distance works out to be $63.57.</p>
<p>Imagine this discussion between the null hypothesis mean and the sample mean:</p>
<p><strong>Null hypothesis mean, hypothesis test representative</strong>: Hey buddy! I’ve found that you’re statistically significant because you’re more than $63.57 away from me!</p>
<p><strong>Sample mean, confidence interval representative</strong>: Actually, I’m significant because <em>you’re</em> more than $63.57 away from <em>me</em>!</p>
<p>Very agreeable aren’t they? And, they always will agree as long as you compare the correct pairs of P values and confidence intervals. If you compare the incorrect pair, you can get conflicting results, as shown by common mistake #1 in this <a href="http://blog.minitab.com/blog/real-world-quality-improvement/3-common-and-dangerous-statistical-misconceptions" target="_blank">post</a>.</p>
Closing Thoughts
<p>In statistical analyses, there tends to be a greater focus on P values and simply detecting a significant effect or difference. However, a statistically significant effect is not necessarily meaningful in the real world. For instance, the effect might be too small to be of any practical value.</p>
<p>It’s important to pay attention to the both the magnitude and the precision of the estimated effect. That’s why I'm rather fond of confidence intervals. They allow you to assess these important characteristics along with the statistical significance. You'd like to see a narrow confidence interval where the entire range represents an effect that is meaningful in the real world.</p>
<p>For more about confidence intervals, read my post where I <a href="http://blog.minitab.com/blog/adventures-in-statistics/when-should-i-use-confidence-intervals-prediction-intervals-and-tolerance-intervals">compare them to tolerance intervals and prediction intervals</a>.</p>
<p>If you'd like to see how I made the probability distribution plot, please read: <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-create-a-graphical-version-of-the-1-sample-t-test-in-minitab">How to Create a Graphical Version of the 1-sample t-Test</a>.</p>
Data AnalysisHypothesis TestingLearningStatisticsStatistics HelpStatsThu, 02 Apr 2015 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-confidence-intervals-and-confidence-levelsJim Frost