Hypothesis Testing | MinitabBlog posts and articles about hypothesis testing, especially in the course of Lean Six Sigma quality improvement projects.
http://blog.minitab.com/blog/hypothesis-testing-2/rss
Wed, 04 Mar 2015 08:29:14 +0000FeedCreator 1.7.3Choosing Between a Nonparametric Test and a Parametric Test
http://blog.minitab.com/blog/adventures-in-statistics/choosing-between-a-nonparametric-test-and-a-parametric-test
<p>It’s safe to say that most people who use statistics are more familiar with parametric analyses than nonparametric analyses. Nonparametric tests are also called distribution-free tests because they don’t assume that your data follow a specific distribution.</p>
<p>You may have heard that you should use nonparametric tests when your data don’t meet the assumptions of the parametric test, especially the assumption about normally distributed data. That sounds like a nice and straightforward way to choose, but there are additional considerations.</p>
<p>In this post, I’ll help you determine when you should use a:</p>
<ul>
<li>Parametric analysis to test group means.</li>
<li>Nonparametric analysis to test group medians.</li>
</ul>
<p>In particular, I'll focus on an important reason to use nonparametric tests that I don’t think gets mentioned often enough!</p>
Hypothesis Tests of the Mean and Median
<p>Nonparametric tests are like a parallel universe to parametric tests. The table shows related pairs of <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/hypothesis-tests-in-minitab/" target="_blank">hypothesis tests</a> that <a href="http://www.minitab.com/en-us/products/minitab/features/" target="_blank">Minitab statistical software</a> offers.</p>
<p style="text-align: center;"><strong>Parametric tests (means)</strong></p>
<p style="text-align: center;"><strong>Nonparametric tests (medians)</strong></p>
<p style="text-align: center;">1-sample t test</p>
<p style="text-align: center;">1-sample Sign, 1-sample Wilcoxon</p>
<p style="text-align: center;">2-sample t test</p>
<p style="text-align: center;">Mann-Whitney test</p>
<p style="text-align: center;">One-Way ANOVA</p>
<p style="text-align: center;">Kruskal-Wallis, Mood’s median test</p>
<p style="text-align: center;">Factorial DOE with one factor and one blocking variable</p>
<p style="text-align: center;">Friedman test</p>
Reasons to Use Parametric Tests
<p><strong>Reason 1: Parametric tests can perform well with skewed and nonnormal distributions</strong></p>
<p>This may be a surprise but parametric tests can perform well with continuous data that are nonnormal if you satisfy these sample size guidelines.</p>
<p style="text-align: center;"><strong>Parametric analyses</strong></p>
<p style="text-align: center;"><strong>Sample size guidelines for nonnormal data</strong></p>
<p style="text-align: center;">1-sample t test</p>
<p style="text-align: center;">Greater than 20</p>
<p style="text-align: center;">2-sample t test</p>
<p style="text-align: center;">Each group should be greater than 15</p>
<p style="text-align: center;">One-Way ANOVA</p>
<ul>
<li style="text-align: center;">If you have 2-9 groups, each group should be greater than 15.</li>
<li style="text-align: center;">If you have 10-12 groups, each group should be greater than 20.</li>
</ul>
<p><strong>Reason 2: Parametric tests can perform well when the spread of each group is different</strong></p>
<p>While nonparametric tests don’t assume that your data follow a normal distribution, they do have other assumptions that can be hard to meet. For nonparametric tests that compare groups, a common assumption is that the data for all groups must have the same spread (dispersion). If your groups have a different spread, the nonparametric tests might not provide valid results.</p>
<p>On the other hand, if you use the 2-sample t test or One-Way ANOVA, you can simply go to the <strong>Options</strong> subdialog and uncheck <em>Assume equal variances</em>. Voilà, you’re good to go even when the groups have different spreads!</p>
<p><strong>Reason 3: Statistical power</strong></p>
<p>Parametric tests usually have more <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/power-and-sample-size/what-is-power/" target="_blank">statistical power</a> than nonparametric tests. Thus, you are more likely to detect a significant effect when one truly exists.</p>
Reasons to Use Nonparametric Tests
<p><strong>Reason 1: Your area of study is better represented by the median</strong></p>
<p><img alt="Comparing two skewed distributions" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/7223b01bc095dbd652bd863be5288cfe/mean_or_median.png" style="float: right; width: 200px; height: 181px; margin: 10px 15px;" />This is my favorite reason to use a nonparametric test and the one that isn’t mentioned often enough! The fact that you <em>can</em> perform a parametric test with nonnormal data doesn’t imply that the mean is the best <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/summary-statistics/measures-of-central-tendency/" target="_blank">measure of the central tendency</a> for your data.</p>
<p>For example, the center of a skewed distribution, like income, can be better measured by the median where 50% are above the median and 50% are below. If you add a few billionaires to a sample, the mathematical mean increases greatly even though the income for the typical person doesn’t change.</p>
<p>When your distribution is skewed enough, the mean is strongly affected by changes far out in the distribution’s tail whereas the median continues to more closely reflect the center of the distribution. For these two distributions, a random sample of 100 from each distribution produces means that are significantly different, but medians that are not significantly different.</p>
<p>Two of my colleagues have written excellent blog posts that illustrate this point:</p>
<ul>
<li>Michelle Paret: <a href="http://blog.minitab.com/blog/michelle-paret/using-the-mean-its-not-always-a-slam-dunk" target="_blank">Using the Mean in Data Analysis: It’s Not Always a Slam-Dunk</a></li>
<li>Redouane Kouiden: <a href="http://blog.minitab.com/blog/statistics-for-lean-six-sigma/the-non-parametric-economy-what-does-average-actually-mean" target="_blank">The Non-parametric Economy: What Does Average Actually Mean?</a></li>
</ul>
<p><strong>Reason 2: You have a very small sample size</strong></p>
<p>If you don’t meet the sample size guidelines for the parametric tests and you are not confident that you have normally distributed data, you should use a nonparametric test. When you have a really small sample, you might not even be able to ascertain the distribution of your data because the distribution tests will lack sufficient power to provide meaningful results.</p>
<p>In this scenario, you’re in a tough spot with no valid alternative. Nonparametric tests have less power to begin with and it’s a double whammy when you add a small sample size on top of that!</p>
<p><strong>Reason 3: You have ordinal data, ranked data, or outliers that you can’t remove</strong></p>
<p>Typical parametric tests can only assess continuous data and the results can be significantly affected by outliers. Conversely, some nonparametric tests can handle ordinal data, ranked data, and not be seriously affected by outliers. Be sure to check the assumptions for the nonparametric test because each one has its own data requirements.</p>
Closing Thoughts
<p>It’s commonly thought that the need to choose between a parametric and nonparametric test occurs when your data fail to meet an assumption of the parametric test. This can be the case when you have both a small sample size and nonnormal data. However, other considerations often play a role because parametric tests can often handle nonnormal data. Conversely, nonparametric tests have strict assumptions that you can’t disregard.</p>
<p>The decision often depends on whether the mean or median more accurately represents the center of your data’s distribution.</p>
<ul>
<li>If the mean accurately represents the center of your distribution and your sample size is large enough, consider a parametric test because they are more powerful.</li>
<li>If the median better represents the center of your distribution, consider the nonparametric test even when you have a large sample.</li>
</ul>
<p>Finally, if you have a very small sample size, you might be stuck using a nonparametric test. Please, collect more data next time if it is at all possible! As you can see, the sample size guidelines aren’t really that large. Your chance of detecting a significant effect when one exists can be very small when you have both a small sample size and you need to use a less efficient nonparametric test!</p>
Hypothesis TestingStatisticsStatistics HelpThu, 19 Feb 2015 13:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/choosing-between-a-nonparametric-test-and-a-parametric-testJim FrostWhat’s the Probability that Your Favorite Football Team Will Win?
http://blog.minitab.com/blog/customized-data-analysis/what%E2%80%99s-the-probability-that-your-favorite-football-team-will-win
<div>
<p>If you wanted to figure out the probability that your favorite football team will win their next game, how would you do it? My colleague <a href="http://blog.minitab.com/blog/understanding-statistics-and-its-application">Eduardo Santiago</a> and I recently looked at this question, and in this post we'll share how we approached the solution. Let’s start by breaking down this problem:<img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8954fcace8f66a536aca06fad36a4c5a/boy_football_200.png" style="margin: 10px 15px; float: right; width: 200px; height: 200px;" /></p>
<ol>
<li>There are only two possible outcomes: your favorite team wins, or they lose. Ties are a possibility, but they're very rare. So, to simplify things a bit, we’ll assume they are so unlikely that could be disregarded from this analysis.</li>
<li>There are numerous factors to consider.
<ol style="list-style-type:lower-alpha;">
<li>What will the playing conditions be?</li>
<li>Are key players injured?</li>
<li>Do they match up well with their opponent?</li>
<li>Do they have home-field advantage?</li>
<li>And the list goes on...</li>
</ol>
</li>
</ol>
<p>First, since we assumed the outcome is binary, we can put together a <a href="http://blog.minitab.com/blog/real-world-quality-improvement/using-binary-logistic-regression-to-investigate-high-employee-turnover">Binary Logistic Regression</a> model to predict the probability of a win occurring. Next, we need to find which predictors would be best to include. After <a href="http://www.thepredictiontracker.com/ncaaresults.php" target="_blank">a little research</a>, we found the betting markets seem to take all of this information into account. Basically, we are utilizing the wisdom of the masses to find out what they believe will happen. Since betting markets take this into account, we decided to look at the probability of a win, given the spread of a NCAA football game. </p>
Data Collection
<p>If you are not convinced about how accurate the spreads can be in determining the outcome of a game: win or loss, we collected data for every college football game played <span style="line-height: 20.7999992370605px;">between 2000 and 2014</span><span style="line-height: 1.6;">. The structure of the data is illustrated below. The third column has the spread (or line) provided by casinos at Vegas, and the last column displayed is the actual score differential (vscore – hscore).</span></p>
<p><strong><em>Note</em></strong><em>: In betting lines, a negative spread indicates how many points you are favored over the opponent. In short, you are giving the opponent a certain number of points. </em></p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/52aaa628ea28b55523232a9b2da6b623/table1.png" style="width: 600px; height: 352px;" /></p>
<p><span style="line-height: 1.6;">The original win-or-lose question can be rephrased then as follows: Is the difference between the spreads and actual score differentials statistically significant?</span></p>
<p>Since we have two populations that are dependent we would compare them via a paired t test. In other words, both the <em>Spread</em> and <em>scoreDiffer</em> are observations (a priori and a posteriori) for the same game and they reflect the relative strength of the home team <em>i</em> versus the road team <em>j</em>.</p>
<p>Using <strong>Stat > Basic Statistics > Paired t </strong>in Minitab Statistical Software, we get the output below.</p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/5531697c176006a4057f4ab7b6fda7dc/t_test_output.png" style="width: 500px; height: 189px;" /></p>
<p>Since the p-value is larger than 0.05, we can conclude from the 15 years of data that the average difference between Las Vegas spreads and actual score differentials is not significantly different from zero. With this we are saying that the bias that could exist between both measures of relative strength for teams is not different from zero, which in lay terms means that <em>on average</em> the error that exists between Vegas and actual outcomes is negligible.</p>
<p>It is worth noting that the results above were obtained with a sample size of 10,476 games! So we hope you'll excuse our not including <a href="http://blog.minitab.com/blog/understanding-statistics/how-much-data-do-you-really-need-check-power-and-sample-size">power calculations</a> here.</p>
<p>As a final remark on spreads, the histogram of the differences below shows a couple of interesting things:</p>
<ul>
<li>The average difference between the spreads and score differentials seem to be very close to zero. So don’t get too excited yet, as the spreads cannot be used to predict the exact score differential for a game. Nevertheless, with extremely high probability the spread will be very close to the score differential.</li>
<li>The standard deviation, however, is 15.5 points. That means that if a game shows a spread for your favorite team of -3 points, the outcome could be with high confidence within plus or minus 2 standard deviations of the point estimate, which is -3 ± 31 points in this case. So your favorite team could win by 34 points, or lose by 28!</li>
</ul>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f8f64fe200b85bcd5753a62737735de3/histogram1.png" style="width: 577px; height: 385px;" /></p>
<p align="center"><em>Figure 1 - Distribution of the differences between scores and spreads</em></p>
The Binary Logistic Regression Model
<p>By this point, we hope you are convinced about how good these spread values could be. To make the output more readable we summarized the data as follows:</p>
</div>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/218aa990c975fa2292d84926ba0002f0/table2.png" style="width: 250px; height: 405px;" /></p>
Creating our Binary Logistic Regression Model
<p>After summarizing the data, we used the Binary Fitted Line Plot (new in Minitab 17) to come up with our model. </p>
<p>If you are following along, here are the steps:</p>
<ol>
<li>Go to <strong>Stat > Regression > Binary Fitted Line Plot</strong></li>
<li>Fill out the dialog box as shown below and click <strong>OK</strong>.</li>
</ol>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ae06bb10e129b9c527b07700a57a7e2f/dialog1.png" style="width: 600px; height: 457px;" /></p>
<p><span style="line-height: 1.6;">The steps will produce the following graph:</span></p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/e2acef300ee9d1cd605ff089c217c8d5/binary_fitted_line_plot_w1024.png" style="width: 600px; height: 400px;" /></p>
Interpreting the Plot
<p>If your team is favored to win by 25 points or more, you have a very good chance of winning the game, but what if the spread is much closer?</p>
<p>For the 2014 National Championship, Ohio State was an underdog by 6 points to Oregon. Looking at the Binary Fitted Line Plot the probability of a 6-point underdog to win the game is close to 31% in college football. </p>
<p>Ohio State University ended up beating Oregon by 22 points. Given that the differences described in Figure 1 are normally distributed with respect to zero, then if we assume the spread is given (or known), we can compute the probability of the national championship game outcome being as extreme—or more—as it turned out.</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/977aafa9be7a8e2736b1340bca0b3b62/distribution_plot.png" style="width: 576px; height: 384px;" /></p>
<p>With Ohio State 6 point underdogs, and a standard deviation of 15.53, we can run a Probability Distribution Plot to show that Ohio State would win by 22 points or more 3.6% of the time.</p>
<p>Eduardo Santiago and myself will be giving a talk on using statistics to rank college football teams at the upcoming <a href="http://www.amstat.org/meetings/csp/2015/" target="_blank">Conference on Statistical Practice</a> in New Orleans. Our talk is February 21 at 2 p.m. and we would love to have you join. </p>
Fun StatisticsHypothesis TestingRegression AnalysisThu, 12 Feb 2015 13:00:00 +0000http://blog.minitab.com/blog/customized-data-analysis/what%E2%80%99s-the-probability-that-your-favorite-football-team-will-winDaniel GriffithStatistics: Another Weapon in the Galactic Patrol’s Arsenal
http://blog.minitab.com/blog/statistics-in-the-field/statistics-another-weapon-in-the-galactic-patrol%E2%80%99s-arsenal
<p><em><span style="line-height: 1.6;">by Matthew Barsalou, guest blogger. </span></em></p>
<p>E. E. Doc <a href="http://en.wikipedia.org/wiki/E._E._Smith" target="_blank">Smith</a>, one of the greatest authors ever, wrote many classic books such as <a href="http://en.wikipedia.org/wiki/Skylark_%28series%29" target="_blank">The Skylark of </a><a href="http://en.wikipedia.org/wiki/Skylark_%28series%29">Space</a> and his <a href="http://en.wikipedia.org/wiki/Lensman_series" target="_blank">Lensman</a> series. Doc Smith’s imagination knew no limits; his Galactic <a href="http://en.wikipedia.org/wiki/Galactic_Patrol" target="_blank">Patrol</a> had millions of combat fleets under its command and possessed planets turned into movable, armored weapons platforms. Some of the Galactic Patrol’s weapons may be well known. For example, there is the sunbeam, which concentrated the entire output of a sun’s energy into one beam.</p>
<p><span style="line-height: 1.6;"><img alt="amazing stories featuring E. E. "Doc" Smith" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/0d1ef573ea1b75bd2e6364f219ec6a19/docsmithcover.png" style="border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 296px; height: 400px;" />The Galactic Patrol also created the negasphere, a planet-sized dark matter/dark energy bomb that could eat through anything. I’ll go out on a limb and assume that they first created a container that could contain such a substance,</span><span style="line-height: 20.7999992370605px;"> </span><span style="line-height: 20.7999992370605px;">at least briefly</span><span style="line-height: 1.6;">.</span></p>
<p>When I read about such technology, I always have to wonder “How did they test it?” I can see where Minitab Statistical Software could be very helpful to the Galactic Patrol. How could the Galactic Patrol evaluate smaller, torpedo-sized units of negasphere? Suppose negasphere was created at the time of firing in a space torpedo and needed to be contained for the first 30 seconds after being fired, lest it break containment early and damage the ship that is firing it or rupture the torpedo before it reaches a space pirate.</p>
<p>The table below shows data collected from fifteen samples each of two materials that could be used for negasphere containment. Material 1 has a mean containment time of 33.951 seconds and Material 2 has a mean of 32.018 seconds. But is this difference statically significant? Does it even matter?</p>
<p style="text-align: center;"><strong>Material 1</strong></p>
<p style="text-align: center;"><strong>Material 2</strong></p>
<p style="text-align: center;">34.5207</p>
<p style="text-align: center;">32.1227</p>
<p style="text-align: center;">33.0061</p>
<p style="text-align: center;">31.9836</p>
<p style="text-align: center;">32.9733</p>
<p style="text-align: center;">31.9975</p>
<p style="text-align: center;">32.4381</p>
<p style="text-align: center;">31.9997</p>
<p style="text-align: center;">34.1364</p>
<p style="text-align: center;">31.9414</p>
<p style="text-align: center;">36.1568</p>
<p style="text-align: center;">32.0403</p>
<p style="text-align: center;">34.6487</p>
<p style="text-align: center;">32.1153</p>
<p style="text-align: center;">36.6436</p>
<p style="text-align: center;">31.9661</p>
<p style="text-align: center;">35.3177</p>
<p style="text-align: center;">32.0670</p>
<p style="text-align: center;">32.4043</p>
<p style="text-align: center;">31.9610</p>
<p style="text-align: center;">31.3107</p>
<p style="text-align: center;">32.0303</p>
<p style="text-align: center;">34.0913</p>
<p style="text-align: center;">32.0146</p>
<p style="text-align: center;">33.2040</p>
<p style="text-align: center;">31.9865</p>
<p style="text-align: center;">32.5601</p>
<p style="text-align: center;">32.0079</p>
<p style="text-align: center;">35.8556</p>
<p style="text-align: center;">32.0328</p>
<p><span style="line-height: 1.6;">The questions we're asking and the type and distribution of the data we have should determine the types of statistical test we perform. Many statistical tests for continuous data require an assumption of normality, and this can easily be tested in our <a href="http://www.minitab.com/products/minitab">statistical software</a> by going to <strong>Graphs > Probability Plot…</strong> and entering the columns containing the data.</span></p>
<p><span style="line-height: 1.6;"><img alt="probability plot of material 1" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ebd5796caf013f0204dbddc33c06df56/probability_plot1.png" style="width: 581px; height: 388px;" /></span></p>
<p><span style="line-height: 1.6;"><img alt="probability plot of material 2" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a8464e0302753942334c4e11d31482e5/probability_plot2.png" style="width: 580px; height: 388px;" /></span></p>
<p><span style="line-height: 1.6;">The null hypothesis is “the data are normally distributed,” and the resulting P-values are greater 0.05, so we <a href="http://blog.minitab.com/blog/understanding-statistics/things-statisticians-say-failure-to-reject-the-null-hypothesis">fail to reject the null hypothesis</a>. That means we can evaluate the data using tests that require the data to be normally distributed.</span></p>
<p>To determine if the mean of Material 1 is indeed greater than the mean of Material 2, we perform a two sample t-test: go to <strong>Stat > Basic Statistics > 2 Sample t…</strong> and select “Each sample in its own column.” We then choose “Options..” and select “Difference > hypothesized difference.”</p>
<p><img alt="two-sample t-test and ci output" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/3e270a93cceb77f6818345bcb41c9110/2_sample_t_test_output.png" style="width: 546px; height: 226px;" /></p>
<p><span style="line-height: 1.6;">The P-value for the two sample t-test is less than 0.05, so we can conclude there is a statistically significant difference between the materials. But the two sample t-test does not give us a complete picture of the situation, so we should look at the data by going to <strong>Graph > Individual Value Plot...</strong> and selecting a simple graph for multiple Y’s.</span></p>
<p><span style="line-height: 1.6;"><img alt="individual value plot " src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/96d713980aefd4912a402dc156802788/individual_value_plot1.png" style="width: 583px; height: 391px;" /></span></p>
<p><span style="line-height: 1.6;">The mean of Material 1 may be higher, but our biggest concern is identifying a material that does not fail in 30 seconds or less. Material 2 appears to have far less variation and we can assess this by performing an F-test: go to <strong>Stat > Basic Statistics > 2 Variances…</strong> and select “Each sample in its own column.” Then choose “Options..” and select “Ratio > hypothesized ratio.” The data is normally distributed, so put a checkmark next to “Use test and confidence intervals based on normal distribution.”</span></p>
<p><img alt="two variances test output" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/28aa53d2ec2582e3b41e29fb5f55331f/two_variances_test_output.png" style="width: 482px; height: 563px;" /></p>
<p>The P-value is less than 0.05, so we can conclude the evidence does supports the alternative hypothesis that the variance of the first material is greater than the variance of the second material. Having already looked at a graph of the data, this should come as no surprise</p>
<p>No statistical software program can tell us which material to choose, but Minitab can provide us with the information needed to make an informed decision. The objective is to exceed a lower specification limit of 30 seconds and the lower variability of Material 2 will achieve this better than the higher mean value for Material 1. Material 2 looks good, but the penalty for a wrong decision could be lost space ships if the negasphere breaches its containment too soon, so we must be certain.</p>
<p>The Galactic Patrol has millions of ships so a failure rate of even one per million would be unacceptably high so we should perform a capability study by going to<strong> Quality Tools > Capability Analysis > Normal…</strong> Enter the column containing the data for Material 1 and use the same column for the subgroup size and then enter a lower specification of 30. This would then be repeated for Material 2.</p>
<p><img alt="process capability for material 1" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9c10d14f155707770eb3688aec834ca2/process_capability_report1.png" style="width: 635px; height: 476px;" /></p>
<p><img alt="Process Capability for Material 2" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/eaf12e0874c393730037cf504a90fa8f/process_capability_report2.png" style="width: 638px; height: 476px;" /></p>
<p><span style="line-height: 1.6;">Looking at the Minitab generated capability studies, we can see that Material 1 can be expected to fail thousands of times per million uses, but Material 2 would is not expected to fail at all. In spite of the higher mean, the Galactic Patrol should use Material 2 for the negaspehere torpedoes. </span></p>
<p> </p>
<p> </p>
<div>
<p style="line-height: 20.7999992370605px;"><strong>About the Guest Blogger</strong></p>
<p style="line-height: 20.7999992370605px;"><em><a href="https://www.linkedin.com/pub/matthew-barsalou/5b/539/198" target="_blank">Matthew Barsalou</a> is a statistical problem resolution Master Black Belt at <a href="http://www.3k-warner.de/" target="_blank">BorgWarner</a> Turbo Systems Engineering GmbH. He is a Smarter Solutions certified Lean Six Sigma Master Black Belt, ASQ-certified Six Sigma Black Belt, quality engineer, and quality technician, and a TÜV-certified quality manager, quality management representative, and auditor. He has a bachelor of science in industrial sciences, a master of liberal studies with emphasis in international business, and has a master of science in business administration and engineering from the Wilhelm Büchner Hochschule in Darmstadt, Germany. He is author of the books <a href="http://www.amazon.com/Root-Cause-Analysis-Step---Step/dp/148225879X/ref=sr_1_1?ie=UTF8&qid=1416937278&sr=8-1&keywords=Root+Cause+Analysis%3A+A+Step-By-Step+Guide+to+Using+the+Right+Tool+at+the+Right+Time" target="_blank">Root Cause Analysis: A Step-By-Step Guide to Using the Right Tool at the Right Time</a>, <a href="http://asq.org/quality-press/display-item/index.html?item=H1472" target="_blank">Statistics for Six Sigma Black Belts</a> and <a href="http://asq.org/quality-press/display-item/index.html?item=H1473&xvl=76115763" target="_blank">The ASQ Pocket Guide to Statistics for Six Sigma Black Belts</a>.</em></p>
</div>
Data AnalysisHypothesis TestingStatisticsTue, 03 Feb 2015 13:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/statistics-another-weapon-in-the-galactic-patrol%E2%80%99s-arsenalGuest BloggerAnalyzing Qualitative Data, part 1: Pareto, Pie, and Stacked Bar Charts
http://blog.minitab.com/blog/applying-statistics-in-quality-projects/analyzing-qualitative-data-part-1-pareto-pie-and-stacked-bar-charts
<p>In several previous blogs, I have discussed the use of statistics for <a href="http://blog.minitab.com/blog/applying-statistics-in-quality-projects/using-nonparametric-analysis-to-visually-manage-durations-in-service-processes">quality improvement in the service sector</a>. Understandably, services account for a very large part of the economy. Lately, when meeting with several people from financial companies, I realized that one of the problems they faced was that they were collecting large amounts of "qualitative" data: types of product, customer profiles, different subsidiaries, several customer requirements, etc.</p>
<p>There are several ways to process such qualitative data. Qualitative data points may still be counted, and once they have been counted they may be quantitatively (numerically) analyzed using statistical methods.</p>
<p>I will focus on the analysis of qualitative data using a simple and obvious example. In this case, we would like to analyze mistakes on invoices made during a period of several weeks by three employees (anonymously identified).</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/545c0823fc7368e795585c38424891d9/quali1.jpg" style="width: 288px; height: 273px;" /></p>
<p>I will present three different ways to analyze such qualitative data (counts). In this post, I will cover:</p>
<ol>
<li>A very simple graphical approach based on bar charts to display counts (stacked and clustered bars), Pareto diagrams and Pie charts.</li>
</ol>
<p>Then, in my next post, I will demonstrate: </p>
<ol start="2">
<li> A more complex approach for testing statistical significance using a Chi-square test.<br />
</li>
<li> An even more complex multivariate approach (using correspondence analysis).</li>
</ol>
<p>Again, the main purpose of this example is to show several ways to analyze qualitative data. Quantitative data represent numeric values such as the number of grams, dollars, newtons, etc., whereas qualitative data may represent text values such as different colours, types of defects or different employees.</p>
<p>The <a href="http://www.minitab.com/en-us/products/minitab/assistant/">Assistant</a> in Minitab 17 provides a great breakdown of two main data types: </p>
<p><img alt="" spellcheck="true" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/2fd46235529df11ab90d53efa677b706/quali2.jpg" style="width: 586px; height: 316px; border-width: 1px; border-style: solid;" /></p>
Charts and Diagrams with Qualitative Data
<p>I first created a pie chart using the Minitab Assistant (<strong>Assistant > Graphical Analysis</strong>) as well as a stacked bar chart on counts (from the graph menu of Minitab, select <strong>Graph > Bar Charts</strong>) to describe the proportion of each type of mistakes according to the day of the week.</p>
<p><img alt="" spellcheck="true" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/15ec9831d178df8fc0cbaddab0975c89/pie_chart_of_mistake_by_day___summary_report.jpg" style="width: 478px; height: 358px; border-width: 1px; border-style: solid;" /></p>
<p>In the pie charts above, the proportion of mistake types seems to be fairly similar across the different days of the week.</p>
<p> <img alt="" spellcheck="true" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/4b92a1293aff3f424d5a6f751653fb17/quali3.jpg" style="width: 403px; height: 302px; border-width: 1px; border-style: solid;" /></p>
<p>The number of mistakes also seems to be very stable and uniform according to day of week, when we see the stacked bar chart above.</p>
<p><img alt="" spellcheck="true" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/c23dcf3e01cedf8aaad5bad176437ed2/quali4.jpg" style="width: 426px; height: 330px;" /></p>
<p>Now let's create a stacked bar chart on counts to analyze mistakes by employees. In this second graph, shown above, large variations in the number of errors do occur according to employees. The distribution of errors also seems to be very different, with more “Product” errors associated with employee A.</p>
Qualitative Data in a Pareto Chart
<p><img alt="" spellcheck="true" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/30893b16e7ab4a75024498b7c3cf9fdf/pareto_chart_of_mistake_by_person___diagnostic_report.jpg" style="width: 768px; height: 547px;" /></p>
<p>Above we see <span style="line-height: 1.6;"><span><a href="http://blog.minitab.com/blog/understanding-statistics/explaining-quality-statistics-so-your-boss-will-understand-pareto-charts">Pareto charts</a></span> created using the Minitab Assistant (above): an overall Pareto and some additional Pareto diagrams, one for each employee. Again, it's easy to identify the large number of “product” mistakes (red columns) for employee A.</span></p>
<span style="line-height: 1.6;">Stacked Bar Charts of Qualitative Data</span>
<p><img alt="" spellcheck="true" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/79589c080171780e682cbd69d3353a0e/quali6.jpg" style="width: 426px; height: 347px;" /></p>
<p><span style="line-height: 20.7999992370605px;">Mistake counts are represented as percentages in the s</span><span style="line-height: 1.6;">tacked bar chart above. For each employee the error types are summed up to obtain 100% (within each employee's column). This provides a clearer understanding of how each employee's mistakes are distributed. Again, the high percentage of “Product” errors (in yellow) for employee A is very noticeable, but also note the high percentage, proportionately, of “Address” mistakes (blue areas) for employee C.</span></p>
<p><img alt="" spellcheck="true" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/9da688410bcb56f516061a4a26e64dfe/quali7.jpg" style="width: 434px; height: 346px;" /></p>
<p>The stacked bar chart above displays changes in the number of errors and in error types according to the week (time trends). Notice that in the last three weeks, at the end of the period, only product and address issues occurred. Apparently error types tend to shift towards more “product” and “address” types of errors, at the end of the period.</p>
Different Views of the Data Give a More Complete Picture
<p>These diagrams do provide a clear picture of mistake occurrences according to employees, error types and weeks. However, as you've seen, it takes several graphs to provide a good understanding of the issue.</p>
<p>This is still a subjective approach though, several people seated around the same table looking at these same graphs, might interpret them differently and in some cases, this could result in endless discussions.</p>
<p>Therefore we would also like to use a more scientific and rigorous approach: the Chi-square test. <a href="http://blog.minitab.com/blog/applying-statistics-in-quality-projects/analyzing-qualitative-data-part-2-chi-square-and-multivariate-analysis">We'll cover that in my next post</a>. </p>
<p> </p>
Data AnalysisHypothesis TestingQuality ImprovementSix SigmaStatisticsStatsWed, 28 Jan 2015 13:00:00 +0000http://blog.minitab.com/blog/applying-statistics-in-quality-projects/analyzing-qualitative-data-part-1-pareto-pie-and-stacked-bar-chartsBruno ScibiliaWhat Are T Values and P Values in Statistics?
http://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-are-t-values-and-p-values-in-statistics
<p>If you’re not a statistician, looking through statistical output can sometimes make you feel a bit like <em>Alice in</em> <em>Wonderland. </em>Suddenly, you step into a fantastical world where strange and mysterious phantasms appear out of nowhere. </p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/6f4053a89257952fef0b9998547dffe2/tweedle_tweedledum.jpg" style="line-height: 20.7999992370605px; float: right; width: 248px; height: 255px; margin: 10px 15px;" /></p>
<p>For example, consider the T and P in your t-test results.</p>
<p>“Curiouser and curiouser!” you might exclaim, like Alice, as you gaze at your output.</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/1e5a4c064f43f19169121222402e4560/t_test_results_one_sided.jpg" style="width: 467px; height: 121px;" /></p>
<p>What are these values, really? Where do they come from? Even if you’ve used the p-value to interpret the statistical significance of your results<span style="line-height: 20.7999992370605px;"> </span><span style="line-height: 20.7999992370605px;">umpteen times</span><span style="line-height: 1.6;">, its actual origin may remain murky to you.</span></p>
T & P: The Tweedledee and Tweedledum of a T-test
<p>T and P are inextricably linked. They go arm in arm, like Tweedledee and Tweedledum. Here's why.</p>
<p>When you perform a t-test, you're usually trying to find evidence of a significant difference between population means (2-sample t) or between the population mean and a hypothesized value (1-sample t). <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-is-a-t-test-and-why-is-it-like-telling-a-kid-to-clean-up-that-mess-in-the-kitchen">The t-value measures the size of the difference relative to the variation in your sample data</a>. Put another way, T is simply the calculated difference represented in units of standard error. The greater the magnitude of T (it can be either positive or negative), the greater the evidence <em>against </em>the null hypothesis that there is no significant difference. The closer T is to 0, the more likely there isn't a significant difference.</p>
<p>Remember, the t-value in your output is calculated from only one sample from the entire population. It you took repeated random samples of data from the same population, you'd get slightly different t-values each time, due to random sampling error (which is really not a mistake of any kind–it's just the random variation expected in the data).</p>
<p>How different could you expect the t-values from many random samples from the same population to be? And how does the t-value from your sample data compare to those expected t-values?</p>
<p>You can use a t-distribution to find out.</p>
Using a t-distribution to calculate probability
<p>For the sake of illustration, assume that you're using a 1-sample t-test to determine whether the population mean is greater than a hypothesized value, such as 5, based on a sample of 20 observations, as shown in the above t-test output.</p>
<ol>
<li>In Minitab, choose <strong>Graph > Probability Distribution Plot</strong>.</li>
<li>Select <strong>View Probability</strong>, then click <strong>OK</strong>.</li>
<li>From <strong>Distribution</strong>, select <strong>t</strong>.</li>
<li>In <strong>Degrees of freedom</strong>, enter <em>19</em>. (For a 1-sample t test, the degrees of freedom equals the sample size minus 1).</li>
<li>Click <strong>Shaded Area</strong>. Select <strong>X Value</strong>. Select <strong>Right Tail</strong>.</li>
<li> In <strong>X Value</strong>, enter 2.8 (the t-value), then click <strong>OK</strong>.</li>
</ol>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/bc5183a42a169d45632fd4f6c0b153b3/distribution_plot_t_2.8" style="width: 576px; height: 384px;" /></p>
<p>The highest part (peak) of the distribution curve shows you where you can expect most of the t-values to fall. Most of the time, you’d expect to get t-values close to 0. That makes sense, right? Because if you randomly select representative samples from a population, the mean of most of those random samples from the population should be close to the overall population mean, making their differences (and thus the calculated t-values) close to 0.</p>
T values, P values, and poker hands
<p>T values of larger magnitudes (either negative or positive) are less likely. The far left and right "tails" of the distribution curve represent instances of obtaining extreme values of t, far from 0. For example, the shaded region represents the probability of obtaining a t-value of 2.8 or greater. Imagine a magical dart that could be thrown to land randomly anywhere under the distribution curve. What's the chance it would land in the shaded region? The calculated probability is 0.005712.....which rounds to 0.006...which is...the p-value obtained in the t-test results! <img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/5633b267494c2017d6d7c7544247d57d/poker_picture.jpg" style="float: right; width: 200px; height: 164px; margin: 10px 15px;" /></p>
<p>In other words, the probability of obtaining a t-value of 2.8 or higher, when sampling from the same population (here, a population with a hypothesized mean of 5), is approximately 0.006.</p>
<p>How likely is that? Not very! For comparison, the probability of being dealt 3-of-a-kind in a 5-card poker hand is over three times as high (≈ 0.021).</p>
<p>Given that the probability of obtaining a t-value this high or higher when sampling from this population is so low, what’s more likely? It’s more likely this sample doesn’t come from this population (with the hypothesized mean of 5). It's much more likely that this sample comes from different population, one with a mean greater than 5.</p>
<p>To wit: Because the p-value is very low (< alpha level), you reject the null hypothesis and conclude that there's a statistically significant difference.</p>
<p>In this way, T and P are inextricably linked. Consider them simply different ways to quantify the "extremeness" of your results under the null hypothesis. You can’t change the value of one without changing the other.</p>
<p>The larger the absolute value of the t-value, the smaller the p-value, and the greater the evidence against the null hypothesis.(You can verify this by entering lower and higher t values for the t-distribution in step 6 above).</p>
Try this two-tailed follow up...
<p>The t-distribution example shown above is based on a one-tailed t-test to determine whether the mean of the population is greater than a hypothesized value. Therefore the t-distribution example shows the probability associated with the t-value of 2.8 only in one direction (the right tail of the distribution).</p>
<p>How would you use the t-distribution to find the p-value associated with a t-value of 2.8 for two-tailed t-test (in both directions)?</p>
<p><strong>Hint:</strong> In Minitab, adjust the options in step 5 to find the probability for both tails. If you don't have a copy of Minitab, download a free <a href="http://it.minitab.com/en-us/products/minitab/free-trial.aspx" target="_blank">30-day trial version</a>.</p>
Hypothesis TestingTue, 27 Jan 2015 13:10:00 +0000http://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-are-t-values-and-p-values-in-statisticsPatrick RunkelA Minitab Holiday Tale: Featuring the Two Sample t-Test
http://blog.minitab.com/blog/statistics-in-the-field/a-minitab-holiday-tale-featuring-the-two-sample-t-test
<p><em><span style="line-height: 1.6;">by Matthew Barsalou, guest blogger</span></em></p>
<p>Aaron and Billy are two very competitive—and not always well-behaved—eight-year-old twin brothers. They constantly strive to outdo each other, no matter what the subject. If the boys are given a piece of pie for dessert, they each automatically want to make sure that their own piece of pie is bigger than the other’s piece of pie. This causes much exasperation, aggravation and annoyance for their parents. Especially when it happens in a restaurant (although the restaurant situation has improved, since they have been asked not to return to most local restaurants).</p>
<p><img alt="A bag of coal" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d2ccbe9f7c8e887281272ae49854893f/bag_of_coal.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 200px; height: 200px;" />Sending the boys to their rooms never helped. The two would just compete to see who could stay in their room longer. This Christmas their parents were at wits' ends, and they decided the boys needed to be taught a lesson so they could grow up to be upstanding citizens. Instead of the new bicycles the boys were going to get—and probably just race till they crashed anyway—their parents decided to give them each a bag of coal.</p>
<p>An astute reader might ask, “But what does this have to do with <a href="http://www.minitab.com/products/minitab">Minitab</a>?” Well, dear reader, the boys need to figure out who got the most coal. Immediately upon opening their packages, the boys carefully weighed each piece of coal and entered the data into Minitab.</p>
<p><span style="line-height: 1.6;">Then they selected <strong>Stat > Basic Statistics > Display Descriptive Statistics</strong> and used the "Statistics" options dialog to select the metrics they wanted, including the sum of the weights they'd entered:</span></p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/dacaebac62e3cc4c2e29329d0a779720/descriptivestatistics.png" style="width: 600px; height: 208px;" /></p>
<p><span style="line-height: 1.6;">Billy quickly saw that he had the most coal, and yelled, “I have 279.383 ounces and you only have 272.896 ounces, and the mean of my pieces of coal is more than the mean of yours. Mine weigh more, so our parents must love me more.” </span></p>
<p><span style="line-height: 1.6;">“Not so fast,” said Aaron. “You may have a higher mean value, but is the difference statistically significant?” There was only one thing left for the boys to do: perform a <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/t-for-2-should-i-use-a-paired-t-or-a-2-sample-t">two sample t-test</a>.</span></p>
<p><span style="line-height: 1.6;">In Minitab, Aaron selected </span><strong><span style="line-height: 1.6;">Stat > Basic Statistics > 2-Sample t…</span></strong></p>
<p>The boys left the default values at a confidence level of 95.0 and a hypothesized difference of 0. The alternative hypothesis was “Difference ≠ hypothesized difference” because the only question they were asking was “Is there a statistically significant difference?” between the two data sets.</p>
<p>The two troublemakers also selected “Graphs” and checked the options to display an individual value plot and a boxplot. They knew they should look at their data. Having the graphs available would also make it easier for them to communicate their results to higher authorities, in this case, their poor parents.</p>
<p><img alt="Individual Value Plot of Coal" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/bf541d8df2461a8edff9060789394b00/individual_value_plot_of_coal.png" style="width: 577px; height: 385px;" /></p>
<p><img alt="Boxplot of Coal" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8945d7a038de654d008f68dc0a8886d3/boxplot_of_coal.png" style="width: 577px; height: 385px;" /></p>
<p>Both the individual value plots and boxplots showed that Aaron's bag of coal had pieces with the highest individual weights. But he also had the pieces with the least weight. So the values for his Christmas coal were scattered across a wider range than the values for Billy‘s Christmas coal. But was there really a difference?</p>
<p>Billy went running for his tables of Student‘s t-scores so he could interpret the resulting t-value of -0.71. Aaron simply looked at the resulting p-value of 0.481. The p-value was greater than 0.05 so the boys could not conclude there was a true difference in the weight of their Christmas "presents."</p>
<p><img alt="600" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/549762a9cb277536a76baedba32617d3/2_sample_t_test_coal.png" style="width: 683px; height: 305px;" /></p>
<p><span style="line-height: 1.6;">The boys dutifully reported the results, with illustrative graphs, each demanding that they get a little more to best the other. Clearly, receiving coal for Christmas had done nothing to reduce their level of competitiveness. Their parents realized the boys were probably not going to grow up to be upstanding citizens, but they may at least become good statisticians.</span></p>
<p>Happy Holidays.</p>
<p> </p>
<p style="line-height: 20.7999992370605px;"><strong>About the Guest Blogger</strong></p>
<p style="line-height: 20.7999992370605px;"><em><a href="https://www.linkedin.com/pub/matthew-barsalou/5b/539/198" target="_blank">Matthew Barsalou</a> is a statistical problem resolution Master Black Belt at <a href="http://www.3k-warner.de/" target="_blank">BorgWarner</a> Turbo Systems Engineering GmbH. He is a Smarter Solutions certified Lean Six Sigma Master Black Belt, ASQ-certified Six Sigma Black Belt, quality engineer, and quality technician, and a TÜV-certified quality manager, quality management representative, and auditor. He has a bachelor of science in industrial sciences, a master of liberal studies with emphasis in international business, and has a master of science in business administration and engineering from the Wilhelm Büchner Hochschule in Darmstadt, Germany. He is author of the books <a href="http://www.amazon.com/Root-Cause-Analysis-Step---Step/dp/148225879X/ref=sr_1_1?ie=UTF8&qid=1416937278&sr=8-1&keywords=Root+Cause+Analysis%3A+A+Step-By-Step+Guide+to+Using+the+Right+Tool+at+the+Right+Time" target="_blank">Root Cause Analysis: A Step-By-Step Guide to Using the Right Tool at the Right Time</a>, <a href="http://asq.org/quality-press/display-item/index.html?item=H1472" target="_blank">Statistics for Six Sigma Black Belts</a> and <a href="http://asq.org/quality-press/display-item/index.html?item=H1473&xvl=76115763" target="_blank">The ASQ Pocket Guide to Statistics for Six Sigma Black Belts</a>.</em></p>
Fun StatisticsHypothesis TestingStatisticsTue, 23 Dec 2014 13:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/a-minitab-holiday-tale-featuring-the-two-sample-t-testGuest BloggerAre Preseason Football or Basketball Rankings More Accurate?
http://blog.minitab.com/blog/the-statistics-game/are-preseason-football-or-basketball-rankings-more-accurate
<p>College basketball season tips off today, and for the second straight season Kentucky is the #1 ranked preseason team in the AP poll. Last year Kentucky did not live up to that ranking in the regular season, going 24-10 and earning a lowly 8 seed in the NCAA tournament. But then, in the tournament, they overachieved and made a run all the way to the championship game...before losing to Connecticut.</p>
<p>In football, Florida State was the AP poll preseason #1 football team. While they are currently still undefeated, they aren't quite playing like the #1 team in the country. So this made me wonder, which preseason rankings are more accurate, football or basketball?</p>
<p>I gathered <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/1d3961db92c5ba14bc90b2b8323b95f8/preseason_basketball_vs__football_rankings.MTW">data</a> from the last 10 seasons, and recorded the top 10 teams in the preseason AP poll for both football and basketball. Then I recorded the difference between their preseason ranking and their final ranking. Both sports had 10 teams that weren’t ranked or receiving votes in the final poll, so I gave all of those teams a final ranking of 40.</p>
Creating a Histogram to Compare Two Distributions
<p>Let’s start with a histogram to look at the distributions of the differences. (It's always a good idea to look at the distribution of your data when you're starting an analysis, whether you're looking at quality improvement data work or sports data for yourself.) </p>
<p>You can create this graph in Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a> by selecting <strong>Graph > Histograms</strong>, choosing "With Groups" in the dialog box, and using the Basketball Difference and Football Difference columns as the graph variables:</p>
<p><img alt="Histogram" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/53055c57978dbfa85d28688cc816c98a/histogram_of_basketball_difference__football_difference.jpg" style="width: 720px; height: 480px;" /></p>
<p>The differences in the rankings appear to be pretty similar. Most of the data is towards the left side of this histogram, meaning for most cases the difference between the preseason and final ranking is pretty small.</p>
Conducting a Mann-Whitney Hypothesis Test on Two Medians
<p>We can further investigate the data by performing a hypothesis test. Because the data is heavily skewed, I’ll use <a href="http://blog.minitab.com/blog/the-statistics-game/do-the-data-really-say-female-named-hurricanes-are-more-deadly">a Mann-Whitney test</a>. This compares the medians of two samples with similarly-shaped distributions, as opposed to a <a href="http://blog.minitab.com/blog/understanding-statistics/guidelines-and-how-tos-for-the-2-sample-t-test">2-sample t test</a>, which compares the means. <span style="line-height: 20.7999992370605px;">The median is the middle value of the data. Half the observations are less than or equal to it, and half the observations are greater than or equal to it.</span><span style="line-height: 20.7999992370605px;"> </span></p>
<p>To perform this test in our statistical software, we select <strong>Stat > Nonparametrics > Mann-Whitney</strong>, then choose the appropriate columns for our first and second sample: </p>
<p><img alt="Mann-Whitney Test" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/1a1f239841b82e60170e6ecbc8077d4b/mann_whitney.jpg" style="width: 689px; height: 241px;" /></p>
<p>The basketball rankings have a smaller median difference than the football rankings. However, when we examine the <a href="http://blog.minitab.com/blog/understanding-statistics/three-things-the-p-value-cant-tell-you-about-your-hypothesis-test">p-value</a> we see that this difference is not statistically significant. There is not enough evidence to conclude that one preseason poll is more accurate than the other.</p>
<p>But what about the best teams? I grouped each of the top 3 ranked teams and looked at the median difference between their preseason and final rank.</p>
<p><img alt="Bar Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/692a3db40dd5d3b4c20d539f92395629/bar_chart.jpg" style="width: 720px; height: 480px;" /></p>
<p>The preseason AP basketball poll has a smaller difference for the #1 and #3 ranked teams. But the football poll is better for the #2 team, having an impressive median value of 1. Overall, both polls are relatively good, as neither has a median value greater than 6. And the differences are close enough that we can’t conclude that one is more accurate than the other.</p>
What Does It Mean for the Teams?
<p>While the odds are against both Kentucky and Florida State to finish the season ranked #1 in their respective polls, previous seasons indicate that they’re still likely to finish as one of the top teams. This is better news for Kentucky, as being one of the top teams means they’ll easily make the NCAA basketball tournament and get a high seed. However, Florida State must finish as one of the top 4 teams, or else they’ll miss out on the football postseason completely.</p>
<p>So while we can’t conclude one poll is better than the other, teams at the top of the AP basketball poll are clearly much more likely to reach the postseason than football.</p>
Data AnalysisFun StatisticsHypothesis TestingStatistics in the NewsFri, 14 Nov 2014 15:03:33 +0000http://blog.minitab.com/blog/the-statistics-game/are-preseason-football-or-basketball-rankings-more-accurateKevin RudyComparing the College Football Playoff Top 25 and the Preseason AP Poll
http://blog.minitab.com/blog/the-statistics-game/comparing-the-college-football-playoff-top-25-and-the-preseason-ap-poll
<p>The college football playoff committee waited until the end of October to release their first top 25 rankings. One of the reasons for waiting so far into the season was that the committee would rank the teams off of actual games and wouldn’t be influenced by preseason rankings.</p>
<p>At least, that was the idea.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/8ac74acf42052d068b6cd0eeec32f609/cfb_playoff.jpg" style="line-height: 20.7999992370605px; float: right; width: 300px; height: 187px;" /></p>
<p>Earlier this year, I found that the <a href="http://blog.minitab.com/blog/the-statistics-game/has-the-college-football-playoff-already-been-decided">final AP poll was correlated with the preseason AP poll</a>. That is, if team A was ranked ahead of team B in the preseason and they had the same number of losses, team A was still usually ranked ahead of team B. The biggest exception was SEC teams, who were able to regularly jump ahead of teams (with the same number of losses) ranked ahead of them in the preseason.</p>
<p>If the final AP poll can be influenced by preseason expectations, could the college football playoff committee be influenced, too? Let’s compare their first set of rankings to the preseason AP poll to find out.</p>
Comparing the Ranks
<p>There are currently 17 different teams in the committee’s top 25 that have just one loss. I <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/26e7c8d8d8eee4fe2dfa26dc3d6e3c54/preseason_ap_vs__cfb_playoff_rankings.MTW">recorded the order</a> they are ranked in the committee’s poll and their order in the AP preseason poll. Below is an individual value plot of the data that shows each team’s preseason rank versus their current rank.</p>
<p><img alt="IVP" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/4098bab194a586865d3861f854d65627/ivp.jpg" style="width: 600px; height: 400px;" /></p>
<p>Teams on the diagonal line haven’t moved up or down since the preseason. Although Notre Dame is the only team to fall directly on the line, most teams aren’t too far off.</p>
<p>Teams below the line have jumped teams that were ranked ahead of them in the preseason. The biggest winner is actually not an SEC team, it’s TCU. Before the season, 13 of the current one-loss teams were ranked ahead of TCU, but now there are only 4. On the surface TCU seems to counter the idea that only SEC teams can drastically move up from their preseason ranking. However, of the 9 teams TCU jumped, only one (Georgia) is from the SEC. And the only other team to jump up more than 5 spots is Mississippi—who of course is from the SEC. So I wouldn’t conclude that the CFB playoff committee rankings behave differently than the AP poll quite yet.</p>
<p>Teams below the line have been passed by teams that had been ranked behind them in the preseason. Ohio State is the biggest loser, having had 9 different teams pass over them. Part of this can be explained by the fact that they have the worst loss (a 4-4 Virginia Tech game at home). But another factor is that the preseason AP poll was released before anybody knew Buckeye quarterback Braxton Miller would miss the entire season. Had voters known that, Ohio State probably wouldn’t have been ranked so high to begin with. </p>
<p>Overall, 10 teams have moved up or down from their preseason spot by 3 spots or less. The correlation between the two polls is 0.571, which indicates a positive association between the preseason AP poll and the current CFB playoff rankings. That is, teams ranked higher in the preseason poll tend to be ranked higher in the playoff rankings.</p>
Concordant and Discordant Pairs
<p>We can take this analysis a step further by looking at the concordant and discordant pairs. A pair is concordant if the observations are in the same direction. A pair is discordant if the observations are in opposite directions. This will let us compare teams to each other two at a time.</p>
<p>For example, let’s compare Auburn and Mississippi. In the preseason, Auburn was ranked 3 (out of the 17 one-loss teams) and Mississippi was ranked 10. In the playoff rankings, Auburn is ranked 1 and Mississippi is ranked 2. This pair is concordant, since in both cases Auburn is ranked higher than Mississippi. But if you compare Alabama and Mississippi, you’ll see Alabama was ranked higher in the preseason, but Mississippi is ranked higher in the playoff rankings. That pair is discordant.</p>
<p>When we compare every team, we end up with 136 pairs. How many of those are concordant? Our <a href="http://www.minitab.com/products/minitab">favorite statistical software</a> has the answer: </p>
<p><img alt="Measures of Concordance" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/5f281abfa1e06d5cda492e17b3f9746b/concordance.jpg" style="width: 663px; height: 176px;" /></p>
<p>There are 96 concordant pairs, which is just over 70%. So most of the time, if a team ranked higher in the preseason poll, they are ranked higher in the playoff rankings. And consider this: of the one-loss teams, the top 4 ranked preseason teams were Alabama, Oregon, Auburn, and Michigan St. Currently, the top 4 one loss teams are Auburn, Mississippi, Oregon, and Alabama. That’s only one new team—which just so happens to be from the SEC.</p>
<p>That’s bad news for non-SEC teams that started the season ranked low, like Arizona, Notre Dame, Nebraska, and Kansas State. It's going to be hard for them to jump teams with the same record, especially if those teams are from the SEC. Just look at Alabama’s résumé so far. Their best win is over West Virginia and they lost to #4 Mississippi. Is that <em>really </em>better than Kansas State, who lost to #3 Auburn and beat Oklahoma <em>on the road</em>? If you simply changed the name on Alabama’s uniform to Utah and had them unranked to start the season, would they still be ranked three spots higher than Kansas State? I doubt it.</p>
<p>The good news is that there are still many games left to play. Most of these one-loss teams will lose at least one more game. But with 4 teams making the playoff this year, odds are we'll see multiple teams with the same record vying for the last playoff spot. And if this college football playoff ranking is any indication, if you're not in the SEC, teams who were highly thought of in the preseason will have an edge.</p>
Fun StatisticsHypothesis TestingFri, 31 Oct 2014 13:04:57 +0000http://blog.minitab.com/blog/the-statistics-game/comparing-the-college-football-playoff-top-25-and-the-preseason-ap-pollKevin RudyUsing Data Analysis to Maximize Webinar Attendance
http://blog.minitab.com/blog/michelle-paret/using-data-analysis-to-maximize-webinar-attendance
<p>We like to host webinars, and our customers and prospects like to attend them. But when our webinar vendor moved from a pay-per-person pricing model to a pay-per-webinar pricing model, we wanted to find out how to maximize registrations and thereby minimize our costs.<img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/8a6733d3b0516b7f1c7ad80ea753d430/mtbnewspromos_w640.jpeg" style="width: 400px; height: 273px; float: right; border-width: 1px; border-style: solid; margin: 10px 15px;" /></p>
<p>We collected webinar data on the following variables:</p>
<ul>
<li>Webinar topic</li>
<li>Day of week</li>
<li>Time of day – 11 a.m. or 2 p.m.</li>
<li>Newsletter promotion – no promotion, newsletter article, newsletter sidebar</li>
<li>Number of registrants</li>
<li>Number of attendees</li>
</ul>
<p>Once we'd collected our data, it was time to analyze it and answer some key questions using <a href="http://www.minitab.com/products/minitab/">Minitab Statistical Software</a>.</p>
Should we use registrant or attendee counts for the analysis?
<strong><span style="line-height: 16.8666667938232px; font-family: Calibri, sans-serif; font-size: 11pt;"><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/4d9fa1e3c73606627d2ca1ec34b620e2/scatterplot_w640.jpeg" style="width: 300px; height: 197px; margin: 10px 15px; float: left;" /></span></strong>
<p>First we needed to decide what we would use to measure our results: the number of people who signed up, or the number of people who actually attended the webinar. This question really boils down to answering the question, “Can I trust my data?”</p>
<p>Our data collection system for webinar registrants is much more accurate than our data collection system for webinar attendees. This is due to customer behavior and their willingness to share contact information, in addition to the automated database processes that connect our webinar vendor data with our own database. So, for a period of time, I manually collected the attendee data directly from our webinar vendor to see how it correlated with the easily-accessible and accurate registration data. The scatterplot above shows the results.</p>
<p>With a <a href="http://blog.minitab.com/blog/understanding-statistics/no-matter-how-strong-correlation-still-doesnt-imply-causation">correlation coefficient </a>of 0.929 and a p-value of 0.000, there was a strong positive linear relationship between the registrations and attendee counts. If registrations are high, then attendance is also high. If registrations are low, then attendance is also low. I concluded that I could use the registration data—which is both easily accessible and extremely reliable—to conduct my analysis.</p>
Should we consider data for the last 6 years?
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/5e73f48b852c7afc17762f28bf8887cf/i_mr_chart_of_registrants_w640.jpeg" style="width: 400px; height: 263px; margin: 10px 15px; float: left;" />We’ve been collecting webinar data for 6 years, but that doesn’t mean we can treat the last 6 years of data as one homogeneous population.</p>
<p>A lot can change in a 6-year time period. Perhaps there was a change in the webinar process that affected registrations. To determine whether or not I should use all of the data, I used an Individuals and Moving Range (I-MR, also referred to as X-MR) <a href="http://blog.minitab.com/blog/understanding-statistics/how-create-and-read-an-i-mr-control-chart">control chart</a> to evaluate the process stability of webinar registrations over time.</p>
<p>The graph revealed a single point on the MR chart that flagged as out-of-control. I looked more closely at this point and verified that the data was accurate and that this webinar belonged with the larger population. Based on this information, I decided to proceed with analyzing all 6 years of data together. (Note there is some clustering of points due to promotions, but again the goal here was to determine if we could use data over a 6-year time period.)</p>
What variables impact registrations?
<p>I performed an ANOVA using Minitab's General Linear Model tool to find out which factors—topic, day of week, time of day, or newsletter promotion—significantly affect webinar registrations.<img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/3758d3d03a604bab9921ad9f94663dc8/main_effects_plot_for_registrants_w640.jpeg" style="width: 400px; height: 263px; float: right; margin: 10px 15px;" /></p>
<p>The ANOVA results revealed that the day of week, time of day, and webinar topic <em>do not</em> affect webinar registrations, but the newsletter promotion type <em>does</em> (p-value = 0.000).</p>
<p>So which webinar promotion type maximizes webinar registrations?</p>
<p>Using Minitab to conduct <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/keep-that-special-someone-happy-when-you-perform-multiple-comparisons">Tukey comparisons</a>, we can see that registrations for webinars promoted in the newsletter sidebar space were not significantly different from webinars that weren't promoted at all.</p>
<p>However, webinars that were promoted in the newsletter <em>article </em>space resulted in significantly more registrations than both the sidebar promotions and no promotions.</p>
<p>From this analysis, we concluded that we still had the flexibility to offer webinars at various times and days of the week, and we could continue to vary webinar topics based on customer demand and other factors. To maximize webinar attendance and minimize webinar cost, we needed to focus our efforts on promoting the webinars in our newsletter, utilizing the article space.</p>
<p>But over the past year, we’ve started to actively promote our webinars via other channels as well, so next up is some more data analysis—using Minitab—to figure out what marketing channels provide the best results…</p>
Data AnalysisHypothesis TestingRegression AnalysisStatisticsFri, 17 Oct 2014 12:00:00 +0000http://blog.minitab.com/blog/michelle-paret/using-data-analysis-to-maximize-webinar-attendanceMichelle ParetWith the Assistant, You Won't Have to Stop and Get Directions about Directional Hypotheses
http://blog.minitab.com/blog/statistics-and-quality-improvement/with-the-assistant-you-wont-have-to-stop-and-get-directions-about-directional-hypotheses
<p>I got lost a lot as a child. I got lost at malls, at museums, Christmas markets, and everywhere else you could think of. Had it been in fashion to tether children to their parents at the time, I'm sure my mother would have. As an adult, I've gotten used to using a GPS device to keep me from getting lost.</p>
<p><span style="line-height: 20.7999992370605px;">The Assistant in Minitab is like your GPS for statistics. The Assistant is there to provide you with directions so that you don't get lost. One particular area where it's easy to get lost is with directional hypotheses.</span><img alt="Wait... is my hypothesis the other direction?" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/25dd42362071d2aafc3bfc85f78f5f22/hypothesis_bubble_w640.jpeg" style="line-height: 20.7999992370605px; width: 480px; height: 350px; border-width: 1px; border-style: solid; margin: 10px 15px;" /></p>
What Is a Directional Hypothesis?
<p>When you do a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/what-is-a-hypothesis-test/">statistical hypothesis test</a>, you have a null hypothesis and an alternative hypothesis. Directional hypotheses refer to two types of alternative hypotheses that you can usually choose. The common alternative hypotheses are these three:</p>
<ul>
<li>The value that you want to test is greater than a target.</li>
<li>The value that you want to test is different from a target.</li>
<li>The value that you want to test is less than a target.</li>
</ul>
<p>If you select an alternative hypothesis with "greater than" or "less than" in it, then you've chosen a directional hypothesis. When you choose a directional hypothesis, you get a one-sided test.</p>
<p>What does it look like to choose a one-sided test, and why would you? Let's consider an example.</p>
Choosing Whether to Use a One-sided Test or a Two-sided Test
<p>Suppose new production equipment is installed at a factory that should increase the rate of production for electrical panels. Concern exists that the change could increase the percentage of electrical panels that require rework before shipping. A quality team prepares to conduct a hypothesis test to determine whether statistical evidence supports this concern. The historical rework rate is 1%.</p>
<p>At this point, you would usually choose an alternative hypothesis. Maybe you remember hearing that you should think about whether to use a one-sided test or a two-sided test, or you may not even know how a test can have a side.</p>
<p>To keep from getting lost, you use your GPS. To keep from getting confused about statistics, you can use the Assistant. The Assistant uses clear and simple language. The Assistant doesn't ask you about "directional hypotheses" or "one-sided tests." Instead, the Assistant asks the question, "What do you want to determine?"</p>
<p><img alt="Is the % defective of Panels greater than .01? Is the % defective of Panels less than .01? Is the % defective of Panels different from .01?" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/b090980e5b08184e7b70b96b9cb05489/test_setup_in_assistant.png" style="width: 573px; height: 198px;" /></p>
<p>In this scenario, it's easy to see why the team would want to determine whether the percent is greater than 1. By performing the one-sided test for whether the percentage is greater than 1, the team can determine if there is enough statistical evidence to conclude that the percentage increased. If the percentage increased, then the concern is justified.</p>
<p>In practical terms, you should consider what it means to limit your decision to whether there is evidence for an increase. A one-sided test of whether the percentage increased will never show a statistically significant decrease in the percentage of boards that require rework. Evidence of a decrease in the number of defectives might guide the quality team to investigate the reasons for the unforeseen benefit.</p>
Why Use a One-sided Test?
<p>Given this possible concern about whether a one-sided test excludes important information from the result, why would you ever use one? The best answer is that you use a one-sided test when the one-sided test tells you everything that you need to know.</p>
<p>In the example about the electrical panels, the quality team might feel completely secure in assuming that the new equipment will not result in a decrease in the percentage of panels that require rework. If so, then a test that checks for a decrease is flawed. The team needs only to determine whether to solve a problem with increased defectives or not.</p>
The Assistant Gets Even Better
<p>While a p-value for a one-sided test can be useful, more analysis can help you make better decisions. For example, in the electrical panel example, if the team finds a statistically significant increase, it will be important to know what the percentage increase is. <a href="http://www.minitab.com/en-us/products/minitab/assistant/">The Assistant</a> produces several reports with your hypothesis tests that help you get as much information as you can from your data. The report card verifies your analysis by providing assumption checks and identifying any concerns that you should be aware of. The diagnostic report helps you further understand your analysis by providing additional detail. The summary report helps you to draw the correct conclusions and explain those conclusions to others. The series of reports includes a variety of other statistics and analyses. That way, you have everything that you need to interpret your results with confidence.</p>
<p><img alt="The % defective of Panels is not significantly greater than the target (p > 0.05)" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/75f280df482574a3aee75ee65741b5c4/1_sample___defective_test_for_panels___summary_report_w640.png" style="width: 480px; height: 360px;" /></p>
<p>The image of the face in the crowd without the thought bubble is by <a href="https://www.flickr.com/photos/akbarsyah/">_Imaji_</a> and is licensed under <a href="https://creativecommons.org/licenses/by/2.0/">this creative commons license</a>.</p>
Hypothesis TestingWed, 15 Oct 2014 18:52:23 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/with-the-assistant-you-wont-have-to-stop-and-get-directions-about-directional-hypothesesCody SteeleHow Politicians and Governments Could Benefit from Statistical Analyses
http://blog.minitab.com/blog/applying-statistics-in-quality-projects/how-politicians-and-governments-could-benefit-from-statistical-analyses
<p>Using <a href="http://blog.minitab.com/blog/applying-statistics-in-quality-projects/a-doe-in-a-manufacturing-environment-part-1">statistical techniques to optimize manufacturing processes</a> is quite common now, but using the same approach on social topics is still an innovative approach. For example, if our objective is to improve student academic performances, should we increase teachers wages or would it be better to reduce the number of students in a class?</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/4b07ae989e35a7dfd8b6fdb313a5561b/ballot.jpg" style="float: right; width: 250px; height: 250px;" />Many social topics (the effect of increasing the minimum wage on employment, etc.) generate long and passionate discussions in the media and in politics. People express very different and subjective points of views according to political/ideological opinions and varied ways of thinking.</p>
Hypothesis Testing in the Policy Realm
<p><span style="line-height: 20.7999992370605px;">Social experimentation and data analysis can provide a firmer ground on which we can base more objective decisions.</span></p>
<p>The objective is to investigate the effects of a policy intervention and to <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/example-of-a-hypothesis-test/">test specific hypotheses</a>. In these social experiments “randomization” is a key element. If one policy option is tested in, say, the Netherlands, and another policy option is tested in France, the experimenter will never be in a position to fully understand whether a difference in outcomes is due to the intervention itself or to the many other differences between these two countries.</p>
<p>It would clearly be preferable to test the two approaches in different regions of France and of the Netherlands, for example, and assign the policy intervention in a random way to a “treatment” group (individuals who receive it) and a “comparison” group (individuals who do not receive it).</p>
<p>At the beginning of the study, the “treatment” and the “control” groups should be as similar as possible to prevent any systematic previous bias. The objective is not to “observe” differences but to identify the actual causal effects.</p>
Designed Experiment Techniques
<p>Other techniques that are often used in <a href="http://blog.minitab.com/blog/understanding-statistics/getting-started-with-factorial-design-of-experiments-doe">designed experiments (DOEs)</a> may also be useful in this context, such as blocking and balancing. In my example, France and the Netherlands might be considered as a blocking factor (an external extra factor which the experimenter cannot control), and the tests should be “balanced” across blocks so that the treatment effect estimates are not biased and the blocking effects of the countries are neutralized. Other potential blocking factors in policy studies might be urban versus rural regions, or females versus males.</p>
Examples of Policy Experiments
<p>Data analysis and statistics have been used to inform several important policy debates around the world over the past few years. Here are a few examples:</p>
<p>- In Kenya, a social experiment showed that neither hiring extra teachers to reduce class sizes in schools nor providing more textbooks to pupils had much effect on academic performances. A surprising finding of this study was that deworming (intestinal worms) programs were very effective in decreasing child absenteeism.</p>
<p>- In the U.S, a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/doe/factorial-designs/choose-a-factorial-design/">full factorial design (DOE)</a> was used to assess the effectiveness of commitment contracts. The objective of these contracts was to encourage individuals to exercise more in order to reduce health risks and prevent obesity. The effects of different factors such as duration of the physical exercises, their frequency and financial stakes were studied. The outcome was the likelihood of accepting such a contract.</p>
<p>- Different strategies to quit smoking based on commitment contracts have been tested using a randomized experimental approach.</p>
<p>- In France, a social experiment was conducted to compare different job-counselling strategies for placing young unemployed people. The studied outcome was the probability to find a job.</p>
Conclusion
<p>Experiments make it possible to vary one factor at a time, but a more effective approach is to modify several factors for each test using proper designs of experiments. Expertise in setting up randomized field experiments to test economic hypotheses is clearly a key factor.</p>
<p>Experimental results are often surprising, therefore experimentation and data analysis are potentially new and powerful tools in the arsenal of politicians and governments.</p>
<p>Here are sources of more information about the examples I've mentioned :</p>
<p>Miguel, Edward and Michael Kremer (2004). “Worms: Identifying Impacts on Education and Health in the Presence of Treatment Externalities,” Econometrica, Volume72 (1), pp. 159-217.</p>
<p>Gine, Xavier, Dean Karlan and Jonathan Zinman (2008). “Put Your Money Where Your Butt Is: A Commitment Savings Account for Smoking Cessation,” MIMEO, Yale University.</p>
<p><a href="http://www.voxeu.org/article/job-placement-and-displacement-evidence-randomised-experiment">http://www.voxeu.org/article/job-placement-and-displacement-evidence-randomised-experiment</a></p>
<p>Using Nudges in Exercise Commitment Contracts : <a href="http://www.nber.org/bah/2011no1/w16624.html">http://www.nber.org/bah/2011no1/w16624.html</a></p>
<p> </p>
Data AnalysisDesign of ExperimentsHypothesis TestingStatisticsStatistics in the NewsStatsMon, 22 Sep 2014 12:00:00 +0000http://blog.minitab.com/blog/applying-statistics-in-quality-projects/how-politicians-and-governments-could-benefit-from-statistical-analysesBruno ScibiliaA Fun ANOVA: Does Milk Affect the Fluffiness of Pancakes?
http://blog.minitab.com/blog/statistics-in-the-field/a-fun-anova3a-does-milk-affect-the-fluffiness-of-pancakes
<p><em>by Iván Alfonso, guest blogger</em></p>
<p><img alt="hotcakes" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/7bd460fa71f6d12672a2ac5d9f754762/pancakes.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 300px; height: 223px;" />I'm a huge fan of hot cakes—they are my favorite dessert ever. I’ve been cooking them for over 15 years, and over that time I’ve noticed many variation in textures, flavor, and thickness. Personally, I like fluffy pancakes.</p>
<p>There are many brands of hotcake mix on the market, all with very similar formulations. So I decided to investigate which ingredients and inputs may influence the fluffiness of my pancakes.</p>
<p>Potential factors could include the type of mix used, the type of milk used, the use of margarine or butter (of many brands), the amount of mixing time, the origin of the eggs, and the skill of the person who prepares the pancakes.</p>
<p>Instead of looking at <em>all </em>of these factors, I focused on the type of milk used in the pancakes. I had four types of milk available: whole milk, light, low fat, and low protein.</p>
<p>My goal was to determine if these different milk formulations influence fluffiness (thickness). Is the whole milk the best for fluffy hotcakes? Does skim milk works the same way as the whole milk? Can I be sure that the use of light milk will result in hot cakes that are less smooth?</p>
Gathering Data
<p>I sorted the four formulations as shown in the diagram below:</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/643f9f4f94be78a5b1c012e49c400772/milk_factor.jpg" style="width: 144px; height: 200px;" /></p>
<p>I used the the same amounts of milk, flour (one brand), salt and margarine for each batch of hotcakes I cooked.</p>
<p>The response variable was the thickness of the cooked pancakes. I prepared 6 pancakes for each type of milk, which gives me a total of 8 pancakes. I randomized the cooking order to minimize bias. I also prepared each batch by myself—if my sister or mother had helped with some lots, it would be a potential source of variation.</p>
<p>To measure the fluffiness, I inserted a stick into the center of each hotcake until the bottom, marked the stick with a pencil, then measured the distance to the mark in millimeters with a ruler.</p>
<p>After a couple of hours of cooking hotcakes, making measurements, and recording the data on a worksheet, I started to analyze my data with Minitab.</p>
Analysis of Variance (ANOVA)
<p>My goal was to assess the variation in thickness or fluffiness between different batches of hot cakes, so the most appropriate statistical technique was <a href="http://blog.minitab.com/blog/statistics-in-the-field/understanding-anova-by-looking-at-your-household-budget">analysis of variance, or ANOVA</a>. With this analysis I could visualize and compare the formulations based on my response variable, the thickness in millimeters, and see if there were statistically significant differences between them. I used a <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/alpha-male-vs-alpha-female">0.05 significance value</a>.</p>
<p>As soon as I had my data in a Minitab worksheet, I started to check it for the assumptions of ANOVA. First, I needed to see if the data followed a normal distribution, so I went straight to <strong>Statistics > Basic Statistics > Normality Test</strong>. Minitab produced the following graph:</p>
<p><img alt="Graph of probability of thickness" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/58599d2e2d8572e700893e2e8000dce9/probability_of_thickness.jpg" style="width: 500px; height: 304px;" /></p>
<p>My data passed both the Kolmogorov-Smirnov and Anderson-Darling normality tests. This was a relief—since my data had a normal distribution, I didn’t need to worry about ANOVA’s assumptions of normality.</p>
<p>Traditional ANOVA also has an assumption of equal variances; however, I knew that even if my data didn’t meet this assumption, I could proceed using the method called <a href="http://blog.minitab.com/blog/adventures-in-statistics/did-welchs-anova-make-fishers-classic-one-way-anova-obsolete">Welch’s ANOVA</a>, which accommodates unequal variances. But when I ran Bartlett’s test for equal variances, and even the more stringent Levene test, my data passed. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/5600f02a4a7a9faa8b82c3bbe1458784/test_for_equality_of_variances.jpg" style="width: 500px; height: 307px;" /></p>
<p>With confirmation that my data met the assumptions, I proceeded to perform the ANOVA and create box-and-whisker graphs.</p>
ANOVA Results
<p>Here's the Minitab output for the ANOVA:</p>
<p style="margin-left: 40px;"><img alt="one-way anova output" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/5817e0a9b2d961942f7101bc8eb2eced/one_way_anova.gif" style="width: 400px; height: 133px;" /></p>
<p>The ANOVA revealed that there were indeed statistically significant differences (p = 0.009) among my four batches of hotcakes.</p>
<p>Minitab’s output also included grouping information using Tukey’s method of multiple comparisons for 95% confidence intervals:</p>
<p style="margin-left: 40px;"><img alt="Tukey Method" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c9194c1dda604ad87e4e7985ec8261c1/tukey_method.gif" style="width: 400px; height: 151px;" /></p>
<p>The Tukey analysis shows that low-fat milk and light items do not show a significant difference in fluffiness. However, the batches made with whole milk and low protein did significantly differ from each other.</p>
<p>The box-and-whisker diagram makes the results of the analysis easier to visualize:</p>
<p><img alt="Boxplot of thickness" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8ca740917c33fddd8953433d67488ac8/boxplot_of_thickness.gif" style="width: 500px; height: 338px;" /></p>
<p>It is clear from the graph that hotcakes produced with whole milk had the most fluffiness, and those made with low protein milk had the least fluffiness. There was not a big difference between the fluffiness of hotcakes made with light milk and lowfat milk.</p>
Which Milk Should You Use for Fluffy Pancakes?
<p>Based on this analysis, I recommend using whole milk for fluffier hotcakes. If you want to avoid fats and sugars in milk, low fat milk is a good choice.</p>
<p>I always use lowfat milk, but the analysis indicates that light milk offers a good alternative for people following a strict no-fat diet.</p>
<p>It’s important to note that for this analysis, I only compared formulations that used the same brand of pancake mix and the same amounts of salt and butter. But there are other factors to consider! My next pancake experiment will use design of experiments (DOE) to compare milk types, different brands of flour, and margarine with and without salt, to see how all of these factors together affect the fluffiness of pancakes.</p>
<p> </p>
<p><strong>About the Guest Blogger:</strong></p>
<p><em>Iván Alfonso is a biochemical engineer and statistics professor at the Autonomous University of Campeche, Mexico. Alfonso holds a master's degree in marine chemistry and has worked extensively in data analysis and design of experiments in basic and advanced sciences like chemistry and epidemiology.</em></p>
<p> </p>
<p><strong>Would you like to publish a guest post on the Minitab Blog? Contact <a href="mailto:publicrelations@minitab.com?subject=Guest%20Blogger">publicrelations@minitab.com</a>.</strong></p>
<p> </p>
Data AnalysisFun StatisticsHypothesis TestingStatisticsTue, 05 Aug 2014 12:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/a-fun-anova3a-does-milk-affect-the-fluffiness-of-pancakesGuest BloggerDo the Data Really Say Female-Named Hurricanes Are More Deadly?
http://blog.minitab.com/blog/the-statistics-game/do-the-data-really-say-female-named-hurricanes-are-more-deadly
<p><img alt="Hurricane" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/61165559035556ba8f784164d74a7f96/hurricane_w640.jpeg" style="float: right; width: 250px; height: 188px; border-width: 1px; border-style: solid; margin: 10px 15px;" />A recent study has indicated that <a href="http://www.washingtonpost.com/blogs/capital-weather-gang/wp/2014/06/02/female-named-hurricanes-kill-more-than-male-because-people-dont-respect-them-study-finds/" target="_blank">female-named hurricanes kill more people than male hurricanes</a>. Of course, the title of that article (and other articles like it) is a bit misleading. The study found a significant <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-models/what-is-an-interaction/">interaction</a> between the damage caused by the storm and the perceived masculinity or femininity of the hurricane names. So don’t be confused by stories that suggest all female-named hurricanes are deadlier than male-named hurricanes. The study actually found no effect of masculinity/femininity for less severe storms. It was the more severe storms where the gender of the name had a significant relationship with the number of deaths.</p>
<p>The study looked at every hurricane since 1950, with the exception of Katrina and Audrey (those two are outliers that would skew the results). Many critics of the study believe that it is biased, since almost all of the 38 hurricanes before 1979 had female names (there were two male names in the early 50s). It’s possible that our ability to forecast hurricanes has vastly improved since the 50s and 60s. So, these critics say, the difference is simply because more people died in hurricanes back when they all had a female name.</p>
<p>Let’s perform a data analysis to see if that is true. We will use pre- and post-1979 to distinguish between the predominantly female-name hurricane era and the era of mixed hurricane names. I’ll use the exact same data set that was used in the study, which you can get <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/ad7c966669da36643b8060c74038e6d6/hurricane.MTW">here</a>.</p>
Hurricanes Before and After 1979
<p>For the 92 hurricanes in the study, the number of deaths and the normalized damage was recorded. The study showed that these two variables are highly correlated, so it’s important to consider both factors. If we find there were more deaths in hurricanes before 1979, we need to make sure the reason isn’t simply because those hurricanes caused more damage (implying they were bigger storms).</p>
<p>We can start by using a scatterplot to plot the two variables against each other, using whether the hurricane came before or after 1979 as a grouping variable. Hurricanes that occurred <em>during </em>1979 were put in the After group.</p>
<p><img alt="Scatterplot" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/72ef8a172f250267d3b03cccd6ff8399/scatterplot_of_deaths_vs_normalized_damage_w640.jpeg" style="width: 640px; height: 427px;" /></p>
<p>We see that the two deadliest hurricanes (Camille and Diane) both occurred before 1979. If you look below them, you’ll see that many hurricanes in both eras have caused the same amount of damage, yet resulted in far fewer deaths.</p>
<p>Meanwhile, the two most damaging hurricanes (Sandy and Andrew) both occurred <em>after </em>1979. These hurricanes caused more than three times the damage of Camille and Diane, yet resulted in fewer deaths. This gives some credibility to the idea that our improvement in being able to predict hurricanes has resulted in fewer deaths. However, Hurricane Donna supports the opposite idea: five post-1979 hurricanes resulted in more deaths than Donna, despite causing significantly less damage. It’s hard to draw conclusions from the scatterplot.</p>
<p>Of course, the hurricanes labeled in the plot above are pretty rare. Most of the 92 hurricanes had normalized damage less than $30 billion and fewer than 100 deaths. The descriptive statistics below show just how much of an impact those big storms can have on an analysis.</p>
<p><img alt="Describe" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/ac70541e09a25b227de847363d10e9c0/describe_deaths_ndam_by_year_group.jpg" style="width: 503px; height: 177px;" /></p>
<p>If we look at the mean, everything becomes clear! On average, hurricanes before 1979 had 11 more deaths despite causing half a billion <em>fewer</em> dollars in damages. But when we look at the median, which isn’t sensitive to extreme data values, the values are almost the same. </p>
<p>Part of the problem is that so many smaller storms are included. The study already concluded that the name doesn’t matter for smaller storms. So let’s just focus on the big storms. The median normalized damage for all 92 storms is $1.65 billion. I took only the storms that have caused at least that much damage (there were 47 of them) and looked at the descriptive statistics again.</p>
<p><img alt="Describe" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/06fc8707283704922858ce000d05fde2/describe_deaths_ndam_by_year_group_big_storm.jpg" style="width: 500px; height: 175px;" /></p>
<p>Once again, the mean and median paint different pictures. The mean shows that a much higher number of deaths occurred in big storms before 1979, even though those storms caused the same amount of damage. However, this is because hurricanes Camille, Diane, and Agnes are heavily influencing the mean for deaths before 1979, pulling it up much higher than the After-1979 group. And hurricanes Sandy and Andrew influence the mean for normalized damage after 1979, pulling it up to equal the damage before 1979.</p>
<p>With data this skewed, the medians are a more accurate representation of the middle of the data. The median for deaths shows that there were slightly more deaths in big storms prior to 1979. However, those storms also caused more damage, implying <em>that </em>could be the reason for the larger number of deaths.</p>
<p>And even if we ignore the fact that the hurricanes before 1979 caused more damage, a <a href="http://blog.minitab.com/blog/statistics-for-lean-six-sigma/the-non-parametric-economy-what-does-average-actually-mean">Mann-Whitney test</a> (which compares 2 medians, as opposed to a 2-sample t test which compares 2 means) shows that the difference in deaths is not statistically significant.</p>
<p><img alt="Mann-Whitney" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/a8f1ef8922a9238ba0414caef236a05d/mann_whitney_w640.jpeg" style="width: 640px; height: 230px;" /></p>
<p>The p-value is 0.1393, which is greater than 0.05. There isn’t enough evidence to conclude that hurricanes caused more deaths before 1979.</p>
Can We Really Conclude that Female-Named Hurricanes Cause More Deaths?
<p>The lack of conclusive evidence from our data analysis certainly makes the idea that hurricanes with female names cause deaths plausible. But there are other issues to consider. For example, the gender of the hurricane name was not treated as a binary variable, which would group each hurricane as either male or female. Instead, nine independent coders rated the masculinity vs. femininity of historical hurricane names on two items (1 = very masculine, 11 = very feminine, and 1 = very man-like, 11 = very woman-like), which were averaged to compute a masculinity-femininity index (MFI).</p>
<p>Do these 9 coders represent how most Americans would rate the femininity of names? Would you rate Barbara as more feminine than Carol or Betsy? The coders did, giving Barbara a 9.8 while Carol and Betsy were 8.1 and 8.3 respectively. And the MFI is important, since it was found to be the gender variable that had a significant interaction with normalized damage. When gender name was treated as a binary variable, there was no interaction.</p>
<p>But masculinity-femininity index aside, the study did have some very interesting findings. I’m sure additional research will be done in the years to come to see if the findings hold true. Let's hope that then we’ll be able to know for sure whether people underestimate female-named hurricanes or not.</p>
<p>Until then, if a hurricane is bearing down on your neighborhood, I would make sure to board up the windows and buy out the supermarket's bread and milk, regardless of the storm's name.</p>
Hypothesis TestingStatisticsStatistics in the NewsFri, 06 Jun 2014 13:17:00 +0000http://blog.minitab.com/blog/the-statistics-game/do-the-data-really-say-female-named-hurricanes-are-more-deadlyKevin RudyHypothesis Testing and P Values
http://blog.minitab.com/blog/statistics-in-the-field/hypothesis-testing-and-p-values
<p><em>by Matthew Barsalou, guest blogger</em></p>
<p>Programs such as the <a href="http://www.minitab.com/products/minitab/">Minitab Statistical Software</a> make hypothesis testing easier; but no program can think for the experimenter. Anybody performing a statistical hypothesis test must understand what p values mean in regards to their statistical results as well as potential limitations of statistical hypothesis testing.</p>
<p>A p value of 0.05 is frequently used during statistical hypothesis testing. This p value indicates that if there is no effect (or if the null hypothesis is true), you’d obtain the observed difference or more in 5% of studies due to random sampling error. However, <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values">performing multiple hypothesis tests with p > 0.05 increases the chance of a false positive</a>.</p>
<p>This is well illustrated by the online comic <a href="http://xkcd.com/882/">XKCD</a>, which depicted somebody stating that jelly beans cause acne.</p>
<p><img alt="Significant" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/08b29e9eec884bee99602335f1f9c893/xkcd.png" style="border-width: 0px; border-style: solid; width: 310px; height: 859px;" /></p>
<p>Scientists investigated and found no link, so the person made the claim that it is only a certain color jelly bean that caused acne. The scientists then test 20 different colors of jelly beans with p > 0.05. Only the green jelly bean had a p value less than 0.05.</p>
<p>The comic ends with a newspaper reporting a link between green jelly beans and acne. The newspaper points out there is 95% confidence with only a 5% chance of coincidence. What is wrong with the conclusion?</p>
<p>We can determine the chance that there will be no false conclusions by using the binomial formula.</p>
<p><img alt="binomial formula" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b962df0ea487d69594aea4975ae69225/equation1.gif" style="width: 500px; height: 87px;" /></p>
<p>This means that we have a 35.8% chance of performing 20 hypothesis tests without getting a false positive (or, as statisticians refer to it, the <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/multiple-comparisons-beware-of-individual-errors-that-multiply">family error rate</a>) when using an alpha level of 0.05. We can also calculate the probability that we have at least one incorrect result due to random chance.</p>
<p><img src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/6a80807434e2c2678163dbcc710d13a0/equation2.gif" style="width: 345px; height: 73px;" /></p>
<p>The chance that at least one result will be a false positive when performing 20 hypothesis tests using an alpha level of 0.05 is 64.2%.</p>
<p>So the press release in the XKCD comic may have been a bit premature.</p>
<p>Suppose I had 14 samples with a mean of 87.2 and I wanted to know if the mean is actually 85.2. I performed a One-Sample T-test using Minitab by going to <strong>Stat > Basic Statistics > 1 Sample t …. </strong>And I entered the summarized data. I checked the “perform hypothesis test box” and then selected “Options…” and used the default confidence level of 95.0. This corresponds to an alpha of 0.05.</p>
<p style="margin-left: 40px;"><img alt="One-Sample T test output" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/55e90b93ae38e8612ce3adb4ea0c4f00/output1.png" style="border-width: 0px; border-style: solid; width: 425px; height: 130px;" /></p>
<p>I performed the test and the resulting p value was 0.049, which is close to but still below 0.05, so I can reject my null hypothesis. If I performed the test repeatedly, as in the XLCD example, I might have failed to reject the null hypothesis, because the 5% probability adds up with additional tests.</p>
<p>There are alternatives to statistical hypothesis testing; for example, Bayesian inference could be used in place of hypothesis testing with p values. But alternative methods have their own weaknesses, and they may be difficult for non-statisticians to use.</p>
<p>Instead of avoiding the use of hypothesis testing, we should account for its limitations. For example, by realizing that each repeat of the test increases the chance of a false positive, as illustrated by XKCD's jelly bean example.</p>
<p>We can’t simply retest over and over using the same p value and then conclude that we have results with statistical significance. For situations such as in the XKCD example, Simons, Nelson and Simonsohn recommend disclosing the total number of test that were <a href="http://people.psych.cornell.edu/~jec7/pcd%20pubs/simmonsetal11.pdf">performed</a>. Had we known that 20 test had been performed with p > 0.05 we could realize that we may not need to avoid green jellybeans after all.</p>
<p> </p>
<div><strong>About the Guest Blogger: </strong></div>
<div><em>Matthew Barsalou is an engineering quality expert in BorgWarner Turbo Systems Engineering GmbH’s Global Engineering Excellence department. He has previously worked as a quality manager at an automotive component supplier and as a contract quality engineer at Ford in Germany and Belgium. He possesses a bachelor of science in industrial sciences, a master of liberal studies and a master of science in business administration and engineering from the Wilhelm Büchner Hochschule in Darmstadt, Germany.</em></div>
<div> </div>
<p>xkcd.com comic from <a href="http://xkcd.com/882/">http://xkcd.com/882/</a> used under Creative Commons Attribution- NonCommercial 2.5 License. <a href="http://xkcd.com/license.html">http://xkcd.com/license.html</a></p>
<p> </p>
Fun StatisticsHypothesis TestingMon, 02 Jun 2014 12:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/hypothesis-testing-and-p-valuesGuest BloggerFive Guidelines for Using P values
http://blog.minitab.com/blog/adventures-in-statistics/five-guidelines-for-using-p-values
<p>There is high pressure to find low P values. Obtaining a low P value for a hypothesis test is make or break because it can lead to funding, articles, and prestige. Statistical significance is everything!</p>
<p>My two previous posts looked at several issues related to P values:</p>
<ul>
<li><a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">P values have a higher than expected false positive rate.</a></li>
<li><a href="http://blog.minitab.com/blog/adventures-in-statistics/not-all-p-values-are-created-equal" target="_blank">The same P value from different studies can correspond to different false positive rates.</a></li>
</ul>
<p>In this post, I’ll look at whether P values are still helpful and provide guidelines on how to use them with these issues in mind.</p>
<div style="float: right; width: 200px; margin: 25px 25px;">
<p><img alt="Ronald Fisher" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/f7eb953015180df73edfa6f073f234c6/r__a__fisher.jpg" style="float: right; width: 200px; height: 243px; border-width: 1px; border-style: solid;" /> <em>Sir Ronald A Fisher</em></p>
</div>
Are P Values Still Valuable?
<p>Given the issues about P values, are they still helpful? A higher than expected rate of false positives can be a problem because if you implement the “findings” from a false positive study, you won’t get the expected benefits.</p>
<p>In my view, P values are a great tool. Ronald Fisher introduced P values in the 1920s because he wanted an objective method for comparing data to the null hypothesis, rather than the informal eyeball approach: "My data <em>look </em>different than the null hypothesis."</p>
<p>P value calculations incorporate the effect size, sample size, and variability of the data into a single number that objectively tells you how consistent your data are with the null hypothesis. Pretty nifty!</p>
<p>Unfortunately, the high pressure to find low P values, combined with a common misunderstanding of <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">how to correctly interpret P values</a>, has distorted the interpretation of significant results. However, these issues can be resolved.</p>
<p>So, let’s get to the guidelines! Their overall theme is that you should evaluate P values as part of a larger context where other factors matter.</p>
Guideline 1: The Exact P Value Matters
<div style="float: right; width: 90px; margin: 25px 25px;">
<p style="line-height: 11px; text-align: center;"><img alt="Small wooden P" height="75px" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/c408562ea4a40eedae9ae78c1d3ca027/p_wooden.jpg" width="75px" /><br />
<em>Tiny Ps are<br />
great!</em></p>
</div>
<p>With the high pressure to find low P values, there’s a tendency to view studies as either significant or not. Did a study produce a P value less than 0.05? If so, it’s golden! However, there is no magic significance level that distinguishes between the studies that have a true effect and those that don’t with 100% accuracy. Instead, it’s all about lowering the error rate to an acceptable level.</p>
<p>The lower the P value, the lower the error rate. For example, a P value near 0.05 has an error rate of 25-50%. However, a P value of 0.0027 corresponds to an error rate of at least 4.5%, which is close to the rate that many mistakenly attribute to a P value of 0.05.</p>
<p>A lower P value thus suggests stronger evidence for rejecting the null hypothesis. A P value near 0.05 simply indicates that the result is worth another look, but it’s nothing you can hang your hat on by itself. It’s not until you get down near 0.001 until you have a fairly low chance of a false positive.</p>
Guideline 2: Replication Matters
<p>Today, P values are everything. However, Fisher intended P values to be just one part of a process that incorporates experimentation, statistical analysis and replication to lead to scientific conclusions.</p>
<p>According to Fisher, “A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.”</p>
<p>The false positive rates associated with P values that we saw in my last post definitely support this view. A single study, especially if the P value is near 0.05, is unlikely to reduce the false positive rate down to an acceptable level. Repeated experimentation may be required to finish at a point where the error rate is low enough to meet your objectives.</p>
<p>For example, if you have two independent studies that each produced a P value of 0.05, you can multiply the P values to obtain a probability of 0.0025 for both studies. However, you must include both the significant and insignificant studies in a series of similar studies, and not cherry pick only the significant studies.</p>
<p><img alt="Replicate study results" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/d1f27fc3889672c11ac23b1ffa9bfac9/p_rep.gif" style="width: 403px; height: 136px;" /></p>
<p>Conclusively proving a hypothesis with a single study is unlikely. So, don’t expect it!</p>
Guideline 3: The Effect Size Matters
<p>With all the focus on P values, attention to the effect size can be lost. Just because an effect is statistically significant doesn't necessarily make it meaningful in the real world. Nor does a P value indicate the precision of the estimated effect size.</p>
<p>If you want to move from just detecting an effect to assessing its magnitude and precision, use <a href="http://blog.minitab.com/blog/adventures-in-statistics/when-should-i-use-confidence-intervals-prediction-intervals-and-tolerance-intervals" target="_blank">confidence intervals</a>. In this context, a confidence interval is a range of values that is likely to contain the effect size.</p>
<p>For example, an AIDS vaccine <a href="http://news.sciencemag.org/health/2009/09/massive-aids-vaccine-study-modest-success" target="_blank">study</a> in Thailand obtained a P value of 0.039. Great! This was the first time that an AIDS vaccine had positive results. However, the confidence interval for effectiveness ranged from 1% to 52%. That’s not so impressive...the vaccine may work virtually none of the time up to half the time. The effectiveness is both low and imprecisely estimated.</p>
<p>Avoid thinking about studies only in terms of whether they are significant or not. Ask yourself; is the effect size precisely estimated and large enough to be important?</p>
Guideline 4: The Alternative Hypothesis Matters
<p>We tend to think of equivalent P values from different studies as providing the same support for the alternative hypothesis. However, <a href="http://blog.minitab.com/blog/adventures-in-statistics/not-all-p-values-are-created-equal" target="_blank">not all P values are created equal</a>.</p>
<p>Research shows that the plausibility of the alternative hypothesis greatly affects the false positive rate. For example, a highly plausible alternative hypothesis and a P value of 0.05 are associated with an error rate of at least 12%, while an implausible alternative is associated with a rate of at least 76%!</p>
<p>For example, given the track record for AIDS vaccines where the alternative hypothesis has never been true in previous studies, it's highly unlikely to be true at the outset of the Thai study. This situation tends to produce high false positive rates—often around 75%!</p>
<p>When you hear about a surprising new study that finds an unprecedented result, don’t fall for that first significant P value. Wait until the study has been well replicated before buying into the results!</p>
Guideline 5: Subject Area Knowledge Matters
<p>Applying subject area expertise to all aspects of hypothesis testing is crucial. Researchers need to apply their scientific judgment about the plausibility of the hypotheses, results of similar studies, proposed mechanisms, proper experimental design, and so on. Expert knowledge transforms statistics from numbers into meaningful, trustworthy findings.</p>
Hypothesis TestingStatisticsStatistics HelpThu, 15 May 2014 11:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/five-guidelines-for-using-p-valuesJim Frost