Hypothesis Testing | MinitabHypothesis testing and Lean Six Sigma quality improvement projects.
http://blog.minitab.com/blog/hypothesis-testing-2/rss
Tue, 16 Sep 2014 13:24:50 +0000FeedCreator 1.7.3A Fun ANOVA: Does Milk Affect the Fluffiness of Pancakes?
http://blog.minitab.com/blog/statistics-in-the-field/a-fun-anova3a-does-milk-affect-the-fluffiness-of-pancakes
<p><em>by Iván Alfonso, guest blogger</em></p>
<p><img alt="hotcakes" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/7bd460fa71f6d12672a2ac5d9f754762/pancakes.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 300px; height: 223px;" />I'm a huge fan of hot cakes—they are my favorite dessert ever. I’ve been cooking them for over 15 years, and over that time I’ve noticed many variation in textures, flavor, and thickness. Personally, I like fluffy pancakes.</p>
<p>There are many brands of hotcake mix on the market, all with very similar formulations. So I decided to investigate which ingredients and inputs may influence the fluffiness of my pancakes.</p>
<p>Potential factors could include the type of mix used, the type of milk used, the use of margarine or butter (of many brands), the amount of mixing time, the origin of the eggs, and the skill of the person who prepares the pancakes.</p>
<p>Instead of looking at <em>all </em>of these factors, I focused on the type of milk used in the pancakes. I had four types of milk available: whole milk, light, low fat, and low protein.</p>
<p>My goal was to determine if these different milk formulations influence fluffiness (thickness). Is the whole milk the best for fluffy hotcakes? Does skim milk works the same way as the whole milk? Can I be sure that the use of light milk will result in hot cakes that are less smooth?</p>
Gathering Data
<p>I sorted the four formulations as shown in the diagram below:</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/643f9f4f94be78a5b1c012e49c400772/milk_factor.jpg" style="width: 144px; height: 200px;" /></p>
<p>I used the the same amounts of milk, flour (one brand), salt and margarine for each batch of hotcakes I cooked.</p>
<p>The response variable was the thickness of the cooked pancakes. I prepared 6 pancakes for each type of milk, which gives me a total of 8 pancakes. I randomized the cooking order to minimize bias. I also prepared each batch by myself—if my sister or mother had helped with some lots, it would be a potential source of variation.</p>
<p>To measure the fluffiness, I inserted a stick into the center of each hotcake until the bottom, marked the stick with a pencil, then measured the distance to the mark in millimeters with a ruler.</p>
<p>After a couple of hours of cooking hotcakes, making measurements, and recording the data on a worksheet, I started to analyze my data with Minitab.</p>
Analysis of Variance (ANOVA)
<p>My goal was to assess the variation in thickness or fluffiness between different batches of hot cakes, so the most appropriate statistical technique was <a href="http://blog.minitab.com/blog/statistics-in-the-field/understanding-anova-by-looking-at-your-household-budget">analysis of variance, or ANOVA</a>. With this analysis I could visualize and compare the formulations based on my response variable, the thickness in millimeters, and see if there were statistically significant differences between them. I used a <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/alpha-male-vs-alpha-female">0.05 significance value</a>.</p>
<p>As soon as I had my data in a Minitab worksheet, I started to check it for the assumptions of ANOVA. First, I needed to see if the data followed a normal distribution, so I went straight to <strong>Statistics > Basic Statistics > Normality Test</strong>. Minitab produced the following graph:</p>
<p><img alt="Graph of probability of thickness" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/58599d2e2d8572e700893e2e8000dce9/probability_of_thickness.jpg" style="width: 500px; height: 304px;" /></p>
<p>My data passed both the Kolmogorov-Smirnov and Anderson-Darling normality tests. This was a relief—since my data had a normal distribution, I didn’t need to worry about ANOVA’s assumptions of normality.</p>
<p>Traditional ANOVA also has an assumption of equal variances; however, I knew that even if my data didn’t meet this assumption, I could proceed using the method called <a href="http://blog.minitab.com/blog/adventures-in-statistics/did-welchs-anova-make-fishers-classic-one-way-anova-obsolete">Welch’s ANOVA</a>, which accommodates unequal variances. But when I ran Bartlett’s test for equal variances, and even the more stringent Levene test, my data passed. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/5600f02a4a7a9faa8b82c3bbe1458784/test_for_equality_of_variances.jpg" style="width: 500px; height: 307px;" /></p>
<p>With confirmation that my data met the assumptions, I proceeded to perform the ANOVA and create box-and-whisker graphs.</p>
ANOVA Results
<p>Here's the Minitab output for the ANOVA:</p>
<p style="margin-left: 40px;"><img alt="one-way anova output" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/5817e0a9b2d961942f7101bc8eb2eced/one_way_anova.gif" style="width: 400px; height: 133px;" /></p>
<p>The ANOVA revealed that there were indeed statistically significant differences (p = 0.009) among my four batches of hotcakes.</p>
<p>Minitab’s output also included grouping information using Tukey’s method of multiple comparisons for 95% confidence intervals:</p>
<p style="margin-left: 40px;"><img alt="Tukey Method" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c9194c1dda604ad87e4e7985ec8261c1/tukey_method.gif" style="width: 400px; height: 151px;" /></p>
<p>The Tukey analysis shows that low-fat milk and light items do not show a significant difference in fluffiness. However, the batches made with whole milk and low protein did significantly differ from each other.</p>
<p>The box-and-whisker diagram makes the results of the analysis easier to visualize:</p>
<p><img alt="Boxplot of thickness" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8ca740917c33fddd8953433d67488ac8/boxplot_of_thickness.gif" style="width: 500px; height: 338px;" /></p>
<p>It is clear from the graph that hotcakes produced with whole milk had the most fluffiness, and those made with low protein milk had the least fluffiness. There was not a big difference between the fluffiness of hotcakes made with light milk and lowfat milk.</p>
Which Milk Should You Use for Fluffy Pancakes?
<p>Based on this analysis, I recommend using whole milk for fluffier hotcakes. If you want to avoid fats and sugars in milk, low fat milk is a good choice.</p>
<p>I always use lowfat milk, but the analysis indicates that light milk offers a good alternative for people following a strict no-fat diet.</p>
<p>It’s important to note that for this analysis, I only compared formulations that used the same brand of pancake mix and the same amounts of salt and butter. But there are other factors to consider! My next pancake experiment will use design of experiments (DOE) to compare milk types, different brands of flour, and margarine with and without salt, to see how all of these factors together affect the fluffiness of pancakes.</p>
<p> </p>
<p><strong>About the Guest Blogger:</strong></p>
<p><em>Iván Alfonso is a biochemical engineer and statistics professor at the Autonomous University of Campeche, Mexico. Alfonso holds a master's degree in marine chemistry and has worked extensively in data analysis and design of experiments in basic and advanced sciences like chemistry and epidemiology.</em></p>
<p> </p>
<p><strong>Would you like to publish a guest post on the Minitab Blog? Contact <a href="mailto:publicrelations@minitab.com?subject=Guest%20Blogger">publicrelations@minitab.com</a>.</strong></p>
<p> </p>
Data AnalysisFun StatisticsHypothesis TestingStatisticsTue, 05 Aug 2014 12:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/a-fun-anova3a-does-milk-affect-the-fluffiness-of-pancakesGuest BloggerDo the Data Really Say Female-Named Hurricanes Are More Deadly?
http://blog.minitab.com/blog/the-statistics-game/do-the-data-really-say-female-named-hurricanes-are-more-deadly
<p><img alt="Hurricane" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/61165559035556ba8f784164d74a7f96/hurricane_w640.jpeg" style="float: right; width: 250px; height: 188px; border-width: 1px; border-style: solid; margin: 10px 15px;" />A recent study has indicated that <a href="http://www.washingtonpost.com/blogs/capital-weather-gang/wp/2014/06/02/female-named-hurricanes-kill-more-than-male-because-people-dont-respect-them-study-finds/" target="_blank">female-named hurricanes kill more people than male hurricanes</a>. Of course, the title of that article (and other articles like it) is a bit misleading. The study found a significant <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-models/what-is-an-interaction/">interaction</a> between the damage caused by the storm and the perceived masculinity or femininity of the hurricane names. So don’t be confused by stories that suggest all female-named hurricanes are deadlier than male-named hurricanes. The study actually found no effect of masculinity/femininity for less severe storms. It was the more severe storms where the gender of the name had a significant relationship with the number of deaths.</p>
<p>The study looked at every hurricane since 1950, with the exception of Katrina and Audrey (those two are outliers that would skew the results). Many critics of the study believe that it is biased, since almost all of the 38 hurricanes before 1979 had female names (there were two male names in the early 50s). It’s possible that our ability to forecast hurricanes has vastly improved since the 50s and 60s. So, these critics say, the difference is simply because more people died in hurricanes back when they all had a female name.</p>
<p>Let’s perform a data analysis to see if that is true. We will use pre- and post-1979 to distinguish between the predominantly female-name hurricane era and the era of mixed hurricane names. I’ll use the exact same data set that was used in the study, which you can get <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/ad7c966669da36643b8060c74038e6d6/hurricane.MTW">here</a>.</p>
Hurricanes Before and After 1979
<p>For the 92 hurricanes in the study, the number of deaths and the normalized damage was recorded. The study showed that these two variables are highly correlated, so it’s important to consider both factors. If we find there were more deaths in hurricanes before 1979, we need to make sure the reason isn’t simply because those hurricanes caused more damage (implying they were bigger storms).</p>
<p>We can start by using a scatterplot to plot the two variables against each other, using whether the hurricane came before or after 1979 as a grouping variable. Hurricanes that occurred <em>during </em>1979 were put in the After group.</p>
<p><img alt="Scatterplot" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/72ef8a172f250267d3b03cccd6ff8399/scatterplot_of_deaths_vs_normalized_damage_w640.jpeg" style="width: 640px; height: 427px;" /></p>
<p>We see that the two deadliest hurricanes (Camille and Diane) both occurred before 1979. If you look below them, you’ll see that many hurricanes in both eras have caused the same amount of damage, yet resulted in far fewer deaths.</p>
<p>Meanwhile, the two most damaging hurricanes (Sandy and Andrew) both occurred <em>after </em>1979. These hurricanes caused more than three times the damage of Camille and Diane, yet resulted in fewer deaths. This gives some credibility to the idea that our improvement in being able to predict hurricanes has resulted in fewer deaths. However, Hurricane Donna supports the opposite idea: five post-1979 hurricanes resulted in more deaths than Donna, despite causing significantly less damage. It’s hard to draw conclusions from the scatterplot.</p>
<p>Of course, the hurricanes labeled in the plot above are pretty rare. Most of the 92 hurricanes had normalized damage less than $30 billion and fewer than 100 deaths. The descriptive statistics below show just how much of an impact those big storms can have on an analysis.</p>
<p><img alt="Describe" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/ac70541e09a25b227de847363d10e9c0/describe_deaths_ndam_by_year_group.jpg" style="width: 503px; height: 177px;" /></p>
<p>If we look at the mean, everything becomes clear! On average, hurricanes before 1979 had 11 more deaths despite causing half a billion <em>fewer</em> dollars in damages. But when we look at the median, which isn’t sensitive to extreme data values, the values are almost the same. </p>
<p>Part of the problem is that so many smaller storms are included. The study already concluded that the name doesn’t matter for smaller storms. So let’s just focus on the big storms. The median normalized damage for all 92 storms is $1.65 billion. I took only the storms that have caused at least that much damage (there were 47 of them) and looked at the descriptive statistics again.</p>
<p><img alt="Describe" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/06fc8707283704922858ce000d05fde2/describe_deaths_ndam_by_year_group_big_storm.jpg" style="width: 500px; height: 175px;" /></p>
<p>Once again, the mean and median paint different pictures. The mean shows that a much higher number of deaths occurred in big storms before 1979, even though those storms caused the same amount of damage. However, this is because hurricanes Camille, Diane, and Agnes are heavily influencing the mean for deaths before 1979, pulling it up much higher than the After-1979 group. And hurricanes Sandy and Andrew influence the mean for normalized damage after 1979, pulling it up to equal the damage before 1979.</p>
<p>With data this skewed, the medians are a more accurate representation of the middle of the data. The median for deaths shows that there were slightly more deaths in big storms prior to 1979. However, those storms also caused more damage, implying <em>that </em>could be the reason for the larger number of deaths.</p>
<p>And even if we ignore the fact that the hurricanes before 1979 caused more damage, a <a href="http://blog.minitab.com/blog/statistics-for-lean-six-sigma/the-non-parametric-economy-what-does-average-actually-mean">Mann-Whitney test</a> (which compares 2 medians, as opposed to a 2-sample t test which compares 2 means) shows that the difference in deaths is not statistically significant.</p>
<p><img alt="Mann-Whitney" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/a8f1ef8922a9238ba0414caef236a05d/mann_whitney_w640.jpeg" style="width: 640px; height: 230px;" /></p>
<p>The p-value is 0.1393, which is greater than 0.05. There isn’t enough evidence to conclude that hurricanes caused more deaths before 1979.</p>
Can We Really Conclude that Female-Named Hurricanes Cause More Deaths?
<p>The lack of conclusive evidence from our data analysis certainly makes the idea that hurricanes with female names cause deaths plausible. But there are other issues to consider. For example, the gender of the hurricane name was not treated as a binary variable, which would group each hurricane as either male or female. Instead, nine independent coders rated the masculinity vs. femininity of historical hurricane names on two items (1 = very masculine, 11 = very feminine, and 1 = very man-like, 11 = very woman-like), which were averaged to compute a masculinity-femininity index (MFI).</p>
<p>Do these 9 coders represent how most Americans would rate the femininity of names? Would you rate Barbara as more feminine than Carol or Betsy? The coders did, giving Barbara a 9.8 while Carol and Betsy were 8.1 and 8.3 respectively. And the MFI is important, since it was found to be the gender variable that had a significant interaction with normalized damage. When gender name was treated as a binary variable, there was no interaction.</p>
<p>But masculinity-femininity index aside, the study did have some very interesting findings. I’m sure additional research will be done in the years to come to see if the findings hold true. Let's hope that then we’ll be able to know for sure whether people underestimate female-named hurricanes or not.</p>
<p>Until then, if a hurricane is bearing down on your neighborhood, I would make sure to board up the windows and buy out the supermarket's bread and milk, regardless of the storm's name.</p>
Hypothesis TestingStatisticsStatistics in the NewsFri, 06 Jun 2014 13:17:00 +0000http://blog.minitab.com/blog/the-statistics-game/do-the-data-really-say-female-named-hurricanes-are-more-deadlyKevin RudyHypothesis Testing and P Values
http://blog.minitab.com/blog/statistics-in-the-field/hypothesis-testing-and-p-values
<p><em>by Matthew Barsalou, guest blogger</em></p>
<p>Programs such as the <a href="http://www.minitab.com/products/minitab/">Minitab Statistical Software</a> make hypothesis testing easier; but no program can think for the experimenter. Anybody performing a statistical hypothesis test must understand what p values mean in regards to their statistical results as well as potential limitations of statistical hypothesis testing.</p>
<p>A p value of 0.05 is frequently used during statistical hypothesis testing. This p value indicates that if there is no effect (or if the null hypothesis is true), you’d obtain the observed difference or more in 5% of studies due to random sampling error. However, <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values">performing multiple hypothesis tests with p > 0.05 increases the chance of a false positive</a>.</p>
<p>This is well illustrated by the online comic <a href="http://xkcd.com/882/">XKCD</a>, which depicted somebody stating that jelly beans cause acne.</p>
<p><img alt="Significant" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/08b29e9eec884bee99602335f1f9c893/xkcd.png" style="border-width: 0px; border-style: solid; width: 310px; height: 859px;" /></p>
<p>Scientists investigated and found no link, so the person made the claim that it is only a certain color jelly bean that caused acne. The scientists then test 20 different colors of jelly beans with p > 0.05. Only the green jelly bean had a p value less than 0.05.</p>
<p>The comic ends with a newspaper reporting a link between green jelly beans and acne. The newspaper points out there is 95% confidence with only a 5% chance of coincidence. What is wrong with the conclusion?</p>
<p>We can determine the chance that there will be no false conclusions by using the binomial formula.</p>
<p><img alt="binomial formula" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b962df0ea487d69594aea4975ae69225/equation1.gif" style="width: 500px; height: 87px;" /></p>
<p>This means that we have a 35.8% chance of performing 20 hypothesis tests without getting a false positive (or, as statisticians refer to it, the <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/multiple-comparisons-beware-of-individual-errors-that-multiply">family error rate</a>) when using an alpha level of 0.05. We can also calculate the probability that we have at least one incorrect result due to random chance.</p>
<p><img src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/6a80807434e2c2678163dbcc710d13a0/equation2.gif" style="width: 345px; height: 73px;" /></p>
<p>The chance that at least one result will be a false positive when performing 20 hypothesis tests using an alpha level of 0.05 is 64.2%.</p>
<p>So the press release in the XKCD comic may have been a bit premature.</p>
<p>Suppose I had 14 samples with a mean of 87.2 and I wanted to know if the mean is actually 85.2. I performed a One-Sample T-test using Minitab by going to <strong>Stat > Basic Statistics > 1 Sample t …. </strong>And I entered the summarized data. I checked the “perform hypothesis test box” and then selected “Options…” and used the default confidence level of 95.0. This corresponds to an alpha of 0.05.</p>
<p style="margin-left: 40px;"><img alt="One-Sample T test output" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/55e90b93ae38e8612ce3adb4ea0c4f00/output1.png" style="border-width: 0px; border-style: solid; width: 425px; height: 130px;" /></p>
<p>I performed the test and the resulting p value was 0.049, which is close to but still below 0.05, so I can reject my null hypothesis. If I performed the test repeatedly, as in the XLCD example, I might have failed to reject the null hypothesis, because the 5% probability adds up with additional tests.</p>
<p>There are alternatives to statistical hypothesis testing; for example, Bayesian inference could be used in place of hypothesis testing with p values. But alternative methods have their own weaknesses, and they may be difficult for non-statisticians to use.</p>
<p>Instead of avoiding the use of hypothesis testing, we should account for its limitations. For example, by realizing that each repeat of the test increases the chance of a false positive, as illustrated by XKCD's jelly bean example.</p>
<p>We can’t simply retest over and over using the same p value and then conclude that we have results with statistical significance. For situations such as in the XKCD example, Simons, Nelson and Simonsohn recommend disclosing the total number of test that were <a href="http://people.psych.cornell.edu/~jec7/pcd%20pubs/simmonsetal11.pdf">performed</a>. Had we known that 20 test had been performed with p > 0.05 we could realize that we may not need to avoid green jellybeans after all.</p>
<p> </p>
<div><strong>About the Guest Blogger: </strong></div>
<div><em>Matthew Barsalou is an engineering quality expert in BorgWarner Turbo Systems Engineering GmbH’s Global Engineering Excellence department. He has previously worked as a quality manager at an automotive component supplier and as a contract quality engineer at Ford in Germany and Belgium. He possesses a bachelor of science in industrial sciences, a master of liberal studies and a master of science in business administration and engineering from the Wilhelm Büchner Hochschule in Darmstadt, Germany.</em></div>
<div> </div>
<p>xkcd.com comic from <a href="http://xkcd.com/882/">http://xkcd.com/882/</a> used under Creative Commons Attribution- NonCommercial 2.5 License. <a href="http://xkcd.com/license.html">http://xkcd.com/license.html</a></p>
<p> </p>
Fun StatisticsHypothesis TestingMon, 02 Jun 2014 12:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/hypothesis-testing-and-p-valuesGuest BloggerFive Guidelines for Using P values
http://blog.minitab.com/blog/adventures-in-statistics/five-guidelines-for-using-p-values
<p>There is high pressure to find low P values. Obtaining a low P value for a hypothesis test is make or break because it can lead to funding, articles, and prestige. Statistical significance is everything!</p>
<p>My two previous posts looked at several issues related to P values:</p>
<ul>
<li><a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">P values have a higher than expected false positive rate.</a></li>
<li><a href="http://blog.minitab.com/blog/adventures-in-statistics/not-all-p-values-are-created-equal" target="_blank">The same P value from different studies can correspond to different false positive rates.</a></li>
</ul>
<p>In this post, I’ll look at whether P values are still helpful and provide guidelines on how to use them with these issues in mind.</p>
<div style="float: right; width: 200px; margin: 25px 25px;">
<p><img alt="Ronald Fisher" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/f7eb953015180df73edfa6f073f234c6/r__a__fisher.jpg" style="float: right; width: 200px; height: 243px; border-width: 1px; border-style: solid;" /> <em>Sir Ronald A Fisher</em></p>
</div>
Are P Values Still Valuable?
<p>Given the issues about P values, are they still helpful? A higher than expected rate of false positives can be a problem because if you implement the “findings” from a false positive study, you won’t get the expected benefits.</p>
<p>In my view, P values are a great tool. Ronald Fisher introduced P values in the 1920s because he wanted an objective method for comparing data to the null hypothesis, rather than the informal eyeball approach: "My data <em>look </em>different than the null hypothesis."</p>
<p>P value calculations incorporate the effect size, sample size, and variability of the data into a single number that objectively tells you how consistent your data are with the null hypothesis. Pretty nifty!</p>
<p>Unfortunately, the high pressure to find low P values, combined with a common misunderstanding of <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">how to correctly interpret P values</a>, has distorted the interpretation of significant results. However, these issues can be resolved.</p>
<p>So, let’s get to the guidelines! Their overall theme is that you should evaluate P values as part of a larger context where other factors matter.</p>
Guideline 1: The Exact P Value Matters
<div style="float: right; width: 90px; margin: 25px 25px;">
<p style="line-height: 11px; text-align: center;"><img alt="Small wooden P" height="75px" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/c408562ea4a40eedae9ae78c1d3ca027/p_wooden.jpg" width="75px" /><br />
<em>Tiny Ps are<br />
great!</em></p>
</div>
<p>With the high pressure to find low P values, there’s a tendency to view studies as either significant or not. Did a study produce a P value less than 0.05? If so, it’s golden! However, there is no magic significance level that distinguishes between the studies that have a true effect and those that don’t with 100% accuracy. Instead, it’s all about lowering the error rate to an acceptable level.</p>
<p>The lower the P value, the lower the error rate. For example, a P value near 0.05 has an error rate of 25-50%. However, a P value of 0.0027 corresponds to an error rate of at least 4.5%, which is close to the rate that many mistakenly attribute to a P value of 0.05.</p>
<p>A lower P value thus suggests stronger evidence for rejecting the null hypothesis. A P value near 0.05 simply indicates that the result is worth another look, but it’s nothing you can hang your hat on by itself. It’s not until you get down near 0.001 until you have a fairly low chance of a false positive.</p>
Guideline 2: Replication Matters
<p>Today, P values are everything. However, Fisher intended P values to be just one part of a process that incorporates experimentation, statistical analysis and replication to lead to scientific conclusions.</p>
<p>According to Fisher, “A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.”</p>
<p>The false positive rates associated with P values that we saw in my last post definitely support this view. A single study, especially if the P value is near 0.05, is unlikely to reduce the false positive rate down to an acceptable level. Repeated experimentation may be required to finish at a point where the error rate is low enough to meet your objectives.</p>
<p>For example, if you have two independent studies that each produced a P value of 0.05, you can multiply the P values to obtain a probability of 0.0025 for both studies. However, you must include both the significant and insignificant studies in a series of similar studies, and not cherry pick only the significant studies.</p>
<p><img alt="Replicate study results" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/d1f27fc3889672c11ac23b1ffa9bfac9/p_rep.gif" style="width: 403px; height: 136px;" /></p>
<p>Conclusively proving a hypothesis with a single study is unlikely. So, don’t expect it!</p>
Guideline 3: The Effect Size Matters
<p>With all the focus on P values, attention to the effect size can be lost. Just because an effect is statistically significant doesn't necessarily make it meaningful in the real world. Nor does a P value indicate the precision of the estimated effect size.</p>
<p>If you want to move from just detecting an effect to assessing its magnitude and precision, use <a href="http://blog.minitab.com/blog/adventures-in-statistics/when-should-i-use-confidence-intervals-prediction-intervals-and-tolerance-intervals" target="_blank">confidence intervals</a>. In this context, a confidence interval is a range of values that is likely to contain the effect size.</p>
<p>For example, an AIDS vaccine <a href="http://news.sciencemag.org/health/2009/09/massive-aids-vaccine-study-modest-success" target="_blank">study</a> in Thailand obtained a P value of 0.039. Great! This was the first time that an AIDS vaccine had positive results. However, the confidence interval for effectiveness ranged from 1% to 52%. That’s not so impressive...the vaccine may work virtually none of the time up to half the time. The effectiveness is both low and imprecisely estimated.</p>
<p>Avoid thinking about studies only in terms of whether they are significant or not. Ask yourself; is the effect size precisely estimated and large enough to be important?</p>
Guideline 4: The Alternative Hypothesis Matters
<p>We tend to think of equivalent P values from different studies as providing the same support for the alternative hypothesis. However, <a href="http://blog.minitab.com/blog/adventures-in-statistics/not-all-p-values-are-created-equal" target="_blank">not all P values are created equal</a>.</p>
<p>Research shows that the plausibility of the alternative hypothesis greatly affects the false positive rate. For example, a highly plausible alternative hypothesis and a P value of 0.05 are associated with an error rate of at least 12%, while an implausible alternative is associated with a rate of at least 76%!</p>
<p>For example, given the track record for AIDS vaccines where the alternative hypothesis has never been true in previous studies, it's highly unlikely to be true at the outset of the Thai study. This situation tends to produce high false positive rates—often around 75%!</p>
<p>When you hear about a surprising new study that finds an unprecedented result, don’t fall for that first significant P value. Wait until the study has been well replicated before buying into the results!</p>
Guideline 5: Subject Area Knowledge Matters
<p>Applying subject area expertise to all aspects of hypothesis testing is crucial. Researchers need to apply their scientific judgment about the plausibility of the hypotheses, results of similar studies, proposed mechanisms, proper experimental design, and so on. Expert knowledge transforms statistics from numbers into meaningful, trustworthy findings.</p>
Hypothesis TestingStatisticsStatistics HelpThu, 15 May 2014 11:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/five-guidelines-for-using-p-valuesJim FrostNot All P Values are Created Equal
http://blog.minitab.com/blog/adventures-in-statistics/not-all-p-values-are-created-equal
<p><img alt="Fancy P" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/2762a55291d134b8185ba9da47ea6f83/p_fancy.gif" style="float: right; width: 150px; height: 194px; margin: 10px 15px;" />The interpretation of P values would seem to be fairly standard between different studies. Even if two hypothesis tests study different subject matter, we tend to assume that you can interpret a P value of 0.03 the same way for both tests. A P value is a P value, right?</p>
<p>Not so fast! While Minitab <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">statistical software</a> can correctly calculate all P values, it can’t factor in the larger context of the study. You and your common sense need to do that!</p>
<p>In this post, I’ll demonstrate that P values tell us very different things depending on the larger context.</p>
Recap: P Values Are Not the Probability of Making a Mistake
<p>In my previous post, I showed the <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">correct way to interpret P values</a>. Keep in mind the big caution: P values are<em> not</em> the error rate, or the likelihood of making a mistake by rejecting a true null hypothesis (Type I error).</p>
<p>You can equate this error rate to the false positive rate for a hypothesis test. A false positive happens when the sample is unusual due to chance alone and it produces a low P value. However, despite the low P value, the alternative hypothesis is not true. There is no effect at the population level.</p>
<p>Sellke <em>et al</em>. estimated that a P value of 0.05 corresponds to a false positive rate of “at least 23% (and typically close to 50%).”</p>
What Affects the Error Rate?
<p>Why is there a range of values for the error rate? To understand that, you need to understand the factors involved. David Colquhoun, a professor in biostatistics, lays them out <a href="http://www.dcscience.net/?p=6518" target="_blank">here</a>.</p>
<p>Whereas Sellke<em> et al.</em> use a Bayesian approach, Colquhoun uses a non-Bayesian approach but derives similar estimates. For example, Colquhoun estimates P values between 0.045 and 0.05 have a false positive rate of at least 26%.</p>
<p>The factors that affect the false positive rate are:</p>
<ul>
<li>Prevalence of real effects (higher is good)</li>
<li>Power (higher is good)</li>
<li>P value (lower is good)</li>
</ul>
<p>“Good” means that the test is less likely to produce a false positive. The 26% error rate assumes a prevalence of real effects of 0.5 and a power of 0.8. If you decrease the prevalence to 0.1, suddenly the false positive rate shoots up to 76%. Yikes!</p>
<p>Power is related to false positives because when a study has a lower probability of detecting a true effect, a higher proportion of the positives will be false positives.</p>
<p>Now, let’s dig into a very interesting factor: the prevalence of real effects. As we saw, this factor can hugely impact the error rate!</p>
P Values and the Prevalence of Real Effects
<p><img alt="Joke: I once asked a statistician out. She failed to reject me!" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/f9d1ea1b51185c0631ae8fadb0145f8f/fail_reject_joke.gif" style="float: right; width: 275px; height: 313px; margin: 10px 15px;" />What Colquhoun calls the prevalence of real effects (denoted as P(real)), the Bayesian approach calls the prior probability. It is the proportion of hypothesis tests in which the alternative hypothesis is true at the outset. It can be thought of as the long-term probability, or track record, of similar types of studies. It’s the plausibility of the alternative hypothesis.</p>
<p>If the alternative hypothesis is farfetched, or has a poor track record, P(real) is low. For example, a prevalence of 0.1 indicates that 10% of similar alternative hypotheses have turned out to be true while 90% of the time the null was true. Perhaps the alternative hypothesis is unusual, untested, or otherwise implausible.</p>
<p>If the alternative hypothesis fits current theory, has an identified mechanism for the effect, and previous studies have already shown significant results, P(real) is higher. For example, a prevalence of 0.90 indicates that the alternative is true 90% of the time, and the null only 10% of the time.</p>
<p>If the prevalence is 0.5, there is a 50/50 chance that either the null or alternative hypothesis is true at the outset of the study.</p>
<p>You may not always know this probability, but theory and a previous track record can be guides. For our purposes, we’ll use this principle to see how it impacts our interpretation of P values. Specifically, we’ll focus on the probability of the null being true (1 – P(real)) at the beginning of the study.</p>
Hypothesis Tests Are Journeys from the Prior Probability to Posterior Probability
<p><a href="http://blog.minitab.com/blog/understanding-statistics/what-statistical-hypothesis-test-should-i-use" target="_blank">Hypothesis tests</a> begin with differing probabilities that the null hypothesis is true depending on the specific hypotheses being tested. This prior probability influences the probability that the null is true at the conclusion of the test, the posterior probability.</p>
<p>If P(real) = 0.9, there is only a 10% chance that the null hypothesis is true at the outset. Consequently, the probability of rejecting a true null at the conclusion of the test must be less than 10%. However, if you start with a 90% chance of the null being true, the odds of rejecting a true null increases because there are more true nulls.</p>
<p style="text-align: center;">Initial Probability of<br />
true null (1 – P(real))</p>
<p style="text-align: center;">P value obtained</p>
<p style="text-align: center;">Final Minimum Probability<br />
of true null</p>
<p style="text-align: center;">0.5</p>
<p style="text-align: center;">0.05</p>
<p style="text-align: center;">0.289</p>
<p style="text-align: center;">0.5</p>
<p style="text-align: center;">0.01</p>
<p style="text-align: center;">0.110</p>
<p style="text-align: center;">0.5</p>
<p style="text-align: center;">0.001</p>
<p style="text-align: center;">0.018</p>
<p style="text-align: center;">0.33</p>
<p style="text-align: center;">0.05</p>
<p style="text-align: center;">0.12</p>
<p style="text-align: center;">0.9</p>
<p style="text-align: center;">0.05</p>
<p style="text-align: center;">0.76</p>
<p>The table is based on calculations by Colquhoun and Sellke <em>et al.</em> It shows that the decrease from the initial probability to the final probability of a true null depends on the P value. Power is also a factor but not shown in the table.</p>
Where Do We Go with P values from Here?
<p><img alt="wooden block P" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/c408562ea4a40eedae9ae78c1d3ca027/p_wooden.jpg" style="float: right; width: 150px; height: 150px;" />There are many combinations of conditions that affect the probability of rejecting a true null. However, don't try to remember every combination and the error rate, especially because you may only have a vague sense of the true P(real) value!</p>
<p>Just remember two big takeaways:</p>
<ol>
<li>A single statistically significant hypothesis test often provides insufficient evidence to confidently discard the null hypothesis. This is particularly true when the P value is closer to 0.05.</li>
<li>P values from different hypothesis tests can have the same value, but correspond to very different false positive rates. You need to understand their context to be able to interpret them correctly.</li>
</ol>
<p>The second point is epitomized by a quote that was popularized by Carl Sagan: “Extraordinary claims require extraordinary evidence.”</p>
<p>A surprising new study may have a significant P value, but you shouldn't trust the alternative hypothesis until the results are replicated by additional studies. As shown in the table, a significant but unusual alternative hypothesis can have an error rate of 76%!</p>
<p>Don’t fret! There are simple recommendations based on the principles above that can help you navigate P values and use them correctly. I’ll cover <a href="http://blog.minitab.com/blog/adventures-in-statistics/five-guidelines-for-using-p-values">five guidelines for using P values</a> in my next post.</p>
Hypothesis TestingThu, 01 May 2014 11:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/not-all-p-values-are-created-equalJim FrostHow to Correctly Interpret P Values
http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values
<p><img alt="P value" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/d95f756ee6f6a4cec607017c8edea52a/bigp.gif" style="margin: 4px; float: right; width: 110px; height: 125px;" />The P value is used all over statistics, from <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/t-for-2-should-i-use-a-paired-t-or-a-2-sample-t" target="_blank">t-tests</a> to <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">regression analysis</a>. Everyone knows that you use P values to determine statistical significance in a hypothesis test. In fact, P values often determine what studies get published and what projects get funding.</p>
<p>Despite being so important, the P value is a slippery concept that people often interpret incorrectly. How <em>do</em> you interpret P values?</p>
<p>In this post, I'll help you to understand P values in a more intuitive way and to avoid a very common misinterpretation that can cost you money and credibility.</p>
What Is the Null Hypothesis in Hypothesis Testing?
<p><img alt="Scientist performing an experiment" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/3407070c72311249854712c526aceb59/scientist_w640.jpeg" style="margin: 10px 15px; float: right; width: 300px; height: 200px; border-width: 1px; border-style: solid;" />In order to understand P values, you must first understand the null hypothesis.</p>
<p>In every experiment, there is an effect or difference between groups that the researchers are testing. It could be the effectiveness of a new drug, building material, or other intervention that has benefits. Unfortunately for the researchers, there is always the possibility that there is no effect, that is, that there is no difference between the groups. This lack of a difference is called the null hypothesis, which is essentially the position a devil’s advocate would take when evaluating the results of an experiment.</p>
<p>To see why, let’s imagine an experiment for a drug that we know is totally ineffective. The null hypothesis is true: there is no difference between the experimental groups at the population level.</p>
<p>Despite the null being true, it’s entirely possible that there will be an effect in the sample data due to random sampling error. In fact, it is extremely unlikely that the sample groups will ever exactly equal the null hypothesis value. Consequently, the devil’s advocate position is that the observed difference in the sample does not reflect a true difference between populations.</p>
What Are P Values?
<p><img alt="Joke" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/81242ed4497d1961eb264c3d7c65cc66/null_joke.gif" style="margin: 4px; float: right; width: 250px; height: 206px;" />P values evaluate how well the sample data support the devil’s advocate argument that the null hypothesis is true. It measures how compatible your data are with the null hypothesis. How likely is the effect observed in your sample data if the null hypothesis is true?</p>
<ul>
<li>High P values: your data are likely with a true null.</li>
<li>Low P values: your data are unlikely with a true null.</li>
</ul>
<p>A low P value suggests that your sample provides enough evidence that you can reject the null hypothesis for the entire population.</p>
How Do You Interpret P Values?
<p><img alt="Vaccine" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/179970708b13904b2993033a5cc2e71d/vaccination_w640.jpeg" style="margin: 4px; float: right; width: 300px; height: 160px;" />In technical terms, a P value is the probability of obtaining an effect at least as extreme as the one in your sample data, assuming the truth of the null hypothesis.</p>
<p>For example, suppose that a vaccine study produced a P value of 0.04. This P value indicates that if the vaccine had no effect, you’d obtain the observed difference or more in 4% of studies due to random sampling error.</p>
<p>P values address only one question: how likely are your data, assuming a true null hypothesis? It does not measure support for the alternative hypothesis. This limitation leads us into the next section to cover a very common misinterpretation of P values.</p>
P Values Are <em>NOT </em>the Probability of Making a Mistake
<p>Incorrect interpretations of P values are very common. The most common mistake is to interpret a P value as the probability of making a mistake by rejecting a true null hypothesis (a Type I error).</p>
<p>There are several reasons why P values can’t be the error rate.</p>
<p>First, P values are calculated based on the assumptions that the null is true for the population and that the difference in the sample is caused entirely by random chance. Consequently, P values can’t tell you the probability that the null is true or false because it is 100% true from the perspective of the calculations.</p>
<p>Second, while a low P value indicates that your data are unlikely assuming a true null, it can’t evaluate which of two competing cases is more likely:</p>
<ul>
<li>The null is true but your sample was unusual.</li>
<li>The null is false.</li>
</ul>
<p>Determining which case is more likely requires subject area knowledge and replicate studies.</p>
<p>Let’s go back to the vaccine study and compare the correct and incorrect way to interpret the P value of 0.04:</p>
<ul>
<li><strong>Correct: </strong>Assuming that the vaccine had no effect, you’d obtain the observed difference or more in 4% of studies due to random sampling error.<br />
</li>
<li><strong>Incorrect:</strong> If you reject the null hypothesis, there’s a 4% chance that you’re making a mistake.</li>
</ul>
What Is the True Error Rate?
<p><img alt="Caution sign" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/41ad875b2a88a19ab5bdfa5e47ed790b/caution_sign_w640.jpeg" style="margin: 4px; float: right; width: 250px; height: 250px;" />Think that this interpretation difference is simply a matter of semantics, and only important to picky statisticians? Think again. It’s important to you.</p>
<p>If a P value is not the error rate, what the heck is the error rate? (Can you guess which way this is heading now?)</p>
<p>Sellke et al.* have estimated the error rates associated with different P values. While the precise error rate depends on various assumptions (which I'll talk about in my next post), the table summarizes them for middle-of-the-road assumptions.</p>
<p style="text-align: center;"><strong>P value</strong></p>
<p style="text-align: center;"><strong>Probability of incorrectly rejecting a true null hypothesis</strong></p>
<p style="text-align: center;">0.05</p>
<p style="text-align: center;">At least 23% (and typically close to 50%)</p>
<p style="text-align: center;">0.01</p>
<p style="text-align: center;">At least 7% (and typically close to 15%)</p>
<p>Do the higher error rates in this table surprise you? Unfortunately, the common misinterpretation of P values as the error rate creates the illusion of substantially more evidence against the null hypothesis than is justified. As you can see, if you base a decision on a single study with a P value near 0.05, the difference observed in the sample may not exist at the population level. That can be costly!</p>
<p>Read my <a href="http://blog.minitab.com/blog/adventures-in-statistics/not-all-p-values-are-created-equal" target="_blank">next post</a> to understand the factors that affect the true error rate. Or, read my <a href="http://blog.minitab.com/blog/adventures-in-statistics/five-guidelines-for-using-p-values">five guidelines for how to use P values and avoid incorrect decisions</a>.<br />
</p>
<p>*Thomas SELLKE, M. J. BAYARRI, and James O. BERGER, Calibration of p Values for Testing Precise Null Hypotheses, The American Statistician, February 2001, Vol. 55, No. 1</p>
Hypothesis TestingThu, 17 Apr 2014 11:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-valuesJim FrostRe-analyzing Wine Tastes with Minitab 17
http://blog.minitab.com/blog/applying-statistics-in-quality-projects/re-analyzing-wine-tastes-with-minitab-17
<p>In April 2012, I wrote a short paper on <a href="http://www.minitab.com/en-us/Published-Articles/Wine-Tasting-by-Numbers--Using-Binary-Logistic-Regression-to-Reveal-the-Preferences-of-Experts/">binary logistic regression</a> to analyze wine tasting data. At that time, François Hollande was about to get elected as French president and in the U.S., Mitt Romney was winning the Republican primaries. That seems like a long time ago…</p>
<p>Now, in 2014, Minitab 17 <a href="http://www.minitab.com/products/minitab/">Statistical Software</a> has just been released. Had Minitab 17, been available in 2012, would have I conducted my analysis in a different way? Would the results still look similar? I decided to re-analyze my April 2012 data with Minitab 17 and assess the differences, if there are any.</p>
<p>There were no less than 12 parameters to analyze with a binary response. Among them 11 parameters were continuous variables, one factor was discrete in nature (white and red wines: a qualitative variable), and the number of two-factor interactions that could be studied was huge (66 two-factor interactions were potentially available).</p>
<p>The parameters to be studied :</p>
<p style="text-align: center;"><strong>Variable</strong></p>
<p style="text-align: center;"><strong>Details</strong></p>
<p style="text-align: center;"><strong>Units</strong></p>
<p style="text-align: center;">Type</p>
<p style="text-align: center;">red or white</p>
<p style="text-align: center;">N/A</p>
<p style="text-align: center;">pH</p>
<p style="text-align: center;">acidity (below 7) or alkalinity (over 7)</p>
<p style="text-align: center;">N/A</p>
<p style="text-align: center;">Density</p>
<p style="text-align: center;">density</p>
<p style="text-align: center;">grams/cubic centimeter</p>
<p style="text-align: center;">Sulphates</p>
<p style="text-align: center;">potassium sulfate</p>
<p style="text-align: center;">grams/liter</p>
<p style="text-align: center;">Alcohol</p>
<p style="text-align: center;">percentage alcohol</p>
<p style="text-align: center;">% volume</p>
<p style="text-align: center;">Residual sugar</p>
<p style="text-align: center;">residual sugar</p>
<p style="text-align: center;">grams/liter</p>
<p style="text-align: center;">Chlorides</p>
<p style="text-align: center;">sodium chloride</p>
<p style="text-align: center;">grams/liter</p>
<p style="text-align: center;">Free SO2</p>
<p style="text-align: center;">free sulphur dioxide</p>
<p style="text-align: center;">milligrams/liter</p>
<p style="text-align: center;">Total SO2</p>
<p style="text-align: center;">total sulphur dioxide</p>
<p style="text-align: center;">milligrams/liter</p>
<p style="text-align: center;">Fixed acidity</p>
<p style="text-align: center;">tartaric acid</p>
<p style="text-align: center;">grams/liter</p>
<p style="text-align: center;">Volatile acidity</p>
<p style="text-align: center;">acetic acid</p>
<p style="text-align: center;">grams/liter</p>
<p style="text-align: center;">Citric acid</p>
<p style="text-align: center;">citric acid</p>
<p style="text-align: center;">grams/liter</p>
Restricting Analysis to the Main Effects
<p>In 2012, due to the very large number of potential two-factor interactions, I restricted my analysis to the main effects (not considering the interactions between continuous variables).</p>
<p>Because the individual parameters had to be eliminated one at a time, according to their p value (the highest p values are eliminated one at a time until all the parameters and interactions that remain in the model have p values that are lower than 0.05), this was a very lengthy process.</p>
<p>To avoid obtaining an excessively complex final model, I eventually decided to analyze white and red wines separately (a model for the white wines, another model for the red wines), suggesting that the effect of some of the variables were different according to the type of wine.</p>
Including 2-Way Interactions in the Analysis
<p>Using Minitab 17 makes a substantial difference in this respect. All 2-way interactions can be easily selected to generate an initial model :</p>
<p><img alt="interactions" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/47940b6e8427b9c44afdf56f511b0d44/interactions_logistic_binary.JPG" style="width: 516px; height: 540px;" /></p>
<p>With Minitab 17, you can use stepwise logistic binary regression to quickly build a final model and identify the significant effects. In 2012, I used a descending approach considering all variables first and eliminating one variable at a time manually.</p>
<p>This lengthy and tedious process takes just a single click in Minitab 17:</p>
<p><img alt="stepwise" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/8fe5aafde53273ba3b7d16da305b5e4d/stepwise_binary.JPG" style="width: 486px; height: 539px;" /></p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/fc6b2c0fe2c083e439f4c66e0e446ddd/deviance_table_w640.gif" style="width: 640px; height: 168px;" /></p>
<p> </p>
<p>The results above show that Alcohol and Acidity (both fixed and volatile) seem to play a major role.</p>
<p>The Residual sugar by Type of wine interaction is barely significant with a p value (0.087) larger than 0.05 but smaller than 0.1.</p>
<p>The R Squared value (R-Sq) is also available in Minitab 17, to assess the proportion of the total variability that is explained by the model. The larger the R square value, the more comprehensive our model is (a large R squared means that we have got the full picture of our process, a low R squared means that our model explains only a small part of the variability in the response). In this example, the R squared is relatively low (28%) with 72% of the total variability unexplained by the model.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/4a5a9df6a33e7e05cfaf880bcc2cc3d8/model_summary.png" style="width: 278px; height: 94px;" /></p>
<p>In 2012, the final result consisted of two equations that could be used to understand which variables were significant for each type of wine in order to improve their taste.</p>
Optimizing the Response
<p>In Minitab 17, I can go one step further and use the optimization tool to identify the ideal settings and help the experimenter make the right decision.</p>
<p><img alt="regression equation" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/513cc8d599ba7948f4f288db12356435/regression_equation.png" style="width: 580px; height: 174px;" /></p>
<p><img alt="Optimize" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/9e42868dc619334da72100ec138b00c4/optimize_binary_w640.jpeg" style="width: 640px; height: 184px;" /></p>
<p>The optimization tool shows that tasters tend to prefer wines with a large amount of alcohol and both high fixed acidity <em>and </em>high volatile acidity.</p>
<p>Finally, showing graphs is important to convince colleagues and managers that the right decision has been taken. A visual representation is also very useful to better understand the factor effects. In Minitab 17, contour plots and response surface diagrams are available to describe the variable effects in the logistic binary regression sub-menu.</p>
<p>The contour plot below shows that tasters either prefer wines with high fixed acidity <em>and </em>high volatile acidity or with low fixed acidity <em>but also </em>low volatile acidity. The balance between the two types of acidity seems to be crucial.</p>
<p><img alt="Contour" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/2705aa13f0f80f9830408616028428a0/contour_plot_of_quality_vs_volatile_acidity__fixed.jpg" style="width: 576px; height: 384px;" /></p>
<p><img alt="Surface" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/2f593a6e6b89cb70b9439c85e8345477/surface_plot_of_quality_vs_volatile_acidity__fixed.jpg" style="width: 576px; height: 384px;" /></p>
<p>The models I arrived at in April 2012 are different from the one I found with Minitab 17. The two types of Acidity (Fixed and Volatile) were significant in the model for white wines, and Alcohol and Fixed Acidity had been selected in the final model for red wines.</p>
<p>But the main difference is that the Fixed Acidity by Volatile Acidity interaction had not been considered in 2012. In April 2012, the two-factor interactions were not on my radar, and I instead focused only on the individual main effects and their impact on wine tastes.</p>
<p>Fortunately, with Minitab 17 it is a lot easier to build an initial model—even a complex one with 66 two-factor potential interactions—and stepwise regression allows you to consider a much larger number of potential effects in the initial full model.</p>
Conclusion
<p>Ultimately, this study shows that the methods you use definitely impact your conclusion and statistical analysis. I got a simpler model using the tools available in Minitab 17, and therefore I did not need to study white and red wines separately. The optimization tool as well as the graphs were very useful to better understand the effects of the variables that are significant.</p>
<p> </p>
Data AnalysisFun StatisticsHypothesis TestingQuality ImprovementRegression AnalysisStatisticsStatistics HelpStatsTue, 15 Apr 2014 12:00:00 +0000http://blog.minitab.com/blog/applying-statistics-in-quality-projects/re-analyzing-wine-tastes-with-minitab-17Bruno ScibiliaEquivalence Testing for Quality Analysis (Part II): What Difference Does the Difference Make?
http://blog.minitab.com/blog/statistics-and-quality-data-analysis/equivalence-testing-for-quality-analysis-part-ii-what-difference-does-the-difference-make
<p><img alt="magnifying glass" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/1d9c8453dd19544a3f73fd787189619b/equivalence_test_difference.jpg" style="width: 250px; height: 250px; float: right; border-width: 1px; border-style: solid; margin: 10px 15px;" />My <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/equivalence-testing-for-quality-analysis-part-i-what-are-you-trying-to-prove" target="_blank">previous post</a> examined how an equivalence test can shift the burden of proof when you perform hypothesis test of the means. This allows you to more rigorously test whether the process mean is equivalent to a target or to another mean.</p>
<p>Here’s another key difference: To perform the analysis, an equivalence test requires that you first define, upfront, the size of a <em>practically important</em> difference between the mean and the target, or between two means.</p>
<p>Truth be told, even when performing a standard hypothesis test, you should know the value of this difference. Because you can’t really evaluate whether your analysis will have adequate power without knowing it. Nor can you evaluate whether a statistically significant difference in your test results has significant meaning in the real world, outside of probability distribution theory.</p>
<p>But since a standard t-test doesn’t <em>require</em> you to define this difference, people often run the analysis with a fuzzy idea, at best, of what they’re actually looking for. It’s not an error, really. It’s more like using a radon measuring device without knowing what levels of radon are potentially harmful. </p>
Defining Equivalence Limits: Your Call
<p>How close does the mean have to be to the target value or to another mean for you to consider them, for all practical purposes, “equivalent”? </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/6ec96ebe3b82e0b4828a79c8a74ba862/zone_of_equivalence_w640.jpeg" style="width: 640px; height: 178px;" /></p>
<p>The zone of equivalence is defined by a lower equivalence and/or an upper equivalence limit. The lower equivalence limit (LEL) defines your lower limit of acceptability for the difference. The upper equivalence limit (UEL) defines your upper limit of acceptability for the difference. Any difference from the mean that falls within this zone is considered unimportant.</p>
<p>In some fields, such as the pharmaceutical industry, equivalence limits are set by regulatory guidelines. If there aren’t guidelines for your application, you’ll need to define the zone of equivalence using knowledge of your product or process.</p>
<p>Here’s the bad news: There isn’t a statistician on Earth who can help you define those limits. Because it isn’t a question of statistics. It’s a question of what size of a difference produces tangible ramifications for you or your customer.</p>
<p>A difference of 0.005 mg from the mean target value? A 10% shift in the process mean? Obviously, the criteria aren't going to be the same for the diameter of a stent and the diameter of a soda can.</p>
Equivalence Test in Practice
<p>Here's a quick example of a 1-sample equivalence test, adapted from Minitab 17 Help.To follow along, you can download the revised <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/File/2d3f51f5ad82ccea78f46bcacf3c1af8/equivalence_test_data.MPJ" target="_blank">data here</a>. If you don't have Minitab 17, download <a href="http://it.minitab.com/products/minitab/free-trial.aspx?WT.ac=BlogMtbAd" target="_blank">a free trial version here.</a></p>
<p>Suppose a packaging company wants to ensure that the force needed to open its snack food bags is within 10% of the target value of 4.2N (Newtons). From previous testing, they know that a force lower than 10% below the target causes the bags to open too easily and reduces product freshness.A force above 10% of the target makes the bags too difficult to open. They randomly sample 100 bags and measure the force required to open each one.</p>
<p>To test whether the mean force is equivalent to the target, they choose <strong>Stat > Equivalence Tests > 1-Sample</strong> and fill in the dialog box as shown below:</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/295557153e0f5f5d7f3b52e2f8d4c4fe/equivalence_dialog.jpg" style="width: 560px; height: 390px;" /></p>
<p><strong>Tip</strong>: Use the <strong>Multiply by Target</strong> box when you want to define the equivalence limits for a difference in terms of a percentage of the target. In this case, the lower limit is 10% less than the target. The upper limit is 10% higher than the target. If you want to represent the equivalence limits in absolute terms, rather than as percentages, simply enter the actual values for your equivalence limits and don't check the <strong>Multiply by Target</strong> box.</p>
<p>When you click <strong>OK</strong>, Minitab displays the following results:</p>
<p style="margin-left: 40px;"><strong>One-Sample Equivalence Test: Force</strong></p>
<p style="margin-left: 40px;">Difference: Mean(Force) - Target</p>
<p style="margin-left: 40px;">Difference SE 95% CI Equivalence Interval<br />
0.14270 0.067559 (0, 0.25487) (-0.42, 0.42)</p>
<p><span style="color:#FF0000;">CI is within the equivalence interval. Can claim equivalence.</span></p>
<p style="margin-left: 40px;">Test<br />
Null hypothesis: Difference ≤ -0.42 or Difference ≥ 0.42<br />
Alternative hypothesis: -0.42 < Difference < 0.42<br />
α level: 0.05</p>
<p style="margin-left: 40px;">Null Hypothesis DF T-Value P-Value<br />
Difference ≤ -0.42 99 8.3290 0.000<br />
Difference ≥ 0.42 99 -4.1046 0.000</p>
<p><span style="color:#FF0000;">The greater of the two P-Values is 0.000. Can claim equivalence.</span></p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/eef6c9313790b2590fa6178d96a5c1cb/equivalence_plot.jpg" style="width: 576px; height: 384px;" /></p>
<p>Because the confidence interval for the difference falls completely within the equivalence limits, you can reject the null hypothesis that the mean differs from the target. You can claim that the mean and the target are equivalent.</p>
<p>Notice that if you had used a standard 1-sample t-test to analyze these data, the output would show a statistically significant difference between the mean and the target (at a significance level of 0.05):</p>
<p style="margin-left: 40px;"><strong>One-Sample T: Force</strong></p>
<p style="margin-left: 40px;">Test of μ = 4.2 vs ≠ 4.2<br />
<strong>Variable N Mean StDev SE Mean 95% CI T P</strong><br />
Force 100 4.3427 0.6756 0.0676 (4.2086, 4.4768) 2.11 <span style="color:#FF0000;"> 0.037</span></p>
<p>These two sets of results aren't really contradictory, though.</p>
<p>The equivalence test has simply defined "equality" between the mean and the target in broader terms, using the values you entered for the equivalence zone. The standard t-test has no knowledge of what "practically significant' means. So it can only evaluate the difference from the target in terms of statistical significance.</p>
<p>In this way, an equivalence test is "naturally smarter" than a standard t-test. But it's your knowledge of the process or product that allows an equivalence test to evaluate the practical significance of a difference, in addition to its statistical significance.</p>
<strong>Learn More about Equivalence Testing</strong>
<p>There are four types of equivalence tests newly available in Minitab 17. To learn more about each test, choose <strong>Help > Help</strong>. Click the<strong> Index</strong> tab, scroll down to <strong>Equivalence testing</strong>, and click <strong>Overview</strong>.</p>
Hypothesis TestingTue, 01 Apr 2014 12:31:24 +0000http://blog.minitab.com/blog/statistics-and-quality-data-analysis/equivalence-testing-for-quality-analysis-part-ii-what-difference-does-the-difference-makePatrick Runkel Equivalence Testing for Quality Analysis (Part I): What are You Trying to Prove?
http://blog.minitab.com/blog/statistics-and-quality-data-analysis/equivalence-testing-for-quality-analysis-part-i-what-are-you-trying-to-prove
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/b922d8ae294ef7be358b1b2abdc06eab/scales.jpg" style="float: right; border-width: 1px; border-style: solid; margin: 10px 15px; width: 250px; height: 244px;" />With more options, come more decisions.</p>
<p>With equivalence testing added to Minitab 17, you now have more statistical tools to test a sample mean against target value or another sample mean.</p>
<p>Equivalence testing is extensively used in the biomedical field. Pharmaceutical manufacturers often need to test whether the biological activity of a generic drug is equivalent to that of a brand name drug that has already been through the regulatory approval process.</p>
<p>But in the field of quality improvement, why might you want to use an equivalence test instead of a standard t-test?</p>
Interpreting Hypothesis Tests: A Common Pitfall
<p>Suppose a manufacturer finds a new supplier that offers a less expensive material that could be substituted for a costly material currently used in the production process. This new material is <em>supposed to be</em> just as good as the material currently used. It should not make the product too pliable nor too rigid.</p>
<p>To make sure the substitution doesn’t negatively impact quality, an analyst collects two random samples from the production process (which is stable): one using the new material and one using the current material.</p>
<p>The analyst then uses a standard 2-sample t-test (<strong>Stat > Basic Statistics > 2-Sample t </strong>in Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a>) to assess whether the mean pliability of the product is the same using both materials:</p>
<p style="margin-left: 40px;">________________________________________</p>
<p style="margin-left: 40px;"><strong>Two-Sample T-Test and CI: Current, New </strong></p>
<p style="margin-left: 40px;">Two-sample T for Current vs New<br />
N Mean StDev SE Mean<br />
Current 9 34.092 0.261 0.087<br />
New 10 33.971 0.581 0.18</p>
<p style="margin-left: 40px;">Difference = μ (Current) - μ (New)<br />
Estimate for difference: 0.121<br />
95% CI for difference: (-0.322, 0.564)<br />
T-Test of difference = 0 (vs ≠): T-Value = 0.60 <strong><span style="color:#FF0000;">P-Value = 0.562</span></strong> DF = 12<br />
________________________________________</p>
<p>Because the p-value is not less than the <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/alpha-male-vs-alpha-female">alpha level</a> (0.05), the analyst concludes that the means do not differ. Based on these results, the company switches suppliers for the material, confident that statistical analysis has proven that they can save money with the new material without compromising the quality of their product.</p>
<p>The test results make everyone happy. High-fives. Group hugs. Popping champagne corks. There’s only one minor problem.</p>
<p>Their statistical analysis didn’t really <em>prove</em> that the means are the same.</p>
Consider Where to Place the Burden of Proof
<p>In hypothesis testing, H1 is the alternative hypothesis that requires the burden of proof. Usually, the alternative hypothesis is what you’re hoping to prove or demonstrate. When you perform a standard 2-sample t-test, you’re really asking: “Do I have enough evidence to <em>prove</em>, beyond a reasonable doubt (your alpha level), that the population means are different?”</p>
<p>To do that, the hypotheses are set up as follows:</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/102c226ce0c8cffe1a37cb7e43a969eb/2samplet_w640.jpeg" style="width: 640px; height: 271px;" /></p>
<p>If the p-value is less than alpha, you conclude that the means significantly differ. But if the p-value is not less than alpha, you haven’t <em>proven</em> that the means are equal. You just don’t have enough evidence to prove that they’re not equal.</p>
<p>The absence of evidence for a statement is not proof of its converse. If you don’t have sufficient evidence to claim that A is true, you haven’t <em>proven</em> that A is false.</p>
<p>Equivalence tests were specifically developed to address this issue. In a 2-sample equivalence test, the null and alternative hypotheses are reversed from a standard 2-sample t test.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/1d72f62da34d5c7d7d7719973c3927ac/2smple_equiv_image_w640.jpeg" style="width: 640px; height: 271px;" /></p>
<p>This switches the burden of proof for the test. It also reverses the ramification of incorrectly assuming (H0) for the test.</p>
Case in Point: The Presumption of Innocence vs. Guilt
<p>This rough analogy may help illustrate the concept.</p>
<p>In the court of law, the burden of proof rests on proving guilt. The suspect is presumed innocent (H0), until proven guilty (H1). In the news media, the burden of proof is often reversed: The suspect is presumed guilty (H0), until proven innocent (H1).</p>
<p>Shifting the burden of proof can yield different conclusions. That’s why the news media often express outrage when a suspect who is presumed to be guilty is let go because there was not sufficient evidence to prove the suspect’s guilt in the courtroom. As long as news media and the courtroom reverse their null and alternative hypotheses, they’ll sometimes draw different conclusions based on the same evidence.</p>
<p>Why do they set up their hypotheses differently in the first place? Because each seems to have a different idea of what’s a worse error to make. The judicial system believes the worse error is to convict an innocent person, rather than let a guilty person go free. The news media seem to believe the contrary. (Maybe because the presumption of guilt sells more papers than presumption of innocence?)</p>
When the Burden of Proof Shifts, the Conclusion May Change
<p>Back to our quality analyst in the first example. To avoid losing customers, the company would rather err by assuming that the quality was not the same using the cheaper material--when it actually was--than err by assuming it was the same, when it actually was not.</p>
<p>To more rigorously demonstrate that the means are the same, the analyst performs a 2-sample equivalence test (<strong>Stat > Equivalence Tests > Two Sample</strong>).</p>
<p style="margin-left: 40px;">________________________________________</p>
<p style="margin-left: 40px;"><strong>Equivalence Test: Mean(New) - Mean(Current) </strong></p>
<p style="margin-left: 40px;">Test<br />
Null hypothesis: Difference ≤ -0.4 or Difference ≥ 0.4<br />
Alternative hypothesis: -0.4 < Difference < 0.4<br />
α level: 0.05</p>
<p style="margin-left: 40px;">Null Hypothesis DF T-Value P-Value<br />
Difference ≤ -0.4 12 1.3717 0.098<br />
Difference ≥ 0.4 12 -2.5646 0.012</p>
<p style="margin-left: 40px;"><strong><span style="color:#FF0000;">The greater of the two P-Values is 0.098. Cannot claim equivalence.</span></strong><br />
________________________________________</p>
<p>Using the equivalence test on the same data, the results now indicate that there<em> isn't</em> sufficient evidence to claim that the means are the same. The company <em>cannot</em><em> </em>be confident that product quality will not suffer if they substitute the less expensive material. By using an equivalence test, the company has raised the bar for evaluating a possible shift in the process mean.</p>
<p><strong>Note:</strong> If you look at the above output, you'll see another way that the equivalence test differs from a standard t-test. Two one-sided t-tests are used to test the null hypothesis. In addition, the test uses a zone of equivalence that defines what size difference between the means you consider to be practically insignificant. <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/equivalence-testing-for-quality-analysis-part-ii-what-difference-does-the-difference-make">We’ll look at that in more detail in my next post</a>.</p>
Quick Summary
<p>To choose between an equivalence test and a standard t-test, consider what you hope to prove or demonstrate. Whatever you hope to prove true should be set up as the alternative hypothesis for the test and require the burden of proof. Whatever you deem to be the less harmful incorrect assumption to make should be the null hypothesis. If you’re trying to rigorously prove that two means are equal, or that a mean equals a target value, you may want to use an equivalence test rather than a standard t-test.</p>
Hypothesis TestingQuality ImprovementMon, 31 Mar 2014 12:39:00 +0000http://blog.minitab.com/blog/statistics-and-quality-data-analysis/equivalence-testing-for-quality-analysis-part-i-what-are-you-trying-to-provePatrick RunkelIs the "Madden Curse" Real?
http://blog.minitab.com/blog/the-statistics-game/is-the-madden-curse-real
<p><img alt="Madden" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/d3d5cdc0287be4fb0ae870f30a5e8a8c/madden.JPG" style="float: right; width: 180px; height: 211px; border-width: 1px; border-style: solid; margin: 10px 15px;" />If you like football and you like video games, you must certainly be aware of the “Madden Curse.” Each year, EA Sports releases a new version of Madden, a video game based on the National Football League. Each version of Madden features a different NFL player on the cover of the game. And it seems that each year, the player featured on the cover gets hurt or has a terrible season. Thus, the “Madden Curse” was born.</p>
<p>As a statistician, I’m always skeptical of these things. When people make judgments based on their own perception and not on data, it can be easy to think you see trends that aren’t really there. We tend to remember the cases that support our point of view (Michael Vick breaking his leg after being on the cover) and forget cases that don’t support our argument (Calvin Johnson setting the single season receiving record after being on the cover).</p>
<p>But I’ll humor the Madden curse theorists and perform a data analysis to see if a curse might indeed be real. And if it does exist, perhaps we can even figure out why!</p>
Are Madden-Featured Players Getting Hurt More Often?
<p>We already mentioned that Michael Vick broke his leg the season after he appeared on the cover of Madden. He missed 11 regular season games that year. So could be the curse be that the featured players get injured and miss a lot of games the next season?</p>
<p>For all 16 players who have been on the cover since 1999, I gathered the number of games they played the season <em>before</em> being on the cover, and the number of games they played after. You can follow along by getting the data <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/8496aa39b421f1c2512d539d22ce3c3f/maddencurse.MTW">here</a>. Don't already have Minitab? You can get a <a href="http://it.minitab.com/en-us/products/minitab/free-trial.aspx">30-day trial version</a>.</p>
<p>First, I ran a Paired t-test on the two groups.</p>
<p><img alt="Paired t test" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/0b3a97cc7c870b2597f44eafaa1e67bc/paired_t_injuries_w640.gif" style="width: 640px; height: 203px;" /></p>
<p>We see that on average, players played about 2 fewer games the season after being on the cover for Madden. This difference is statistically significant at the α = 0.10 level. So is the curse true?</p>
Statistically Significant and Practically Significant
<p>Just because a difference is <a href="http://blog.minitab.com/blog/the-stats-cat/sample-size-statistical-power-and-the-revenge-of-the-zombie-salmon-the-stats-cat">significant doesn’t mean that difference is practical</a> to your situation. So when it comes to the Madden curse, we have to ask ourselves “Is a difference of 2 games <em>really</em> a curse?” Sure, Vick’s injury was bad, but it wasn’t typical. The only other players besides Vick to play in fewer than 10 games after being on the cover were Troy Polamalu (5 games in 2009) and Donovan McNabb (9 games in 2004). On the flip side, 10 of the 16 featured players played in at least 14 games the next season. That doesn’t sound like much of a curse to me.</p>
<p>And keep in mind that EA is not going to put anybody on the cover of Madden who was injured the <em>previous</em> season. The outliers of Vick, McNabb, and Polamalu pull down the average of the entire “appeared-on-the-cover” group. That doesn’t happen the season before you’re on the cover. Fifteen players played in at least 13 games the season before being on the cover, and 10 of them played all 16. This means the “before” group is artificially inflated.</p>
<p>So yes, players <em>are </em>playing fewer games the season after being on the cover. But on average, it’s only 2 fewer games. Featured players aren’t experiencing season-ending injuries year after year. When you consider the <em>practical difference</em>, I would say 2 games is so small that there isn’t any curse going on.</p>
Are Players Performing Poorly?
<p>So if injuries don’t appear to be the curse, maybe players have a worse season the year after being on the cover. Assessing this becomes a little tricky because the players on the cover play a variety of positions, including two defensive players. So we need a statistic that can represent the value of players at different positions. Pro-Football-Reference has a statistic called <a href="http://www.sports-reference.com/blog/approximate-value-methodology/">Approximate Value</a> (AV). It’s a metric that puts a single numerical value on any player’s season, at any position, from any year. I know it’s not well known, but I’m not aware of any other statistic that can represent the value of Ray Lewis, Eddie George, <em>and</em> Michael Vick. For our purposes, it’ll do just fine.</p>
<p>I took each player's AV the season before they were on the cover and the season after. Then I did another paired t-test.</p>
<p><img alt="Paired t test" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/537b93e10b6b50ffcc1ff031072bf180/paired_t_1_season_w640.gif" style="width: 640px; height: 201px;" /></p>
<p>There is a difference of almost 5, and this difference is statistically significant since the p-value is less than 0.05. To give you some perspective on AV, during his record -setting year in 2012, Calvin Johnson had an AV of 14. So the average of 15 that the “Before” group has is pretty darned good (in 2007 Tom Brady had a 24, so I'm guessing that's about the max). In comparison, Bills receiver Steve Johnson had a 9 last year, and 49ers receiver Michael Crabtree had a 10. I think it’s safe to say a difference between 10 and 15 is practically significant. </p>
<p>This means there <em>is </em>a curse! Players on the cover of Madden perform worse the season after being on the cover. There is proof that the curse exists!</p>
<p>Or is there?</p>
<p>Let’s look at how players perform <em>two </em>seasons before they are on the cover of Madden. We’ll compare that to how they perform the season after being on the cover.</p>
<p><img alt="Pairted t test" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/83b608654e754bc946758ef9dae78484/paired_t_2_season_w640.gif" style="width: 640px; height: 210px;" /></p>
<p>There is pretty much no difference in a player’s performance two seasons before being on the cover, and the season after being on the cover.</p>
<p>Okay, what if we go back three years before they were on the cover?</p>
<p>(If you've noticed the different sample sizes in these paired t-tests, it’s because a few of the players were so young that were not even in the NFL 2 or 3 seasons before being on the cover of Madden.)</p>
<p><img alt="Paired t test" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/b1764251e4cd6ac3cbc1fa9324a25156/paired_t_3_season_w640.gif" style="width: 640px; height: 200px;" /></p>
<p>Almost the exact same thing! There is really no difference in a player’s performance 3 seasons before being on the cover, and the season after. The only season that stands out is the one directly before they were chosen to be on the cover of Madden.</p>
<p>Now the curse makes sense. It’s a simple case of <a href="http://blog.minitab.com/blog/fun-with-statistics/fantasy-studs-and-regression-to-the-mean">regression to the mean</a>!</p>
What Is Regression to the Mean?
<p>Think of a roulette wheel where half of the spaces are black, and the other half are red. And now imagine a set of 16 spins where red comes up 75% of the time. In the next set of 16 spins, we would expect the average to regress back to 50% red and 50% black. This is regression to the mean.</p>
<p>Note that regression to the mean does <em>not</em> mean we would expect a set of 16 spins where we had 75% black to “even out” the previous set that had more red. We would just expect the results to return to the average, which is 8 red and 8 black.</p>
<p>Now let’s apply this thinking to the Madden curse. We see that 3 seasons before being on the cover, the players as a group have an AV of about 11. It stays in the 10-11 range the next season, too. Then all of the sudden it jumps up to almost 15 the season before they make the Madden cover, only to return back to the “average” the following season. </p>
<p>It doesn’t take a Six Sigma Black Belt to see what is going on here.</p>
<p>Madden is selecting players who had outstanding seasons the previous year. But just like a roulette wheel might have a run where it comes up red 75% of the time, the outstanding performance by the players who appear on the cover is not sustainable. So the year after they're featured they don’t perform as well as they did the year before, and it looks like they’re cursed. In reality, they’re simply playing back at the same level they were before their outstanding season. They’re just regressing to the mean, and it would have happened whether they appeared on the cover of Madden or not.</p>
<p>So before you start believing in curses, try a statistical analysis first. Odds are you’ll find a perfectly reasonable explanation!</p>
Fun StatisticsHypothesis TestingLearningRegression AnalysisFri, 11 Oct 2013 13:08:00 +0000http://blog.minitab.com/blog/the-statistics-game/is-the-madden-curse-realKevin RudySam Ficken and the Danger of Small Sample Sizes
http://blog.minitab.com/blog/the-statistics-game/sam-ficken-and-the-danger-of-small-sample-sizes
<p>
<img alt="Sam Ficken" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/c046562d831b68c2637b30f783c28089/ficken_normal.jpg" style="float: right; width: 250px; height: 155px; border-width: 1px; border-style: solid; margin: 10px 15px;" />On Saturday, September 8, 2012, Penn State football player Sam Ficken had a kicker’s worst nightmare. Playing against Virginia, he missed 4 field goals, including the potential game-winner as the game ended. To add injury to insult, he also had an extra point blocked.</p>
<p>
Penn State lost the game by a single point.</p>
<p>
At that point in his career, Ficken had made 2 out of his 7 field goal attempts. That equals about a 29% success rate, which is terrible for kickers. Many called for Ficken to be benched. He was harassed on Twitter (to put it mildly). And a Penn State soccer player even made a YouTube video of himself kicking field goals at the Nittany Lion practice facility, prompting many fans to suggest Coach Bill O’Brien to give him a tryout. It was pretty apparent Ficken just wasn’t a good kicker.</p>
<p>
Or was it?</p>
<p>
Flip a coin 7 times. If tails comes up only twice, are you going to conclude that the coin is “biased” towards heads? Of course not, you simply had an unlikely outcome (the coin coming up heads 71% of the time) because 7 tosses is a very small sample size. Now, kicking field goals is a lot different than flipping a coin, but the same idea applies. So let’s do a data analysis on Ficken’s field goal percentage.</p>
<p>
<strong>NOTE:</strong> I’m going to use a 1 Proportion analysis, which assumes the probability of each observation is the same. Obviously this isn’t true for field goals. Distance, weather conditions, and altitude all affect the probability of the kicker making the goal. Even the opponent can affect the probability: your odds aren’t as good if <a href="http://youtu.be/83w4SRzSg7c?t=3m49s">LaVar Arrington circa 1999</a> is lining up to block the kick! But I’m really just trying to illustrate the amount of variation that exists in small samples, not trying to accurately gage Ficken’s true field goal percentage. So for purely illustrative purposes, I’m going to use the 1 Proportion analysis anyway...just take the statistics with a grain of salt if you were hoping for a comprehensive review that includes all possible factors.</p>
How Confident Can We Be in Ficken’s Field Goal Percentage?
<p>
After the Virginia game Ficken was 2 for 7 (29%) on field goal attempts. Using these numbers, we can use Minitab’s 1 Proportion analysis to create a confidence interval. This confidence interval will give us a range of likely values for the percentage of kicks that Ficken will make going forward. That is, it gives us an idea of how confident can we be that these 7 kicks represent Ficken’s true kicking percentage.</p>
<p>
<img alt="Minitab's One Proportion Analysis" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/070b0042d03e59b4c894a8438c7017db/1_propotion_2_for_7.gif" style="width: 514px; height: 166px;" /></p>
<p>
The confidence interval tells us that we can be 95% confident that Ficken’s true field goals percentage is between 3.7% and 71%. That range is so large that it’s pretty much worthless! So anybody trying to make an accurate assessment of Ficken’s ability based off of those 7 kicks is doing nothing other than guessing. Moreover, the range actually increases if you look at only the 5 kicks in the Virginia game (which many people did)!</p>
Ficken’s Career Since the Virginia Game
<p>
If there's one person who could accurately judge Ficken, it's Penn State Coach Bill O’Brien. He'd seen plenty of Ficken kicks in practice, and had a lot more than 7 observations to make his decision on. And he decided to stick with Ficken as his kicker.</p>
<p>
Boy, has that decision paid off.</p>
<p>
Since the Virginia game, Ficken has made 20 of 24 field goal attempts. He hit a Penn State record of 15 field goals in a row, and also made a 54-yard field goal--a Penn State home record--<em>in the rain</em>. In his career, Ficken is now 22 for 31 on field goal attempts, good for 71%.</p>
<p>
And wouldn’t you know it, that equals the upper bound from the 95% confidence interval we created earlier! </p>
<p>
Clearly, Ficken is a better kicker than his first few attempts showed. And considering where he had to be at mentally after the Virginia game, it’s a great story to see him bounce back and perform so well. But, statistically speaking , how good can we <em>really </em>claim he is? Since we now have another 24 observations, let’s combine them with the original 7 and calculate an updated 95% confidence interval for Ficken’s field goal percentage.</p>
<p>
<img alt="Minitab's One Proportion Analysis" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/67b6a57fa13e63a1298b7c2a9f190cf6/1_propotion_22_for_31.gif" style="width: 523px; height: 159px;" /></p>
<p>
Now that we have more observations, we can narrow down Ficken’s true ability much better. The new lower bound for the interval (52%) is nowhere close to the 29% that Ficken made in his first 7 attempts.</p>
<p>
But the confidence interval is still pretty wide, with a range of about 34%. There is a chance his true field goal percentage is in the 50% range, which would put him among the worst kickers in the country!</p>
<p>
How big of a sample size do we need in order to really be confident in Ficken’s abilities?</p>
How Many Kicks Do We Need?
<p>
To answer that question, first we need to decide how “narrow” we want our confidence interval to be. This is the same thing as determining the margin of error. For example, let’s use Ficken’s current field goal percentage of 71%. If the margin of error were 5%, our confidence interval would range from 66% to 76%.</p>
<p>
But instead of picking just one, let’s use a couple margins of error to compare the different sample sizes needed for each one. We’ll use margins of error of 10%, 5%, and 1%. Then we can use Minitab’s Sample Size for Estimation analysis to get the sample sizes.</p>
<p>
<img alt="Sample Size for Estimation" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/0984a4d9e48004d90c9311a3bfe5b77a/sample_size.gif" style="width: 305px; height: 328px;" /></p>
<p>
To obtain a margin of error of 10% (which is still pretty wide) we would need 99 kicks. It skyrockets to 359 for 5%, and becomes an unattainable 8,129 kicks for a 1% margin of error! To put that in perspective, former Penn State kicker Kevin Kelly was the starter at Penn State for 4 years, and attempted only 107 field goals. And Sebastian Janikowski is in his 14th year of kicking in the NFL, and has only 409 attempts.</p>
<p>
Your average college kicker will get between 20 and 30 field goal attempts per year. And unless you’re a four-year starter, you’re not getting close to 99 kicks for your career. That means for a college kicker, even if every field goal attempt has the same probability of being made (which it doesn’t), we still have a pretty wide margin of error when determining just how accurate the kicker is.</p>
<p>
So when you want to make claims based on statistics, make sure you have a sufficiently large sample. And that’s not just in the world of sports. Sample sizes are important for everything from determining the net weight of the cereal in packages to <a href="http://blog.minitab.com/blog/adventures-in-statistics/using-hypothesis-tests-to-bust-myths-about-the-battle-of-the-sexes">Mythbusters determining whether women are better at multitasking</a>. If you don’t have a large enough sample, your conclusions might be meaningless. To find proof, you need only look at Sam Ficken.</p>
<p>
<em style="border: 0px; margin: 0px; padding: 0px; font-size: 10px; color: rgb(90, 90, 90); font-family: Verdana; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 20px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);">Photo by PennStateNews. Licensed under Creative Commons Attribution ShareAlike 2.0.</em></p>
Hypothesis TestingStatisticsStatistics HelpFri, 27 Sep 2013 13:59:00 +0000http://blog.minitab.com/blog/the-statistics-game/sam-ficken-and-the-danger-of-small-sample-sizesKevin RudyUsing Hypothesis Tests to Bust Myths about the Battle of the Sexes
http://blog.minitab.com/blog/adventures-in-statistics/using-hypothesis-tests-to-bust-myths-about-the-battle-of-the-sexes
<p><img alt="Mythbusters title screen" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/95c8e44114a7378203ad3b70a9fac3a5/mythbusters_title_screen.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; width: 300px; height: 169px; float: right;" />In my home, we’re huge fans of <a href="http://dsc.discovery.com/tv-shows/mythbusters" target="_blank">Mythbusters</a>, the show on Discovery Channel. This fun show mixes science and experiments to prove or disprove various myths, urban legends, and popular beliefs. It’s a great show because it brings the scientific method to life. I’ve written about Mythbusters <a href="http://blog.minitab.com/blog/adventures-in-statistics/busting-the-mythbusters-are-yawns-contagious">before</a> to show how, without proper statistical analysis, it’s difficult to know when a result is statistically significant. How much data do you need to collect and how large does the difference need to be?</p>
<p>For this blog, let's look at a more recent Mythbusters episode, “Battle of the Sexes – Round Two.” I want to see how they’ve progressed with handling sample size. There are some encouraging signs: during the show, Adam Savage, one of the hosts, explains, “Sample size is everything in science; the more you have, the better your results.”</p>
<p>To paraphrase the show, here at Minitab, we don’t just talk about the hypotheses; we put them to the test. We’ll use two different hypothesis tests and <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/c4c2c86825987c371555b86c69216829/battlesexes.MTW">this worksheet</a> to determine whether:</p>
<ul>
<li>Women are better at multitasking</li>
<li>Men are better at parallel parking</li>
</ul>
Are Women Better Multitaskers?
<p>The Mythbusters wanted to determine whether women are better multitaskers than men. To test this, they had 10 men and 10 women perform a set of tasks that required multitasking in order to have sufficient time to complete all of the tasks. They use a scoring system that produces scores between 0 and 100.</p>
<p>The women end up with an average of 72, while the men average 64. The Mythbusters conclude that this 8 point difference confirms the myth that women are better multitaskers. Does statistical analysis agree?</p>
The statistical perspective
<p>The average scores are based on samples rather than the entire population of men and women. Samples contain error because they are a subset of the entire population. Consequently, a sample mean and the corresponding population mean are likely to be different. It’s possible that if we reran the experiment, the sample results could change.</p>
<p>We want to be reasonably sure that the observed difference between samples actually represents a <em>true </em>difference between the entire population of men and women. This is where hypothesis tests play a role.</p>
Choosing the correct hypothesis test
<p>Because we want to compare the means between two groups, you might think that we’ll use the 2-Sample t test. However, based on a Normality Test, these data appear to be nonnormal.</p>
<p>The 2-Sample t test is robust to nonnormal data when each sample has at least 15 subjects (30 total). However, our sample sizes are too small for this test to handle nonnormal data. Therefore, we can’t trust the p-value calculated by the 2-Sample t test for these data.</p>
<p>Instead, we’ll use the nonparametric Mann-Whitney test, which compares the medians. Nonparametric tests have fewer requirements and are particularly useful when your data are nonnormal and you have small sample sizes. We’ll use a one-tailed test to determine whether the median multitasking score for women is greater than the median men’s score.</p>
<p>To run the test in Minitab statistical software, go to: <strong>Stat > Nonparametrics > Mann-Whitney</strong></p>
The Mann-Whitney test results
<p style="margin-left: 40px;"><img alt="Mann-Whitney test results" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/84b08eafb0b4c868443466396060b3a6/mann_whitney.gif" style="width: 443px; height: 196px;" /></p>
<p>The p-value of 0.1271 is greater than 0.05, which indicates that the women’s median is not significantly greater than the men’s median. Further, the 95% confidence interval suggests that the median pairwise difference is likely between -9.99 and 30.01. Because the confidence interval includes both positive and negative values, it would not be surprising to repeat the experiment and find that <em>men </em>had the higher median!</p>
<p>The Mythbusters looked at the sample means and “Confirmed” the myth. However, the data do not support the conclusion that women have a higher median score than men.</p>
Power analysis to determine sample size
<p>If the Mythbusters were to perform this experiment again, how many subjects should they recruit? For a start, if they collect at least 15 samples per group, they can use the more powerful 2-Sample t test.</p>
<p>I’ll perform a <a href="http://www.minitab.com/en-us/Support/Tutorials/Using-Power-and-Sample-Size-Tools-with-Power-Curves/" target="_blank">power analysis</a> for a 2-sample t test to estimate a good sample size based on the following:</p>
<ul>
<li>I’ll assume that the difference must be at least 10 points to be practically meaningful.</li>
<li>I want to have an 80% chance of detecting a meaningful difference if it exists.</li>
<li>I’ll use the sample standard deviation.</li>
</ul>
<p>In Minitab, go to <strong>Stat > Power and Sample Size > 2-Sample t</strong> and fill in the dialog as follows:</p>
<p><img alt="Power and sample size for 2-sample t dialog" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/3f2600ae2a0ee6cfca95350164ff7dfc/pss_dialog.gif" style="width: 334px; height: 256px;" /></p>
<p>Under Options, choose <strong>Greater than</strong>, and click <strong>OK</strong> in all dialogs.</p>
<p style="margin-left: 40px;"><img alt="Power and sample size results for 2-sample t test" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/11f631412f296a365c93ed2e94c57f05/pss2t.gif" style="width: 360px; height: 225px;" /></p>
<p>The output shows that we need 29 subjects per group, for a total of 58, to have a reasonable chance of detecting a meaningful difference, if that difference actually exists between the two populations.</p>
Are Men Better at Parallel Parking?
<p>The Mythbusters also wanted to determine whether men are better at parallel parking than women. They devised a test that produces scores between 0 and 100. At first glance, this appears to be a similar scenario as the multitasking myth where we’ll compare means, or medians. However, the means and medians are virtually identical and are not significantly different according to any test.</p>
<p style="margin-left: 40px;"><img alt="Descriptive statistics for parallel parking by gender" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/8466a72e71eafb6b6f8bfb2f7b3c12fe/parking_desc.gif" style="width: 288px; height: 86px;" /></p>
<p>There’s a different story behind this myth. During the parking test, the hosts notice that the women’s scores seem more variable than the men’s. The women are either really good or really bad, while men are somewhere in between, as you can see below.</p>
<p><img alt="Individual value plot of parallel parking scores by gender" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/568a85406da8584ce314ab4fb3ba4f3b/ivp_parking.gif" style="width: 400px; height: 267px;" /></p>
<p>We want to be reasonably sure that the observed difference in variability actually represents a true difference between the populations. We need to use the correct hypothesis test, which is Two Variances (<strong>Stat > Basic Statistics > 2 Variances</strong>). The test results are below:</p>
<p style="margin-left: 40px;"><img alt="Two variances test results for parallel parking by gender" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/3f1bcb84beaf871e73567d45c6dd5068/2var_parking.gif" style="width: 431px; height: 104px;" /></p>
<p>The null hypothesis is that the variability in both groups are equal. Because the p-value (0.000) is less than 0.05, we can reject the null hypothesis and conclude that women’s scores for parallel parking are more variable than men’s scores.</p>
<p>The Mythbusters correctly busted this myth because the means and medians are essentially equal. We can't conclude that one gender is better at parallel parking than the other.</p>
<p>However, we <em>can</em> conclude that men are more <em>consistent </em>at parallel parking than women.</p>
Closing Thoughts
<p>In one of <a href="http://dsc.discovery.com/tv-shows/mythbusters/videos/m5-aftershows.htm" target="_blank">their online videos</a>, Adam and Jamie explain that they understand the importance of sample size. However, Adam states that the Mythbusters put more effort into the methodology of collecting good data. It’s true, they are great at reducing sources of variation, obtaining accurate measurements, etc. He goes on to explain that they just don’t have the resources to obtain larger sample sizes. Fair enough—for a television show.</p>
<p>However, if you’re in science or Six Sigma, you don’t have this luxury. You must:</p>
<ul>
<li>Have a good methodology for collecting data</li>
<li>Have a sufficient sample size</li>
<li>Use the correct statistical analysis</li>
</ul>
<p>Without all of the above, you risk drawing incorrect conclusions.</p>
Fun StatisticsHypothesis TestingLearningThu, 05 Sep 2013 11:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/using-hypothesis-tests-to-bust-myths-about-the-battle-of-the-sexesJim FrostA correspondence table for non parametric and parametric tests
http://blog.minitab.com/blog/applying-statistics-in-quality-projects/a-correspondence-table-for-non-parametric-and-parametric-tests
<p>Most of the data that one can collect and analyze follow a normal distribution (the famous bell-shaped curve). In fact, the formulae and calculations used in many analyses simply take it for granted that our data follow this distribution; statisticians call this the "assumption of normality."</p>
<p>For example, our data need to meet the normality assumption before we can accept the results of a one- or two-sample t (Student) or z test. Therefore, it is generally good practice to run a normality test before performing the hypothesis test.</p>
<p>But wait...according to the <a href="http://blog.minitab.com/blog/understanding-statistics/how-the-central-limit-theorem-works">Central Limit Theorem</a>, when the sample size is larger than 30, normality is not a crucial prerequisite for a standard t (Student) or z hypothesis test: even though the individual values within a sample might follow an unknown, non-normal distribution, the sample <em>means </em>(as long as the sample sizes are at least 30) will follow a normal distribution.</p>
<p> <img alt="Central Limit Theorem" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/e6c8ecfc6a97487a48390e09c37b6997/central_limit_theorem.JPG" style="width: 579px; height: 204px" /></p>
<p> </p>
<p>Moreover, some tests are more robust to departures from normality. For example, if you use the <a href="http://blog.minitab.com/blog/real-world-quality-improvement/using-minitab-to-curb-supplier-defects">Minitab Assistant</a>, a two-sample T test requires only 15 values per sample. If the sample size is at least 15, normality is not an issue and the test is accurate even with non-normal data. Again, in the Minitab Assistant, a one-sample t test only requires at least 20 values in the sample. The reason for this is that the tests that are available in the Minitab Assistant have been modified in order to make them more robust to departures from normality.</p>
<p>What can you do when your sample sizes are still smaller than these threshold limit values and your data are not normally distributed ? The only remaining option is to use a nonparametric test. A nonparametric test is not based on any theoretical distribution. Therefore as a last resort and when all other options are exhausted, you can still use a nonparametric test.</p>
<p>In the service sector, for example, durations are often analyzed to improve processes (reduce waiting times, queuing times, lead times, payment times, faster replies to customer requests…). How long we wait for something is an important aspect of the customer experience, and ultimately influences customer satisfaction. Typically, duration times will not follow a normal distribution.</p>
<p><img alt="Non Normal distribution" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/ed1298a9d0684a8548e92dfe06eb9db7/non_normal_distribution.JPG" style="width: 575px; height: 385px; border-width: 1px; border-style: solid;" /></p>
<p>The P value in the probability plot above is smaller than 0.05, indicating that the data points do not follow a normal distribution. We can see a very significant curvature in the normal probability plot, and the points clearly do not follow the normal probability line. The histogram shows that the distribution is highly skewed to the right; also, the sample size is quite small (14).</p>
<p>This data set is an ideal candidate for a nonparametric approach.</p>
<p>But which nonparametric test do we need to use in this situation? The correspondence table below shows how each nonparametric test (in Minitab, choose <strong>Stats > Non Parametric Tests</strong>) is related to a parametric test. This table provides a guideline for choosing the most appropriate nonparametric test in each case, along with the main characteristics of each nonparametric test.</p>
<p><img alt="Correspondence table" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8adb7db42f9c0baf84fb7d139f8bd6ed/correspondence_table_for_parametric_tests_and_non_parametric_tests_w640.gif" style="width: 640px; height: 480px;" /></p>
<p> </p>
Data AnalysisHypothesis TestingQuality ImprovementStatisticsStatistics HelpStatsTue, 27 Aug 2013 12:00:00 +0000http://blog.minitab.com/blog/applying-statistics-in-quality-projects/a-correspondence-table-for-non-parametric-and-parametric-testsBruno ScibiliaThe Gentleman Tasting Coffee: A Variation on Fisher’s Famous Experiment
http://blog.minitab.com/blog/statistics-in-the-field/the-gentleman-tasting-coffee-a-variation-on-fishers-famous-experiment
<p><em>by Matthew Barsalou, guest blogger</em></p>
<p><img alt="coffee" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/d7ff51df-2032-4f24-b2b4-eb0eb76d0b01/Image/9f879a3a0ec229096d04f517f01ac3e9/coffee.gif" style="border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 200px; height: 200px;" />In the 1935 book <em><a href="http://en.wikipedia.org/wiki/The_Design_of_Experiments" target="_blank">The Design of Experiments</a>,</em> Ronald A. Fisher used the example of a lady tasting tea to demonstrate basic principles of statistical experiments. In Fisher’s example, a lady made the claim that she could taste whether milk or tea was poured first into her cup, so Fisher did what any good statistician would do—he performed an experiment.</p>
<p>The lady in question was given eight random combinations of cups of tea with either the tea poured first or the milk poured first. She was required to divide the cups into two groups based on whether the milk or tea was poured in first. Fisher’s presentation of the experiment was not about the tasting of tea; rather, it was a method to explain the proper use of statistical methods.</p>
<p>Understanding how to properly perform a statistical experiment is critical, whether you're using a data analysis tool such as Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a> or performing the calculations by hand.</p>
The Experiment
<p>A poorly performed experiment can do worse than just provide bad data; it could lead to misleading statistical results and incorrect conclusions. A variation on Fisher’s experiment could be used for illustrating how to properly perform a statistical experiment. Statistical experiments require more than just an understanding of statistics. An experimenter must also know how to plan and carry out an experiment.</p>
<p>A possible variation on Fisher’s original experiment could be performed using a man tasting coffee made with or without the addition of sugar. The objective is not actually to determine if the hypothetical test subject could indeed determine if there is sugar in the coffee, but to present the statistical experiment in a way that is both practical and easy to understand. Having decided half of the cups of coffee would be prepared with sugar and half would be prepared without sugar, the next step is to determine the required sample size. The formula for sample size when using a proportion is</p>
<p style="margin-left: 40px;"><img alt="sample size when using proportions formula" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/d7ff51df-2032-4f24-b2b4-eb0eb76d0b01/Image/4f92c583c64275770f15468342898580/sample_size_when_using_proportion_formula.gif" style="width: 140px; height: 70px;" /></p>
<p>In this equation the <em>n </em>is the sample size, <em>p </em>is the probability something will occur and <em>q </em>is the probability it will not occur. <em>Z </em>is the Z score for the selected confidence level and <em>E </em>is the margin of error. We use 0.50 for both <em>p </em>and <em>q </em>because there will be a 50/50 chance of randomly selecting the correct cup. The <em>Z </em>score is based on the alpha (α) level we select for the confidence level; in this case we choose an alpha of 0.05, so that there is a 5% chance of failing to reject the null hypothesis when it should actually be rejected. We will use 15% for <em>E</em>. This means the sample size would be:</p>
<p style="margin-left: 40px;"><img alt="sample size calculation" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/d7ff51df-2032-4f24-b2b4-eb0eb76d0b01/Image/f60f9ae7ebeefbda97f6a93694e3e8c6/sample_size_calculation.gif" style="width: 204px; height: 39px;" /></p>
<p>We can’t perform 0.68 tests, so we round up to the next even whole number, which would mean 44 trials. We would need 22 cups of coffee with sugar and 22 cups of coffee without sugar. That is a lot of coffee so the cup size will be 10 ml each. There is a risk that different pots of coffee will not be the same as each other due to differences such as the amount of coffee grain used or the cooling of the coffee over time. To counter this, the experimenter would brew one large pot of coffee and then separate it into two containers; one container would receive the sugar.</p>
<p>A table is then created to plan the experiment and record the results. The first 22 samples would contain sugar and the next 22 would not. Simply providing the test subject with the cups in the order they are listed would risk the subject realizing the sugar is in the first half so randomization will be required to ensure the test subject is unaware of which cups contain sugar. Fisher in <em>The Design of Experiments</em> referred to randomization as an “essential safeguard.” A random sequence generator can used to assign the run order to the samples.</p>
<p>The accuracy of the results could be increased by using blinding. The experimenter may subconsciously give the test subject signals that could indicate the actual condition of the coffee. This could be avoided by having the cups of coffee delivered by a second person who is unaware of the status of the cups. The use of blinding adds an additional layer of protection to the experiment.</p>
<strong>The Analysis</strong> (by Hand)
<p>Suppose the test subject correctly identified 38 out of 44 samples, which results in a proportion of 0.86. This could have been the result of random chance and not actually correctly identifying the samples so a one sample proportion test could be used to evaluate the results. A one sample proportion test has several requirements that must be met:</p>
<ol>
<li style="margin-left: 0.5in;">The sample size times the probability of an occurrence must be greater than or equal to 5 so: np ≥5. We have 44 samples and the chance of a random occurrence is 0.5 so 44 x 0.5 = 22.<br />
</li>
<li style="margin-left: 0.5in;">The sample size times the probability something will not occur must be greater than or equal to 5 so: nq ≥ 5. We have 44 samples and the chance of an occurrence not occurring is 0.5 so 44 x 0.5 = 22.<br />
</li>
<li style="margin-left: 0.5in;">The sample size must be large; generally, there should be 30 or more samples.</li>
</ol>
<p>All requirements have been met so we can use the one sample hypothesis test to analyze the results. The test statistic is:</p>
<p align="center" style="margin-left: 40px;"><img alt="Z test statistic" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/d7ff51df-2032-4f24-b2b4-eb0eb76d0b01/Image/e4a4178b7de86d7df2a2d1802b4d0131/test_statistic_z.gif" style="width: 121px; height: 72px; float: left;" /></p>
<p align="center"> </p>
<p> </p>
<p>The P represents the actual proportion and P0 represents the hypothesized proportion of the results if they had been random.</p>
<p>We need a null hypothesis and an alternative hypothesis to valuate. The null hypothesis states that nothing happened so P = P0. The alternative could be the two values are not equal; however, this could lead to rejecting the null hypothesis if the gentlemen tasting the coffee guessed incorrectly more often than should have happened through chance alone. So we would use P > P0, which means we are using a one-tailed upper-tail hypothesis test. The resulting hypothesis test would be:</p>
<p style="margin-left: 40px;">Null Hypothesis (H0): P = P0</p>
<p style="margin-left: 40px;">Alternative Hypothesis (Ha): P > P0</p>
<p>We want a 95% confidence level so we check a Z score table and determi ne the proper Z score to use is 1.96. The null hypothesis would be rejected if the calculated Z value is higher than 1.96. The formula is:</p>
<p style="margin-left: 40px;"><img alt="formula" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/d7ff51df-2032-4f24-b2b4-eb0eb76d0b01/Image/26f0a300a550ee01b2f6613c9d1e8c68/formula.gif" style="width: 287px; height: 84px;" /></p>
<p>The resulting Z score is greater than 1.96 so we reject the null hypothesis. The rejection region for this test is the red area of the distribution depicted in figure 1. Had the resulting Z score been less than 1.96, we would have failed to reject the null hypothesis when using an α of 0.05.</p>
<p style="margin-left: 40px;"><img alt="rejection region" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/d7ff51df-2032-4f24-b2b4-eb0eb76d0b01/Image/f1216ab5fafc84081aaffc3a3e6354b8/rejection_region.gif" style="width: 500px; height: 252px;" /></p>
The Analysis (Using Statistical Software)
<p>We can also perform this analysis using statistical software. In Minitab, select <strong>Stat > Basic Statistics > 1 Proporortion...</strong> and fill out the dialog box as shown below:</p>
<p style="margin-left: 40px;"><img alt="1 Proportion Test" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/771aaa298305dce17f1b24fcf812a224/one_proportion_dialog.gif" style="width: 409px; height: 316px;" /></p>
<p>Then click on the "Options" button to select "greater than" as the alternative hypothesis, and check the box that tells the software to use the normal distribution in its calculations:</p>
<p style="margin-left: 40px;"><img alt="1 Proportion Test options" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/34627c6c715490824415e996d0540edb/1_proportion_test_options.gif" style="width: 308px; height: 213px;" /></p>
<p>Minitab gives the following output:</p>
<p style="margin-left: 40px;"><img alt="1 Proportion Test output" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/4f8586f239734ff67e6e236cbc57ab9c/one_proportion_test_output.gif" style="width: 394px; height: 133px;" /></p>
<p>The z-value of 4.82 differs slightly from our hand-calculated value since Minitab used more decimals than we did, but the practical result is the same: the z-value is greater that 1.96, so we reject the null hypothesis. Minitab also gives us a p-value, which in this case is 0. And as a wise statistician once said, "If the P-value's low, the null must go."</p>
<p>It is important to note that <a href="http://blog.minitab.com/blog/understanding-statistics/things-statisticians-say-failure-to-reject-the-null-hypothesis">rejecting the null hypothesis does not automatically mean we accept the alternative hypothesis</a>. Accepting the alternative hypothesis is a strong conclusion; we can only conclude there is insufficient evidence to reject it when compared against the null hypothesis and the null hypothesis only used as a comparison with the alternative hypothesis. Fisher himself, in <em>The Design of Experiments, </em>tells us “the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation.”</p>
<strong>Fisher’s Results</strong>
<p>As for the original experiment, Fisher’s son-in-law the statistician George E.P. Box informs us in the <a href="http://www.jstor.org/discover/10.2307/2286841?uid=3737864&uid=2&uid=4&sid=21101771509911"><em>Journal of the American Statistical Association</em></a> the lady in question was Dr. Muriel Bristol and her future husband reported she got almost all choices correct. In <a href="http://en.wikipedia.org/wiki/The_Lady_Tasting_Tea"><em>The Lady Tasting Tea</em></a> David Salsburg also confirms the lady in question could indeed taste the difference; he was so informed by Professor Hugh Smith, who was present while the lady tasted her tea.</p>
<p>Fisher never actually reported the results; however, what mattered in Fisher’s tale is not whether or not somebody could taste a difference in a drink, but using the proper methodology when performing a statistical experiment.</p>
<p> </p>
<div style="border: 0px; margin: 0px; padding: 0px; color: rgb(90, 90, 90); font-family: Verdana; line-height: 20px;">
<div style="border: 0px; margin: 0px; padding: 0px;"><strong style="border: 0px; margin: 0px; padding: 0px;">About the Guest Blogger: </strong>
<div><em>Matthew Barsalou is an engineering quality expert in BorgWarner Turbo Systems Engineering GmbH’s Global Engineering Excellence department. He has previously worked as a quality manager at an automotive component supplier and as a contract quality engineer at Ford in Germany and Belgium. He possesses a bachelor of science in industrial sciences, a master of liberal studies and a master of science in business administration and engineering from the Wilhelm Büchner Hochschule in Darmstadt, Germany</em><em>.</em></div>
<em style="border: 0px; margin: 0px; padding: 0px;">.</em></div>
<div style="border: 0px; margin: 0px; padding: 0px;"><em style="border: 0px; margin: 0px; padding: 0px;"> </em></div>
</div>
<p style="border: 0px; margin: 0px 0px 20px; padding: 0px; color: rgb(90, 90, 90); font-family: Verdana; line-height: 20px;"><em style="border: 0px; margin: 0px; padding: 0px;"><strong style="border: 0px; margin: 0px; padding: 0px;">Would you like to publish a guest post on the Minitab Blog? Contact <a href="mailto:publicrelations@minitab.com?subject=I%20Would%20Like%20to%20Be%20a%20Guest%20Blogger" style="border-width: 0px 0px 0.1em; border-bottom-style: dotted; border-bottom-color: rgb(0, 47, 97); margin: 0px; padding: 0px; color: rgb(0, 47, 97); text-decoration: none;">publicrelations@minitab.com</a>. </strong></em></p>
<p style="border: 0px; margin: 0px 0px 20px; padding: 0px; color: rgb(90, 90, 90); font-family: Verdana; line-height: 20px;"> </p>
Hypothesis TestingLearningStatisticsTue, 30 Jul 2013 14:10:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/the-gentleman-tasting-coffee-a-variation-on-fishers-famous-experimentGuest BloggerT for 2. Should I Use a Paired t or a 2-sample t?
http://blog.minitab.com/blog/statistics-and-quality-data-analysis/t-for-2-should-i-use-a-paired-t-or-a-2-sample-t
<p><img alt="tea" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/7cdf055fba930ed2f98fef2dae6d6674/tea_staggered.jpg" style="width: 213px; float: right; height: 217px" />Boxers or briefs.</p>
<p>Republican or Democrat.</p>
<p>Yin or yang.</p>
<p>Why is it that life often seems to boil down to two choices?</p>
<p>Heck, it even happens when you open the <strong>Basic Stats</strong> menu in Minitab. You’ll see a choice between a 2-sample t-test and a paired t-test:</p>
<p><img alt="t menu" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/216d33d2ba91e63016fe3452372975e0/t_menu_2.jpg" style="width: 416px; height: 312px" /></p>
<p>Which test should you choose? And what’s at stake?</p>
<p>Ask a statistician, and you might get this response: <em>"Elementary, my dear Watson. Choose the 2-sample t-test to test the difference in two means:</em> H0: µ1 – µ2 = 0 <em>Choose the paired t-test to test the mean of pairwise differences</em> H0: µd = 0."</p>
<p>You gaze at your two sets of data values, mystified. Do you have to master the Greek alphabet to choose the right test?</p>
<strong>όχι</strong> !
<p>(That's Greek for “no”)</p>
Base Your Decision on How the Data Is Collected
<p><strong>Dependent samples</strong>: If you collect two measurements on each item, each person, or each experimental unit, then each pair of observations is closely related, or matched.</p>
<p>For example, suppose you test the computer skills of participants before and after they complete a computer training course. That produces a set of paired observations (Before and After test scores) for each participant.</p>
<p><img alt="paired data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/2fa17a380df47710a28a38443b1652f0/paired_observations2.jpg" style="width: 387px; height: 301px" /></p>
<p>In that case, you should use the paired t-test to test the mean difference between these dependent observations. (See the orange guy at left shaped like a lopsided peanut? That’s me. Training failed to improve my computer graphics skills. Hence the cheesy 1990s ClipArt). </p>
<p>Paired observations can also arise when you measure two different items subject to the same unique condition. </p>
<p>For example, suppose you measure tread wear on two different types of bike tires by putting both tires on the same bicycle. Then each bike is ridden by a different rider. To compare 20 pairs of tires, you use 20 different bicycles/riders.</p>
<p>Because each bicycle is ridden different distances in different conditions, measuring the tread wear for the two tires on each bike produces a set of paired (dependent) measurements. To account for the unique conditions that each bike was subject to, you’d use a paired t-test to evaluate the differences in mean tread wear.</p>
<p><strong>Independent samples</strong>: If you randomly sample each set of items separately, under different conditions, the samples are independent. The measurements in one sample have no bearing on the measurements in the other sample.</p>
<p>Suppose you randomly sample two different groups of people and test their computer skills. You take one random sample from people who have not taken a computer training course and record their test scores. You take a second random sample from another group of people who<em> have</em> completed the computer training course and record their test scores.</p>
<p><img alt="unpaired" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/c5dd779c4d41eb86f99135044d910280/unpaired_observations2.jpg" style="width: 389px; height: 270px" /></p>
<p>Because the two samples are independent, you must use the 2-sample t test to compare the difference in the means.</p>
<p>If you use the paired t test for these data, Minitab assumes that the before and after scores are paired: The 47 score before training is associated with a 53 score after training. A 92 score before training is associated with a 71 score after training, etc. </p>
<p>You could end up pairing Mark Zuckerberg’s <em>Before</em> test score with Lloyd Christmas’ <em>After</em> test score.</p>
<p>Invalid pairings like that can lead to very erroneous conclusions.</p>
Paired vs 2-Sample Designs
<p>If you’re planning your study and haven’t collected data yet, be aware of the possible ramifications of using 2-sample vs a paired design. The difference in the designs could drastically affect the amount of data you'll need to collect.</p>
<p>For example, suppose you design your study to measure the test scores of the same 15 participants before and after they complete a computer training course. The paired t-test test gives the following results:</p>
<p><b>Paired T-Test and CI: Before, After </b><br />
Paired T for Before - After</p>
<p> N Mean StDev SE Mean<br />
Before 15 97.07 26.88 6.94<br />
After 15 101.60 27.16 7.01<br />
Difference 15 -4.533 3.720 0.960</p>
<p>95% CI for mean difference: (-6.593, -2.473)<br />
T-Test of mean difference = 0 (vs not = 0): T-Value = -4.72<span style="color: #ff0000"> <strong>P-Value = 0.000</strong></span></p>
<p>Because the p-value (0.000) is less than alpha (0.05), you conclude that the mean difference between the Before and After test scores is statistically significant.</p>
<p>Now suppose instead you had designed a study to collect two independent samples: 1) the test scores from 15 people who had not completed computer training (Before) and 2) the tests scores from 15 different people who had completed the computer training (After).</p>
<p>For the sake of argument let's suppose you wind up with the same exact data values for the Before and After scores that you did with the paired design. Here's what you obtain when you analyze the data using the 2-sample t test.</p>
<p><b>Two-Sample T-Test and CI: Before, After </b><br />
Two-sample T for Before vs After</p>
<p> N Mean StDev SE Mean<br />
Before 15 97.1 26.9 6.9<br />
After 15 101.6 27.2 7.0</p>
<p>Difference = mu (Before) - mu (After)<br />
Estimate for difference: -4.53<br />
95% CI for difference: (-24.78, 15.71)<br />
T-Test of difference = 0 (vs not =): T-Value = -0.46 <strong><span style="color: #ff0000">P-Value = 0.650 </span></strong>DF = 27</p>
<p>The sample size, the standard deviation, and the estimated difference between the means are exactly the same for both tests. But note the whopping difference in p-values—0.000 for the paired t-test and 0.650 for the 2-sample t-test.</p>
<p><span face="">Even though the 2-sample design required twice as many subjects (30) as the paired design (15), you can’t conclude there’s a statistically significant difference between the means of the Before and After test scores. </span></p>
<p>What’s going on? Why the huge disparity in results?</p>
A Paired Design Reduces Experimental Error
<p>By accounting for the variability caused by different items, subjects, or conditions, and thereby reducing experimental error, the paired design tends to increase <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-is-a-t-test-and-why-is-it-like-telling-a-kid-to-clean-up-that-mess-in-the-kitchen">the signal-to-noise ratio that determines statistical significance</a>. This can result in a more efficient design that requires less resources to detect a significant difference, if one exists.</p>
<p>Because 2-sample design doesn’t control for the variability of the experimental unit, a much larger sample is needed to achieve statistical significance for a given difference and variability in the data, as shown below:</p>
<p><b>Two-Sample T-Test and CI: Before, After </b></p>
<p>Two-sample T for Before vs After</p>
<p> <span style="color: #ff0000"><strong> N</strong></span> Mean StDev SE Mean<br />
Before <strong><span style="color: #ff0000">270</span></strong> 97.1 26.0 1.6<br />
After <strong><span style="color: #ff0000">270</span></strong> 101.6 26.3 1.6</p>
<p>Difference = mu (Before) - mu (After)<br />
Estimate for difference: -4.53|<br />
95% CI for difference: (-8.95, -0.11)<br />
T-Test of difference = 0 (vs not =): T-Value = -2.01 <strong><span style="color: #ff0000">P-Value = 0.045</span></strong> DF = 537</p>
<p>Remember, these are independent samples. So this translates to 270 + 270 = 540 subjects for the study—compared to only 15 subjects in the paired design!</p>
<p>That gain in efficiency comes from controlling for person-to-person variability -- a good thing to do because that variability is not a primary objective of this study. It's a <em>nuisance factor, </em>something that creates “extraneous noise” that gets in the way of “hearing” the main effect that you're most interested in.</p>
<p>So next time you’re planning T for 2, give it a hard think.</p>
<p><em>If </em>it’s possible to satisfy your objectives using a paired t design rather than a 2-sample t design, you may be wise to do so.</p>
Hypothesis TestingLearningMon, 08 Jul 2013 16:45:00 +0000http://blog.minitab.com/blog/statistics-and-quality-data-analysis/t-for-2-should-i-use-a-paired-t-or-a-2-sample-tPatrick Runkel