Hypothesis Testing | MinitabBlog posts and articles about hypothesis testing, especially in the course of Lean Six Sigma quality improvement projects.
http://blog.minitab.com/blog/hypothesis-testing-2/rss
Mon, 20 Oct 2014 09:39:41 +0000FeedCreator 1.7.3Using Data Analysis to Maximize Webinar Attendance
http://blog.minitab.com/blog/michelle-paret/using-data-analysis-to-maximize-webinar-attendance
<p>We like to host webinars, and our customers and prospects like to attend them. But when our webinar vendor moved from a pay-per-person pricing model to a pay-per-webinar pricing model, we wanted to find out how to maximize registrations and thereby minimize our costs.<img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/8a6733d3b0516b7f1c7ad80ea753d430/mtbnewspromos_w640.jpeg" style="width: 400px; height: 273px; float: right; border-width: 1px; border-style: solid; margin: 10px 15px;" /></p>
<p>We collected webinar data on the following variables:</p>
<ul>
<li>Webinar topic</li>
<li>Day of week</li>
<li>Time of day – 11 a.m. or 2 p.m.</li>
<li>Newsletter promotion – no promotion, newsletter article, newsletter sidebar</li>
<li>Number of registrants</li>
<li>Number of attendees</li>
</ul>
<p>Once we'd collected our data, it was time to analyze it and answer some key questions using <a href="http://www.minitab.com/products/minitab/">Minitab Statistical Software</a>.</p>
Should we use registrant or attendee counts for the analysis?
<strong><span style="line-height: 16.8666667938232px; font-family: Calibri, sans-serif; font-size: 11pt;"><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/4d9fa1e3c73606627d2ca1ec34b620e2/scatterplot_w640.jpeg" style="width: 300px; height: 197px; margin: 10px 15px; float: left;" /></span></strong>
<p>First we needed to decide what we would use to measure our results: the number of people who signed up, or the number of people who actually attended the webinar. This question really boils down to answering the question, “Can I trust my data?”</p>
<p>Our data collection system for webinar registrants is much more accurate than our data collection system for webinar attendees. This is due to customer behavior and their willingness to share contact information, in addition to the automated database processes that connect our webinar vendor data with our own database. So, for a period of time, I manually collected the attendee data directly from our webinar vendor to see how it correlated with the easily-accessible and accurate registration data. The scatterplot above shows the results.</p>
<p>With a <a href="http://blog.minitab.com/blog/understanding-statistics/no-matter-how-strong-correlation-still-doesnt-imply-causation">correlation coefficient </a>of 0.929 and a p-value of 0.000, there was a strong positive linear relationship between the registrations and attendee counts. If registrations are high, then attendance is also high. If registrations are low, then attendance is also low. I concluded that I could use the registration data—which is both easily accessible and extremely reliable—to conduct my analysis.</p>
Should we consider data for the last 6 years?
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/5e73f48b852c7afc17762f28bf8887cf/i_mr_chart_of_registrants_w640.jpeg" style="width: 400px; height: 263px; margin: 10px 15px; float: left;" />We’ve been collecting webinar data for 6 years, but that doesn’t mean we can treat the last 6 years of data as one homogeneous population.</p>
<p>A lot can change in a 6-year time period. Perhaps there was a change in the webinar process that affected registrations. To determine whether or not I should use all of the data, I used an Individuals and Moving Range (I-MR, also referred to as X-MR) <a href="http://blog.minitab.com/blog/understanding-statistics/how-create-and-read-an-i-mr-control-chart">control chart</a> to evaluate the process stability of webinar registrations over time.</p>
<p>The graph revealed a single point on the MR chart that flagged as out-of-control. I looked more closely at this point and verified that the data was accurate and that this webinar belonged with the larger population. Based on this information, I decided to proceed with analyzing all 6 years of data together. (Note there is some clustering of points due to promotions, but again the goal here was to determine if we could use data over a 6-year time period.)</p>
What variables impact registrations?
<p>I performed an ANOVA using Minitab's General Linear Model tool to find out which factors—topic, day of week, time of day, or newsletter promotion—significantly affect webinar registrations.<img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/3758d3d03a604bab9921ad9f94663dc8/main_effects_plot_for_registrants_w640.jpeg" style="width: 400px; height: 263px; float: right; margin: 10px 15px;" /></p>
<p>The ANOVA results revealed that the day of week, time of day, and webinar topic <em>do not</em> affect webinar registrations, but the newsletter promotion type <em>does</em> (p-value = 0.000).</p>
<p>So which webinar promotion type maximizes webinar registrations?</p>
<p>Using Minitab to conduct <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/keep-that-special-someone-happy-when-you-perform-multiple-comparisons">Tukey comparisons</a>, we can see that registrations for webinars promoted in the newsletter sidebar space were not significantly different from webinars that weren't promoted at all.</p>
<p>However, webinars that were promoted in the newsletter <em>article </em>space resulted in significantly more registrations than both the sidebar promotions and no promotions.</p>
<p>From this analysis, we concluded that we still had the flexibility to offer webinars at various times and days of the week, and we could continue to vary webinar topics based on customer demand and other factors. To maximize webinar attendance and minimize webinar cost, we needed to focus our efforts on promoting the webinars in our newsletter, utilizing the article space.</p>
<p>But over the past year, we’ve started to actively promote our webinars via other channels as well, so next up is some more data analysis—using Minitab—to figure out what marketing channels provide the best results…</p>
Data AnalysisHypothesis TestingRegression AnalysisStatisticsFri, 17 Oct 2014 12:00:00 +0000http://blog.minitab.com/blog/michelle-paret/using-data-analysis-to-maximize-webinar-attendanceMichelle ParetWith the Assistant, You Won't Have to Stop and Get Directions about Directional Hypotheses
http://blog.minitab.com/blog/statistics-and-quality-improvement/with-the-assistant-you-wont-have-to-stop-and-get-directions-about-directional-hypotheses
<p>I got lost a lot as a child. I got lost at malls, at museums, Christmas markets, and everywhere else you could think of. Had it been in fashion to tether children to their parents at the time, I'm sure my mother would have. As an adult, I've gotten used to using a GPS device to keep me from getting lost.</p>
<p><span style="line-height: 20.7999992370605px;">The Assistant in Minitab is like your GPS for statistics. The Assistant is there to provide you with directions so that you don't get lost. One particular area where it's easy to get lost is with directional hypotheses.</span><img alt="Wait... is my hypothesis the other direction?" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/25dd42362071d2aafc3bfc85f78f5f22/hypothesis_bubble_w640.jpeg" style="line-height: 20.7999992370605px; width: 480px; height: 350px; border-width: 1px; border-style: solid; margin: 10px 15px;" /></p>
What Is a Directional Hypothesis?
<p>When you do a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/what-is-a-hypothesis-test/">statistical hypothesis test</a>, you have a null hypothesis and an alternative hypothesis. Directional hypotheses refer to two types of alternative hypotheses that you can usually choose. The common alternative hypotheses are these three:</p>
<ul>
<li>The value that you want to test is greater than a target.</li>
<li>The value that you want to test is different from a target.</li>
<li>The value that you want to test is less than a target.</li>
</ul>
<p>If you select an alternative hypothesis with "greater than" or "less than" in it, then you've chosen a directional hypothesis. When you choose a directional hypothesis, you get a one-sided test.</p>
<p>What does it look like to choose a one-sided test, and why would you? Let's consider an example.</p>
Choosing Whether to Use a One-sided Test or a Two-sided Test
<p>Suppose new production equipment is installed at a factory that should increase the rate of production for electrical panels. Concern exists that the change could increase the percentage of electrical panels that require rework before shipping. A quality team prepares to conduct a hypothesis test to determine whether statistical evidence supports this concern. The historical rework rate is 1%.</p>
<p>At this point, you would usually choose an alternative hypothesis. Maybe you remember hearing that you should think about whether to use a one-sided test or a two-sided test, or you may not even know how a test can have a side.</p>
<p>To keep from getting lost, you use your GPS. To keep from getting confused about statistics, you can use the Assistant. The Assistant uses clear and simple language. The Assistant doesn't ask you about "directional hypotheses" or "one-sided tests." Instead, the Assistant asks the question, "What do you want to determine?"</p>
<p><img alt="Is the % defective of Panels greater than .01? Is the % defective of Panels less than .01? Is the % defective of Panels different from .01?" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/b090980e5b08184e7b70b96b9cb05489/test_setup_in_assistant.png" style="width: 573px; height: 198px;" /></p>
<p>In this scenario, it's easy to see why the team would want to determine whether the percent is greater than 1. By performing the one-sided test for whether the percentage is greater than 1, the team can determine if there is enough statistical evidence to conclude that the percentage increased. If the percentage increased, then the concern is justified.</p>
<p>In practical terms, you should consider what it means to limit your decision to whether there is evidence for an increase. A one-sided test of whether the percentage increased will never show a statistically significant decrease in the percentage of boards that require rework. Evidence of a decrease in the number of defectives might guide the quality team to investigate the reasons for the unforeseen benefit.</p>
Why Use a One-sided Test?
<p>Given this possible concern about whether a one-sided test excludes important information from the result, why would you ever use one? The best answer is that you use a one-sided test when the one-sided test tells you everything that you need to know.</p>
<p>In the example about the electrical panels, the quality team might feel completely secure in assuming that the new equipment will not result in a decrease in the percentage of panels that require rework. If so, then a test that checks for a decrease is flawed. The team needs only to determine whether to solve a problem with increased defectives or not.</p>
The Assistant Gets Even Better
<p>While a p-value for a one-sided test can be useful, more analysis can help you make better decisions. For example, in the electrical panel example, if the team finds a statistically significant increase, it will be important to know what the percentage increase is. <a href="http://www.minitab.com/en-us/products/minitab/assistant/">The Assistant</a> produces several reports with your hypothesis tests that help you get as much information as you can from your data. The report card verifies your analysis by providing assumption checks and identifying any concerns that you should be aware of. The diagnostic report helps you further understand your analysis by providing additional detail. The summary report helps you to draw the correct conclusions and explain those conclusions to others. The series of reports includes a variety of other statistics and analyses. That way, you have everything that you need to interpret your results with confidence.</p>
<p><img alt="The % defective of Panels is not significantly greater than the target (p > 0.05)" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/75f280df482574a3aee75ee65741b5c4/1_sample___defective_test_for_panels___summary_report_w640.png" style="width: 480px; height: 360px;" /></p>
<p>The image of the face in the crowd without the thought bubble is by <a href="https://www.flickr.com/photos/akbarsyah/">_Imaji_</a> and is licensed under <a href="https://creativecommons.org/licenses/by/2.0/">this creative commons license</a>.</p>
Hypothesis TestingWed, 15 Oct 2014 18:52:23 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/with-the-assistant-you-wont-have-to-stop-and-get-directions-about-directional-hypothesesCody SteeleHow Politicians and Governments Could Benefit from Statistical Analyses
http://blog.minitab.com/blog/applying-statistics-in-quality-projects/how-politicians-and-governments-could-benefit-from-statistical-analyses
<p>Using <a href="http://blog.minitab.com/blog/applying-statistics-in-quality-projects/a-doe-in-a-manufacturing-environment-part-1">statistical techniques to optimize manufacturing processes</a> is quite common now, but using the same approach on social topics is still an innovative approach. For example, if our objective is to improve student academic performances, should we increase teachers wages or would it be better to reduce the number of students in a class?</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/4b07ae989e35a7dfd8b6fdb313a5561b/ballot.jpg" style="float: right; width: 250px; height: 250px;" />Many social topics (the effect of increasing the minimum wage on employment, etc.) generate long and passionate discussions in the media and in politics. People express very different and subjective points of views according to political/ideological opinions and varied ways of thinking.</p>
Hypothesis Testing in the Policy Realm
<p><span style="line-height: 20.7999992370605px;">Social experimentation and data analysis can provide a firmer ground on which we can base more objective decisions.</span></p>
<p>The objective is to investigate the effects of a policy intervention and to <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/example-of-a-hypothesis-test/">test specific hypotheses</a>. In these social experiments “randomization” is a key element. If one policy option is tested in, say, the Netherlands, and another policy option is tested in France, the experimenter will never be in a position to fully understand whether a difference in outcomes is due to the intervention itself or to the many other differences between these two countries.</p>
<p>It would clearly be preferable to test the two approaches in different regions of France and of the Netherlands, for example, and assign the policy intervention in a random way to a “treatment” group (individuals who receive it) and a “comparison” group (individuals who do not receive it).</p>
<p>At the beginning of the study, the “treatment” and the “control” groups should be as similar as possible to prevent any systematic previous bias. The objective is not to “observe” differences but to identify the actual causal effects.</p>
Designed Experiment Techniques
<p>Other techniques that are often used in <a href="http://blog.minitab.com/blog/understanding-statistics/getting-started-with-factorial-design-of-experiments-doe">designed experiments (DOEs)</a> may also be useful in this context, such as blocking and balancing. In my example, France and the Netherlands might be considered as a blocking factor (an external extra factor which the experimenter cannot control), and the tests should be “balanced” across blocks so that the treatment effect estimates are not biased and the blocking effects of the countries are neutralized. Other potential blocking factors in policy studies might be urban versus rural regions, or females versus males.</p>
Examples of Policy Experiments
<p>Data analysis and statistics have been used to inform several important policy debates around the world over the past few years. Here are a few examples:</p>
<p>- In Kenya, a social experiment showed that neither hiring extra teachers to reduce class sizes in schools nor providing more textbooks to pupils had much effect on academic performances. A surprising finding of this study was that deworming (intestinal worms) programs were very effective in decreasing child absenteeism.</p>
<p>- In the U.S, a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/doe/factorial-designs/choose-a-factorial-design/">full factorial design (DOE)</a> was used to assess the effectiveness of commitment contracts. The objective of these contracts was to encourage individuals to exercise more in order to reduce health risks and prevent obesity. The effects of different factors such as duration of the physical exercises, their frequency and financial stakes were studied. The outcome was the likelihood of accepting such a contract.</p>
<p>- Different strategies to quit smoking based on commitment contracts have been tested using a randomized experimental approach.</p>
<p>- In France, a social experiment was conducted to compare different job-counselling strategies for placing young unemployed people. The studied outcome was the probability to find a job.</p>
Conclusion
<p>Experiments make it possible to vary one factor at a time, but a more effective approach is to modify several factors for each test using proper designs of experiments. Expertise in setting up randomized field experiments to test economic hypotheses is clearly a key factor.</p>
<p>Experimental results are often surprising, therefore experimentation and data analysis are potentially new and powerful tools in the arsenal of politicians and governments.</p>
<p>Here are sources of more information about the examples I've mentioned :</p>
<p>Miguel, Edward and Michael Kremer (2004). “Worms: Identifying Impacts on Education and Health in the Presence of Treatment Externalities,” Econometrica, Volume72 (1), pp. 159-217.</p>
<p>Gine, Xavier, Dean Karlan and Jonathan Zinman (2008). “Put Your Money Where Your Butt Is: A Commitment Savings Account for Smoking Cessation,” MIMEO, Yale University.</p>
<p><a href="http://www.voxeu.org/article/job-placement-and-displacement-evidence-randomised-experiment">http://www.voxeu.org/article/job-placement-and-displacement-evidence-randomised-experiment</a></p>
<p>Using Nudges in Exercise Commitment Contracts : <a href="http://www.nber.org/bah/2011no1/w16624.html">http://www.nber.org/bah/2011no1/w16624.html</a></p>
<p> </p>
Data AnalysisDesign of ExperimentsHypothesis TestingStatisticsStatistics in the NewsStatsMon, 22 Sep 2014 12:00:00 +0000http://blog.minitab.com/blog/applying-statistics-in-quality-projects/how-politicians-and-governments-could-benefit-from-statistical-analysesBruno ScibiliaA Fun ANOVA: Does Milk Affect the Fluffiness of Pancakes?
http://blog.minitab.com/blog/statistics-in-the-field/a-fun-anova3a-does-milk-affect-the-fluffiness-of-pancakes
<p><em>by Iván Alfonso, guest blogger</em></p>
<p><img alt="hotcakes" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/7bd460fa71f6d12672a2ac5d9f754762/pancakes.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 300px; height: 223px;" />I'm a huge fan of hot cakes—they are my favorite dessert ever. I’ve been cooking them for over 15 years, and over that time I’ve noticed many variation in textures, flavor, and thickness. Personally, I like fluffy pancakes.</p>
<p>There are many brands of hotcake mix on the market, all with very similar formulations. So I decided to investigate which ingredients and inputs may influence the fluffiness of my pancakes.</p>
<p>Potential factors could include the type of mix used, the type of milk used, the use of margarine or butter (of many brands), the amount of mixing time, the origin of the eggs, and the skill of the person who prepares the pancakes.</p>
<p>Instead of looking at <em>all </em>of these factors, I focused on the type of milk used in the pancakes. I had four types of milk available: whole milk, light, low fat, and low protein.</p>
<p>My goal was to determine if these different milk formulations influence fluffiness (thickness). Is the whole milk the best for fluffy hotcakes? Does skim milk works the same way as the whole milk? Can I be sure that the use of light milk will result in hot cakes that are less smooth?</p>
Gathering Data
<p>I sorted the four formulations as shown in the diagram below:</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/643f9f4f94be78a5b1c012e49c400772/milk_factor.jpg" style="width: 144px; height: 200px;" /></p>
<p>I used the the same amounts of milk, flour (one brand), salt and margarine for each batch of hotcakes I cooked.</p>
<p>The response variable was the thickness of the cooked pancakes. I prepared 6 pancakes for each type of milk, which gives me a total of 8 pancakes. I randomized the cooking order to minimize bias. I also prepared each batch by myself—if my sister or mother had helped with some lots, it would be a potential source of variation.</p>
<p>To measure the fluffiness, I inserted a stick into the center of each hotcake until the bottom, marked the stick with a pencil, then measured the distance to the mark in millimeters with a ruler.</p>
<p>After a couple of hours of cooking hotcakes, making measurements, and recording the data on a worksheet, I started to analyze my data with Minitab.</p>
Analysis of Variance (ANOVA)
<p>My goal was to assess the variation in thickness or fluffiness between different batches of hot cakes, so the most appropriate statistical technique was <a href="http://blog.minitab.com/blog/statistics-in-the-field/understanding-anova-by-looking-at-your-household-budget">analysis of variance, or ANOVA</a>. With this analysis I could visualize and compare the formulations based on my response variable, the thickness in millimeters, and see if there were statistically significant differences between them. I used a <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/alpha-male-vs-alpha-female">0.05 significance value</a>.</p>
<p>As soon as I had my data in a Minitab worksheet, I started to check it for the assumptions of ANOVA. First, I needed to see if the data followed a normal distribution, so I went straight to <strong>Statistics > Basic Statistics > Normality Test</strong>. Minitab produced the following graph:</p>
<p><img alt="Graph of probability of thickness" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/58599d2e2d8572e700893e2e8000dce9/probability_of_thickness.jpg" style="width: 500px; height: 304px;" /></p>
<p>My data passed both the Kolmogorov-Smirnov and Anderson-Darling normality tests. This was a relief—since my data had a normal distribution, I didn’t need to worry about ANOVA’s assumptions of normality.</p>
<p>Traditional ANOVA also has an assumption of equal variances; however, I knew that even if my data didn’t meet this assumption, I could proceed using the method called <a href="http://blog.minitab.com/blog/adventures-in-statistics/did-welchs-anova-make-fishers-classic-one-way-anova-obsolete">Welch’s ANOVA</a>, which accommodates unequal variances. But when I ran Bartlett’s test for equal variances, and even the more stringent Levene test, my data passed. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/5600f02a4a7a9faa8b82c3bbe1458784/test_for_equality_of_variances.jpg" style="width: 500px; height: 307px;" /></p>
<p>With confirmation that my data met the assumptions, I proceeded to perform the ANOVA and create box-and-whisker graphs.</p>
ANOVA Results
<p>Here's the Minitab output for the ANOVA:</p>
<p style="margin-left: 40px;"><img alt="one-way anova output" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/5817e0a9b2d961942f7101bc8eb2eced/one_way_anova.gif" style="width: 400px; height: 133px;" /></p>
<p>The ANOVA revealed that there were indeed statistically significant differences (p = 0.009) among my four batches of hotcakes.</p>
<p>Minitab’s output also included grouping information using Tukey’s method of multiple comparisons for 95% confidence intervals:</p>
<p style="margin-left: 40px;"><img alt="Tukey Method" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c9194c1dda604ad87e4e7985ec8261c1/tukey_method.gif" style="width: 400px; height: 151px;" /></p>
<p>The Tukey analysis shows that low-fat milk and light items do not show a significant difference in fluffiness. However, the batches made with whole milk and low protein did significantly differ from each other.</p>
<p>The box-and-whisker diagram makes the results of the analysis easier to visualize:</p>
<p><img alt="Boxplot of thickness" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8ca740917c33fddd8953433d67488ac8/boxplot_of_thickness.gif" style="width: 500px; height: 338px;" /></p>
<p>It is clear from the graph that hotcakes produced with whole milk had the most fluffiness, and those made with low protein milk had the least fluffiness. There was not a big difference between the fluffiness of hotcakes made with light milk and lowfat milk.</p>
Which Milk Should You Use for Fluffy Pancakes?
<p>Based on this analysis, I recommend using whole milk for fluffier hotcakes. If you want to avoid fats and sugars in milk, low fat milk is a good choice.</p>
<p>I always use lowfat milk, but the analysis indicates that light milk offers a good alternative for people following a strict no-fat diet.</p>
<p>It’s important to note that for this analysis, I only compared formulations that used the same brand of pancake mix and the same amounts of salt and butter. But there are other factors to consider! My next pancake experiment will use design of experiments (DOE) to compare milk types, different brands of flour, and margarine with and without salt, to see how all of these factors together affect the fluffiness of pancakes.</p>
<p> </p>
<p><strong>About the Guest Blogger:</strong></p>
<p><em>Iván Alfonso is a biochemical engineer and statistics professor at the Autonomous University of Campeche, Mexico. Alfonso holds a master's degree in marine chemistry and has worked extensively in data analysis and design of experiments in basic and advanced sciences like chemistry and epidemiology.</em></p>
<p> </p>
<p><strong>Would you like to publish a guest post on the Minitab Blog? Contact <a href="mailto:publicrelations@minitab.com?subject=Guest%20Blogger">publicrelations@minitab.com</a>.</strong></p>
<p> </p>
Data AnalysisFun StatisticsHypothesis TestingStatisticsTue, 05 Aug 2014 12:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/a-fun-anova3a-does-milk-affect-the-fluffiness-of-pancakesGuest BloggerDo the Data Really Say Female-Named Hurricanes Are More Deadly?
http://blog.minitab.com/blog/the-statistics-game/do-the-data-really-say-female-named-hurricanes-are-more-deadly
<p><img alt="Hurricane" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/61165559035556ba8f784164d74a7f96/hurricane_w640.jpeg" style="float: right; width: 250px; height: 188px; border-width: 1px; border-style: solid; margin: 10px 15px;" />A recent study has indicated that <a href="http://www.washingtonpost.com/blogs/capital-weather-gang/wp/2014/06/02/female-named-hurricanes-kill-more-than-male-because-people-dont-respect-them-study-finds/" target="_blank">female-named hurricanes kill more people than male hurricanes</a>. Of course, the title of that article (and other articles like it) is a bit misleading. The study found a significant <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-models/what-is-an-interaction/">interaction</a> between the damage caused by the storm and the perceived masculinity or femininity of the hurricane names. So don’t be confused by stories that suggest all female-named hurricanes are deadlier than male-named hurricanes. The study actually found no effect of masculinity/femininity for less severe storms. It was the more severe storms where the gender of the name had a significant relationship with the number of deaths.</p>
<p>The study looked at every hurricane since 1950, with the exception of Katrina and Audrey (those two are outliers that would skew the results). Many critics of the study believe that it is biased, since almost all of the 38 hurricanes before 1979 had female names (there were two male names in the early 50s). It’s possible that our ability to forecast hurricanes has vastly improved since the 50s and 60s. So, these critics say, the difference is simply because more people died in hurricanes back when they all had a female name.</p>
<p>Let’s perform a data analysis to see if that is true. We will use pre- and post-1979 to distinguish between the predominantly female-name hurricane era and the era of mixed hurricane names. I’ll use the exact same data set that was used in the study, which you can get <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/ad7c966669da36643b8060c74038e6d6/hurricane.MTW">here</a>.</p>
Hurricanes Before and After 1979
<p>For the 92 hurricanes in the study, the number of deaths and the normalized damage was recorded. The study showed that these two variables are highly correlated, so it’s important to consider both factors. If we find there were more deaths in hurricanes before 1979, we need to make sure the reason isn’t simply because those hurricanes caused more damage (implying they were bigger storms).</p>
<p>We can start by using a scatterplot to plot the two variables against each other, using whether the hurricane came before or after 1979 as a grouping variable. Hurricanes that occurred <em>during </em>1979 were put in the After group.</p>
<p><img alt="Scatterplot" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/72ef8a172f250267d3b03cccd6ff8399/scatterplot_of_deaths_vs_normalized_damage_w640.jpeg" style="width: 640px; height: 427px;" /></p>
<p>We see that the two deadliest hurricanes (Camille and Diane) both occurred before 1979. If you look below them, you’ll see that many hurricanes in both eras have caused the same amount of damage, yet resulted in far fewer deaths.</p>
<p>Meanwhile, the two most damaging hurricanes (Sandy and Andrew) both occurred <em>after </em>1979. These hurricanes caused more than three times the damage of Camille and Diane, yet resulted in fewer deaths. This gives some credibility to the idea that our improvement in being able to predict hurricanes has resulted in fewer deaths. However, Hurricane Donna supports the opposite idea: five post-1979 hurricanes resulted in more deaths than Donna, despite causing significantly less damage. It’s hard to draw conclusions from the scatterplot.</p>
<p>Of course, the hurricanes labeled in the plot above are pretty rare. Most of the 92 hurricanes had normalized damage less than $30 billion and fewer than 100 deaths. The descriptive statistics below show just how much of an impact those big storms can have on an analysis.</p>
<p><img alt="Describe" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/ac70541e09a25b227de847363d10e9c0/describe_deaths_ndam_by_year_group.jpg" style="width: 503px; height: 177px;" /></p>
<p>If we look at the mean, everything becomes clear! On average, hurricanes before 1979 had 11 more deaths despite causing half a billion <em>fewer</em> dollars in damages. But when we look at the median, which isn’t sensitive to extreme data values, the values are almost the same. </p>
<p>Part of the problem is that so many smaller storms are included. The study already concluded that the name doesn’t matter for smaller storms. So let’s just focus on the big storms. The median normalized damage for all 92 storms is $1.65 billion. I took only the storms that have caused at least that much damage (there were 47 of them) and looked at the descriptive statistics again.</p>
<p><img alt="Describe" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/06fc8707283704922858ce000d05fde2/describe_deaths_ndam_by_year_group_big_storm.jpg" style="width: 500px; height: 175px;" /></p>
<p>Once again, the mean and median paint different pictures. The mean shows that a much higher number of deaths occurred in big storms before 1979, even though those storms caused the same amount of damage. However, this is because hurricanes Camille, Diane, and Agnes are heavily influencing the mean for deaths before 1979, pulling it up much higher than the After-1979 group. And hurricanes Sandy and Andrew influence the mean for normalized damage after 1979, pulling it up to equal the damage before 1979.</p>
<p>With data this skewed, the medians are a more accurate representation of the middle of the data. The median for deaths shows that there were slightly more deaths in big storms prior to 1979. However, those storms also caused more damage, implying <em>that </em>could be the reason for the larger number of deaths.</p>
<p>And even if we ignore the fact that the hurricanes before 1979 caused more damage, a <a href="http://blog.minitab.com/blog/statistics-for-lean-six-sigma/the-non-parametric-economy-what-does-average-actually-mean">Mann-Whitney test</a> (which compares 2 medians, as opposed to a 2-sample t test which compares 2 means) shows that the difference in deaths is not statistically significant.</p>
<p><img alt="Mann-Whitney" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/a8f1ef8922a9238ba0414caef236a05d/mann_whitney_w640.jpeg" style="width: 640px; height: 230px;" /></p>
<p>The p-value is 0.1393, which is greater than 0.05. There isn’t enough evidence to conclude that hurricanes caused more deaths before 1979.</p>
Can We Really Conclude that Female-Named Hurricanes Cause More Deaths?
<p>The lack of conclusive evidence from our data analysis certainly makes the idea that hurricanes with female names cause deaths plausible. But there are other issues to consider. For example, the gender of the hurricane name was not treated as a binary variable, which would group each hurricane as either male or female. Instead, nine independent coders rated the masculinity vs. femininity of historical hurricane names on two items (1 = very masculine, 11 = very feminine, and 1 = very man-like, 11 = very woman-like), which were averaged to compute a masculinity-femininity index (MFI).</p>
<p>Do these 9 coders represent how most Americans would rate the femininity of names? Would you rate Barbara as more feminine than Carol or Betsy? The coders did, giving Barbara a 9.8 while Carol and Betsy were 8.1 and 8.3 respectively. And the MFI is important, since it was found to be the gender variable that had a significant interaction with normalized damage. When gender name was treated as a binary variable, there was no interaction.</p>
<p>But masculinity-femininity index aside, the study did have some very interesting findings. I’m sure additional research will be done in the years to come to see if the findings hold true. Let's hope that then we’ll be able to know for sure whether people underestimate female-named hurricanes or not.</p>
<p>Until then, if a hurricane is bearing down on your neighborhood, I would make sure to board up the windows and buy out the supermarket's bread and milk, regardless of the storm's name.</p>
Hypothesis TestingStatisticsStatistics in the NewsFri, 06 Jun 2014 13:17:00 +0000http://blog.minitab.com/blog/the-statistics-game/do-the-data-really-say-female-named-hurricanes-are-more-deadlyKevin RudyHypothesis Testing and P Values
http://blog.minitab.com/blog/statistics-in-the-field/hypothesis-testing-and-p-values
<p><em>by Matthew Barsalou, guest blogger</em></p>
<p>Programs such as the <a href="http://www.minitab.com/products/minitab/">Minitab Statistical Software</a> make hypothesis testing easier; but no program can think for the experimenter. Anybody performing a statistical hypothesis test must understand what p values mean in regards to their statistical results as well as potential limitations of statistical hypothesis testing.</p>
<p>A p value of 0.05 is frequently used during statistical hypothesis testing. This p value indicates that if there is no effect (or if the null hypothesis is true), you’d obtain the observed difference or more in 5% of studies due to random sampling error. However, <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values">performing multiple hypothesis tests with p > 0.05 increases the chance of a false positive</a>.</p>
<p>This is well illustrated by the online comic <a href="http://xkcd.com/882/">XKCD</a>, which depicted somebody stating that jelly beans cause acne.</p>
<p><img alt="Significant" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/08b29e9eec884bee99602335f1f9c893/xkcd.png" style="border-width: 0px; border-style: solid; width: 310px; height: 859px;" /></p>
<p>Scientists investigated and found no link, so the person made the claim that it is only a certain color jelly bean that caused acne. The scientists then test 20 different colors of jelly beans with p > 0.05. Only the green jelly bean had a p value less than 0.05.</p>
<p>The comic ends with a newspaper reporting a link between green jelly beans and acne. The newspaper points out there is 95% confidence with only a 5% chance of coincidence. What is wrong with the conclusion?</p>
<p>We can determine the chance that there will be no false conclusions by using the binomial formula.</p>
<p><img alt="binomial formula" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b962df0ea487d69594aea4975ae69225/equation1.gif" style="width: 500px; height: 87px;" /></p>
<p>This means that we have a 35.8% chance of performing 20 hypothesis tests without getting a false positive (or, as statisticians refer to it, the <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/multiple-comparisons-beware-of-individual-errors-that-multiply">family error rate</a>) when using an alpha level of 0.05. We can also calculate the probability that we have at least one incorrect result due to random chance.</p>
<p><img src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/6a80807434e2c2678163dbcc710d13a0/equation2.gif" style="width: 345px; height: 73px;" /></p>
<p>The chance that at least one result will be a false positive when performing 20 hypothesis tests using an alpha level of 0.05 is 64.2%.</p>
<p>So the press release in the XKCD comic may have been a bit premature.</p>
<p>Suppose I had 14 samples with a mean of 87.2 and I wanted to know if the mean is actually 85.2. I performed a One-Sample T-test using Minitab by going to <strong>Stat > Basic Statistics > 1 Sample t …. </strong>And I entered the summarized data. I checked the “perform hypothesis test box” and then selected “Options…” and used the default confidence level of 95.0. This corresponds to an alpha of 0.05.</p>
<p style="margin-left: 40px;"><img alt="One-Sample T test output" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/55e90b93ae38e8612ce3adb4ea0c4f00/output1.png" style="border-width: 0px; border-style: solid; width: 425px; height: 130px;" /></p>
<p>I performed the test and the resulting p value was 0.049, which is close to but still below 0.05, so I can reject my null hypothesis. If I performed the test repeatedly, as in the XLCD example, I might have failed to reject the null hypothesis, because the 5% probability adds up with additional tests.</p>
<p>There are alternatives to statistical hypothesis testing; for example, Bayesian inference could be used in place of hypothesis testing with p values. But alternative methods have their own weaknesses, and they may be difficult for non-statisticians to use.</p>
<p>Instead of avoiding the use of hypothesis testing, we should account for its limitations. For example, by realizing that each repeat of the test increases the chance of a false positive, as illustrated by XKCD's jelly bean example.</p>
<p>We can’t simply retest over and over using the same p value and then conclude that we have results with statistical significance. For situations such as in the XKCD example, Simons, Nelson and Simonsohn recommend disclosing the total number of test that were <a href="http://people.psych.cornell.edu/~jec7/pcd%20pubs/simmonsetal11.pdf">performed</a>. Had we known that 20 test had been performed with p > 0.05 we could realize that we may not need to avoid green jellybeans after all.</p>
<p> </p>
<div><strong>About the Guest Blogger: </strong></div>
<div><em>Matthew Barsalou is an engineering quality expert in BorgWarner Turbo Systems Engineering GmbH’s Global Engineering Excellence department. He has previously worked as a quality manager at an automotive component supplier and as a contract quality engineer at Ford in Germany and Belgium. He possesses a bachelor of science in industrial sciences, a master of liberal studies and a master of science in business administration and engineering from the Wilhelm Büchner Hochschule in Darmstadt, Germany.</em></div>
<div> </div>
<p>xkcd.com comic from <a href="http://xkcd.com/882/">http://xkcd.com/882/</a> used under Creative Commons Attribution- NonCommercial 2.5 License. <a href="http://xkcd.com/license.html">http://xkcd.com/license.html</a></p>
<p> </p>
Fun StatisticsHypothesis TestingMon, 02 Jun 2014 12:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/hypothesis-testing-and-p-valuesGuest BloggerFive Guidelines for Using P values
http://blog.minitab.com/blog/adventures-in-statistics/five-guidelines-for-using-p-values
<p>There is high pressure to find low P values. Obtaining a low P value for a hypothesis test is make or break because it can lead to funding, articles, and prestige. Statistical significance is everything!</p>
<p>My two previous posts looked at several issues related to P values:</p>
<ul>
<li><a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">P values have a higher than expected false positive rate.</a></li>
<li><a href="http://blog.minitab.com/blog/adventures-in-statistics/not-all-p-values-are-created-equal" target="_blank">The same P value from different studies can correspond to different false positive rates.</a></li>
</ul>
<p>In this post, I’ll look at whether P values are still helpful and provide guidelines on how to use them with these issues in mind.</p>
<div style="float: right; width: 200px; margin: 25px 25px;">
<p><img alt="Ronald Fisher" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/f7eb953015180df73edfa6f073f234c6/r__a__fisher.jpg" style="float: right; width: 200px; height: 243px; border-width: 1px; border-style: solid;" /> <em>Sir Ronald A Fisher</em></p>
</div>
Are P Values Still Valuable?
<p>Given the issues about P values, are they still helpful? A higher than expected rate of false positives can be a problem because if you implement the “findings” from a false positive study, you won’t get the expected benefits.</p>
<p>In my view, P values are a great tool. Ronald Fisher introduced P values in the 1920s because he wanted an objective method for comparing data to the null hypothesis, rather than the informal eyeball approach: "My data <em>look </em>different than the null hypothesis."</p>
<p>P value calculations incorporate the effect size, sample size, and variability of the data into a single number that objectively tells you how consistent your data are with the null hypothesis. Pretty nifty!</p>
<p>Unfortunately, the high pressure to find low P values, combined with a common misunderstanding of <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">how to correctly interpret P values</a>, has distorted the interpretation of significant results. However, these issues can be resolved.</p>
<p>So, let’s get to the guidelines! Their overall theme is that you should evaluate P values as part of a larger context where other factors matter.</p>
Guideline 1: The Exact P Value Matters
<div style="float: right; width: 90px; margin: 25px 25px;">
<p style="line-height: 11px; text-align: center;"><img alt="Small wooden P" height="75px" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/c408562ea4a40eedae9ae78c1d3ca027/p_wooden.jpg" width="75px" /><br />
<em>Tiny Ps are<br />
great!</em></p>
</div>
<p>With the high pressure to find low P values, there’s a tendency to view studies as either significant or not. Did a study produce a P value less than 0.05? If so, it’s golden! However, there is no magic significance level that distinguishes between the studies that have a true effect and those that don’t with 100% accuracy. Instead, it’s all about lowering the error rate to an acceptable level.</p>
<p>The lower the P value, the lower the error rate. For example, a P value near 0.05 has an error rate of 25-50%. However, a P value of 0.0027 corresponds to an error rate of at least 4.5%, which is close to the rate that many mistakenly attribute to a P value of 0.05.</p>
<p>A lower P value thus suggests stronger evidence for rejecting the null hypothesis. A P value near 0.05 simply indicates that the result is worth another look, but it’s nothing you can hang your hat on by itself. It’s not until you get down near 0.001 until you have a fairly low chance of a false positive.</p>
Guideline 2: Replication Matters
<p>Today, P values are everything. However, Fisher intended P values to be just one part of a process that incorporates experimentation, statistical analysis and replication to lead to scientific conclusions.</p>
<p>According to Fisher, “A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.”</p>
<p>The false positive rates associated with P values that we saw in my last post definitely support this view. A single study, especially if the P value is near 0.05, is unlikely to reduce the false positive rate down to an acceptable level. Repeated experimentation may be required to finish at a point where the error rate is low enough to meet your objectives.</p>
<p>For example, if you have two independent studies that each produced a P value of 0.05, you can multiply the P values to obtain a probability of 0.0025 for both studies. However, you must include both the significant and insignificant studies in a series of similar studies, and not cherry pick only the significant studies.</p>
<p><img alt="Replicate study results" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/d1f27fc3889672c11ac23b1ffa9bfac9/p_rep.gif" style="width: 403px; height: 136px;" /></p>
<p>Conclusively proving a hypothesis with a single study is unlikely. So, don’t expect it!</p>
Guideline 3: The Effect Size Matters
<p>With all the focus on P values, attention to the effect size can be lost. Just because an effect is statistically significant doesn't necessarily make it meaningful in the real world. Nor does a P value indicate the precision of the estimated effect size.</p>
<p>If you want to move from just detecting an effect to assessing its magnitude and precision, use <a href="http://blog.minitab.com/blog/adventures-in-statistics/when-should-i-use-confidence-intervals-prediction-intervals-and-tolerance-intervals" target="_blank">confidence intervals</a>. In this context, a confidence interval is a range of values that is likely to contain the effect size.</p>
<p>For example, an AIDS vaccine <a href="http://news.sciencemag.org/health/2009/09/massive-aids-vaccine-study-modest-success" target="_blank">study</a> in Thailand obtained a P value of 0.039. Great! This was the first time that an AIDS vaccine had positive results. However, the confidence interval for effectiveness ranged from 1% to 52%. That’s not so impressive...the vaccine may work virtually none of the time up to half the time. The effectiveness is both low and imprecisely estimated.</p>
<p>Avoid thinking about studies only in terms of whether they are significant or not. Ask yourself; is the effect size precisely estimated and large enough to be important?</p>
Guideline 4: The Alternative Hypothesis Matters
<p>We tend to think of equivalent P values from different studies as providing the same support for the alternative hypothesis. However, <a href="http://blog.minitab.com/blog/adventures-in-statistics/not-all-p-values-are-created-equal" target="_blank">not all P values are created equal</a>.</p>
<p>Research shows that the plausibility of the alternative hypothesis greatly affects the false positive rate. For example, a highly plausible alternative hypothesis and a P value of 0.05 are associated with an error rate of at least 12%, while an implausible alternative is associated with a rate of at least 76%!</p>
<p>For example, given the track record for AIDS vaccines where the alternative hypothesis has never been true in previous studies, it's highly unlikely to be true at the outset of the Thai study. This situation tends to produce high false positive rates—often around 75%!</p>
<p>When you hear about a surprising new study that finds an unprecedented result, don’t fall for that first significant P value. Wait until the study has been well replicated before buying into the results!</p>
Guideline 5: Subject Area Knowledge Matters
<p>Applying subject area expertise to all aspects of hypothesis testing is crucial. Researchers need to apply their scientific judgment about the plausibility of the hypotheses, results of similar studies, proposed mechanisms, proper experimental design, and so on. Expert knowledge transforms statistics from numbers into meaningful, trustworthy findings.</p>
Hypothesis TestingStatisticsStatistics HelpThu, 15 May 2014 11:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/five-guidelines-for-using-p-valuesJim FrostNot All P Values are Created Equal
http://blog.minitab.com/blog/adventures-in-statistics/not-all-p-values-are-created-equal
<p><img alt="Fancy P" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/2762a55291d134b8185ba9da47ea6f83/p_fancy.gif" style="float: right; width: 150px; height: 194px; margin: 10px 15px;" />The interpretation of P values would seem to be fairly standard between different studies. Even if two hypothesis tests study different subject matter, we tend to assume that you can interpret a P value of 0.03 the same way for both tests. A P value is a P value, right?</p>
<p>Not so fast! While Minitab <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">statistical software</a> can correctly calculate all P values, it can’t factor in the larger context of the study. You and your common sense need to do that!</p>
<p>In this post, I’ll demonstrate that P values tell us very different things depending on the larger context.</p>
Recap: P Values Are Not the Probability of Making a Mistake
<p>In my previous post, I showed the <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">correct way to interpret P values</a>. Keep in mind the big caution: P values are<em> not</em> the error rate, or the likelihood of making a mistake by rejecting a true null hypothesis (Type I error).</p>
<p>You can equate this error rate to the false positive rate for a hypothesis test. A false positive happens when the sample is unusual due to chance alone and it produces a low P value. However, despite the low P value, the alternative hypothesis is not true. There is no effect at the population level.</p>
<p>Sellke <em>et al</em>. estimated that a P value of 0.05 corresponds to a false positive rate of “at least 23% (and typically close to 50%).”</p>
What Affects the Error Rate?
<p>Why is there a range of values for the error rate? To understand that, you need to understand the factors involved. David Colquhoun, a professor in biostatistics, lays them out <a href="http://www.dcscience.net/?p=6518" target="_blank">here</a>.</p>
<p>Whereas Sellke<em> et al.</em> use a Bayesian approach, Colquhoun uses a non-Bayesian approach but derives similar estimates. For example, Colquhoun estimates P values between 0.045 and 0.05 have a false positive rate of at least 26%.</p>
<p>The factors that affect the false positive rate are:</p>
<ul>
<li>Prevalence of real effects (higher is good)</li>
<li>Power (higher is good)</li>
<li>P value (lower is good)</li>
</ul>
<p>“Good” means that the test is less likely to produce a false positive. The 26% error rate assumes a prevalence of real effects of 0.5 and a power of 0.8. If you decrease the prevalence to 0.1, suddenly the false positive rate shoots up to 76%. Yikes!</p>
<p>Power is related to false positives because when a study has a lower probability of detecting a true effect, a higher proportion of the positives will be false positives.</p>
<p>Now, let’s dig into a very interesting factor: the prevalence of real effects. As we saw, this factor can hugely impact the error rate!</p>
P Values and the Prevalence of Real Effects
<p><img alt="Joke: I once asked a statistician out. She failed to reject me!" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/f9d1ea1b51185c0631ae8fadb0145f8f/fail_reject_joke.gif" style="float: right; width: 275px; height: 313px; margin: 10px 15px;" />What Colquhoun calls the prevalence of real effects (denoted as P(real)), the Bayesian approach calls the prior probability. It is the proportion of hypothesis tests in which the alternative hypothesis is true at the outset. It can be thought of as the long-term probability, or track record, of similar types of studies. It’s the plausibility of the alternative hypothesis.</p>
<p>If the alternative hypothesis is farfetched, or has a poor track record, P(real) is low. For example, a prevalence of 0.1 indicates that 10% of similar alternative hypotheses have turned out to be true while 90% of the time the null was true. Perhaps the alternative hypothesis is unusual, untested, or otherwise implausible.</p>
<p>If the alternative hypothesis fits current theory, has an identified mechanism for the effect, and previous studies have already shown significant results, P(real) is higher. For example, a prevalence of 0.90 indicates that the alternative is true 90% of the time, and the null only 10% of the time.</p>
<p>If the prevalence is 0.5, there is a 50/50 chance that either the null or alternative hypothesis is true at the outset of the study.</p>
<p>You may not always know this probability, but theory and a previous track record can be guides. For our purposes, we’ll use this principle to see how it impacts our interpretation of P values. Specifically, we’ll focus on the probability of the null being true (1 – P(real)) at the beginning of the study.</p>
Hypothesis Tests Are Journeys from the Prior Probability to Posterior Probability
<p><a href="http://blog.minitab.com/blog/understanding-statistics/what-statistical-hypothesis-test-should-i-use" target="_blank">Hypothesis tests</a> begin with differing probabilities that the null hypothesis is true depending on the specific hypotheses being tested. This prior probability influences the probability that the null is true at the conclusion of the test, the posterior probability.</p>
<p>If P(real) = 0.9, there is only a 10% chance that the null hypothesis is true at the outset. Consequently, the probability of rejecting a true null at the conclusion of the test must be less than 10%. However, if you start with a 90% chance of the null being true, the odds of rejecting a true null increases because there are more true nulls.</p>
<p style="text-align: center;">Initial Probability of<br />
true null (1 – P(real))</p>
<p style="text-align: center;">P value obtained</p>
<p style="text-align: center;">Final Minimum Probability<br />
of true null</p>
<p style="text-align: center;">0.5</p>
<p style="text-align: center;">0.05</p>
<p style="text-align: center;">0.289</p>
<p style="text-align: center;">0.5</p>
<p style="text-align: center;">0.01</p>
<p style="text-align: center;">0.110</p>
<p style="text-align: center;">0.5</p>
<p style="text-align: center;">0.001</p>
<p style="text-align: center;">0.018</p>
<p style="text-align: center;">0.33</p>
<p style="text-align: center;">0.05</p>
<p style="text-align: center;">0.12</p>
<p style="text-align: center;">0.9</p>
<p style="text-align: center;">0.05</p>
<p style="text-align: center;">0.76</p>
<p>The table is based on calculations by Colquhoun and Sellke <em>et al.</em> It shows that the decrease from the initial probability to the final probability of a true null depends on the P value. Power is also a factor but not shown in the table.</p>
Where Do We Go with P values from Here?
<p><img alt="wooden block P" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/c408562ea4a40eedae9ae78c1d3ca027/p_wooden.jpg" style="float: right; width: 150px; height: 150px;" />There are many combinations of conditions that affect the probability of rejecting a true null. However, don't try to remember every combination and the error rate, especially because you may only have a vague sense of the true P(real) value!</p>
<p>Just remember two big takeaways:</p>
<ol>
<li>A single statistically significant hypothesis test often provides insufficient evidence to confidently discard the null hypothesis. This is particularly true when the P value is closer to 0.05.</li>
<li>P values from different hypothesis tests can have the same value, but correspond to very different false positive rates. You need to understand their context to be able to interpret them correctly.</li>
</ol>
<p>The second point is epitomized by a quote that was popularized by Carl Sagan: “Extraordinary claims require extraordinary evidence.”</p>
<p>A surprising new study may have a significant P value, but you shouldn't trust the alternative hypothesis until the results are replicated by additional studies. As shown in the table, a significant but unusual alternative hypothesis can have an error rate of 76%!</p>
<p>Don’t fret! There are simple recommendations based on the principles above that can help you navigate P values and use them correctly. I’ll cover <a href="http://blog.minitab.com/blog/adventures-in-statistics/five-guidelines-for-using-p-values">five guidelines for using P values</a> in my next post.</p>
Hypothesis TestingThu, 01 May 2014 11:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/not-all-p-values-are-created-equalJim FrostHow to Correctly Interpret P Values
http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values
<p><img alt="P value" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/d95f756ee6f6a4cec607017c8edea52a/bigp.gif" style="margin: 4px; float: right; width: 110px; height: 125px;" />The P value is used all over statistics, from <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/t-for-2-should-i-use-a-paired-t-or-a-2-sample-t" target="_blank">t-tests</a> to <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">regression analysis</a>. Everyone knows that you use P values to determine statistical significance in a hypothesis test. In fact, P values often determine what studies get published and what projects get funding.</p>
<p>Despite being so important, the P value is a slippery concept that people often interpret incorrectly. How <em>do</em> you interpret P values?</p>
<p>In this post, I'll help you to understand P values in a more intuitive way and to avoid a very common misinterpretation that can cost you money and credibility.</p>
What Is the Null Hypothesis in Hypothesis Testing?
<p><img alt="Scientist performing an experiment" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/3407070c72311249854712c526aceb59/scientist_w640.jpeg" style="margin: 10px 15px; float: right; width: 300px; height: 200px; border-width: 1px; border-style: solid;" />In order to understand P values, you must first understand the null hypothesis.</p>
<p>In every experiment, there is an effect or difference between groups that the researchers are testing. It could be the effectiveness of a new drug, building material, or other intervention that has benefits. Unfortunately for the researchers, there is always the possibility that there is no effect, that is, that there is no difference between the groups. This lack of a difference is called the null hypothesis, which is essentially the position a devil’s advocate would take when evaluating the results of an experiment.</p>
<p>To see why, let’s imagine an experiment for a drug that we know is totally ineffective. The null hypothesis is true: there is no difference between the experimental groups at the population level.</p>
<p>Despite the null being true, it’s entirely possible that there will be an effect in the sample data due to random sampling error. In fact, it is extremely unlikely that the sample groups will ever exactly equal the null hypothesis value. Consequently, the devil’s advocate position is that the observed difference in the sample does not reflect a true difference between populations.</p>
What Are P Values?
<p><img alt="Joke" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/81242ed4497d1961eb264c3d7c65cc66/null_joke.gif" style="margin: 4px; float: right; width: 250px; height: 206px;" />P values evaluate how well the sample data support the devil’s advocate argument that the null hypothesis is true. It measures how compatible your data are with the null hypothesis. How likely is the effect observed in your sample data if the null hypothesis is true?</p>
<ul>
<li>High P values: your data are likely with a true null.</li>
<li>Low P values: your data are unlikely with a true null.</li>
</ul>
<p>A low P value suggests that your sample provides enough evidence that you can reject the null hypothesis for the entire population.</p>
How Do You Interpret P Values?
<p><img alt="Vaccine" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/179970708b13904b2993033a5cc2e71d/vaccination_w640.jpeg" style="margin: 4px; float: right; width: 300px; height: 160px;" />In technical terms, a P value is the probability of obtaining an effect at least as extreme as the one in your sample data, assuming the truth of the null hypothesis.</p>
<p>For example, suppose that a vaccine study produced a P value of 0.04. This P value indicates that if the vaccine had no effect, you’d obtain the observed difference or more in 4% of studies due to random sampling error.</p>
<p>P values address only one question: how likely are your data, assuming a true null hypothesis? It does not measure support for the alternative hypothesis. This limitation leads us into the next section to cover a very common misinterpretation of P values.</p>
P Values Are <em>NOT </em>the Probability of Making a Mistake
<p>Incorrect interpretations of P values are very common. The most common mistake is to interpret a P value as the probability of making a mistake by rejecting a true null hypothesis (a Type I error).</p>
<p>There are several reasons why P values can’t be the error rate.</p>
<p>First, P values are calculated based on the assumptions that the null is true for the population and that the difference in the sample is caused entirely by random chance. Consequently, P values can’t tell you the probability that the null is true or false because it is 100% true from the perspective of the calculations.</p>
<p>Second, while a low P value indicates that your data are unlikely assuming a true null, it can’t evaluate which of two competing cases is more likely:</p>
<ul>
<li>The null is true but your sample was unusual.</li>
<li>The null is false.</li>
</ul>
<p>Determining which case is more likely requires subject area knowledge and replicate studies.</p>
<p>Let’s go back to the vaccine study and compare the correct and incorrect way to interpret the P value of 0.04:</p>
<ul>
<li><strong>Correct: </strong>Assuming that the vaccine had no effect, you’d obtain the observed difference or more in 4% of studies due to random sampling error.<br />
</li>
<li><strong>Incorrect:</strong> If you reject the null hypothesis, there’s a 4% chance that you’re making a mistake.</li>
</ul>
What Is the True Error Rate?
<p><img alt="Caution sign" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/41ad875b2a88a19ab5bdfa5e47ed790b/caution_sign_w640.jpeg" style="margin: 4px; float: right; width: 250px; height: 250px;" />Think that this interpretation difference is simply a matter of semantics, and only important to picky statisticians? Think again. It’s important to you.</p>
<p>If a P value is not the error rate, what the heck is the error rate? (Can you guess which way this is heading now?)</p>
<p>Sellke et al.* have estimated the error rates associated with different P values. While the precise error rate depends on various assumptions (which I'll talk about in my next post), the table summarizes them for middle-of-the-road assumptions.</p>
<p style="text-align: center;"><strong>P value</strong></p>
<p style="text-align: center;"><strong>Probability of incorrectly rejecting a true null hypothesis</strong></p>
<p style="text-align: center;">0.05</p>
<p style="text-align: center;">At least 23% (and typically close to 50%)</p>
<p style="text-align: center;">0.01</p>
<p style="text-align: center;">At least 7% (and typically close to 15%)</p>
<p>Do the higher error rates in this table surprise you? Unfortunately, the common misinterpretation of P values as the error rate creates the illusion of substantially more evidence against the null hypothesis than is justified. As you can see, if you base a decision on a single study with a P value near 0.05, the difference observed in the sample may not exist at the population level. That can be costly!</p>
<p>Read my <a href="http://blog.minitab.com/blog/adventures-in-statistics/not-all-p-values-are-created-equal" target="_blank">next post</a> to understand the factors that affect the true error rate. Or, read my <a href="http://blog.minitab.com/blog/adventures-in-statistics/five-guidelines-for-using-p-values">five guidelines for how to use P values and avoid incorrect decisions</a>.<br />
</p>
<p>*Thomas SELLKE, M. J. BAYARRI, and James O. BERGER, Calibration of p Values for Testing Precise Null Hypotheses, The American Statistician, February 2001, Vol. 55, No. 1</p>
Hypothesis TestingThu, 17 Apr 2014 11:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-valuesJim FrostRe-analyzing Wine Tastes with Minitab 17
http://blog.minitab.com/blog/applying-statistics-in-quality-projects/re-analyzing-wine-tastes-with-minitab-17
<p>In April 2012, I wrote a short paper on <a href="http://www.minitab.com/en-us/Published-Articles/Wine-Tasting-by-Numbers--Using-Binary-Logistic-Regression-to-Reveal-the-Preferences-of-Experts/">binary logistic regression</a> to analyze wine tasting data. At that time, François Hollande was about to get elected as French president and in the U.S., Mitt Romney was winning the Republican primaries. That seems like a long time ago…</p>
<p>Now, in 2014, Minitab 17 <a href="http://www.minitab.com/products/minitab/">Statistical Software</a> has just been released. Had Minitab 17, been available in 2012, would have I conducted my analysis in a different way? Would the results still look similar? I decided to re-analyze my April 2012 data with Minitab 17 and assess the differences, if there are any.</p>
<p>There were no less than 12 parameters to analyze with a binary response. Among them 11 parameters were continuous variables, one factor was discrete in nature (white and red wines: a qualitative variable), and the number of two-factor interactions that could be studied was huge (66 two-factor interactions were potentially available).</p>
<p>The parameters to be studied :</p>
<p style="text-align: center;"><strong>Variable</strong></p>
<p style="text-align: center;"><strong>Details</strong></p>
<p style="text-align: center;"><strong>Units</strong></p>
<p style="text-align: center;">Type</p>
<p style="text-align: center;">red or white</p>
<p style="text-align: center;">N/A</p>
<p style="text-align: center;">pH</p>
<p style="text-align: center;">acidity (below 7) or alkalinity (over 7)</p>
<p style="text-align: center;">N/A</p>
<p style="text-align: center;">Density</p>
<p style="text-align: center;">density</p>
<p style="text-align: center;">grams/cubic centimeter</p>
<p style="text-align: center;">Sulphates</p>
<p style="text-align: center;">potassium sulfate</p>
<p style="text-align: center;">grams/liter</p>
<p style="text-align: center;">Alcohol</p>
<p style="text-align: center;">percentage alcohol</p>
<p style="text-align: center;">% volume</p>
<p style="text-align: center;">Residual sugar</p>
<p style="text-align: center;">residual sugar</p>
<p style="text-align: center;">grams/liter</p>
<p style="text-align: center;">Chlorides</p>
<p style="text-align: center;">sodium chloride</p>
<p style="text-align: center;">grams/liter</p>
<p style="text-align: center;">Free SO2</p>
<p style="text-align: center;">free sulphur dioxide</p>
<p style="text-align: center;">milligrams/liter</p>
<p style="text-align: center;">Total SO2</p>
<p style="text-align: center;">total sulphur dioxide</p>
<p style="text-align: center;">milligrams/liter</p>
<p style="text-align: center;">Fixed acidity</p>
<p style="text-align: center;">tartaric acid</p>
<p style="text-align: center;">grams/liter</p>
<p style="text-align: center;">Volatile acidity</p>
<p style="text-align: center;">acetic acid</p>
<p style="text-align: center;">grams/liter</p>
<p style="text-align: center;">Citric acid</p>
<p style="text-align: center;">citric acid</p>
<p style="text-align: center;">grams/liter</p>
Restricting Analysis to the Main Effects
<p>In 2012, due to the very large number of potential two-factor interactions, I restricted my analysis to the main effects (not considering the interactions between continuous variables).</p>
<p>Because the individual parameters had to be eliminated one at a time, according to their p value (the highest p values are eliminated one at a time until all the parameters and interactions that remain in the model have p values that are lower than 0.05), this was a very lengthy process.</p>
<p>To avoid obtaining an excessively complex final model, I eventually decided to analyze white and red wines separately (a model for the white wines, another model for the red wines), suggesting that the effect of some of the variables were different according to the type of wine.</p>
Including 2-Way Interactions in the Analysis
<p>Using Minitab 17 makes a substantial difference in this respect. All 2-way interactions can be easily selected to generate an initial model :</p>
<p><img alt="interactions" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/47940b6e8427b9c44afdf56f511b0d44/interactions_logistic_binary.JPG" style="width: 516px; height: 540px;" /></p>
<p>With Minitab 17, you can use stepwise logistic binary regression to quickly build a final model and identify the significant effects. In 2012, I used a descending approach considering all variables first and eliminating one variable at a time manually.</p>
<p>This lengthy and tedious process takes just a single click in Minitab 17:</p>
<p><img alt="stepwise" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/8fe5aafde53273ba3b7d16da305b5e4d/stepwise_binary.JPG" style="width: 486px; height: 539px;" /></p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/fc6b2c0fe2c083e439f4c66e0e446ddd/deviance_table_w640.gif" style="width: 640px; height: 168px;" /></p>
<p> </p>
<p>The results above show that Alcohol and Acidity (both fixed and volatile) seem to play a major role.</p>
<p>The Residual sugar by Type of wine interaction is barely significant with a p value (0.087) larger than 0.05 but smaller than 0.1.</p>
<p>The R Squared value (R-Sq) is also available in Minitab 17, to assess the proportion of the total variability that is explained by the model. The larger the R square value, the more comprehensive our model is (a large R squared means that we have got the full picture of our process, a low R squared means that our model explains only a small part of the variability in the response). In this example, the R squared is relatively low (28%) with 72% of the total variability unexplained by the model.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/4a5a9df6a33e7e05cfaf880bcc2cc3d8/model_summary.png" style="width: 278px; height: 94px;" /></p>
<p>In 2012, the final result consisted of two equations that could be used to understand which variables were significant for each type of wine in order to improve their taste.</p>
Optimizing the Response
<p>In Minitab 17, I can go one step further and use the optimization tool to identify the ideal settings and help the experimenter make the right decision.</p>
<p><img alt="regression equation" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/513cc8d599ba7948f4f288db12356435/regression_equation.png" style="width: 580px; height: 174px;" /></p>
<p><img alt="Optimize" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/9e42868dc619334da72100ec138b00c4/optimize_binary_w640.jpeg" style="width: 640px; height: 184px;" /></p>
<p>The optimization tool shows that tasters tend to prefer wines with a large amount of alcohol and both high fixed acidity <em>and </em>high volatile acidity.</p>
<p>Finally, showing graphs is important to convince colleagues and managers that the right decision has been taken. A visual representation is also very useful to better understand the factor effects. In Minitab 17, contour plots and response surface diagrams are available to describe the variable effects in the logistic binary regression sub-menu.</p>
<p>The contour plot below shows that tasters either prefer wines with high fixed acidity <em>and </em>high volatile acidity or with low fixed acidity <em>but also </em>low volatile acidity. The balance between the two types of acidity seems to be crucial.</p>
<p><img alt="Contour" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/2705aa13f0f80f9830408616028428a0/contour_plot_of_quality_vs_volatile_acidity__fixed.jpg" style="width: 576px; height: 384px;" /></p>
<p><img alt="Surface" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/2f593a6e6b89cb70b9439c85e8345477/surface_plot_of_quality_vs_volatile_acidity__fixed.jpg" style="width: 576px; height: 384px;" /></p>
<p>The models I arrived at in April 2012 are different from the one I found with Minitab 17. The two types of Acidity (Fixed and Volatile) were significant in the model for white wines, and Alcohol and Fixed Acidity had been selected in the final model for red wines.</p>
<p>But the main difference is that the Fixed Acidity by Volatile Acidity interaction had not been considered in 2012. In April 2012, the two-factor interactions were not on my radar, and I instead focused only on the individual main effects and their impact on wine tastes.</p>
<p>Fortunately, with Minitab 17 it is a lot easier to build an initial model—even a complex one with 66 two-factor potential interactions—and stepwise regression allows you to consider a much larger number of potential effects in the initial full model.</p>
Conclusion
<p>Ultimately, this study shows that the methods you use definitely impact your conclusion and statistical analysis. I got a simpler model using the tools available in Minitab 17, and therefore I did not need to study white and red wines separately. The optimization tool as well as the graphs were very useful to better understand the effects of the variables that are significant.</p>
<p> </p>
Data AnalysisFun StatisticsHypothesis TestingQuality ImprovementRegression AnalysisStatisticsStatistics HelpStatsTue, 15 Apr 2014 12:00:00 +0000http://blog.minitab.com/blog/applying-statistics-in-quality-projects/re-analyzing-wine-tastes-with-minitab-17Bruno ScibiliaEquivalence Testing for Quality Analysis (Part II): What Difference Does the Difference Make?
http://blog.minitab.com/blog/statistics-and-quality-data-analysis/equivalence-testing-for-quality-analysis-part-ii-what-difference-does-the-difference-make
<p><img alt="magnifying glass" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/1d9c8453dd19544a3f73fd787189619b/equivalence_test_difference.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; width: 250px; height: 250px; float: right;" />My <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/equivalence-testing-for-quality-analysis-part-i-what-are-you-trying-to-prove" target="_blank">previous post</a> examined how an equivalence test can shift the burden of proof when you perform hypothesis test of the means. This allows you to more rigorously test whether the process mean is equivalent to a target or to another mean.</p>
<p>Here’s another key difference: To perform the analysis, an equivalence test requires that you first define, upfront, the size of a <em>practically important</em> difference between the mean and the target, or between two means.</p>
<p>Truth be told, even when performing a standard hypothesis test, you should know the value of this difference. Because you can’t really evaluate whether your analysis will have adequate power without knowing it. Nor can you evaluate whether a statistically significant difference in your test results has significant meaning in the real world, outside of probability distribution theory.</p>
<p>But since a standard t-test doesn’t <em>require</em> you to define this difference, people often run the analysis with a fuzzy idea, at best, of what they’re actually looking for. It’s not an error, really. It’s more like using a radon measuring device without knowing what levels of radon are potentially harmful. </p>
Defining Equivalence Limits: Your Call
<p>How close does the mean have to be to the target value or to another mean for you to consider them, for all practical purposes, “equivalent”? </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/6ec96ebe3b82e0b4828a79c8a74ba862/zone_of_equivalence_w640.jpeg" style="width: 640px; height: 178px;" /></p>
<p>The zone of equivalence is defined by a lower equivalence and/or an upper equivalence limit. The lower equivalence limit (LEL) defines your lower limit of acceptability for the difference. The upper equivalence limit (UEL) defines your upper limit of acceptability for the difference. Any difference from the mean that falls within this zone is considered unimportant.</p>
<p>In some fields, such as the pharmaceutical industry, equivalence limits are set by regulatory guidelines. If there aren’t guidelines for your application, you’ll need to define the zone of equivalence using knowledge of your product or process.</p>
<p>Here’s the bad news: There isn’t a statistician on Earth who can help you define those limits. Because it isn’t a question of statistics. It’s a question of what size of a difference produces tangible ramifications for you or your customer.</p>
<p>A difference of 0.005 mg from the mean target value? A 10% shift in the process mean? Obviously, the criteria aren't going to be the same for the diameter of a stent and the diameter of a soda can.</p>
Equivalence Test in Practice
<p>Here's a quick example of a 1-sample equivalence test, adapted from Minitab 17 Help.To follow along, you can download the revised <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/File/2d3f51f5ad82ccea78f46bcacf3c1af8/equivalence_test_data.MPJ" target="_blank">data here</a>. If you don't have Minitab 17, download <a href="http://it.minitab.com/products/minitab/free-trial.aspx?WT.ac=BlogMtbAd" target="_blank">a free trial version here.</a></p>
<p>Suppose a packaging company wants to ensure that the force needed to open its snack food bags is within 10% of the target value of 4.2N (Newtons). From previous testing, they know that a force lower than 10% below the target causes the bags to open too easily and reduces product freshness.A force above 10% of the target makes the bags too difficult to open. They randomly sample 100 bags and measure the force required to open each one.</p>
<p>To test whether the mean force is equivalent to the target, they choose <strong>Stat > Equivalence Tests > 1-Sample</strong> and fill in the dialog box as shown below:</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/295557153e0f5f5d7f3b52e2f8d4c4fe/equivalence_dialog.jpg" style="width: 560px; height: 390px;" /></p>
<p><strong>Tip</strong>: Use the <strong>Multiply by Target</strong> box when you want to define the equivalence limits for a difference in terms of a percentage of the target. In this case, the lower limit is 10% less than the target. The upper limit is 10% higher than the target. If you want to represent the equivalence limits in absolute terms, rather than as percentages, simply enter the actual values for your equivalence limits and don't check the <strong>Multiply by Target</strong> box.</p>
<p>When you click <strong>OK</strong>, Minitab displays the following results:</p>
<p style="margin-left: 40px;"><strong>One-Sample Equivalence Test: Force</strong></p>
<p style="margin-left: 40px;">Difference: Mean(Force) - Target</p>
<p style="margin-left: 40px;">Difference SE 95% CI Equivalence Interval<br />
0.14270 0.067559 (0, 0.25487) (-0.42, 0.42)</p>
<p><span style="color: rgb(255, 0, 0);">CI is within the equivalence interval. Can claim equivalence.</span></p>
<p style="margin-left: 40px;">Test<br />
Null hypothesis: Difference ≤ -0.42 or Difference ≥ 0.42<br />
Alternative hypothesis: -0.42 < Difference < 0.42<br />
α level: 0.05</p>
<p style="margin-left: 40px;">Null Hypothesis DF T-Value P-Value<br />
Difference ≤ -0.42 99 8.3290 0.000<br />
Difference ≥ 0.42 99 -4.1046 0.000</p>
<p><span style="color: rgb(255, 0, 0);">The greater of the two P-Values is 0.000. Can claim equivalence.</span></p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/eef6c9313790b2590fa6178d96a5c1cb/equivalence_plot.jpg" style="width: 576px; height: 384px;" /></p>
<p>Because the confidence interval for the difference falls completely within the equivalence limits, you can reject the null hypothesis that the mean differs from the target. You can claim that the mean and the target are equivalent.</p>
<p>Notice that if you had used a standard 1-sample t-test to analyze these data, the output would show a statistically significant difference between the mean and the target (at a significance level of 0.05):</p>
<p style="margin-left: 40px;"><strong>One-Sample T: Force</strong></p>
<p style="margin-left: 40px;">Test of μ = 4.2 vs ≠ 4.2<br />
<strong>Variable N Mean StDev SE Mean 95% CI T P</strong><br />
Force 100 4.3427 0.6756 0.0676 (4.2086, 4.4768) 2.11 <span style="color: rgb(255, 0, 0);"> 0.037</span></p>
<p>These two sets of results aren't really contradictory, though.</p>
<p>The equivalence test has simply defined "equality" between the mean and the target in broader terms, using the values you entered for the equivalence zone. The standard t-test has no knowledge of what "practically significant' means. So it can only evaluate the difference from the target in terms of statistical significance.</p>
<p>In this way, an equivalence test is "naturally smarter" than a standard t-test. But it's your knowledge of the process or product that allows an equivalence test to evaluate the practical significance of a difference, in addition to its statistical significance.</p>
<strong>Learn More about Equivalence Testing</strong>
<p>There are four types of equivalence tests newly available in Minitab 17. To learn more about each test, choose <strong>Help > Help</strong>. Click the<strong> Index</strong> tab, scroll down to <strong>Equivalence testing</strong>, and click <strong>Overview</strong>.</p>
Hypothesis TestingTue, 01 Apr 2014 12:31:00 +0000http://blog.minitab.com/blog/statistics-and-quality-data-analysis/equivalence-testing-for-quality-analysis-part-ii-what-difference-does-the-difference-makePatrick Runkel Equivalence Testing for Quality Analysis (Part I): What are You Trying to Prove?
http://blog.minitab.com/blog/statistics-and-quality-data-analysis/equivalence-testing-for-quality-analysis-part-i-what-are-you-trying-to-prove
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/b922d8ae294ef7be358b1b2abdc06eab/scales.jpg" style="float: right; border-width: 1px; border-style: solid; margin: 10px 15px; width: 250px; height: 244px;" />With more options, come more decisions.</p>
<p>With equivalence testing added to Minitab 17, you now have more statistical tools to test a sample mean against target value or another sample mean.</p>
<p>Equivalence testing is extensively used in the biomedical field. Pharmaceutical manufacturers often need to test whether the biological activity of a generic drug is equivalent to that of a brand name drug that has already been through the regulatory approval process.</p>
<p>But in the field of quality improvement, why might you want to use an equivalence test instead of a standard t-test?</p>
Interpreting Hypothesis Tests: A Common Pitfall
<p>Suppose a manufacturer finds a new supplier that offers a less expensive material that could be substituted for a costly material currently used in the production process. This new material is <em>supposed to be</em> just as good as the material currently used. It should not make the product too pliable nor too rigid.</p>
<p>To make sure the substitution doesn’t negatively impact quality, an analyst collects two random samples from the production process (which is stable): one using the new material and one using the current material.</p>
<p>The analyst then uses a standard 2-sample t-test (<strong>Stat > Basic Statistics > 2-Sample t </strong>in Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a>) to assess whether the mean pliability of the product is the same using both materials:</p>
<p style="margin-left: 40px;">________________________________________</p>
<p style="margin-left: 40px;"><strong>Two-Sample T-Test and CI: Current, New </strong></p>
<p style="margin-left: 40px;">Two-sample T for Current vs New<br />
N Mean StDev SE Mean<br />
Current 9 34.092 0.261 0.087<br />
New 10 33.971 0.581 0.18</p>
<p style="margin-left: 40px;">Difference = μ (Current) - μ (New)<br />
Estimate for difference: 0.121<br />
95% CI for difference: (-0.322, 0.564)<br />
T-Test of difference = 0 (vs ≠): T-Value = 0.60 <strong><span style="color:#FF0000;">P-Value = 0.562</span></strong> DF = 12<br />
________________________________________</p>
<p>Because the p-value is not less than the <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/alpha-male-vs-alpha-female">alpha level</a> (0.05), the analyst concludes that the means do not differ. Based on these results, the company switches suppliers for the material, confident that statistical analysis has proven that they can save money with the new material without compromising the quality of their product.</p>
<p>The test results make everyone happy. High-fives. Group hugs. Popping champagne corks. There’s only one minor problem.</p>
<p>Their statistical analysis didn’t really <em>prove</em> that the means are the same.</p>
Consider Where to Place the Burden of Proof
<p>In hypothesis testing, H1 is the alternative hypothesis that requires the burden of proof. Usually, the alternative hypothesis is what you’re hoping to prove or demonstrate. When you perform a standard 2-sample t-test, you’re really asking: “Do I have enough evidence to <em>prove</em>, beyond a reasonable doubt (your alpha level), that the population means are different?”</p>
<p>To do that, the hypotheses are set up as follows:</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/102c226ce0c8cffe1a37cb7e43a969eb/2samplet_w640.jpeg" style="width: 640px; height: 271px;" /></p>
<p>If the p-value is less than alpha, you conclude that the means significantly differ. But if the p-value is not less than alpha, you haven’t <em>proven</em> that the means are equal. You just don’t have enough evidence to prove that they’re not equal.</p>
<p>The absence of evidence for a statement is not proof of its converse. If you don’t have sufficient evidence to claim that A is true, you haven’t <em>proven</em> that A is false.</p>
<p>Equivalence tests were specifically developed to address this issue. In a 2-sample equivalence test, the null and alternative hypotheses are reversed from a standard 2-sample t test.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/1d72f62da34d5c7d7d7719973c3927ac/2smple_equiv_image_w640.jpeg" style="width: 640px; height: 271px;" /></p>
<p>This switches the burden of proof for the test. It also reverses the ramification of incorrectly assuming (H0) for the test.</p>
Case in Point: The Presumption of Innocence vs. Guilt
<p>This rough analogy may help illustrate the concept.</p>
<p>In the court of law, the burden of proof rests on proving guilt. The suspect is presumed innocent (H0), until proven guilty (H1). In the news media, the burden of proof is often reversed: The suspect is presumed guilty (H0), until proven innocent (H1).</p>
<p>Shifting the burden of proof can yield different conclusions. That’s why the news media often express outrage when a suspect who is presumed to be guilty is let go because there was not sufficient evidence to prove the suspect’s guilt in the courtroom. As long as news media and the courtroom reverse their null and alternative hypotheses, they’ll sometimes draw different conclusions based on the same evidence.</p>
<p>Why do they set up their hypotheses differently in the first place? Because each seems to have a different idea of what’s a worse error to make. The judicial system believes the worse error is to convict an innocent person, rather than let a guilty person go free. The news media seem to believe the contrary. (Maybe because the presumption of guilt sells more papers than presumption of innocence?)</p>
When the Burden of Proof Shifts, the Conclusion May Change
<p>Back to our quality analyst in the first example. To avoid losing customers, the company would rather err by assuming that the quality was not the same using the cheaper material--when it actually was--than err by assuming it was the same, when it actually was not.</p>
<p>To more rigorously demonstrate that the means are the same, the analyst performs a 2-sample equivalence test (<strong>Stat > Equivalence Tests > Two Sample</strong>).</p>
<p style="margin-left: 40px;">________________________________________</p>
<p style="margin-left: 40px;"><strong>Equivalence Test: Mean(New) - Mean(Current) </strong></p>
<p style="margin-left: 40px;">Test<br />
Null hypothesis: Difference ≤ -0.4 or Difference ≥ 0.4<br />
Alternative hypothesis: -0.4 < Difference < 0.4<br />
α level: 0.05</p>
<p style="margin-left: 40px;">Null Hypothesis DF T-Value P-Value<br />
Difference ≤ -0.4 12 1.3717 0.098<br />
Difference ≥ 0.4 12 -2.5646 0.012</p>
<p style="margin-left: 40px;"><strong><span style="color:#FF0000;">The greater of the two P-Values is 0.098. Cannot claim equivalence.</span></strong><br />
________________________________________</p>
<p>Using the equivalence test on the same data, the results now indicate that there<em> isn't</em> sufficient evidence to claim that the means are the same. The company <em>cannot</em><em> </em>be confident that product quality will not suffer if they substitute the less expensive material. By using an equivalence test, the company has raised the bar for evaluating a possible shift in the process mean.</p>
<p><strong>Note:</strong> If you look at the above output, you'll see another way that the equivalence test differs from a standard t-test. Two one-sided t-tests are used to test the null hypothesis. In addition, the test uses a zone of equivalence that defines what size difference between the means you consider to be practically insignificant. <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/equivalence-testing-for-quality-analysis-part-ii-what-difference-does-the-difference-make">We’ll look at that in more detail in my next post</a>.</p>
Quick Summary
<p>To choose between an equivalence test and a standard t-test, consider what you hope to prove or demonstrate. Whatever you hope to prove true should be set up as the alternative hypothesis for the test and require the burden of proof. Whatever you deem to be the less harmful incorrect assumption to make should be the null hypothesis. If you’re trying to rigorously prove that two means are equal, or that a mean equals a target value, you may want to use an equivalence test rather than a standard t-test.</p>
Hypothesis TestingQuality ImprovementMon, 31 Mar 2014 12:39:00 +0000http://blog.minitab.com/blog/statistics-and-quality-data-analysis/equivalence-testing-for-quality-analysis-part-i-what-are-you-trying-to-provePatrick RunkelIs the "Madden Curse" Real?
http://blog.minitab.com/blog/the-statistics-game/is-the-madden-curse-real
<p><img alt="Madden" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/d3d5cdc0287be4fb0ae870f30a5e8a8c/madden.JPG" style="border-width: 1px; border-style: solid; margin: 10px 15px; width: 180px; height: 211px; float: right;" />If you like football and you like video games, you must certainly be aware of the “Madden Curse.” Each year, EA Sports releases a new version of Madden, a video game based on the National Football League. Each version of Madden features a different NFL player on the cover of the game. And it seems that each year, the player featured on the cover gets hurt or has a terrible season. Thus, the “Madden Curse” was born.</p>
<p>As a statistician, I’m always skeptical of these things. When people make judgments based on their own perception and not on data, it can be easy to think you see trends that aren’t really there. We tend to remember the cases that support our point of view (Michael Vick breaking his leg after being on the cover) and forget cases that don’t support our argument (Calvin Johnson setting the single season receiving record after being on the cover).</p>
<p>But I’ll humor the Madden curse theorists and perform a data analysis to see if a curse might indeed be real. And if it does exist, perhaps we can even figure out why!</p>
Are Madden-Featured Players Getting Hurt More Often?
<p>We already mentioned that Michael Vick broke his leg the season after he appeared on the cover of Madden. He missed 11 regular season games that year. So could be the curse be that the featured players get injured and miss a lot of games the next season?</p>
<p>For all 16 players who have been on the cover since 1999, I gathered the number of games they played the season <em>before</em> being on the cover, and the number of games they played after. You can follow along by getting the data <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/8496aa39b421f1c2512d539d22ce3c3f/maddencurse.MTW">here</a>. Don't already have Minitab? You can get a <a href="http://it.minitab.com/en-us/products/minitab/free-trial.aspx">30-day trial version</a>.</p>
<p>First, I ran a Paired t-test on the two groups.</p>
<p><img alt="Paired t test" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/0b3a97cc7c870b2597f44eafaa1e67bc/paired_t_injuries_w640.gif" style="width: 640px; height: 203px;" /></p>
<p>We see that on average, players played about 2 fewer games the season after being on the cover for Madden. This difference is statistically significant at the α = 0.10 level. So is the curse true?</p>
Statistically Significant and Practically Significant
<p>Just because a difference is <a href="http://blog.minitab.com/blog/the-stats-cat/sample-size-statistical-power-and-the-revenge-of-the-zombie-salmon-the-stats-cat">significant doesn’t mean that difference is practical</a> to your situation. So when it comes to the Madden curse, we have to ask ourselves “Is a difference of 2 games <em>really</em> a curse?” Sure, Vick’s injury was bad, but it wasn’t typical. The only other players besides Vick to play in fewer than 10 games after being on the cover were Troy Polamalu (5 games in 2009) and Donovan McNabb (9 games in 2004). On the flip side, 10 of the 16 featured players played in at least 14 games the next season. That doesn’t sound like much of a curse to me.</p>
<p>And keep in mind that EA is not going to put anybody on the cover of Madden who was injured the <em>previous</em> season. The outliers of Vick, McNabb, and Polamalu pull down the average of the entire “appeared-on-the-cover” group. That doesn’t happen the season before you’re on the cover. Fifteen players played in at least 13 games the season before being on the cover, and 10 of them played all 16. This means the “before” group is artificially inflated.</p>
<p>So yes, players <em>are </em>playing fewer games the season after being on the cover. But on average, it’s only 2 fewer games. Featured players aren’t experiencing season-ending injuries year after year. When you consider the <em>practical difference</em>, I would say 2 games is so small that there isn’t any curse going on.</p>
Are Players Performing Poorly?
<p>So if injuries don’t appear to be the curse, maybe players have a worse season the year after being on the cover. Assessing this becomes a little tricky because the players on the cover play a variety of positions, including two defensive players. So we need a statistic that can represent the value of players at different positions. Pro-Football-Reference has a statistic called <a href="http://www.sports-reference.com/blog/approximate-value-methodology/">Approximate Value</a> (AV). It’s a metric that puts a single numerical value on any player’s season, at any position, from any year. I know it’s not well known, but I’m not aware of any other statistic that can represent the value of Ray Lewis, Eddie George, <em>and</em> Michael Vick. For our purposes, it’ll do just fine.</p>
<p>I took each player's AV the season before they were on the cover and the season after. Then I did another paired t-test.</p>
<p><img alt="Paired t test" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/537b93e10b6b50ffcc1ff031072bf180/paired_t_1_season_w640.gif" style="width: 640px; height: 201px;" /></p>
<p>There is a difference of almost 5, and this difference is statistically significant since the p-value is less than 0.05. To give you some perspective on AV, during his record -setting year in 2012, Calvin Johnson had an AV of 14. So the average of 15 that the “Before” group has is pretty darned good (in 2007 Tom Brady had a 24, so I'm guessing that's about the max). In comparison, Bills receiver Steve Johnson had a 9 last year, and 49ers receiver Michael Crabtree had a 10. I think it’s safe to say a difference between 10 and 15 is practically significant. </p>
<p>This means there <em>is </em>a curse! Players on the cover of Madden perform worse the season after being on the cover. There is proof that the curse exists!</p>
<p>Or is there?</p>
<p>Let’s look at how players perform <em>two </em>seasons before they are on the cover of Madden. We’ll compare that to how they perform the season after being on the cover.</p>
<p><img alt="Pairted t test" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/83b608654e754bc946758ef9dae78484/paired_t_2_season_w640.gif" style="width: 640px; height: 210px;" /></p>
<p>There is pretty much no difference in a player’s performance two seasons before being on the cover, and the season after being on the cover.</p>
<p>Okay, what if we go back three years before they were on the cover?</p>
<p>(If you've noticed the different sample sizes in these paired t-tests, it’s because a few of the players were so young that were not even in the NFL 2 or 3 seasons before being on the cover of Madden.)</p>
<p><img alt="Paired t test" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/b1764251e4cd6ac3cbc1fa9324a25156/paired_t_3_season_w640.gif" style="width: 640px; height: 200px;" /></p>
<p>Almost the exact same thing! There is really no difference in a player’s performance 3 seasons before being on the cover, and the season after. The only season that stands out is the one directly before they were chosen to be on the cover of Madden.</p>
<p>Now the curse makes sense. It’s a simple case of <a href="http://blog.minitab.com/blog/fun-with-statistics/fantasy-studs-and-regression-to-the-mean">regression to the mean</a>!</p>
What Is Regression to the Mean?
<p>Think of a roulette wheel where half of the spaces are black, and the other half are red. And now imagine a set of 16 spins where red comes up 75% of the time. In the next set of 16 spins, we would expect the average to regress back to 50% red and 50% black. This is regression to the mean.</p>
<p>Note that regression to the mean does <em>not</em> mean we would expect a set of 16 spins where we had 75% black to “even out” the previous set that had more red. We would just expect the results to return to the average, which is 8 red and 8 black.</p>
<p>Now let’s apply this thinking to the Madden curse. We see that 3 seasons before being on the cover, the players as a group have an AV of about 11. It stays in the 10-11 range the next season, too. Then all of the sudden it jumps up to almost 15 the season before they make the Madden cover, only to return back to the “average” the following season. </p>
<p>It doesn’t take a Six Sigma Black Belt to see what is going on here.</p>
<p>Madden is selecting players who had outstanding seasons the previous year. But just like a roulette wheel might have a run where it comes up red 75% of the time, the outstanding performance by the players who appear on the cover is not sustainable. So the year after they're featured they don’t perform as well as they did the year before, and it looks like they’re cursed. In reality, they’re simply playing back at the same level they were before their outstanding season. They’re just regressing to the mean, and it would have happened whether they appeared on the cover of Madden or not.</p>
<p>So before you start believing in curses, try a statistical analysis first. Odds are you’ll find a perfectly reasonable explanation!</p>
Fun StatisticsHypothesis TestingLearningRegression AnalysisFri, 11 Oct 2013 13:08:00 +0000http://blog.minitab.com/blog/the-statistics-game/is-the-madden-curse-realKevin RudySam Ficken and the Danger of Small Sample Sizes
http://blog.minitab.com/blog/the-statistics-game/sam-ficken-and-the-danger-of-small-sample-sizes
<p>
<img alt="Sam Ficken" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/c046562d831b68c2637b30f783c28089/ficken_normal.jpg" style="float: right; width: 250px; height: 155px; border-width: 1px; border-style: solid; margin: 10px 15px;" />On Saturday, September 8, 2012, Penn State football player Sam Ficken had a kicker’s worst nightmare. Playing against Virginia, he missed 4 field goals, including the potential game-winner as the game ended. To add injury to insult, he also had an extra point blocked.</p>
<p>
Penn State lost the game by a single point.</p>
<p>
At that point in his career, Ficken had made 2 out of his 7 field goal attempts. That equals about a 29% success rate, which is terrible for kickers. Many called for Ficken to be benched. He was harassed on Twitter (to put it mildly). And a Penn State soccer player even made a YouTube video of himself kicking field goals at the Nittany Lion practice facility, prompting many fans to suggest Coach Bill O’Brien to give him a tryout. It was pretty apparent Ficken just wasn’t a good kicker.</p>
<p>
Or was it?</p>
<p>
Flip a coin 7 times. If tails comes up only twice, are you going to conclude that the coin is “biased” towards heads? Of course not, you simply had an unlikely outcome (the coin coming up heads 71% of the time) because 7 tosses is a very small sample size. Now, kicking field goals is a lot different than flipping a coin, but the same idea applies. So let’s do a data analysis on Ficken’s field goal percentage.</p>
<p>
<strong>NOTE:</strong> I’m going to use a 1 Proportion analysis, which assumes the probability of each observation is the same. Obviously this isn’t true for field goals. Distance, weather conditions, and altitude all affect the probability of the kicker making the goal. Even the opponent can affect the probability: your odds aren’t as good if <a href="http://youtu.be/83w4SRzSg7c?t=3m49s">LaVar Arrington circa 1999</a> is lining up to block the kick! But I’m really just trying to illustrate the amount of variation that exists in small samples, not trying to accurately gage Ficken’s true field goal percentage. So for purely illustrative purposes, I’m going to use the 1 Proportion analysis anyway...just take the statistics with a grain of salt if you were hoping for a comprehensive review that includes all possible factors.</p>
How Confident Can We Be in Ficken’s Field Goal Percentage?
<p>
After the Virginia game Ficken was 2 for 7 (29%) on field goal attempts. Using these numbers, we can use Minitab’s 1 Proportion analysis to create a confidence interval. This confidence interval will give us a range of likely values for the percentage of kicks that Ficken will make going forward. That is, it gives us an idea of how confident can we be that these 7 kicks represent Ficken’s true kicking percentage.</p>
<p>
<img alt="Minitab's One Proportion Analysis" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/070b0042d03e59b4c894a8438c7017db/1_propotion_2_for_7.gif" style="width: 514px; height: 166px;" /></p>
<p>
The confidence interval tells us that we can be 95% confident that Ficken’s true field goals percentage is between 3.7% and 71%. That range is so large that it’s pretty much worthless! So anybody trying to make an accurate assessment of Ficken’s ability based off of those 7 kicks is doing nothing other than guessing. Moreover, the range actually increases if you look at only the 5 kicks in the Virginia game (which many people did)!</p>
Ficken’s Career Since the Virginia Game
<p>
If there's one person who could accurately judge Ficken, it's Penn State Coach Bill O’Brien. He'd seen plenty of Ficken kicks in practice, and had a lot more than 7 observations to make his decision on. And he decided to stick with Ficken as his kicker.</p>
<p>
Boy, has that decision paid off.</p>
<p>
Since the Virginia game, Ficken has made 20 of 24 field goal attempts. He hit a Penn State record of 15 field goals in a row, and also made a 54-yard field goal--a Penn State home record--<em>in the rain</em>. In his career, Ficken is now 22 for 31 on field goal attempts, good for 71%.</p>
<p>
And wouldn’t you know it, that equals the upper bound from the 95% confidence interval we created earlier! </p>
<p>
Clearly, Ficken is a better kicker than his first few attempts showed. And considering where he had to be at mentally after the Virginia game, it’s a great story to see him bounce back and perform so well. But, statistically speaking , how good can we <em>really </em>claim he is? Since we now have another 24 observations, let’s combine them with the original 7 and calculate an updated 95% confidence interval for Ficken’s field goal percentage.</p>
<p>
<img alt="Minitab's One Proportion Analysis" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/67b6a57fa13e63a1298b7c2a9f190cf6/1_propotion_22_for_31.gif" style="width: 523px; height: 159px;" /></p>
<p>
Now that we have more observations, we can narrow down Ficken’s true ability much better. The new lower bound for the interval (52%) is nowhere close to the 29% that Ficken made in his first 7 attempts.</p>
<p>
But the confidence interval is still pretty wide, with a range of about 34%. There is a chance his true field goal percentage is in the 50% range, which would put him among the worst kickers in the country!</p>
<p>
How big of a sample size do we need in order to really be confident in Ficken’s abilities?</p>
How Many Kicks Do We Need?
<p>
To answer that question, first we need to decide how “narrow” we want our confidence interval to be. This is the same thing as determining the margin of error. For example, let’s use Ficken’s current field goal percentage of 71%. If the margin of error were 5%, our confidence interval would range from 66% to 76%.</p>
<p>
But instead of picking just one, let’s use a couple margins of error to compare the different sample sizes needed for each one. We’ll use margins of error of 10%, 5%, and 1%. Then we can use Minitab’s Sample Size for Estimation analysis to get the sample sizes.</p>
<p>
<img alt="Sample Size for Estimation" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/0984a4d9e48004d90c9311a3bfe5b77a/sample_size.gif" style="width: 305px; height: 328px;" /></p>
<p>
To obtain a margin of error of 10% (which is still pretty wide) we would need 99 kicks. It skyrockets to 359 for 5%, and becomes an unattainable 8,129 kicks for a 1% margin of error! To put that in perspective, former Penn State kicker Kevin Kelly was the starter at Penn State for 4 years, and attempted only 107 field goals. And Sebastian Janikowski is in his 14th year of kicking in the NFL, and has only 409 attempts.</p>
<p>
Your average college kicker will get between 20 and 30 field goal attempts per year. And unless you’re a four-year starter, you’re not getting close to 99 kicks for your career. That means for a college kicker, even if every field goal attempt has the same probability of being made (which it doesn’t), we still have a pretty wide margin of error when determining just how accurate the kicker is.</p>
<p>
So when you want to make claims based on statistics, make sure you have a sufficiently large sample. And that’s not just in the world of sports. Sample sizes are important for everything from determining the net weight of the cereal in packages to <a href="http://blog.minitab.com/blog/adventures-in-statistics/using-hypothesis-tests-to-bust-myths-about-the-battle-of-the-sexes">Mythbusters determining whether women are better at multitasking</a>. If you don’t have a large enough sample, your conclusions might be meaningless. To find proof, you need only look at Sam Ficken.</p>
<p>
<em style="border: 0px; margin: 0px; padding: 0px; font-size: 10px; color: rgb(90, 90, 90); font-family: Verdana; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 20px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255);">Photo by PennStateNews. Licensed under Creative Commons Attribution ShareAlike 2.0.</em></p>
Hypothesis TestingStatisticsStatistics HelpFri, 27 Sep 2013 13:59:00 +0000http://blog.minitab.com/blog/the-statistics-game/sam-ficken-and-the-danger-of-small-sample-sizesKevin RudyUsing Hypothesis Tests to Bust Myths about the Battle of the Sexes
http://blog.minitab.com/blog/adventures-in-statistics/using-hypothesis-tests-to-bust-myths-about-the-battle-of-the-sexes
<p><img alt="Mythbusters title screen" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/95c8e44114a7378203ad3b70a9fac3a5/mythbusters_title_screen.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; width: 300px; height: 169px; float: right;" />In my home, we’re huge fans of <a href="http://dsc.discovery.com/tv-shows/mythbusters" target="_blank">Mythbusters</a>, the show on Discovery Channel. This fun show mixes science and experiments to prove or disprove various myths, urban legends, and popular beliefs. It’s a great show because it brings the scientific method to life. I’ve written about Mythbusters <a href="http://blog.minitab.com/blog/adventures-in-statistics/busting-the-mythbusters-are-yawns-contagious">before</a> to show how, without proper statistical analysis, it’s difficult to know when a result is statistically significant. How much data do you need to collect and how large does the difference need to be?</p>
<p>For this blog, let's look at a more recent Mythbusters episode, “Battle of the Sexes – Round Two.” I want to see how they’ve progressed with handling sample size. There are some encouraging signs: during the show, Adam Savage, one of the hosts, explains, “Sample size is everything in science; the more you have, the better your results.”</p>
<p>To paraphrase the show, here at Minitab, we don’t just talk about the hypotheses; we put them to the test. We’ll use two different hypothesis tests and <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/c4c2c86825987c371555b86c69216829/battlesexes.MTW">this worksheet</a> to determine whether:</p>
<ul>
<li>Women are better at multitasking</li>
<li>Men are better at parallel parking</li>
</ul>
Are Women Better Multitaskers?
<p>The Mythbusters wanted to determine whether women are better multitaskers than men. To test this, they had 10 men and 10 women perform a set of tasks that required multitasking in order to have sufficient time to complete all of the tasks. They use a scoring system that produces scores between 0 and 100.</p>
<p>The women end up with an average of 72, while the men average 64. The Mythbusters conclude that this 8 point difference confirms the myth that women are better multitaskers. Does statistical analysis agree?</p>
The statistical perspective
<p>The average scores are based on samples rather than the entire population of men and women. Samples contain error because they are a subset of the entire population. Consequently, a sample mean and the corresponding population mean are likely to be different. It’s possible that if we reran the experiment, the sample results could change.</p>
<p>We want to be reasonably sure that the observed difference between samples actually represents a <em>true </em>difference between the entire population of men and women. This is where hypothesis tests play a role.</p>
Choosing the correct hypothesis test
<p>Because we want to compare the means between two groups, you might think that we’ll use the 2-Sample t test. However, based on a Normality Test, these data appear to be nonnormal.</p>
<p>The 2-Sample t test is robust to nonnormal data when each sample has at least 15 subjects (30 total). However, our sample sizes are too small for this test to handle nonnormal data. Therefore, we can’t trust the p-value calculated by the 2-Sample t test for these data.</p>
<p>Instead, we’ll use the nonparametric Mann-Whitney test, which compares the medians. Nonparametric tests have fewer requirements and are particularly useful when your data are nonnormal and you have small sample sizes. We’ll use a one-tailed test to determine whether the median multitasking score for women is greater than the median men’s score.</p>
<p>To run the test in Minitab statistical software, go to: <strong>Stat > Nonparametrics > Mann-Whitney</strong></p>
The Mann-Whitney test results
<p style="margin-left: 40px;"><img alt="Mann-Whitney test results" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/84b08eafb0b4c868443466396060b3a6/mann_whitney.gif" style="width: 443px; height: 196px;" /></p>
<p>The p-value of 0.1271 is greater than 0.05, which indicates that the women’s median is not significantly greater than the men’s median. Further, the 95% confidence interval suggests that the median pairwise difference is likely between -9.99 and 30.01. Because the confidence interval includes both positive and negative values, it would not be surprising to repeat the experiment and find that <em>men </em>had the higher median!</p>
<p>The Mythbusters looked at the sample means and “Confirmed” the myth. However, the data do not support the conclusion that women have a higher median score than men.</p>
Power analysis to determine sample size
<p>If the Mythbusters were to perform this experiment again, how many subjects should they recruit? For a start, if they collect at least 15 samples per group, they can use the more powerful 2-Sample t test.</p>
<p>I’ll perform a <a href="http://www.minitab.com/en-us/Support/Tutorials/Using-Power-and-Sample-Size-Tools-with-Power-Curves/" target="_blank">power analysis</a> for a 2-sample t test to estimate a good sample size based on the following:</p>
<ul>
<li>I’ll assume that the difference must be at least 10 points to be practically meaningful.</li>
<li>I want to have an 80% chance of detecting a meaningful difference if it exists.</li>
<li>I’ll use the sample standard deviation.</li>
</ul>
<p>In Minitab, go to <strong>Stat > Power and Sample Size > 2-Sample t</strong> and fill in the dialog as follows:</p>
<p><img alt="Power and sample size for 2-sample t dialog" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/3f2600ae2a0ee6cfca95350164ff7dfc/pss_dialog.gif" style="width: 334px; height: 256px;" /></p>
<p>Under Options, choose <strong>Greater than</strong>, and click <strong>OK</strong> in all dialogs.</p>
<p style="margin-left: 40px;"><img alt="Power and sample size results for 2-sample t test" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/11f631412f296a365c93ed2e94c57f05/pss2t.gif" style="width: 360px; height: 225px;" /></p>
<p>The output shows that we need 29 subjects per group, for a total of 58, to have a reasonable chance of detecting a meaningful difference, if that difference actually exists between the two populations.</p>
Are Men Better at Parallel Parking?
<p>The Mythbusters also wanted to determine whether men are better at parallel parking than women. They devised a test that produces scores between 0 and 100. At first glance, this appears to be a similar scenario as the multitasking myth where we’ll compare means, or medians. However, the means and medians are virtually identical and are not significantly different according to any test.</p>
<p style="margin-left: 40px;"><img alt="Descriptive statistics for parallel parking by gender" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/8466a72e71eafb6b6f8bfb2f7b3c12fe/parking_desc.gif" style="width: 288px; height: 86px;" /></p>
<p>There’s a different story behind this myth. During the parking test, the hosts notice that the women’s scores seem more variable than the men’s. The women are either really good or really bad, while men are somewhere in between, as you can see below.</p>
<p><img alt="Individual value plot of parallel parking scores by gender" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/568a85406da8584ce314ab4fb3ba4f3b/ivp_parking.gif" style="width: 400px; height: 267px;" /></p>
<p>We want to be reasonably sure that the observed difference in variability actually represents a true difference between the populations. We need to use the correct hypothesis test, which is Two Variances (<strong>Stat > Basic Statistics > 2 Variances</strong>). The test results are below:</p>
<p style="margin-left: 40px;"><img alt="Two variances test results for parallel parking by gender" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/3f1bcb84beaf871e73567d45c6dd5068/2var_parking.gif" style="width: 431px; height: 104px;" /></p>
<p>The null hypothesis is that the variability in both groups are equal. Because the p-value (0.000) is less than 0.05, we can reject the null hypothesis and conclude that women’s scores for parallel parking are more variable than men’s scores.</p>
<p>The Mythbusters correctly busted this myth because the means and medians are essentially equal. We can't conclude that one gender is better at parallel parking than the other.</p>
<p>However, we <em>can</em> conclude that men are more <em>consistent </em>at parallel parking than women.</p>
Closing Thoughts
<p>In one of <a href="http://dsc.discovery.com/tv-shows/mythbusters/videos/m5-aftershows.htm" target="_blank">their online videos</a>, Adam and Jamie explain that they understand the importance of sample size. However, Adam states that the Mythbusters put more effort into the methodology of collecting good data. It’s true, they are great at reducing sources of variation, obtaining accurate measurements, etc. He goes on to explain that they just don’t have the resources to obtain larger sample sizes. Fair enough—for a television show.</p>
<p>However, if you’re in science or Six Sigma, you don’t have this luxury. You must:</p>
<ul>
<li>Have a good methodology for collecting data</li>
<li>Have a sufficient sample size</li>
<li>Use the correct statistical analysis</li>
</ul>
<p>Without all of the above, you risk drawing incorrect conclusions.</p>
Fun StatisticsHypothesis TestingLearningThu, 05 Sep 2013 11:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/using-hypothesis-tests-to-bust-myths-about-the-battle-of-the-sexesJim Frost