Hypothesis Testing | MinitabBlog posts and articles about hypothesis testing, especially in the course of Lean Six Sigma quality improvement projects.
http://blog.minitab.com/blog/hypothesis-testing-2/rss
Fri, 30 Jan 2015 02:32:42 +0000FeedCreator 1.7.3Analyzing Qualitative Data, part 1: Pareto, Pie, and Stacked Bar Charts
http://blog.minitab.com/blog/applying-statistics-in-quality-projects/analyzing-qualitative-data-part-1-pareto-pie-and-stacked-bar-charts
<p>In several previous blogs, I have discussed the use of statistics for <a href="http://blog.minitab.com/blog/applying-statistics-in-quality-projects/using-nonparametric-analysis-to-visually-manage-durations-in-service-processes">quality improvement in the service sector</a>. Understandably, services account for a very large part of the economy. Lately, when meeting with several people from financial companies, I realized that one of the problems they faced was that they were collecting large amounts of "qualitative" data: types of product, customer profiles, different subsidiaries, several customer requirements, etc.</p>
<p>There are several ways to process such qualitative data. Qualitative data points may still be counted, and once they have been counted they may be quantitatively (numerically) analyzed using statistical methods.</p>
<p>I will focus on the analysis of qualitative data using a simple and obvious example. In this case, we would like to analyze mistakes on invoices made during a period of several weeks by three employees (anonymously identified).</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/545c0823fc7368e795585c38424891d9/quali1.jpg" style="width: 288px; height: 273px;" /></p>
<p>I will present three different ways to analyze such qualitative data (counts). In this post, I will cover:</p>
<ol>
<li>A very simple graphical approach based on bar charts to display counts (stacked and clustered bars), Pareto diagrams and Pie charts.</li>
</ol>
<p>Then, in my next post, I will demonstrate: </p>
<ol start="2">
<li> A more complex approach for testing statistical significance using a Chi-square test.<br />
</li>
<li> An even more complex multivariate approach (using correspondence analysis).</li>
</ol>
<p>Again, the main purpose of this example is to show several ways to analyze qualitative data. Quantitative data represent numeric values such as the number of grams, dollars, newtons, etc., whereas qualitative data may represent text values such as different colours, types of defects or different employees.</p>
<p>The <a href="http://www.minitab.com/en-us/products/minitab/assistant/">Assistant</a> in Minitab 17 provides a great breakdown of two main data types: </p>
<p><img alt="" spellcheck="true" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/2fd46235529df11ab90d53efa677b706/quali2.jpg" style="width: 586px; height: 316px; border-width: 1px; border-style: solid;" /></p>
Charts and Diagrams with Qualitative Data
<p>I first created a pie chart using the Minitab Assistant (<strong>Assistant > Graphical Analysis</strong>) as well as a stacked bar chart on counts (from the graph menu of Minitab, select <strong>Graph > Bar Charts</strong>) to describe the proportion of each type of mistakes according to the day of the week.</p>
<p><img alt="" spellcheck="true" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/15ec9831d178df8fc0cbaddab0975c89/pie_chart_of_mistake_by_day___summary_report.jpg" style="width: 478px; height: 358px; border-width: 1px; border-style: solid;" /></p>
<p>In the pie charts above, the proportion of mistake types seems to be fairly similar across the different days of the week.</p>
<p> <img alt="" spellcheck="true" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/4b92a1293aff3f424d5a6f751653fb17/quali3.jpg" style="width: 403px; height: 302px; border-width: 1px; border-style: solid;" /></p>
<p>The number of mistakes also seems to be very stable and uniform according to day of week, when we see the stacked bar chart above.</p>
<p><img alt="" spellcheck="true" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/c23dcf3e01cedf8aaad5bad176437ed2/quali4.jpg" style="width: 426px; height: 330px;" /></p>
<p>Now let's create a stacked bar chart on counts to analyze mistakes by employees. In this second graph, shown above, large variations in the number of errors do occur according to employees. The distribution of errors also seems to be very different, with more “Product” errors associated with employee A.</p>
Qualitative Data in a Pareto Chart
<p><img alt="" spellcheck="true" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/30893b16e7ab4a75024498b7c3cf9fdf/pareto_chart_of_mistake_by_person___diagnostic_report.jpg" style="width: 768px; height: 547px;" /></p>
<p>Above we see <span style="line-height: 1.6;"><span><a href="http://blog.minitab.com/blog/understanding-statistics/explaining-quality-statistics-so-your-boss-will-understand-pareto-charts">Pareto charts</a></span> created using the Minitab Assistant (above): an overall Pareto and some additional Pareto diagrams, one for each employee. Again, it's easy to identify the large number of “product” mistakes (red columns) for employee A.</span></p>
<span style="line-height: 1.6;">Stacked Bar Charts of Qualitative Data</span>
<p><img alt="" spellcheck="true" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/79589c080171780e682cbd69d3353a0e/quali6.jpg" style="width: 426px; height: 347px;" /></p>
<p><span style="line-height: 20.7999992370605px;">Mistake counts are represented as percentages in the s</span><span style="line-height: 1.6;">tacked bar chart above. For each employee the error types are summed up to obtain 100% (within each employee's column). This provides a clearer understanding of how each employee's mistakes are distributed. Again, the high percentage of “Product” errors (in yellow) for employee A is very noticeable, but also note the high percentage, proportionately, of “Address” mistakes (blue areas) for employee C.</span></p>
<p><img alt="" spellcheck="true" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/9da688410bcb56f516061a4a26e64dfe/quali7.jpg" style="width: 434px; height: 346px;" /></p>
<p>The stacked bar chart above displays changes in the number of errors and in error types according to the week (time trends). Notice that in the last three weeks, at the end of the period, only product and address issues occurred. Apparently error types tend to shift towards more “product” and “address” types of errors, at the end of the period.</p>
Different Views of the Data Give a More Complete Picture
<p>These diagrams do provide a clear picture of mistake occurrences according to employees, error types and weeks. However, as you've seen, it takes several graphs to provide a good understanding of the issue.</p>
<p>This is still a subjective approach though, several people seated around the same table looking at these same graphs, might interpret them differently and in some cases, this could result in endless discussions.</p>
<p>Therefore we would also like to use a more scientific and rigorous approach: the Chi-square test. <a href="http://blog.minitab.com/blog/applying-statistics-in-quality-projects/analyzing-qualitative-data-part-2-chi-square-and-multivariate-analysis">We'll cover that in my next post</a>. </p>
<p> </p>
Data AnalysisHypothesis TestingQuality ImprovementSix SigmaStatisticsStatsWed, 28 Jan 2015 13:00:00 +0000http://blog.minitab.com/blog/applying-statistics-in-quality-projects/analyzing-qualitative-data-part-1-pareto-pie-and-stacked-bar-chartsBruno ScibiliaWhat Are T Values and P Values in Statistics?
http://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-are-t-values-and-p-values-in-statistics
<p>If you’re not a statistician, looking through statistical output can sometimes make you feel a bit like <em>Alice in</em> <em>Wonderland. </em>Suddenly, you step into a fantastical world where strange and mysterious phantasms appear out of nowhere. </p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/6f4053a89257952fef0b9998547dffe2/tweedle_tweedledum.jpg" style="line-height: 20.7999992370605px; float: right; width: 248px; height: 255px; margin: 10px 15px;" /></p>
<p>For example, consider the T and P in your t-test results.</p>
<p>“Curiouser and curiouser!” you might exclaim, like Alice, as you gaze at your output.</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/1e5a4c064f43f19169121222402e4560/t_test_results_one_sided.jpg" style="width: 467px; height: 121px;" /></p>
<p>What are these values, really? Where do they come from? Even if you’ve used the p-value to interpret the statistical significance of your results<span style="line-height: 20.7999992370605px;"> </span><span style="line-height: 20.7999992370605px;">umpteen times</span><span style="line-height: 1.6;">, its actual origin may remain murky to you.</span></p>
T & P: The Tweedledee and Tweedledum of a T-test
<p>T and P are inextricably linked. They go arm in arm, like Tweedledee and Tweedledum. Here's why.</p>
<p>When you perform a t-test, you're usually trying to find evidence of a significant difference between population means (2-sample t) or between the population mean and a hypothesized value (1-sample t). <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-is-a-t-test-and-why-is-it-like-telling-a-kid-to-clean-up-that-mess-in-the-kitchen">The t-value measures the size of the difference relative to the variation in your sample data</a>. Put another way, T is simply the calculated difference represented in units of standard error. The greater the magnitude of T (it can be either positive or negative), the greater the evidence <em>against </em>the null hypothesis that there is no significant difference. The closer T is to 0, the more likely there isn't a significant difference.</p>
<p>Remember, the t-value in your output is calculated from only one sample from the entire population. It you took repeated random samples of data from the same population, you'd get slightly different t-values each time, due to random sampling error (which is really not a mistake of any kind–it's just the random variation expected in the data).</p>
<p>How different could you expect the t-values from many random samples from the same population to be? And how does the t-value from your sample data compare to those expected t-values?</p>
<p>You can use a t-distribution to find out.</p>
Using a t-distribution to calculate probability
<p>For the sake of illustration, assume that you're using a 1-sample t-test to determine whether the population mean is greater than a hypothesized value, such as 5, based on a sample of 20 observations, as shown in the above t-test output.</p>
<ol>
<li>In Minitab, choose <strong>Graph > Probability Distribution Plot</strong>.</li>
<li>Select <strong>View Probability</strong>, then click <strong>OK</strong>.</li>
<li>From <strong>Distribution</strong>, select <strong>t</strong>.</li>
<li>In <strong>Degrees of freedom</strong>, enter <em>19</em>. (For a 1-sample t test, the degrees of freedom equals the sample size minus 1).</li>
<li>Click <strong>Shaded Area</strong>. Select <strong>X Value</strong>. Select <strong>Right Tail</strong>.</li>
<li> In <strong>X Value</strong>, enter 2.8 (the t-value), then click <strong>OK</strong>.</li>
</ol>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/bc5183a42a169d45632fd4f6c0b153b3/distribution_plot_t_2.8" style="width: 576px; height: 384px;" /></p>
<p>The highest part (peak) of the distribution curve shows you where you can expect most of the t-values to fall. Most of the time, you’d expect to get t-values close to 0. That makes sense, right? Because if you randomly select representative samples from a population, the mean of most of those random samples from the population should be close to the overall population mean, making their differences (and thus the calculated t-values) close to 0.</p>
T values, P values, and poker hands
<p>T values of larger magnitudes (either negative or positive) are less likely. The far left and right "tails" of the distribution curve represent instances of obtaining extreme values of t, far from 0. For example, the shaded region represents the probability of obtaining a t-value of 2.8 or greater. Imagine a magical dart that could be thrown to land randomly anywhere under the distribution curve. What's the chance it would land in the shaded region? The calculated probability is 0.005712.....which rounds to 0.006...which is...the p-value obtained in the t-test results! <img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/5633b267494c2017d6d7c7544247d57d/poker_picture.jpg" style="float: right; width: 200px; height: 164px; margin: 10px 15px;" /></p>
<p>In other words, the probability of obtaining a t-value of 2.8 or higher, when sampling from the same population (here, a population with a hypothesized mean of 5), is approximately 0.006.</p>
<p>How likely is that? Not very! For comparison, the probability of being dealt 3-of-a-kind in a 5-card poker hand is over three times as high (≈ 0.021).</p>
<p>Given that the probability of obtaining a t-value this high or higher when sampling from this population is so low, what’s more likely? It’s more likely this sample doesn’t come from this population (with the hypothesized mean of 5). It's much more likely that this sample comes from different population, one with a mean greater than 5.</p>
<p>To wit: Because the p-value is very low (< alpha level), you reject the null hypothesis and conclude that there's a statistically significant difference.</p>
<p>In this way, T and P are inextricably linked. Consider them simply different ways to quantify the "extremeness" of your results under the null hypothesis. You can’t change the value of one without changing the other.</p>
<p>The larger the absolute value of the t-value, the smaller the p-value, and the greater the evidence against the null hypothesis.(You can verify this by entering lower and higher t values for the t-distribution in step 6 above).</p>
Try this two-tailed follow up...
<p>The t-distribution example shown above is based on a one-tailed t-test to determine whether the mean of the population is greater than a hypothesized value. Therefore the t-distribution example shows the probability associated with the t-value of 2.8 only in one direction (the right tail of the distribution).</p>
<p>How would you use the t-distribution to find the p-value associated with a t-value of 2.8 for two-tailed t-test (in both directions)?</p>
<p><strong>Hint:</strong> In Minitab, adjust the options in step 5 to find the probability for both tails. If you don't have a copy of Minitab, download a free <a href="http://it.minitab.com/en-us/products/minitab/free-trial.aspx" target="_blank">30-day trial version</a>.</p>
Hypothesis TestingTue, 27 Jan 2015 13:10:00 +0000http://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-are-t-values-and-p-values-in-statisticsPatrick RunkelA Minitab Holiday Tale: Featuring the Two Sample t-Test
http://blog.minitab.com/blog/statistics-in-the-field/a-minitab-holiday-tale-featuring-the-two-sample-t-test
<p><em><span style="line-height: 1.6;">by Matthew Barsalou, guest blogger</span></em></p>
<p>Aaron and Billy are two very competitive—and not always well-behaved—eight-year-old twin brothers. They constantly strive to outdo each other, no matter what the subject. If the boys are given a piece of pie for dessert, they each automatically want to make sure that their own piece of pie is bigger than the other’s piece of pie. This causes much exasperation, aggravation and annoyance for their parents. Especially when it happens in a restaurant (although the restaurant situation has improved, since they have been asked not to return to most local restaurants).</p>
<p><img alt="A bag of coal" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d2ccbe9f7c8e887281272ae49854893f/bag_of_coal.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 200px; height: 200px;" />Sending the boys to their rooms never helped. The two would just compete to see who could stay in their room longer. This Christmas their parents were at wits' ends, and they decided the boys needed to be taught a lesson so they could grow up to be upstanding citizens. Instead of the new bicycles the boys were going to get—and probably just race till they crashed anyway—their parents decided to give them each a bag of coal.</p>
<p>An astute reader might ask, “But what does this have to do with <a href="http://www.minitab.com/products/minitab">Minitab</a>?” Well, dear reader, the boys need to figure out who got the most coal. Immediately upon opening their packages, the boys carefully weighed each piece of coal and entered the data into Minitab.</p>
<p><span style="line-height: 1.6;">Then they selected <strong>Stat > Basic Statistics > Display Descriptive Statistics</strong> and used the "Statistics" options dialog to select the metrics they wanted, including the sum of the weights they'd entered:</span></p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/dacaebac62e3cc4c2e29329d0a779720/descriptivestatistics.png" style="width: 600px; height: 208px;" /></p>
<p><span style="line-height: 1.6;">Billy quickly saw that he had the most coal, and yelled, “I have 279.383 ounces and you only have 272.896 ounces, and the mean of my pieces of coal is more than the mean of yours. Mine weigh more, so our parents must love me more.” </span></p>
<p><span style="line-height: 1.6;">“Not so fast,” said Aaron. “You may have a higher mean value, but is the difference statistically significant?” There was only one thing left for the boys to do: perform a <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/t-for-2-should-i-use-a-paired-t-or-a-2-sample-t">two sample t-test</a>.</span></p>
<p><span style="line-height: 1.6;">In Minitab, Aaron selected </span><strong><span style="line-height: 1.6;">Stat > Basic Statistics > 2-Sample t…</span></strong></p>
<p>The boys left the default values at a confidence level of 95.0 and a hypothesized difference of 0. The alternative hypothesis was “Difference ≠ hypothesized difference” because the only question they were asking was “Is there a statistically significant difference?” between the two data sets.</p>
<p>The two troublemakers also selected “Graphs” and checked the options to display an individual value plot and a boxplot. They knew they should look at their data. Having the graphs available would also make it easier for them to communicate their results to higher authorities, in this case, their poor parents.</p>
<p><img alt="Individual Value Plot of Coal" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/bf541d8df2461a8edff9060789394b00/individual_value_plot_of_coal.png" style="width: 577px; height: 385px;" /></p>
<p><img alt="Boxplot of Coal" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8945d7a038de654d008f68dc0a8886d3/boxplot_of_coal.png" style="width: 577px; height: 385px;" /></p>
<p>Both the individual value plots and boxplots showed that Aaron's bag of coal had pieces with the highest individual weights. But he also had the pieces with the least weight. So the values for his Christmas coal were scattered across a wider range than the values for Billy‘s Christmas coal. But was there really a difference?</p>
<p>Billy went running for his tables of Student‘s t-scores so he could interpret the resulting t-value of -0.71. Aaron simply looked at the resulting p-value of 0.481. The p-value was greater than 0.05 so the boys could not conclude there was a true difference in the weight of their Christmas "presents."</p>
<p><img alt="600" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/549762a9cb277536a76baedba32617d3/2_sample_t_test_coal.png" style="width: 683px; height: 305px;" /></p>
<p><span style="line-height: 1.6;">The boys dutifully reported the results, with illustrative graphs, each demanding that they get a little more to best the other. Clearly, receiving coal for Christmas had done nothing to reduce their level of competitiveness. Their parents realized the boys were probably not going to grow up to be upstanding citizens, but they may at least become good statisticians.</span></p>
<p>Happy Holidays.</p>
<p> </p>
<p><strong>About the Guest Blogger</strong></p>
<p><em><a href="https://www.linkedin.com/pub/matthew-barsalou/5b/539/198" target="_blank">Matthew Barsalou</a> is an engineering quality expert in <a href="http://www.3k-warner.de/" target="_blank">BorgWarner</a> Turbo Systems Engineering GmbH’s global engineering excellence department. He is a Smarter Solutions certified Lean Six Sigma Master Black Belt, ASQ-certified Six Sigma Black Belt, quality engineer, and quality technician, and a TÜV-certified quality manager, quality management representative, and auditor. He has a bachelor of science in industrial sciences, a master of liberal studies with emphasis in international business, and has a master of science in business administration and engineering from the Wilhelm Büchner Hochschule in Darmstadt, Germany. He is author of the books <a href="http://www.amazon.com/Root-Cause-Analysis-Step---Step/dp/148225879X/ref=sr_1_1?ie=UTF8&qid=1416937278&sr=8-1&keywords=Root+Cause+Analysis%3A+A+Step-By-Step+Guide+to+Using+the+Right+Tool+at+the+Right+Time" target="_blank">Root Cause Analysis: A Step-By-Step Guide to Using the Right Tool at the Right Time</a>, <a href="http://asq.org/quality-press/display-item/index.html?item=H1472" target="_blank">Statistics for Six Sigma Black Belts</a> and <a href="http://asq.org/quality-press/display-item/index.html?item=H1473&xvl=76115763" target="_blank">The ASQ Pocket Guide to Statistics for Six Sigma Black Belts</a>.</em></p>
Fun StatisticsHypothesis TestingStatisticsTue, 23 Dec 2014 13:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/a-minitab-holiday-tale-featuring-the-two-sample-t-testGuest BloggerAre Preseason Football or Basketball Rankings More Accurate?
http://blog.minitab.com/blog/the-statistics-game/are-preseason-football-or-basketball-rankings-more-accurate
<p>College basketball season tips off today, and for the second straight season Kentucky is the #1 ranked preseason team in the AP poll. Last year Kentucky did not live up to that ranking in the regular season, going 24-10 and earning a lowly 8 seed in the NCAA tournament. But then, in the tournament, they overachieved and made a run all the way to the championship game...before losing to Connecticut.</p>
<p>In football, Florida State was the AP poll preseason #1 football team. While they are currently still undefeated, they aren't quite playing like the #1 team in the country. So this made me wonder, which preseason rankings are more accurate, football or basketball?</p>
<p>I gathered <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/1d3961db92c5ba14bc90b2b8323b95f8/preseason_basketball_vs__football_rankings.MTW">data</a> from the last 10 seasons, and recorded the top 10 teams in the preseason AP poll for both football and basketball. Then I recorded the difference between their preseason ranking and their final ranking. Both sports had 10 teams that weren’t ranked or receiving votes in the final poll, so I gave all of those teams a final ranking of 40.</p>
Creating a Histogram to Compare Two Distributions
<p>Let’s start with a histogram to look at the distributions of the differences. (It's always a good idea to look at the distribution of your data when you're starting an analysis, whether you're looking at quality improvement data work or sports data for yourself.) </p>
<p>You can create this graph in Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a> by selecting <strong>Graph > Histograms</strong>, choosing "With Groups" in the dialog box, and using the Basketball Difference and Football Difference columns as the graph variables:</p>
<p><img alt="Histogram" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/53055c57978dbfa85d28688cc816c98a/histogram_of_basketball_difference__football_difference.jpg" style="width: 720px; height: 480px;" /></p>
<p>The differences in the rankings appear to be pretty similar. Most of the data is towards the left side of this histogram, meaning for most cases the difference between the preseason and final ranking is pretty small.</p>
Conducting a Mann-Whitney Hypothesis Test on Two Medians
<p>We can further investigate the data by performing a hypothesis test. Because the data is heavily skewed, I’ll use <a href="http://blog.minitab.com/blog/the-statistics-game/do-the-data-really-say-female-named-hurricanes-are-more-deadly">a Mann-Whitney test</a>. This compares the medians of two samples with similarly-shaped distributions, as opposed to a <a href="http://blog.minitab.com/blog/understanding-statistics/guidelines-and-how-tos-for-the-2-sample-t-test">2-sample t test</a>, which compares the means. <span style="line-height: 20.7999992370605px;">The median is the middle value of the data. Half the observations are less than or equal to it, and half the observations are greater than or equal to it.</span><span style="line-height: 20.7999992370605px;"> </span></p>
<p>To perform this test in our statistical software, we select <strong>Stat > Nonparametrics > Mann-Whitney</strong>, then choose the appropriate columns for our first and second sample: </p>
<p><img alt="Mann-Whitney Test" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/1a1f239841b82e60170e6ecbc8077d4b/mann_whitney.jpg" style="width: 689px; height: 241px;" /></p>
<p>The basketball rankings have a smaller median difference than the football rankings. However, when we examine the <a href="http://blog.minitab.com/blog/understanding-statistics/three-things-the-p-value-cant-tell-you-about-your-hypothesis-test">p-value</a> we see that this difference is not statistically significant. There is not enough evidence to conclude that one preseason poll is more accurate than the other.</p>
<p>But what about the best teams? I grouped each of the top 3 ranked teams and looked at the median difference between their preseason and final rank.</p>
<p><img alt="Bar Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/692a3db40dd5d3b4c20d539f92395629/bar_chart.jpg" style="width: 720px; height: 480px;" /></p>
<p>The preseason AP basketball poll has a smaller difference for the #1 and #3 ranked teams. But the football poll is better for the #2 team, having an impressive median value of 1. Overall, both polls are relatively good, as neither has a median value greater than 6. And the differences are close enough that we can’t conclude that one is more accurate than the other.</p>
What Does It Mean for the Teams?
<p>While the odds are against both Kentucky and Florida State to finish the season ranked #1 in their respective polls, previous seasons indicate that they’re still likely to finish as one of the top teams. This is better news for Kentucky, as being one of the top teams means they’ll easily make the NCAA basketball tournament and get a high seed. However, Florida State must finish as one of the top 4 teams, or else they’ll miss out on the football postseason completely.</p>
<p>So while we can’t conclude one poll is better than the other, teams at the top of the AP basketball poll are clearly much more likely to reach the postseason than football.</p>
Data AnalysisFun StatisticsHypothesis TestingStatistics in the NewsFri, 14 Nov 2014 15:03:33 +0000http://blog.minitab.com/blog/the-statistics-game/are-preseason-football-or-basketball-rankings-more-accurateKevin RudyComparing the College Football Playoff Top 25 and the Preseason AP Poll
http://blog.minitab.com/blog/the-statistics-game/comparing-the-college-football-playoff-top-25-and-the-preseason-ap-poll
<p>The college football playoff committee waited until the end of October to release their first top 25 rankings. One of the reasons for waiting so far into the season was that the committee would rank the teams off of actual games and wouldn’t be influenced by preseason rankings.</p>
<p>At least, that was the idea.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/8ac74acf42052d068b6cd0eeec32f609/cfb_playoff.jpg" style="line-height: 20.7999992370605px; float: right; width: 300px; height: 187px;" /></p>
<p>Earlier this year, I found that the <a href="http://blog.minitab.com/blog/the-statistics-game/has-the-college-football-playoff-already-been-decided">final AP poll was correlated with the preseason AP poll</a>. That is, if team A was ranked ahead of team B in the preseason and they had the same number of losses, team A was still usually ranked ahead of team B. The biggest exception was SEC teams, who were able to regularly jump ahead of teams (with the same number of losses) ranked ahead of them in the preseason.</p>
<p>If the final AP poll can be influenced by preseason expectations, could the college football playoff committee be influenced, too? Let’s compare their first set of rankings to the preseason AP poll to find out.</p>
Comparing the Ranks
<p>There are currently 17 different teams in the committee’s top 25 that have just one loss. I <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/26e7c8d8d8eee4fe2dfa26dc3d6e3c54/preseason_ap_vs__cfb_playoff_rankings.MTW">recorded the order</a> they are ranked in the committee’s poll and their order in the AP preseason poll. Below is an individual value plot of the data that shows each team’s preseason rank versus their current rank.</p>
<p><img alt="IVP" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/4098bab194a586865d3861f854d65627/ivp.jpg" style="width: 600px; height: 400px;" /></p>
<p>Teams on the diagonal line haven’t moved up or down since the preseason. Although Notre Dame is the only team to fall directly on the line, most teams aren’t too far off.</p>
<p>Teams below the line have jumped teams that were ranked ahead of them in the preseason. The biggest winner is actually not an SEC team, it’s TCU. Before the season, 13 of the current one-loss teams were ranked ahead of TCU, but now there are only 4. On the surface TCU seems to counter the idea that only SEC teams can drastically move up from their preseason ranking. However, of the 9 teams TCU jumped, only one (Georgia) is from the SEC. And the only other team to jump up more than 5 spots is Mississippi—who of course is from the SEC. So I wouldn’t conclude that the CFB playoff committee rankings behave differently than the AP poll quite yet.</p>
<p>Teams below the line have been passed by teams that had been ranked behind them in the preseason. Ohio State is the biggest loser, having had 9 different teams pass over them. Part of this can be explained by the fact that they have the worst loss (a 4-4 Virginia Tech game at home). But another factor is that the preseason AP poll was released before anybody knew Buckeye quarterback Braxton Miller would miss the entire season. Had voters known that, Ohio State probably wouldn’t have been ranked so high to begin with. </p>
<p>Overall, 10 teams have moved up or down from their preseason spot by 3 spots or less. The correlation between the two polls is 0.571, which indicates a positive association between the preseason AP poll and the current CFB playoff rankings. That is, teams ranked higher in the preseason poll tend to be ranked higher in the playoff rankings.</p>
Concordant and Discordant Pairs
<p>We can take this analysis a step further by looking at the concordant and discordant pairs. A pair is concordant if the observations are in the same direction. A pair is discordant if the observations are in opposite directions. This will let us compare teams to each other two at a time.</p>
<p>For example, let’s compare Auburn and Mississippi. In the preseason, Auburn was ranked 3 (out of the 17 one-loss teams) and Mississippi was ranked 10. In the playoff rankings, Auburn is ranked 1 and Mississippi is ranked 2. This pair is concordant, since in both cases Auburn is ranked higher than Mississippi. But if you compare Alabama and Mississippi, you’ll see Alabama was ranked higher in the preseason, but Mississippi is ranked higher in the playoff rankings. That pair is discordant.</p>
<p>When we compare every team, we end up with 136 pairs. How many of those are concordant? Our <a href="http://www.minitab.com/products/minitab">favorite statistical software</a> has the answer: </p>
<p><img alt="Measures of Concordance" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/5f281abfa1e06d5cda492e17b3f9746b/concordance.jpg" style="width: 663px; height: 176px;" /></p>
<p>There are 96 concordant pairs, which is just over 70%. So most of the time, if a team ranked higher in the preseason poll, they are ranked higher in the playoff rankings. And consider this: of the one-loss teams, the top 4 ranked preseason teams were Alabama, Oregon, Auburn, and Michigan St. Currently, the top 4 one loss teams are Auburn, Mississippi, Oregon, and Alabama. That’s only one new team—which just so happens to be from the SEC.</p>
<p>That’s bad news for non-SEC teams that started the season ranked low, like Arizona, Notre Dame, Nebraska, and Kansas State. It's going to be hard for them to jump teams with the same record, especially if those teams are from the SEC. Just look at Alabama’s résumé so far. Their best win is over West Virginia and they lost to #4 Mississippi. Is that <em>really </em>better than Kansas State, who lost to #3 Auburn and beat Oklahoma <em>on the road</em>? If you simply changed the name on Alabama’s uniform to Utah and had them unranked to start the season, would they still be ranked three spots higher than Kansas State? I doubt it.</p>
<p>The good news is that there are still many games left to play. Most of these one-loss teams will lose at least one more game. But with 4 teams making the playoff this year, odds are we'll see multiple teams with the same record vying for the last playoff spot. And if this college football playoff ranking is any indication, if you're not in the SEC, teams who were highly thought of in the preseason will have an edge.</p>
Fun StatisticsHypothesis TestingFri, 31 Oct 2014 13:04:57 +0000http://blog.minitab.com/blog/the-statistics-game/comparing-the-college-football-playoff-top-25-and-the-preseason-ap-pollKevin RudyUsing Data Analysis to Maximize Webinar Attendance
http://blog.minitab.com/blog/michelle-paret/using-data-analysis-to-maximize-webinar-attendance
<p>We like to host webinars, and our customers and prospects like to attend them. But when our webinar vendor moved from a pay-per-person pricing model to a pay-per-webinar pricing model, we wanted to find out how to maximize registrations and thereby minimize our costs.<img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/8a6733d3b0516b7f1c7ad80ea753d430/mtbnewspromos_w640.jpeg" style="width: 400px; height: 273px; float: right; border-width: 1px; border-style: solid; margin: 10px 15px;" /></p>
<p>We collected webinar data on the following variables:</p>
<ul>
<li>Webinar topic</li>
<li>Day of week</li>
<li>Time of day – 11 a.m. or 2 p.m.</li>
<li>Newsletter promotion – no promotion, newsletter article, newsletter sidebar</li>
<li>Number of registrants</li>
<li>Number of attendees</li>
</ul>
<p>Once we'd collected our data, it was time to analyze it and answer some key questions using <a href="http://www.minitab.com/products/minitab/">Minitab Statistical Software</a>.</p>
Should we use registrant or attendee counts for the analysis?
<strong><span style="line-height: 16.8666667938232px; font-family: Calibri, sans-serif; font-size: 11pt;"><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/4d9fa1e3c73606627d2ca1ec34b620e2/scatterplot_w640.jpeg" style="width: 300px; height: 197px; margin: 10px 15px; float: left;" /></span></strong>
<p>First we needed to decide what we would use to measure our results: the number of people who signed up, or the number of people who actually attended the webinar. This question really boils down to answering the question, “Can I trust my data?”</p>
<p>Our data collection system for webinar registrants is much more accurate than our data collection system for webinar attendees. This is due to customer behavior and their willingness to share contact information, in addition to the automated database processes that connect our webinar vendor data with our own database. So, for a period of time, I manually collected the attendee data directly from our webinar vendor to see how it correlated with the easily-accessible and accurate registration data. The scatterplot above shows the results.</p>
<p>With a <a href="http://blog.minitab.com/blog/understanding-statistics/no-matter-how-strong-correlation-still-doesnt-imply-causation">correlation coefficient </a>of 0.929 and a p-value of 0.000, there was a strong positive linear relationship between the registrations and attendee counts. If registrations are high, then attendance is also high. If registrations are low, then attendance is also low. I concluded that I could use the registration data—which is both easily accessible and extremely reliable—to conduct my analysis.</p>
Should we consider data for the last 6 years?
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/5e73f48b852c7afc17762f28bf8887cf/i_mr_chart_of_registrants_w640.jpeg" style="width: 400px; height: 263px; margin: 10px 15px; float: left;" />We’ve been collecting webinar data for 6 years, but that doesn’t mean we can treat the last 6 years of data as one homogeneous population.</p>
<p>A lot can change in a 6-year time period. Perhaps there was a change in the webinar process that affected registrations. To determine whether or not I should use all of the data, I used an Individuals and Moving Range (I-MR, also referred to as X-MR) <a href="http://blog.minitab.com/blog/understanding-statistics/how-create-and-read-an-i-mr-control-chart">control chart</a> to evaluate the process stability of webinar registrations over time.</p>
<p>The graph revealed a single point on the MR chart that flagged as out-of-control. I looked more closely at this point and verified that the data was accurate and that this webinar belonged with the larger population. Based on this information, I decided to proceed with analyzing all 6 years of data together. (Note there is some clustering of points due to promotions, but again the goal here was to determine if we could use data over a 6-year time period.)</p>
What variables impact registrations?
<p>I performed an ANOVA using Minitab's General Linear Model tool to find out which factors—topic, day of week, time of day, or newsletter promotion—significantly affect webinar registrations.<img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/3758d3d03a604bab9921ad9f94663dc8/main_effects_plot_for_registrants_w640.jpeg" style="width: 400px; height: 263px; float: right; margin: 10px 15px;" /></p>
<p>The ANOVA results revealed that the day of week, time of day, and webinar topic <em>do not</em> affect webinar registrations, but the newsletter promotion type <em>does</em> (p-value = 0.000).</p>
<p>So which webinar promotion type maximizes webinar registrations?</p>
<p>Using Minitab to conduct <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/keep-that-special-someone-happy-when-you-perform-multiple-comparisons">Tukey comparisons</a>, we can see that registrations for webinars promoted in the newsletter sidebar space were not significantly different from webinars that weren't promoted at all.</p>
<p>However, webinars that were promoted in the newsletter <em>article </em>space resulted in significantly more registrations than both the sidebar promotions and no promotions.</p>
<p>From this analysis, we concluded that we still had the flexibility to offer webinars at various times and days of the week, and we could continue to vary webinar topics based on customer demand and other factors. To maximize webinar attendance and minimize webinar cost, we needed to focus our efforts on promoting the webinars in our newsletter, utilizing the article space.</p>
<p>But over the past year, we’ve started to actively promote our webinars via other channels as well, so next up is some more data analysis—using Minitab—to figure out what marketing channels provide the best results…</p>
Data AnalysisHypothesis TestingRegression AnalysisStatisticsFri, 17 Oct 2014 12:00:00 +0000http://blog.minitab.com/blog/michelle-paret/using-data-analysis-to-maximize-webinar-attendanceMichelle ParetWith the Assistant, You Won't Have to Stop and Get Directions about Directional Hypotheses
http://blog.minitab.com/blog/statistics-and-quality-improvement/with-the-assistant-you-wont-have-to-stop-and-get-directions-about-directional-hypotheses
<p>I got lost a lot as a child. I got lost at malls, at museums, Christmas markets, and everywhere else you could think of. Had it been in fashion to tether children to their parents at the time, I'm sure my mother would have. As an adult, I've gotten used to using a GPS device to keep me from getting lost.</p>
<p><span style="line-height: 20.7999992370605px;">The Assistant in Minitab is like your GPS for statistics. The Assistant is there to provide you with directions so that you don't get lost. One particular area where it's easy to get lost is with directional hypotheses.</span><img alt="Wait... is my hypothesis the other direction?" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/25dd42362071d2aafc3bfc85f78f5f22/hypothesis_bubble_w640.jpeg" style="line-height: 20.7999992370605px; width: 480px; height: 350px; border-width: 1px; border-style: solid; margin: 10px 15px;" /></p>
What Is a Directional Hypothesis?
<p>When you do a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/what-is-a-hypothesis-test/">statistical hypothesis test</a>, you have a null hypothesis and an alternative hypothesis. Directional hypotheses refer to two types of alternative hypotheses that you can usually choose. The common alternative hypotheses are these three:</p>
<ul>
<li>The value that you want to test is greater than a target.</li>
<li>The value that you want to test is different from a target.</li>
<li>The value that you want to test is less than a target.</li>
</ul>
<p>If you select an alternative hypothesis with "greater than" or "less than" in it, then you've chosen a directional hypothesis. When you choose a directional hypothesis, you get a one-sided test.</p>
<p>What does it look like to choose a one-sided test, and why would you? Let's consider an example.</p>
Choosing Whether to Use a One-sided Test or a Two-sided Test
<p>Suppose new production equipment is installed at a factory that should increase the rate of production for electrical panels. Concern exists that the change could increase the percentage of electrical panels that require rework before shipping. A quality team prepares to conduct a hypothesis test to determine whether statistical evidence supports this concern. The historical rework rate is 1%.</p>
<p>At this point, you would usually choose an alternative hypothesis. Maybe you remember hearing that you should think about whether to use a one-sided test or a two-sided test, or you may not even know how a test can have a side.</p>
<p>To keep from getting lost, you use your GPS. To keep from getting confused about statistics, you can use the Assistant. The Assistant uses clear and simple language. The Assistant doesn't ask you about "directional hypotheses" or "one-sided tests." Instead, the Assistant asks the question, "What do you want to determine?"</p>
<p><img alt="Is the % defective of Panels greater than .01? Is the % defective of Panels less than .01? Is the % defective of Panels different from .01?" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/b090980e5b08184e7b70b96b9cb05489/test_setup_in_assistant.png" style="width: 573px; height: 198px;" /></p>
<p>In this scenario, it's easy to see why the team would want to determine whether the percent is greater than 1. By performing the one-sided test for whether the percentage is greater than 1, the team can determine if there is enough statistical evidence to conclude that the percentage increased. If the percentage increased, then the concern is justified.</p>
<p>In practical terms, you should consider what it means to limit your decision to whether there is evidence for an increase. A one-sided test of whether the percentage increased will never show a statistically significant decrease in the percentage of boards that require rework. Evidence of a decrease in the number of defectives might guide the quality team to investigate the reasons for the unforeseen benefit.</p>
Why Use a One-sided Test?
<p>Given this possible concern about whether a one-sided test excludes important information from the result, why would you ever use one? The best answer is that you use a one-sided test when the one-sided test tells you everything that you need to know.</p>
<p>In the example about the electrical panels, the quality team might feel completely secure in assuming that the new equipment will not result in a decrease in the percentage of panels that require rework. If so, then a test that checks for a decrease is flawed. The team needs only to determine whether to solve a problem with increased defectives or not.</p>
The Assistant Gets Even Better
<p>While a p-value for a one-sided test can be useful, more analysis can help you make better decisions. For example, in the electrical panel example, if the team finds a statistically significant increase, it will be important to know what the percentage increase is. <a href="http://www.minitab.com/en-us/products/minitab/assistant/">The Assistant</a> produces several reports with your hypothesis tests that help you get as much information as you can from your data. The report card verifies your analysis by providing assumption checks and identifying any concerns that you should be aware of. The diagnostic report helps you further understand your analysis by providing additional detail. The summary report helps you to draw the correct conclusions and explain those conclusions to others. The series of reports includes a variety of other statistics and analyses. That way, you have everything that you need to interpret your results with confidence.</p>
<p><img alt="The % defective of Panels is not significantly greater than the target (p > 0.05)" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/75f280df482574a3aee75ee65741b5c4/1_sample___defective_test_for_panels___summary_report_w640.png" style="width: 480px; height: 360px;" /></p>
<p>The image of the face in the crowd without the thought bubble is by <a href="https://www.flickr.com/photos/akbarsyah/">_Imaji_</a> and is licensed under <a href="https://creativecommons.org/licenses/by/2.0/">this creative commons license</a>.</p>
Hypothesis TestingWed, 15 Oct 2014 18:52:23 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/with-the-assistant-you-wont-have-to-stop-and-get-directions-about-directional-hypothesesCody SteeleHow Politicians and Governments Could Benefit from Statistical Analyses
http://blog.minitab.com/blog/applying-statistics-in-quality-projects/how-politicians-and-governments-could-benefit-from-statistical-analyses
<p>Using <a href="http://blog.minitab.com/blog/applying-statistics-in-quality-projects/a-doe-in-a-manufacturing-environment-part-1">statistical techniques to optimize manufacturing processes</a> is quite common now, but using the same approach on social topics is still an innovative approach. For example, if our objective is to improve student academic performances, should we increase teachers wages or would it be better to reduce the number of students in a class?</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/4b07ae989e35a7dfd8b6fdb313a5561b/ballot.jpg" style="float: right; width: 250px; height: 250px;" />Many social topics (the effect of increasing the minimum wage on employment, etc.) generate long and passionate discussions in the media and in politics. People express very different and subjective points of views according to political/ideological opinions and varied ways of thinking.</p>
Hypothesis Testing in the Policy Realm
<p><span style="line-height: 20.7999992370605px;">Social experimentation and data analysis can provide a firmer ground on which we can base more objective decisions.</span></p>
<p>The objective is to investigate the effects of a policy intervention and to <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/example-of-a-hypothesis-test/">test specific hypotheses</a>. In these social experiments “randomization” is a key element. If one policy option is tested in, say, the Netherlands, and another policy option is tested in France, the experimenter will never be in a position to fully understand whether a difference in outcomes is due to the intervention itself or to the many other differences between these two countries.</p>
<p>It would clearly be preferable to test the two approaches in different regions of France and of the Netherlands, for example, and assign the policy intervention in a random way to a “treatment” group (individuals who receive it) and a “comparison” group (individuals who do not receive it).</p>
<p>At the beginning of the study, the “treatment” and the “control” groups should be as similar as possible to prevent any systematic previous bias. The objective is not to “observe” differences but to identify the actual causal effects.</p>
Designed Experiment Techniques
<p>Other techniques that are often used in <a href="http://blog.minitab.com/blog/understanding-statistics/getting-started-with-factorial-design-of-experiments-doe">designed experiments (DOEs)</a> may also be useful in this context, such as blocking and balancing. In my example, France and the Netherlands might be considered as a blocking factor (an external extra factor which the experimenter cannot control), and the tests should be “balanced” across blocks so that the treatment effect estimates are not biased and the blocking effects of the countries are neutralized. Other potential blocking factors in policy studies might be urban versus rural regions, or females versus males.</p>
Examples of Policy Experiments
<p>Data analysis and statistics have been used to inform several important policy debates around the world over the past few years. Here are a few examples:</p>
<p>- In Kenya, a social experiment showed that neither hiring extra teachers to reduce class sizes in schools nor providing more textbooks to pupils had much effect on academic performances. A surprising finding of this study was that deworming (intestinal worms) programs were very effective in decreasing child absenteeism.</p>
<p>- In the U.S, a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/doe/factorial-designs/choose-a-factorial-design/">full factorial design (DOE)</a> was used to assess the effectiveness of commitment contracts. The objective of these contracts was to encourage individuals to exercise more in order to reduce health risks and prevent obesity. The effects of different factors such as duration of the physical exercises, their frequency and financial stakes were studied. The outcome was the likelihood of accepting such a contract.</p>
<p>- Different strategies to quit smoking based on commitment contracts have been tested using a randomized experimental approach.</p>
<p>- In France, a social experiment was conducted to compare different job-counselling strategies for placing young unemployed people. The studied outcome was the probability to find a job.</p>
Conclusion
<p>Experiments make it possible to vary one factor at a time, but a more effective approach is to modify several factors for each test using proper designs of experiments. Expertise in setting up randomized field experiments to test economic hypotheses is clearly a key factor.</p>
<p>Experimental results are often surprising, therefore experimentation and data analysis are potentially new and powerful tools in the arsenal of politicians and governments.</p>
<p>Here are sources of more information about the examples I've mentioned :</p>
<p>Miguel, Edward and Michael Kremer (2004). “Worms: Identifying Impacts on Education and Health in the Presence of Treatment Externalities,” Econometrica, Volume72 (1), pp. 159-217.</p>
<p>Gine, Xavier, Dean Karlan and Jonathan Zinman (2008). “Put Your Money Where Your Butt Is: A Commitment Savings Account for Smoking Cessation,” MIMEO, Yale University.</p>
<p><a href="http://www.voxeu.org/article/job-placement-and-displacement-evidence-randomised-experiment">http://www.voxeu.org/article/job-placement-and-displacement-evidence-randomised-experiment</a></p>
<p>Using Nudges in Exercise Commitment Contracts : <a href="http://www.nber.org/bah/2011no1/w16624.html">http://www.nber.org/bah/2011no1/w16624.html</a></p>
<p> </p>
Data AnalysisDesign of ExperimentsHypothesis TestingStatisticsStatistics in the NewsStatsMon, 22 Sep 2014 12:00:00 +0000http://blog.minitab.com/blog/applying-statistics-in-quality-projects/how-politicians-and-governments-could-benefit-from-statistical-analysesBruno ScibiliaA Fun ANOVA: Does Milk Affect the Fluffiness of Pancakes?
http://blog.minitab.com/blog/statistics-in-the-field/a-fun-anova3a-does-milk-affect-the-fluffiness-of-pancakes
<p><em>by Iván Alfonso, guest blogger</em></p>
<p><img alt="hotcakes" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/7bd460fa71f6d12672a2ac5d9f754762/pancakes.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 300px; height: 223px;" />I'm a huge fan of hot cakes—they are my favorite dessert ever. I’ve been cooking them for over 15 years, and over that time I’ve noticed many variation in textures, flavor, and thickness. Personally, I like fluffy pancakes.</p>
<p>There are many brands of hotcake mix on the market, all with very similar formulations. So I decided to investigate which ingredients and inputs may influence the fluffiness of my pancakes.</p>
<p>Potential factors could include the type of mix used, the type of milk used, the use of margarine or butter (of many brands), the amount of mixing time, the origin of the eggs, and the skill of the person who prepares the pancakes.</p>
<p>Instead of looking at <em>all </em>of these factors, I focused on the type of milk used in the pancakes. I had four types of milk available: whole milk, light, low fat, and low protein.</p>
<p>My goal was to determine if these different milk formulations influence fluffiness (thickness). Is the whole milk the best for fluffy hotcakes? Does skim milk works the same way as the whole milk? Can I be sure that the use of light milk will result in hot cakes that are less smooth?</p>
Gathering Data
<p>I sorted the four formulations as shown in the diagram below:</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/643f9f4f94be78a5b1c012e49c400772/milk_factor.jpg" style="width: 144px; height: 200px;" /></p>
<p>I used the the same amounts of milk, flour (one brand), salt and margarine for each batch of hotcakes I cooked.</p>
<p>The response variable was the thickness of the cooked pancakes. I prepared 6 pancakes for each type of milk, which gives me a total of 8 pancakes. I randomized the cooking order to minimize bias. I also prepared each batch by myself—if my sister or mother had helped with some lots, it would be a potential source of variation.</p>
<p>To measure the fluffiness, I inserted a stick into the center of each hotcake until the bottom, marked the stick with a pencil, then measured the distance to the mark in millimeters with a ruler.</p>
<p>After a couple of hours of cooking hotcakes, making measurements, and recording the data on a worksheet, I started to analyze my data with Minitab.</p>
Analysis of Variance (ANOVA)
<p>My goal was to assess the variation in thickness or fluffiness between different batches of hot cakes, so the most appropriate statistical technique was <a href="http://blog.minitab.com/blog/statistics-in-the-field/understanding-anova-by-looking-at-your-household-budget">analysis of variance, or ANOVA</a>. With this analysis I could visualize and compare the formulations based on my response variable, the thickness in millimeters, and see if there were statistically significant differences between them. I used a <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/alpha-male-vs-alpha-female">0.05 significance value</a>.</p>
<p>As soon as I had my data in a Minitab worksheet, I started to check it for the assumptions of ANOVA. First, I needed to see if the data followed a normal distribution, so I went straight to <strong>Statistics > Basic Statistics > Normality Test</strong>. Minitab produced the following graph:</p>
<p><img alt="Graph of probability of thickness" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/58599d2e2d8572e700893e2e8000dce9/probability_of_thickness.jpg" style="width: 500px; height: 304px;" /></p>
<p>My data passed both the Kolmogorov-Smirnov and Anderson-Darling normality tests. This was a relief—since my data had a normal distribution, I didn’t need to worry about ANOVA’s assumptions of normality.</p>
<p>Traditional ANOVA also has an assumption of equal variances; however, I knew that even if my data didn’t meet this assumption, I could proceed using the method called <a href="http://blog.minitab.com/blog/adventures-in-statistics/did-welchs-anova-make-fishers-classic-one-way-anova-obsolete">Welch’s ANOVA</a>, which accommodates unequal variances. But when I ran Bartlett’s test for equal variances, and even the more stringent Levene test, my data passed. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/5600f02a4a7a9faa8b82c3bbe1458784/test_for_equality_of_variances.jpg" style="width: 500px; height: 307px;" /></p>
<p>With confirmation that my data met the assumptions, I proceeded to perform the ANOVA and create box-and-whisker graphs.</p>
ANOVA Results
<p>Here's the Minitab output for the ANOVA:</p>
<p style="margin-left: 40px;"><img alt="one-way anova output" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/5817e0a9b2d961942f7101bc8eb2eced/one_way_anova.gif" style="width: 400px; height: 133px;" /></p>
<p>The ANOVA revealed that there were indeed statistically significant differences (p = 0.009) among my four batches of hotcakes.</p>
<p>Minitab’s output also included grouping information using Tukey’s method of multiple comparisons for 95% confidence intervals:</p>
<p style="margin-left: 40px;"><img alt="Tukey Method" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c9194c1dda604ad87e4e7985ec8261c1/tukey_method.gif" style="width: 400px; height: 151px;" /></p>
<p>The Tukey analysis shows that low-fat milk and light items do not show a significant difference in fluffiness. However, the batches made with whole milk and low protein did significantly differ from each other.</p>
<p>The box-and-whisker diagram makes the results of the analysis easier to visualize:</p>
<p><img alt="Boxplot of thickness" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8ca740917c33fddd8953433d67488ac8/boxplot_of_thickness.gif" style="width: 500px; height: 338px;" /></p>
<p>It is clear from the graph that hotcakes produced with whole milk had the most fluffiness, and those made with low protein milk had the least fluffiness. There was not a big difference between the fluffiness of hotcakes made with light milk and lowfat milk.</p>
Which Milk Should You Use for Fluffy Pancakes?
<p>Based on this analysis, I recommend using whole milk for fluffier hotcakes. If you want to avoid fats and sugars in milk, low fat milk is a good choice.</p>
<p>I always use lowfat milk, but the analysis indicates that light milk offers a good alternative for people following a strict no-fat diet.</p>
<p>It’s important to note that for this analysis, I only compared formulations that used the same brand of pancake mix and the same amounts of salt and butter. But there are other factors to consider! My next pancake experiment will use design of experiments (DOE) to compare milk types, different brands of flour, and margarine with and without salt, to see how all of these factors together affect the fluffiness of pancakes.</p>
<p> </p>
<p><strong>About the Guest Blogger:</strong></p>
<p><em>Iván Alfonso is a biochemical engineer and statistics professor at the Autonomous University of Campeche, Mexico. Alfonso holds a master's degree in marine chemistry and has worked extensively in data analysis and design of experiments in basic and advanced sciences like chemistry and epidemiology.</em></p>
<p> </p>
<p><strong>Would you like to publish a guest post on the Minitab Blog? Contact <a href="mailto:publicrelations@minitab.com?subject=Guest%20Blogger">publicrelations@minitab.com</a>.</strong></p>
<p> </p>
Data AnalysisFun StatisticsHypothesis TestingStatisticsTue, 05 Aug 2014 12:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/a-fun-anova3a-does-milk-affect-the-fluffiness-of-pancakesGuest BloggerDo the Data Really Say Female-Named Hurricanes Are More Deadly?
http://blog.minitab.com/blog/the-statistics-game/do-the-data-really-say-female-named-hurricanes-are-more-deadly
<p><img alt="Hurricane" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/61165559035556ba8f784164d74a7f96/hurricane_w640.jpeg" style="float: right; width: 250px; height: 188px; border-width: 1px; border-style: solid; margin: 10px 15px;" />A recent study has indicated that <a href="http://www.washingtonpost.com/blogs/capital-weather-gang/wp/2014/06/02/female-named-hurricanes-kill-more-than-male-because-people-dont-respect-them-study-finds/" target="_blank">female-named hurricanes kill more people than male hurricanes</a>. Of course, the title of that article (and other articles like it) is a bit misleading. The study found a significant <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-models/what-is-an-interaction/">interaction</a> between the damage caused by the storm and the perceived masculinity or femininity of the hurricane names. So don’t be confused by stories that suggest all female-named hurricanes are deadlier than male-named hurricanes. The study actually found no effect of masculinity/femininity for less severe storms. It was the more severe storms where the gender of the name had a significant relationship with the number of deaths.</p>
<p>The study looked at every hurricane since 1950, with the exception of Katrina and Audrey (those two are outliers that would skew the results). Many critics of the study believe that it is biased, since almost all of the 38 hurricanes before 1979 had female names (there were two male names in the early 50s). It’s possible that our ability to forecast hurricanes has vastly improved since the 50s and 60s. So, these critics say, the difference is simply because more people died in hurricanes back when they all had a female name.</p>
<p>Let’s perform a data analysis to see if that is true. We will use pre- and post-1979 to distinguish between the predominantly female-name hurricane era and the era of mixed hurricane names. I’ll use the exact same data set that was used in the study, which you can get <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/ad7c966669da36643b8060c74038e6d6/hurricane.MTW">here</a>.</p>
Hurricanes Before and After 1979
<p>For the 92 hurricanes in the study, the number of deaths and the normalized damage was recorded. The study showed that these two variables are highly correlated, so it’s important to consider both factors. If we find there were more deaths in hurricanes before 1979, we need to make sure the reason isn’t simply because those hurricanes caused more damage (implying they were bigger storms).</p>
<p>We can start by using a scatterplot to plot the two variables against each other, using whether the hurricane came before or after 1979 as a grouping variable. Hurricanes that occurred <em>during </em>1979 were put in the After group.</p>
<p><img alt="Scatterplot" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/72ef8a172f250267d3b03cccd6ff8399/scatterplot_of_deaths_vs_normalized_damage_w640.jpeg" style="width: 640px; height: 427px;" /></p>
<p>We see that the two deadliest hurricanes (Camille and Diane) both occurred before 1979. If you look below them, you’ll see that many hurricanes in both eras have caused the same amount of damage, yet resulted in far fewer deaths.</p>
<p>Meanwhile, the two most damaging hurricanes (Sandy and Andrew) both occurred <em>after </em>1979. These hurricanes caused more than three times the damage of Camille and Diane, yet resulted in fewer deaths. This gives some credibility to the idea that our improvement in being able to predict hurricanes has resulted in fewer deaths. However, Hurricane Donna supports the opposite idea: five post-1979 hurricanes resulted in more deaths than Donna, despite causing significantly less damage. It’s hard to draw conclusions from the scatterplot.</p>
<p>Of course, the hurricanes labeled in the plot above are pretty rare. Most of the 92 hurricanes had normalized damage less than $30 billion and fewer than 100 deaths. The descriptive statistics below show just how much of an impact those big storms can have on an analysis.</p>
<p><img alt="Describe" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/ac70541e09a25b227de847363d10e9c0/describe_deaths_ndam_by_year_group.jpg" style="width: 503px; height: 177px;" /></p>
<p>If we look at the mean, everything becomes clear! On average, hurricanes before 1979 had 11 more deaths despite causing half a billion <em>fewer</em> dollars in damages. But when we look at the median, which isn’t sensitive to extreme data values, the values are almost the same. </p>
<p>Part of the problem is that so many smaller storms are included. The study already concluded that the name doesn’t matter for smaller storms. So let’s just focus on the big storms. The median normalized damage for all 92 storms is $1.65 billion. I took only the storms that have caused at least that much damage (there were 47 of them) and looked at the descriptive statistics again.</p>
<p><img alt="Describe" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/06fc8707283704922858ce000d05fde2/describe_deaths_ndam_by_year_group_big_storm.jpg" style="width: 500px; height: 175px;" /></p>
<p>Once again, the mean and median paint different pictures. The mean shows that a much higher number of deaths occurred in big storms before 1979, even though those storms caused the same amount of damage. However, this is because hurricanes Camille, Diane, and Agnes are heavily influencing the mean for deaths before 1979, pulling it up much higher than the After-1979 group. And hurricanes Sandy and Andrew influence the mean for normalized damage after 1979, pulling it up to equal the damage before 1979.</p>
<p>With data this skewed, the medians are a more accurate representation of the middle of the data. The median for deaths shows that there were slightly more deaths in big storms prior to 1979. However, those storms also caused more damage, implying <em>that </em>could be the reason for the larger number of deaths.</p>
<p>And even if we ignore the fact that the hurricanes before 1979 caused more damage, a <a href="http://blog.minitab.com/blog/statistics-for-lean-six-sigma/the-non-parametric-economy-what-does-average-actually-mean">Mann-Whitney test</a> (which compares 2 medians, as opposed to a 2-sample t test which compares 2 means) shows that the difference in deaths is not statistically significant.</p>
<p><img alt="Mann-Whitney" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/a8f1ef8922a9238ba0414caef236a05d/mann_whitney_w640.jpeg" style="width: 640px; height: 230px;" /></p>
<p>The p-value is 0.1393, which is greater than 0.05. There isn’t enough evidence to conclude that hurricanes caused more deaths before 1979.</p>
Can We Really Conclude that Female-Named Hurricanes Cause More Deaths?
<p>The lack of conclusive evidence from our data analysis certainly makes the idea that hurricanes with female names cause deaths plausible. But there are other issues to consider. For example, the gender of the hurricane name was not treated as a binary variable, which would group each hurricane as either male or female. Instead, nine independent coders rated the masculinity vs. femininity of historical hurricane names on two items (1 = very masculine, 11 = very feminine, and 1 = very man-like, 11 = very woman-like), which were averaged to compute a masculinity-femininity index (MFI).</p>
<p>Do these 9 coders represent how most Americans would rate the femininity of names? Would you rate Barbara as more feminine than Carol or Betsy? The coders did, giving Barbara a 9.8 while Carol and Betsy were 8.1 and 8.3 respectively. And the MFI is important, since it was found to be the gender variable that had a significant interaction with normalized damage. When gender name was treated as a binary variable, there was no interaction.</p>
<p>But masculinity-femininity index aside, the study did have some very interesting findings. I’m sure additional research will be done in the years to come to see if the findings hold true. Let's hope that then we’ll be able to know for sure whether people underestimate female-named hurricanes or not.</p>
<p>Until then, if a hurricane is bearing down on your neighborhood, I would make sure to board up the windows and buy out the supermarket's bread and milk, regardless of the storm's name.</p>
Hypothesis TestingStatisticsStatistics in the NewsFri, 06 Jun 2014 13:17:00 +0000http://blog.minitab.com/blog/the-statistics-game/do-the-data-really-say-female-named-hurricanes-are-more-deadlyKevin RudyHypothesis Testing and P Values
http://blog.minitab.com/blog/statistics-in-the-field/hypothesis-testing-and-p-values
<p><em>by Matthew Barsalou, guest blogger</em></p>
<p>Programs such as the <a href="http://www.minitab.com/products/minitab/">Minitab Statistical Software</a> make hypothesis testing easier; but no program can think for the experimenter. Anybody performing a statistical hypothesis test must understand what p values mean in regards to their statistical results as well as potential limitations of statistical hypothesis testing.</p>
<p>A p value of 0.05 is frequently used during statistical hypothesis testing. This p value indicates that if there is no effect (or if the null hypothesis is true), you’d obtain the observed difference or more in 5% of studies due to random sampling error. However, <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values">performing multiple hypothesis tests with p > 0.05 increases the chance of a false positive</a>.</p>
<p>This is well illustrated by the online comic <a href="http://xkcd.com/882/">XKCD</a>, which depicted somebody stating that jelly beans cause acne.</p>
<p><img alt="Significant" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/08b29e9eec884bee99602335f1f9c893/xkcd.png" style="border-width: 0px; border-style: solid; width: 310px; height: 859px;" /></p>
<p>Scientists investigated and found no link, so the person made the claim that it is only a certain color jelly bean that caused acne. The scientists then test 20 different colors of jelly beans with p > 0.05. Only the green jelly bean had a p value less than 0.05.</p>
<p>The comic ends with a newspaper reporting a link between green jelly beans and acne. The newspaper points out there is 95% confidence with only a 5% chance of coincidence. What is wrong with the conclusion?</p>
<p>We can determine the chance that there will be no false conclusions by using the binomial formula.</p>
<p><img alt="binomial formula" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b962df0ea487d69594aea4975ae69225/equation1.gif" style="width: 500px; height: 87px;" /></p>
<p>This means that we have a 35.8% chance of performing 20 hypothesis tests without getting a false positive (or, as statisticians refer to it, the <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/multiple-comparisons-beware-of-individual-errors-that-multiply">family error rate</a>) when using an alpha level of 0.05. We can also calculate the probability that we have at least one incorrect result due to random chance.</p>
<p><img src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/6a80807434e2c2678163dbcc710d13a0/equation2.gif" style="width: 345px; height: 73px;" /></p>
<p>The chance that at least one result will be a false positive when performing 20 hypothesis tests using an alpha level of 0.05 is 64.2%.</p>
<p>So the press release in the XKCD comic may have been a bit premature.</p>
<p>Suppose I had 14 samples with a mean of 87.2 and I wanted to know if the mean is actually 85.2. I performed a One-Sample T-test using Minitab by going to <strong>Stat > Basic Statistics > 1 Sample t …. </strong>And I entered the summarized data. I checked the “perform hypothesis test box” and then selected “Options…” and used the default confidence level of 95.0. This corresponds to an alpha of 0.05.</p>
<p style="margin-left: 40px;"><img alt="One-Sample T test output" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/55e90b93ae38e8612ce3adb4ea0c4f00/output1.png" style="border-width: 0px; border-style: solid; width: 425px; height: 130px;" /></p>
<p>I performed the test and the resulting p value was 0.049, which is close to but still below 0.05, so I can reject my null hypothesis. If I performed the test repeatedly, as in the XLCD example, I might have failed to reject the null hypothesis, because the 5% probability adds up with additional tests.</p>
<p>There are alternatives to statistical hypothesis testing; for example, Bayesian inference could be used in place of hypothesis testing with p values. But alternative methods have their own weaknesses, and they may be difficult for non-statisticians to use.</p>
<p>Instead of avoiding the use of hypothesis testing, we should account for its limitations. For example, by realizing that each repeat of the test increases the chance of a false positive, as illustrated by XKCD's jelly bean example.</p>
<p>We can’t simply retest over and over using the same p value and then conclude that we have results with statistical significance. For situations such as in the XKCD example, Simons, Nelson and Simonsohn recommend disclosing the total number of test that were <a href="http://people.psych.cornell.edu/~jec7/pcd%20pubs/simmonsetal11.pdf">performed</a>. Had we known that 20 test had been performed with p > 0.05 we could realize that we may not need to avoid green jellybeans after all.</p>
<p> </p>
<div><strong>About the Guest Blogger: </strong></div>
<div><em>Matthew Barsalou is an engineering quality expert in BorgWarner Turbo Systems Engineering GmbH’s Global Engineering Excellence department. He has previously worked as a quality manager at an automotive component supplier and as a contract quality engineer at Ford in Germany and Belgium. He possesses a bachelor of science in industrial sciences, a master of liberal studies and a master of science in business administration and engineering from the Wilhelm Büchner Hochschule in Darmstadt, Germany.</em></div>
<div> </div>
<p>xkcd.com comic from <a href="http://xkcd.com/882/">http://xkcd.com/882/</a> used under Creative Commons Attribution- NonCommercial 2.5 License. <a href="http://xkcd.com/license.html">http://xkcd.com/license.html</a></p>
<p> </p>
Fun StatisticsHypothesis TestingMon, 02 Jun 2014 12:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/hypothesis-testing-and-p-valuesGuest BloggerFive Guidelines for Using P values
http://blog.minitab.com/blog/adventures-in-statistics/five-guidelines-for-using-p-values
<p>There is high pressure to find low P values. Obtaining a low P value for a hypothesis test is make or break because it can lead to funding, articles, and prestige. Statistical significance is everything!</p>
<p>My two previous posts looked at several issues related to P values:</p>
<ul>
<li><a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">P values have a higher than expected false positive rate.</a></li>
<li><a href="http://blog.minitab.com/blog/adventures-in-statistics/not-all-p-values-are-created-equal" target="_blank">The same P value from different studies can correspond to different false positive rates.</a></li>
</ul>
<p>In this post, I’ll look at whether P values are still helpful and provide guidelines on how to use them with these issues in mind.</p>
<div style="float: right; width: 200px; margin: 25px 25px;">
<p><img alt="Ronald Fisher" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/f7eb953015180df73edfa6f073f234c6/r__a__fisher.jpg" style="float: right; width: 200px; height: 243px; border-width: 1px; border-style: solid;" /> <em>Sir Ronald A Fisher</em></p>
</div>
Are P Values Still Valuable?
<p>Given the issues about P values, are they still helpful? A higher than expected rate of false positives can be a problem because if you implement the “findings” from a false positive study, you won’t get the expected benefits.</p>
<p>In my view, P values are a great tool. Ronald Fisher introduced P values in the 1920s because he wanted an objective method for comparing data to the null hypothesis, rather than the informal eyeball approach: "My data <em>look </em>different than the null hypothesis."</p>
<p>P value calculations incorporate the effect size, sample size, and variability of the data into a single number that objectively tells you how consistent your data are with the null hypothesis. Pretty nifty!</p>
<p>Unfortunately, the high pressure to find low P values, combined with a common misunderstanding of <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">how to correctly interpret P values</a>, has distorted the interpretation of significant results. However, these issues can be resolved.</p>
<p>So, let’s get to the guidelines! Their overall theme is that you should evaluate P values as part of a larger context where other factors matter.</p>
Guideline 1: The Exact P Value Matters
<div style="float: right; width: 90px; margin: 25px 25px;">
<p style="line-height: 11px; text-align: center;"><img alt="Small wooden P" height="75px" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/c408562ea4a40eedae9ae78c1d3ca027/p_wooden.jpg" width="75px" /><br />
<em>Tiny Ps are<br />
great!</em></p>
</div>
<p>With the high pressure to find low P values, there’s a tendency to view studies as either significant or not. Did a study produce a P value less than 0.05? If so, it’s golden! However, there is no magic significance level that distinguishes between the studies that have a true effect and those that don’t with 100% accuracy. Instead, it’s all about lowering the error rate to an acceptable level.</p>
<p>The lower the P value, the lower the error rate. For example, a P value near 0.05 has an error rate of 25-50%. However, a P value of 0.0027 corresponds to an error rate of at least 4.5%, which is close to the rate that many mistakenly attribute to a P value of 0.05.</p>
<p>A lower P value thus suggests stronger evidence for rejecting the null hypothesis. A P value near 0.05 simply indicates that the result is worth another look, but it’s nothing you can hang your hat on by itself. It’s not until you get down near 0.001 until you have a fairly low chance of a false positive.</p>
Guideline 2: Replication Matters
<p>Today, P values are everything. However, Fisher intended P values to be just one part of a process that incorporates experimentation, statistical analysis and replication to lead to scientific conclusions.</p>
<p>According to Fisher, “A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.”</p>
<p>The false positive rates associated with P values that we saw in my last post definitely support this view. A single study, especially if the P value is near 0.05, is unlikely to reduce the false positive rate down to an acceptable level. Repeated experimentation may be required to finish at a point where the error rate is low enough to meet your objectives.</p>
<p>For example, if you have two independent studies that each produced a P value of 0.05, you can multiply the P values to obtain a probability of 0.0025 for both studies. However, you must include both the significant and insignificant studies in a series of similar studies, and not cherry pick only the significant studies.</p>
<p><img alt="Replicate study results" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/d1f27fc3889672c11ac23b1ffa9bfac9/p_rep.gif" style="width: 403px; height: 136px;" /></p>
<p>Conclusively proving a hypothesis with a single study is unlikely. So, don’t expect it!</p>
Guideline 3: The Effect Size Matters
<p>With all the focus on P values, attention to the effect size can be lost. Just because an effect is statistically significant doesn't necessarily make it meaningful in the real world. Nor does a P value indicate the precision of the estimated effect size.</p>
<p>If you want to move from just detecting an effect to assessing its magnitude and precision, use <a href="http://blog.minitab.com/blog/adventures-in-statistics/when-should-i-use-confidence-intervals-prediction-intervals-and-tolerance-intervals" target="_blank">confidence intervals</a>. In this context, a confidence interval is a range of values that is likely to contain the effect size.</p>
<p>For example, an AIDS vaccine <a href="http://news.sciencemag.org/health/2009/09/massive-aids-vaccine-study-modest-success" target="_blank">study</a> in Thailand obtained a P value of 0.039. Great! This was the first time that an AIDS vaccine had positive results. However, the confidence interval for effectiveness ranged from 1% to 52%. That’s not so impressive...the vaccine may work virtually none of the time up to half the time. The effectiveness is both low and imprecisely estimated.</p>
<p>Avoid thinking about studies only in terms of whether they are significant or not. Ask yourself; is the effect size precisely estimated and large enough to be important?</p>
Guideline 4: The Alternative Hypothesis Matters
<p>We tend to think of equivalent P values from different studies as providing the same support for the alternative hypothesis. However, <a href="http://blog.minitab.com/blog/adventures-in-statistics/not-all-p-values-are-created-equal" target="_blank">not all P values are created equal</a>.</p>
<p>Research shows that the plausibility of the alternative hypothesis greatly affects the false positive rate. For example, a highly plausible alternative hypothesis and a P value of 0.05 are associated with an error rate of at least 12%, while an implausible alternative is associated with a rate of at least 76%!</p>
<p>For example, given the track record for AIDS vaccines where the alternative hypothesis has never been true in previous studies, it's highly unlikely to be true at the outset of the Thai study. This situation tends to produce high false positive rates—often around 75%!</p>
<p>When you hear about a surprising new study that finds an unprecedented result, don’t fall for that first significant P value. Wait until the study has been well replicated before buying into the results!</p>
Guideline 5: Subject Area Knowledge Matters
<p>Applying subject area expertise to all aspects of hypothesis testing is crucial. Researchers need to apply their scientific judgment about the plausibility of the hypotheses, results of similar studies, proposed mechanisms, proper experimental design, and so on. Expert knowledge transforms statistics from numbers into meaningful, trustworthy findings.</p>
Hypothesis TestingStatisticsStatistics HelpThu, 15 May 2014 11:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/five-guidelines-for-using-p-valuesJim FrostNot All P Values are Created Equal
http://blog.minitab.com/blog/adventures-in-statistics/not-all-p-values-are-created-equal
<p><img alt="Fancy P" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/2762a55291d134b8185ba9da47ea6f83/p_fancy.gif" style="float: right; width: 150px; height: 194px; margin: 10px 15px;" />The interpretation of P values would seem to be fairly standard between different studies. Even if two hypothesis tests study different subject matter, we tend to assume that you can interpret a P value of 0.03 the same way for both tests. A P value is a P value, right?</p>
<p>Not so fast! While Minitab <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">statistical software</a> can correctly calculate all P values, it can’t factor in the larger context of the study. You and your common sense need to do that!</p>
<p>In this post, I’ll demonstrate that P values tell us very different things depending on the larger context.</p>
Recap: P Values Are Not the Probability of Making a Mistake
<p>In my previous post, I showed the <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">correct way to interpret P values</a>. Keep in mind the big caution: P values are<em> not</em> the error rate, or the likelihood of making a mistake by rejecting a true null hypothesis (Type I error).</p>
<p>You can equate this error rate to the false positive rate for a hypothesis test. A false positive happens when the sample is unusual due to chance alone and it produces a low P value. However, despite the low P value, the alternative hypothesis is not true. There is no effect at the population level.</p>
<p>Sellke <em>et al</em>. estimated that a P value of 0.05 corresponds to a false positive rate of “at least 23% (and typically close to 50%).”</p>
What Affects the Error Rate?
<p>Why is there a range of values for the error rate? To understand that, you need to understand the factors involved. David Colquhoun, a professor in biostatistics, lays them out <a href="http://www.dcscience.net/?p=6518" target="_blank">here</a>.</p>
<p>Whereas Sellke<em> et al.</em> use a Bayesian approach, Colquhoun uses a non-Bayesian approach but derives similar estimates. For example, Colquhoun estimates P values between 0.045 and 0.05 have a false positive rate of at least 26%.</p>
<p>The factors that affect the false positive rate are:</p>
<ul>
<li>Prevalence of real effects (higher is good)</li>
<li>Power (higher is good)</li>
<li>P value (lower is good)</li>
</ul>
<p>“Good” means that the test is less likely to produce a false positive. The 26% error rate assumes a prevalence of real effects of 0.5 and a power of 0.8. If you decrease the prevalence to 0.1, suddenly the false positive rate shoots up to 76%. Yikes!</p>
<p>Power is related to false positives because when a study has a lower probability of detecting a true effect, a higher proportion of the positives will be false positives.</p>
<p>Now, let’s dig into a very interesting factor: the prevalence of real effects. As we saw, this factor can hugely impact the error rate!</p>
P Values and the Prevalence of Real Effects
<p><img alt="Joke: I once asked a statistician out. She failed to reject me!" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/f9d1ea1b51185c0631ae8fadb0145f8f/fail_reject_joke.gif" style="float: right; width: 275px; height: 313px; margin: 10px 15px;" />What Colquhoun calls the prevalence of real effects (denoted as P(real)), the Bayesian approach calls the prior probability. It is the proportion of hypothesis tests in which the alternative hypothesis is true at the outset. It can be thought of as the long-term probability, or track record, of similar types of studies. It’s the plausibility of the alternative hypothesis.</p>
<p>If the alternative hypothesis is farfetched, or has a poor track record, P(real) is low. For example, a prevalence of 0.1 indicates that 10% of similar alternative hypotheses have turned out to be true while 90% of the time the null was true. Perhaps the alternative hypothesis is unusual, untested, or otherwise implausible.</p>
<p>If the alternative hypothesis fits current theory, has an identified mechanism for the effect, and previous studies have already shown significant results, P(real) is higher. For example, a prevalence of 0.90 indicates that the alternative is true 90% of the time, and the null only 10% of the time.</p>
<p>If the prevalence is 0.5, there is a 50/50 chance that either the null or alternative hypothesis is true at the outset of the study.</p>
<p>You may not always know this probability, but theory and a previous track record can be guides. For our purposes, we’ll use this principle to see how it impacts our interpretation of P values. Specifically, we’ll focus on the probability of the null being true (1 – P(real)) at the beginning of the study.</p>
Hypothesis Tests Are Journeys from the Prior Probability to Posterior Probability
<p><a href="http://blog.minitab.com/blog/understanding-statistics/what-statistical-hypothesis-test-should-i-use" target="_blank">Hypothesis tests</a> begin with differing probabilities that the null hypothesis is true depending on the specific hypotheses being tested. This prior probability influences the probability that the null is true at the conclusion of the test, the posterior probability.</p>
<p>If P(real) = 0.9, there is only a 10% chance that the null hypothesis is true at the outset. Consequently, the probability of rejecting a true null at the conclusion of the test must be less than 10%. However, if you start with a 90% chance of the null being true, the odds of rejecting a true null increases because there are more true nulls.</p>
<p style="text-align: center;">Initial Probability of<br />
true null (1 – P(real))</p>
<p style="text-align: center;">P value obtained</p>
<p style="text-align: center;">Final Minimum Probability<br />
of true null</p>
<p style="text-align: center;">0.5</p>
<p style="text-align: center;">0.05</p>
<p style="text-align: center;">0.289</p>
<p style="text-align: center;">0.5</p>
<p style="text-align: center;">0.01</p>
<p style="text-align: center;">0.110</p>
<p style="text-align: center;">0.5</p>
<p style="text-align: center;">0.001</p>
<p style="text-align: center;">0.018</p>
<p style="text-align: center;">0.33</p>
<p style="text-align: center;">0.05</p>
<p style="text-align: center;">0.12</p>
<p style="text-align: center;">0.9</p>
<p style="text-align: center;">0.05</p>
<p style="text-align: center;">0.76</p>
<p>The table is based on calculations by Colquhoun and Sellke <em>et al.</em> It shows that the decrease from the initial probability to the final probability of a true null depends on the P value. Power is also a factor but not shown in the table.</p>
Where Do We Go with P values from Here?
<p><img alt="wooden block P" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/c408562ea4a40eedae9ae78c1d3ca027/p_wooden.jpg" style="float: right; width: 150px; height: 150px;" />There are many combinations of conditions that affect the probability of rejecting a true null. However, don't try to remember every combination and the error rate, especially because you may only have a vague sense of the true P(real) value!</p>
<p>Just remember two big takeaways:</p>
<ol>
<li>A single statistically significant hypothesis test often provides insufficient evidence to confidently discard the null hypothesis. This is particularly true when the P value is closer to 0.05.</li>
<li>P values from different hypothesis tests can have the same value, but correspond to very different false positive rates. You need to understand their context to be able to interpret them correctly.</li>
</ol>
<p>The second point is epitomized by a quote that was popularized by Carl Sagan: “Extraordinary claims require extraordinary evidence.”</p>
<p>A surprising new study may have a significant P value, but you shouldn't trust the alternative hypothesis until the results are replicated by additional studies. As shown in the table, a significant but unusual alternative hypothesis can have an error rate of 76%!</p>
<p>Don’t fret! There are simple recommendations based on the principles above that can help you navigate P values and use them correctly. I’ll cover <a href="http://blog.minitab.com/blog/adventures-in-statistics/five-guidelines-for-using-p-values">five guidelines for using P values</a> in my next post.</p>
Hypothesis TestingThu, 01 May 2014 11:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/not-all-p-values-are-created-equalJim FrostHow to Correctly Interpret P Values
http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values
<p><img alt="P value" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/d95f756ee6f6a4cec607017c8edea52a/bigp.gif" style="margin: 4px; float: right; width: 110px; height: 125px;" />The P value is used all over statistics, from <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/t-for-2-should-i-use-a-paired-t-or-a-2-sample-t" target="_blank">t-tests</a> to <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">regression analysis</a>. Everyone knows that you use P values to determine statistical significance in a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/what-is-a-hypothesis-test/" target="_blank">hypothesis test</a>. In fact, P values often determine what studies get published and what projects get funding.</p>
<p>Despite being so important, the P value is a slippery concept that people often interpret incorrectly. How <em>do</em> you interpret P values?</p>
<p>In this post, I'll help you to understand P values in a more intuitive way and to avoid a very common misinterpretation that can cost you money and credibility.</p>
What Is the Null Hypothesis in Hypothesis Testing?
<p><img alt="Scientist performing an experiment" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/3407070c72311249854712c526aceb59/scientist_w640.jpeg" style="margin: 10px 15px; float: right; width: 300px; height: 200px; border-width: 1px; border-style: solid;" />In order to understand P values, you must first understand the null hypothesis.</p>
<p>In every experiment, there is an effect or difference between groups that the researchers are testing. It could be the effectiveness of a new drug, building material, or other intervention that has benefits. Unfortunately for the researchers, there is always the possibility that there is no effect, that is, that there is no difference between the groups. This lack of a difference is called the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/null-and-alternative-hypotheses/" target="_blank">null hypothesis</a>, which is essentially the position a devil’s advocate would take when evaluating the results of an experiment.</p>
<p>To see why, let’s imagine an experiment for a drug that we know is totally ineffective. The null hypothesis is true: there is no difference between the experimental groups at the population level.</p>
<p>Despite the null being true, it’s entirely possible that there will be an effect in the sample data due to random sampling error. In fact, it is extremely unlikely that the sample groups will ever exactly equal the null hypothesis value. Consequently, the devil’s advocate position is that the observed difference in the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/basic-concepts/sample-and-population/" target="_blank">sample</a> does not reflect a true difference between <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/basic-concepts/sample-and-population/" target="_blank">populations</a>.</p>
What Are P Values?
<p><img alt="Joke" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/81242ed4497d1961eb264c3d7c65cc66/null_joke.gif" style="margin: 4px; float: right; width: 250px; height: 206px;" />P values evaluate how well the sample data support the devil’s advocate argument that the null hypothesis is true. It measures how compatible your data are with the null hypothesis. How likely is the effect observed in your sample data if the null hypothesis is true?</p>
<ul>
<li>High P values: your data are likely with a true null.</li>
<li>Low P values: your data are unlikely with a true null.</li>
</ul>
<p>A low P value suggests that your sample provides enough evidence that you can reject the null hypothesis for the entire population.</p>
How Do You Interpret P Values?
<p><img alt="Vaccine" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/179970708b13904b2993033a5cc2e71d/vaccination_w640.jpeg" style="margin: 4px; float: right; width: 300px; height: 160px;" />In technical terms, a P value is the probability of obtaining an effect at least as extreme as the one in your sample data, assuming the truth of the null hypothesis.</p>
<p>For example, suppose that a vaccine study produced a P value of 0.04. This P value indicates that if the vaccine had no effect, you’d obtain the observed difference or more in 4% of studies due to random sampling error.</p>
<p>P values address only one question: how likely are your data, assuming a true null hypothesis? It does not measure support for the alternative hypothesis. This limitation leads us into the next section to cover a very common misinterpretation of P values.</p>
P Values Are <em>NOT </em>the Probability of Making a Mistake
<p>Incorrect interpretations of P values are very common. The most common mistake is to interpret a P value as the probability of making a mistake by rejecting a true null hypothesis (a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/type-i-and-type-ii-error/" target="_blank">Type I error</a>).</p>
<p>There are several reasons why P values can’t be the error rate.</p>
<p>First, P values are calculated based on the assumptions that the null is true for the population and that the difference in the sample is caused entirely by random chance. Consequently, P values can’t tell you the probability that the null is true or false because it is 100% true from the perspective of the calculations.</p>
<p>Second, while a low P value indicates that your data are unlikely assuming a true null, it can’t evaluate which of two competing cases is more likely:</p>
<ul>
<li>The null is true but your sample was unusual.</li>
<li>The null is false.</li>
</ul>
<p>Determining which case is more likely requires subject area knowledge and replicate studies.</p>
<p>Let’s go back to the vaccine study and compare the correct and incorrect way to interpret the P value of 0.04:</p>
<ul>
<li><strong>Correct: </strong>Assuming that the vaccine had no effect, you’d obtain the observed difference or more in 4% of studies due to random sampling error.<br />
</li>
<li><strong>Incorrect:</strong> If you reject the null hypothesis, there’s a 4% chance that you’re making a mistake.</li>
</ul>
What Is the True Error Rate?
<p><img alt="Caution sign" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/41ad875b2a88a19ab5bdfa5e47ed790b/caution_sign_w640.jpeg" style="margin: 4px; float: right; width: 250px; height: 250px;" />Think that this interpretation difference is simply a matter of semantics, and only important to picky statisticians? Think again. It’s important to you.</p>
<p>If a P value is not the error rate, what the heck is the error rate? (Can you guess which way this is heading now?)</p>
<p>Sellke et al.* have estimated the error rates associated with different P values. While the precise error rate depends on various assumptions (which I discuss <a href="http://blog.minitab.com/blog/adventures-in-statistics/not-all-p-values-are-created-equal" target="_blank">here</a>), the table summarizes them for middle-of-the-road assumptions.</p>
<p style="text-align: center;"><strong>P value</strong></p>
<p style="text-align: center;"><strong>Probability of incorrectly rejecting a true null hypothesis</strong></p>
<p style="text-align: center;">0.05</p>
<p style="text-align: center;">At least 23% (and typically close to 50%)</p>
<p style="text-align: center;">0.01</p>
<p style="text-align: center;">At least 7% (and typically close to 15%)</p>
<p>Do the higher error rates in this table surprise you? Unfortunately, the common misinterpretation of P values as the error rate creates the illusion of substantially more evidence against the null hypothesis than is justified. As you can see, if you base a decision on a single study with a P value near 0.05, the difference observed in the sample may not exist at the population level. That can be costly!</p>
<p>Now that you know how to interpret P values, read my <a href="http://blog.minitab.com/blog/adventures-in-statistics/five-guidelines-for-using-p-values">five guidelines for how to use P values and avoid mistakes</a>.</p>
<p>*Thomas SELLKE, M. J. BAYARRI, and James O. BERGER, Calibration of p Values for Testing Precise Null Hypotheses, The American Statistician, February 2001, Vol. 55, No. 1</p>
Hypothesis TestingThu, 17 Apr 2014 11:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-valuesJim FrostRe-analyzing Wine Tastes with Minitab 17
http://blog.minitab.com/blog/applying-statistics-in-quality-projects/re-analyzing-wine-tastes-with-minitab-17
<p>In April 2012, I wrote a short paper on <a href="http://www.minitab.com/en-us/Published-Articles/Wine-Tasting-by-Numbers--Using-Binary-Logistic-Regression-to-Reveal-the-Preferences-of-Experts/">binary logistic regression</a> to analyze wine tasting data. At that time, François Hollande was about to get elected as French president and in the U.S., Mitt Romney was winning the Republican primaries. That seems like a long time ago…</p>
<p>Now, in 2014, Minitab 17 <a href="http://www.minitab.com/products/minitab/">Statistical Software</a> has just been released. Had Minitab 17, been available in 2012, would have I conducted my analysis in a different way? Would the results still look similar? I decided to re-analyze my April 2012 data with Minitab 17 and assess the differences, if there are any.</p>
<p>There were no less than 12 parameters to analyze with a binary response. Among them 11 parameters were continuous variables, one factor was discrete in nature (white and red wines: a qualitative variable), and the number of two-factor interactions that could be studied was huge (66 two-factor interactions were potentially available).</p>
<p>The parameters to be studied :</p>
<p style="text-align: center;"><strong>Variable</strong></p>
<p style="text-align: center;"><strong>Details</strong></p>
<p style="text-align: center;"><strong>Units</strong></p>
<p style="text-align: center;">Type</p>
<p style="text-align: center;">red or white</p>
<p style="text-align: center;">N/A</p>
<p style="text-align: center;">pH</p>
<p style="text-align: center;">acidity (below 7) or alkalinity (over 7)</p>
<p style="text-align: center;">N/A</p>
<p style="text-align: center;">Density</p>
<p style="text-align: center;">density</p>
<p style="text-align: center;">grams/cubic centimeter</p>
<p style="text-align: center;">Sulphates</p>
<p style="text-align: center;">potassium sulfate</p>
<p style="text-align: center;">grams/liter</p>
<p style="text-align: center;">Alcohol</p>
<p style="text-align: center;">percentage alcohol</p>
<p style="text-align: center;">% volume</p>
<p style="text-align: center;">Residual sugar</p>
<p style="text-align: center;">residual sugar</p>
<p style="text-align: center;">grams/liter</p>
<p style="text-align: center;">Chlorides</p>
<p style="text-align: center;">sodium chloride</p>
<p style="text-align: center;">grams/liter</p>
<p style="text-align: center;">Free SO2</p>
<p style="text-align: center;">free sulphur dioxide</p>
<p style="text-align: center;">milligrams/liter</p>
<p style="text-align: center;">Total SO2</p>
<p style="text-align: center;">total sulphur dioxide</p>
<p style="text-align: center;">milligrams/liter</p>
<p style="text-align: center;">Fixed acidity</p>
<p style="text-align: center;">tartaric acid</p>
<p style="text-align: center;">grams/liter</p>
<p style="text-align: center;">Volatile acidity</p>
<p style="text-align: center;">acetic acid</p>
<p style="text-align: center;">grams/liter</p>
<p style="text-align: center;">Citric acid</p>
<p style="text-align: center;">citric acid</p>
<p style="text-align: center;">grams/liter</p>
Restricting Analysis to the Main Effects
<p>In 2012, due to the very large number of potential two-factor interactions, I restricted my analysis to the main effects (not considering the interactions between continuous variables).</p>
<p>Because the individual parameters had to be eliminated one at a time, according to their p value (the highest p values are eliminated one at a time until all the parameters and interactions that remain in the model have p values that are lower than 0.05), this was a very lengthy process.</p>
<p>To avoid obtaining an excessively complex final model, I eventually decided to analyze white and red wines separately (a model for the white wines, another model for the red wines), suggesting that the effect of some of the variables were different according to the type of wine.</p>
Including 2-Way Interactions in the Analysis
<p>Using Minitab 17 makes a substantial difference in this respect. All 2-way interactions can be easily selected to generate an initial model :</p>
<p><img alt="interactions" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/47940b6e8427b9c44afdf56f511b0d44/interactions_logistic_binary.JPG" style="width: 516px; height: 540px;" /></p>
<p>With Minitab 17, you can use stepwise logistic binary regression to quickly build a final model and identify the significant effects. In 2012, I used a descending approach considering all variables first and eliminating one variable at a time manually.</p>
<p>This lengthy and tedious process takes just a single click in Minitab 17:</p>
<p><img alt="stepwise" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/8fe5aafde53273ba3b7d16da305b5e4d/stepwise_binary.JPG" style="width: 486px; height: 539px;" /></p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/fc6b2c0fe2c083e439f4c66e0e446ddd/deviance_table_w640.gif" style="width: 640px; height: 168px;" /></p>
<p> </p>
<p>The results above show that Alcohol and Acidity (both fixed and volatile) seem to play a major role.</p>
<p>The Residual sugar by Type of wine interaction is barely significant with a p value (0.087) larger than 0.05 but smaller than 0.1.</p>
<p>The R Squared value (R-Sq) is also available in Minitab 17, to assess the proportion of the total variability that is explained by the model. The larger the R square value, the more comprehensive our model is (a large R squared means that we have got the full picture of our process, a low R squared means that our model explains only a small part of the variability in the response). In this example, the R squared is relatively low (28%) with 72% of the total variability unexplained by the model.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/4a5a9df6a33e7e05cfaf880bcc2cc3d8/model_summary.png" style="width: 278px; height: 94px;" /></p>
<p>In 2012, the final result consisted of two equations that could be used to understand which variables were significant for each type of wine in order to improve their taste.</p>
Optimizing the Response
<p>In Minitab 17, I can go one step further and use the optimization tool to identify the ideal settings and help the experimenter make the right decision.</p>
<p><img alt="regression equation" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/513cc8d599ba7948f4f288db12356435/regression_equation.png" style="width: 580px; height: 174px;" /></p>
<p><img alt="Optimize" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/9e42868dc619334da72100ec138b00c4/optimize_binary_w640.jpeg" style="width: 640px; height: 184px;" /></p>
<p>The optimization tool shows that tasters tend to prefer wines with a large amount of alcohol and both high fixed acidity <em>and </em>high volatile acidity.</p>
<p>Finally, showing graphs is important to convince colleagues and managers that the right decision has been taken. A visual representation is also very useful to better understand the factor effects. In Minitab 17, contour plots and response surface diagrams are available to describe the variable effects in the logistic binary regression sub-menu.</p>
<p>The contour plot below shows that tasters either prefer wines with high fixed acidity <em>and </em>high volatile acidity or with low fixed acidity <em>but also </em>low volatile acidity. The balance between the two types of acidity seems to be crucial.</p>
<p><img alt="Contour" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/2705aa13f0f80f9830408616028428a0/contour_plot_of_quality_vs_volatile_acidity__fixed.jpg" style="width: 576px; height: 384px;" /></p>
<p><img alt="Surface" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/2f593a6e6b89cb70b9439c85e8345477/surface_plot_of_quality_vs_volatile_acidity__fixed.jpg" style="width: 576px; height: 384px;" /></p>
<p>The models I arrived at in April 2012 are different from the one I found with Minitab 17. The two types of Acidity (Fixed and Volatile) were significant in the model for white wines, and Alcohol and Fixed Acidity had been selected in the final model for red wines.</p>
<p>But the main difference is that the Fixed Acidity by Volatile Acidity interaction had not been considered in 2012. In April 2012, the two-factor interactions were not on my radar, and I instead focused only on the individual main effects and their impact on wine tastes.</p>
<p>Fortunately, with Minitab 17 it is a lot easier to build an initial model—even a complex one with 66 two-factor potential interactions—and stepwise regression allows you to consider a much larger number of potential effects in the initial full model.</p>
Conclusion
<p>Ultimately, this study shows that the methods you use definitely impact your conclusion and statistical analysis. I got a simpler model using the tools available in Minitab 17, and therefore I did not need to study white and red wines separately. The optimization tool as well as the graphs were very useful to better understand the effects of the variables that are significant.</p>
<p> </p>
Data AnalysisFun StatisticsHypothesis TestingQuality ImprovementRegression AnalysisStatisticsStatistics HelpStatsTue, 15 Apr 2014 12:00:00 +0000http://blog.minitab.com/blog/applying-statistics-in-quality-projects/re-analyzing-wine-tastes-with-minitab-17Bruno Scibilia