Hypothesis Testing | MinitabBlog posts and articles about hypothesis testing, especially in the course of Lean Six Sigma quality improvement projects.
http://blog.minitab.com/blog/hypothesis-testing-2/rss
Tue, 26 Jul 2016 21:47:14 +0000FeedCreator 1.7.3Understanding Analysis of Variance (ANOVA) and the F-test
http://blog.minitab.com/blog/adventures-in-statistics/understanding-analysis-of-variance-anova-and-the-f-test
<p>Analysis of variance (ANOVA) can determine whether the means of three or more groups are different. ANOVA uses F-tests to statistically test the equality of means. In this post, I’ll show you how ANOVA and F-tests work using a one-way ANOVA example.</p>
<p>But wait a minute...have you ever stopped to wonder why you’d use an analysis of <em>variance</em> to determine whether <em>means</em> are different? I'll also show how variances provide information about means.</p>
<p>As in my posts about <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-t-tests:-1-sample,-2-sample,-and-paired-t-tests" target="_blank">understanding t-tests</a>, I’ll focus on concepts and graphs rather than equations to explain ANOVA F-tests.</p>
What are F-statistics and the F-test?
<p>F-tests are named after its test statistic, F, which was named in honor of Sir Ronald Fisher. The F-statistic is simply a ratio of two variances. Variances are a measure of dispersion, or how far the data are scattered from the mean. Larger values represent greater dispersion.</p>
<img alt="F is for F-test" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/2176eecdb5dee3586bf90f5dc2ca0007/f.gif" style="line-height: 20.8px; margin: 10px 15px; float: right; width: 200px; height: 221px;" />
<p>Variance is the square of the standard deviation. For us humans, standard deviations are easier to understand than variances because they’re in the same units as the data rather than squared units. However, many analyses actually use variances in the calculations.</p>
<p>F-statistics are based on the ratio of mean squares. The term “<a href="http://support.minitab.com/minitab/17/topic-library/modeling-statistics/anova/anova-statistics/understanding-mean-squares/" target="_blank">mean squares</a>” may sound confusing but it is simply an estimate of population variance that accounts for the <a href="http://support.minitab.com/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/basic-concepts/df/" target="_blank">degrees of freedom (DF)</a> used to calculate that estimate.</p>
<p>Despite being a ratio of variances, you can use F-tests in a wide variety of situations. Unsurprisingly, the F-test can assess the equality of variances. However, by changing the variances that are included in the ratio, the F-test becomes a very flexible test. For example, you can use F-statistics and F-tests to <a href="http://blog.minitab.com/blog/adventures-in-statistics/what-is-the-f-test-of-overall-significance-in-regression-analysis" target="_blank">test the overall significance for a regression model</a>, to compare the fits of different models, to test specific regression terms, and to test the equality of means.</p>
Using the F-test in One-Way ANOVA
<p>To use the F-test to determine whether group means are equal, it’s just a matter of including the correct variances in the ratio. In one-way ANOVA, the F-statistic is this ratio:</p>
<p style="margin-left: 40px;"><strong>F = variation between sample means / variation within the samples</strong></p>
<p>The best way to understand this ratio is to walk through a one-way ANOVA example.</p>
<p>We’ll analyze four samples of plastic to determine whether they have different mean strengths. You can download the <a href="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/File/a8a9c678090ccac0f3be61be91cf8012/plasticstrength.mtw">sample data</a> if you want to follow along. (If you don't have Minitab, you can download a <a href="http://www.minitab.com/en-us/products/minitab/free-trial/" target="_blank">free 30-day trial</a>.) I'll refer back to the one-way ANOVA output as I explain the concepts.</p>
<p>In Minitab, choose <strong>Stat > ANOVA > One-Way ANOVA...</strong> In the dialog box, choose "Strength" as the response, and "Sample" as the factor. Press OK, and Minitab's Session Window displays the following output: </p>
<p style="margin-left: 40px;"><img alt="Output for Minitab's one-way ANOVA" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/42587221b52ed940d53478106c134ebc/1way_swo.png" style="width: 315px; height: 322px;" /></p>
Numerator: Variation Between Sample Means
<p>One-way ANOVA has calculated a mean for each of the four samples of plastic. The group means are: 11.203, 8.938, 10.683, and 8.838. These group means are distributed around the overall mean for all 40 observations, which is 9.915. If the group means are clustered close to the overall mean, their variance is low. However, if the group means are spread out further from the overall mean, their variance is higher.</p>
<p>Clearly, if we want to show that the group means are different, it helps if the means are further apart from each other. In other words, we want higher variability among the means.</p>
<p>Imagine that we perform two different one-way ANOVAs where each analysis has four groups. The graph below shows the spread of the means. Each dot represents the mean of an entire group. The further the dots are spread out, the higher the value of the variability in the numerator of the F-statistic.</p>
<p style="margin-left: 40px;"><img alt="Dot plot that shows high and low variability between group means" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/f9a100946675098ca09c4440a7907230/group_means_dot_plot.png" style="width: 576px; height: 86px;" /></p>
<p>What value do we use to measure the variance between sample means for the plastic strength example? In the one-way ANOVA output, we’ll use the adjusted mean square (Adj MS) for Factor, which is 14.540. Don’t try to interpret this number because it won’t make sense. It’s the sum of the squared deviations divided by the factor DF. Just keep in mind that the further apart the group means are, the larger this number becomes.</p>
Denominator: Variation Within the Samples
<p>We also need an estimate of the variability within each sample. To calculate this variance, we need to calculate how far each observation is from its group mean for all 40 observations. Technically, it is the sum of the squared deviations of each observation from its group mean divided by the error DF.</p>
<p>If the observations for each group are close to the group mean, the variance within the samples is low. However, if the observations for each group are further from the group mean, the variance within the samples is higher.</p>
<p style="margin-left: 40px;"><img alt="Plot that shows high and low variability within groups" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/9ef2eae1cf6bba97ccb1b664356d0d0a/within_group_dplot.png" style="width: 576px; height: 384px;" /></p>
<p>In the graph, the panel on the left shows low variation in the samples while the panel on the right shows high variation. The more spread out the observations are from their group mean, the higher the value in the denominator of the F-statistic.</p>
<p>If we’re hoping to show that the means are different, it's good when the within-group variance is low. You can think of the within-group variance as the background noise that can obscure a difference between means.</p>
<p>For this one-way ANOVA example, the value that we’ll use for the variance within samples is the Adj MS for Error, which is 4.402. It is considered “error” because it is the variability that is not explained by the factor.</p>
The F-Statistic: Variation Between Sample Means / Variation Within the Samples
<p>The F-statistic is the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/what-is-a-test-statistic/" target="_blank">test statistic</a> for F-tests. In general, an F-statistic is a ratio of two quantities that are expected to be roughly equal under the null hypothesis, which produces an F-statistic of approximately 1.</p>
<p>The F-statistic incorporates both measures of variability discussed above. Let's take a look at how these measures can work together to produce low and high F-values. Look at the graphs below and compare the width of the spread of the group means to the width of the spread within each group.</p>
<img alt="Graph that shows sample data that produce a low F-value" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/a8faab4bb32bf1a1f5864d34d96e8d56/low_f_dplot.png" style="width: 350px; height: 233px;" />
<img alt="Graph that shows sample data that produce a high F-value" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/054b86eb1e48803baba2cff9c78028ab/high_f_dplot.png" style="width: 350px; height: 233px;" />
<p>The low F-value graph shows a case where the group means are close together (low variability) relative to the variability within each group. The high F-value graph shows a case where the variability of group means is large relative to the within group variability. In order to reject the null hypothesis that the group means are equal, we need a high F-value.</p>
<p>For our plastic strength example, we'll use the Factor Adj MS for the numerator (14.540) and the Error Adj MS for the denominator (4.402), which gives us an F-value of 3.30.</p>
<p>Is our F-value high enough? A single F-value is hard to interpret on its own. We need to place our F-value into a larger context before we can interpret it. To do that, we’ll use the F-distribution to calculate probabilities.</p>
F-distributions and Hypothesis Testing
<p>For one-way ANOVA, the ratio of the between-group variability to the within-group variability follows an <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/probability-distributions-and-random-data/distributions/f-distribution/" target="_blank">F-distribution</a> when the null hypothesis is true.</p>
<p>When you perform a one-way ANOVA for a single study, you obtain a single F-value. However, if we drew multiple random samples of the same size from the same population and performed the same one-way ANOVA, we would obtain many F-values and we could plot a distribution of all of them. This type of distribution is known as a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/basic-concepts/sampling-distribution/" target="_blank">sampling distribution</a>.</p>
<p>Because the F-distribution assumes that the null hypothesis is true, we can place the F-value from our study in the F-distribution to determine how consistent our results are with the null hypothesis and to calculate probabilities.</p>
<p>The probability that we want to calculate is the probability of observing an F-statistic that is at least as high as the value that our study obtained. That probability allows us to determine how common or rare our F-value is under the assumption that the null hypothesis is true. If the probability is low enough, we can conclude that our data is inconsistent with the null hypothesis. The evidence in the sample data is strong enough to reject the null hypothesis for the entire population.</p>
<p>This probability that we’re calculating is also known as the p-value!</p>
<p>To plot the F-distribution for our plastic strength example, I’ll use Minitab’s <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/graphs/graphs-of-distributions/probability-distribution-plots/probability-distribution-plot/" target="_blank">probability distribution plots</a>. In order to graph the F-distribution that is appropriate for our specific design and sample size, we'll need to specify the correct number of DF. Looking at our one-way ANOVA output, we can see that we have 3 DF for the numerator and 36 DF for the denominator.</p>
<p><img alt="Probability distribution plot for an F-distribution with a probability" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/6303a2314437d8fcf2f72d9a56b1293a/f_distribution_probability.png" style="width: 576px; height: 384px;" /></p>
<p>The graph displays the distribution of F-values that we'd obtain if the null hypothesis is true and we repeat our study many times. The shaded area represents the probability of observing an F-value that is at least as large as the F-value our study obtained. F-values fall within this shaded region about 3.1% of the time when the null hypothesis is true. This probability is low enough to reject the null hypothesis using the common <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests:-significance-levels-alpha-and-p-values-in-statistics" target="_blank">significance level</a> of 0.05. We can conclude that not all the group means are equal.</p>
<p><a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">Learn how to correctly interpret the p-value.</a></p>
Assessing Means by Analyzing Variation
<p>ANOVA uses the F-test to determine whether the variability between group means is larger than the variability of the observations within the groups. If that ratio is sufficiently large, you can conclude that not all the means are equal.</p>
<p><span style="line-height: 20.8px;">This brings us back to why we analyze variation to make judgments about means. </span>Think about the question: "Are the group means different?" You are implicitly asking about the variability of the means. After all, if the group means <em>don't </em>vary, or don't vary by more than random chance allows, then you can't say the means are different. And that's why you use analysis of variance to test the means.</p>
ANOVAData AnalysisHypothesis TestingLearningStatistics HelpWed, 18 May 2016 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/understanding-analysis-of-variance-anova-and-the-f-testJim FrostAn Overview of Discriminant Analysis
http://blog.minitab.com/blog/starting-out-with-statistical-software/an-overview-of-discriminant-analysis
<p>Among the most underutilized statistical tools in Minitab, and I think in general, are multivariate tools. Minitab offers a number of different multivariate tools, including principal component analysis, factor analysis, <span><a href="http://blog.minitab.com/blog/quality-data-analysis-and-statistics/cluster-analysis-tips-part-2">clustering</a></span>, and more. In this post, my goal is to give you a better understanding of the multivariate tool called discriminant analysis, and how it can be used.</p>
<p>Discriminant analysis is used to classify observations into two or more groups if you have a sample with known groups. Essentially, it's a way to handle a classification problem, where two or more groups, clusters, populations are known up front, and one or more new observations are placed into one of these known classifications based on the measured characteristics. Discriminant analysis can also used to investigate how variables contribute to group separation.</p>
<p>An area where this is especially useful is species classification. We'll use that as an example to explore how this all works. If you want to follow along and you don't already have Minitab, you can get it <a href="http://www.minitab.com/products/minitab/free-trial/">free for 30 days</a>. </p>
Discriminant Analysis in Action
<img alt="Arctic wolf" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/43484b551c0cc2eacb1b848678d666be/wolf.jpg" style="line-height: 20.8px; margin: 10px 15px; float: right; width: 241px; height: 300px;" />
<div>
<p>I have a <a href="//cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9429cbd678e906f6bbbda0793aa859f6/discrimdata.mtw">data set</a> with variables containing data on both Rocky Mountain and Arctic wolves. We already know which species each observation belongs to; the main goal of this analysis is find out how the data we have contribute to the groupings, and then to use this information to help us classify new individuals. </p>
<p>In Minitab, we set up our worksheet to be column-based like usual. We have a column denoting the species of wolf, as well as 9 other columns containing measurements for each individual on a number of different features.</p>
<p>Once we have our continuous predictors and a group identifier column in our worksheet, we can go to <strong>Stat > Multivariate > Discriminant Analysis</strong>. Here's how we'd fill out the dialog:</p>
<p style="margin-left: 40px;"><img alt="dialog" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/732ead34-1005-4470-b034-d7f8b87fabcf/Image/bbfff731ce2f30923c064a73324dba1e/discrimdia.png" style="width: 448px; height: 336px;" /></p>
<p>'Groups' is where you would enter the column that contains the data on which group the observation falls into. In this case, "Location" is the species ID column. Our predictors, in my case X1-X9, represent the measurements of the individual wolves for each of 9 categories; we'll use these to determine which characteristics determine the groupings.</p>
<p>Some notes before we click OK. First, we're using a Linear discriminant function for simplicity. This makes the assumption that the covariance matrices are equal for all groups. This is something we can verify using Bartlett's Test (also available in Minitab). Once we have our dialog filled out, we can click OK and see our results.</p>
Using the Linear Discriminant Function to Classify New Observations
<p>One of the most important parts of the output we get is called the Linear Discriminant Function. In our example, it looks like this:</p>
<p style="margin-left: 40px;"><img alt="function" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/732ead34-1005-4470-b034-d7f8b87fabcf/Image/a3f3b5199c25010c69d3b19843c31b0e/function.PNG" style="width: 303px; height: 208px;" /></p>
<p>This is the function we will use to classify new observations into groups. Using this function, we can use these coefficients to determine which group provides the best fit for a new individual's measurements. Minitab can do this in the "Options" subdialog. For example, let's say we had an observation with a certain vector of measurements (X1,...,X9). If we do that, we get output like this:</p>
<p style="margin-left: 40px;"><img alt="pred" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/732ead34-1005-4470-b034-d7f8b87fabcf/Image/49873dcbc94d8aa1ae75a45474aaf147/predic.PNG" style="width: 421px; height: 119px;" /></p>
<p>This will give us the probability that a particular new observation falls into either of our groups. In our case, it was an easy one. The probability that is belongs to the AR species was 1. We're reasonably sure, based on the data, that this is the case. In some cases, you may get probabilities much closer to each other, meaning it isn't as clear cut.</p>
<p>I hope this gives you some idea of the usefulness of discriminant analysis, and how you can use it in Minitab to make decisions.</p>
</div>
Data AnalysisHypothesis TestingStatisticsMon, 16 May 2016 12:00:00 +0000http://blog.minitab.com/blog/starting-out-with-statistical-software/an-overview-of-discriminant-analysisEric HeckmanTests of 2 Standard Deviations? Side Effects May Include Paradoxical Dissociations
http://blog.minitab.com/blog/data-analysis-and-quality-improvement-and-stuff/tests-of-2-standard-deviations-side-effects-may-include-paradoxical-dissociations
<p>Once upon a time, when people wanted to compare the standard deviations of two samples, they had two handy tests available, the F-test and Levene's test.</p>
<p>Statistical lore has it that the F-test is so named because <a href="##footnote">it so frequently fails you.1</a> Although the F-test is suitable for data that are normally distributed, its sensitivity to departures from <span><a href="http://blog.minitab.com/blog/the-statistical-mentor/anderson-darling-ryan-joiner-or-kolmogorov-smirnov-which-normality-test-is-the-best">normality</a></span> limits when and where it can be used.</p>
<p><a name="#back"></a>Levene’s test was developed as an antidote to the F-test's extreme sensitivity to nonnormality. However, Levene's test<span style="line-height: 1.6;"> is sometimes accompanied by a troubling side effect: paradoxical </span>dissociations<span style="line-height: 1.6;">. To see what I mean, take a look at these results from an </span><span style="line-height: 1.6;">actual </span><span style="line-height: 1.6;">test of 2 standard deviations that I actually ran in Minitab 16 using actual data that I actually made up:</span></p>
<p style="margin-left: 40px;"><img alt="Ratio of the standard deviations in Release 16" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/313db9f57725eeb074002df423c4415e/16_ratio.jpg" style="width: 286px; height: 99px;" /></p>
<p>Nothing surprising so far. The ratio of the standard deviations from samples 1 and 2 (s1/s2) is <span style="line-height: 20.8px;">1.414 / 1.575 = 0.898. This ratio is </span>our best "point estimate" for the ratio of the standard deviations from populations 1 and 2 (Ps1/Ps2).</p>
<p>Note that the ratio is less than 1, which suggests that Ps2 is greater than Ps1. </p>
<p>Now, let's have a look at the confidence interval (CI) for the population ratio. The CI gives us a range of likely values for the ratio of Ps1/Ps2. The CI <span style="line-height: 20.8px;">below</span><span style="line-height: 1.6;"> labeled "Continuous" is the one calculated using Levene's method:</span></p>
<p style="margin-left: 40px;"><img alt="Confidence interval for the ratio in Release 16" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/aee886880d52d5aed7150abd242b5d61/16_ci.jpg" style="width: 338px; height: 114px;" /></p>
<p><span style="line-height: 1.6;">What in Gauss' name is going on here?!? The range of likely values for Ps1/Ps2—1.046 to 1.566—doesn't include the point estimate of 0.898?!? In fact, the CI suggests that Ps1/Ps2 is </span><em style="line-height: 1.6;">greater </em><span style="line-height: 1.6;">than 1. Which suggests that Ps1 is actually </span><em style="line-height: 20.8px;">greater </em><span style="line-height: 1.6;">than Ps2. </span></p>
<p><span style="line-height: 1.6;">But the point estimate suggests the exact opposite! Which suggests that </span><span style="line-height: 20.8px;">something odd is going on here. Or that</span><span style="line-height: 1.6;"> I might be losing my mind (which wouldn't be that odd). Or both.</span></p>
<p>As it turns out, the very elements that make Levene's test robust to departures from normality also leave the test susceptible to paradoxical dissociations like this one. You see, Levene's test isn't <em>actually </em>based on the standard deviation. Instead, the test is based on a statistic called the <em>mean absolute deviation from the median</em>, or MADM. The MADM is much less affected by nonnormality and outliers than is the standard deviation. And even though the MADM and the <span style="line-height: 20.8px;">standard deviation of a sample </span>can be very different, the <em>ratio </em>of MADM1/MADM2 is nevertheless a good approximation for the <em>ratio </em>of Ps1/Ps2. </p>
<p><span style="line-height: 1.6;">However, in extreme cases, outliers can affect the sample standard deviations so much that s1/s2 can fall completely outside of Levene's CI. And that's when you're left with an awkward and confusing case of paradoxical dissociation. </span></p>
<p><span style="line-height: 1.6;">Fortunately (and this may be the first and last time that you'll ever hear this next phrase), our </span><span style="line-height: 1.6;">statisticians have made things a lot less awkward. </span><span style="line-height: 1.6;">One of the brave folks in Minitab's R&D department toiled against all odds, and at considerable personal peril to solve this enigma. The result, which has been incorporated into Minitab 17, is an effective, elegant, and </span>non-enigmatic<span style="line-height: 1.6;"> test that we call Bonett's test. </span></p>
<p style="margin-left: 40px;"><span style="line-height: 1.6;"><img alt="Confidence interval in Release 17" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/3c014cdea970a3f1f6a540119ef3b533/bonnet_results.jpg" style="width: 310px; height: 170px;" /></span></p>
<p>Like Levene's test, Bonett's test can be used with nonnormal data. But <em>unlike </em>Levene's test, Bonett's test is actually based on the actual standard deviations of the actual samples. Which means that Bonett's test is not subject to the same awkward and confusing paradoxical dissociations that can accompany Levene's test. And I don't know about you, but I try to avoid paradoxical dissociations whenever I can. (Especially as I get older, ... I just don't bounce back the way I used to.) </p>
<p><span style="line-height: 20.8px;">When you compare two standard deviations in Minitab 17, you get a handy graphical report </span><span style="line-height: 20.8px;">that quickly and clearly summarizes the results of your test, including the point estimate and the CI from Bonett's test. Which means n</span><span style="line-height: 20.8px;">o more awkward and confusing paradoxical dissociations. </span></p>
<p style="margin-left: 40px;"><img alt="Summary plot in Release 17" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/b785749b3292df1aa6d32abe4e430b63/17_summary_plot.jpg" style="width: 578px; height: 386px;" /></p>
<p><span style="line-height: 1.6;">------------------------------------------------------------</span></p>
<p><a name="#footnote"> </a></p>
<p>1 So, that bit about the name of the F-test—I kind of made that up. Fortunately, there is a better source of information for the genuinely curious. Our white paper, <a href="http://support.minitab.com/en-us/minitab/17/bonetts_method_two_variances.pdf">Bonett's Method</a>, includes all kinds of details about these tests and comparisons between the CIs calculated with each. Enjoy.</p>
<p> <br />
<em><a href="##back">return to text of post</a></em></p>
<p> </p>
<p> </p>
Hypothesis TestingStatisticsStatsWed, 11 May 2016 12:00:00 +0000http://blog.minitab.com/blog/data-analysis-and-quality-improvement-and-stuff/tests-of-2-standard-deviations-side-effects-may-include-paradoxical-dissociationsGreg FoxUnderstanding t-Tests: 1-sample, 2-sample, and Paired t-Tests
http://blog.minitab.com/blog/adventures-in-statistics/understanding-t-tests%3A-1-sample%2C-2-sample%2C-and-paired-t-tests
<p>In statistics, t-tests are a type of hypothesis test that allows you to compare means. They are called t-tests because each t-test boils your sample data down to one number, the t-value. If you understand how t-tests calculate t-values, you’re well on your way to understanding how these tests work.</p>
<p>In this series of posts, I'm focusing on concepts rather than equations to show how t-tests work. However, this post includes two simple equations that I’ll work through using the analogy of a signal-to-noise ratio.</p>
<p><a href="http://www.minitab.com/products/minitab/" target="_blank">Minitab statistical software</a> offers the 1-sample t-test, paired t-test, and the 2-sample t-test. Let's look at how each of these t-tests reduce your sample data down to the t-value.</p>
How 1-Sample t-Tests Calculate t-Values
<p>Understanding this process is crucial to understanding how t-tests work. I'll show you the formula first, and then I’ll explain how it works.</p>
<p style="margin-left: 40px;"><img alt="formula to calculate t for a 1-sample t-test" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/dbbda42fec926eef96a56c22ed462458/formula_1t.png" style="width: 142px; height: 88px;" /></p>
<p>Please notice that the formula is a ratio. A common analogy is that the t-value is the signal-to-noise ratio.</p>
<strong>Signal (a.k.a. the effect size)</strong>
<p>The numerator is the signal. You simply take the sample mean and subtract the null hypothesis value. If your sample mean is 10 and the null hypothesis is 6, the difference, or signal, is 4.</p>
<p>If there is no difference between the sample mean and null value, the signal in the numerator, as well as the value of the entire ratio, equals zero. For instance, if your sample mean is 6 and the null value is 6, the difference is zero.</p>
<p>As the difference between the sample mean and the null hypothesis mean increases in either the positive or negative direction, the strength of the signal increases.</p>
<div style="float: right; width: 325px; margin: 15px 0px 15px 15px;"><img alt="Photo of a packed stadium to illustrate high background noise" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/695f063e8d38c2bc9c5fa61637ef6327/crowd.jpg" style="width: 325px; height: 244px; margin-bottom:5px;" /><br />
<em>Lots of noise can overwhelm the signal.</em></div>
<strong>Noise</strong>
<p>The denominator is the noise. The equation in the denominator is a measure of variability known as the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/tests-of-means/what-is-the-standard-error-of-the-mean/" target="_blank">standard error of the mean</a>. This statistic indicates how accurately your sample estimates the mean of the population. A larger number indicates that your sample estimate is less precise because it has more random error.</p>
<p>This random error is the “noise.” When there is more noise, you expect to see larger differences between the sample mean and the null hypothesis value <em>even when the null hypothesis is true</em>. We include the noise factor in the denominator because we must determine whether the signal is large enough to stand out from it.</p>
<strong>Signal-to-Noise ratio</strong>
<p>Both the signal and noise values are in the units of your data. If your signal is 6 and the noise is 2, your t-value is 3. This t-value indicates that the difference is 3 times the size of the standard error. However, if there is a difference of the same size but your data have more variability (6), your t-value is only 1. The signal is at the same scale as the noise.</p>
<p>In this manner, t-values allow you to see how distinguishable your signal is from the noise. Relatively large signals and low levels of noise produce larger t-values. If the signal does not stand out from the noise, it’s likely that the observed difference between the sample estimate and the null hypothesis value is due to random error in the sample rather than a true difference at the population level.</p>
A Paired t-test Is Just A 1-Sample t-Test
<p>Many people are confused about when to use a paired t-test and how it works. I’ll let you in on a little secret. The paired t-test and the 1-sample t-test are actually the same test in disguise! As we saw above, a 1-sample t-test compares one sample mean to a null hypothesis value. A paired t-test simply calculates the difference between paired observations (e.g., before and after) and then performs a 1-sample t-test on the differences.</p>
<p>You can test this with <a href="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/File/946c3f4725847e714e7fcc9664ae67b2/paired_t_test.mtw">this data set</a> to see how all of the results are identical, including the mean difference, t-value, p-value, and confidence interval of the difference.</p>
<p style="margin-left: 40px;"><img alt="Minitab worksheet with paired t-test example" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/02fbcdbbf62fec3823123fbcc818b11f/paired_t_worksheet.png" style="width: 229px; height: 223px;" /><img alt="paired t-test output" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/170d6d4fa1fbbb1bf4f5aa56b1783b5f/paired_t_swo.png" style="width: 518px; height: 196px;" /></p>
<p style="margin-left: 40px;"><img alt="1-sample t-test output" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/08d652fb45599fc1ac247181a935c471/1t_difc_swo.png" style="width: 504px; height: 115px;" /></p>
<p>Understanding that the paired t-test simply performs a 1-sample t-test on the paired differences can really help you understand how the paired t-test works and when to use it. You just need to figure out whether it makes sense to calculate the difference between each pair of observations.</p>
<p>For example, let’s assume that “before” and “after” represent test scores, and there was an intervention in between them. If the before and after scores in each row of the example worksheet represent the same subject, it makes sense to calculate the difference between the scores in this fashion—the paired t-test is appropriate. However, if the scores in each row are for different subjects, it doesn’t make sense to calculate the difference. In this case, you’d need to use another test, such as the 2-sample t-test, which I discuss below.</p>
<p>Using the paired t-test simply saves you the step of having to calculate the differences before performing the t-test. You just need to be sure that the paired differences make sense!</p>
<p>When it is appropriate to use a paired t-test, it can be more powerful than a 2-sample t-test. For more information, go to <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/tests-of-means/why-use-paired-t/" target="_blank">Why should I use a paired t-test?</a></p>
How Two-Sample T-tests Calculate T-Values
<p>The 2-sample t-test takes your sample data from two groups and boils it down to the t-value. The process is very similar to the 1-sample t-test, and you can still use the analogy of the signal-to-noise ratio. Unlike the paired t-test, the 2-sample t-test requires independent groups for each sample.</p>
<p>The formula is below, and then some discussion.</p>
<p style="margin-left: 40px;"><img alt="formula to cacalculate t for a 2-sample t-test" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/276994cf179b4997ce6097d1f4462363/formula_2t.png" style="width: 102px; height: 54px;" /></p>
<p>For the 2-sample t-test, the numerator is again the signal, which is the difference between the means of the two samples. For example, if the mean of group 1 is 10, and the mean of group 2 is 4, the difference is 6.</p>
<p>The default null hypothesis for a 2-sample t-test is that the two groups are equal. You can see in the equation that when the two groups are equal, the difference (and the entire ratio) also equals zero. As the difference between the two groups grows in either a positive or negative direction, the signal becomes stronger.</p>
<p>In a 2-sample t-test, the denominator is still the noise, but Minitab can use two different values. You can either assume that the variability in both groups is equal or not equal, and Minitab uses the corresponding estimate of the variability. Either way, the principle remains the same: you are comparing your signal to the noise to see how much the signal stands out.</p>
<p>Just like with the 1-sample t-test, for any given difference in the numerator, as you increase the noise value in the denominator, the t-value becomes smaller. To determine that the groups are different, you need a t-value that is large.</p>
What Do t-Values Mean?
<p>Each type of t-test uses a procedure to boil all of your sample data down to one value, the t-value. The calculations compare your sample mean(s) to the null hypothesis and incorporates both the sample size and the variability in the data. A t-value of 0 indicates that the sample results exactly equal the null hypothesis. In statistics, we call the difference between the sample estimate and the null hypothesis the effect size. As this difference increases, the absolute value of the t-value increases.</p>
<p>That’s all nice, but what does a t-value of, say, 2 really mean? From the discussion above, we know that a t-value of 2 indicates that the observed difference is twice the size of the variability in your data. However, we use t-tests to evaluate hypotheses rather than just figuring out the signal-to-noise ratio. We want to determine whether the effect size is statistically significant.</p>
<p>To see how we get from t-values to assessing hypotheses and determining statistical significance, read the other post in this series, <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-t-tests-t-values-and-t-distributions">Understanding t-Tests: t-values and t-distributions</a>.</p>
Data AnalysisHypothesis TestingLearningStatistics HelpWed, 04 May 2016 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/understanding-t-tests%3A-1-sample%2C-2-sample%2C-and-paired-t-testsJim FrostUnderstanding t-Tests: t-values and t-distributions
http://blog.minitab.com/blog/adventures-in-statistics/understanding-t-tests-t-values-and-t-distributions
<p>T-tests are handy <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-why-we-need-to-use-hypothesis-tests-in-statistics" target="_blank">hypothesis tests</a> in statistics when you want to compare means. You can compare a sample mean to a hypothesized or target value using a one-sample t-test. You can compare the means of two groups with a two-sample t-test. If you have two groups with paired observations (e.g., before and after measurements), use the paired t-test.</p>
<img alt="Output that shows a t-value" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/efd51d69e3947d70197143b735e0c51d/t_value_swo.png" style="line-height: 20.8px; float: right; width: 400px; height: 57px; margin: 10px 15px; border-width: 1px; border-style: solid;" />
<p>How do t-tests work? How do t-values fit in? In this series of posts, I’ll answer these questions by focusing on concepts and graphs rather than equations and numbers. After all, a key reason to use <a href="http://www.minitab.com/products/minitab">statistical software like </a><a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">Minitab</a> is so you don’t get bogged down in the calculations and can instead focus on understanding your results.</p>
<p>In this post, I will explain t-values, t-distributions, and how t-tests use them to calculate probabilities and assess hypotheses.</p>
What Are t-Values?
<p>T-tests are called t-tests because the test results are all based on t-values. T-values are an example of what statisticians call test statistics. A test statistic is a standardized value that is calculated from sample data during a hypothesis test. The procedure that calculates the test statistic compares your data to what is expected under the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/null-and-alternative-hypotheses/" target="_blank">null hypothesis</a>.</p>
<p>Each type of t-test uses a specific procedure to boil all of your sample data down to one value, the t-value. The calculations behind t-values compare your sample mean(s) to the null hypothesis and incorporates both the sample size and the variability in the data. A t-value of 0 indicates that the sample results exactly equal the null hypothesis. As the difference between the sample data and the null hypothesis increases, the absolute value of the t-value increases.</p>
<p>Assume that we perform a t-test and it calculates a t-value of 2 for our sample data. What does that even mean? I might as well have told you that our data equal 2 fizbins! We don’t know if that’s common or rare when the null hypothesis is true.</p>
<p>By itself, a t-value of 2 doesn’t really tell us anything. T-values are not in the units of the original data, or anything else we’d be familiar with. We need a larger context in which we can place individual t-values before we can interpret them. This is where t-distributions come in.</p>
What Are t-Distributions?
<p>When you perform a t-test for a single study, you obtain a single t-value. However, if we drew multiple random samples of the same size from the same population and performed the same t-test, we would obtain many t-values and we could plot a distribution of all of them. This type of distribution is known as a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/basic-concepts/sampling-distribution/" target="_blank">sampling distribution</a>.</p>
<p>Fortunately, the properties of t-distributions are well understood in statistics, so we can plot them without having to collect many samples! A specific t-distribution is defined by its <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/basic-concepts/df/" target="_blank">degrees of freedom (DF)</a>, a value closely related to sample size. Therefore, different t-distributions exist for every sample size. <span style="line-height: 20.8px;">You can graph t-distributions u</span><span style="line-height: 1.6;">sing Minitab’s </span><a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/graphs/graphs-of-distributions/probability-distribution-plots/probability-distribution-plot/" style="line-height: 1.6;" target="_blank">probability distribution plots</a><span style="line-height: 1.6;">.</span></p>
<p>T-distributions assume that you draw repeated random samples from a population where the null hypothesis is true. You place the t-value from your study in the t-distribution to determine how consistent your results are with the null hypothesis.</p>
<p style="margin-left: 40px;"><img alt="Plot of t-distribution" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/d628e56f0380e0edcf575502a670ed31/t_dist_20_df.png" style="width: 576px; height: 384px;" /></p>
<p>The graph above shows a t-distribution that has 20 degrees of freedom, which corresponds to a sample size of 21 in a one-sample t-test. It is a symmetric, bell-shaped distribution that is similar to the normal distribution, but with thicker tails. This graph plots the probability density function (PDF), which describes the likelihood of each t-value.</p>
<p>The peak of the graph is right at zero, which indicates that obtaining a sample value close to the null hypothesis is the most likely. That makes sense because t-distributions assume that the null hypothesis is true. T-values become less likely as you get further away from zero in either direction. In other words, when the null hypothesis is true, you are less likely to obtain a sample that is very different from the null hypothesis.</p>
<p>Our t-value of 2 indicates a positive difference between our sample data and the null hypothesis. The graph shows that there is a reasonable probability of obtaining a t-value from -2 to +2 when the null hypothesis is true. Our t-value of 2 is an unusual value, but we don’t know exactly <em>how </em>unusual. Our ultimate goal is to determine whether our t-value is unusual enough to warrant rejecting the null hypothesis. To do that, we'll need to calculate the probability.</p>
Using t-Values and t-Distributions to Calculate Probabilities
<p>The foundation behind any hypothesis test is being able to take the test statistic from a specific sample and place it within the context of a known probability distribution. For t-tests, if you take a t-value and place it in the context of the correct t-distribution, you can calculate the probabilities associated with that t-value.</p>
<p>A probability allows us to determine how common or rare our t-value is under the assumption that the null hypothesis is true. If the probability is low enough, we can conclude that the effect observed in our sample is inconsistent with the null hypothesis. The evidence in the sample data is strong enough to reject the null hypothesis for the entire population.</p>
<p>Before we calculate the probability associated with our t-value of 2, there are two important details to address.</p>
<p>First, we’ll actually use the t-values of +2 and -2 because we’ll perform a two-tailed test. A two-tailed test is one that can test for differences in both directions. For example, a two-tailed 2-sample t-test can determine whether the difference between group 1 and group 2 is statistically significant in either the positive or negative direction. A one-tailed test can only assess one of those directions.</p>
<p>Second, we can only calculate a non-zero probability for a range of t-values. As you’ll see in the graph below, a range of t-values corresponds to a proportion of the total area under the distribution curve, which is the probability. The probability for any specific point value is zero because it does not produce an area under the curve.</p>
<p>With these points in mind, we’ll shade the area of the curve that has t-values greater than 2 and t-values less than -2.</p>
<p><img alt="T-distribution with a shaded area that represents a probability" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/5e124a2c8139681afec706799ebabcec/t_dist_prob.png" style="width: 576px; height: 384px;" /></p>
<p>The graph displays the probability for observing a difference from the null hypothesis that is at least as extreme as the difference present in our sample data while assuming that the null hypothesis is actually true. Each of the shaded regions has a probability of 0.02963, which sums to a total probability of 0.05926. When the null hypothesis is true, the t-value falls within these regions nearly 6% of the time.</p>
<p>This probability has a name that you might have heard of—it’s called the p-value! While the probability of our t-value falling within these regions is fairly low, it’s not low enough to reject the null hypothesis using the common <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-significance-levels-alpha-and-p-values-in-statistics" target="_blank">significance level</a> of 0.05.</p>
<p><a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">Learn how to correctly interpret the p-value.</a></p>
t-Distributions and Sample Size
<p>As mentioned above, t-distributions are defined by the DF, which are closely associated with sample size. As the DF increases, the probability density in the tails decreases and the distribution becomes more tightly clustered around the central value. The graph below depicts t-distributions with 5 and 30 degrees of freedom.</p>
<p><img alt="Comparison of t-distributions with different degrees of freedom" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/5220dc6347611a230e89b70de904b034/t_dist_comp_df.png" style="width: 576px; height: 384px;" /></p>
<p>The t-distribution with fewer degrees of freedom has thicker tails. This occurs because the t-distribution is designed to reflect the added uncertainty associated with analyzing small samples. In other words, if you have a small sample, the probability that the sample statistic will be further away from the null hypothesis is greater even when the null hypothesis is true.</p>
<p>Small samples are more likely to be unusual. This affects the probability associated with any given t-value. For 5 and 30 degrees of freedom, a t-value of 2 in a two-tailed test has p-values of 10.2% and 5.4%, respectively. Large samples are better!</p>
<p>I’ve showed how t-values and t-distributions work together to produce probabilities. To see how each type of t-test works and actually calculates the t-values, read the other post in this series, <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-t-tests:-1-sample,-2-sample,-and-paired-t-tests">Understanding t-Tests: 1-sample, 2-sample, and Paired t-Tests</a>.</p>
<p>If you'd like to learn how the ANOVA F-test works, read my post, <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-analysis-of-variance-anova-and-the-f-test">Understanding Analysis of Variance (ANOVA) and the F-test</a>.</p>
Data AnalysisHypothesis TestingLearningStatistics HelpWed, 20 Apr 2016 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/understanding-t-tests-t-values-and-t-distributionsJim FrostBest Way to Analyze Likert Item Data: Two Sample T-Test versus Mann-Whitney
http://blog.minitab.com/blog/adventures-in-statistics/best-way-to-analyze-likert-item-data%3A-two-sample-t-test-versus-mann-whitney
<p><img alt="Worksheet that shows Likert data" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/6b1cf78b969699ed58febb026d32051d/likert_worksheet.png" style="float: right; width: 162px; height: 265px; margin: 10px 15px;" />Five-point Likert scales are commonly associated with surveys and are used in a wide variety of settings. You’ve run into the Likert scale if you’ve ever been asked whether you strongly agree, agree, neither agree or disagree, disagree, or strongly disagree about something. The worksheet to the right shows what five-point Likert data look like when you have two groups.</p>
<p>Because Likert item data are discrete, ordinal, and have a limited range, there’s been a longstanding dispute about the most valid way to analyze Likert data. The basic choice is between <a href="http://blog.minitab.com/blog/adventures-in-statistics/choosing-between-a-nonparametric-test-and-a-parametric-test" target="_blank">a parametric test and a nonparametric test</a>. The pros and cons for each type of test are generally described as the following:</p>
<ul>
<li>Parametric tests, such as the 2-sample t-test, assume a normal, continuous distribution. However, with a sufficient sample size, t-tests are robust to departures from normality.</li>
<li>Nonparametric tests, such as the Mann-Whitney test, do not assume a normal or a continuous distribution. However, there are concerns about a lower ability to detect a difference when one truly exists.</li>
</ul>
<p>What’s the better choice? This is a real-world decision that users of <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">statistical software</a> have to make when they want to analyze Likert data.</p>
<p>Over the years, a number of studies that have tried to answer this question. However, they’ve tended to look at a limited number of potential distributions for the Likert data, which causes the generalizability of the results to suffer. Thanks to increases in computing power, simulation studies can now thoroughly assess a wide range of distributions.</p>
<p>In this blog post, I highlight a simulation study conducted by de Winter and Dodou* that compares the capabilities of the two sample t-test and the Mann-Whitney test to analyze five-point Likert items for two groups. Is it better to use one analysis or the other?</p>
<p>The researchers identified a diverse set of 14 distributions that are representative of actual Likert data. The computer program drew independent pairs of samples to test all possible combinations of the 14 distributions. All in all, 10,000 random samples were generated for each of the 98 distribution combinations! The pairs of samples are analyzed using both the two sample t-test and the Mann-Whitney test to compare how well each test performs. The study also assessed different sample sizes.</p>
<p>The results show that for all pairs of distributions the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/type-i-and-type-ii-error/" target="_blank">Type I (false positive) error rates</a> are very close to the target amounts. In other words, if you use either analysis and your results are statistically significant, you don’t need to be overly concerned about a false positive.</p>
<p>The results also show that for most pairs of distributions, the difference between the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/power-and-sample-size/what-is-power/" target="_blank">statistical power</a> of the two tests is trivial. In other words, if a difference truly exists at the population level, either analysis is equally likely to detect it. The concerns about the Mann-Whitney test having less power in this context appear to be unfounded.</p>
<p>I do have one caveat. There are a few pairs of specific distributions where there is a power difference between the two tests. If you perform both tests on the same data and they disagree (one is significant and the other is not), you can look at a table in the article to help you determine whether a difference in statistical power might be an issue. This power difference affects only a small minority of the cases.</p>
<p>Generally speaking, the choice between the two analyses is tie. If you need to compare two groups of five-point Likert data, it usually doesn’t matter which analysis you use. Both tests almost always provide the same protection against false negatives and always provide the same protection against false positives. These patterns hold true for sample sizes of 10, 30, and 200 per group.</p>
<p>*de Winter, J.C.F. and D. Dodou (2010), Five-Point Likert Items: t test versus Mann-Whitney-Wilcoxon, <em>Practical Assessment, Research and Evaluation</em>, 15(11).</p>
Data AnalysisHypothesis TestingStatisticsStatistics HelpWed, 06 Apr 2016 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/best-way-to-analyze-likert-item-data%3A-two-sample-t-test-versus-mann-whitneyJim FrostThe American Statistical Association's Statement on the Use of P Values
http://blog.minitab.com/blog/adventures-in-statistics/the-american-statistical-associations-statement-on-the-use-of-p-values
<p>P values have been around for nearly a century and they’ve been the subject of criticism since their origins. In recent years, the debate over P values has risen to a fever pitch. In particular, there are serious fears that P values are misused to such an extent that it has actually damaged science.</p>
<p>In March 2016, spurred on by the growing concerns, the American Statistical Association (ASA) did something that it has never done before and took an official position on a statistical practice—how to use P values. The ASA tapped a group of 20 experts who discussed this over the course of many months. Despite facing complex issues and many heated disagreements, this group managed to reach a consensus on specific points and produce the <a href="http://amstat.tandfonline.com/doi/pdf/10.1080/00031305.2016.1154108" target="_blank">ASA Statement on Statistical Significance and P-values</a>.</p>
<p>I’ve written previously about my concerns over how P values have been misused and misinterpreted. My opinion is that P values are powerful tools but they need to be used and interpreted correctly. P value calculations incorporate the effect size, sample size, and variability of the data into a single number that objectively tells you how consistent your data are with the null hypothesis. You can read my case for the power of P values in my <a href="http://blog.minitab.com/blog/adventures-in-statistics/banned-p-values-and-confidence-intervals-a-rebuttal-part-1" target="_blank">rebuttal to a journal that banned them</a>.</p>
<p><span style="line-height: 1.6;">The ASA statement contains the following six principles on how to use P values, which</span><span style="line-height: 20.8px;"> </span><span style="line-height: 20.8px;">are remarkably aligned with my own. </span><span style="line-height: 20.8px;">Let’s take a look at what they came up with.</span></p>
<ol>
<li>P-values can indicate how incompatible the data are with a specified statistical model.</li>
<li>P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.</li>
</ol>
<p>I discuss these ideas in my post <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">How to Correctly Interpret P Values</a>. It turns out that the common misconception stated in principle #2 creates the illusion of substantially more evidence against the null hypothesis than is justified. There are a number of reasons <a href="http://blog.minitab.com/blog/adventures-in-statistics/why-are-p-value-misunderstandings-so-common" target="_blank">why this type of P value misunderstanding is so common</a>. In reality, a P value is a probability about your sample data and not about the truth of a hypothesis.</p>
<ol>
<li value="3">Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.</li>
</ol>
<p>In statistics, we’re working with samples to describe a complex reality. Attempting to discover the truth based on an oversimplified process of comparing a single P value to an arbitrary significance level is destined to have problems. False positives, false negatives, and otherwise fluky results are bound to happen.</p>
<p>Using P values in conjunction with a significance level to decide when to reject the null hypothesis increases your chance of making the correct decision. However, there is no magical threshold that distinguishes between the studies that have a true effect and those that don’t with 100% accuracy. You can see a graphical representation of why this is the case in my post <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-why-we-need-to-use-hypothesis-tests-in-statistics" target="_blank">Why We Need to Use Hypothesis Tests</a>.</p>
<p>When Sir Ronald Fisher introduced P values, he never intended for them to be the deciding factor in such a rigid process. Instead, Fisher considered them to be just one part of a process that incorporates scientific reasoning, experimentation, statistical analysis and replication to lead to scientific conclusions.</p>
<p>According to Fisher, “A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.”</p>
<p>In other words, don’t expect a <em>single</em> study to provide a definitive answer. No single P value can divine the truth about reality by itself.</p>
<ol>
<li value="4">Proper inference requires full reporting and transparency.</li>
</ol>
<p>If you don’t know the full context of a study, you can’t properly interpret a carefully selected subset of the results. Data dredging, cherry picking, significance chasing, data manipulation, and other forms of p-hacking can make it impossible to draw the proper conclusions from selectively reported findings. You must know the full details about all data collection choices, how many and which analyses were performed, and all P values.</p>
<p><img alt="Comic about jelly beans causing acne with selective reporting of the results" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/22099bc252d3630a4876f579c1b83778/jelly_bean_comic.png" style="line-height: 20.8px; width: 500px; height: 1387px; margin: 10px 15px;" /></p>
<div><span style="line-height: 1.6;">In the </span><a href="http://xkcd.com/882/" style="line-height: 1.6;" target="_blank">XKCD comic</a><span style="line-height: 1.6;"> about jelly beans, if you didn’t know about the post hoc decision to subdivide the data and the 20 insignificant test results, you’d be pretty convinced that green jelly beans cause acne!</span></div>
<ol>
<li value="5">A p-value, or statistical significance, does not measure the size of an effect or the importance of an effect.</li>
<li value="6">By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.</li>
</ol>
<p>I cover these ideas, and more, in my <a href="http://blog.minitab.com/blog/adventures-in-statistics/five-guidelines-for-using-p-values">Five Guidelines for Using P Values</a>. P-values don’t tell you the size or importance of the effect. An effect can be statistically significant but trivial in the real world. This is the difference between <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/p-value-and-significance-level/practical-significance/" target="_blank">statistical significance and practical significance</a>. The analyst should supplement P values with other statistics, such as effect sizes and confidence intervals, to convey the importance of the effect.</p>
<p>Researchers need to apply their scientific judgment about the plausibility of the hypotheses, results of similar studies, proposed mechanisms, proper experimental design, and so on. Expert knowledge transforms statistics from numbers into meaningful, trustworthy findings.</p>
Data AnalysisHypothesis TestingLearningStatisticsStatistics in the NewsWed, 23 Mar 2016 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/the-american-statistical-associations-statement-on-the-use-of-p-valuesJim FrostDo Actors Wait Longer than Actresses for Oscars? A Comparison Between Academy Award Winners
http://blog.minitab.com/blog/statistics-and-more/do-actors-wait-longer-than-actresses-for-oscars-a-comparison-between-academy-award-winners
<p><span style="line-height: 1.6;">I am a bit of an Oscar fanatic. Every year after the ceremony, I religiously go online to find out who won the awards and listen to their acceptance speeches. This year, I was <em>so </em>chuffed to learn that Leonardo Di Caprio won his first Oscar for his performance in <em>The Revenant</em> in the 88</span>th<span style="line-height: 1.6;"> Academy Awards—after five nominations in previous ceremonies. As a longtime Di Caprio fan, I still remember going to the cinema when <em>Titanic </em>was released, and returning four more times. Every time, I could not hold back any tears and used up all tissues I'd brought with me!<img alt="this year's winner..." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a51cc79cd412237ef2f241d69e7e83ec/dicaprio.png" style="margin: 10px 15px; float: right; width: 190px; height: 250px;" /></span></p>
<p>Compared to his <em>Titanic </em>costar Kate Winslet, who won the Best Actress award in 2009 (aged 33), Leonardo waited 7 more years (20 years since his first nomination) before his turn came. I can name several actresses—Gwyneth Paltrow, Hilary Swank, and Jennifer Lawrence come immediately to mind—who obtained the award at younger ages. However, it appears that few young actors have received the Academy Award in recent years. This makes me wonder whether Oscar-winning actors tend to be older than Oscar-winning actresses.</p>
<p>To investigate, I collected data of the dates of past Academy Awards ceremonies and the birthdays of the winning actors and actresses. From these, I calculated the age of the winners on their Oscar-winning night. Below is a screenshot of some of the data.</p>
<p><img alt="oscars data" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ef494be723f2d7d3bb55d8f055124ad1/oscar1.png" style="width: 564px; height: 390px;" /></p>
<p>I used <a href="http://www.minitab.com/products/minitab">Minitab Statistical Software</a> to create a time series plot of the data, shown below.</p>
<p><img alt="time series plot" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/2d5d4f84cb67fbe6fd41aee118f40c6a/oscar2.png" style="width: 550px; height: 367px;" /></p>
<p><span style="line-height: 1.6;">The plot suggests that there is usually a substantial age difference between the Best Actor and Best Actress winners. There are more years when the Best Actor winner is much older than the best actress winner (blue dots above red dots) than years where the winning actress is older. Some examples:</span></p>
<p style="margin-left: 40px;">1992: Anthony Hopkins (54.2466), Jodie Foster (29.3616)</p>
<p style="margin-left: 40px;">1987: Paul Newman (62.1726), Marlee Matlin (21.5973)</p>
<p style="margin-left: 40px;">1989: Dustin Hoffman (51.6329), Jodie Foster (26.3507)</p>
<p style="margin-left: 40px;">1990: Daniel Day-Lewis (32.9068), Jessica Tandy (80.8000)</p>
<p style="margin-left: 40px;">1998: Jack Nicholson (60.9178), Helen Hunt (34.7699)</p>
<p style="margin-left: 40px;">2011: Colin Firth (50.4658), Natalie Portman (29.7205)</p>
<p style="margin-left: 40px;">2013: Daniel Day-Lewis (55.8247), Jennifer Lawrence (22.5288)</p>
<p><span style="line-height: 1.6;">There are not many occasions when both the Best Actor and Best Actress are in their 30s, 40s, 50s, etc.</span></p>
<p><a href="http://blog.minitab.com/blog/cpammer/planning-a-trip-to-disney-world%3A-using-statistics-to-keep-it-in-the-green">Conditional formatting</a> was introduced with the release of Minitab 17.2 and this is what I am going to use to identify any repeats in the data. </p>
<p style="margin-left: 40px;"><img alt="conditional formatting" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9c70ddd1b5378dd004dac75a6dafaf31/oscar3.png" style="width: 505px; height: 213px;" /></p>
<p>Minitab applies the following conditional formatting to the data set:</p>
<p style="margin-left: 40px;"><img alt="conditional formatting" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/bfff88d02adb86b588b2ee63fc9d41a4/oscar4.png" style="width: 547px; height: 570px;" /></p>
<p>For the Best Actor award, Daniel Day-Lewis received the award on three occasions, while <span style="line-height: 20.8px;">Marlon Brando, Gary Cooper, Tom Hanks, Dustin Hoffman, Fredric March, Jack Nicholson, </span><span style="line-height: 20.8px;">Sean Penn, and Spencer Tracy each</span><span style="line-height: 1.6;"> won the award twice.</span></p>
<p>For the Best Actress category, Katharine Hepburn won four times. <span style="line-height: 20.8px;">Ingrid Bergman, Bette Davis, Olivia de Havilland, Sally Field, Jane Fonda, Jodie Foster, </span><span style="line-height: 20.8px;">Glenda Jackson, Vivien Leigh, Luise Rainer, Meryl Streep, Hilary Swank, and Elizabeth Taylor each</span><span style="line-height: 1.6;"> received the award twice.</span></p>
<p>Winners below the age of 30 could be regarded as obtaining the award at an early stage of their careers. Using the conditional formatting again, I can quickly identify the actors and actress in the data who are in this group.</p>
<p><img alt="conditional formatting" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a1397cfe27b867b0fae4b1da7271a945/oscar5.png" style="width: 496px; height: 210px;" /></p>
<p><span style="line-height: 1.6;">As shown below, a lot more actresses than actors obtain the award before the age of 30.</span></p>
<p><img alt="conditional formatted data" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/2dd1d117449fee623b47ce7c0062bb7d/oscar5a.png" style="width: 649px; height: 432px;" /></p>
<p><span style="line-height: 1.6;">To get a better comparison, I am going to remove the repeats (with the help of the highlighted cells) for actors and actress who won more than once and only take into account their age at first win. This gives data from 79 Best Actor and 74 Best Actress winners. I am going to use <a href="http://www.minitab.com/products/minitab/assistant/">the Assistant</a> to carry out a comparison using the 2-sample t-test.</span></p>
<p><img alt="Assistant 2-sample t test" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/eab880dbb38bb33bbfb452693513610f/oscar6.png" style="width: 641px; height: 495px;" /></p>
<p><span style="line-height: 1.6;">Apart from generating easy-to-interpret output, the Assistant also has the advantage of carrying out a powerful t-test even with unequal sample sizes using the Welch approach.</span></p>
<p><img alt="Report Card" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ac73bee979d25f49f4221170e9769956/oscar7.png" style="line-height: 1.6; width: 650px; height: 488px;" /></p>
<p><span style="line-height: 1.6;">The Report Card indicates that we have sufficient data and the assumptions of the t-test are fulfilled. However, Minitab also detects some usual data which I will look into further.</span></p>
<p><img alt="2 sample t test diagnostic report" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/384df9dc72ef6868ed050f65d663e1bc/oscar8.png" style="width: 650px; height: 487px;" /></p>
<p><span style="line-height: 1.6;">Using the brush, the following unusual data are identified.</span></p>
<p style="margin-left: 40px;"><strong>Best Actor: </strong><br />
<span style="line-height: 1.6;">John Wayne (62.8658)</span><br />
<span style="line-height: 1.6;">Henry Fonda (76.8685)</span></p>
<p>These winners were considerably older, as the majority of the actor winners are in their 40s and 50s.</p>
<p style="margin-left: 40px;"><strong>Best Actress:</strong><br />
<span style="line-height: 1.6;">Marie Dressler (63.0027)</span><br />
<span style="line-height: 1.6;">Geraldine Page (61.3342)</span><br />
<span style="line-height: 1.6;">Jessica Tandy (80.8000)</span><br />
<span style="line-height: 1.6;">Helen Mirren (61.5863)</span></p>
<p>These winners were considerably older as the majority of the actress winners were in their late 30s and 40s.</p>
<p><img alt="2-sample t test summary report" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8d4c39828bcad9ffdc0e151771078244/oscar9.png" style="width: 650px; height: 488px;" /></p>
<p><span style="line-height: 1.6;">The Summary Report provides the key output of the t-test. The mean age of Best Actor is 43.746, while the mean age of Best Actress is 35. The p-value value of the test is very small (<0.001). This means that we have enough evidence to suggest that, on average, the Best Actor winner is older than the Best Actress winner.</span></p>
<p><span style="line-height: 1.6;">I will leave it to others to speculate (and perhaps even use data to explore) why this apparent age gap exists. However, whatever their ages, we all enjoy seeing these Oscar winners' amazing performances on the big screen!</span></p>
<p><span style="font-size: 8px; line-height: 1.6;">Photograph of Leonardo DiCaprio by <a href="https://www.flickr.com/photos/phototoday2008/11933209533/" target="_blank">See Li</a>, used under Creative Commons 2.0. </span></p>
Fun StatisticsHypothesis TestingStatistics in the NewsMon, 07 Mar 2016 13:00:00 +0000http://blog.minitab.com/blog/statistics-and-more/do-actors-wait-longer-than-actresses-for-oscars-a-comparison-between-academy-award-winnersEugenie ChungHow to Compare Regression Slopes
http://blog.minitab.com/blog/adventures-in-statistics/how-to-compare-regression-lines-between-different-models
<p>If you perform linear regression analysis, you might need to compare different regression lines to see if their constants and slope coefficients are different. Imagine there is an established relationship between X and Y. Now, suppose you want to determine whether that relationship has changed. Perhaps there is a new context, process, or some other qualitative change, and you want to determine whether that affects the relationship between X and Y.</p>
<p>For example, you might want to assess whether the relationship between the height and weight of football players is significantly different than the same relationship in the general population.</p>
<p>You can graph the regression lines to visually compare the slope coefficients and constants. However, you should also statistically test the differences. <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-why-we-need-to-use-hypothesis-tests-in-statistics" target="_blank">Hypothesis testing</a> helps separate the true differences from the random differences caused by sampling error so you can have more confidence in your findings.</p>
<p>In this blog post, I’ll show you how to compare a relationship between different regression models and determine whether the differences are statistically significant. Fortunately, these tests are easy to do using <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">Minitab statistical software</a>.</p>
<p>In the example I’ll use throughout this post, there is an input variable and an output variable for a hypothetical process. We want to compare the relationship between these two variables under two different conditions. Here is the <a href="//cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/File/569a0e7d067944f6f9147434794efcd6/comparingregressionmodels.MPJ">Minitab project file</a> with the data.</p>
Comparing Constants in Regression Analysis
<p>When the <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-to-interpret-the-constant-y-intercept" target="_blank">constants</a> (or y intercepts) in two different regression equations are different, this indicates that the two regression lines are shifted up or down on the Y axis. In the scatterplot below, you can see that the Output from Condition B is consistently higher than Condition A for any given Input value. We want to determine whether this vertical shift is statistically significant.</p>
<p><img alt="Scatterplot with two regression lines that have different constants." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/2ed27f4204515bac9d9674c16fa0c0f7/scatter_constant_dift.png" style="width: 576px; height: 384px;" /></p>
<p>To test the difference between the constants, we just need to include a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/data-concepts/cat-quan-variable/" target="_blank">categorical variable</a> that identifies the qualitative attribute of interest in the model. For our example, I have created a variable for the condition (A or B) associated with each observation.</p>
<p>To fit the model in Minitab, I’ll use: <strong>Stat > Regression > Regression > Fit Regression Model</strong>. I’ll include <em>Output</em> as the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">response variable</a>, <em>Input</em> as the continuous <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">predictor</a>, and <em>Condition</em> as the categorical predictor.</p>
<p>In the regression analysis output, we’ll first check the coefficients table.</p>
<p style="margin-left: 40px;"><img alt="Coefficients table that shows that the constants are different" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/23657868f2cf893d216d05d3400ab9e6/coeff_constant_dift.png" style="width: 369px; height: 117px;" /></p>
<p>This table shows us that the relationship between Input and Output is statistically significant because the p-value for Input is 0.000.</p>
<p>The <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">coefficient</a> for Condition is 10 and its <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">p-value</a> is significant (0.000). The coefficient tells us that the vertical distance between the two regression lines in the scatterplot is 10 units of Output. The p-value tells us that this difference is statistically significant—you can reject the null hypothesis that the distance between the two constants is zero. You can also see the difference between the two constants in the regression equation table below.</p>
<p style="margin-left: 40px;"><img alt="Regression equation table that shows constants that are different" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/a879996e37ebb05a297721e695a71943/equ_constant_dift.png" style="width: 305px; height: 113px;" /></p>
Comparing Coefficients in Regression Analysis
<p>When two slope coefficients are different, a one-unit change in a predictor is associated with different mean changes in the response. In the scatterplot below, it appears that a one-unit increase in Input is associated with a greater increase in Output in Condition B than in Condition A. We can <em>see</em> that the slopes look different, but we want to be sure this difference is statistically significant.</p>
<p><img alt="Scatterplot that shows two slopes that are different" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/200c12087fdf7eecd9b773d9ce213020/scatter_slope_dift.png" style="width: 576px; height: 384px;" /></p>
<p>How do you statistically test the difference between regression coefficients? It sounds like it might be complicated, but it is actually very simple. We can even use the same Condition variable that we did for testing the constants.</p>
<p>We need to determine whether the coefficient for Input depends on the Condition. In statistics, when we say that the effect of one variable depends on another variable, that’s an interaction effect. All we need to do is include the interaction term for Input*Condition!</p>
<p>In Minitab, you can specify interaction terms by clicking the <strong>Model</strong> button in the main regression dialog box. After I fit the regression model with the interaction term, we obtain the following coefficients table:</p>
<p style="margin-left: 40px;"><img alt="Coefficients table that shows different slopes" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/f06eff56f2266d0ff7e3919aa1292285/coeff_slope_dift.png" style="width: 410px; height: 154px;" /></p>
<p>The table shows us that the interaction term (Input*Condition) is statistically significant (p = 0.000). Consequently, we reject the null hypothesis and conclude that the difference between the two coefficients for Input (below, 1.5359 and 2.0050) does not equal zero. We also see that the main effect of Condition is not significant (p = 0.093), which indicates that difference between the two constants is not statistically significant.</p>
<p style="margin-left: 40px;"><img alt="Regression equation table that shows different slopes" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/d5e5142c0ff13645d1dacc3e2c0bee27/equ_coeff_dift.png" style="width: 295px; height: 105px;" /></p>
<p>It is easy to compare and test the differences between the constants and coefficients in regression models by including a categorical variable. These tests are useful when you can see differences between regression models and you want to defend your conclusions with p-values.</p>
<p>If you're learning about regression, read my <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples">regression tutorial</a>!</p>
Data AnalysisHypothesis TestingRegression AnalysisStatistics HelpWed, 13 Jan 2016 13:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/how-to-compare-regression-lines-between-different-modelsJim FrostChecking the “Naughty” or “Nice” Assessment with Attribute Agreement Analysis
http://blog.minitab.com/blog/using-data-and-statistics/checking-the-naughty-or-nice-assessment-with-attribute-agreement-analysis
<p><span style="line-height: 1.6;">Each year Santa’s Elves have to take all the information provided by family, friends and teachers to determine if all the children of the world have been “Naughty” or “Nice.” This is no small task, as according to the website </span><a href="http://www.santafaqs.com/" style="line-height: 1.6;">www.santafaqs.com</a><span style="line-height: 1.6;"> Santa delivers over 5 billion presents per year. </span></p>
<p><span style="line-height: 1.6;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/0b9beb7f6ce672b36e9141b1bc3d3826/elf_classifying.png" style="margin: 10px 15px; float: right; width: 200px; height: 194px;" />Not only is it a large task in terms of size, but it is critical that the Elves have a consistent approach to this assessment. Santa does not want to give presents to naughty children, but he is adamant that he would rather mistakenly give a present to a naughty child than run the risk of <em>not</em> giving a present to a nice child. </span></p>
<p>For this reason, every summer Santa trains all his staff on separating people into the “Naughty” and “Nice” categories, and then he gives them a final test on a set of characters where their behaviour category is already known. For each of these 50 characters, Santa gives the Elves details of their behaviour as reported by their family, friends and work colleagues, and they give them a Naughty or Nice grade. To set up and analyse his new Elf recruits performance, Santa uses an <span><a href="http://blog.minitab.com/blog/understanding-statistics/got-good-judgment-prove-it-with-attribute-agreement-analysis">Attribute Agreement Analysis</a></span>.</p>
<p>The full list of characters and their grades can be seen in this Minitab project file: <a href="//cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/edccea06e97c99500398e5f26bf71e23/elf_test.MPJ">elf-test.mpj</a>. If you don't already have Minitab and you'd like to give Attribute Agreement Analysis a try with this data set, you can <a href="http://www.minitab.com/en-us/products/minitab/free-trial/">download the free 30-day trial</a>. </p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/006aec47ff21e0f2c53ff20bb3c8aaf7/naughty_1.png" style="border-width: 0px; border-style: solid; width: 282px; height: 330px;" /></p>
<p><span style="line-height: 1.6;">The first thing Santa has to do is create an Attribute Agreement Worksheet, which ensures that each Elf evaluates all the characters in a random order and creates a Minitab worksheet that includes expected category (Naughty or Nice) for each person so that Santa or one of his helpers can quickly enter the Elves assessments. </span></p>
<p>To avoid any pre-judgement the Elves do not see the name of the person they are assessing—only their Sample No and the information from family and friends.</p>
<p>The steps he follows are:</p>
<ol>
<li><strong>File > Open Project > Elf-Test.mpj</strong></li>
<li><strong>Assistant > Measurement System Analysis (MSA) > Attribute Agreement Worksheet</strong></li>
</ol>
<p>Santa completes the dialog box as follows and clicks OK. He then prints of the collection datasheets and gets the new Elves to assess the information for each of the people of the list and categorise them as Naughty or Nice. Once he has this information it is input into the Minitab Worksheet.</p>
<p><img alt="Attribute Agreement Analysis worksheet" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/367d2c66939af738133c5e34ef72dabf/naughty_2.png" style="border-width: 0px; border-style: solid; width: 585px; height: 418px;" /></p>
<p>Once Santa has collected all this data, he runs the Attribute Agreement Analysis in the Assistant and gets the following results:</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/058a59ecbcb6d9b30f6bb90f896e7f9e/naughty_3.png" style="border-width: 0px; border-style: solid; width: 496px; height: 137px;" /></p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/cd34f968224601cfacac3a635117c05b/naughty_4.png" style="border-width: 0px; border-style: solid; width: 567px; height: 575px;" /></p>
<p>Santa is happy with the overall error rate. However, he is very concerned that the percentage of Nice people being rated as Naughty is higher than the overall error rate. This means that there are some good people that may not get presents. This is not acceptable, so he uses another report produced by Minitab to investigate which people are being mis-classified.</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a16a40bd7ae48ca184dc665b6c3727f3/naughty_5.png" style="border-width: 0px; border-style: solid; width: 549px; height: 318px;" /></p>
<p>This chart shows which samples were misclassified as Naughty.</p>
<p>Santa is worried because every Elf said person 26 was Naughty when the standard was Nice. When Santa looks at the Elf-Test Worksheet, he can see that person 26 was Sherlock Holmes. Santa checks the information on him and can see why the Elves think he is naughty: he smokes and the neighbours have complained that he plays his violin (badly) at all hours of the day and night. Santa provides extra training to the Elves to help them realise that musicians only improve if they practise regularly, so the neighbours will have to suffer.</p>
<p>Characters, 24, 40 and 49 (Little Red Riding Hood, Stuart Little and Shrek, respectively) were only misclassified once apiece, so Santa wants to investigate which Elves made the wrong decision in these cases and again he uses one of the reports the Assistant produces as a standard.</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/e960ad692f6856d825475bfc1373de29/naughty_6.png" style="border-width: 0px; border-style: solid; width: 545px; height: 344px;" /></p>
<p>From this report Santa, can see that Berry is the strictest elf—and the one who has made the most mistakes classifying Nice people as Naughty. For this reason, Santa decides to reassign Berry to the reindeer welfare department.</p>
<p>Jingle and Sparkle are now full time Niceness monitors, and Santa is sure—thanks to his training program and the Attribute Agreement Assessment Analysis completed in Minitab—that <em>everyone</em> will get the presents they deserve this year.</p>
<p>If, like Santa, you have to make qualitative assessments on your products or services, an Attribute Agreement Analysis is a good way to verify and improve the performance of you assessors.</p>
<p> </p>
Fun StatisticsHypothesis TestingQuality ImprovementWed, 23 Dec 2015 13:00:00 +0000http://blog.minitab.com/blog/using-data-and-statistics/checking-the-naughty-or-nice-assessment-with-attribute-agreement-analysisGillian GroomWhy Are P Value Misunderstandings So Common?
http://blog.minitab.com/blog/adventures-in-statistics/why-are-p-value-misunderstandings-so-common
<p><img alt="Danger thin ice sign" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/694cbaccbcb94c40ba77ec6a967994d7/thin_ice_sign.jpg" style="float: right; width: 225px; height: 300px; margin: 15px 10px;" />I’ve written a fair bit about P values: <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">how to correctly interpret P values</a>, <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-significance-levels-alpha-and-p-values-in-statistics" target="_blank">a graphical representation of how they work</a>, <a href="http://blog.minitab.com/blog/adventures-in-statistics/five-guidelines-for-using-p-values" target="_blank">guidelines for using P values</a>, and why the <a href="http://blog.minitab.com/blog/adventures-in-statistics/banned-p-values-and-confidence-intervals-a-rebuttal-part-1" target="_blank">P value ban in one journal is a mistake</a>. Along the way, I’ve received many questions about P values, but the questions from one reader stand out.</p>
<p>This reader asked, <em>why</em> is it so easy to interpret P values incorrectly? Why is the common misinterpretation <em>so</em> pervasive? And, what can be done about it? He wasn’t sure if it these were fair questions, but I think they are. Let’s answer them!</p>
The Correct Way to Interpret P Values
<p>First, to make sure we’re on the same page, here’s the correct definition of P values.</p>
<p>The P value is the probability of obtaining an effect at least as extreme as the one in your sample data, assuming the truth of the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/null-and-alternative-hypotheses/" target="_blank">null hypothesis</a>. In other words, if the null hypothesis is true, the P value is the probability of obtaining your sample data. It answers the question, are your sample data unusual if the null hypothesis is true?</p>
<p>If you’re thinking that the P value is the probability that the null hypothesis is true, the probability that you’re making a mistake if you reject the null, or anything else along these lines, that’s the most common misunderstanding. You should click the links above to learn how to correctly interpret P values.</p>
Historical Circumstances Helped Make P Values Confusing
<p>This problem is nearly a century old and goes back to two very antagonistic camps from the early days of hypothesis testing: Fisher's measures of evidence approach (P values) and the Neyman-Pearson error rate approach (alpha). Fisher believed in inductive reasoning, which is the idea that we can use sample data to learn about a population. On the other side, the Neyman-Pearson methodology does not allow analysts to learn from individual studies. Instead, the results only apply to a long series of tests.</p>
<p>Courses and textbooks have mushed these disparate approaches together into the standard hypothesis-testing procedure that is known and taught today. This procedure <em>seems </em>like a seamless combination but it's really a muddled, Frankenstein's-monster combination of sometimes-contradictory methods that has promoted the confusion. The end result of this fusion is that P values are incorrectly entangled with the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/type-i-and-type-ii-error/" target="_blank">Type I error rate</a>. Fisher tried to clarify this misunderstanding for decades, but to no avail.</p>
P Values Aren’t What We <em>Really </em>Want to Know
<p>The common misconception is what we'd <em>really</em> like to know. We’d <em>loooove</em> to know the probability that a hypothesis is correct, or the probability that we’re making a mistake. What we get instead is the probability of our <em>observation</em>, which just isn’t as useful.</p>
<p>It would be great if we could take evidence solely from a sample and determine the probability that the sample is wrong. Unfortunately, that's not possible—for logical reasons when you think about it. Without outside information, a sample can’t tell you whether it’s representative of the population.</p>
<p>P values are based exclusively on information contained within a sample. Consequently, P values can't answer the question that we most want answered, but there seems to be an irresistible temptation towards interpreting it that way.</p>
P Values Have a Convoluted Definition
<p>The correct definition of a P value is fairly convoluted. The definition is based on the probability of observing what you actually did observe (huh?), but in a hypothetical context (a true null hypothesis), and it includes strange wording about results that are at least as extreme as what you observed. It's hard to understand all of that without a lot of study. It's just not intuitive.</p>
<p>Unfortunately, there is no simple <em>and</em> accurate definition that can help counteract the pressures to believe in the common misinterpretation. In fact, the incorrect definition <em>sounds</em> so much simpler than the correct definition. Shoot, <a href="http://fivethirtyeight.com/features/not-even-scientists-can-easily-explain-p-values/" target="_blank">not even scientists can explain P values</a>! And, so the misconceptions live on.</p>
What Can Be Done?
<p>Historical circumstances have conspired to confuse the issue. We have a natural tendency to want P values to mean something else. And, there is no simple yet correct definition for P values that can counteract the common misunderstandings. No wonder this has been a problem for a long time!</p>
<p>Fisher tried in vain to correct this misinterpretation but didn't have much luck. As for myself, I hope to point out that what may seem like a semantic difference between the correct and incorrect definitions actually equates to a huge difference.</p>
<p>Using the incorrect definition is likely to come back to bite you! If you think a P value of 0.05 equates to a 5% chance of a mistake, boy, are you in for a big surprise—because it’s often around 26%! Instead, based on middle-of-the-road assumptions, you’ll need a P value around 0.0027 to achieve an error rate of about 5%. However, <a href="http://blog.minitab.com/blog/adventures-in-statistics/not-all-p-values-are-created-equal" target="_blank">not all P values are created equal</a> in terms of the error rate.</p>
<p>I also think that P values are easier for most people to understand graphically than through the tricky definition and the math. So, I wrote a series of blog posts that graphically show <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-why-we-need-to-use-hypothesis-tests-in-statistics" target="_blank">why we need hypothesis testing and how it works</a>.</p>
<p>I have no reason to expect that I'll have any more impact than Fisher did himself, but it's an attempt!</p>
Hypothesis TestingLearningStatisticsThu, 10 Dec 2015 13:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/why-are-p-value-misunderstandings-so-commonJim FrostWhy You Should Use Non-parametric Tests when Analyzing Data with Outliers
http://blog.minitab.com/blog/applying-statistics-in-quality-projects/why-you-should-use-non-parametric-tests-when-analyzing-data-with-outliers
<p>There are many reasons why a distribution might not be normal/Gaussian. A non-normal pattern might be caused by several distributions being mixed together, or by a drift in time, or by one or several outliers, or by an asymmetrical behavior, some out-of-control points, etc.</p>
<p>I recently collected the scores of three different teams (the Blue team, the Yellow team and the Pink team) after a laser tag game session one Saturday afternoon. The three teams represented three different groups of friends wishing to spend their afternoon tagging players from competing teams. Gengiz Khan turned out to be the best player, followed by Tarantula and Desert Fox.</p>
One-Way ANOVA
<p>In this post, I will focus on team performances, not on single individuals. I decided to compare the average scores of each team. The best tool I could possibly think of was a one-way ANOVA using the Minitab <a href="http://www.minitab.com/products/minitab/assistant/">Assistant</a> (with a continuous Y response and three sample means to compare).</p>
<p>To assess statistical significance, the differences <em>between </em>team averages are compared to the <em>within </em>(team) variability. A large between-team variability compared to a small within-team variability (the error term) means that the differences between teams are statistically significant.</p>
<p>In this comparison (see the output from the Assistant below), the <a href="http://blog.minitab.com/blog/understanding-statistics/what-can-you-say-when-your-p-value-is-greater-than-005">P value was 0.053, just above the 0.05</a> standard usual threshold. The P value is the probability that the differences in observed means are only due to random causes. A p-value above 0.05, therefore, indicates that the probability that such differences are only due to random causes is not negligible. Because of that, the differences are not considered to be statistically significant (there is "not enough evidence that there are significant differences," according to the comments in Minitab Assistant). But the result remains somewhat ambiguous since the p-value is still very close to the significance limit (0.05).</p>
<p>Note that the variability within the Blue team seems to be much larger (see the confidence interval plot in the means comparison chart below) than for the other two groups. This not a cause for concern in this case, since the Minitab Assistant uses the <a href="http://blog.minitab.com/blog/adventures-in-statistics/did-welchs-anova-make-fishers-classic-one-way-anova-obsolete">Welch method of ANOVA</a>, which does not require or assume variances within groups to be equal.</p>
<p><img height="468" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/8457900262f468b76f8d2a4f28027c2d/8457900262f468b76f8d2a4f28027c2d.png" width="624" /></p>
Outliers and Normality
<p>When looking at the distribution of individual data (below) one point seems to be an outlier or at least a suspect, extreme value (marked in red). This is Gengiz Khan, the best player. In my worksheet, the scores have been entered from the best to the worst (not in time order). This is why we can see a downward trend in the chart on the right site of the diagnostic report (see below).</p>
<p><img height="468" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/7a7f0889d207c5dab409c4f32fb33d85/7a7f0889d207c5dab409c4f32fb33d85.png" width="624" /></p>
<p>The Report Card (see below) from the Minitab Assistant shows that Normality might be an issue (the yellow triangle is a warning sign) because the sample sizes are quite small. We need to check normality within each team. The second warning sign is due to the unusual / extreme data (score in row 1) which may bias our analysis.</p>
<p><img height="500" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/c1b6e895a14806b48ce1389d3b1d283b/c1b6e895a14806b48ce1389d3b1d283b.png" width="666" /></p>
<p><span style="line-height: 20.8px;">Following the suggestion from the warning signal in the Minitab Assistant Report Card, </span>I decided to run a normality test. I performed a separate normality test for each team in order not to mix different distributions together.</p>
<p>A low P value in the normal probability plot (see below) signals a significant departure from normality. This p-value is below 0.05 for the Blue team. The points located along the normal probability plot line represent “normal,” common, random variations. The points at the upper or lower extreme, which are distant from the line, represent unusual values or outliers. The non-normal behavior in the probability plot of the blue team is clearly due to the outlier on the right side of the normal probability plot line.</p>
<p><img height="384" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/0a2a24b82115014a0e981e63b0628f5b/0a2a24b82115014a0e981e63b0628f5b.png" width="576" /></p>
<p>Should we remove this value (Gengiz Khan’s score) in the Blue group and rerun the analysis without him ?</p>
<p>Even though Gengiz Khan is more experienced and talented than the other team members, there are no particular reasons why he should be removed—he is certainly part of the Blue team. There are probably many other talented laser game players around. If another additional laser game session takes place in the future, there will probably still be a large difference between Gengiz Khan and the rest of his team.</p>
<p>The problem is that this extreme value tends to inflate the within-group variability. Because there is a much larger within-team variability for the blue team, differences <em>between </em>groups when they are compared to the residual / within variability do not appear to be significant, causing the p-value to move just above the significance threshold.</p>
A Non-parametric Solution
<p>One possible solution is to use a non-parametric approach. Non-parametric techniques are based on ranks, or medians. Ranks represent the relative position of an individual in comparison to others, but are not affected by extreme values (whereas a mean is sensitive to outlier values). Ranks and medians are more “robust” to outliers.</p>
<p>I used the Kruskal-Wallis test (see the correspondence table between parametric and non-parametric tests below). The p-value (see the output below) is now significant (less than 0.05), and the conclusion is completely different. We can consider that the differences are significant .</p>
<p style="margin-left: 40px;"><strong>Kruskal-Wallis Test: Score versus Team </strong></p>
<p style="margin-left: 40px;">Kruskal-Wallis Test on Score</p>
<p style="margin-left: 40px;">Team N Median Ave Rank Z</p>
<p style="margin-left: 40px;">Blue 9 2550,0 23,7 2,72</p>
<p style="margin-left: 40px;">Pink 13 -450,0 11,6 -2,44</p>
<p style="margin-left: 40px;">Yellow 10 975,0 16,4 -0,06</p>
<p style="margin-left: 40px;">Overall 32 16,5</p>
<p style="margin-left: 40px;">H = 8,86 DF = 2 <strong>P = 0,012</strong></p>
<p style="margin-left: 40px;">H = 8,87 DF = 2 <strong>P = 0,012</strong> (adjusted for ties)</p>
<p style="margin-left: 40px;"><img height="384" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/ed3231b9ceab5d16a6a5d5bb0ce43973/ed3231b9ceab5d16a6a5d5bb0ce43973.png" width="576" /></p>
<p>See below the correspondence table for parametric and non-parametric tests :</p>
<p><img height="457" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/3616777847f46f514dc4dfdc51397e3d/3616777847f46f514dc4dfdc51397e3d.png" width="694" /></p>
Conclusion
<p>Outliers do happen and removing them is not always straightforward. One nice thing about non-parametric tests is that they are more robust to such outliers. However, this does not mean that non-parametric tests should be used in any circumstance. When there are no outliers and the distribution is normal, standard parametric tests (T tests or ANOVA) are more powerful. </p>
Data AnalysisHypothesis TestingLearningStatisticsStatsMon, 07 Dec 2015 13:02:00 +0000http://blog.minitab.com/blog/applying-statistics-in-quality-projects/why-you-should-use-non-parametric-tests-when-analyzing-data-with-outliersBruno ScibiliaWhat Can You Say When Your P-Value is Greater Than 0.05?
http://blog.minitab.com/blog/understanding-statistics/what-can-you-say-when-your-p-value-is-greater-than-005
<p>P-values are frequently misinterpreted, which causes many problems. I won't rehash <a href="http://blog.minitab.com/blog/adventures-in-statistics/banned-p-values-and-confidence-intervals-a-rebuttal-part-1">those problems here</a> here since my colleague Jim Frost has detailed the issues involved at some length, but the fact remains that the p-value will continue to be one of the most frequently used tools for deciding if a result is statistically significant. </p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f5a45a8de99994c6fd16e3fd018776b1/shoveling.png" style="line-height: 20.8px; margin: 10px 15px; float: right; width: 250px; height: 221px;" /></p>
<p>You know the old saw about "Lies, damned lies, and statistics," right? It rings true because statistics really is as much about interpretation and presentation as it is mathematics. That means we human beings who are analyzing data, with all our foibles and failings, have the opportunity to shade and shadow the way results get re<span style="line-height: 1.6;">ported. </span></p>
<p>While I generally like to believe that people<span style="line-height: 20.8px;"> <em>want</em> to be honest and objective</span><span style="line-height: 1.6;">—especially smart people who do research and analyze data that may affect other people's lives</span><span style="line-height: 1.6;">—<a href="https://mchankins.wordpress.com/2013/04/21/still-not-significant-2/">here are 500 pieces of evidence that fly in the face of that belief</a>. </span></p>
<p><span style="line-height: 1.6;">We'll get back to that in a minute. But first, a quick review...</span></p>
<span style="line-height: 1.6;">What's a P-Value, and How Do I Interpret It?</span>
<p>Most of us first encounter p-values when we conduct simple hypothesis tests, although they also are integral to many more sophisticated methods. Let's use Minitab 17 to do a quick review of how they work (if you want to follow along and don't have Minitab, the <a href="http://it.minitab.com/products/minitab/free-trial.aspx">full package is available free for 30 days</a>). We're going to compare fuel consumption for two different kinds of furnaces to see if there's a difference between their means. </p>
<p>Go to <strong>File > Open Worksheet</strong>, and click the "Look in Minitab Sample Data Folder" button. Open the sample data set named <em>Furnace.mtw</em>, and choose <strong>Stat > Basic Statistics > 2 Sample t...</strong> from the menu. In the dialog box, enter "BTU.In" for Samples, and enter "Damper" for Sample IDs.</p>
<p>Press <strong>OK</strong> and Minitab returns the following output, in which I've highlighted the p-value. </p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/26076d0dd37249748f1541a2313036b6/p_values_output.png" style="width: 547px; height: 172px;" /></p>
<p>In the majority of analyses, an alpha of 0.05 is used as the cutoff for significance. If the p-value is less than 0.05, we reject the <a href="http://blog.minitab.com/blog/understanding-statistics/things-statisticians-say-failure-to-reject-the-null-hypothesis">null hypothesis</a> that there's no difference between the means and conclude that a significant difference does exist. If the p-value is larger than 0.05, we <em>cannot</em> conclude that a significant difference exists. </p>
<p>That's pretty straightforward, right? Below 0.05, significant. Over 0.05, <em>not</em> significant. </p>
"Missed It By <em>That</em> Much!"
<p>In the example above, the result is clear: a p-value of 0.7 is so much higher than 0.05 that you can't apply any wishful thinking to the results. <span style="line-height: 1.6;">But what if your p-value is really, <em>really</em> close to 0.05? </span></p>
<p><span style="line-height: 1.6;"><em>Like, what if you had a p-value of 0.06? </em></span></p>
<p>That's not significant. </p>
<p><em>Oh. Okay, what about 0.055?</em></p>
<p>Not significant. </p>
<p><em>How about 0.051?</em></p>
<p>It's <em>still</em> not statistically significant, and data analysts should not try to pretend otherwise. <span style="line-height: 1.6;">A p-value is not a negotiation: if p > 0.05, the results are not significant. </span><em style="line-height: 1.6;">Period.</em></p>
<p><em>So, what </em>should<em> I say when I get a p-value that's higher than 0.05? </em></p>
<p>How about saying this? "The results were not statistically significant." If that's what the data tell you, there is nothing wrong with saying so. </p>
No Matter How Thin You Slice It, It's Still Baloney.
<p>Which brings me back to the <a href="https://mchankins.wordpress.com/2013/04/21/still-not-significant-2/">blog post</a> I referenced at the beginning. Do give it a read, but the bottom line is that the author cataloged 500 <em>different</em> ways that contributors to scientific journals have used language to obscure their results (or lack thereof). </p>
<p>As a student of language, I confess I find the list fascinating...but also upsetting. It's <em>not right</em>: These contributors are educated people who certainly understand A) what a p-value higher than 0.05 signifies, and B) that manipulating words to soften that result is deliberately deceptive. Or, to put it in words that are less soft, it's a damned lie.</p>
<p>Nonetheless, it happens frequently. </p>
<p>Here are just a few of my favorites of the 500 different ways people have reported results that were not significant, accompanied by the p-values to which these creative interpretations applied: </p>
<ul>
<li>a certain trend toward significance (p=0.08)</li>
<li>approached the borderline of significance (p=0.07)</li>
<li>at the margin of statistical significance (p<0.07)</li>
<li>close to being statistically signiﬁcant (p=0.055)</li>
<li>fell just short of statistical significance (p=0.12)</li>
<li>just very slightly missed the significance level (p=0.086)</li>
<li>near-marginal significance (p=0.18)</li>
<li>only slightly non-significant (p=0.0738)</li>
<li>provisionally significant (p=0.073)</li>
</ul>
<p>and my very favorite:</p>
<ul>
<li>quasi-significant (p=0.09)</li>
</ul>
<p>I'm not sure what "quasi-significant" is even supposed to mean, but it <em>sounds</em> quasi-important, as long as you don't think about it too hard. But there's still no getting around the fact that a p-value of 0.09 is not a statistically significant result. </p>
<p>The blogger does not address the question of whether the opposite situation occurs. Do contributors ever write that a p-value of, say, 0.049999 is:</p>
<ul>
<li>quasi-insignificant</li>
<li>only slightly significant</li>
<li>provisionally insignificant</li>
<li>just on the verge of being non-significant</li>
<li>at the margin of statistical non-significance</li>
</ul>
<p>I'll go out on a limb and posit that describing a p-value just under 0.05 in ways that diminish its statistical significance <em>just</em> <em>doesn't happen</em>. However, downplaying statistical non-significance would appear to be almost endemic. </p>
<p>That's why I find the above-referenced post so disheartening. It's distressing that you can so easily gather so many examples of bad behavior by data analysts <em>who almost certainly know better</em>.</p>
<p><em>You</em> would never use language to try to obscure the outcome of your analysis, would you?</p>
<p> </p>
Hypothesis TestingStatisticsThu, 03 Dec 2015 13:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/what-can-you-say-when-your-p-value-is-greater-than-005Eston MartzWhat Is ANOVA? And Who Drinks the Most Beer?
http://blog.minitab.com/blog/michelle-paret/what-is-anova-and-who-drinks-the-most-beer
<p><span style="line-height: 1.6;">Back when I was an undergrad in statistics, I unfortunately spent an entire semester of my life taking a class, diligently crunching numbers with my TI-82, before realizing 1) that I was actually in an Analysis of Variance (ANOVA) class, 2) why I would want to use such a tool in the first place, and 3) that ANOVA doesn’t necessarily tell you a thing about variances.</span></p>
<p>Fortunately, I've had a lot more real-world experience to draw from since then, which makes it much easier to understand today. TI-82 not required.</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f000c555e0ba8a9d9a78461b7230073c/beer.jpg" style="line-height: 20.8px; margin: 10px 15px; float: right; width: 220px; height: 220px;" /></p>
Why Conduct an ANOVA?
<p>In its simplest form—specifically, a 1-way ANOVA—you take 1 continuous (“response”) variable and 1 categorical (“factor”) variable and test the null hypothesis that all group means for the categorical variable are equal. Typically, we’re talking about at least 3 groups, because if you only have 2 groups (samples), then you can use a <span><a href="http://blog.minitab.com/blog/adventures-in-statistics/using-hypothesis-tests-to-bust-myths-about-the-battle-of-the-sexes">2-sample t-test</a></span> and skip ANOVA all together.</p>
<div>
<p>As an example, let’s look at the <a href="https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita">average annual per capita beer consumption</a> across 3 regions of the world: Asia, Europe, and America. Here’s the null and alternative hypothesis:</p>
<p style="margin-left: 40px;">H0: All regions drink the same average amount of beer (μAsia = μEurope = μAmerica)</p>
<p style="margin-left: 40px;">H1: Not all regions drink the same average amount of beer</p>
<p>Any guess on who consumes the most beer?</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/371609298620629565add1ee54234c37/individual_value_plot_of_volume_consumed__liters__w1024.jpeg" style="border-width: 0px; border-style: solid; width: 600px; height: 394px;" /></p>
<p><span style="line-height: 1.6;">According to the individual value plot created using </span><a href="http://www.minitab.com/products/minitab/" style="line-height: 1.6;">Minitab 17</a><span style="line-height: 1.6;">, Europe consumes the most beer on average and Asia consumes the least. However, are these differences statistically significant? Or are these differences simply due to random variation?</span></p>
How ANOVA Works
<p>The basic logic behind ANOVA is that the within-group variation is due only to random error. Therefore:</p>
<ul>
<li><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/2c9466a316c75e4c17cd250f0aff45bd/comparing_variation.jpg" style="line-height: 20.8px; margin-left: 12px; margin-right: 12px; float: right; width: 112px; height: 200px;" />If the between-group variation is similar to the within-group variation, then the group means are likely to differ only due to random error. (Figure 1)</li>
<li>If the between-group variation is large relative to the within-group variation, then there are likely differences between the group means. (Figure 2)</li>
</ul>
<p>Say what?</p>
<p>In our example, the between-group variation represents the variation <em>between</em> the 3 different regions. And the within-group variation represents the beer consumption variability <em>within</em> a given region. Take Europe, for instance, where we have the Czech Republic. It appears to be the thirstiest country, consuming the most beer at 148.6 liters. But Europe also contains Italy, whose population drinks the least at only 29 liters (perhaps the Italians are passing up the Peroni for some vino and Limoncello?). So you can see that there is variability within the Europe group. There’s also variability within the Asia group, and within the America group.</p>
<p>With ANOVA, we compare the between-group variation (i.e., Asia vs. Europe vs. America) to the within-group variation (i.e., within each of those regions). The higher this ratio, the smaller the p-value. So the term ANOVA refers to the fact that we're using information about the variances to draw conclusions about the means.</p>
The Analysis
<p>If we run a 1-way ANOVA using this beer data, Minitab Statistical Software provides the following output in the Session Window:</p>
<p style="margin-left:.5in;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/956e35d59d68a844fdf00986397dbc9d/oneway_anova_for_beer.jpg" style="border-width: 0px; border-style: solid; width: 460px; height: 286px;" /></p>
<p><span style="line-height: 1.6;">Our </span><a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" style="line-height: 1.6;">p-value</a><span style="line-height: 1.6;"> is statistically significant at 0.000. Therefore, we can reject the null hypothesis that all regions drink the same average amount of beer. </span></p>
<p><span style="line-height: 1.6;">This leads us to our next question: Which countries differ? Let’s use Tukey multiple comparisons to find out.</span></p>
<p align="center"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b499b8c87a401c007713a0d5179a5e56/oneway_anova_interval_plot_for_beer_w1024.jpeg" style="border-width: 0px; border-style: solid; width: 600px; height: 394px;" /></p>
<p>Per the footnote on the Tukey comparisons graph, “If an interval does not contain zero, the corresponding means are significantly different.” Therefore, the intervals shown in red tell us where the differences are. Specifically, we can conclude that the average beer consumption for Europe is significantly higher than that of Asia. We can also conclude that America consumes significantly more than Asia. However, there is not sufficient evidence to conclude that the average beer consumption for Europe is different than for America.</p>
The Last Sip
<p>Although it’s unlikely that you’re analyzing beer data in your professional career, I do hope this provides a little insight into ANOVA and how you can utilize it to test averages between 3 or more groups.</p>
<p> </p>
</div>
Data AnalysisFun StatisticsHypothesis TestingStatisticsStatistics HelpThu, 19 Nov 2015 13:00:00 +0000http://blog.minitab.com/blog/michelle-paret/what-is-anova-and-who-drinks-the-most-beerMichelle ParetControl Charts - Not Just for Statistical Process Control (SPC) Anymore!
http://blog.minitab.com/blog/adventures-in-statistics/control-charts-not-just-for-statistical-process-control-spc-anymore
<p>Control charts are a fantastic tool. These charts plot your process data to identify common cause and special cause variation. By identifying the different causes of variation, you can take action on your process without over-controlling it.</p>
<p>Assessing the stability of a process can help you determine whether there is a problem and identify the source of the problem. Is the mean too high, too low, or unstable? Is variability a problem? If so, is the variability inherent in the process or attributable to specific sources? Control charts answer these questions, which can guide your corrective efforts.</p>
<p>Determining that your process is stable is good information all by itself, but it is also a prerequisite for further analysis, such as <a href="http://blog.minitab.com/blog/understanding-statistics/i-think-i-can-i-know-i-can-a-high-level-overview-of-process-capability-analysis" target="_blank">capability analysis</a>. Before assessing process capability, you must be sure that your process is stable. An unstable process is unpredictable. If your process is stable, you can predict future performance and improve its capability.</p>
<p>While we associate control charts with business processes, I’ll argue in this post that control charts provide the same great benefits in other areas beyond statistical process control (SPC) and Six Sigma. In fact, you’ll see several examples where control charts find answers that you’d be hard pressed to uncover using different methods.</p>
The Importance of Assessing Whether Other Types of Processes Are In Control
<p>I want you to expand your mental concept of a process to include processes outside the business environment. After all, unstable process levels and excessive variability can be problems in many different settings. For example:</p>
<ul>
<li>A teacher has a process that helps students learn the material as measured by test scores.</li>
<li><a href="http://blog.minitab.com/blog/real-world-quality-improvement/control-charts-keep-blood-sugar-in-check" target="_blank">A diabetic has a process for keeping blood sugar in control</a>.</li>
<li>A researcher has a process that causes subjects to experience an impact of 6 times their body weight.</li>
</ul>
<p>All of these processes can be stable or unstable, have a certain amount of inherent variability, and can also have special causes of variability. Understanding these issues can help improve all of them.</p>
<p>The third bullet relates to a <a href="http://blog.minitab.com/blog/adventures-in-statistics/quality-improvement-controlling-variability-more-difficult-than-the-mean" target="_blank">research study</a> that I was involved with. Our research goal was to have middle school subjects jump from 24-inch steps, 30 times, every other school day to determine whether it would increase their bone density. We defined our treatment as the subjects experiencing an impact of 6 body weights. However, we weren’t quite hitting the mark.</p>
<p>To guide our corrective efforts, I conducted a pilot study and graphed the results in the Xbar-S chart below.</p>
<p><img alt="Xbar-S chart of ground reaction forces for pilot study" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/e721bd172aa55d5ec9976e81990f1293/xbars_grf_w1024.jpeg" style="width: 576px; height: 384px;" /></p>
<p>The in-control S chart (bottom) shows that each subject has a consistent landing style that produces impacts of a consistent magnitude—the variability is in control. However, the out-of-control Xbar chart (top) indicates that, while the overall mean (6.141) exceeds our target, different subjects have very different means. Collectively, the chart shows that some subjects are consistently hard landers while others are consistently soft landers. The control chart suggests that the variability is not inherent in the process (common cause variation) but rather assignable to differences between subjects (special cause variation).</p>
<p>Based on this information, we decided to train the subjects how to land and to have a nurse observe all of the jumping sessions. This ongoing training and corrective action reduced the variability enough so that the impacts were consistently greater than 6 body weights.</p>
Control Charts as a Prerequisite for Statistical Hypothesis Tests
<p>As I mentioned, control charts are also important because they can verify the assumption that a process is stable, which is required to produce a valid capability analysis. We don’t often think of using control charts to test the assumptions for <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-why-we-need-to-use-hypothesis-tests-in-statistics" target="_blank">hypothesis tests</a> in a similar fashion, but they are very useful for that as well.</p>
<p>The assumption that the measurements used in a hypothesis test are stable is often overlooked. As with any process, if the measurements are not stable, you can’t make inferences about whatever you are measuring.</p>
<p>Let’s assume that we’re comparing test scores between group A and group B. We’ll use this <a href="//cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/File/6053477fc294de59d5b3837389daab3a/groupcomparison.MTW">data set</a> to perform a 2-sample t-test as shown below.</p>
<p style="margin-left: 40px;"><img alt="two sample t-test results" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/d60543dc8eb46afc282b9776b0517a5e/2samplet.png" style="width: 522px; height: 210px;" /></p>
<p>The results appear to show that group A has the higher mean and that the difference is statistically significant. Group B has a marginally higher standard deviation, but we’re not assuming equal variances, so that’s not a problem. If you conduct normality tests, you’ll see that the data for both groups are normally distributed—although we have a sufficient number of observations per group that we don’t have to worry about normality. All is good, right?</p>
<p>The I-MR charts below suggest otherwise!</p>
<p><img alt="I-MR chart for group A" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/cef240bbb760bb6760ddcbc33e446be9/imr_a.png" style="width: 576px; height: 384px;" /></p>
<p><img alt="I-MR chart of group B" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/e4bd53da7831826959be94540b7ab0a2/imr_b.png" style="width: 576px; height: 384px;" /></p>
<p>The chart for group A shows that these scores are stable. However, in group B, the multiple out-of-control points indicate that the scores are unstable. Clearly, there is a negative trend. Comparing a stable group to an unstable group is not a valid comparison even though the data satisfy the other assumptions.</p>
<p>This I-MR chart illustrates just one type of problem that control charts can detect. Control charts can also test for a variety of patterns in the data and for out-of-control variability. As these data show, you can miss problems using other methods.</p>
Using the Different Types of Control Charts
<p>The I-MR chart assesses the stability of the mean and standard deviation when you don’t have subgroups, while the XBar-S chart shown earlier assesses the same parameters but <em>with </em>subgroups.</p>
<p>You can also use other control charts to test other types of data. In Minitab, the U Chart and Laney U’ Chart are control charts that use the Poisson distribution. You can use these charts in conjunction with the 1-Sample and 2-Sample Poisson Rate tests. The P Chart and Laney P’ Chart are control charts that use the binomial distribution. Use these charts with the 1 Proportion and 2 Proportions tests.</p>
<p>If you're using <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank" title="Minitab 16 Statistical Software">Minitab Statistical Software</a>, you can choose <strong>Assistant > Control Charts</strong> and get step-by-step guidance through the process of creating a control chart, from determining what type of data you have, to making sure that your data meets necessary assumptions, to interpreting the results of your chart.</p>
<p>Additionally, check out the great <a href="http://blog.minitab.com/blog/understanding-statistics/control-chart-tutorials-and-examples">control charts tutorial</a> put together by my colleague, Eston Martz.</p>
Data AnalysisHypothesis TestingLearningQuality ImprovementSix SigmaStatisticsStatistics HelpThu, 12 Nov 2015 13:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/control-charts-not-just-for-statistical-process-control-spc-anymoreJim Frost