Hypothesis Testing | MinitabBlog posts and articles about hypothesis testing, especially in the course of Lean Six Sigma quality improvement projects.
http://blog.minitab.com/blog/hypothesis-testing-2/rss
Fri, 06 May 2016 07:19:12 +0000FeedCreator 1.7.3Understanding t-Tests: 1-sample, 2-sample, and Paired t-Tests
http://blog.minitab.com/blog/adventures-in-statistics/understanding-t-tests%3A-1-sample%2C-2-sample%2C-and-paired-t-tests
<p>In statistics, t-tests are a type of hypothesis test that allows you to compare means. They are called t-tests because each t-test boils your sample data down to one number, the t-value. If you understand how t-tests calculate t-values, you’re well on your way to understanding how these tests work.</p>
<p>In this series of posts, I'm focusing on concepts rather than equations to show how t-tests work. However, this post includes two simple equations that I’ll work through using the analogy of a signal-to-noise ratio.</p>
<p><a href="http://www.minitab.com/products/minitab/" target="_blank">Minitab statistical software</a> offers the 1-sample t-test, paired t-test, and the 2-sample t-test. Let's look at how each of these t-tests reduce your sample data down to the t-value.</p>
How 1-Sample t-Tests Calculate t-Values
<p>Understanding this process is crucial to understanding how t-tests work. I'll show you the formula first, and then I’ll explain how it works.</p>
<p style="margin-left: 40px;"><img alt="formula to calculate t for a 1-sample t-test" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/dbbda42fec926eef96a56c22ed462458/formula_1t.png" style="width: 142px; height: 88px;" /></p>
<p>Please notice that the formula is a ratio. A common analogy is that the t-value is the signal-to-noise ratio.</p>
<strong>Signal (a.k.a. the effect size)</strong>
<p>The numerator is the signal. You simply take the sample mean and subtract the null hypothesis value. If your sample mean is 10 and the null hypothesis is 6, the difference, or signal, is 4.</p>
<p>If there is no difference between the sample mean and null value, the signal in the numerator, as well as the value of the entire ratio, equals zero. For instance, if your sample mean is 6 and the null value is 6, the difference is zero.</p>
<p>As the difference between the sample mean and the null hypothesis mean increases in either the positive or negative direction, the strength of the signal increases.</p>
<div style="float: right; width: 325px; margin: 15px 0px 15px 15px;"><img alt="Photo of a packed stadium to illustrate high background noise" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/695f063e8d38c2bc9c5fa61637ef6327/crowd.jpg" style="width: 325px; height: 244px; margin-bottom:5px;" /><br />
<em>Lots of noise can overwhelm the signal.</em></div>
<strong>Noise</strong>
<p>The denominator is the noise. The equation in the denominator is a measure of variability known as the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/tests-of-means/what-is-the-standard-error-of-the-mean/" target="_blank">standard error of the mean</a>. This statistic indicates how accurately your sample estimates the mean of the population. A larger number indicates that your sample estimate is less precise because it has more random error.</p>
<p>This random error is the “noise.” When there is more noise, you expect to see larger differences between the sample mean and the null hypothesis value <em>even when the null hypothesis is true</em>. We include the noise factor in the denominator because we must determine whether the signal is large enough to stand out from it.</p>
<strong>Signal-to-Noise ratio</strong>
<p>Both the signal and noise values are in the units of your data. If your signal is 6 and the noise is 2, your t-value is 3. This t-value indicates that the difference is 3 times the size of the standard error. However, if there is a difference of the same size but your data have more variability (6), your t-value is only 1. The signal is at the same scale as the noise.</p>
<p>In this manner, t-values allow you to see how distinguishable your signal is from the noise. Relatively large signals and low levels of noise produce larger t-values. If the signal does not stand out from the noise, it’s likely that the observed difference between the sample estimate and the null hypothesis value is due to random error in the sample rather than a true difference at the population level.</p>
A Paired t-test Is Just A 1-Sample t-Test
<p>Many people are confused about when to use a paired t-test and how it works. I’ll let you in on a little secret. The paired t-test and the 1-sample t-test are actually the same test in disguise! As we saw above, a 1-sample t-test compares one sample mean to a null hypothesis value. A paired t-test simply calculates the difference between paired observations (e.g., before and after) and then performs a 1-sample t-test on the differences.</p>
<p>You can test this with <a href="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/File/946c3f4725847e714e7fcc9664ae67b2/paired_t_test.mtw">this data set</a> to see how all of the results are identical, including the mean difference, t-value, p-value, and confidence interval of the difference.</p>
<p style="margin-left: 40px;"><img alt="Minitab worksheet with paired t-test example" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/02fbcdbbf62fec3823123fbcc818b11f/paired_t_worksheet.png" style="width: 229px; height: 223px;" /><img alt="paired t-test output" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/170d6d4fa1fbbb1bf4f5aa56b1783b5f/paired_t_swo.png" style="width: 518px; height: 196px;" /></p>
<p style="margin-left: 40px;"><img alt="1-sample t-test output" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/08d652fb45599fc1ac247181a935c471/1t_difc_swo.png" style="width: 504px; height: 115px;" /></p>
<p>Understanding that the paired t-test simply performs a 1-sample t-test on the paired differences can really help you understand how the paired t-test works and when to use it. You just need to figure out whether it makes sense to calculate the difference between each pair of observations.</p>
<p>For example, let’s assume that “before” and “after” represent test scores, and there was an intervention in between them. If the before and after scores in each row of the example worksheet represent the same subject, it makes sense to calculate the difference between the scores in this fashion—the paired t-test is appropriate. However, if the scores in each row are for different subjects, it doesn’t make sense to calculate the difference. In this case, you’d need to use another test, such as the 2-sample t-test, which I discuss below.</p>
<p>Using the paired t-test simply saves you the step of having to calculate the differences before performing the t-test. You just need to be sure that the paired differences make sense!</p>
<p>When it is appropriate to use a paired t-test, it can be more powerful than a 2-sample t-test. For more information, go to <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/tests-of-means/why-use-paired-t/" target="_blank">Why should I use a paired t-test?</a></p>
How Two-Sample T-tests Calculate T-Values
<p>The 2-sample t-test takes your sample data from two groups and boils it down to the t-value. The process is very similar to the 1-sample t-test, and you can still use the analogy of the signal-to-noise ratio. Unlike the paired t-test, the 2-sample t-test requires independent groups for each sample.</p>
<p>The formula is below, and then some discussion.</p>
<p style="margin-left: 40px;"><img alt="formula to cacalculate t for a 2-sample t-test" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/276994cf179b4997ce6097d1f4462363/formula_2t.png" style="width: 102px; height: 54px;" /></p>
<p>For the 2-sample t-test, the numerator is again the signal, which is the difference between the means of the two samples. For example, if the mean of group 1 is 10, and the mean of group 2 is 4, the difference is 6.</p>
<p>The default null hypothesis for a 2-sample t-test is that the two groups are equal. You can see in the equation that when the two groups are equal, the difference (and the entire ratio) also equals zero. As the difference between the two groups grows in either a positive or negative direction, the signal becomes stronger.</p>
<p>In a 2-sample t-test, the denominator is still the noise, but Minitab can use two different values. You can either assume that the variability in both groups is equal or not equal, and Minitab uses the corresponding estimate of the variability. Either way, the principle remains the same: you are comparing your signal to the noise to see how much the signal stands out.</p>
<p>Just like with the 1-sample t-test, for any given difference in the numerator, as you increase the noise value in the denominator, the t-value becomes smaller. To determine that the groups are different, you need a t-value that is large.</p>
What Do t-Values Mean?
<p>Each type of t-test uses a procedure to boil all of your sample data down to one value, the t-value. The calculations compare your sample mean(s) to the null hypothesis and incorporates both the sample size and the variability in the data. A t-value of 0 indicates that the sample results exactly equal the null hypothesis. In statistics, we call the difference between the sample estimate and the null hypothesis the effect size. As this difference increases, the absolute value of the t-value increases.</p>
<p>That’s all nice, but what does a t-value of, say, 2 really mean? From the discussion above, we know that a t-value of 2 indicates that the observed difference is twice the size of the variability in your data. However, we use t-tests to evaluate hypotheses rather than just figuring out the signal-to-noise ratio. We want to determine whether the effect size is statistically significant.</p>
<p>To see how we get from t-values to assessing hypotheses and determining statistical significance, read the other post in this series, <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-t-tests-t-values-and-t-distributions">Understanding t-Tests: t-values and t-distributions</a>.</p>
Data AnalysisHypothesis TestingLearningStatistics HelpWed, 04 May 2016 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/understanding-t-tests%3A-1-sample%2C-2-sample%2C-and-paired-t-testsJim FrostUnderstanding t-Tests: t-values and t-distributions
http://blog.minitab.com/blog/adventures-in-statistics/understanding-t-tests-t-values-and-t-distributions
<p>T-tests are handy <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-why-we-need-to-use-hypothesis-tests-in-statistics" target="_blank">hypothesis tests</a> in statistics when you want to compare means. You can compare a sample mean to a hypothesized or target value using a one-sample t-test. You can compare the means of two groups with a two-sample t-test. If you have two groups with paired observations (e.g., before and after measurements), use the paired t-test.</p>
<img alt="Output that shows a t-value" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/efd51d69e3947d70197143b735e0c51d/t_value_swo.png" style="line-height: 20.8px; float: right; width: 400px; height: 57px; margin: 10px 15px; border-width: 1px; border-style: solid;" />
<p>How do t-tests work? How do t-values fit in? In this series of posts, I’ll answer these questions by focusing on concepts and graphs rather than equations and numbers. After all, a key reason to use <a href="http://www.minitab.com/products/minitab">statistical software like </a><a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">Minitab</a> is so you don’t get bogged down in the calculations and can instead focus on understanding your results.</p>
<p>In this post, I will explain t-values, t-distributions, and how t-tests use them to calculate probabilities and assess hypotheses.</p>
What Are t-Values?
<p>T-tests are called t-tests because the test results are all based on t-values. T-values are an example of what statisticians call test statistics. A test statistic is a standardized value that is calculated from sample data during a hypothesis test. The procedure that calculates the test statistic compares your data to what is expected under the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/null-and-alternative-hypotheses/" target="_blank">null hypothesis</a>.</p>
<p>Each type of t-test uses a specific procedure to boil all of your sample data down to one value, the t-value. The calculations behind t-values compare your sample mean(s) to the null hypothesis and incorporates both the sample size and the variability in the data. A t-value of 0 indicates that the sample results exactly equal the null hypothesis. As the difference between the sample data and the null hypothesis increases, the absolute value of the t-value increases.</p>
<p>Assume that we perform a t-test and it calculates a t-value of 2 for our sample data. What does that even mean? I might as well have told you that our data equal 2 fizbins! We don’t know if that’s common or rare when the null hypothesis is true.</p>
<p>By itself, a t-value of 2 doesn’t really tell us anything. T-values are not in the units of the original data, or anything else we’d be familiar with. We need a larger context in which we can place individual t-values before we can interpret them. This is where t-distributions come in.</p>
What Are t-Distributions?
<p>When you perform a t-test for a single study, you obtain a single t-value. However, if we drew multiple random samples of the same size from the same population and performed the same t-test, we would obtain many t-values and we could plot a distribution of all of them. This type of distribution is known as a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/basic-concepts/sampling-distribution/" target="_blank">sampling distribution</a>.</p>
<p>Fortunately, the properties of t-distributions are well understood in statistics, so we can plot them without having to collect many samples! A specific t-distribution is defined by its <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/basic-concepts/df/" target="_blank">degrees of freedom (DF)</a>, a value closely related to sample size. Therefore, different t-distributions exist for every sample size. <span style="line-height: 20.8px;">You can graph t-distributions u</span><span style="line-height: 1.6;">sing Minitab’s </span><a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/graphs/graphs-of-distributions/probability-distribution-plots/probability-distribution-plot/" style="line-height: 1.6;" target="_blank">probability distribution plots</a><span style="line-height: 1.6;">.</span></p>
<p>T-distributions assume that you draw repeated random samples from a population where the null hypothesis is true. You place the t-value from your study in the t-distribution to determine how consistent your results are with the null hypothesis.</p>
<p style="margin-left: 40px;"><img alt="Plot of t-distribution" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/d628e56f0380e0edcf575502a670ed31/t_dist_20_df.png" style="width: 576px; height: 384px;" /></p>
<p>The graph above shows a t-distribution that has 20 degrees of freedom, which corresponds to a sample size of 21 in a one-sample t-test. It is a symmetric, bell-shaped distribution that is similar to the normal distribution, but with thicker tails. This graph plots the probability density function (PDF), which describes the likelihood of each t-value.</p>
<p>The peak of the graph is right at zero, which indicates that obtaining a sample value close to the null hypothesis is the most likely. That makes sense because t-distributions assume that the null hypothesis is true. T-values become less likely as you get further away from zero in either direction. In other words, when the null hypothesis is true, you are less likely to obtain a sample that is very different from the null hypothesis.</p>
<p>Our t-value of 2 indicates a positive difference between our sample data and the null hypothesis. The graph shows that there is a reasonable probability of obtaining a t-value from -2 to +2 when the null hypothesis is true. Our t-value of 2 is an unusual value, but we don’t know exactly <em>how </em>unusual. Our ultimate goal is to determine whether our t-value is unusual enough to warrant rejecting the null hypothesis. To do that, we'll need to calculate the probability.</p>
Using t-Values and t-Distributions to Calculate Probabilities
<p>The foundation behind any hypothesis test is being able to take the test statistic from a specific sample and place it within the context of a known probability distribution. For t-tests, if you take a t-value and place it in the context of the correct t-distribution, you can calculate the probabilities associated with that t-value.</p>
<p>A probability allows us to determine how common or rare our t-value is under the assumption that the null hypothesis is true. If the probability is low enough, we can conclude that the effect observed in our sample is inconsistent with the null hypothesis. The evidence in the sample data is strong enough to reject the null hypothesis for the entire population.</p>
<p>Before we calculate the probability associated with our t-value of 2, there are two important details to address.</p>
<p>First, we’ll actually use the t-values of +2 and -2 because we’ll perform a two-tailed test. A two-tailed test is one that can test for differences in both directions. For example, a two-tailed 2-sample t-test can determine whether the difference between group 1 and group 2 is statistically significant in either the positive or negative direction. A one-tailed test can only assess one of those directions.</p>
<p>Second, we can only calculate a non-zero probability for a range of t-values. As you’ll see in the graph below, a range of t-values corresponds to a proportion of the total area under the distribution curve, which is the probability. The probability for any specific point value is zero because it does not produce an area under the curve.</p>
<p>With these points in mind, we’ll shade the area of the curve that has t-values greater than 2 and t-values less than -2.</p>
<p><img alt="T-distribution with a shaded area that represents a probability" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/5e124a2c8139681afec706799ebabcec/t_dist_prob.png" style="width: 576px; height: 384px;" /></p>
<p>The graph displays the probability for observing a difference from the null hypothesis that is at least as extreme as the difference present in our sample data while assuming that the null hypothesis is actually true. Each of the shaded regions has a probability of 0.02963, which sums to a total probability of 0.05926. When the null hypothesis is true, the t-value falls within these regions nearly 6% of the time.</p>
<p>This probability has a name that you might have heard of—it’s called the p-value! While the probability of our t-value falling within these regions is fairly low, it’s not low enough to reject the null hypothesis using the common <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-significance-levels-alpha-and-p-values-in-statistics" target="_blank">significance level</a> of 0.05.</p>
<p><a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">Learn how to correctly interpret the p-value.</a></p>
t-Distributions and Sample Size
<p>As mentioned above, t-distributions are defined by the DF, which are closely associated with sample size. As the DF increases, the probability density in the tails decreases and the distribution becomes more tightly clustered around the central value. The graph below depicts t-distributions with 5 and 30 degrees of freedom.</p>
<p><img alt="Comparison of t-distributions with different degrees of freedom" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/5220dc6347611a230e89b70de904b034/t_dist_comp_df.png" style="width: 576px; height: 384px;" /></p>
<p>The t-distribution with fewer degrees of freedom has thicker tails. This occurs because the t-distribution is designed to reflect the added uncertainty associated with analyzing small samples. In other words, if you have a small sample, the probability that the sample statistic will be further away from the null hypothesis is greater even when the null hypothesis is true.</p>
<p>Small samples are more likely to be unusual. This affects the probability associated with any given t-value. For 5 and 30 degrees of freedom, a t-value of 2 in a two-tailed test has p-values of 10.2% and 5.4%, respectively. Large samples are better!</p>
<p>I’ve showed how t-values and t-distributions work together to produce probabilities. To see how each type of t-test works and actually calculates the t-values, read the other post in this series, <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-t-tests:-1-sample,-2-sample,-and-paired-t-tests">Understanding t-Tests: 1-sample, 2-sample, and Paired t-Tests</a>.</p>
Data AnalysisHypothesis TestingLearningStatistics HelpWed, 20 Apr 2016 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/understanding-t-tests-t-values-and-t-distributionsJim FrostBest Way to Analyze Likert Item Data: Two Sample T-Test versus Mann-Whitney
http://blog.minitab.com/blog/adventures-in-statistics/best-way-to-analyze-likert-item-data%3A-two-sample-t-test-versus-mann-whitney
<p><img alt="Worksheet that shows Likert data" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/6b1cf78b969699ed58febb026d32051d/likert_worksheet.png" style="float: right; width: 162px; height: 265px; margin: 10px 15px;" />Five-point Likert scales are commonly associated with surveys and are used in a wide variety of settings. You’ve run into the Likert scale if you’ve ever been asked whether you strongly agree, agree, neither agree or disagree, disagree, or strongly disagree about something. The worksheet to the right shows what five-point Likert data look like when you have two groups.</p>
<p>Because Likert item data are discrete, ordinal, and have a limited range, there’s been a longstanding dispute about the most valid way to analyze Likert data. The basic choice is between <a href="http://blog.minitab.com/blog/adventures-in-statistics/choosing-between-a-nonparametric-test-and-a-parametric-test" target="_blank">a parametric test and a nonparametric test</a>. The pros and cons for each type of test are generally described as the following:</p>
<ul>
<li>Parametric tests, such as the 2-sample t-test, assume a normal, continuous distribution. However, with a sufficient sample size, t-tests are robust to departures from normality.</li>
<li>Nonparametric tests, such as the Mann-Whitney test, do not assume a normal or a continuous distribution. However, there are concerns about a lower ability to detect a difference when one truly exists.</li>
</ul>
<p>What’s the better choice? This is a real-world decision that users of <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">statistical software</a> have to make when they want to analyze Likert data.</p>
<p>Over the years, a number of studies that have tried to answer this question. However, they’ve tended to look at a limited number of potential distributions for the Likert data, which causes the generalizability of the results to suffer. Thanks to increases in computing power, simulation studies can now thoroughly assess a wide range of distributions.</p>
<p>In this blog post, I highlight a simulation study conducted by de Winter and Dodou* that compares the capabilities of the two sample t-test and the Mann-Whitney test to analyze five-point Likert items for two groups. Is it better to use one analysis or the other?</p>
<p>The researchers identified a diverse set of 14 distributions that are representative of actual Likert data. The computer program drew independent pairs of samples to test all possible combinations of the 14 distributions. All in all, 10,000 random samples were generated for each of the 98 distribution combinations! The pairs of samples are analyzed using both the two sample t-test and the Mann-Whitney test to compare how well each test performs. The study also assessed different sample sizes.</p>
<p>The results show that for all pairs of distributions the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/type-i-and-type-ii-error/" target="_blank">Type I (false positive) error rates</a> are very close to the target amounts. In other words, if you use either analysis and your results are statistically significant, you don’t need to be overly concerned about a false positive.</p>
<p>The results also show that for most pairs of distributions, the difference between the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/power-and-sample-size/what-is-power/" target="_blank">statistical power</a> of the two tests is trivial. In other words, if a difference truly exists at the population level, either analysis is equally likely to detect it. The concerns about the Mann-Whitney test having less power in this context appear to be unfounded.</p>
<p>I do have one caveat. There are a few pairs of specific distributions where there is a power difference between the two tests. If you perform both tests on the same data and they disagree (one is significant and the other is not), you can look at a table in the article to help you determine whether a difference in statistical power might be an issue. This power difference affects only a small minority of the cases.</p>
<p>Generally speaking, the choice between the two analyses is tie. If you need to compare two groups of five-point Likert data, it usually doesn’t matter which analysis you use. Both tests almost always provide the same protection against false negatives and always provide the same protection against false positives. These patterns hold true for sample sizes of 10, 30, and 200 per group.</p>
<p>*de Winter, J.C.F. and D. Dodou (2010), Five-Point Likert Items: t test versus Mann-Whitney-Wilcoxon, <em>Practical Assessment, Research and Evaluation</em>, 15(11).</p>
Data AnalysisHypothesis TestingStatisticsStatistics HelpWed, 06 Apr 2016 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/best-way-to-analyze-likert-item-data%3A-two-sample-t-test-versus-mann-whitneyJim FrostThe American Statistical Association's Statement on the Use of P Values
http://blog.minitab.com/blog/adventures-in-statistics/the-american-statistical-associations-statement-on-the-use-of-p-values
<p>P values have been around for nearly a century and they’ve been the subject of criticism since their origins. In recent years, the debate over P values has risen to a fever pitch. In particular, there are serious fears that P values are misused to such an extent that it has actually damaged science.</p>
<p>In March 2016, spurred on by the growing concerns, the American Statistical Association (ASA) did something that it has never done before and took an official position on a statistical practice—how to use P values. The ASA tapped a group of 20 experts who discussed this over the course of many months. Despite facing complex issues and many heated disagreements, this group managed to reach a consensus on specific points and produce the <a href="http://amstat.tandfonline.com/doi/pdf/10.1080/00031305.2016.1154108" target="_blank">ASA Statement on Statistical Significance and P-values</a>.</p>
<p>I’ve written previously about my concerns over how P values have been misused and misinterpreted. My opinion is that P values are powerful tools but they need to be used and interpreted correctly. P value calculations incorporate the effect size, sample size, and variability of the data into a single number that objectively tells you how consistent your data are with the null hypothesis. You can read my case for the power of P values in my <a href="http://blog.minitab.com/blog/adventures-in-statistics/banned-p-values-and-confidence-intervals-a-rebuttal-part-1" target="_blank">rebuttal to a journal that banned them</a>.</p>
<p><span style="line-height: 1.6;">The ASA statement contains the following six principles on how to use P values, which</span><span style="line-height: 20.8px;"> </span><span style="line-height: 20.8px;">are remarkably aligned with my own. </span><span style="line-height: 20.8px;">Let’s take a look at what they came up with.</span></p>
<ol>
<li>P-values can indicate how incompatible the data are with a specified statistical model.</li>
<li>P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.</li>
</ol>
<p>I discuss these ideas in my post <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">How to Correctly Interpret P Values</a>. It turns out that the common misconception stated in principle #2 creates the illusion of substantially more evidence against the null hypothesis than is justified. There are a number of reasons <a href="http://blog.minitab.com/blog/adventures-in-statistics/why-are-p-value-misunderstandings-so-common" target="_blank">why this type of P value misunderstanding is so common</a>. In reality, a P value is a probability about your sample data and not about the truth of a hypothesis.</p>
<ol>
<li value="3">Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.</li>
</ol>
<p>In statistics, we’re working with samples to describe a complex reality. Attempting to discover the truth based on an oversimplified process of comparing a single P value to an arbitrary significance level is destined to have problems. False positives, false negatives, and otherwise fluky results are bound to happen.</p>
<p>Using P values in conjunction with a significance level to decide when to reject the null hypothesis increases your chance of making the correct decision. However, there is no magical threshold that distinguishes between the studies that have a true effect and those that don’t with 100% accuracy. You can see a graphical representation of why this is the case in my post <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-why-we-need-to-use-hypothesis-tests-in-statistics" target="_blank">Why We Need to Use Hypothesis Tests</a>.</p>
<p>When Sir Ronald Fisher introduced P values, he never intended for them to be the deciding factor in such a rigid process. Instead, Fisher considered them to be just one part of a process that incorporates scientific reasoning, experimentation, statistical analysis and replication to lead to scientific conclusions.</p>
<p>According to Fisher, “A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.”</p>
<p>In other words, don’t expect a <em>single</em> study to provide a definitive answer. No single P value can divine the truth about reality by itself.</p>
<ol>
<li value="4">Proper inference requires full reporting and transparency.</li>
</ol>
<p>If you don’t know the full context of a study, you can’t properly interpret a carefully selected subset of the results. Data dredging, cherry picking, significance chasing, data manipulation, and other forms of p-hacking can make it impossible to draw the proper conclusions from selectively reported findings. You must know the full details about all data collection choices, how many and which analyses were performed, and all P values.</p>
<p><img alt="Comic about jelly beans causing acne with selective reporting of the results" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/22099bc252d3630a4876f579c1b83778/jelly_bean_comic.png" style="line-height: 20.8px; width: 500px; height: 1387px; margin: 10px 15px;" /></p>
<div><span style="line-height: 1.6;">In the </span><a href="http://xkcd.com/882/" style="line-height: 1.6;" target="_blank">XKCD comic</a><span style="line-height: 1.6;"> about jelly beans, if you didn’t know about the post hoc decision to subdivide the data and the 20 insignificant test results, you’d be pretty convinced that green jelly beans cause acne!</span></div>
<ol>
<li value="5">A p-value, or statistical significance, does not measure the size of an effect or the importance of an effect.</li>
<li value="6">By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.</li>
</ol>
<p>I cover these ideas, and more, in my <a href="http://blog.minitab.com/blog/adventures-in-statistics/five-guidelines-for-using-p-values">Five Guidelines for Using P Values</a>. P-values don’t tell you the size or importance of the effect. An effect can be statistically significant but trivial in the real world. This is the difference between <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/p-value-and-significance-level/practical-significance/" target="_blank">statistical significance and practical significance</a>. The analyst should supplement P values with other statistics, such as effect sizes and confidence intervals, to convey the importance of the effect.</p>
<p>Researchers need to apply their scientific judgment about the plausibility of the hypotheses, results of similar studies, proposed mechanisms, proper experimental design, and so on. Expert knowledge transforms statistics from numbers into meaningful, trustworthy findings.</p>
Data AnalysisHypothesis TestingLearningStatisticsStatistics in the NewsWed, 23 Mar 2016 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/the-american-statistical-associations-statement-on-the-use-of-p-valuesJim FrostDo Actors Wait Longer than Actresses for Oscars? A Comparison Between Academy Award Winners
http://blog.minitab.com/blog/statistics-and-more/do-actors-wait-longer-than-actresses-for-oscars-a-comparison-between-academy-award-winners
<p><span style="line-height: 1.6;">I am a bit of an Oscar fanatic. Every year after the ceremony, I religiously go online to find out who won the awards and listen to their acceptance speeches. This year, I was <em>so </em>chuffed to learn that Leonardo Di Caprio won his first Oscar for his performance in <em>The Revenant</em> in the 88</span>th<span style="line-height: 1.6;"> Academy Awards—after five nominations in previous ceremonies. As a longtime Di Caprio fan, I still remember going to the cinema when <em>Titanic </em>was released, and returning four more times. Every time, I could not hold back any tears and used up all tissues I'd brought with me!<img alt="this year's winner..." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a51cc79cd412237ef2f241d69e7e83ec/dicaprio.png" style="margin: 10px 15px; float: right; width: 190px; height: 250px;" /></span></p>
<p>Compared to his <em>Titanic </em>costar Kate Winslet, who won the Best Actress award in 2009 (aged 33), Leonardo waited 7 more years (20 years since his first nomination) before his turn came. I can name several actresses—Gwyneth Paltrow, Hilary Swank, and Jennifer Lawrence come immediately to mind—who obtained the award at younger ages. However, it appears that few young actors have received the Academy Award in recent years. This makes me wonder whether Oscar-winning actors tend to be older than Oscar-winning actresses.</p>
<p>To investigate, I collected data of the dates of past Academy Awards ceremonies and the birthdays of the winning actors and actresses. From these, I calculated the age of the winners on their Oscar-winning night. Below is a screenshot of some of the data.</p>
<p><img alt="oscars data" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ef494be723f2d7d3bb55d8f055124ad1/oscar1.png" style="width: 564px; height: 390px;" /></p>
<p>I used <a href="http://www.minitab.com/products/minitab">Minitab Statistical Software</a> to create a time series plot of the data, shown below.</p>
<p><img alt="time series plot" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/2d5d4f84cb67fbe6fd41aee118f40c6a/oscar2.png" style="width: 550px; height: 367px;" /></p>
<p><span style="line-height: 1.6;">The plot suggests that there is usually a substantial age difference between the Best Actor and Best Actress winners. There are more years when the Best Actor winner is much older than the best actress winner (blue dots above red dots) than years where the winning actress is older. Some examples:</span></p>
<p style="margin-left: 40px;">1992: Anthony Hopkins (54.2466), Jodie Foster (29.3616)</p>
<p style="margin-left: 40px;">1987: Paul Newman (62.1726), Marlee Matlin (21.5973)</p>
<p style="margin-left: 40px;">1989: Dustin Hoffman (51.6329), Jodie Foster (26.3507)</p>
<p style="margin-left: 40px;">1990: Daniel Day-Lewis (32.9068), Jessica Tandy (80.8000)</p>
<p style="margin-left: 40px;">1998: Jack Nicholson (60.9178), Helen Hunt (34.7699)</p>
<p style="margin-left: 40px;">2011: Colin Firth (50.4658), Natalie Portman (29.7205)</p>
<p style="margin-left: 40px;">2013: Daniel Day-Lewis (55.8247), Jennifer Lawrence (22.5288)</p>
<p><span style="line-height: 1.6;">There are not many occasions when both the Best Actor and Best Actress are in their 30s, 40s, 50s, etc.</span></p>
<p><a href="http://blog.minitab.com/blog/cpammer/planning-a-trip-to-disney-world%3A-using-statistics-to-keep-it-in-the-green">Conditional formatting</a> was introduced with the release of Minitab 17.2 and this is what I am going to use to identify any repeats in the data. </p>
<p style="margin-left: 40px;"><img alt="conditional formatting" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9c70ddd1b5378dd004dac75a6dafaf31/oscar3.png" style="width: 505px; height: 213px;" /></p>
<p>Minitab applies the following conditional formatting to the data set:</p>
<p style="margin-left: 40px;"><img alt="conditional formatting" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/bfff88d02adb86b588b2ee63fc9d41a4/oscar4.png" style="width: 547px; height: 570px;" /></p>
<p>For the Best Actor award, Daniel Day-Lewis received the award on three occasions, while <span style="line-height: 20.8px;">Marlon Brando, Gary Cooper, Tom Hanks, Dustin Hoffman, Fredric March, Jack Nicholson, </span><span style="line-height: 20.8px;">Sean Penn, and Spencer Tracy each</span><span style="line-height: 1.6;"> won the award twice.</span></p>
<p>For the Best Actress category, Katharine Hepburn won four times. <span style="line-height: 20.8px;">Ingrid Bergman, Bette Davis, Olivia de Havilland, Sally Field, Jane Fonda, Jodie Foster, </span><span style="line-height: 20.8px;">Glenda Jackson, Vivien Leigh, Luise Rainer, Meryl Streep, Hilary Swank, and Elizabeth Taylor each</span><span style="line-height: 1.6;"> received the award twice.</span></p>
<p>Winners below the age of 30 could be regarded as obtaining the award at an early stage of their careers. Using the conditional formatting again, I can quickly identify the actors and actress in the data who are in this group.</p>
<p><img alt="conditional formatting" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a1397cfe27b867b0fae4b1da7271a945/oscar5.png" style="width: 496px; height: 210px;" /></p>
<p><span style="line-height: 1.6;">As shown below, a lot more actresses than actors obtain the award before the age of 30.</span></p>
<p><img alt="conditional formatted data" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/2dd1d117449fee623b47ce7c0062bb7d/oscar5a.png" style="width: 649px; height: 432px;" /></p>
<p><span style="line-height: 1.6;">To get a better comparison, I am going to remove the repeats (with the help of the highlighted cells) for actors and actress who won more than once and only take into account their age at first win. This gives data from 79 Best Actor and 74 Best Actress winners. I am going to use <a href="http://www.minitab.com/products/minitab/assistant/">the Assistant</a> to carry out a comparison using the 2-sample t-test.</span></p>
<p><img alt="Assistant 2-sample t test" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/eab880dbb38bb33bbfb452693513610f/oscar6.png" style="width: 641px; height: 495px;" /></p>
<p><span style="line-height: 1.6;">Apart from generating easy-to-interpret output, the Assistant also has the advantage of carrying out a powerful t-test even with unequal sample sizes using the Welch approach.</span></p>
<p><img alt="Report Card" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ac73bee979d25f49f4221170e9769956/oscar7.png" style="line-height: 1.6; width: 650px; height: 488px;" /></p>
<p><span style="line-height: 1.6;">The Report Card indicates that we have sufficient data and the assumptions of the t-test are fulfilled. However, Minitab also detects some usual data which I will look into further.</span></p>
<p><img alt="2 sample t test diagnostic report" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/384df9dc72ef6868ed050f65d663e1bc/oscar8.png" style="width: 650px; height: 487px;" /></p>
<p><span style="line-height: 1.6;">Using the brush, the following unusual data are identified.</span></p>
<p style="margin-left: 40px;"><strong>Best Actor: </strong><br />
<span style="line-height: 1.6;">John Wayne (62.8658)</span><br />
<span style="line-height: 1.6;">Henry Fonda (76.8685)</span></p>
<p>These winners were considerably older, as the majority of the actor winners are in their 40s and 50s.</p>
<p style="margin-left: 40px;"><strong>Best Actress:</strong><br />
<span style="line-height: 1.6;">Marie Dressler (63.0027)</span><br />
<span style="line-height: 1.6;">Geraldine Page (61.3342)</span><br />
<span style="line-height: 1.6;">Jessica Tandy (80.8000)</span><br />
<span style="line-height: 1.6;">Helen Mirren (61.5863)</span></p>
<p>These winners were considerably older as the majority of the actress winners were in their late 30s and 40s.</p>
<p><img alt="2-sample t test summary report" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8d4c39828bcad9ffdc0e151771078244/oscar9.png" style="width: 650px; height: 488px;" /></p>
<p><span style="line-height: 1.6;">The Summary Report provides the key output of the t-test. The mean age of Best Actor is 43.746, while the mean age of Best Actress is 35. The p-value value of the test is very small (<0.001). This means that we have enough evidence to suggest that, on average, the Best Actor winner is older than the Best Actress winner.</span></p>
<p><span style="line-height: 1.6;">I will leave it to others to speculate (and perhaps even use data to explore) why this apparent age gap exists. However, whatever their ages, we all enjoy seeing these Oscar winners' amazing performances on the big screen!</span></p>
<p><span style="font-size: 8px; line-height: 1.6;">Photograph of Leonardo DiCaprio by <a href="https://www.flickr.com/photos/phototoday2008/11933209533/" target="_blank">See Li</a>, used under Creative Commons 2.0. </span></p>
Fun StatisticsHypothesis TestingStatistics in the NewsMon, 07 Mar 2016 13:00:00 +0000http://blog.minitab.com/blog/statistics-and-more/do-actors-wait-longer-than-actresses-for-oscars-a-comparison-between-academy-award-winnersEugenie ChungHow to Compare Regression Slopes
http://blog.minitab.com/blog/adventures-in-statistics/how-to-compare-regression-lines-between-different-models
<p>If you perform linear regression analysis, you might need to compare different regression lines to see if their constants and slope coefficients are different. Imagine there is an established relationship between X and Y. Now, suppose you want to determine whether that relationship has changed. Perhaps there is a new context, process, or some other qualitative change, and you want to determine whether that affects the relationship between X and Y.</p>
<p>For example, you might want to assess whether the relationship between the height and weight of football players is significantly different than the same relationship in the general population.</p>
<p>You can graph the regression lines to visually compare the slope coefficients and constants. However, you should also statistically test the differences. <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-why-we-need-to-use-hypothesis-tests-in-statistics" target="_blank">Hypothesis testing</a> helps separate the true differences from the random differences caused by sampling error so you can have more confidence in your findings.</p>
<p>In this blog post, I’ll show you how to compare a relationship between different regression models and determine whether the differences are statistically significant. Fortunately, these tests are easy to do using <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">Minitab statistical software</a>.</p>
<p>In the example I’ll use throughout this post, there is an input variable and an output variable for a hypothetical process. We want to compare the relationship between these two variables under two different conditions. Here is the <a href="//cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/File/569a0e7d067944f6f9147434794efcd6/comparingregressionmodels.MPJ">Minitab project file</a> with the data.</p>
Comparing Constants in Regression Analysis
<p>When the <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-to-interpret-the-constant-y-intercept" target="_blank">constants</a> (or y intercepts) in two different regression equations are different, this indicates that the two regression lines are shifted up or down on the Y axis. In the scatterplot below, you can see that the Output from Condition B is consistently higher than Condition A for any given Input value. We want to determine whether this vertical shift is statistically significant.</p>
<p><img alt="Scatterplot with two regression lines that have different constants." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/2ed27f4204515bac9d9674c16fa0c0f7/scatter_constant_dift.png" style="width: 576px; height: 384px;" /></p>
<p>To test the difference between the constants, we just need to include a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/data-concepts/cat-quan-variable/" target="_blank">categorical variable</a> that identifies the qualitative attribute of interest in the model. For our example, I have created a variable for the condition (A or B) associated with each observation.</p>
<p>To fit the model in Minitab, I’ll use: <strong>Stat > Regression > Regression > Fit Regression Model</strong>. I’ll include <em>Output</em> as the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">response variable</a>, <em>Input</em> as the continuous <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">predictor</a>, and <em>Condition</em> as the categorical predictor.</p>
<p>In the regression analysis output, we’ll first check the coefficients table.</p>
<p style="margin-left: 40px;"><img alt="Coefficients table that shows that the constants are different" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/23657868f2cf893d216d05d3400ab9e6/coeff_constant_dift.png" style="width: 369px; height: 117px;" /></p>
<p>This table shows us that the relationship between Input and Output is statistically significant because the p-value for Input is 0.000.</p>
<p>The <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">coefficient</a> for Condition is 10 and its <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">p-value</a> is significant (0.000). The coefficient tells us that the vertical distance between the two regression lines in the scatterplot is 10 units of Output. The p-value tells us that this difference is statistically significant—you can reject the null hypothesis that the distance between the two constants is zero. You can also see the difference between the two constants in the regression equation table below.</p>
<p style="margin-left: 40px;"><img alt="Regression equation table that shows constants that are different" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/a879996e37ebb05a297721e695a71943/equ_constant_dift.png" style="width: 305px; height: 113px;" /></p>
Comparing Coefficients in Regression Analysis
<p>When two slope coefficients are different, a one-unit change in a predictor is associated with different mean changes in the response. In the scatterplot below, it appears that a one-unit increase in Input is associated with a greater increase in Output in Condition B than in Condition A. We can <em>see</em> that the slopes look different, but we want to be sure this difference is statistically significant.</p>
<p><img alt="Scatterplot that shows two slopes that are different" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/200c12087fdf7eecd9b773d9ce213020/scatter_slope_dift.png" style="width: 576px; height: 384px;" /></p>
<p>How do you statistically test the difference between regression coefficients? It sounds like it might be complicated, but it is actually very simple. We can even use the same Condition variable that we did for testing the constants.</p>
<p>We need to determine whether the coefficient for Input depends on the Condition. In statistics, when we say that the effect of one variable depends on another variable, that’s an interaction effect. All we need to do is include the interaction term for Input*Condition!</p>
<p>In Minitab, you can specify interaction terms by clicking the <strong>Model</strong> button in the main regression dialog box. After I fit the regression model with the interaction term, we obtain the following coefficients table:</p>
<p style="margin-left: 40px;"><img alt="Coefficients table that shows different slopes" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/f06eff56f2266d0ff7e3919aa1292285/coeff_slope_dift.png" style="width: 410px; height: 154px;" /></p>
<p>The table shows us that the interaction term (Input*Condition) is statistically significant (p = 0.000). Consequently, we reject the null hypothesis and conclude that the difference between the two coefficients for Input (below, 1.5359 and 2.0050) does not equal zero. We also see that the main effect of Condition is not significant (p = 0.093), which indicates that difference between the two constants is not statistically significant.</p>
<p style="margin-left: 40px;"><img alt="Regression equation table that shows different slopes" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/d5e5142c0ff13645d1dacc3e2c0bee27/equ_coeff_dift.png" style="width: 295px; height: 105px;" /></p>
<p>It is easy to compare and test the differences between the constants and coefficients in regression models by including a categorical variable. These tests are useful when you can see differences between regression models and you want to defend your conclusions with p-values.</p>
<p>If you're learning about regression, read my <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples">regression tutorial</a>!</p>
Data AnalysisHypothesis TestingRegression AnalysisStatistics HelpWed, 13 Jan 2016 13:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/how-to-compare-regression-lines-between-different-modelsJim FrostChecking the “Naughty” or “Nice” Assessment with Attribute Agreement Analysis
http://blog.minitab.com/blog/using-data-and-statistics/checking-the-naughty-or-nice-assessment-with-attribute-agreement-analysis
<p><span style="line-height: 1.6;">Each year Santa’s Elves have to take all the information provided by family, friends and teachers to determine if all the children of the world have been “Naughty” or “Nice.” This is no small task, as according to the website </span><a href="http://www.santafaqs.com/" style="line-height: 1.6;">www.santafaqs.com</a><span style="line-height: 1.6;"> Santa delivers over 5 billion presents per year. </span></p>
<p><span style="line-height: 1.6;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/0b9beb7f6ce672b36e9141b1bc3d3826/elf_classifying.png" style="margin: 10px 15px; float: right; width: 200px; height: 194px;" />Not only is it a large task in terms of size, but it is critical that the Elves have a consistent approach to this assessment. Santa does not want to give presents to naughty children, but he is adamant that he would rather mistakenly give a present to a naughty child than run the risk of <em>not</em> giving a present to a nice child. </span></p>
<p>For this reason, every summer Santa trains all his staff on separating people into the “Naughty” and “Nice” categories, and then he gives them a final test on a set of characters where their behaviour category is already known. For each of these 50 characters, Santa gives the Elves details of their behaviour as reported by their family, friends and work colleagues, and they give them a Naughty or Nice grade. To set up and analyse his new Elf recruits performance, Santa uses an <span><a href="http://blog.minitab.com/blog/understanding-statistics/got-good-judgment-prove-it-with-attribute-agreement-analysis">Attribute Agreement Analysis</a></span>.</p>
<p>The full list of characters and their grades can be seen in this Minitab project file: <a href="//cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/edccea06e97c99500398e5f26bf71e23/elf_test.MPJ">elf-test.mpj</a>. If you don't already have Minitab and you'd like to give Attribute Agreement Analysis a try with this data set, you can <a href="http://www.minitab.com/en-us/products/minitab/free-trial/">download the free 30-day trial</a>. </p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/006aec47ff21e0f2c53ff20bb3c8aaf7/naughty_1.png" style="border-width: 0px; border-style: solid; width: 282px; height: 330px;" /></p>
<p><span style="line-height: 1.6;">The first thing Santa has to do is create an Attribute Agreement Worksheet, which ensures that each Elf evaluates all the characters in a random order and creates a Minitab worksheet that includes expected category (Naughty or Nice) for each person so that Santa or one of his helpers can quickly enter the Elves assessments. </span></p>
<p>To avoid any pre-judgement the Elves do not see the name of the person they are assessing—only their Sample No and the information from family and friends.</p>
<p>The steps he follows are:</p>
<ol>
<li><strong>File > Open Project > Elf-Test.mpj</strong></li>
<li><strong>Assistant > Measurement System Analysis (MSA) > Attribute Agreement Worksheet</strong></li>
</ol>
<p>Santa completes the dialog box as follows and clicks OK. He then prints of the collection datasheets and gets the new Elves to assess the information for each of the people of the list and categorise them as Naughty or Nice. Once he has this information it is input into the Minitab Worksheet.</p>
<p><img alt="Attribute Agreement Analysis worksheet" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/367d2c66939af738133c5e34ef72dabf/naughty_2.png" style="border-width: 0px; border-style: solid; width: 585px; height: 418px;" /></p>
<p>Once Santa has collected all this data, he runs the Attribute Agreement Analysis in the Assistant and gets the following results:</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/058a59ecbcb6d9b30f6bb90f896e7f9e/naughty_3.png" style="border-width: 0px; border-style: solid; width: 496px; height: 137px;" /></p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/cd34f968224601cfacac3a635117c05b/naughty_4.png" style="border-width: 0px; border-style: solid; width: 567px; height: 575px;" /></p>
<p>Santa is happy with the overall error rate. However, he is very concerned that the percentage of Nice people being rated as Naughty is higher than the overall error rate. This means that there are some good people that may not get presents. This is not acceptable, so he uses another report produced by Minitab to investigate which people are being mis-classified.</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a16a40bd7ae48ca184dc665b6c3727f3/naughty_5.png" style="border-width: 0px; border-style: solid; width: 549px; height: 318px;" /></p>
<p>This chart shows which samples were misclassified as Naughty.</p>
<p>Santa is worried because every Elf said person 26 was Naughty when the standard was Nice. When Santa looks at the Elf-Test Worksheet, he can see that person 26 was Sherlock Holmes. Santa checks the information on him and can see why the Elves think he is naughty: he smokes and the neighbours have complained that he plays his violin (badly) at all hours of the day and night. Santa provides extra training to the Elves to help them realise that musicians only improve if they practise regularly, so the neighbours will have to suffer.</p>
<p>Characters, 24, 40 and 49 (Little Red Riding Hood, Stuart Little and Shrek, respectively) were only misclassified once apiece, so Santa wants to investigate which Elves made the wrong decision in these cases and again he uses one of the reports the Assistant produces as a standard.</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/e960ad692f6856d825475bfc1373de29/naughty_6.png" style="border-width: 0px; border-style: solid; width: 545px; height: 344px;" /></p>
<p>From this report Santa, can see that Berry is the strictest elf—and the one who has made the most mistakes classifying Nice people as Naughty. For this reason, Santa decides to reassign Berry to the reindeer welfare department.</p>
<p>Jingle and Sparkle are now full time Niceness monitors, and Santa is sure—thanks to his training program and the Attribute Agreement Assessment Analysis completed in Minitab—that <em>everyone</em> will get the presents they deserve this year.</p>
<p>If, like Santa, you have to make qualitative assessments on your products or services, an Attribute Agreement Analysis is a good way to verify and improve the performance of you assessors.</p>
<p> </p>
Fun StatisticsHypothesis TestingQuality ImprovementWed, 23 Dec 2015 13:00:00 +0000http://blog.minitab.com/blog/using-data-and-statistics/checking-the-naughty-or-nice-assessment-with-attribute-agreement-analysisGillian GroomWhy Are P Value Misunderstandings So Common?
http://blog.minitab.com/blog/adventures-in-statistics/why-are-p-value-misunderstandings-so-common
<p><img alt="Danger thin ice sign" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/694cbaccbcb94c40ba77ec6a967994d7/thin_ice_sign.jpg" style="float: right; width: 225px; height: 300px; margin: 15px 10px;" />I’ve written a fair bit about P values: <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">how to correctly interpret P values</a>, <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-significance-levels-alpha-and-p-values-in-statistics" target="_blank">a graphical representation of how they work</a>, <a href="http://blog.minitab.com/blog/adventures-in-statistics/five-guidelines-for-using-p-values" target="_blank">guidelines for using P values</a>, and why the <a href="http://blog.minitab.com/blog/adventures-in-statistics/banned-p-values-and-confidence-intervals-a-rebuttal-part-1" target="_blank">P value ban in one journal is a mistake</a>. Along the way, I’ve received many questions about P values, but the questions from one reader stand out.</p>
<p>This reader asked, <em>why</em> is it so easy to interpret P values incorrectly? Why is the common misinterpretation <em>so</em> pervasive? And, what can be done about it? He wasn’t sure if it these were fair questions, but I think they are. Let’s answer them!</p>
The Correct Way to Interpret P Values
<p>First, to make sure we’re on the same page, here’s the correct definition of P values.</p>
<p>The P value is the probability of obtaining an effect at least as extreme as the one in your sample data, assuming the truth of the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/null-and-alternative-hypotheses/" target="_blank">null hypothesis</a>. In other words, if the null hypothesis is true, the P value is the probability of obtaining your sample data. It answers the question, are your sample data unusual if the null hypothesis is true?</p>
<p>If you’re thinking that the P value is the probability that the null hypothesis is true, the probability that you’re making a mistake if you reject the null, or anything else along these lines, that’s the most common misunderstanding. You should click the links above to learn how to correctly interpret P values.</p>
Historical Circumstances Helped Make P Values Confusing
<p>This problem is nearly a century old and goes back to two very antagonistic camps from the early days of hypothesis testing: Fisher's measures of evidence approach (P values) and the Neyman-Pearson error rate approach (alpha). Fisher believed in inductive reasoning, which is the idea that we can use sample data to learn about a population. On the other side, the Neyman-Pearson methodology does not allow analysts to learn from individual studies. Instead, the results only apply to a long series of tests.</p>
<p>Courses and textbooks have mushed these disparate approaches together into the standard hypothesis-testing procedure that is known and taught today. This procedure <em>seems </em>like a seamless combination but it's really a muddled, Frankenstein's-monster combination of sometimes-contradictory methods that has promoted the confusion. The end result of this fusion is that P values are incorrectly entangled with the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/type-i-and-type-ii-error/" target="_blank">Type I error rate</a>. Fisher tried to clarify this misunderstanding for decades, but to no avail.</p>
P Values Aren’t What We <em>Really </em>Want to Know
<p>The common misconception is what we'd <em>really</em> like to know. We’d <em>loooove</em> to know the probability that a hypothesis is correct, or the probability that we’re making a mistake. What we get instead is the probability of our <em>observation</em>, which just isn’t as useful.</p>
<p>It would be great if we could take evidence solely from a sample and determine the probability that the sample is wrong. Unfortunately, that's not possible—for logical reasons when you think about it. Without outside information, a sample can’t tell you whether it’s representative of the population.</p>
<p>P values are based exclusively on information contained within a sample. Consequently, P values can't answer the question that we most want answered, but there seems to be an irresistible temptation towards interpreting it that way.</p>
P Values Have a Convoluted Definition
<p>The correct definition of a P value is fairly convoluted. The definition is based on the probability of observing what you actually did observe (huh?), but in a hypothetical context (a true null hypothesis), and it includes strange wording about results that are at least as extreme as what you observed. It's hard to understand all of that without a lot of study. It's just not intuitive.</p>
<p>Unfortunately, there is no simple <em>and</em> accurate definition that can help counteract the pressures to believe in the common misinterpretation. In fact, the incorrect definition <em>sounds</em> so much simpler than the correct definition. Shoot, <a href="http://fivethirtyeight.com/features/not-even-scientists-can-easily-explain-p-values/" target="_blank">not even scientists can explain P values</a>! And, so the misconceptions live on.</p>
What Can Be Done?
<p>Historical circumstances have conspired to confuse the issue. We have a natural tendency to want P values to mean something else. And, there is no simple yet correct definition for P values that can counteract the common misunderstandings. No wonder this has been a problem for a long time!</p>
<p>Fisher tried in vain to correct this misinterpretation but didn't have much luck. As for myself, I hope to point out that what may seem like a semantic difference between the correct and incorrect definitions actually equates to a huge difference.</p>
<p>Using the incorrect definition is likely to come back to bite you! If you think a P value of 0.05 equates to a 5% chance of a mistake, boy, are you in for a big surprise—because it’s often around 26%! Instead, based on middle-of-the-road assumptions, you’ll need a P value around 0.0027 to achieve an error rate of about 5%. However, <a href="http://blog.minitab.com/blog/adventures-in-statistics/not-all-p-values-are-created-equal" target="_blank">not all P values are created equal</a> in terms of the error rate.</p>
<p>I also think that P values are easier for most people to understand graphically than through the tricky definition and the math. So, I wrote a series of blog posts that graphically show <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-why-we-need-to-use-hypothesis-tests-in-statistics" target="_blank">why we need hypothesis testing and how it works</a>.</p>
<p>I have no reason to expect that I'll have any more impact than Fisher did himself, but it's an attempt!</p>
Hypothesis TestingLearningStatisticsThu, 10 Dec 2015 13:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/why-are-p-value-misunderstandings-so-commonJim FrostWhy You Should Use Non-parametric Tests when Analyzing Data with Outliers
http://blog.minitab.com/blog/applying-statistics-in-quality-projects/why-you-should-use-non-parametric-tests-when-analyzing-data-with-outliers
<p>There are many reasons why a distribution might not be normal/Gaussian. A non-normal pattern might be caused by several distributions being mixed together, or by a drift in time, or by one or several outliers, or by an asymmetrical behavior, some out-of-control points, etc.</p>
<p>I recently collected the scores of three different teams (the Blue team, the Yellow team and the Pink team) after a laser tag game session one Saturday afternoon. The three teams represented three different groups of friends wishing to spend their afternoon tagging players from competing teams. Gengiz Khan turned out to be the best player, followed by Tarantula and Desert Fox.</p>
One-Way ANOVA
<p>In this post, I will focus on team performances, not on single individuals. I decided to compare the average scores of each team. The best tool I could possibly think of was a one-way ANOVA using the Minitab <a href="http://www.minitab.com/products/minitab/assistant/">Assistant</a> (with a continuous Y response and three sample means to compare).</p>
<p>To assess statistical significance, the differences <em>between </em>team averages are compared to the <em>within </em>(team) variability. A large between-team variability compared to a small within-team variability (the error term) means that the differences between teams are statistically significant.</p>
<p>In this comparison (see the output from the Assistant below), the <a href="http://blog.minitab.com/blog/understanding-statistics/what-can-you-say-when-your-p-value-is-greater-than-005">P value was 0.053, just above the 0.05</a> standard usual threshold. The P value is the probability that the differences in observed means are only due to random causes. A p-value above 0.05, therefore, indicates that the probability that such differences are only due to random causes is not negligible. Because of that, the differences are not considered to be statistically significant (there is "not enough evidence that there are significant differences," according to the comments in Minitab Assistant). But the result remains somewhat ambiguous since the p-value is still very close to the significance limit (0.05).</p>
<p>Note that the variability within the Blue team seems to be much larger (see the confidence interval plot in the means comparison chart below) than for the other two groups. This not a cause for concern in this case, since the Minitab Assistant uses the <a href="http://blog.minitab.com/blog/adventures-in-statistics/did-welchs-anova-make-fishers-classic-one-way-anova-obsolete">Welch method of ANOVA</a>, which does not require or assume variances within groups to be equal.</p>
<p><img height="468" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/8457900262f468b76f8d2a4f28027c2d/8457900262f468b76f8d2a4f28027c2d.png" width="624" /></p>
Outliers and Normality
<p>When looking at the distribution of individual data (below) one point seems to be an outlier or at least a suspect, extreme value (marked in red). This is Gengiz Khan, the best player. In my worksheet, the scores have been entered from the best to the worst (not in time order). This is why we can see a downward trend in the chart on the right site of the diagnostic report (see below).</p>
<p><img height="468" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/7a7f0889d207c5dab409c4f32fb33d85/7a7f0889d207c5dab409c4f32fb33d85.png" width="624" /></p>
<p>The Report Card (see below) from the Minitab Assistant shows that Normality might be an issue (the yellow triangle is a warning sign) because the sample sizes are quite small. We need to check normality within each team. The second warning sign is due to the unusual / extreme data (score in row 1) which may bias our analysis.</p>
<p><img height="500" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/c1b6e895a14806b48ce1389d3b1d283b/c1b6e895a14806b48ce1389d3b1d283b.png" width="666" /></p>
<p><span style="line-height: 20.8px;">Following the suggestion from the warning signal in the Minitab Assistant Report Card, </span>I decided to run a normality test. I performed a separate normality test for each team in order not to mix different distributions together.</p>
<p>A low P value in the normal probability plot (see below) signals a significant departure from normality. This p-value is below 0.05 for the Blue team. The points located along the normal probability plot line represent “normal,” common, random variations. The points at the upper or lower extreme, which are distant from the line, represent unusual values or outliers. The non-normal behavior in the probability plot of the blue team is clearly due to the outlier on the right side of the normal probability plot line.</p>
<p><img height="384" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/0a2a24b82115014a0e981e63b0628f5b/0a2a24b82115014a0e981e63b0628f5b.png" width="576" /></p>
<p>Should we remove this value (Gengiz Khan’s score) in the Blue group and rerun the analysis without him ?</p>
<p>Even though Gengiz Khan is more experienced and talented than the other team members, there are no particular reasons why he should be removed—he is certainly part of the Blue team. There are probably many other talented laser game players around. If another additional laser game session takes place in the future, there will probably still be a large difference between Gengiz Khan and the rest of his team.</p>
<p>The problem is that this extreme value tends to inflate the within-group variability. Because there is a much larger within-team variability for the blue team, differences <em>between </em>groups when they are compared to the residual / within variability do not appear to be significant, causing the p-value to move just above the significance threshold.</p>
A Non-parametric Solution
<p>One possible solution is to use a non-parametric approach. Non-parametric techniques are based on ranks, or medians. Ranks represent the relative position of an individual in comparison to others, but are not affected by extreme values (whereas a mean is sensitive to outlier values). Ranks and medians are more “robust” to outliers.</p>
<p>I used the Kruskal-Wallis test (see the correspondence table between parametric and non-parametric tests below). The p-value (see the output below) is now significant (less than 0.05), and the conclusion is completely different. We can consider that the differences are significant .</p>
<p style="margin-left: 40px;"><strong>Kruskal-Wallis Test: Score versus Team </strong></p>
<p style="margin-left: 40px;">Kruskal-Wallis Test on Score</p>
<p style="margin-left: 40px;">Team N Median Ave Rank Z</p>
<p style="margin-left: 40px;">Blue 9 2550,0 23,7 2,72</p>
<p style="margin-left: 40px;">Pink 13 -450,0 11,6 -2,44</p>
<p style="margin-left: 40px;">Yellow 10 975,0 16,4 -0,06</p>
<p style="margin-left: 40px;">Overall 32 16,5</p>
<p style="margin-left: 40px;">H = 8,86 DF = 2 <strong>P = 0,012</strong></p>
<p style="margin-left: 40px;">H = 8,87 DF = 2 <strong>P = 0,012</strong> (adjusted for ties)</p>
<p style="margin-left: 40px;"><img height="384" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/ed3231b9ceab5d16a6a5d5bb0ce43973/ed3231b9ceab5d16a6a5d5bb0ce43973.png" width="576" /></p>
<p>See below the correspondence table for parametric and non-parametric tests :</p>
<p><img height="457" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/3616777847f46f514dc4dfdc51397e3d/3616777847f46f514dc4dfdc51397e3d.png" width="694" /></p>
Conclusion
<p>Outliers do happen and removing them is not always straightforward. One nice thing about non-parametric tests is that they are more robust to such outliers. However, this does not mean that non-parametric tests should be used in any circumstance. When there are no outliers and the distribution is normal, standard parametric tests (T tests or ANOVA) are more powerful. </p>
Data AnalysisHypothesis TestingLearningStatisticsStatsMon, 07 Dec 2015 13:02:00 +0000http://blog.minitab.com/blog/applying-statistics-in-quality-projects/why-you-should-use-non-parametric-tests-when-analyzing-data-with-outliersBruno ScibiliaWhat Can You Say When Your P-Value is Greater Than 0.05?
http://blog.minitab.com/blog/understanding-statistics/what-can-you-say-when-your-p-value-is-greater-than-005
<p>P-values are frequently misinterpreted, which causes many problems. I won't rehash <a href="http://blog.minitab.com/blog/adventures-in-statistics/banned-p-values-and-confidence-intervals-a-rebuttal-part-1">those problems here</a> here since my colleague Jim Frost has detailed the issues involved at some length, but the fact remains that the p-value will continue to be one of the most frequently used tools for deciding if a result is statistically significant. </p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f5a45a8de99994c6fd16e3fd018776b1/shoveling.png" style="line-height: 20.8px; margin: 10px 15px; float: right; width: 250px; height: 221px;" /></p>
<p>You know the old saw about "Lies, damned lies, and statistics," right? It rings true because statistics really is as much about interpretation and presentation as it is mathematics. That means we human beings who are analyzing data, with all our foibles and failings, have the opportunity to shade and shadow the way results get re<span style="line-height: 1.6;">ported. </span></p>
<p>While I generally like to believe that people<span style="line-height: 20.8px;"> <em>want</em> to be honest and objective</span><span style="line-height: 1.6;">—especially smart people who do research and analyze data that may affect other people's lives</span><span style="line-height: 1.6;">—<a href="https://mchankins.wordpress.com/2013/04/21/still-not-significant-2/">here are 500 pieces of evidence that fly in the face of that belief</a>. </span></p>
<p><span style="line-height: 1.6;">We'll get back to that in a minute. But first, a quick review...</span></p>
<span style="line-height: 1.6;">What's a P-Value, and How Do I Interpret It?</span>
<p>Most of us first encounter p-values when we conduct simple hypothesis tests, although they also are integral to many more sophisticated methods. Let's use Minitab 17 to do a quick review of how they work (if you want to follow along and don't have Minitab, the <a href="http://it.minitab.com/products/minitab/free-trial.aspx">full package is available free for 30 days</a>). We're going to compare fuel consumption for two different kinds of furnaces to see if there's a difference between their means. </p>
<p>Go to <strong>File > Open Worksheet</strong>, and click the "Look in Minitab Sample Data Folder" button. Open the sample data set named <em>Furnace.mtw</em>, and choose <strong>Stat > Basic Statistics > 2 Sample t...</strong> from the menu. In the dialog box, enter "BTU.In" for Samples, and enter "Damper" for Sample IDs.</p>
<p>Press <strong>OK</strong> and Minitab returns the following output, in which I've highlighted the p-value. </p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/26076d0dd37249748f1541a2313036b6/p_values_output.png" style="width: 547px; height: 172px;" /></p>
<p>In the majority of analyses, an alpha of 0.05 is used as the cutoff for significance. If the p-value is less than 0.05, we reject the <a href="http://blog.minitab.com/blog/understanding-statistics/things-statisticians-say-failure-to-reject-the-null-hypothesis">null hypothesis</a> that there's no difference between the means and conclude that a significant difference does exist. If the p-value is larger than 0.05, we <em>cannot</em> conclude that a significant difference exists. </p>
<p>That's pretty straightforward, right? Below 0.05, significant. Over 0.05, <em>not</em> significant. </p>
"Missed It By <em>That</em> Much!"
<p>In the example above, the result is clear: a p-value of 0.7 is so much higher than 0.05 that you can't apply any wishful thinking to the results. <span style="line-height: 1.6;">But what if your p-value is really, <em>really</em> close to 0.05? </span></p>
<p><span style="line-height: 1.6;"><em>Like, what if you had a p-value of 0.06? </em></span></p>
<p>That's not significant. </p>
<p><em>Oh. Okay, what about 0.055?</em></p>
<p>Not significant. </p>
<p><em>How about 0.051?</em></p>
<p>It's <em>still</em> not statistically significant, and data analysts should not try to pretend otherwise. <span style="line-height: 1.6;">A p-value is not a negotiation: if p > 0.05, the results are not significant. </span><em style="line-height: 1.6;">Period.</em></p>
<p><em>So, what </em>should<em> I say when I get a p-value that's higher than 0.05? </em></p>
<p>How about saying this? "The results were not statistically significant." If that's what the data tell you, there is nothing wrong with saying so. </p>
No Matter How Thin You Slice It, It's Still Baloney.
<p>Which brings me back to the <a href="https://mchankins.wordpress.com/2013/04/21/still-not-significant-2/">blog post</a> I referenced at the beginning. Do give it a read, but the bottom line is that the author cataloged 500 <em>different</em> ways that contributors to scientific journals have used language to obscure their results (or lack thereof). </p>
<p>As a student of language, I confess I find the list fascinating...but also upsetting. It's <em>not right</em>: These contributors are educated people who certainly understand A) what a p-value higher than 0.05 signifies, and B) that manipulating words to soften that result is deliberately deceptive. Or, to put it in words that are less soft, it's a damned lie.</p>
<p>Nonetheless, it happens frequently. </p>
<p>Here are just a few of my favorites of the 500 different ways people have reported results that were not significant, accompanied by the p-values to which these creative interpretations applied: </p>
<ul>
<li>a certain trend toward significance (p=0.08)</li>
<li>approached the borderline of significance (p=0.07)</li>
<li>at the margin of statistical significance (p<0.07)</li>
<li>close to being statistically signiﬁcant (p=0.055)</li>
<li>fell just short of statistical significance (p=0.12)</li>
<li>just very slightly missed the significance level (p=0.086)</li>
<li>near-marginal significance (p=0.18)</li>
<li>only slightly non-significant (p=0.0738)</li>
<li>provisionally significant (p=0.073)</li>
</ul>
<p>and my very favorite:</p>
<ul>
<li>quasi-significant (p=0.09)</li>
</ul>
<p>I'm not sure what "quasi-significant" is even supposed to mean, but it <em>sounds</em> quasi-important, as long as you don't think about it too hard. But there's still no getting around the fact that a p-value of 0.09 is not a statistically significant result. </p>
<p>The blogger does not address the question of whether the opposite situation occurs. Do contributors ever write that a p-value of, say, 0.049999 is:</p>
<ul>
<li>quasi-insignificant</li>
<li>only slightly significant</li>
<li>provisionally insignificant</li>
<li>just on the verge of being non-significant</li>
<li>at the margin of statistical non-significance</li>
</ul>
<p>I'll go out on a limb and posit that describing a p-value just under 0.05 in ways that diminish its statistical significance <em>just</em> <em>doesn't happen</em>. However, downplaying statistical non-significance would appear to be almost endemic. </p>
<p>That's why I find the above-referenced post so disheartening. It's distressing that you can so easily gather so many examples of bad behavior by data analysts <em>who almost certainly know better</em>.</p>
<p><em>You</em> would never use language to try to obscure the outcome of your analysis, would you?</p>
<p> </p>
Hypothesis TestingStatisticsThu, 03 Dec 2015 13:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/what-can-you-say-when-your-p-value-is-greater-than-005Eston MartzWhat Is ANOVA? And Who Drinks the Most Beer?
http://blog.minitab.com/blog/michelle-paret/what-is-anova-and-who-drinks-the-most-beer
<p><span style="line-height: 1.6;">Back when I was an undergrad in statistics, I unfortunately spent an entire semester of my life taking a class, diligently crunching numbers with my TI-82, before realizing 1) that I was actually in an Analysis of Variance (ANOVA) class, 2) why I would want to use such a tool in the first place, and 3) that ANOVA doesn’t necessarily tell you a thing about variances.</span></p>
<p>Fortunately, I've had a lot more real-world experience to draw from since then, which makes it much easier to understand today. TI-82 not required.</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f000c555e0ba8a9d9a78461b7230073c/beer.jpg" style="line-height: 20.8px; margin: 10px 15px; float: right; width: 220px; height: 220px;" /></p>
Why Conduct an ANOVA?
<p>In its simplest form—specifically, a 1-way ANOVA—you take 1 continuous (“response”) variable and 1 categorical (“factor”) variable and test the null hypothesis that all group means for the categorical variable are equal. Typically, we’re talking about at least 3 groups, because if you only have 2 groups (samples), then you can use a <span><a href="http://blog.minitab.com/blog/adventures-in-statistics/using-hypothesis-tests-to-bust-myths-about-the-battle-of-the-sexes">2-sample t-test</a></span> and skip ANOVA all together.</p>
<div>
<p>As an example, let’s look at the <a href="https://en.wikipedia.org/wiki/List_of_countries_by_beer_consumption_per_capita">average annual per capita beer consumption</a> across 3 regions of the world: Asia, Europe, and America. Here’s the null and alternative hypothesis:</p>
<p style="margin-left: 40px;">H0: All regions drink the same average amount of beer (μAsia = μEurope = μAmerica)</p>
<p style="margin-left: 40px;">H1: Not all regions drink the same average amount of beer</p>
<p>Any guess on who consumes the most beer?</p>
<p><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/371609298620629565add1ee54234c37/individual_value_plot_of_volume_consumed__liters__w1024.jpeg" style="border-width: 0px; border-style: solid; width: 600px; height: 394px;" /></p>
<p><span style="line-height: 1.6;">According to the individual value plot created using </span><a href="http://www.minitab.com/products/minitab/" style="line-height: 1.6;">Minitab 17</a><span style="line-height: 1.6;">, Europe consumes the most beer on average and Asia consumes the least. However, are these differences statistically significant? Or are these differences simply due to random variation?</span></p>
How ANOVA Works
<p>The basic logic behind ANOVA is that the within-group variation is due only to random error. Therefore:</p>
<ul>
<li><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/2c9466a316c75e4c17cd250f0aff45bd/comparing_variation.jpg" style="line-height: 20.8px; margin-left: 12px; margin-right: 12px; float: right; width: 112px; height: 200px;" />If the between-group variation is similar to the within-group variation, then the group means are likely to differ only due to random error. (Figure 1)</li>
<li>If the between-group variation is large relative to the within-group variation, then there are likely differences between the group means. (Figure 2)</li>
</ul>
<p>Say what?</p>
<p>In our example, the between-group variation represents the variation <em>between</em> the 3 different regions. And the within-group variation represents the beer consumption variability <em>within</em> a given region. Take Europe, for instance, where we have the Czech Republic. It appears to be the thirstiest country, consuming the most beer at 148.6 liters. But Europe also contains Italy, whose population drinks the least at only 29 liters (perhaps the Italians are passing up the Peroni for some vino and Limoncello?). So you can see that there is variability within the Europe group. There’s also variability within the Asia group, and within the America group.</p>
<p>With ANOVA, we compare the between-group variation (i.e., Asia vs. Europe vs. America) to the within-group variation (i.e., within each of those regions). The higher this ratio, the smaller the p-value. So the term ANOVA refers to the fact that we're using information about the variances to draw conclusions about the means.</p>
The Analysis
<p>If we run a 1-way ANOVA using this beer data, Minitab Statistical Software provides the following output in the Session Window:</p>
<p style="margin-left:.5in;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/956e35d59d68a844fdf00986397dbc9d/oneway_anova_for_beer.jpg" style="border-width: 0px; border-style: solid; width: 460px; height: 286px;" /></p>
<p><span style="line-height: 1.6;">Our </span><a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" style="line-height: 1.6;">p-value</a><span style="line-height: 1.6;"> is statistically significant at 0.000. Therefore, we can reject the null hypothesis that all regions drink the same average amount of beer. </span></p>
<p><span style="line-height: 1.6;">This leads us to our next question: Which countries differ? Let’s use Tukey multiple comparisons to find out.</span></p>
<p align="center"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b499b8c87a401c007713a0d5179a5e56/oneway_anova_interval_plot_for_beer_w1024.jpeg" style="border-width: 0px; border-style: solid; width: 600px; height: 394px;" /></p>
<p>Per the footnote on the Tukey comparisons graph, “If an interval does not contain zero, the corresponding means are significantly different.” Therefore, the intervals shown in red tell us where the differences are. Specifically, we can conclude that the average beer consumption for Europe is significantly higher than that of Asia. We can also conclude that America consumes significantly more than Asia. However, there is not sufficient evidence to conclude that the average beer consumption for Europe is different than for America.</p>
The Last Sip
<p>Although it’s unlikely that you’re analyzing beer data in your professional career, I do hope this provides a little insight into ANOVA and how you can utilize it to test averages between 3 or more groups.</p>
<p> </p>
</div>
Data AnalysisFun StatisticsHypothesis TestingStatisticsStatistics HelpThu, 19 Nov 2015 13:00:00 +0000http://blog.minitab.com/blog/michelle-paret/what-is-anova-and-who-drinks-the-most-beerMichelle ParetControl Charts - Not Just for Statistical Process Control (SPC) Anymore!
http://blog.minitab.com/blog/adventures-in-statistics/control-charts-not-just-for-statistical-process-control-spc-anymore
<p>Control charts are a fantastic tool. These charts plot your process data to identify common cause and special cause variation. By identifying the different causes of variation, you can take action on your process without over-controlling it.</p>
<p>Assessing the stability of a process can help you determine whether there is a problem and identify the source of the problem. Is the mean too high, too low, or unstable? Is variability a problem? If so, is the variability inherent in the process or attributable to specific sources? Control charts answer these questions, which can guide your corrective efforts.</p>
<p>Determining that your process is stable is good information all by itself, but it is also a prerequisite for further analysis, such as <a href="http://blog.minitab.com/blog/understanding-statistics/i-think-i-can-i-know-i-can-a-high-level-overview-of-process-capability-analysis" target="_blank">capability analysis</a>. Before assessing process capability, you must be sure that your process is stable. An unstable process is unpredictable. If your process is stable, you can predict future performance and improve its capability.</p>
<p>While we associate control charts with business processes, I’ll argue in this post that control charts provide the same great benefits in other areas beyond statistical process control (SPC) and Six Sigma. In fact, you’ll see several examples where control charts find answers that you’d be hard pressed to uncover using different methods.</p>
The Importance of Assessing Whether Other Types of Processes Are In Control
<p>I want you to expand your mental concept of a process to include processes outside the business environment. After all, unstable process levels and excessive variability can be problems in many different settings. For example:</p>
<ul>
<li>A teacher has a process that helps students learn the material as measured by test scores.</li>
<li><a href="http://blog.minitab.com/blog/real-world-quality-improvement/control-charts-keep-blood-sugar-in-check" target="_blank">A diabetic has a process for keeping blood sugar in control</a>.</li>
<li>A researcher has a process that causes subjects to experience an impact of 6 times their body weight.</li>
</ul>
<p>All of these processes can be stable or unstable, have a certain amount of inherent variability, and can also have special causes of variability. Understanding these issues can help improve all of them.</p>
<p>The third bullet relates to a <a href="http://blog.minitab.com/blog/adventures-in-statistics/quality-improvement-controlling-variability-more-difficult-than-the-mean" target="_blank">research study</a> that I was involved with. Our research goal was to have middle school subjects jump from 24-inch steps, 30 times, every other school day to determine whether it would increase their bone density. We defined our treatment as the subjects experiencing an impact of 6 body weights. However, we weren’t quite hitting the mark.</p>
<p>To guide our corrective efforts, I conducted a pilot study and graphed the results in the Xbar-S chart below.</p>
<p><img alt="Xbar-S chart of ground reaction forces for pilot study" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/e721bd172aa55d5ec9976e81990f1293/xbars_grf_w1024.jpeg" style="width: 576px; height: 384px;" /></p>
<p>The in-control S chart (bottom) shows that each subject has a consistent landing style that produces impacts of a consistent magnitude—the variability is in control. However, the out-of-control Xbar chart (top) indicates that, while the overall mean (6.141) exceeds our target, different subjects have very different means. Collectively, the chart shows that some subjects are consistently hard landers while others are consistently soft landers. The control chart suggests that the variability is not inherent in the process (common cause variation) but rather assignable to differences between subjects (special cause variation).</p>
<p>Based on this information, we decided to train the subjects how to land and to have a nurse observe all of the jumping sessions. This ongoing training and corrective action reduced the variability enough so that the impacts were consistently greater than 6 body weights.</p>
Control Charts as a Prerequisite for Statistical Hypothesis Tests
<p>As I mentioned, control charts are also important because they can verify the assumption that a process is stable, which is required to produce a valid capability analysis. We don’t often think of using control charts to test the assumptions for <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-why-we-need-to-use-hypothesis-tests-in-statistics" target="_blank">hypothesis tests</a> in a similar fashion, but they are very useful for that as well.</p>
<p>The assumption that the measurements used in a hypothesis test are stable is often overlooked. As with any process, if the measurements are not stable, you can’t make inferences about whatever you are measuring.</p>
<p>Let’s assume that we’re comparing test scores between group A and group B. We’ll use this <a href="//cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/File/6053477fc294de59d5b3837389daab3a/groupcomparison.MTW">data set</a> to perform a 2-sample t-test as shown below.</p>
<p style="margin-left: 40px;"><img alt="two sample t-test results" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/d60543dc8eb46afc282b9776b0517a5e/2samplet.png" style="width: 522px; height: 210px;" /></p>
<p>The results appear to show that group A has the higher mean and that the difference is statistically significant. Group B has a marginally higher standard deviation, but we’re not assuming equal variances, so that’s not a problem. If you conduct normality tests, you’ll see that the data for both groups are normally distributed—although we have a sufficient number of observations per group that we don’t have to worry about normality. All is good, right?</p>
<p>The I-MR charts below suggest otherwise!</p>
<p><img alt="I-MR chart for group A" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/cef240bbb760bb6760ddcbc33e446be9/imr_a.png" style="width: 576px; height: 384px;" /></p>
<p><img alt="I-MR chart of group B" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/e4bd53da7831826959be94540b7ab0a2/imr_b.png" style="width: 576px; height: 384px;" /></p>
<p>The chart for group A shows that these scores are stable. However, in group B, the multiple out-of-control points indicate that the scores are unstable. Clearly, there is a negative trend. Comparing a stable group to an unstable group is not a valid comparison even though the data satisfy the other assumptions.</p>
<p>This I-MR chart illustrates just one type of problem that control charts can detect. Control charts can also test for a variety of patterns in the data and for out-of-control variability. As these data show, you can miss problems using other methods.</p>
Using the Different Types of Control Charts
<p>The I-MR chart assesses the stability of the mean and standard deviation when you don’t have subgroups, while the XBar-S chart shown earlier assesses the same parameters but <em>with </em>subgroups.</p>
<p>You can also use other control charts to test other types of data. In Minitab, the U Chart and Laney U’ Chart are control charts that use the Poisson distribution. You can use these charts in conjunction with the 1-Sample and 2-Sample Poisson Rate tests. The P Chart and Laney P’ Chart are control charts that use the binomial distribution. Use these charts with the 1 Proportion and 2 Proportions tests.</p>
<p>If you're using <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank" title="Minitab 16 Statistical Software">Minitab Statistical Software</a>, you can choose <strong>Assistant > Control Charts</strong> and get step-by-step guidance through the process of creating a control chart, from determining what type of data you have, to making sure that your data meets necessary assumptions, to interpreting the results of your chart.</p>
<p>Additionally, check out the great <a href="http://blog.minitab.com/blog/understanding-statistics/control-chart-tutorials-and-examples">control charts tutorial</a> put together by my colleague, Eston Martz.</p>
Data AnalysisHypothesis TestingLearningQuality ImprovementSix SigmaStatisticsStatistics HelpThu, 12 Nov 2015 13:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/control-charts-not-just-for-statistical-process-control-spc-anymoreJim FrostTerry Bradshaw Might be the Best Super Bowl Quarterback Ever
http://blog.minitab.com/blog/statistics-and-quality-improvement/terry-bradshaw-might-be-the-best-super-bowl-quarterback-ever
<p><img alt="By U.S. Navy photo by Chief Photographer's Mate Chris Desmond. [Public domain], via Wikimedia Commons" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/fddd60d1ca304b5397c16d0bc050910b/bradshaw.jpg" style="width: 350px; height: 228px; float: right; margin: 10px 15px;" />Last time I touched on the subject of <a href="http://blog.minitab.com/blog/statistics-and-quality-improvement/troy-aikman-or-joe-montana-might-be-the-best-super-bowl-quarterback-ever">the greatest Super Bowl quarterback</a>, I promised a multivariate analysis considering several different statistics. Let’s get right to a factor analysis.</p>
Getting Ready for Factor Analysis
<p>One purpose of factor analysis is to identify underlying factors that you can’t measure directly. These factors explain the variation of many different variables in fewer dimensions. Here are the variables we’re going to consider:</p>
<ul>
<li>Margin of victory</li>
<li>Difference between Super Bowl winner’s passer rating and the playoff passer rating allowed by the opposing team—PR Diff (Winner – Allowed)</li>
<li>Point spread—Spread</li>
<li>Adjusted career rating of the losing quarterback—Adjusted Career PR Loser</li>
<li>The difference between the winning and losing quarterback’s ratings—PR Difference (Winner – Loser)</li>
<li>Winning quarterback’s rating—Passer Rating Winner</li>
<li>Losing quarterback’s rating—Passer Rating Loser</li>
</ul>
Determining the number of factors
<p>To begin the factor analysis, you usually <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/multivariate/principal-components-and-factor-analysis/number-of-principal-components/">determine the number of factors to use</a>. The determination is similar to determining the number of principal components. Looking for eigenvectors bigger than 1, for the number of factors that determine 80% of the variation, and for the number of components that explain large amounts of variation relative to the other factors. A scree plot of the eigenvectors look like this:</p>
<p><img alt="The first two factors have eigenvalues greater than 1. The third eigenvalue is close to 1." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/1ba3b721712e20c94bc9a75f1bd2c000/eigenvalue_scree_plot.png" style="border-width: 0px; border-style: solid; width: 576px; height: 384px;" /></p>
<p>Two factors have eigenvalues greater than 1, and the third factor is close. The 3 factors explain about 80% of the variation in the data, so 3 factors seems likes a reasonable number to explore.</p>
Factor rotation
<p>Once we determine the number of factors, we want to see if we can find a rotation that produces underlying factors that make sense. In general, rotation of the factors makes them load on fewer variables so that the factors are simpler. For example, the Minitab output from the varimax rotation shows the unrotated and rotated factor loadings:</p>
<p style="margin-left: 40px;"><span style="font-family: courier new; font-size:9pt">Unrotated Factor Loadings and Communalities<br />
Variable Factor1 Factor2 Factor3 Communality<br />
Passer Rating Loser -0.644 0.502 0.420 0.843<br />
Passer Rating Winner 0.723 0.563 0.127 0.857<br />
PR Difference (Winner - Loser) 0.953 0.027 -0.212 0.955<br />
Adjusted Career PR Loser -0.181 0.748 -0.400 0.752<br />
Spread -0.536 0.189 -0.687 0.794<br />
PR Diff (Winner - Allowed) 0.620 0.570 0.145 0.731<br />
Margin of victory 0.700 -0.324 -0.214 0.640<br />
Variance 3.0410 1.5952 0.9359 5.5721<br />
% Var 0.434 0.228 0.134 0.796<br />
<br />
Rotated Factor Loadings and Communalities<br />
Varimax Rotation<br />
Variable Factor1 Factor2 Factor3 Communality<br />
Passer Rating Loser 0.032 -0.912 -0.106 0.843<br />
Passer Rating Winner 0.909 0.171 0.030 0.857<br />
PR Difference (Winner - Loser) 0.598 0.767 0.096 0.955<br />
Adjusted Career PR Loser 0.336 -0.256 -0.757 0.752<br />
Spread -0.363 -0.085 -0.810 0.794<br />
PR Diff (Winner - Allowed) 0.850 0.086 0.011 0.731<br />
Margin of victory 0.178 0.754 0.198 0.640<br />
Variance 2.1852 2.0973 1.2896 5.5721<br />
% Var 0.312 0.300 0.184 0.796</span></p>
<p style="margin-left: 40px;"> </p>
<p>In this output, the unrotated first factor has 5 variables where the absolute value of the loading is 0.6 or higher. The rotated first factor has 2 variables with loadings of 0.6 or higher, so the rotated factor should be easier to interpret.</p>
<p>We’re lucky, in this case, because the different rotation methods available in Minitab all produce factors that load on the same variables. When different methods agree, you feel more certain about the results.</p>
Interpreting the factors
<p>The first factor, which loads highly on the winning quarterback’s passer rating and the difference between that passer rating and what the opposing team allowed in the playoffs, looks like a measure of how well the winning quarterback played. Higher values of this factor indicate better performance.</p>
<p>The second factor is the most difficult to interpret because of the signs of the different variables with high loadings. You get a higher value of the second component by having a higher margin of victory, by having a higher difference between the ratings of the winning and losing quarterbacks, and by having a lower passer rating by the losing quarterback. I would think that the first two components would be values that you would want to be high, but that you would also want the third value to be high.</p>
<p>It looks like the variation in the data suggests that a losing team is much more likely to lose by a lot of points if the opposing quarterback plays poorly. In <a href="http://blog.minitab.com/blog/statistics-and-quality-improvement/tom-brady-is-the-best-super-bowl-quarterback-ever">my first post about the best Super Bowl quarterback</a>, I made the judgement that winning a competitive Super Bowl was more impressive than winning a noncompetitive match. Thus, I’m going to tend to think that lower values of the second component, caused by high passer ratings of the opposing quarterback, small differences, and smaller margins of victory are more impressive; but I’ll conduct the final comparisons both ways to see how it affects the conclusion.</p>
<p>The third factor loads on two factors: the point spread and the passer rating of the losing quarterback. This factor is about the quality of the victory. The more positive the point spread, the more unexpected the victory was. Also, the better the opposing quarterback the better the victory was. Because both of these loadings are negative, more negative values of the third factor indicate better performance.</p>
Conclusion?
<p>In addition to the decision of what to do with the second component, there are still some other considerations for how to determine the best super Bowl quarterback. For example, should we compare the candidate quarterbacks to the average performance or to the best performance? Should we look at the mean performance of the best quarterbacks or the median performance? With so many options available for the remaining analysis, we’ll have to wait for next time to review them all. For now, here’s some initial impressions of the three factors.</p>
<p><img alt="Different points identify Montana, Bradshaw, Aikman, and Brady" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/f666023c739116e02a0b2884c172eda7/3d_graph_legend.jpg" style="width: 207px; height: 154px;" /></p>
<p><img alt="Factor scores for all Super Bowl victors" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/0da0045e07c54f01e67fa5816b1a9a9f/3d_graph.jpg" style="width: 576px; height: 378px;" /></p>
<p>In terms of a quarterback playing well, especially in light of the opposing team, Terry Bradshaw’s first victory over the Dallas Cowboys, in Super Bowl X, takes the prize among our candidate quarterbacks. A factor score of 1.62 is not quite as good as Jim Plunkett’s 1.77, but pretty good for a guy throwing against Hall of Fame cornerback Mike Renfro. Among the candidates, Bradshaw also has the second-place score for his second victory over the Cowboys in Super Bowl XIII.</p>
<p>We’ll explore the best of the second factor in more detail, but the extremes make quarterback’s look good in both directions. Among the candidates, Tom Brady has the minimum score from his victory over the Carolina Panthers in Super Bowl XXXVIII. Brady overcame an incredible effort by Jake Delhomme that resulted in a 113.6 passer rating, the highest rating by a losing quarterback in a Super Bowl. Brady’s effort is also the overall minimum for factor 2.</p>
<p>On the maximum side of factor 2 lies another candidate, Montana’s victory over the Broncos in Super Bowl XXIV. The 45-point victory is the only Super Bowl in our data set where the winning quarterback’s passer rating exceeded the losing quarterback’s by over 100 points.</p>
<p>With respect to the third component, no victory was more unexpected than Brady’s overcoming of the Kurt Warner-led Rams in Super Bowl XXXVI. The 14-point underdog did enough that day to fend off the fourth quarter charge of the Greatest Show on Turf in what was, at the time, the first Super Bowl to be decided by a score on the final play of the game.</p>
<p>So, will it come down to Bradshaw or Brady defying the odds, or Montana’s domination? We’ll evaluate all three factors next time!</p>
<p>Ready to try out your own factor analysis? Check out the overview in <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/multivariate/principal-components-and-factor-analysis/perform-a-factor-analysis/">Perform a Factor Analysis</a> on the Minitab Support Center.</p>
The photograph of Terry Bradshaw and Lieutenant Commander Heather Pouncey is by Chief Photographer's Mate Chris Desmond, whose work deserves attribution this Veteran's Day even though it's in the public domain.
Data AnalysisFun StatisticsHypothesis TestingWed, 11 Nov 2015 15:17:00 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/terry-bradshaw-might-be-the-best-super-bowl-quarterback-everCody SteelePractical Statistical Problem Solving Using Minitab to Explore the Problem
http://blog.minitab.com/blog/statistics-in-the-field/practical-statistical-problem-solving-using-minitab-to-explore-the-problem
<p><em style="line-height: 1.6;">By Matthew Barsalou, guest blogger</em></p>
<p>A problem must be understood before it can be properly addressed. A thorough understanding of the problem is critical when performing a <a href="http://blog.minitab.com/blog/understanding-statistics/root-cause-analysis-and-process-improvement-for-patient-safety">root cause analysis (RCA)</a> and an RCA is necessary if an organization wants to implement corrective actions that truly address the root cause of the problem. An RCA may also be necessary for process improvement projects; it is necessary to understand the cause of the current level performance before attempts are made to improve the performance.</p>
<p>Many <span style="line-height: 20.8px;">statistical tests</span><span style="line-height: 20.8px;"> related to </span><span style="line-height: 1.6;">problem-solving can be performed using <a href="http://www.minitab.com/products/minitab">Minitab Statistical Software</a>. However, the actual test you select should be based upon the type of data you have and what needs to be understood. The figure below shows various statistical options structured in a cause-and-effect diagram with the main branches based on characteristics that describe what the tests and methods are used for.</span></p>
<p align="center"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/153fae1037161a10541f14572c58ce02/statistical_problem_solving_1_w1024.png" style="width: 700px; height: 310px;" /></p>
<p>The main branch labeled “differences” is split into two high-level sub-branches: hypothesis tests that have an assumption of normality, and non-parametric tests of medians. The <a href="http://blog.minitab.com/blog/understanding-statistics/what-statistical-hypothesis-test-should-i-use">hypothesis tests</a> assume data is normally distributed and can be used to compare means, variances, or proportions to either a given value or to the value of a second sample. An ANOVA can be performed to compare the means of two or more samples.</p>
<p>The non-parametric tests listed in the cause-and-effect diagram are used to compare medians, either to a specified value, or two or more medians, depending upon which test is selected. The non-parametric tests provide an option when data is too skewed to use other options, such as a Z-test.</p>
<p>Time may also be of interest when exploring a problem. If your data are recorded in order of occurrence, a <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/time-series-plots-theres-gold-in-them-thar-hills">time series plot</a> can be created to show each value at the time it was produced; this may give insights into potential changes in a process.</p>
<p style="margin-left: 40px;"><img alt="" src="https://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/1d760d455a55b6a76d9e7fe25a20764e/time_series_gold_2.gif" style="width: 300px; height: 203px;" /></p>
<p>A <a href="http://blog.minitab.com/blog/real-world-quality-improvement/trend-analysis-super-bowl-ticket-prices">trend analysis</a> looks much like the time series plot; however, Minitab also tests for potential trends in the data such as increasing or decreasing values over time. Exponential smoothing options are available to assign exponentially decreasing weights to the values over time when attempting to predict future outcomes.</p>
<p>Relationships can be explored using various types of <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples">regression analysis</a> to identify potential correlations in the data such as the relationship between the hardness of steel and the quenching time of the steel. This can be helpful when attempting to identify the factors that influence a process. Another option for understanding relationships is <a href="http://blog.minitab.com/blog/understanding-statistics/getting-started-with-factorial-design-of-experiments-doe">Design of Experiments (DoE)</a>, where experiments are planned specifically to economically explore the effects and interactions between multiple factors and a response variable.</p>
<p>Another main branch is for capability and stability assessments. There are two main sub-branches here; one is for<a href="http://blog.minitab.com/blog/understanding-statistics/i-think-i-can-i-know-i-can-a-high-level-overview-of-process-capability-analysis"> measures of process capability and performance</a> and the other is for Statistical Process Control (SPC), which can assess the stability of a process.</p>
<p>The measures of process performance and capability can be useful for establishing the baseline performance of a process; this can be helpful in determining of process improvement activities have actually improved the process. The SPC sub-branch is split into three lower-level sub-branches; these are control charts for attribute data such as number of defective units, control charts for continues data such as diameters, and time-weighted charts that don’t give all values equal weights.</p>
<p><a href="http://blog.minitab.com/blog/understanding-statistics/what-control-chart-should-i-use">Control charts</a> can be used for both assessing the current performance of a process such as by using an individual’s chart to determine if the process is in a states of statistical control, or for monitoring the performance of a process such as after improvements have been implemented.</p>
<p><img alt="" src="https://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/d2c0571a-acbd-48c7-84f4-222276c293fe/Image/e36f985ab12401b70318197b3b8a1c77/control_chart_components.jpg" style="width: 400px; height: 103px;" /></p>
<p><a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/get-a-head-start-understand-your-data-before-you-analyze-it">Exploratory data analysis (EDA)</a> can be useful for gaining insights to the problem using graphical methods. The individual values plot is useful for simply observing the position of each value relative to the other values in a data set. For example, a box plot can be helpful when comparing the means, medians and spread of data from multiple processes. The purpose of EDA is not to form conclusions, but to gain insights that can be helpful in forming tentative hypotheses or in deciding which type of statistical test to perform.</p>
<p>The tests and methods presented here do not cover all available statistical tests and methods in Minitab; however, they do provide a large selection of basic options to choose from.</p>
<p>These tools and methods are helpful when exploring a problem, but their use should not be limited to problem exploration. They can also be helpful for planning and verifying improvements. For example, an individual value plot may indicate one process performs better than a comparable process, and this can then be confirmed using a two-sample t test. Or, the settings of the better process can be used to plan a DoE to identify the optimal settings for the two processes and the improvements can be monitored using an xBar and S chart for the two processes. </p>
<p> </p>
<p><strong>About the Guest Blogger</strong></p>
<p><em><a href="https://www.linkedin.com/pub/matthew-barsalou/5b/539/198" target="_blank">Matthew Barsalou</a> is a statistical problem resolution Master Black Belt at <a href="http://www.3k-warner.de/" target="_blank">BorgWarner</a> Turbo Systems Engineering GmbH. He is a Smarter Solutions certified Lean Six Sigma Master Black Belt, ASQ-certified Six Sigma Black Belt, quality engineer, and quality technician, and a TÜV-certified quality manager, quality management representative, and auditor. He has a bachelor of science in industrial sciences, a master of liberal studies with emphasis in international business, and has a master of science in business administration and engineering from the Wilhelm Büchner Hochschule in Darmstadt, Germany. He is author of the books <a href="http://www.amazon.com/Root-Cause-Analysis-Step---Step/dp/148225879X/ref=sr_1_1?ie=UTF8&qid=1416937278&sr=8-1&keywords=Root+Cause+Analysis%3A+A+Step-By-Step+Guide+to+Using+the+Right+Tool+at+the+Right+Time" target="_blank">Root Cause Analysis: A Step-By-Step Guide to Using the Right Tool at the Right Time</a>, <a href="http://asq.org/quality-press/display-item/index.html?item=H1472" target="_blank">Statistics for Six Sigma Black Belts</a> and <a href="http://asq.org/quality-press/display-item/index.html?item=H1473&xvl=76115763" target="_blank">The ASQ Pocket Guide to Statistics for Six Sigma Black Belts</a>.</em></p>
Data AnalysisDesign of ExperimentsHypothesis TestingLean Six SigmaQuality ImprovementRegression AnalysisSix SigmaStatisticsFri, 06 Nov 2015 13:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/practical-statistical-problem-solving-using-minitab-to-explore-the-problemGuest BloggerBeware of Phantom Degrees of Freedom that Haunt Your Regression Models!
http://blog.minitab.com/blog/adventures-in-statistics/beware-of-phantom-degrees-of-freedom-that-haunt-your-regression-models
<p><img alt="Demon" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/357fb1b145dc3e177860509a7791e6b6/demon1.gif" style="float: right; width: 275px; height: 308px;" />As Halloween approaches, you are probably taking the necessary steps to protect yourself from the various <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-be-a-ghost-hunter-with-a-statistical-mindset" target="_blank">ghosts</a>, goblins, and witches that are prowling around. Monsters of all sorts are out to get you, unless they’re sufficiently bribed with candy offerings!</p>
<p>I’m here to warn you about a ghoul that all statisticians and data scientists need to be aware of: phantom degrees of freedom. These phantoms are really sneaky. You can be out, fitting a regression model, looking at your output, and thinking everything is fine. Then, whammo, these phantoms get you! They suck the explanatory and predictive power right out of your regression model but, deviously, leave all of the output looking just fine. Now that’s truly spooky!</p>
<p>In this blog post, I’ll show you how these phantoms work and how to avoid their dastardly deeds!</p>
What Are Normal Degrees of Freedom in Regression Models?
<p>I’ve written previously about the <a href="http://blog.minitab.com/blog/adventures-in-statistics/the-danger-of-overfitting-regression-models" target="_blank">dangers of overfitting your regression model</a>. An overfit model is one that is too complicated for your data set.</p>
<p>You can learn only so much from a data set of a given size. A degree of freedom is a measure of how much you’ve learned. Your model uses these degrees of freedom with every parameter that it estimates. If you use too many, you’re overfitting the model. The end result is that the <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients">regression coefficients, p-values</a>, and <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit">R-squared</a> can all be misleading.</p>
<p>You can detect overfit models by looking at the number of observations per parameter estimate and assessing the <a href="http://blog.minitab.com/blog/adventures-in-statistics/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables" target="_blank">predicted R-squared</a>. However, these methods won’t necessarily detect the misbegotten effects of summoning an excessive number of <em>phantom </em>degrees of freedom!</p>
<p>In the degrees of freedom (DF) column in the ANOVA table below, you can see that this regression model uses 3 degrees of freedom out of a total of 28. It appears that this model is fine. Or is it? <em><Cue evil laugh!></em></p>
<p style="margin-left: 40px;"><img alt="Analysis of variance table for a regression model" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/3aad670bc46412ddfe1ab642089349d1/anova_table.png" style="width: 371px; height: 151px;" /></p>
What Are Phantom Degrees of Freedom?
<p>Phantom degrees of freedom are devilish because they latch onto you through the manner in which you settle on the final model. They are not detectable in the output for the final model even as they haunt your regression models.</p>
<div style="float: right; width: 275px; margin: 15px 0px 15px 15px; line-height: 1;"><img alt="Guy surrounded by demons" height="369px;" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/949386ee39624c2cb187e3dc0d9cb630/demons.jpg" width="275px;" /><br />
<em style="font-size: x-small; line-height: 1;">The dangers of invoking too many phantom degrees of freedom!</em></div>
<p>Every time your incantation adds or removes predictors from a model based on a statistical test, you invoke a phantom degree of freedom because you’re learning something from your data set. However, even when you summon many phantom degrees of freedom during the model selection process, they are not evident in <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">Minitab’s</a> output for the final model. That is what makes them phantoms.</p>
<p>When you invoke too many phantoms, your regression model becomes haunted. This occurs because you’re performing many statistical tests, and every statistical test has a false positive rate. When you try many different models, you're bound to find variables that appear to be significant but are correlated only by chance. These relationships are nothing more than ghostly apparitions!</p>
<p>To protect yourself from this type of bewitching, you need to understand the environment that these phantoms inhabit. Phantom degrees of freedom have the strongest powers when you have a small-to-moderate sample size, many potential predictors, correlated predictors, and when the light of knowledge does not illuminate your conception of the true model.</p>
<p>In this scenario, you are likely to fit many possible models, adding and removing different predictors, and testing curvature and interaction terms in an attempt to conjure an answer out of the darkness. Perhaps you use an automatic incantation procedure like <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-smackdown-stepwise-versus-best-subsets" target="_blank">stepwise or best subsets regression</a>. If you have <a href="http://blog.minitab.com/blog/adventures-in-statistics/what-are-the-effects-of-multicollinearity-and-when-can-i-ignore-them" target="_blank">multicollinearity</a>, the parameter estimates are particularly unhinged.</p>
<p>The ANOVA table we saw above appears to be perfectly normal, but it could be haunted. To divine the truth, you must understand the entire ritual that incited the final model to materialize. If you start out with 20 variables, a sample size of 29, and fit many models to see what works, you could conjure a possessed model beguiling you to accept false conclusions.</p>
<p>In fact, this method of dredging through data to see what sticks casts such a diabolical spell that it can manifest a statistically significant regression model with a high R-squared <em><a href="http://blog.minitab.com/blog/adventures-in-statistics/four-tips-on-how-to-perform-a-regression-analysis-that-avoids-common-problems" target="_blank">from completely random data</a></em>! Beware—this is the environment that the phantoms inhabit!</p>
How to Protect Yourself from the Phantom Degrees of Freedom
<p>To protect yourself from phantom degrees of freedom, information and advance planning are your best talismans. Use the following rites to shine the light of truth on your research and to guide yourself out of the darkness:</p>
<ul>
<li>Conduct prior research about the important variables and their relationships to help you specify the best regression model without the need for data mining.</li>
<li>Collect a large enough sample size to support the level of model complexity that you will need.</li>
<li>Avoid data mining and keep track of how many phantom degrees of freedom that you raise before arriving at your final model.</li>
</ul>
<p>For more information about avoiding haunted models, read my post about <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-choose-the-best-regression-model">How to Choose the Best Regression Model</a>.</p>
<p>Happy Halloween!</p>
<p> </p>
<p style="font-size:10px;"><em>"Buer." Licensed under Public Domain via <a href="https://en.wikipedia.org/wiki/Buer_(demon)#/media/File:Buer.gif" target="_blank">Commons.</a></em></p>
Data AnalysisHypothesis TestingRegression AnalysisStatisticsStatistics HelpThu, 29 Oct 2015 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/beware-of-phantom-degrees-of-freedom-that-haunt-your-regression-modelsJim Frost