Data Analysis Software | MinitabBlog posts and articles with tips for using statistical software to analyze data for quality improvement.
http://blog.minitab.com/blog/data-analysis-software/rss
Thu, 03 Sep 2015 17:07:40 +0000FeedCreator 1.7.3The Danger of Overfitting Regression Models
http://blog.minitab.com/blog/adventures-in-statistics/the-danger-of-overfitting-regression-models
<p><img alt="Example of an overfit regression model" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/a284ba0ea6c3bf8f6dcec4e7a9d5f1f2/overfitlineplotnoequ.gif" style="float: right; width: 300px; height: 200px;" />In regression analysis, overfitting a model is a real problem. An overfit model can cause the <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">regression coefficients, p-values</a>, and <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit" target="_blank">R-squared</a> to be misleading. In this post, I explain what an overfit model is and how to detect and avoid this problem.</p>
<p>An overfit model is one that is too complicated for your data set. When this happens, the regression model becomes tailored to fit the quirks and random noise in your specific sample rather than reflecting the overall population. If you drew another sample, it would have its own quirks, and your original overfit model would not likely fit the new data.</p>
<p>Instead, we want our model to approximate the true model for the entire population. Our model should not only fit the current sample, but new samples too.</p>
<p>The fitted line plot illustrates the dangers of an overfit model. The model appears to explain a lot of variation in the response variable. However, the model is too complex for the sample data. In the overall population, there is no real relationship between the predictor and the response. You can read about the model <a href="http://blog.minitab.com/blog/adventures-in-statistics/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables" target="_blank">here.</a></p>
Fundamentals of Inferential Statistics
<p>To understand how overfitting causes these problems, we need to go back to the basics for inferential statistics.</p>
<p>The overall goal of inferential statistics is to draw conclusions about a larger population from a random sample. Inferential statistics uses the sample data to provide the following:</p>
<ul>
<li>Unbiased estimates of properties and relationships within the population.</li>
<li><a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-why-we-need-to-use-hypothesis-tests-in-statistics" target="_blank">Hypothesis tests</a> that assess statements about the entire population.</li>
</ul>
<p>An important concept in inferential statistics is that the amount of information you can learn about a population is limited by the sample size. The more you want to learn, the larger your sample size must be.</p>
<p>You probably understand this concept intuitively, but here’s an example. If you have a sample size of 20 and want to estimate a single population mean, you’re probably in good shape. However, if you want to estimate two population means using the same total sample size, it suddenly looks iffier. If you increase it to three population means and more, it starts to look pretty bad.</p>
<p>The quality of the results worsens when you try to learn too much from a sample. As the number of observations per parameter decreases in the example above (20, 10, 6.7, etc), the estimates become more erratic and a new sample is less likely to reproduce them.</p>
Applying These Concepts to Overfitting Regression Models
<p>In a similar fashion, overfitting a regression model occurs when you attempt to estimate too many parameters from a sample that is too small. Regression analysis uses one sample to estimate the values of the coefficients for <em>all</em> of the terms in the equation. The sample size limits the number of terms that you can safely include before you begin to overfit the model. The number of terms in the model includes all of the predictors, interaction effects, and polynomials terms (<a href="http://blog.minitab.com/blog/adventures-in-statistics/curve-fitting-with-linear-and-nonlinear-regression" target="_blank">to model curvature</a>).</p>
<p>Larger sample sizes allow you to specify more complex models. For trustworthy results, your sample size must be large enough to support the level of complexity that is required by your research question. If your sample size isn’t large enough, you won’t be able to fit a model that adequately approximates the true model for your response variable. You won’t be able to trust the results.</p>
<p>Just like the example with multiple means, you must have a sufficient number of observations for each term in a regression model. Simulation studies show that a good rule of thumb is to have 10-15 observations per term in multiple linear regression.</p>
<p>For example, if your model contains two predictors and the interaction term, you’ll need 30-45 observations. However, if the effect size is small or there is high multicollinearity, you may need more observations per term.</p>
How to Detect and Avoid Overfit Models
<p>Cross-validation can detect overfit models by determining how well your model generalizes to other data sets by partitioning your data. This process helps you assess how well the model fits new observations that weren't used in the model estimation process.</p>
<p><a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">Minitab statistical software</a> provides a great cross-validation solution for linear models by calculating <a href="http://blog.minitab.com/blog/adventures-in-statistics/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables" target="_blank">predicted R-squared</a>. This statistic is a form of cross-validation that doesn't require you to collect a separate sample. Instead, Minitab calculates predicted R-squared by systematically removing each observation from the data set, estimating the regression equation, and determining how well the model predicts the removed observation.</p>
<p>If the model does a poor job at predicting the removed observations, this indicates that the model is probably tailored to the specific data points that are included in the sample and not generalizable outside the sample.</p>
<p>To avoid overfitting your model in the first place, collect a sample that is large enough so you can safely include all of the predictors, interaction effects, and polynomial terms that your response variable requires. The scientific process involves plenty of research before you even begin to collect data. You should identify the important variables, the model that you are likely to specify, and use that information to estimate a good sample size.</p>
<p>For more about the model selection process, read my blog post, <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-choose-the-best-regression-model">How to Choose the Best Regression Model</a>.</p>
Thu, 03 Sep 2015 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/the-danger-of-overfitting-regression-modelsJim FrostChi-Square Analysis: Powerful, Versatile, Statistically Objective
http://blog.minitab.com/blog/michelle-paret/chi-square-analysis-powerful-versatile-statistically-objective
<p style="line-height: 20.7999992370605px;">To make objective decisions about the processes that are critical to your organization, you often need to examine categorical data. You may know how to use a t-test or ANOVA when you’re comparing measurement data (like weight, length, <span style="line-height: 1.6;">revenue, </span><span style="line-height: 1.6;">and so on), but do you know how to compare attribute or counts data? It easy to do with <a href="http://www.minitab.com/products/minitab">statistical software</a> like Minitab. </span></p>
<p style="line-height: 20.7999992370605px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/60bfd1eb8d2c2c3689bce89ea55453ab/chisquare_onevariable_w1024.jpeg" style="line-height: 20.7999992370605px; width: 350px; height: 230px; float: right; margin: 10px 15px;" /></p>
<p style="line-height: 20.7999992370605px;">One person may look at this bar chart and decide that each production line had the same <span style="line-height: 1.6;">proportion of defects. But another person may focus on the small difference between the bars and decide that one of the lines has outperformed the others. Without an appropriate statistical analysis, how can you know which person is right?</span></p>
<p style="line-height: 20.7999992370605px;">When time, money, and quality depend on your answers, you can’t rely on subjective visual assessments alone. To answer questions like these with statistical objectivity, you can use a Chi-Square analysis.</p>
Which Analysis Is Right for Me?
<p style="line-height: 20.7999992370605px;">Minitab offers three Chi-Square tests. The appropriate analysis depends on the number of variables that you want to examine. And for all three options, the data can be formatted either as raw data or summarized counts.</p>
<strong>Chi-Square Goodness-of-Fit Test – 1 Variable</strong>
<p style="line-height: 20.7999992370605px;">Use Minitab’s <strong>Stat > Tables > Chi-Square Goodness-of-Fit Test (One Variable)</strong> when you have just one variable.</p>
<p style="line-height: 20.7999992370605px;">The Chi-Square Goodness-of-Fit Test can test if the proportions for all groups are equal. It can also be used to test if the proportions for groups are equal to specific values. For example:</p>
<ul style="line-height: 20.7999992370605px;">
<li>A bottle cap manufacturer operates three production lines and records the number of defective caps for each line. The manufacturer uses the <strong>Chi-Square Goodness-of-Fit Test</strong> to determine if the proportion of defects is equal across all three lines.</li>
<li>A bottle cap manufacturer operates three production lines and records the number of defective caps for each line. One line runs at high speed and produces twice as many caps as the other two lines that run at a slower speed. The manufacturer uses the <strong>Chi-Square Goodness-of-Fit Test</strong> to determine if the defects for each line is proportional to the volume of caps it produces.</li>
</ul>
<strong>Chi-Square Test for Association – 2 Variables</strong>
<p style="line-height: 20.7999992370605px;">Use Minitab’s <strong>Stat > Tables > Chi-Square Test for Association</strong> when you have two variables.</p>
<p style="line-height: 20.7999992370605px;">The Chi-Square Test for Association can tell you if there’s an association between two variables. In another words, it can test if two variables are independent or not. For example:</p>
<ul style="line-height: 20.7999992370605px;">
<li>A paint manufacturer operates two production lines across three shifts and records the number of defective units per line per shift. The manufacturer uses the <strong>Chi-Square Goodness-of-Fit Test</strong> to determine if the defect rates are similar across all shifts and production lines. Or, are certain lines during certain shifts more prone to defects?</li>
<li>A credit card billing center records the type of billing error that is made, as well as the type of form that is used. The billing center uses a Chi-Square Test to determine whether certain types of errors are related to certain forms.</li>
</ul>
<p style="line-height: 20.7999992370605px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/7af9e9b2ee624e7d912393d7debe7f1b/chisquare_twovariables_w1024.jpeg" style="width: 500px; height: 329px;" /></p>
<strong>Cross Tabulation and Chi-Square – 2 or more variables</strong>
<p style="line-height: 20.7999992370605px;">Use Minitab’s <strong>Stat > Tables > Cross Tabulation and Chi-Square </strong>when you have two or more variables.</p>
<p style="line-height: 20.7999992370605px;">If you simply want to test for associations between two variables, you can use either <strong>Cross Tabulation and Chi-Square</strong> or <strong>Chi-Square Test for Association</strong>. However, <span><a href="http://blog.minitab.com/blog/understanding-statistics/using-cross-tabulation-and-chi-square-the-survey-says">Cross Tabulation and Chi-Square</a></span> also lets you control for the effect of additional variables. Here’s an example:</p>
<ul style="line-height: 20.7999992370605px;">
<li>A dairy processing plant records information about each defective milk carton that it produces. The plant uses a Cross Tabulation and Chi-Square analysis to look for dependencies between the defect types and the machine that produces the carton, while controlling for any shift effect. Perhaps a particular filling machine is prone to a certain type of defect, but only during the first shift.</li>
</ul>
<p style="line-height: 20.7999992370605px;">This analysis also offers advanced options. For example, if your categories are ordinal (good, better, best or small, medium, large) you can include a special test for concordance.</p>
Conducting a Chi-Square Analysis in Minitab
<p style="line-height: 20.7999992370605px;">Each of these analyses is easy to run in Minitab. For more examples that include step-by-step instructions, just navigate to the Chi-Square menu of your choice and then click Help > example.</p>
<p style="line-height: 20.7999992370605px;">It can be tempting to make subjective assessments about a given set of data, their makeup, and possible interdependencies, but why risk an error in judgment when you can be sure with a Chi-Square test?</p>
<p style="line-height: 20.7999992370605px;">Whether you’re interested in one variable, two variables, or more, a Chi-Square analysis can help you make a clear, statistically sound assessment.</p>
Data AnalysisHypothesis TestingLean Six SigmaQuality ImprovementSix SigmaStatisticsStatistics HelpThu, 27 Aug 2015 12:33:39 +0000http://blog.minitab.com/blog/michelle-paret/chi-square-analysis-powerful-versatile-statistically-objectiveMichelle ParetUsing Probability Distribution Plots to See Data Clearly
http://blog.minitab.com/blog/understanding-statistics/using-probability-distribution-plots-to-see-data-clearly
<p><span style="line-height: 1.6;">When we take pictures with a digital camera or smartphone, what the device <em>really</em> does is capture information in the form of binary code. At the most basic level, our precious photos are really just a bunch of 1s and 0s, but if we were to look at them that way, they'd be pretty unexciting. </span></p>
<p><span style="line-height: 1.6;">In its raw state, all that information the camera records is worthless. T</span><span style="line-height: 1.6;">he 1s and 0s need to be converted into pictures before we can actually see what we've photographed.<img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8ad3420e0401f47cadb9bd3e9723fc32/camera_with_plot.jpg" style="margin: 10px 15px; float: right; width: 250px; height: 163px;" /></span></p>
<p>We encounter a similar situation when we try to use <span><a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-identify-the-distribution-of-your-data-using-minitab">statistical distributions and parameters</a></span> to describe data. There's important information there, but it can seem like a bunch of meaningless numbers without an illustration that makes them easier to interpret.</p>
<p>For instance, if you have data that follows a gamma distribution with a scale of 8 and a shape of 7, what does that really mean? If the distribution shifts to a shape of 10, is that good or bad? And even if <em>you</em> understand it, how easy would it be explain to people who are more interested in outcomes than statistics?</p>
Enter the Probability Distribution Plot
<p>That's where the probability distribution plot comes in. Making a probability distribution plot using Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a> will create a picture that helps bring the numbers to life. Even novices can benefit from understanding their data’s distribution.</p>
<p>Let's take a look at a few examples.</p>
Changing Shape
<p><span style="line-height: 20.7999992370605px;">A building materials manufacturer develops a new process to increase the strength of its I-beams. The old process fit a gamma distribution with a scale of 8 and a shape of 7, whereas the new process has a shape of 10.</span><span style="line-height: 20.7999992370605px;"> </span></p>
<p><img alt="estimates" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/3e60c8f30d8ec67ec28c548ef44cbbc2/probability_distribution_plots_1_en_us_1_.gif" style="width: 294px; height: 90px;" /></p>
<p>The manufacturer does not know what this change in the shape parameter means, and the numbers alone don't tell the story. </p>
<p>But if we go in Minitab to <strong>Graph > Probability Distribution Plot</strong>, select the "View Probability" option, and enter the information about these distributions, the impact of the change will be revealed.</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/662165bbd950bcab66ddc18702437d4c/probability_plot_dialog.png" style="width: 369px; height: 207px;" /></p>
<p>Here's the original process, with the shape of 7:</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/e786e9d35416d0eef6b18f4019ef7a18/distributionplot1.png" style="width: 576px; height: 384px;" /></p>
<p>And here is the plot for the new process, with a shape of 10: </p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ab74192ac9cbacaa4bafb661cf07bb63/distributionplot2.png" style="width: 576px; height: 384px;" /></p>
<p>The probability distribution plots make it easy to see that the shape change increases the number of acceptable beams from 91.4% to 99.5%, an 8.1% improvement. What's more, the right tail appears to be much thicker in the second graph, which indicates the new process creates many more unusually strong units. Hmmm...maybe the new process could ultimately lead to a premium line of products.</p>
Communicating Results
<p>Suppose a chain of department stores is considering a new program to reduce discrepancies between an item’s tagged price and the amount is charged at the register. <span style="line-height: 20.7999992370605px;">Ideally, the system would eliminate any discrepancies, but a </span><span style="line-height: 20.7999992370605px;">± 0.5% </span><span style="line-height: 20.7999992370605px;">difference is considered acceptable. </span><span style="line-height: 1.6;">However, implementing the program will be extremely expensive, so the company runs a pilot test in a single store. </span></p>
<p><span style="line-height: 20.7999992370605px;">In the pilot study, the mean improvement is small, and so is the standard deviation. When the company's board looks at the numbers, they don't see the benefits of approving the program, given its cost. </span></p>
<p><img alt="communicate results data" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/93fc723781cb6d5b48cce1e442c73ecb/probability_distribution_plots_3_en_us_1_.gif" style="width: 266px; height: 92px;" /></p>
<p>The store's quality specialist thinks the numbers aren't telling the story, and decides to show the board the pilot test data in a probability distribution plot instead: </p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d5707e270065bc777cd9e8eeda66393a/distributionplot3.png" style="width: 576px; height: 384px;" /></p>
<p>By overlaying the before and after distributions, the specialist makes it very easy to see that price differences using the new system are clustered much closer to zero, and most are in the ± 0.5% acceptable range. Now the board can see the impact of adopting the new system. </p>
Comparing Distributions
<p>An electronics manufacturer counts the number of printed circuit boards that are completed per hour. The sample data is best described by a Poisson distribution with a mean of 3.2. However, the company's test lab prefers to use <span><a href="http://blog.minitab.com/blog/quality-data-analysis-and-statistics/assumptions-and-normality">an analysis that requires a normal distribution</a></span> and wants to know if it is appropriate.</p>
<p><span style="line-height: 20.7999992370605px;">The manufacturer can easily compare the known distribution with a normal distribution using the probability distribution plot. </span>If the normal distribution does not approximate the Poisson distribution, then the lab's test results will be invalid.</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ca8efa0a9ccabe926553c29b7f42d8e6/distributionplot4.png" style="width: 576px; height: 384px;" /></p>
<p><span style="line-height: 1.6;">As the graph indicates, the normal distribution—and the analyses that require it—won’t be a good fit for data that follow a Poisson distribution with a mean of 3.2.</span></p>
Creating Probability D<span style="line-height: 1.6;">istribution Plots in Minitab</span>
<p>It's easy to use Minitab to create plots to visualize and to compare distributions and even to scrutinize an area of interest.</p>
<p>Let's say a market researcher wants to interview customers with satisfaction scores between 115 and 135. Minitab’s Individual Distribution Identification feature shows that these scores are normally distributed with a mean of 100 and a standard deviation of 15. However, the analyst can’t visualize where his subjects fall within the range of scores or their proportion of the entire distribution.</p>
<p>Choose <strong>Graph > Probability Distribution Plot > View Probability</strong>.<br />
Click <strong>OK</strong>.</p>
<p><img alt="dialog box" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c394e6d6080f4749f93345c431265b6c/probability_plot_dialog_2.png" style="line-height: 20.7999992370605px; width: 431px; height: 386px;" /></p>
<p style="margin-left: 40px;">From Distribution, choose Normal.<br />
In Mean, type 100.<br />
In Standard deviation, type 15.<br />
Click on the "Shaded Area" tab. </p>
<p><img alt="distribution plot dialog box 2" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/1fcc60e71a09408880bf2cff8d8fc545/probability_plot_dialog_3.png" style="width: 431px; height: 386px;" /></p>
<p style="margin-left: 40px;">In Define Shaded Area By, choose X Value.<br />
Click Middle.<br />
In X value 1, type 115.<br />
In X value 2, type 135.<br />
Click OK.</p>
<p>Minitab creates the following plot: </p>
<p><img alt="distribution plot " src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/2f25241390c52c6eb8efbc623bf77e7a/distributionplot5.png" style="width: 576px; height: 384px;" /></p>
<p>About 15% of sampled customers had scores in the region of interest (115-135). This is not a very large percentage, so the researcher may face challenges in finding qualified subjects.</p>
Using Probability Distribution Plots
<p>Just like your camera when it assembles 1s and 0s into pictures, probability distribution plots let you see the deeper meaning of the numbers that describe your distributions. You can use these graphs to highlight the impact of changing distributions and parameter values, to show where target values fall in a distribution, and to view the proportions that are associated with shaded areas. These simple plots also clearly and easily communicate these advanced concepts to a non-statistical audience that might b<span style="line-height: 1.6;">e confused by hard-to-understand concepts and numbers. </span></p>
Data AnalysisStatisticsThu, 20 Aug 2015 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/using-probability-distribution-plots-to-see-data-clearlyEston MartzBig Ten 4th Down Calculator: Creating a Model for Expected Points
http://blog.minitab.com/blog/the-statistics-game/big-ten-4th-down-calculator%3A-creating-a-model-for-expected-points
<p><img alt="4th Down" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/388eafa0f8b00f893356dbce728c1677/4th_down.jpg" style="margin: 10px 15px; width: 206px; height: 290px; float: right;" />If you want to use data to predict the impact of different variables, whether it's for business or some personal interest, you need to create a model based on the best information you have at your disposal. In this post and subsequent posts throughout the football season, I'm going to share how I've been developing and applying a model for predicting the outcomes of 4th down decisions in Big 10 games. I hope sharing my experiences will help you, whether the questions you want to answer are about football or business logistics. </p>
<p>Here are some questions I was looking to answer when I began thinking about creating a 4th down calculator. If you have a 1st and 10 at your opponent’s 20-yard line, on average you’ll score more points than if you have the ball at your own 20 yard line. But how many more? And how does that number change as you move to different positions on the field. And what if you’re playing on the road as opposed to playing at home?</p>
<p>If you’re trying to use analytics to determine what the best decision is on 4th down, you need to know how many points you (or your opponent) would be expected to score on the ensuing 1st down. So my first step in creating a <a href="http://blog.minitab.com/blog/the-statistics-game/coming-soon%3A-the-big-ten-4th-down-calculator">Big Ten 4th down calculator</a> was to use Minitab Statistical Software to model a team’s expected points on 1st and 10 from anywhere on the field.</p>
The Data
<p>I went through every Big Ten conference game the last two seasons. For each instance a team had 1st and 10, I recorded the field position and the next score. If your opponent was the next team to score, then the value for the next score was negative. If nobody scored before halftime or the end of the game (depending on which half they were in) the value was 0.</p>
<p>I only included conference games because many non-conference games are one-sided (I’m looking at you, Ohio State vs. Kent State in <span style="line-height: 20.79px;">2014</span><span style="line-height: 1.6;">). I also didn’t include the conference championship game, since I want to account for home field advantage and that game is played at a neutral site. Finally, I did my best to exclude drives that ended prematurely because of halftime and drives in the 4th quarter of blowouts. </span></p>
<p><span style="line-height: 1.6;">I ended up with 5,496 drives over the two seasons. You can get both the raw and summarized data </span><a href="//cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/5de24a0ccc06069f2b2c02beb2d9e281/expected_points_data.MTW" style="line-height: 1.6;">here</a><span style="line-height: 1.6;">.</span></p>
<p>A bar chart can give us a quick glance at what the most common score is.</p>
<p><img alt="Bar Chart" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/c329e0cb4ca28ed32321e050458fa636/bar_chart_of_next_score.jpg" style="width: 576px; height: 384px;" /></p>
<p>The most common outcome when you have possession of the ball is that you score a touchdown. No revelation there. But surprisingly, it was actually more common for your opponent to get the ball back and score a touchdown than it was for you to kick a field goal. I wouldn’t have expected that.</p>
<p>So now let’s see what happens when we account for the field position and home field advantage.</p>
A Model for Expected Points
<p>I grouped the field position into groups of 5 yards intervals. Then for each group, I took the average of the next score. So first, let’s look at a fitted line plot of the data, <em>without </em>accounting for home field advantage.</p>
<p><img alt="Fitted Line Plot" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/1fc2f2ec3c73acd2c561a0e6fb2c1b92/fitted_line_plot_without_hf_advantage.jpg" style="width: 576px; height: 384px;" /></p>
<p>The regression model fits the data very well. The R-squared value indicates that 96.4% of the variation in Expected Points can be explained by the number of yards to the end zone. That’s fantastic! I added a reference line at the point where the expected value is 0. It crosses our regression line at a distance to the end zone of approximately 85 yards. That suggests you have to be inside your own 15 yard line before the team on defense is more likely to be the next team to score.</p>
<p>Now let’s factor in home field advantage. We’ll start by examining a scatterplot that will show the difference in expected points for home and away teams at each yard line group.</p>
<p><img alt="Scatterplot" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/97951f90653235b48ef016f78eaaf0ec/scatterplot_1.jpg" style="width: 576px; height: 384px;" /></p>
<p>In 17 of the 20 groups, the home team has a higher number of expected points than the away team. And in the 3 cases where the away team is higher, the two values are very close. This gives strong evidence that we need to account for home field advantage. I ran a regression analysis to confirm that we should include that game location in our model.</p>
<p><img alt="Regression" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/8360657dd1ab5ce5f7bf22ad79595849/regression_with_no_interaction.jpg" style="width: 495px; height: 336px;" /></p>
<p>The p-value for location is less than 0.05, and the R-squared value remains very high. I can now use these two equations (one for home games, one for away games) to predict how many points a team with a first down will score from anywhere on the field.</p>
Testing the Interaction Between Home Field Advantage and Yards to the End Zone
<p>There is one last thing I want to look into. Is there an interaction between our two terms? Think about it this way: Say you have 1st and goal inside your opponent’s 10 yard line. You’re so close to the end zone, it seems like it might not matter whether you’re at home or on the road.</p>
<p>Now imagine you have a 1st and 10 inside your own 10 yard line. It seems like a much more daunting task to drive the length of the field on the road with the hostile crowd roaring than it would be with the cheers of a friendly home crowd.</p>
<p>In other words, does the effect of home field advantage increase the further a team is from the end zone? Intuitively, it seems like it <em>should</em>. But we should run a regression analysis to see if the data supports that notion.</p>
<p><img alt="Regression" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/290b2ad01ddba5340b3604b1834fa0d8/regression_with_interaction.jpg" style="width: 534px; height: 158px;" /></p>
<p>The data does not support my intuition. The p-value for the interaction term is much higher than 0.05, indicating that it is not a significant term, and thus that we should not include it in our model. To visualize why, let’s revisit the previous scatterplot, but this time I'll add regression lines to each group.</p>
<p><img alt="Scatterplot" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/98828cf26b5dd282a08d2d7aaeeff6d7/scatterplot_2.jpg" style="width: 576px; height: 384px;" /></p>
<p>If there were an interaction between our two terms, we would expect the two lines to be close together at small distances to the end zone. Then they should move farther apart as the yards to the end zone increase. But you can see here that the lines are pretty parallel to each other. So we can safely remove the interaction term from our model.</p>
The Final Model
<p>Let’s take a final look at the model created by this regression analysis.</p>
<p><img alt="Model Equations" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/362b968414eaa5a21f46de82b8b88f57/model_equations.jpg" style="width: 492px; height: 120px;" /></p>
<p>The equations indicate that if you start a drive on the road, you’ll be expected to score approximately 0.6 fewer points than you would if you were playing at home. Because there is no interaction term, the slopes are the same for both equations. The value of -0.075 means that for every yard you move away from the end zone, your expected points decrease by 0.075. So if you decide to punt the football away and get a net of 40 yards (the average in the Big Ten last year), this model indicates you’ll have saved yourself about 3 points on average.</p>
<p>Of course, that 3 points assumes that you turned the ball over on downs. <span style="line-height: 1.6;">But a third option exists: successfully converting on 4</span>th<span style="line-height: 1.6;"> down. </span></p>
<p><span style="line-height: 1.6;">Will the reward of a successful conversion outweigh the risk of losing those 3 points you would gain by punting? That all depends on the probability of successfully converting on 4</span>th<span style="line-height: 1.6;"> down. And that’s exactly what I'll look at in my next post. Once we can determine the probability of converting on 4</span>th<span style="line-height: 1.6;"> down, we’ll be able to get some data-driven insights into what the correct decision is on 4</span>th<span style="line-height: 1.6;"> down. Stay tuned!</span></p>
<p> </p>
Data AnalysisFun StatisticsRegression AnalysisStatisticsFri, 14 Aug 2015 14:21:00 +0000http://blog.minitab.com/blog/the-statistics-game/big-ten-4th-down-calculator%3A-creating-a-model-for-expected-pointsKevin RudyHigh School Researchers: What Do We Do with All of this Data?
http://blog.minitab.com/blog/statistics-in-the-field/high-school-researchers-what-do-we-do-with-all-of-this-data
<p><em>by Colin Courchesne, guest blogger, representing his Governor's School research team. </em></p>
<p>High-level research opportunities for high school students are rare; however, that was just what the New Jersey Governor’s School of Engineering and Technology provided. </p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/64c1ec6f5eb944a2866e5a1c9b2e80e9/cup_149682_640.png" style="line-height: 20.7999992370605px; margin: 10px 15px; float: right; width: 200px; height: 212px;" /></p>
<p>Bringing together the best and brightest rising seniors from across the state, the Governor’s School, or GSET for short, tasks teams of students with completing a research project chosen from a myriad of engineering fields, ranging from biomedical engineering to, in our team's case, industrial engineering.</p>
<p>Tasked with analyzing, comparing, and simulating queue processes at Dunkin’ Donuts and Starbucks, our team of GSET scholars spent five days tirelessly collecting roughly 250 data points on each restaurant. Our data included how much time people spent waiting in line, what type of drinks customers ordered, and how much time they spent waiting for their drinks after ordering.</p>
<p><img alt="data collection interface" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ec87ea6790ddc0e8ddf576c7efd1e6e9/gov_school_1.png" style="line-height: 20.7999992370605px; width: 600px;" /><br />
<em>The students used a computerized interface to collect data about customers in two different coffee shops.</em></p>
<p>But once the data collection was over, we reached a sort of brick wall. What do we <em>do </em>with all this data? <span style="line-height: 1.6;">As research debutantes not well versed in the realm of statistics and data analysis, we had no idea how to proceed. </span></p>
<p><span style="line-height: 1.6;">Thankfully, the helping hand of our project mentor, engineer Brandon Theiss, guided us towards <a href="http://www.minitab.com/products/minitab">Minitab</a>.</span></p>
Getting Meaning Out of Our Data
<p>Our original, raw data told us nothing. In order to compare data between stores and create accurate process simulations, we needed a way to sort the data, determine descriptive statistics, and assign distributions; it is these very tools that Minitab offered. Getting started was both easy and intuitive.</p>
<p>First, we all managed to download Minitab 17 (thanks to <a href="http://it.minitab.com/products/minitab/free-trial.aspx">the 30-day trial</a>). Our team then went on to learn the ins and outs of Minitab, both through <a href="http://www.minitab.com/support/videos/">instructional videos on YouTube</a> as well as <a href="http://support.minitab.com/minitab/17/">helpful written guides</a>, all of which are provided by Minitab. Less than an hour later, we were able to navigate the program with ease.</p>
<p>The nature of the simulations our team intended to create called for us to identify the arrival process for each store, the distributions for the wait time of a customer in line at each restaurant, as well as the distributions for the drink preparation time, sectioned off by both restaurant as well as drink type. In order to input this information into our simulation, we also needed certain parameters that were dependent on the distribution. Such parameters ranged from alpha and beta values for Gamma distributions to means and standard deviations for Normal distributions.</p>
<p>Thankfully, running the necessary hypothesis tests and calculating each of these parameters was simple. We first used the “<a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/overfit-those-skintight-jeans-fit-perfect-when-you-bought-them-but">Goodness of fit for Poisson</a>” test in order to analyze our arrival rates.</p>
All Necessary Information
<p>Rather than having to fiddle with equations and arrange cells like in Excel, Minitab quickly provided us with all necessary information, including our P-value to determine whether the distribution fit the data as well as parameters for shape and scale.</p>
<p>As for distributions for individual drink preparation times, the process was similarly simple. Using the “<a href="http://blog.minitab.com/blog/meredith-griffith/identifying-the-distribution-of-your-data">Individual Distribution Identification</a>” tool, Minitab ran a series of hypothesis tests, comparing our data against a total of 16 possible distributions. The software output graphs along with P-values and Anderson-Darling values for each distribution, allowing us to graphically and empirically determine the appropriateness of fit. </p>
<p><img alt="Probability Plot for Latte S" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/621113f0e0af63fcb51a67e84498e0a9/gov_school_2.png" style="width: 576px;" /></p>
<p>Within 3 hours, we had sorted and analyzed all of our data.</p>
<p>Not only was Minitab a fantastic tool for our analysis purposes, but the software also provided us with a graphical platform, a means by which to produce most of the graphs used in our research paper and presentation. Once we determined which distribution to use with what data, we used Minitab to output histograms with fitted data distributions for each set of data points. The ease of use for this feature served to save us time, as a series of simple clicks allowed us to output all 10 of our required histograms at the same time.</p>
<p><img alt="Histogram of Line Time S" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d696fa8296af19f5efec2ef9815249fa/gov_school_3.png" style="width: 580px;" /></p>
<p>The same tools first used to analyze our data were then finally used to analyze the success of our simulations; we ran a Kolmogorov-Smirnov test to determine whether two sets of data—in this case, our observed data and the data output by our simulation—share a common distribution. Like most other features in Minitab, it was extremely easy to use and provided clear and immediate feedback as to the results of the test, both graphically and through the requisite critical and KS values</p>
<img alt="Empirical CDF of Iced" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/687a0c85c34e919d3f793ffb9d278b2c/gov_school_4.png" style="float: left; width: 400px; height: 267px;" />
<img alt="Simulated vs Actual " src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/4f7ad45d38e9646b3d585b3907be83df/gov_school_5.png" style="width: 400px; height: 267px;" />
<p><span style="line-height: 1.6;">Research isn’t always fun. It’s often long, tedious, and amounts to nothing. Thankfully, that wasn’t our case. Using Minitab, our entire analysis process was simple and painless. The software was easy to learn and was able to run any test quickly and efficiently, providing us with both empirical and graphical evidence of the results as well as high-quality graphs which were used throughout our project. It really was a pleasure to work with.</span></p>
<p><img alt="GSET Coff(IE) Team" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/13f3ba56ecd0b8f0c6db49e5f8f08a5c/gov_school_6_w1024.png" style="width: 500px; " /></p>
<p><em>—The GSET COFF[IE] Team, whose members were Kenneth Acquah, Colin Courchesne, Sheela Hanagal, Kenneth Li, and Caroline Potts. The team was mentored by Juilee Malavade and Brandon Theiss, </em>PE<em>. Photo courtesy Colin Courchesne. </em></p>
<p> </p>
<p> </p>
<p><strong>About the Guest Blogger:</strong></p>
<p><i>Colin Courchesne was a scholar in the 2015 New Jersey Governor's School of Engineering and Technology, </i><em>a summer program for high-achieving high school students. Students in the program complete a set of challenging courses while working in small groups on real-world research and design projects that relate to the field of engineering. Governor’s School students are mentored by professional engineers as well as Rutgers University honors students and professors, and they often work with companies and organizations to solve real engineering problems.</em></p>
<p> </p>
<p><strong>Would you like to publish a guest post on the Minitab Blog? Contact <a href="mailto:publicrelations@minitab.com?subject=Guest%20Blogger">publicrelations@minitab.com</a>.</strong></p>
<p> </p>
<p> </p>
Data AnalysisFun StatisticsStatistics in the NewsWed, 05 Aug 2015 12:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/high-school-researchers-what-do-we-do-with-all-of-this-dataGuest Blogger10 Statistical Terms Designed to Confuse Non-Statisticians
http://blog.minitab.com/blog/understanding-statistics/10-statistical-terms-designed-to-confuse-non-statisticians
<p>Statisticians say the darndest things. At least, that's how it can seem if you're not well-versed in statistics. </p>
<p>When I began studying statistics, I approached it as a language. I quickly noticed that compared to other disciplines, statistics has some unique problems with terminology, problems that don't affect most scientific and academic specialties. </p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/3e1d06ea09dcdb91141d89604019b361/confused_blockhead.jpg" style="margin: 10px 15px; float: right; width: 300px; height: 242px;" />For example, dairy science has a highly specialized vocabulary, which I picked up when I was an editor for <a href="http://www.cas.psu.edu" target="_blank">Penn State's College of Agricultural Sciences</a>. I found the jargon fascinating, but not particularly confusing to learn. Why? Because words like <a href="https://en.wikivet.net/Ruminant_Stomach_-_Anatomy_%26_Physiology" target="_blank">"rumen" and "abomasum" and "omasum"</a> simply don't turn up in common parlance. They have very specific meaning, and there's little chance of misinterpreting them. </p>
<p>Now open up a statistics text and flip to the glossary. There are plenty of statistics-specific terms, but you're going to see a lot of very common words as well. The problem is that in statistics, these common words don't necessarily mean what they do <em>outside </em>statistics.</p>
<p>And that means that if you're not well versed in statistics, or even if it's just been a while since you thought about it, understanding statistical results—whether it's a research report on the news or an analysis done by a co-worker—can be a real challenge. Sometimes it seems like the language of statistics was <em>designed </em>to be confusing. </p>
<p>That's one of the reasons we incorporated <a href="http://www.minitab.com/products/minitab/assistant">The Assistant</a> into Minitab Statistical Software. This interactive menu guides you through your analysis and presents your results without ambiguity, making them easy to interpret if you aren't a statistician, and making them easy to share if you are one. </p>
<p>Here are 10 common words that are also routinely used in statistics. Those of us who are practicing data analysis and sharing the results with others need to keep in mind the differences between what these words mean to statisticians, and what they mean to everyone else. </p>
1. <a href="http://blog.minitab.com/blog/the-stats-cat/sample-size-statistical-power-and-the-revenge-of-the-zombie-salmon-the-stats-cat"><span style="line-height: 1.6;">Significant </span></a>
<p><span style="line-height: 1.6;">When most people say something is "significant," they mean it's important and worth your attention. But for statisticians, significance refers to the odds that what we observe is not simply a chance result. Statisticians know that on a practical level, significant results often have no importance at all. This distinction between practical and statistical significance is easy for people to overlook. </span></p>
2. <a href="http://blog.minitab.com/blog/quality-data-analysis-and-statistics/assumptions-and-normality">Normal</a>
<p>Normally, people who say something is "normal" mean that it's ordinary or commonplace. We call a temperature of 98.6 degrees Fahrenheit "normal." What's more, when something isn't "normal," it often carries negative connotations: "That knocking from my car's engine isn't normal." But to statisticians, data is “normal” when it follows the familiar bell-shaped curve, and <a href="http://blog.minitab.com/blog/understanding-statistics-and-its-application/what-should-i-do-if-my-data-is-not-normal-v2">there's nothing <em>wrong </em>with data that isn't normal</a>. But it's easy for the uninitiated to conflate "nonnormal data" with "bad data." </p>
3. <span><a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples">Regression</a></span>
<p>In everyday usage, regression means shrinkage or backwards movement. When the dog you're training has a bad day after a few positive ones, you might say his behavior regressed. Unless you're a statistician, you wouldn't immediately think "regression" refers to predicting an output variable based on input variables.</p>
4. <a href="http://blog.minitab.com/blog/statistics-for-lean-six-sigma/the-non-parametric-economy-what-does-average-actually-mean">Average</a>
<p>In statistics, the arithmetic average (or mean) is the sum of the observations divided by the number of observations. When most people hear and say the word "average," they're not thinking about a mathematical value but rather a qualitative judgment, meaning “so-so,” "normal" or "fair." </p>
5. <a href="http://blog.minitab.com/blog/the-stats-cat/understanding-type-1-and-type-2-errors-from-the-feline-perspective-all-mistakes-are-not-equal">Error</a>
<p>Error is a measure of an estimate’s precision—if you're a statistician. To everyone else, errors are just mistakes. </p>
6. <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/gage-linearity-and-bias%3A-wake-up-and-smell-your-measuring-system">Bias</a>
<p>In statistics, bias refers to the accuracy of measurements taken by a particular tool or gauge compared to a reference value. In everyday usage, however, bias refers to preconceptions and prejudices that affect a person's view of the world. </p>
7. <a href="http://blog.minitab.com/blog/the-statistics-game/checking-the-assumption-of-constant-variance-in-regression-analyses">Residual</a>
<p>For most people who aren't statisticians, residuals is a fancy word for leftovers, not the difference between observed and fitted values.</p>
8. <a href="http://blog.minitab.com/blog/starting-out-with-statistical-software/how-powerful-am-i-power-and-sample-size-in-minitab">Power</a>
<p>Usually we talk about power in terms of impact and control. <em>Influence</em>. So the fact that a statistical test can be powerful but not influential seems contradictory, unless you already know it refers to the probability of finding a...um...<em>significant </em>effect when one truly exists. </p>
9. <a href="http://blog.minitab.com/blog/michelle-paret/evaluating-statistical-interactions-with-ketchup-and-soy-sauce">Interaction</a>
<p>People use this word to talk about their communications with others. For statisticians, it means the effects of one factor are dependent on another.</p>
10. <a href="http://blog.minitab.com/blog/michelle-paret/alphas-p-values-confidence-intervals-oh-my">Confidence </a>
<p>In statistics, the confidence interval is a range of values, derived from a sample, that is likely to hold the true value of a population parameter. The confidence <em>level </em>is the percentage of confidence intervals that contain that population parameter you would get if you sampled the population many times.</p>
<p>Outside of its technical meaning in statistics, the word "confidence" carries an emotional charge that can instantly create unintended implications. All too often, people interpret statistical confidence as meaning the researchers really believe in their results. </p>
<p>These 10 terms are <span style="line-height: 1.6;">just a few of the most confusing double-entendres found in the statistical world. Terms like sample, assumptions, stability, capability, success, failure, risk, representative, and uncertainty can all mean different things to the world outside our small statistical circle. </span></p>
<p><span style="line-height: 1.6;">Making an effort to help the people we communicate with appreciate the technical meanings of these terms as we use them would be an easy way to begin promoting higher levels of statistical literacy. </span></p>
<p><span style="line-height: 1.6;">What do you think the most confusing terms in statistics are? </span></p>
Data AnalysisStatistics HelpStatsWed, 29 Jul 2015 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/10-statistical-terms-designed-to-confuse-non-statisticiansEston MartzLessons from a Statistical Analysis Gone Wrong, Part 2
http://blog.minitab.com/blog/understanding-statistics/lessons-from-a-statistical-analysis-gone-wrong-part-2
<p><a href="http://blog.minitab.com/blog/understanding-statistics/lessons-from-a-statistical-analysis-gone-wrong-part-1">Last time</a>, I told you how I had double-checked the analysis in a post that involved running the Johnson transformation on a set of data before doing normal capability analysis on it. <span style="line-height: 1.6;">A reader asked why the transformation didn't work on the data when you applied it outside of the capability analysis. <img alt="DOH!" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ec60d428b5c671da6d672dbdf0812007/facepalm.jpg" style="margin: 10px 15px; float: right; width: 200px; height: 200px;" /></span></p>
<p>I hadn't tried transforming the data that way, b<span style="line-height: 1.6;">ut if the transformation worked when performed as part of the capability analysis, it should work when applied outside of that analysis, too</span><span style="line-height: 1.6;">. </span></p>
<p><span style="line-height: 1.6;">But the reader was correct. The transformation failed when applied by itself. </span></p>
What Happened?
<p><span style="line-height: 1.6;">When I'd performed the capability analysis with the Johnson transformation option selected, the analysis seemed fine to me. It </span><em style="line-height: 1.6;">had </em><span style="line-height: 1.6;">been a while since I'd done a capability analysis, but the graph looked okay. </span></p>
<p>Then I remembered one of my first Minitab instructors, who told us "Always look at the session window." So I did. <span style="line-height: 1.6;">And there it was: </span></p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/456cb0d7fca6ccb5be1c18043bb0b987/no_transform_capa.gif" style="width: 500px; height: 108px;" /></p>
<p>Yes, the process capability analysis had been performed...on data that hadn't been transformed. I missed it. And it wasn't until a reader tried running the analysis a different way that my oversight was revealed. </p>
Missing the First Warning
<p>While Minitab <em>does</em> warn you that the transformation failed, you need to check the session window to see it. <span style="line-height: 1.6;">I've used Minitab and other statistical software packages for some time now, and I </span><em style="line-height: 1.6;">know</em><span style="line-height: 1.6;"> that it's important to look at </span><em style="line-height: 1.6;">all</em><span style="line-height: 1.6;"> of the output. </span></p>
<p><span style="line-height: 1.6;">In this case, I only looked at the graph. </span><span style="line-height: 1.6;">Graphs tell you a lot, but you shouldn't rely on graphs alone. I knew this, and I usually </span><em style="line-height: 1.6;">do</em><span style="line-height: 1.6;"> check Minitab's session window...but in this case, I didn't. </span></p>
Knowing What to Look For
<p>While I <em>should</em> have checked the session window, there's another reason I missed the fact that the transformation hadn't occurred: when it comes to capability analysis, I was out of practice. </p>
<p>Like most people who use Minitab, I have a wide range of responsibilities. Some involve statistics and data analysis, and many do not. I do some types of analysis far more frequently than others. Capability is one that I hadn't performed in a while. </p>
<p>Given the time that passed since the last time I did a capability analysis with transformed data, I should have been more thorough in reviewing the output, shown here: </p>
<p><img alt="" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/6268cce550e1f97de81d0889da797814/belmont5.png" style="width: 400px; height: 300px;" /></p>
<p><span style="line-height: 18.9090900421143px;">My mistake seems obvious now: </span>this graph contains a huge warning that the transformation failed. However, the warning lies not in what you see above, but instead in what this graph does <em>not </em>show.</p>
<p>For comparison, here's a capability report that involves a <em>successful</em> transformation: </p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/3ed297f37062b709b4fcd96b3f332af8/successful_transform_graph.jpg" style="width: 463px; height: 351px;" /></p>
<p>Yeah...when you see the transformation equation in the subhead of this graph, not to mention the words "After Transformation" in the data table, their absence in the earlier graph is very conspicuous. </p>
<p>Thus, I missed my <em>second </em>opportunity to realize that the transformation had failed. <span style="line-height: 1.6;">Unfortunately, that meant that the analysis of the Triple Crown data wasn't valid. I felt like a fool for missing something that seems </span><em style="line-height: 1.6;">so obvious </em><span style="line-height: 1.6;">in hindsight. </span></p>
<p><span style="line-height: 1.6;">You can bet that I'll remember to check the session window more vigilantly, and that I'll be quite a bit more cautious when performing analyses that I haven't done in a while. </span></p>
<p>In fact, after realizing my mistake, I tried doing this analysis using the capability tools in <a href="http://www.minitab.com/products/minitab/assistant">the Assistant</a>, which duly notified me that the analysis was suspect. Would that I had thought to use the Assistant, at least to double-check my results, in the first place!</p>
Owning Up
<p>I removed the post about American Pharoah from the blog. <span style="line-height: 1.6;">Then I wrote to the person who had caught my error, and expressed my gratitude—and chagrin—that he had noticed it. </span></p>
<p>But it turned out I had even <a href="http://blog.minitab.com/blog/understanding-statistics/lessons-from-a-statistical-analysis-gone-wrong-part-3">more lessons to learn from this failed analysis</a>. </p>
<p> </p>
<p style="font-size:9px;"><em>Photo by <a href="https://commons.wikimedia.org/wiki/File:Paris_Tuileries_Garden_Facepalm_statue.jpg">Alex E. Proimos</a>, used under Creative Commons 2.0. </em></p>
Data AnalysisStatisticsStatistics HelpWed, 15 Jul 2015 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/lessons-from-a-statistical-analysis-gone-wrong-part-2Eston MartzLessons from a Statistical Analysis Gone Wrong, part 1
http://blog.minitab.com/blog/understanding-statistics/lessons-from-a-statistical-analysis-gone-wrong-part-3-v2
<p style="line-height: 18.9090900421143px;">I don't like the taste of crow. That's a shame, because I'm about to eat a huge helping of it. </p>
<p style="line-height: 18.9090900421143px;">I'm going to tell you how I messed up an analysis. But in the process, I learned some new lessons and was reminded of some older ones I should remember to apply more carefully. </p>
This Failure Starts in a Victory
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/3e3a70cd6b6094eda21615f6eee14c0f/pharoah.jpg" style="line-height: 18.9090900421143px; border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 280px; height: 296px;" /></p>
<p style="line-height: 18.9090900421143px;"><span style="line-height: 18.9090900421143px;">My mistake originated in the 2015 Triple Crown victory of American Pharoah. I'm no racing enthusiast, but I knew this horse had ended almost four decades of Triple Crown disappointments, and that was exciting. </span><span style="line-height: 18.9090900421143px;">I'd never seen a </span><a href="http://blog.minitab.com/blog/the-statistics-game/triple-crown-odds-ill-have-another" style="line-height: 18.9090900421143px;">Triple Crown</a><span style="line-height: 18.9090900421143px;"> won before. It hadn't happened since 1978. </span></p>
<p style="line-height: 18.9090900421143px;">So when an acquaintance asked to contribute a guest post to the Minitab Blog that compared American Pharoah with previous Triple Crown contenders, including the record-shattering Secretariat, who took the Triple Crown in 1973, I eagerly accepted. </p>
<p style="line-height: 18.9090900421143px;">In reviewing the post, I checked and replicated the contributor's analysis. It was a fun post, and I was excited about publishing it. But a few days after it went live, I had to remove it: the analysis was not acceptable. </p>
<p style="line-height: 18.9090900421143px;">To explain how I made my mistake, I'll need to review that analysis. </p>
Comparing American Pharoah and Secretariat
<p style="line-height: 18.9090900421143px;"><span style="line-height: 18.9090900421143px;">In the post, we used Minitab's </span><a href="http://www.minitab.com/products/minitab/" style="line-height: 18.9090900421143px;">statistical software</a><span style="line-height: 18.9090900421143px;"> to compare Secretariat's performance to other winners of Triple Crown races. </span></p>
<p style="line-height: 18.9090900421143px;">Since 1926, the Belmont Stakes has been the longest of the three races at 1.5 miles. The analysis began by charting 89 years of winning horse times<span style="line-height: 1.6;">:</span><span style="line-height: 18.9090900421143px;"> </span></p>
<p style="line-height: 18.9090900421143px;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ad64da996c235ee5ff8cb4c3cef66292/belmont1.png" style="width: 500px; height: 334px;" /></p>
<p style="line-height: 18.9090900421143px;"><span style="line-height: 1.6;">Only two data points were outside of the I-chart's control limits:</span></p>
<ul style="line-height: 18.9090900421143px;">
<li>The fastest winner, Secretariat's 1973 time of 144 seconds</li>
<li>The slowest winner, High Echelon's 1970 time of 154 seconds</li>
</ul>
<p style="line-height: 18.9090900421143px;">The average winning time was 148.81 seconds, which Secretariat beat by more than 4 seconds. </p>
Applying a Capability Approach to the Race Data
<p style="line-height: 18.9090900421143px;">Next, the analysis approached the data from a capability perspective: Secretariat's time was used as a lower spec limit, and the analysis sought to assess the probability of another horse beating that time. </p>
<p style="line-height: 18.9090900421143px;">The way you assess capability depends on the distribution of your data, and a normality test in Minitab showed this data to be nonnormal<span style="line-height: 18.9090900421143px;">. </span></p>
<p style="line-height: 18.9090900421143px;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/89d338659c8cace002fe777633a238cf/belmont2.png" style="width: 500px; height: 334px;" /></p>
<p style="line-height: 18.9090900421143px;"><span style="line-height: 18.9090900421143px;">When you run Minitab's normal capability analysis, you can elect to apply the Johnson transformation, which can automatically transform many nonnormal distributions before the capability analysis is performed. This is an extremely convenient feature, but here's where I made my mistake. </span></p>
<p style="line-height: 18.9090900421143px;">Running the capability analysis with Johnson transformation, using Secretariat's 144-second time as a lower spec limit, produced the following output:</p>
<p style="line-height: 18.9090900421143px;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/0f3d2967b87743714821fa47e6bd999d/belmont4.png" style="width: 500px; height: 375px;" /></p>
<p style="line-height: 18.9090900421143px;">The analysis found a .36% chance of any horse beating Secretariat's time, making it very unlikely indeed. </p>
<p>The same method was applied to Kentucky Derby and Preakness data. </p>
<p style="line-height: 18.9090900421143px;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/6268cce550e1f97de81d0889da797814/belmont5.png" style="width: 500px; height: 375px;" /></p>
<p style="line-height: 18.9090900421143px;">We found a 5.54% chance of a horse beating Secretariat's Kentucky Derby time.</p>
<p style="line-height: 18.9090900421143px;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/21fda483b790f76e051ddb22359cbfe2/belmont6.png" style="width: 500px; height: 375px;" /></p>
<p style="line-height: 18.9090900421143px;">We found a 3.5% probability of a horse beating Secretariat's Preakness time.</p>
<p style="line-height: 18.9090900421143px;">Despite the b<span style="line-height: 18.9090900421143px;">illions of dollars and countless time and effort spent trying to make thoroughbred horses faster over the past 43 years,</span><span style="line-height: 18.9090900421143px;"> no one has yet beaten “Big Red,” as Secretariat was known. So the analysis indicated that American Pharoah may be a great horse, but he</span><span style="line-height: 1.6;"> is no Secretariat. </span></p>
<p style="line-height: 18.9090900421143px;"><span style="line-height: 1.6;">That conclusion may well be true...but it turns out we can't use <em>this</em> analysis to make that assertion. </span></p>
My Mistake Is Discovered, and the Analysis Unravels
<p style="line-height: 18.9090900421143px;">Here's where I start chewing those crow feathers. A day or so after sharing the post about American Pharoah, a reader sent the following comment: </p>
<p style="line-height: 18.9090900421143px; margin-left: 40px;"><em>Why does Minitab allow a Johnson Transformation on this data when using <strong>Quality Tools > Capability Analysis > Normal > Transform</strong>, but does not allow a transformation when using <strong>Quality Tools > Johnson Transformation</strong>? Or could I be doing something wrong? </em></p>
<p style="line-height: 18.9090900421143px;">Interesting question. Honestly, i<span style="line-height: 18.9090900421143px;">t hadn't even occurred to me to try to run the Johnson transformation on the data by itself. </span></p>
<p style="line-height: 18.9090900421143px;"><span style="line-height: 18.9090900421143px;">But if the Johnson Transformation worked when performed as part of the capability analysis, it ought to work when applied outside of that analysis, too. </span></p>
<p style="line-height: 18.9090900421143px;"><span style="line-height: 18.9090900421143px;">I suspected the person who asked th</span><span style="line-height: 18.9090900421143px;">is question might have just checked a wrong option in the dialog box. </span>So I tried running the Johnson Transformation on the data by itself.</p>
<p style="line-height: 18.9090900421143px;">The following <span style="line-height: 18.9090900421143px;">note appeared in Minitab's session window: </span></p>
<p style="line-height: 18.9090900421143px;"><img alt="no transformation is made" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/2892f3daa1549df56defbdd4fe9dc48a/no_transformation.gif" style="line-height: 18.9090900421143px; width: 500px; height: 55px;" /></p>
<p style="line-height: 18.9090900421143px;">Uh oh. </p>
<p style="line-height: 18.9090900421143px;">Our reader <em>hadn't</em> done anything wrong, but it was looking like I made an error somewhere. But where?</p>
<p style="line-height: 18.9090900421143px;">I'll show you exactly where I made my mistake in <a href="http://blog.minitab.com/blog/understanding-statistics/lessons-from-a-statistical-analysis-gone-wrong-part-2">my next post.</a> </p>
<p style="line-height: 18.9090900421143px;"> </p>
<p style="font-size: 9px;">Photo of American Pharoah used under Creative Commons license 2.0. Source: Maryland GovPics <a href="https://www.flickr.com/people/64018555@N03" target="_blank">https://www.flickr.com/people/64018555@N03</a> </p>
Data AnalysisFun StatisticsHypothesis TestingStatisticsStatistics in the NewsTue, 14 Jul 2015 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/lessons-from-a-statistical-analysis-gone-wrong-part-3-v2Eston MartzT-tests for Speed Tests: How Fast Is Internet Speed?
http://blog.minitab.com/blog/quality-data-analysis-and-statistics/t-tests-for-speed-tests%3A-how-fast-is-internet-speed
<p><span style="line-height: 1.6;">Every now and then I’ll test my Internet speed at home using such sites as </span><a href="http://speedtest.comcast.net/" style="line-height: 1.6;" target="_blank">http://speedtest.comcast.net</a><span style="line-height: 1.6;"> or </span><a href="http://www.att.com/speedtest/" style="line-height: 1.6;" target="_blank">http://www.att.com/speedtest/</a><span style="line-height: 1.6;">. My need to perform these tests could stem from the cool-looking interfaces they employ on their site, as they display the results using analog speedometers and RPM meters. They could also stem from the validation that I need in "getting what I am paying for," although I realize that there are other factors that determine what Internet speed you ultimately end up with when you browse the Web.<img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9d50fc6279b0660301b966bc0214296b/speed.jpg" style="margin: 10px 15px; width: 250px; height: 250px; float: right;" /></span></p>
<p>Recently I started thinking about the distribution of these speeds. If I were to run enough tests, would these speeds be <a href="http://blog.minitab.com/blog/statistics-in-the-field/pencils-and-plots%3A-assessing-the-normality-of-data">normally distributed</a>?</p>
<p>When performing an Internet speed test, you are given an estimated download and upload speed. The download speed is the rate at which data travels from the Internet to your device, and the upload speed is the rate at which data travels from your device to the Internet. I was also curious as to whether the population means of these speeds were statistically different.</p>
Is the Data Normally Distributed?
<p>I ran 30 speed tests from my office at Minitab and recorded the download and upload data into a <a href="http://www.minitab.com/products/minitab">Minitab Statistical Software</a> worksheet: Here is a sample of the data:</p>
<p style="text-align: center;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/74e1267bfffe19349b587a71e30aff82/capture.PNG" style="border-width: 0px; border-style: solid; width: 183px; height: 432px;" /></p>
<p>I went <strong>to Stat > Basic Statistics > Normality Test.</strong> Here are the probability plots for download and upload speed.</p>
<p style="text-align: center;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/b2d366826e2a3e1952da44adb0e7074b/pic1.jpg" style="border-width: 0px; border-style: solid; width: 400px; height: 233px;" /><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/c514ed9dd880d9d9f272cd2c9d8ba8d4/pic2.jpg" style="border-width: 0px; border-style: solid; width: 400px; height: 233px; line-height: 20.79px;" /></p>
<p>I’ll be using an alpha level of 0.05 to compare the p-value to. Both probability plots show p-values greater than alpha, and therefore we do not have enough evidence to reject the null hypothesis. As a quick reminder, the null hypothesis is that our data follows a normal distribution. We can assume normality.</p>
Is There a Difference Between Upload and Download Speed?
<p>Let’s find out if there was a statistical difference between the download speed and the upload speed.</p>
<p>Go to <strong>Stat > Basic Statistics > 2-Sample t</strong>:</p>
<p style="text-align: center;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/8c9ae281e80618ed8586faf8559b4e85/pic3.jpg" style="border-width: 0px; border-style: solid; width: 424px; height: 296px;" /></p>
<p>I chose “Each Sample is in its own column” under the dropdown, and entered in the column for download speed for <strong>Sample 1</strong> and upload speed for <strong>Sample 2</strong>.</p>
<p>If you click on Options you’ll see a checkbox for "Assume Equal Variances." Checking this box will result in a slightly more powerful 2-Sample-t test. But how do I know if the variances are equal or not? By using quick test in Minitab! </p>
<p>I cancelled out of the 2-Sample t dialog window and quickly ran an Equal Variances test <strong>(Stat > Basic Statistics > 2 Variances</strong>) and received these results:</p>
<p style="text-align: center; margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/bf476aa49893cf093c966f74362f2767/pic6.PNG" style="width: 475px; height: 389px;" /></p>
<p>Given that our p-value is greater than an alpha of 0.05, we don’t have enough evidence to say that the two variances are statistically different. Therefore, we are able to go back to the 2-Sample t test and check the box for "Assume equal Variances."</p>
<p style="text-align: center;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/8f6dfdaa8f0840b36fa3f0baebd2c04c/pic4.png" style="border-width: 0px; border-style: solid; width: 370px; height: 242px;" /></p>
<p>Here's the output from my 2-Sample t-test:</p>
<p style="text-align: center; margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/a0647a7251accef8e0875769f5730630/pic5.PNG" style="width: 526px; height: 196px;" /></p>
<p><span style="line-height: 1.6;">Since our p-value is less than 0.05, we can reject the null hypothesis (that both means are the same) and say that the population means for download and upload speed are statistically different.</span></p>
Vrrrrooooooooom!
<p>I was curious as to why the upload speeds were higher than the download speeds during my testing. Whenever I’ve tested speeds at my house, I’ve always seen the reverse.</p>
<p>I asked someone here at Minitab who is well versed in network setup, and he said that there could have been more bandwidth consumption from my coworkers than normal at the time of data collection. This extra consumption can push the download speeds below the upload speeds. He also said that the nature of how the Internet is configured at a company can be a contributing factor as well. </p>
<p><span style="line-height: 1.6;">If you were given an expected download rate by your cable company, you could add to this experiment by performing a <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-create-a-graphical-version-of-the-1-sample-t-test-in-minitab">1-Sample t-test</a>. The expected download rate would serve as your hypothesized mean. You would then be able to perform a hypothesis test to see if your mean is statistically different from your hypothesized mean. </span></p>
<p><span style="line-height: 1.6;">If you find that you're not getting the speeds you wanted, I wouldn't start running around with pitchforks just yet. According to </span><a href="http://www.cnet.com/how-to/how-to-find-a-reliable-network-speed-test/" target="_blank">http://www.cnet.com/how-to/how-to-find-a-reliable-network-speed-test/ </a>, accuracy and consistency in speeds may depend on what online speed test you are using. <span style="line-height: 1.6;">But comparing the different speed testing tools is an analysis for another day! </span></p>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
Data AnalysisFun StatisticsStatisticsStatistics HelpMon, 13 Jul 2015 12:00:00 +0000http://blog.minitab.com/blog/quality-data-analysis-and-statistics/t-tests-for-speed-tests%3A-how-fast-is-internet-speedAndy CheshirePencils and Plots: Assessing the Normality of Data
http://blog.minitab.com/blog/statistics-in-the-field/pencils-and-plots%3A-assessing-the-normality-of-data
<p><em><span style="line-height: 1.6;">By Matthew Barsalou, guest blogger. </span></em></p>
<p>Many statistical tests assume the data being tested came from a normal distribution. Violating the assumption of normality can result in incorrect conclusions. For example, a Z test may indicate a new process is more efficient than an older process when this is not true. This could result in a capital investment for equipment that actually results in higher costs in the long run.</p>
<p>Statistical Process Control (SPC) requires either normally distributed data or a transformation must be performed on the data. It would be very risky to monitor a process with SPC charts created with data that violated the assumption of normality.</p>
<p>What can we do if the assumption of normality is critical to so many statistical methods? We can construct a probability plot to test this assumption.</p>
<p>Those of us who are a bit old-fashioned can construct a probability plot by hand, by plotting the order values (j) against the observed cumulative frequency (j- 0.5/n). Using the numbers 16, 21, 20, 19, 18 and 15, we would construct a normal probability plot by first creating the table shown below.</p>
<p align="center"><strong>j</strong></p>
<p align="center"><strong>Xj</strong></p>
<p align="center"><strong>(j – 0.5)/6</strong></p>
<p align="center">1</p>
<p align="center">15</p>
<p align="center">0.158</p>
<p align="center">2</p>
<p align="center">16</p>
<p align="center">0.325</p>
<p align="center">3</p>
<p align="center">18</p>
<p align="center">0.492</p>
<p align="center">4</p>
<p align="center">19</p>
<p align="center">0.658</p>
<p align="center">5</p>
<p align="center">20</p>
<p align="center">0.825</p>
<p align="center">6</p>
<p align="center">21</p>
<p align="center">0.992</p>
<p>We then plot the results as shown in the figure below.</p>
<p align="center"><img alt="normal probability plot " src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/3d807838b34b785ed5ee7bdcd7134927/normality_1.png" style="width: 359px; height: 354px;" /></p>
<p>That's fine for a small data set, but nobody wants to plot hundreds or thousands of data points by hand. Fortunately, we can also use <a href="http://www.minitab.com/products/minitab/">Minitab Statistical Software</a> to assess the normality of data. Minitab uses the Anderson-Darling test, which compares the actual distribution to a theoretical normal distribution. Anderson-Darling test’s <a href="http://blog.minitab.com/blog/understanding-statistics/things-statisticians-say-failure-to-reject-the-null-hypothesis">null hypothesis</a> is “The distribution is normal.”</p>
<p><strong>Anderson-Darling test:</strong></p>
<p>H0: The data follow a normal distribution.</p>
<p>Ha: The data don’t follow a normal distribution.</p>
<p>Test statistic: A2 = - N – S, where <img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/867049b30d5a27a01bcfc121e60655b4/anderson_darling_equation.gif" style="width: 320px; height: 33px;" /> and F is the cumulative distribution function of the specified distribution. We can assess the results by looking at the resulting p value.</p>
<p>The figure below shows a normal distribution with a sample size of 27. The same data is shown in a histogram, probability plot, dot plot and a box blot.</p>
<p style="text-align: center;"><img alt="Probability plot" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c68d8374151ac9964737e17fa5916c01/pencil_test_sample_size_27.png" style="width: 400px; height: 565px;" /></p>
<p>The next figure shows a normal distribution with sample a size of 208. Notice how the data is concentrated in the center of the histogram, probability plot, dot plot, and box plot.</p>
<p align="center"><img alt="normal probability plot" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/594e91dd9c2a08e2b14f04f73d5faaa4/pencil_test_sample_size_208.png" style="width: 400px; height: 561px;" /></p>
<p>A Laplace distribution with a sample size of 208 is shown below. Visually, this data almost resembles a normal distribution; however, the Minitab generated P value of < 0.05 tells us that this distribution is not normally distributed. </p>
<p align="center"><img alt="normality plot" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9cbb95d47f49e0bb60ded0895aecae48/pencil_test_sample_size_laplace.png" style="width: 400px; height: 615px;" /></p>
<p>The figure below shows a uniform distribution with a sample size of 270. Even without looking at the P value we can quickly see that the data is not normally distributed.</p>
<p> </p>
<p align="center"><img alt="normal probability distribution assumption plot" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/104d927db7729c8be55fed29a48af4ef/pencil_test_sample_size_uniform.png" style="width: 400px; height: 541px;" /></p>
<p>Back in the days of hand-drawn probability plots, the “fat pencil test” was often used to evaluate normality. The data was plotted and the distribution was considered normal if all of the data points could be covered by a thick pencil. The fat pencil test was quick and easy. Unfortunately, it is not as accurate as the Anderson-Darling test and is not a substitution for an actual test.</p>
<p align="center"><img alt="probability plot of normal" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/52ba1470b2a78ffad5e0e50294591dab/normality_7.png" style="width: 604px; height: 391px;" /><br />
<em><span style="line-height: 1.6;">Fat pencil test with normally distributed data</span></em></p>
<p align="center"><img alt="Probability Plot of Non-Normal" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ac1a56f263bf046d4b00d43a18c07312/normality_8.png" style="width: 605px; height: 386px;" /><br />
<em><span style="line-height: 1.6;">Fat pencil test with non-normally distributed data</span></em></p>
<p> </p>
<p>The proper identification of a statistical distribution is critical for properly performing many types of hypothesis tests or for control charting. Fortunately, we can now asses our data without having to rely on hand-drawn tests and a large diameter pencil.</p>
<p>To test for normality go to the <strong>Graph </strong>menu in Minitab, and select <strong>Probability Plot</strong>.</p>
<p><img alt="Selecting the probability plot" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a87dccee30473dc0ce693d06cfa49b25/normality_9.png" style="width: 456px; height: 420px;" /></p>
<p>Click on OK to select Single if you are only looking at one column of data.</p>
<p><img alt="probability plot selection" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/144c44a267edf131888c2dce5d272204/normality_10.png" style="width: 377px; height: 201px;" /></p>
<p>Select your column of data and then click OK.</p>
<p><img alt="Single Probability Plot" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/baee62b7ed756e2e71a3968a9f63e6f4/normality_11.png" style="width: 537px; height: 365px;" /></p>
<p>Minitab will generate a probability plot of your data. Notice the P-value below is 0.829. We would fail to reject the null hypothesis that the distribution of our data is equal to a normal distribution when we use a P-value of 0.05 for 95% confidence.</p>
<p><img alt="Probability Plot of C1" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/51d4a3ccff36f75196dabb490b0fb68d/normality_12.png" style="width: 438px; height: 312px;" /></p>
<p>Using Minitab to test data for normality is far more reliable than a fat pencil test and generally quicker and easier. However, the fat pencil test may still be a viable option if you absolutely must analyze your data during a power outage. </p>
<p> </p>
<p><strong>About the Guest Blogger</strong></p>
<p><em><a href="https://www.linkedin.com/pub/matthew-barsalou/5b/539/198" target="_blank">Matthew Barsalou</a> is a statistical problem resolution Master Black Belt at <a href="http://www.3k-warner.de/" target="_blank">BorgWarner</a> Turbo Systems Engineering GmbH. He is a Smarter Solutions certified Lean Six Sigma Master Black Belt, ASQ-certified Six Sigma Black Belt, quality engineer, and quality technician, and a TÜV-certified quality manager, quality management representative, and auditor. He has a bachelor of science in industrial sciences, a master of liberal studies with emphasis in international business, and has a master of science in business administration and engineering from the Wilhelm Büchner Hochschule in Darmstadt, Germany. He is author of the books <a href="http://www.amazon.com/Root-Cause-Analysis-Step---Step/dp/148225879X/ref=sr_1_1?ie=UTF8&qid=1416937278&sr=8-1&keywords=Root+Cause+Analysis%3A+A+Step-By-Step+Guide+to+Using+the+Right+Tool+at+the+Right+Time" target="_blank">Root Cause Analysis: A Step-By-Step Guide to Using the Right Tool at the Right Time</a>, <a href="http://asq.org/quality-press/display-item/index.html?item=H1472" target="_blank">Statistics for Six Sigma Black Belts</a> and <a href="http://asq.org/quality-press/display-item/index.html?item=H1473&xvl=76115763" target="_blank">The ASQ Pocket Guide to Statistics for Six Sigma Black Belts</a>.</em></p>
Data AnalysisStatisticsTue, 07 Jul 2015 12:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/pencils-and-plots%3A-assessing-the-normality-of-dataGuest BloggerApplying DOE for Great Grilling, part 2
http://blog.minitab.com/blog/understanding-statistics/applying-doe-for-great-grilling-part-2
<p><img alt="grill" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/111e4a65160cf20662dfb13013408f1f/grill.jpg" style="margin: 10px 15px; width: 202px; height: 202px; line-height: 18.9px; float: right;" /></p>
<p style="line-height: 18.9px;"><span style="line-height: 18.9px;">Design of Experiments is an extremely powerful statistical method, we added a DOE tool to the Assistant in Minitab 17 to make it more accessible to more people.</span></p>
<p style="line-height: 18.9px;"><span style="line-height: 18.9px;">Since it's summer here, I'm applying the Assistant's DOE tool to outdoor cooking.</span><span style="line-height: 18.9px;"> </span>Earlier, I showed you <a href="http://blog.minitab.com/blog/understanding-statistics/applying-doe-for-great-grilling-part-1">how to set up a designed experiment</a> that will let you optimize how you grill steaks. </p>
<p>If you're not already using it and you want to play along, you can download the <a href="http://it.minitab.com/products/minitab/free-trial.aspx">free 30-day trial version</a> of Minitab Statistical Software.</p>
<p style="line-height: 18.9px;">Perhaps you are following along, and you've already grilled your steaks according to the experimental plan and recorded the results of your experimental runs. Otherwise, feel free to download our data <a href="//cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/a0d8f12f27ee5a981619c2c3af59d524/steaks___asst_doe.MTW">here</a> for the next step: analyzing the results of our experiment. </p>
Analyzing the Results of the Steak Grilling Experiment
<p style="line-height: 18.9px;">After collecting your data and entering it into Minitab, you should have an experimental worksheet that looks like this: </p>
<p style="line-height: 18.9px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ed29e3c1fb41872df6529e91786215f2/grill_doe_worksheet.png" style="width: 500px; height: 320px;" /></p>
<p style="line-height: 18.9px;">With your results entered in the worksheet, select <strong>Assistant > DOE > Analyze and Interpret</strong>. As you can see below, the only button you can click is "Fit Linear Model." </p>
<p style="line-height: 18.9px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/1ce7cb7744e6fb78c4f5cb74d1903cf6/grill_doe_analyze.png" style="width: 500px; height: 375px;" /></p>
<p style="line-height: 18.9px;">As you might gather from the flowchart, when it analyzes your data, the Assistant first checks to see if the response exhibits curvature. If it does, the Assistant will prompt you to gather more data so you it can fit a quadratic model. Otherwise, the Assistant will fit the linear model and provide the following output. </p>
<p style="line-height: 18.9px;">When you click the "Fit Linear Model" button, the Assistant automatically identifies your response variable.</p>
<p style="line-height: 18.9px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a851201ceabf727ba38e53ef383d6091/grill_doe_analyze2.png" style="width: 435px; height: 260px;" /></p>
<p style="line-height: 18.9px;">All you need to do is confirm your response goal—maximizing flavor, in this case—and press OK. The Assistant performs the analysis, and provides you the results in a series of easy-to-interpret reports. </p>
Understanding the DOE Results
<p style="line-height: 18.9px;">First, the Assistant offers a summary report that gives you the bottom-line results of the analysis. The Pareto Chart of Effects in the top left shows that Turns, Grill type, and Seasoning are all statistically significant, and there's a significant interaction between Turns and Grill type, too. </p>
<p style="line-height: 18.9px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9ac1a8e009efec8b90fdbb32cfebd1df/grill_doe_results_summary.png" style="width: 751px; height: 563px;" /></p>
<p style="line-height: 18.9px;">The summary report also shows that the model explains very high proportion of the variation in flavor, with an R2 value of 95.75 percent. And the "Comments" window in the lower right corner puts things if plain language: "You can conclude that there is a relationship between Flavor and the factors in the model..."</p>
<p style="line-height: 18.9px;">The Assistant's Effects report, shown below, tells you more about the nature of the relationship between the factors in the model and Flavor, with both Interaction Plots and Main Effects plots that illustrate how different experimental settings affect the Flavor response. </p>
<p style="line-height: 18.9px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/4a9d3a9939ad51a9326bed0fbd061048/grill_doe_results_effects.png" style="width: 751px; height: 563px;" /></p>
<p style="line-height: 18.9px;">And if we're looking to make some changes as a result of our experimental results—like selecting an optimal method for grilling steaks in the future—the Prediction and Optimization report gives us the optimal solution (1 turn on a charcoal grill, with Montreal seasoning) and its predicted Flavor response (8.425). </p>
<p style="line-height: 18.9px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f9189c39c79160de4b9c5dbf8f4523ab/grill_doe_results_optimization.png" style="width: 751px; height: 563px;" /></p>
<p style="line-height: 18.9px;"><span style="line-height: 1.6;">It also gives us the Top 5 alternative solutions, shown in the bottom right corner, so if there's some reason we can't implement the optimal solution—for instance, if we only have a gas grill—we can still choose the best solution that suits our circumstances. </span></p>
<p style="line-height: 18.9px;">I hope this example illustrates how easy a designed experiment can be when you use the Assistant to create and analyze it, and that designed experiments can be very useful not just in industry or the lab, but also in your everyday life. </p>
<p style="line-height: 18.9px;">Where could you benefit from analyzing process data to optimize your results? </p>
Design of ExperimentsFun StatisticsStatistics HelpThu, 02 Jul 2015 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/applying-doe-for-great-grilling-part-2Eston MartzApplying DOE for Great Grilling, part 1
http://blog.minitab.com/blog/understanding-statistics/applying-doe-for-great-grilling-part-1
<p>Design of Experiments (DOE) has a reputation for difficulty, and to an extent, this statistical method <em>deserves </em>that reputation. While it's easy to grasp the basic idea—<em>acquire the maximum amount of information from the fewest number of experimental runs</em>—practical application of this tool can quickly become very confusing. </p>
<p><img alt="steaks" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/33d85058b493aff4240dfb9d78aff673/steaks.jpg" style="margin: 10px 15px; width: 250px; height: 250px; float: right;" />Even if you're a long-time user of designed experiments, it's still easy to feel uncertain if it's been a while since you last looked at split-plot designs or needed to choose the appropriate resolution for a fractional factorial design.</p>
<p>But DOE <em>is</em> an extremely powerful and useful tool, so when we launched Minitab 17, we added a DOE tool to the Assistant to make designed experiments more accessible to more people.</p>
<p>Since summer is here at Minitab's world headquarters, I'm going to illustrate how you can use the Assistant's DOE tool to optimize your grilling method. </p>
<p>If you're not already using it and you want to play along, you can download the free 30-day <a href="http://it.minitab.com/products/minitab/free-trial.aspx">trial version of Minitab Statistical Software</a>.</p>
Two Types of Designed Experiments: Screening and Optimizing
<p>To create a designed experiment using the Assistant, open Minitab and select <strong>Assistant > DOE > Plan and Create</strong>. You'll be presented with a decision tree that helps you take a sequential approach to the experimentation process by offering a choice between a screening design and a modeling design.</p>
<p><img alt="DOE Assistant" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/5b585531f6031882fb7880a49700f52c/grill_doe_1.png" style="width: 487px; height: 366px;" /></p>
<p>A <strong>screening design</strong> is important if <span><a href="http://blog.minitab.com/blog/understanding-statistics/why-is-the-office-coffee-so-bad-a-screening-experiment-narrows-down-the-critical-factors">you have a lot of potential factors to consider</a></span> and you want to figure out which ones are important. The Assistant guides you through the process of testing and analyzing the main effects of 6 to 15 factors, and identifies the factors that have greatest influence on the response.</p>
<p>Once you've identified the critical factors, you can use the <strong>modeling design.</strong> Select this option, and the Assistant guides you through testing and analyzing 2 to 5 critical factors and helps you find optimal settings for your process.</p>
<p>Even if you're an old hand at analyzing designed experiments, you may want to use the Assistant to create designs since the Assistant lets you print out easy-to-use data collection forms for each experimental run. After you've collected and entered your data, the designs created in the Assistant can also be analyzed using <span style="line-height: 18.9px;">Minitab's </span><span style="line-height: 1.6;">core DOE tools available through the <strong>Stat > DOE</strong> menu.</span></p>
<span style="line-height: 1.6;">Creating a DOE to Optimize How We Grill Steaks</span>
<p>For grilling steaks, there aren't that many variables to consider, so we'll use the Assistant to pl<span style="line-height: 1.6;">an and create a <strong>modeling design</strong> that will optimize our grilling process. Select <strong>Assistant > DOE > Plan and Create</strong>, then click the "Create Modeling Design" button. </span></p>
<p><span style="line-height: 1.6;">Minitab brings up an easy-to-follow dialog box; all we need to do is fill it in. </span></p>
<p><span style="line-height: 1.6;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/eb90fd8499ab96a579aa6dd63fa325d2/grill_doe_dialog_1.png" style="width: 461px; height: 500px;" /></span></p>
<p>First we enter the name of our Response and the goal of the experiment. Our response is "Flavor," and the goal is "Maximize the response." Next, we enter our factors. We'll look at three critical variables:</p>
<ul>
<li>Number of turns, a continuous variable with a low value of 1 and high value of 3.</li>
<li>Type of grill, a categorical variable with Gas or Charcoal as options. </li>
<li>Type of seasoning, a categorical variable with Salt-Pepper or Montreal steak seasoning as options. </li>
</ul>
<p>If we wanted to, we could select more than 1 replicate of the experiment. A replicate is simply a complete set of experimental runs, so if we did 3 replicates, we would repeat the full experiment three times. But since this experiment has 16 runs, and neither our budget nor our stomachs are limitless, we'll stick with a single replicate. </p>
<p>When we click OK, the Assistant first asks if we want to print out data collection forms for this experiment: </p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c4c63c4b5af7a4c6e4f3c4caa327f523/grill_doe_collection_form1.png" style="width: 445px; height: 207px;" /></p>
<p>Choose Yes, and you can print a form that lists each run, the variables and settings, and a space to fill in the response:</p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/06ed8ad486f3a243c4aea352c9738b2c/grill_doe_collection_form2.png" style="border-width: 1px; border-style: solid; width: 500px; height: 313px;" /></p>
<p>Alternatively, you can just record the results of each run in the worksheet the Assistant creates, which you'll need to do anyway. But having the printed data collection forms can make it much easier to keep track of where you are in the experiment, and exactly what your factor settings should be for each run. </p>
<p>If you've used the Assistant in Minitab for other methods, you know that it seeks to demystify your analysis and make it easy to understand. When you create your experiment, the Assistant gives you a Report Card and Summary Report that explain the steps of the DOE and important considerations, and a summary of your goals and what your analysis will show. </p>
<p><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a767257b9db465b81d6bd1456e5eb508/grill_doe_2_w1024.png" style="width: 650px; height: 439px;" /></p>
<p>Now it's time to cook some steaks, and rate the flavor of each. If you want to do this for real and collect your own data, please do so! <a href="http://blog.minitab.com/blog/understanding-statistics/applying-doe-for-great-grilling-part-2">Tomorrow's post</a> will show how to analyze your data with the Assistant. </p>
Wed, 01 Jul 2015 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/applying-doe-for-great-grilling-part-1Eston MartzUsing Quality Tools Like FMEA in Pathogen Testing
http://blog.minitab.com/blog/understanding-statistics/using-quality-tools-like-fmea-in-pathogen-testing
<p>Before I joined Minitab, I worked for many years in Penn State's College of Agricultural Sciences as a writer and editor. I frequently wrote about food science and particularly food safety, as I regularly needed to report on the research being conducted by Penn State's food safety experts, and also edited course materials and bulletins for professionals and consumers about ensuring they had safe food. </p>
<p><img alt="culture dish" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/18d6a3d63b7c0f1b80cb19461732c349/culture_dish.jpg" style="margin: 10px 15px; float: right; width: 200px; height: 200px;" />After I joined Minitab and became better acquainted with data-driven quality methods like Six Sigma, I was surprised at how infrequently some of the powerful quality tools common in many industries are used in food safety work. </p>
<p>So I was interested to see <a href="http://www.foodsafetytech.com/FoodSafetyTech/News/How-to-Use-FMEA-to-Risk-Assess-Pathogen-Testing-Me-2440.aspx">a recent article on the Food Safety Tech web site</a> about an application of the tool called FMEA in pathogen testing.</p>
What <em>Is </em>an FMEA?
<p style="line-height: 18.9090900421143px;">The acronym FMEA is short for "<span><a href="http://blog.minitab.com/blog/statistics-in-the-field/for-want-of-an-fmea-the-empire-fell">Failure Modes and Effects Analysis</a></span>." What the tool really does is help you look very carefully and systematically at <em>exactly </em>how and why things can go wrong, so you can do your best to prevent that from happening.</p>
<p>In the article, Maureen Harte, a consultant and Lean Six Sigma black belt, talks about the need to identify, quantify, and assess risks of the different pathogen detection methods used to create a Certificate of Analysis (COA)—a document companies obtain to verify product quality and purity.</p>
<p>Too often, Harte says, companies accept COA results blindly:</p>
<p style="margin-left: 40px;"><em>[They] lack the background information to really understand what goes into a COA, and they trust that what is coming to them is the highest quality. </em></p>
<p>Harte then proceeds to explain how doing an FMEA can make the COA more meaningful and useful. </p>
<p style="line-height: 18.9090900421143px; margin-left: 40px;"><em>FMEA helps us understand the differences between testing methods by individually identifying the risks associated with each method on its own. For each process step [in a test method], we ask: Where could it go wrong, and where could an error or failure mode occur? Then we put it down on paper and understand each failure mode. </em></p>
Completing an FMEA
<p>Doing an FMEA typically involves these steps:</p>
<ul>
<li>Identify potential failure types, or "modes," for each step of your process.</li>
<li>List the effects that result when with those failures occur.</li>
<li>Identify potential causes for each failure mode.</li>
<li>List existing controls that are in place to keep these failures from happening.</li>
<li>Rate the Severity of the effect, the likelihood of Occurrence, and the odds of Detecting the failure mode before it causes harm.</li>
<li>Multiply the values for severity, occurrence, and detection to get a risk priority number (RPN).</li>
<li>Improve items with a high RPN, record the actions you've taken, then revise the RPN.</li>
<li>Maintain as a living document.</li>
</ul>
<p>You can <span>do an FMEA </span><span style="line-height: 1.6;">with just a pencil and paper, although Minitab's <a href="http://www.minitab.com/products/quality-companion">Quality Companion</a> and <a href="http://www.minitab.com/products/qeystone">Qeystone Tools</a> process improvement software include forms that make it easy to complete the FMEA—and even share data from process maps and other forms you'll may be using. </span></p>
<p><span style="line-height: 1.6;">Here's an example of a completed Quality Companion FMEA tool: </span></p>
<p><img alt="FMEA" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d46282c5c0b55efeb25a146269263b97/pathogen_fmea.png" style="width: 750px; height: 406px; border-width: 1px; border-style: solid;" /></p>
FMEA Steps
<p>1) In Process Map - Activity, enter each process step, feature or type of activity. In the example above, it's preparation of growth culture and incubation. We also list the key components or inputs of each step.</p>
<p>2) In Potential Failure Mode, we note the ways the process can fail for each activity. There may be many ways it could fail. In the example, we've identified contamination of growth medium and incubating cultures at the wrong temperature as potential failure modes. </p>
<p>3) In Potential Failure Effects, we detail the possible fallout of each type of failure. There may be multiple failure effects.<span style="line-height: 18.9090900421143px;"> In the example above, contaminated growth culture could lead to the waste of perfectly good raw materials. An improperly performed incubation might lead to undetected pathogens, and possibly unsafe products. </span></p>
<p>4) In SEV (Severity Rating), we assign severity to each failure effect on a 1 to 10 scale, where 10 is high and 1 low. This is a relative assignment. In the food world, wasting some good materials is undesirable, but having pathogens reach the market is obviously much worse, hence the ranking of 6 and 9, respectively.</p>
<p>5) In OCC (Occurrence Rating), estimate the probability of occurrence of the cause. Use a 1 to 10 scale, where 10 signifies high frequency (guaranteed ongoing problem) and 1 signifies low frequency (extremely unlikely to occur). </p>
<p>6) In Current Control, enter the manner in which the failure causes/modes are detected or controlled. </p>
<p>7) In DET (Detection Rating), gauge the ability of each control to detect or control the failure cause/mode. Use a 1 to 10 scale, where 10 signifies poor detection/control and 1 signifies high detection/control (you're almost certain detection to catch the problem before it causes failure). </p>
<p>8) RPN (Risk Priority Number) is the product of the SEV, OCC, and DET scores. The higher the RPN, the more severe, more frequent, or less controlled a potential problem is, indicating a greater need for immediate attention. Above, the RPN of 81 for potential incubation error indicates that that type of failure should get higher priority than contaminated cultures. . </p>
<p>9) If you're doing FMEA as part of an improvement project, you can use it to prioritize corrective actions. Once you've implemented improvements, enter the revised SEV, OCC, and DET values to calculate a current RPN. </p>
The Benefits of an FMEA
<p>When you've completed the FMEA, you'll have the answers to these questions:</p>
<p>What are the potential failure modes at each step of a process?<br />
What is the potential effect of each failure mode on the process output, and how severe is it?<br />
What are the potential causes of each failure mode, and how often do they occur?<br />
How well can you detect a cause before it creates a failure mode and effect?<br />
How can you assign a risk value to a process step, that factors in the frequency of the cause, the severity of failure, and the capability of detecting it in advance?<br />
What part of the process should an improvement project focus on?<br />
Which inputs are vital to the process, and which aren't? <br />
How can reaction plans be documented as part of process control?</p>
<p>And if your understanding of the steps that underlie your Certificate of Analysis is that thorough, you will be able to stand behind it with much more confidence. </p>
<p>Where could you apply an FMEA in your organization? </p>
Lean Six SigmaProject ToolsQuality ImprovementStatistics in the NewsWed, 17 Jun 2015 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/using-quality-tools-like-fmea-in-pathogen-testingEston Martz3 Features to Make You Glad You're You When You Have to Clean Data in Minitab
http://blog.minitab.com/blog/statistics-and-quality-improvement/done-cancel
<p>When someone gives you data to analyze, you can gauge how your life is going by what you've received. Get a Minitab file, or even comma-separated values, and everything feels fine. Get a PDF file, and you start to think maybe you’re cursed because of your no-good-dirty-rotten-pig-stealing-great-great-grandfather and wish that you were someone else. For those of you who might be in such dire straits today, here are 3 helpful things you can do in Minitab Statistical Software: change data type, code and remove missing values, and recode variables.</p>
<p>For the purposes of having an example, I’m going to use some data from the Centers for Medicare and Medicaid Services. <a href="http://www.cms.gov/Medicare/Quality-Initiatives-Patient-Assessment-Instruments/HospitalQualityInits/Downloads/HospitalTop50PercentYear6.zip" target="_blank">The data are from October 2008 to September 2009 and track the quality of a hospital’s response to a patient with pneumonia</a>. The data in the PDF file look like this:</p>
<p><img alt="The PDF file has header text and a nicely formatted table." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/eb685364783267efcf57c40d30999633/worksheets1_w1024.jpeg" style="border-width: 0px; border-style: solid; width: 1024px; height: 559px;" /></p>
<p>If you copy and paste it into Minitab, hoping for nicely-organized tables as appear in the document, you get a single column that contains everything:</p>
<p><img alt="The header text and the table content are all in one column." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/8c484f2c2d789faaeced739e945b9c5a/worksheets2.JPG" style="border-width: 0px; border-style: solid; width: 555px; height: 759px;" /></p>
<p>Don’t despair. Instead, look at the capabilities that are at your fingertips.</p>
Change Data Type
<p>What we’re really after for analysis are the numbers inside the table, so a good first step is to get the numbers.</p>
<ol>
<li>Choose <strong>Data > Change Data Type > Text to Numeric</strong>.</li>
<li>In <strong>Change text columns</strong>, enter <em>C1</em>.</li>
<li>In <strong>Store Numeric Columns in</strong>, enter <em>C2</em>.</li>
<li>Click <strong>OK</strong>. In the Error box, click <strong>Cancel</strong>.</li>
</ol>
<p>When you look at the worksheet, the cells that had text values after the paste are now missing value symbols and the numbers that were in the tables remain. You might be a bit unnerved that the percentages of patients who received treatments are all 1, but that’s only a result of the column formatting. (Want to see? <a href="http://support.minitab.com/en-us/minitab/17/topic-library/minitab-environment/data-and-data-manipulation/numeric-data-and-formats/numeric-data-and-formats/#change-the-numeric-data-display-format">Change the numeric display format</a>.)</p>
Remove missing values
<p>You can easily get rid of the missing values in these data so that the missing values don’t interfere with further analysis, but there’s an additional complication here. While most of the missing values are column headers that we don’t want in the data, the table itself contains some missing values. Anytime a hospital gave a treatment to fewer than 10 patients, the table contains the value “Low Sample (10 or less).” To preserve these missing values while eliminating the others, we want to use different values to represent the different cases in the data.</p>
<ol>
<li>Choose <strong>Calc > Calculator</strong>.</li>
<li>In <strong>Store Result in Variable</strong>, enter <em>C3</em>.</li>
<li>In <strong>Expression</strong>, enter <em>If(Left(C1,3)=”Low”, 99999999, C2)</em>.</li>
<li>Click <strong>OK</strong>.</li>
</ol>
<p>Now that you have two kinds of missing value, you can start cleaning them up. First, get rid of the ones that don’t represent values in the table.</p>
<ol>
<li>Choose <strong>Data > Copy > Columns to Columns</strong>.</li>
<li>In <strong>Copy from columns</strong>, enter <em>C3</em>.</li>
<li>In <strong>Store Copied Data in Columns</strong>, select <strong>In current worksheet, in columns</strong> and enter <em>C4</em>.</li>
<li>Click <strong>Subset the Data</strong>.</li>
<li>In <strong>Specify Which Rows to Include</strong>, select <strong>Rows that match</strong> and click <strong>Condition</strong>.</li>
<li>In <strong>Condition</strong>, enter <em>C3 <> '*'</em>.</li>
<li>Click <strong>OK</strong> in all of the dialog boxes.</li>
</ol>
<p>Now that we’ve gotten rid of the missing values that weren’t numbers in the table, we can change the missing values that we kept back to a form Minitab recognizes.</p>
<ol>
<li>Choose <strong>Calc > Calculator</strong>.</li>
<li>In <strong>Store result in variable</strong>, enter <em>C5</em>.</li>
<li>In <strong>Expression</strong>, enter <em>If(c4 = 99999999, ‘*’, c4)</em>.</li>
<li>Click <strong>OK</strong>.</li>
</ol>
Recode the data
<p>For analysis, we want one row for each hospital. To do this, we’ll create a table in the worksheet that shows how to identify the variables for analysis, then <a href="http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-2">unstack the variables</a>.</p>
<p>Because we kept the missing values from the table, every hospital has 9 variables. We make a table in the worksheet that shows the numbers 1 to 9 and a name for each variable:</p>
<p><img alt="A table with number codes and labels that you want for the variables." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/c92210568ca74f8438f661d29cb990ad/worksheets3.JPG" style="border-width: 0px; border-style: solid; width: 535px; height: 271px;" /></p>
<p>To associate the variable names with all 1,944 rows of data, we’ll make patterned data.</p>
<ol>
<li>Choose <strong>Calc > Make Patterned Data > Simple Set of Numbers</strong>.</li>
<li>In <strong>Store patterned data in</strong>, enter <em>C8</em>.</li>
<li>In <strong>From first value</strong>, enter <em>1</em>.</li>
<li>In <strong>To last value</strong>, enter <em>9</em>.</li>
<li>In <strong>Number of times to list sequence</strong>, enter <em>216</em>.</li>
<li>Click <strong>OK</strong>.</li>
</ol>
<p>To convert the number codes to the text variable descriptions, we’ll recode the data.</p>
<ol>
<li>Choose <strong>Data > Code > Use Conversion Table</strong>.</li>
<li>In <strong>Code values in the following column</strong>, enter <em>C8</em>.</li>
<li>In <strong>Current values</strong>, enter <em>C6</em>.</li>
<li>In <strong>Coded values</strong>, enter <em>C7</em>.</li>
<li>Click <strong>OK</strong>.</li>
</ol>
<p>Now that you have a column that says which number belongs to each variable, unstack the data.</p>
<ol>
<li>Choose <strong>Data > Unstack Columns</strong>.</li>
<li>In <strong>Unstack the data in</strong>, enter <em>C5</em>.</li>
<li>In <strong>Using subscripts in</strong>, enter <em>C9</em>.</li>
<li>Click <strong>OK</strong>.</li>
</ol>
<p>Now, you have a new worksheet where each hospital is identified by its unique CCN and the variables are the proportions of pneumonia patients who got each treatment from that hospital.</p>
<p>Once the data are in a traditional format for analysis, you can start to get the answers that you want quickly. For example a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/quality-tools/control-charts/understanding-attributes-control-charts/what-is-a-laney-p-chart/">Laney P’ chart</a> might suggest whether some hospitals had a higher proportion of unvaccinated pneumonia patients than you would expect from the variation in the data.</p>
<p><img alt="8 facilities have higher proportions for the year than you would expect from a random sample from a stable process." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/72f45bae753978f916b3dbf9974c1c6b/laney_p____chart_of_unvaccinated.png" style="border-width: 0px; border-style: solid; width: 576px; height: 384px;" /></p>
<p>Fortunately, being able to change data types, remove missing values, and recode data lets you get data ready to analyze in Minitab as fast as possible. That way, you’re ready to give the answers that your fearless data analysis justifies.</p>
Data AnalysisLearningWed, 10 Jun 2015 12:00:00 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/done-cancelCody SteeleA Closer Look at Probability and Survival Plots
http://blog.minitab.com/blog/quality-data-analysis-and-statistics/a-closer-look-at-probability-and-survival-plots
<p>I recently fielded an interesting question about the probability and survival plots in Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a>'s Reliability/Survival menus:</p>
<p style="margin-left: 40px;"><em>Is there a one-to-one match between the confidence interval points on a probability plot and the confidence interval points on survival plot at a specific percentile?</em></p>
<p>Now, this may seem like an easy question, given that the probabilities on a survival plot are simply 1 minus the failure probabilities on a probability plot at a specific time t or stressor (in the case of Probit Analysis, used for our example below).</p>
<p>This can be seen here, at the 10th percentile:</p>
<p style="text-align: center;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/0b5346729cfc39b1fbd5829bfb4cb58e/pic1.png" style="width: 350px; height: 234px;" /><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/eb499a6f799045d441a03d98f1a15732/pic2.png" style="width: 350px; height: 234px;" /></p>
<p>The probability plot is saying that at a voltage of 113.25, 10% of your items are failing. Conversely, the survival plot will show that 90% of your items will survive at that same voltage.</p>
<p>How do the graphs compare when adding confidence intervals to both graphs? Before we get our hands dirty with this, let’s first review some terms and methods to get us comfortable enough to proceed further.</p>
Reliability/Survival Analysis
<p>This is the overarching classification of tools within Minitab that help with modeling life data. Distribution Analysis, Repairable Systems Analysis, and Probit Analysis fall within this category.</p>
Probit Analysis
<p>This analysis will be used as our example today. Probit analysis is used when you want to estimate percentiles and survival probabilities of an item in the presence of a stress. The response is required to be binomial in nature (go/no go, pass/fail). One example of a probit analysis could be testing light bulb life at different voltages.</p>
<p>Since the response data is binomial, you’d have to specify what would be a considered an event for that light bulb at a certain voltage. Let’s say the event is a light bulb blowing out before 800 hours.</p>
<p>Excerpt of data set</p>
Blows(The Event)
Trials
Volts
2
50
108
6
50
114
11
50
120
45
50
132
Probability Plot
<p>This graph plots each value against the percentage of values in the sample that are less than or equal to it, along a fitted distribution line (middle line). In probit analysis, it helps determine, at certain voltages, what the percentage of bulbs fail before 800 hours.</p>
Survival Plot
<p>This graph displays a plot of the survival probabilities versus time. Each plot point represents the proportion of units surviving at time t. In probit analysis, it helps determine, at a certain voltages, what the percentage of bulbs survive beyond 800 hours.</p>
Back to the original question…
<p>Can we take a value along the CI of a probability plot and find its corresponding value on the CI of survival plot at a specific percentile? Here are the confidence interval values for the percentile at 113.246:</p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/8e52bb9e7bb7d7792646f5ccd35b13d5/sessionpic1.png" style="line-height: 1.6; width: 338px; height: 112px;" /></p>
<p> </p>
<p style="text-align: center;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/fb12e57862d4f6c963a9ad468cfa6e50/pic3.png" style="width: 576px; height: 384px;" /></p>
<p>If we add the above confidence interval values for the 10th percentile to the survival plot, you'll see that they don’t quite equal what’s shown at 90%. They’re a <em>little</em> off:</p>
<p style="text-align: center;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/ad209081968164d93813fe0209898afd/pic4_w1024.png" style="line-height: 1.6; width: 624px; height: 343px;" /></p>
The Reason
<p>In our probability plot, the confidence interval is calculated with the parameter of interest being the percentile. Let’s look at the 10th percentile again:</p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/8e52bb9e7bb7d7792646f5ccd35b13d5/sessionpic1.png" style="width: 338px; height: 112px;" /> </p>
<p>Our 95% CI (111.302 to 114.779) is around the value of 113.246 volts. In our survival plot, the confidence interval is calculated around the probability of survival. You can see this in the session window under the Table of Survival Probabilities. The 95% CI around the survival probability of 0.90 for a voltage of 113.246:</p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/5a31728b7793a1bc82344d192d8a12bb/session_pic2.PNG" style="width: 339px; height: 106px;" /></p>
<p>Here’s another look at our survival plot with our aforementioned survival probabilities added:</p>
<p style="text-align: center;"> </p>
<p style="text-align: center;"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/4c0ca3d6c165ca31ce9139a7e710d5aa/pic5.png" style="width: 624px; height: 279px;" /></p>
<p>They all nicely fit on one straight vertical line at voltage = 113.246. </p>
<p>This all being said, you <em>can </em>convert the lower bound or upper bound of a percentile to a point on a survival plot. Let’s say we look at the lower bound for 113.246 (which is 111.302). We’d first have to find the survival probability for that value:</p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/ba4559dda5a9d3ed2107ac6bc8bb37f3/sessionpic3.PNG" style="width: 322px; height: 105px;" /></p>
<p>Now let’s look at that table of survival probabilities for 0.90 again:</p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/5a31728b7793a1bc82344d192d8a12bb/session_pic2.PNG" style="width: 339px; height: 106px;" /></p>
<p>Notice that the survival probability for the lower CI of 113.246 ends up being the upper bound of the survival probability of 0.90. Given that the survival probabilities are one minus the failure probabilities, it makes sense that you'd have to look at the upper bound of a survival plot when analyzing the lower bound of a probability plot. </p>
<p>I hope this post helps you develop a deeper understanding of the relationship between our probability and survival plots—and I hope it wasn't <em>too </em>technical!</p>
<p>Please check out these other posts on Reliability/Survival:</p>
<p><a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/probit-analysis-down-goes-the-meathouse">Probit Analysis: Down Goes the Meathouse!</a></p>
<p><a href="http://blog.minitab.com/blog/the-statistics-of-science/reliability-statistics-and-the-care-and-feeding-of-capital-equipment">The Care and Feeding of Capital Equipment( with Reliability Statistics)</a></p>
Data AnalysisQuality ImprovementReliability AnalysisSix SigmaMon, 08 Jun 2015 12:00:00 +0000http://blog.minitab.com/blog/quality-data-analysis-and-statistics/a-closer-look-at-probability-and-survival-plotsAndy Cheshire