Minitab | MinitabBlog posts and articles about using Minitab software in quality improvement projects, research, and more.
http://blog.minitab.com/blog/minitab/rss
Thu, 29 Sep 2016 06:41:15 +0000FeedCreator 1.7.3How to Save a Failing Regression with PLS
http://blog.minitab.com/blog/statistics-and-quality-improvement/fix-problems-in-regression-analysis-with-partial-least-squares
<p>Face it, you love regression analysis as much as I do. Regression is one of the most satisfying analyses in <a href="http://www.minitab.com/en-US/products/minitab/free-trial/">Minitab</a>: get some predictors that should have a relationship to a response, go through a model selection process, interpret fit statistics like adjusted R2 and predicted R2, and make predictions. Yes, regression really is quite wonderful.</p>
<p>Except when it’s not. Dark, seedy corners of the data world exist, lying in wait to make regression confusing or impossible. Good old ordinary least squares regression, to be specific.</p>
<p>For instance, sometimes you have a lot of <em>detail</em> in your data, but not a lot of data. Want to see what I mean?</p>
<ol>
<li>In Minitab, choose <strong>Help > Sample Data...</strong></li>
<li>Open Soybean.mtw.</li>
</ol>
<p><img alt="Soybeans" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/e9bae86907cd8194ecf16b7622cf98bb/edamame_by_zesmerelda_in_chicago.jpg" style="float: right; width: 200px; height: 133px; border-width: 1px; border-style: solid; margin: 10px 15px;" />The data has 88 variables about soybeans, the results of near-infrared (NIR) spectroscopy at different wavelengths. But the data contains only 60 measurements, and the data are arranged to save 6 measurements for validation runs.</p>
A Limit on Coefficients
<p>With ordinary least squares regression, you only estimate as many coefficients as the data have samples. Thus, the traditional method that’s satisfactory in most cases would only let you estimate 53 coefficients for variables plus a constant coefficient.</p>
<p>This could leave you wondering about whether any of the other possible terms might have information that you need.</p>
Multicollinearity
<p>The NIR measurements are also highly collinear with each other. This <a href="http://blog.minitab.com/blog/understanding-statistics/handling-multicollinearity-in-regression-analysis">multicollinearity</a> complicates using statistical significance to choose among the variables to include in the model.</p>
<p>When the data have more variables than samples, especially when the predictor variables are highly collinear, it’s a good time to consider partial least squares regression.</p>
How to Perform Partial Least Squares Regression
<p>Try these steps if you want to follow along in Minitab Statistical Software using the soybean data:</p>
<ol>
<li>Choose <strong>Stat > Regression > Partial Least Squares</strong>.</li>
<li>In <strong>Responses</strong>, enter <em>Fat</em>.</li>
<li>In <strong>Model</strong>, enter <em>‘1’-‘88’</em>.</li>
<li>Click <strong>Options</strong>.</li>
<li>Under <strong>Cross-Validation</strong>, select <strong>Leave-one-out</strong>. Click OK.</li>
<li>Click <strong>Results</strong>.</li>
<li>Check <strong>Coefficients</strong>. Click <strong>OK </strong>twice.</li>
</ol>
<p>One of the great things about partial least squares regression is that it forms components and then does ordinary least squares regression with them. Thus the results include statistics that are familiar. For example, <a href="http://blog.minitab.com/blog/adventures-in-statistics/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables">predicted R2</a> is the criterion that Minitab uses to choose the number of components.</p>
<p style="margin-left: 40px;"><br />
<img alt="Minitab selects the model with the highest predicted R-squared." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/12f2493e350eb84a657035b915a5f45f/model_selection.gif" style="width: 476px; height: 194px;" /></p>
<p>Each of the 9 components in the model that maximizes the predicted R2 value is a complex linear combination of all 88 of the variables. So although the ANOVA table shows that you’re using only 9 degrees of freedom for the regression, the analysis uses information from all of the data.</p>
<p style="margin-left: 40px;"><img alt="The regression uses 9 degrees of freedom." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/ce90634261a6cd8994f8e72682473d74/anova.gif" style="width: 381px; height: 113px;" /></p>
<p> The full list of standardized coefficients shows the relative importance of each predictor in the model. (I’m only showing a portion here because the table is 88 rows long.)</p>
<p style="margin-left: 40px;"><br />
<img alt="Each variable has a standardized coefficient." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/b881dc2c5a4b26fa7330a0dbd9e70c8a/coefficients.gif" style="width: 255px; height: 284px;" /></p>
<p>Ordinary least squares regression is a great tool that’s allowed people to make lots of good decision over the years. But there are times when it’s not satisfying. Got too much detail in your data? Partial least squares regression could be the answer.</p>
<p>Want more partial least squares regression now? Check out how <a href="http://www.minitab.com/en-US/Case-Studies/Unifi-Manufacturing-Inc/">Unifi used partial least squares to improve their processes faster</a>.</p>
<span style="color:#a9a9a9;">The image of the soybeans is by Tammy Green </span><span style="color:#a9a9a9;">and is licensed for reuse under this</span> <a href="http://creativecommons.org/licenses/by-sa/2.0/deed.en">Creative Commons License</a>.
Data AnalysisRegression AnalysisStatisticsWed, 28 Sep 2016 12:00:00 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/fix-problems-in-regression-analysis-with-partial-least-squaresCody SteeleValidating Process Changes with Design of Experiments (DOE)
http://blog.minitab.com/blog/real-world-quality-improvement/validating-process-changes-with-design-of-experiments-doe
<p>We’ve got a plethora of <a href="https://www.minitab.com/en-us/company/case-studies/" target="_blank">case studies</a> showing how businesses from different industries solve problems and implement solutions with data analysis. Take a look for ideas about how you can use data analysis to ensure excellence at your business!</p>
<p>Boston Scientific, one of the world’s leading developers of medical devices, is just one organization who has shared their story. A team at their Heredia, Costa Rica facility was able to assess and validate a packaging process, which resulted in a streamlined process and a cost-saving redesign of the packaging.</p>
<p>Below is a brief look at how they did it, but you can also take a look at the full case study at <a href="https://www.minitab.com/Case-Studies/Boston-Scientific/" target="_blank">https://www.minitab.com/Case-Studies/Boston-Scientific/</a>.</p>
Their Challenge
<p><img alt="guidewires in pouch" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/03b6326dcb90a56ca905abbc2526f38c/guidewires.jpg" style="width: 233px; height: 174px; float: right;" />Boston Scientific Heredia evaluates its operations regularly, to maintain process efficiency and contribute to affordable healthcare by reducing costs. At this facility, one packaging engineer led an effort to streamline packaging for guidewires—which are used during procedures such as catheter placement or endoscopic diagnoses—with the introduction of a new, smaller plastic pouch.</p>
<p>Using smaller and different packaging materials for their guidewires would substantially reduce material costs, but the company needed to prove that the new pouches would work with their sealing process, which creates a barrier that keeps the guidewires sterile.</p>
How Data Analysis Helped
<p>To ensure that the seal strength for the smaller pouches met or exceeded standards, they evaluated the process and identified several important factors, such as the temperature of the sealing system. They then used a statistical method called <a href="http://blog.minitab.com/blog/doe" target="_blank">Design of Experiments (DOE)</a> to determine how each of the variables affected the quality of the pouch seal.</p>
<p>The DOE revealed which factors were most critical. Below is a Minitab <a href="http://blog.minitab.com/blog/understanding-statistics/when-to-use-a-pareto-chart" target="_blank">Pareto Chart</a> that identified the factors that significantly affect seal strength: front temperature, rear temperature, and their respective two-way interaction.</p>
<p><img alt="https://www.minitab.com/uploadedImages/Content/Case_Studies/EffectsParetoforAveragePull.jpg" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/abd4d05d00cf48c8b22ecc37e1264e93/pareto_chart.jpg" style="border-width: 0px; border-style: solid; width: 600px; height: 400px;" /></p>
<p>Armed with this knowledge, the team devised optimal process settings to ensure the new pouches had strong seals. To verify the effectiveness of the improved process, they used a statistical tool called capability analysis, which demonstrates whether or not a process meets specifications and can produce good results:</p>
<p><img alt="https://www.minitab.com/uploadedImages/Content/Case_Studies/ProcessCapabilityofHighSettings-SealStrength.jpg" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/c4d94b5ee153c1d3e38757565a5d24c2/process_capa.jpg" style="border-width: 0px; border-style: solid; width: 600px; height: 400px;" /></p>
Results
<p>The analysis showed that guidewires packaged using the new, optimal process settings met, and even exceeded, the minimum seal strength requirements.</p>
<p>With the new pouches, Boston Scientific has saved more than $330,000. “At the end of the day,” a key team member noted, “the more money we save, the more additional savings we can pass on to the people we serve.”</p>
<p><em>For another example of how Boston Scientific uses data analysis to ensure the safety and reliability of its products, read <a href="https://www.minitab.com/Case-Studies/Boston-Scientific-Heredia/" target="_blank">Pulling Its Weight: Tensile Testing Challenge Speeds Regulatory Approval for Boston Scientific</a>, a story about how the company used Minitab Statistical Software to confirm the equivalency of its catheter’s pull-wire strength to previous testing results, and eliminate the need to perform test method validation by leveraging its existing tension testing standard.</em></p>
Data AnalysisDesign of ExperimentsQuality ImprovementStatisticsStatsMon, 26 Sep 2016 12:00:00 +0000http://blog.minitab.com/blog/real-world-quality-improvement/validating-process-changes-with-design-of-experiments-doeCarly BarryDescriptive vs. Inferential Statistics: When Is a P-value Superfluous?
http://blog.minitab.com/blog/statistics-and-quality-data-analysis/descriptive-vs-inferential-statistics-when-is-a-p-value-superfluous
<p>True or false: When comparing a parameter for two sets of measurements, you should always use a hypothesis test to determine whether the difference is statistically significant.</p>
<p>The answer? (<em>drumroll...</em>) True!</p>
<p>...and False!</p>
<p>To understand this paradoxical answer, you need to keep in mind the difference between samples, populations, and descriptive and inferential statistics. </p>
Descriptive Statistics and Populations
<p>Consider the fictional countries of Glumpland and Dolmania.</p>
<p style="text-align: center;"><img alt="Welcome to Glumpland!" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c1f88e0e6d3e4e55684392ec5a8069e8/glumpland.jpg" style="width: 350px; height: 232px;" /></p>
<img alt="wkshet" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/47e5470dd8123218763ac3666f64bbdd/glumpland_dolmania_wkshet.jpg" style="line-height: 20.8px; width: 222px; height: 579px; float: right;" />
<p>The population of Glumpland is 8,442,012. The population of Dolmania is 6,977,201. For each country, the age of every citizen (to the nearest tenth), <a href="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/080981611ba11403dc8fde411e81d150/glumpland_and_dolmania_ages.mpj">is recorded in a cell of a Minitab worksheet</a>. </p>
<p>Using <strong>Stat > Basic Statistics > Display Descriptive Statistics</strong> we can quickly calculate the mean age of each country.</p>
<p style="margin-left: 40px;"><img alt="desc stats" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/1a791dd23ba85673193f20c2c9971fa4/mean_age_glump_and_dol.jpg" style="width: 316px; height: 96px;" /></p>
<p>It looks like Dolmanians are, on average, more youthful than Glumplanders. But is this difference in means statistically significant?</p>
<p>To find out, we might be tempted to evaluate these data using a <span><a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-t-tests%3A-1-sample%2C-2-sample%2C-and-paired-t-tests">2-sample t-test</a></span>.</p>
<p>Except for one thing: there's absolutely no point in doing that.</p>
<p>That's because these calculated means <em>are</em> the means of the entire populations. So we already know that the population means differ.</p>
<p>Another example. Suppose a baseball player gets 213 hits in 680 at bats in 2015, and 178 hits in 532 at bats in 2016.</p>
<p>Would you need a 2-proportions test to determine whether the difference in batting averages (.313 vs .335) is statistically significant? Of course not.</p>
<p>You've already calculated the proportions using all the data for the entire two seasons. There's nothing more to extrapolate. And yet you often see a hypothesis test applied in this type of situation, in the mistaken belief that if there's no p-value, the results aren't "solid" or "statistical" enough.</p>
<p>But if you've collected every possible piece of data for a population, that's about as solid as you can get!</p>
Inferential Statistics and Random Samples
<p>Now suppose that draconian budget cuts have made it infeasible to track and record the age of every resident in Glumpland and Dolmania. <span style="line-height: 1.6;">What can they do? </span></p>
<p><span style="line-height: 1.6;">Quite a lot, actually. They can apply inferential statistics, which is based on random sampling, to make reliable estimates without those millions of data values they don't have.</span></p>
<p>To see how it works, use <strong>Calc > Random Data > Sample from columns</strong> in Minitab. Randomly sample 50 values from the 8,422,012 values in column C1, which includes the ages of the entire population of Glumpland. Then use descriptive statistics to calculate the mean of the sample.</p>
<p>Here are the results for one random sample of 50:</p>
<p style="margin-left: 40px;"><strong>Descriptive Statistics: GPLND (50)</strong><br />
<span style="line-height: 1.6;">Variable Mean</span><br />
<span style="line-height: 1.6;">GPLND(50) 52.37</span></p>
<p>The sample mean, 52.37 is slightly less than the true mean age of 53 for the entire population of Glumpland. What about another random sample of 50?</p>
<p style="margin-left: 40px;"><strong>Descriptive Statistics: GPLND (50) </strong><br />
<span style="line-height: 1.6;">Variable Mean</span><br />
<span style="line-height: 1.6;">GPLND(50) 54.11</span></p>
<p>Hmm. This sample mean of 54.11 slightly <em>overshoots</em> the true population mean of 53.</p>
<p>Even though the sample estimates are in the ballpark of the true population mean, we're seeing some variation. <span style="line-height: 1.6;">How much variation can we expect? Using descriptive statistics alone, we have no inkling of how "close" a sample estimate might be to the truth. </span></p>
Enter...the Confidence Interval
<p>To quantify the precision of a sample estimate for the population, we can use a powerful tool in inferential statistics: the confidence interval.</p>
<p>Suppose you take random samples of size 5, 10, 20, 50, and 100 from Glumpland and Dolmania using <strong>Calc > Random Data > Sample from columns</strong>. Then use <strong>Graph > Interval Plot > Multiple Ys</strong> to display the 95% confidence intervals for the mean of each sample.</p>
<p>Here's what the interval plots look for the random samples in my worksheet.</p>
<p style="margin-left: 40px;"><img alt="interval plot Glumpland" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/262031cc398ee9d48031fe1f43b38bdf/interval_plot_of_glumpland.jpg" style="line-height: 20.8px; width: 576px; height: 384px;" /></p>
<p style="margin-left: 40px;"><img alt="Interval plot Dolmania" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/75440d94eaff64a63e338b480029945b/interval_plot_of_dolmania.jpg" style="width: 576px; height: 384px;" /></p>
<p>Your plots will look different based on your random samples, but you should notice a similar pattern: The sample mean estimates (the blue dots) tend to vary more from the population mean as the sample sizes decrease. To compensate for this, the intervals "stretch out" more and more, to ensure the same 95% overall probability of "capturing" the true population mean.</p>
<p>The larger samples produce narrower intervals. In fact, using only 50-100 data values, we can closely estimate the mean of over 8.4 million values, and get a general sense of how precise the estimate is likely to be. That's the incredible power of random sampling and inferential statistics!</p>
<p>To display side-by-side confidence intervals of the mean estimates for Glumpland and Dolmania, you can use an interval plot with groups.</p>
<p style="margin-left: 40px;"><img alt="interval plot side by side" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/9e6348c87befdaf6434dbe80e8257516/interval_plot_of_age_side_by_side.jpg" style="width: 576px; height: 384px;" /></p>
<p>Now, you might be tempted to use these results to infer whether there's a statistically significant difference in the mean age of the populations of Glumpland and Dolmania. But don't. Confidence intervals can be misleading for that purpose.</p>
<p>For that, we need another powerful tool of inferential statistics...</p>
Enter...the hypothesis test and p-value
<p>The 2-sample t-test is used to determine whether there is a statistically significant difference in the means of the populations from which the two random samples were drawn. The following table shows the t-test results for each pair of same-sized samples from Glumpland and Dolmania. As the sample size increases, notice what happens to the p-value and the confidence interval for the difference between the population means.</p>
<p style="margin-left: 40px;"><img alt="t tests" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/7c1bf45756a7fb621094086e5350fef9/2_sample_t_test.jpg" style="width: 526px; height: 757px;" /></p>
<p>Again, the confidence intervals tend to get wider as the samples get smaller. With smaller samples, we're less certain of the precision of the estimate for the difference..</p>
<p>In fact, only for the two largest random samples (N=50 and N=100) is the p-value less than a 0.05 level of significance, allowing us to conclude that the mean ages of Glumplanders and Dolmanians are statistically different. For the three smallest samples (N=20, N=10, N=5), the p-value is greater than 0.05, and confidence interval for each of these small samples includes 0. Therefore, we cannot conclude that there is difference in the population means.</p>
<p>But remember, we already know that the true population means actually <em>do</em> differ by 5.4 years. We just can't statistically "prove" it with the small samples. That's why statisticians bristle when someone says, "The p-value is not less than 0.05. Therefore, there's no significant difference between the groups." There might very well be. So it's safer to say, especially with small samples, "<em>we don't have enough evidence </em>to conclude that there's a significant difference between the groups."</p>
<p>It's not just a matter of nit-picky semantics. It's simply the truth, as you can see when you take random samples of various sizes from the same known populations and test them for a difference.</p>
Wrap-up
<p>If you have a random sample, you should always accompany estimates of statistical parameters with a confidence interval and p-value, whenever possible. Without them, there's no way to know whether you can safely extrapolate to the entire population. But if you already know every value of the population, you're good to go. You don't need a p-value, a t-test, or a CI—any more than you need a clue to determine whats inside a box, if you already know what's in it.</p>
Data AnalysisHypothesis TestingLearningStatisticsFri, 23 Sep 2016 12:08:00 +0000http://blog.minitab.com/blog/statistics-and-quality-data-analysis/descriptive-vs-inferential-statistics-when-is-a-p-value-superfluousPatrick RunkelProblems Using Data Mining to Build Regression Models
http://blog.minitab.com/blog/adventures-in-statistics/problems-using-data-mining-to-build-regression-models
<p><img alt="Picture of mining truck filled with numbers" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/644d98694f1e6fec63d4f1db6b61a074/data_mining_crop.jpg" style="width: 250px; height: 171px; float: right; margin: 10px 15px;" />Data mining uses algorithms to explore correlations in data sets. An automated procedure sorts through large numbers of variables and includes them in the model based on statistical significance alone. No thought is given to whether the variables and the signs and magnitudes of their coefficients make theoretical sense.</p>
<p>We tend to think of data mining in the context of big data, with its huge databases and servers stuffed with information. However, it can also occur on the smaller scale of a research study.</p>
<p>The comment below is a real one that illustrates this point.</p>
<blockquote>“Then, I moved to the Regression menu and there I could add all the terms I wanted and more. Just for fun, I added many terms and performed backward elimination. Surprisingly, some terms appeared significant and my R-squared Predicted shot up. To me, your concerns are all taken care of with R-squared Predicted. If the model can still predict without the data point, then that's good.”</blockquote>
<p>Comments like this are common and emphasize the temptation to select regression models by trying as many different combinations of variables as possible and seeing which model produces the best-looking statistics. The overall gist of this type of comment is, "What could possibly be wrong with using data mining to build a regression model if the end results are that all the p-values are significant and the various types of R-squared values are all high?"</p>
<p>In this blog post, I’ll illustrate the problems associated with using data mining to build a regression model in the context of a smaller-scale analysis.</p>
An Example of Using Data Mining to Build a Regression Model
<p>My first order of business is to prove to you that data mining can have severe problems. I really want to bring the problems to life so you'll be leery of using this approach. Fortunately, this is simple to accomplish because I can use data mining to make it appear that a set of randomly generated <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">predictor variables</a> explains most of the changes in a randomly generated <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">response variable</a>!</p>
<p>To do this, I’ll create a worksheet in Minitab statistical software that has 100 columns, each of which contains 30 rows of entirely random data. In Minitab, you can use <strong>Calc > Random Data > Normal</strong> to create your own worksheet with random data, or you can use <a href="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/File/c740effad4cc27dc6580093ea6c070fd/randomdata.mtw">this worksheet</a> that I created for the data mining example below. (If you don’t have Minitab and want to try this out, <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">get the free 30 day trial!</a>)</p>
<p>Next, I’ll perform <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-smackdown-stepwise-versus-best-subsets" target="_blank">stepwise regression</a> using column 1 as the response variable and the other 99 columns as the potential predictor variables. This scenario produces a situation where stepwise regression is forced to dredge through 99 variables to see what sticks, which is a key characteristic of data mining.</p>
<p>When I perform stepwise regression, the procedure adds 28 variables that explain 100% of the variance! Because we only have 30 observations, we’re clearly overfitting the model. Overfitting the model is different problem that also inflates R-squared, which you can read about in my post about <a href="http://blog.minitab.com/blog/adventures-in-statistics/the-danger-of-overfitting-regression-models" target="_blank">the dangers of overfitting models</a>.</p>
<p>I’m specifically addressing the problems of data mining in this post, so I don’t want a model that is also overfit. To avoid an overfit model, a good rule of thumb is to include no more than one term for each 10 observations. We have 30 observations, so I’ll include only the first three variables that the stepwise procedure adds to the model: C7, C77, and C95. The output for the first three steps is below.</p>
<p style="margin-left: 40px;"><img alt="Stepwise regression output" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/e4fb01237dd0c8b34496dde3cc28b517/stepwise_swo.png" style="width: 498px; height: 251px;" /></p>
<p>Under step 3, we can see that all of the <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">coefficient p-values</a> are statistically significant. The <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit" target="_blank">R-squared</a> value of 67.54% can either be good or mediocre depending on your field of study. In a real study, there are likely to be some real effects mixed in that would boost the R-squared even higher. We can also look at <a href="http://blog.minitab.com/blog/adventures-in-statistics/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables" target="_blank">the adjusted and predicted R-squared values</a> and neither one suggests a problem.</p>
<p>If we look at the model building process of steps 1 - 3, we see that at each step all of the R-squared values increase. That’s what we like to see. For good measure, let’s graph the relationship between the predictor (C7) and the response (C1). After all, seeing is believing, right?</p>
<p style="margin-left: 40px;"><img alt="Scatterplot of two variables in regression model" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/6e4dfb991b33031738756d4b2d1c77e4/scatterplot.png" style="width: 576px; height: 384px;" /></p>
<p>This graph looks good too! It sure appears that as C7 increases, C1 tends to increase, which agrees with the positive <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">regression coefficient</a> in the output. If we didn’t know better, we’d think that we have a good model!</p>
<p>This example answers the question posed at the beginning: what could possibly be wrong with this approach? Data mining can produce deceptive results. The statistics and graph all look good but these results are based on entirely random data with absolutely no real effects. Our regression model suggests that random data explain other random data even though that's impossible. Everything looks great but we have a lousy model.</p>
The problems associated with using data mining are real, but how the heck do they happen? And, how do you avoid them? Read my next post to learn the answers to these questions!ANOVAData AnalysisRegression AnalysisStatisticsStatistics HelpStatsWed, 21 Sep 2016 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/problems-using-data-mining-to-build-regression-modelsJim FrostWhatever Happened to…the Ozone Hole?
http://blog.minitab.com/blog/statistics-and-quality-data-analysis/whatever-happened-to%E2%80%A6the-ozone-hole
<p><img alt="ozone hole" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/36ffdad1772934f71f8550dc13d4deca/ozone_hole.jpg" style="width: 300px; height: 279px; float: right; margin: 10px 15px;" />Today, September 16, is <a href="https://en.wikipedia.org/wiki/International_Day_for_the_Preservation_of_the_Ozone_Layer" target="_blank">World Ozone Day</a>. You don't hear much about the ozone layer any more.</p>
<p>In fact, if you’re under 30, you might think this is just another trivial, obscure observance, along the lines of <a href="https://www.daysoftheyear.com/days/international-dot-day/" target="_blank">International Dot Day</a> (yesterday) or <a href="http://www.nationaldaycalendar.com/national-apple-dumpling-day-september-17/" target="_blank">National Apple Dumpling Day</a> (tomorrow).</p>
<p>But there’s a good reason that, almost 30 years ago, the United Nations designated today to as a day to raise awareness of the ozone layer: unlike dots and apple dumplings, this fragile shield of gas in the stratosphere, which acts as a natural sunscreen against dangerous levels of UV radiation, is critical to sustain life on our planet. </p>
<p>In this post, we'll join the efforts of educators around the globe who organize special activities on this day, by using Minitab to statistically analyze ozone-related data. You can follow along using the data in <a href="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/b268a053e038bb53a94a5b38360899be/world_ozone_day.mpj" target="_blank">this Minitab project</a>. If you don't already have it, you can <a href="https://www.minitab.com/products/minitab/free-trial/">download Minitab here and use it free for 30 days</a>.</p>
Orthogonal Regression: Can You Trust Your Data?
<p><img alt="NIST data" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/385e642e6552abb2aa033b9194f0ae08/nist_data_worksheet.jpg" style="width: 158px; height: 298px; float: right; margin: 10px 15px;" />Before you analyze data, it's important to verify that your measuring system is accurate. Orthogonal regression, also known as Deming regression, is a tool used to evaluate whether two instruments or methods provide comparable measurements.</p>
<p>The following sample data is from the <a href="http://www.itl.nist.gov/div898/strd/lls/data/Norris.shtml" target="_blank">National Institute of Standards (NIST) web site</a>. The predictor variable <span style="line-height: 20.8px;">(x)</span> is the NIST measurement of ozone concentration. The response variable (y) is the measurement of ozone concentration using a customer's measuring device.</p>
<p>In Minitab, choose <strong>Stat > Regression > Orthogonal Regression</strong>.Enter <em>C1</em> as the<strong> Response (Y)</strong> and <em>NIST</em> as the<strong> Predictor (X)</strong>. Enter 1.5 as the <strong>Error Variance ratio (Y/X) </strong>and click <strong>OK.</strong></p>
<p><em><strong>Note</strong>: The error variance ratio is based on historic data, not the sample data. Because the ratio is not available for these data, we'll use 1.5 purely for illustrative purposes. To learn more about this ratio, and how to estimate it, see the comments following <a href="http://blog.minitab.com/blog/real-world-quality-improvement/orthogonal-regression-testing-the-equivalence-of-instruments" target="_blank">this Minitab blog post</a>. </em></p>
<em><strong><span style="line-height: 1.6;">Orthogonal Regression Analysis: Device versus NIST </span></strong></em>
<span style="line-height: 20.8px; font-size: 13px;">The fitted line plot shows the two sets of measurements appear almost identical. That's about as good as it gets:</span><strong><span style="line-height: 1.6;"><img alt="fitted plot" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/480385f6576f9121e8d90788f8ee8c9e/nist_plot_with_fitted_line.jpg" style="width: 576px; height: 384px; margin: 10px 15px;" /></span></strong>
<p><span style="line-height: 20.8px;">Now look at the numerical output. If there's perfect correlation, and no bias, you'd expect to see a constant value of 0 and a slope of 1 in the regression equation. </span></p>
<p style="margin-left: 40px;">Error Variance Ratio (Device/NIST): 1.5</p>
<p style="margin-left: 40px;">Regression Equation<br />
Device = -<strong><span style="color:#0000FF;"> 0.263</span></strong> + <strong><span style="color:#FF0000;">1.002</span></strong> NIST</p>
<p style="margin-left: 40px;">Coefficients</p>
<p style="margin-left: 40px;">Predictor Coef SE Coef Z P Approx 95% CI<br />
<strong><span style="color:#0000FF;">Constant</span></strong> -0.26338 0.232819 -1.1313 0.258 <strong><span style="color:#0000FF;">(-0.71969, 0.19294)</span></strong><br />
<strong><span style="color:#FF0000;">NIST </span></strong> 1.00212 0.000430 2331.6058 0.000 <strong><span style="color:#FF0000;">( 1.00128, 1.00296)</span></strong></p>
<p><span style="line-height: 1.6;">To assess this, look at the 95% confidence intervals for the coefficients. The confidence interval for constant includes 0. The confidence interval for the predictor variable (NIST) is extremely close to 1, but does not include 1. Technically, there is some bias, although it may be too small to be relevant. In cases like this, rely on your practical knowledge in the field to determine whether the amount of bias is important. </span></p>
<p><span style="line-height: 1.6;">I'm no ozone expert, but given the sample measurements</span><span style="line-height: 1.6;">, I'd speculate</span><span style="line-height: 1.6;"> that this tiny amount of bias is not critical. </span></p>
Plotting the Size of the Ozone Hole
<p>Usually holes just get bigger over time. Like the holes in my socks and sweaters. </p>
<p><span style="line-height: 1.6;">But what about the size of the hole in the ozone layer above Antarctica? </span></p>
<p>As part of the Ozone Hole Watch project, NASA scientists have been tracking the size of the ozone hole of the Southern Hemisphere for years. I copied the data into a Minitab project, and then used <strong>Graph > Time Series Plot > Multiple</strong> to plot both the mean ozone hole area and the maximum daily ozone hole area, by year. </p>
<p style="margin-left: 40px;"><img alt="Time series plot " src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/30ea9b376e2fe868bd2af2b71996ba78/time_series_plot_of_ozone_hole_size.jpg" style="width: 576px; height: 384px;" /></p>
<p>The plot shows why the ozone hole was such a big deal back in the 1980's. The size of the hole was increasing at extremely high rates, trending toward a potential environmental crisis. No wonder, then, that on September 16, 1987, the United Nations adopted the <a href="https://en.wikipedia.org/wiki/Montreal_Protocol" target="_blank" title="Montreal Protocol">Montreal Protocol</a>, an international agreement to reduce ozone-depleting substances such as chlorofluorocarbons. That agreement, <span style="line-height: 20.8px;">eventually signed by nearly 200 nations, is credited with stabilizing the size of the ozone hole at the end of the 20th century, </span>according to <a href="http://research.noaa.gov/News/NewsArchive/LatestNews/TabId/684/ArtMID/1768/ArticleID/10741/Report-telltale-signs-that-ozone-layer-is-recovering-.aspx" target="_blank">NASA</a> and the <a href="http://ozone.unep.org/Assessment_Panels/SAP/SAP2014_Assessment_for_Decision-Makers.pdf" target="_blank">World Meteorological Organization</a>. </p>
One-Way ANOVA: Seasonal Changes in the Ozone Layer
<p>The ozone layer is not static, but varies by latitude, season, and stratospheric conditions. On average, the "typical" thickness of the ozone layer is about 300 Dobson units (DU). </p>
<p>The Lauder Ozone worksheet in the Minitab project linked above contains ra<span style="line-height: 20.8px;">ndom samples of <a href="http://data.mfe.govt.nz/" target="_blank">total ozone column measurements taken in Lauder, New Zealand in 2013</a></span><span style="line-height: 20.8px;">. For this analysis, the seasons are defined as Summer = Dec-Feb, Fall = Mar-May, Winter = June-August, and Spring = Sept-Nov. </span></p>
<p><span style="line-height: 20.8px;">To evaluate whether there are statistically significant differences in mean ozone by season using Minitab, choose </span><strong style="line-height: 20.8px;">Stat > ANOVA > One-Way...</strong><span style="line-height: 20.8px;"> In the dialog box, select </span><strong style="line-height: 20.8px;">Response data are in a separate column for each factor level</strong><span style="line-height: 20.8px;">. As </span><strong style="line-height: 20.8px;">Responses</strong><span style="line-height: 20.8px;">, enter <em>Summer</em>, <em>Fall</em>, <em>Winter</em>, <em>Spring. </em> Click <em><strong>Options</strong></em>, and uncheck <strong>Assume equal variances</strong>. Click </span><strong style="line-height: 20.8px;">Comparisons</strong><em style="line-height: 20.8px;"> </em><span style="line-height: 20.8px;">and</span><em style="line-height: 20.8px;"> </em><span style="line-height: 20.8px;">check <b>Games-Howell</b>. After you click <strong>OK</strong> in each dialog box, Minitab returns the following output.</span></p>
<p style="margin-left: 40px;"><img alt="interval plot ozone" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/7399a0d7c95eb4250406327dfd1d0a52/interval_plot_of_ozone.jpg" style="width: 576px; height: 384px;" /></p>
<p style="margin-left: 40px;"><img alt="ozone session window" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/8eba7cc1fb6fa6b4c70cbe436924ad92/ozone_session_window.jpg" style="width: 433px; height: 552px;" /></p>
<p><span style="line-height: 1.6;">At a 0.05 level of significance, the p-value (</span>≈ 0.000)<span style="line-height: 1.6;"> is less than alpha. Thus, w</span><span style="line-height: 1.6;">e can conclude that there is a statistically significant difference in mean ozone thickness by season. The plot shows that the mean ozone is lowest in Summer and Fall, and highest in Spring. </span></p>
<p><span style="line-height: 1.6;">Look at the 95% confidence intervals (CI). Are any seasons likely to have a mean ozone thickness less than 300 DU? Greater than 300 DU? Based on the pairwise comparisons chart, for which seasons does the mean ozone layer significantly differ?</span></p>
<p><span style="line-height: 1.6;">The ozone layer is just one factor in the myriad complex relationships between human activity and the global environment. So these analyses are just the tip of the iceberg</span>—one that's<span style="line-height: 1.6;"> melting as we speak.</span></p>
Data AnalysisLearningStatisticsStatistics in the NewsFri, 16 Sep 2016 12:00:00 +0000http://blog.minitab.com/blog/statistics-and-quality-data-analysis/whatever-happened-to%E2%80%A6the-ozone-holePatrick RunkelWhen to Use a Pareto Chart
http://blog.minitab.com/blog/understanding-statistics/when-to-use-a-pareto-chart
<p>I confess: I'm not a natural-born decision-maker. Some people—my wife, for example—can assess even very complex situations, consider the options, and confidently choose a way forward. Me? I get anxious about deciding what to eat for lunch. So you can imagine what it used to <span style="line-height: 1.6;">be like when I needed to confront a really big decision or problem. My approach, to paraphrase the Byrds, was "Re: everything, churn, churn, churn."<img alt="question to answer" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/1b29ab96a420030f3551f71a26773259/question.jpg" style="width: 250px; height: 181px; margin: 10px 15px; float: right;" /></span></p>
<p>Thank heavens for Pareto charts.</p>
What Is a Pareto Chart, and How Do You Use It?
<p>A Pareto chart is a basic quality tool that helps you identify the most frequent defects, complaints, or any other factor you can <strong>count </strong>and <strong>categorize</strong>. The chart takes its name from Vilfredo Pareto, originator of the "80/20 rule," which postulates that, roughly speaking, 20 percent of the people own 80 percent of the wealth. Or, in quality terms, 80 percent of the losses come from 20 percent of the causes.</p>
<p><span style="line-height: 20.8px;">You can use a Pareto chart any time you have data that are broken down into categories, and you can count how often each category occurs. As children, most of us learned how to use this kind of data to make a bar chart:</span></p>
<p style="margin-left: 40px;"><img alt="bar chart" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/90e6067d7f0a1f4f738462290a05f439/bar_chart.png" style="width: 576px; height: 384px;" /></p>
<p>A Pareto chart is just a bar chart that arranges the bars (counts) from largest to smallest, from left to right. The categories or factors symbolized by the bigger bars on the left are more important than those on the right.</p>
<p style="margin-left: 40px;"><img alt="Pareto Chart" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/bf0be8506cc30954165e854f24f0ed7d/pareto.png" style="width: 576px; height: 384px;" /></p>
<p>By ordering the bars from largest to smallest, a Pareto chart helps you visualize which factors comprise the 20 percent that are most critical—the "vital few"—and which are the "trivial many."</p>
<p>A cumulative percentage line helps you judge the added contribution of each category. If a Pareto effect exists, the cumulative line rises steeply for the first few defect types and then levels off. In cases where the bars are approximately the same height, the cumulative percentage line makes it easier to compare categories.</p>
<p>It's common sense to focus on the ‘vital few’ factors. In the quality improvement arena, Pareto charts help teams direct their efforts where they can make the biggest impact. By taking a big problem and breaking it down into smaller pieces, a Pareto chart reveals where our efforts will create the most improvement.</p>
<p>If a Pareto chart seems rather basic, well, it is. But like a simple machine, its very simplicity makes the Pareto chart applicable to a very wide range of situations, both within and beyond quality improvement.</p>
Use a Pareto Chart Early in Your Quality Improvement Process
<p>At the leadership or management level, Pareto charts can be used at the start of a new round of quality improvement to figure out what business problems are responsible for the most complaints or losses, and dedicate improvement resources to those. Collecting and examining data like that can often result in surprises and upend an organization's "conventional wisdom." For example, leaders at one company believed that the majority of customer complaints involved product defects. But when they saw the complaint data in a Pareto chart, it showed that many more people complained about shipping delays. Perhaps the impression that defects caused the most complaints arose because the relatively few people who received defective products tended to complain very loudly—but since more customers were affected by shipping delays, the company's energy was better devoted to solving that problem.</p>
Use a Pareto Chart Later in Your Quality Improvement Process
<p>Once a project has been identified, and a team assembled to improve the problem, a Pareto chart can help the team select the appropriate areas to focus on. This is important because most business problems are big and multifaceted. For instance, shipping delays may occur for a wide variety of reasons, from mechanical breakdowns and accidents to data-entry mistakes and supplier issues. If there are many possible causes a team could focus on, it's smart to collect data about which categories account for the biggest number of incidents. That way, the team can choose a direction based on the numbers and not the team's "gut feeling."</p>
Use a Pareto Chart to Build Consensus
<p>Pareto charts also can be very helpful in resolving conflicts, particularly if a project involves many moving parts or crosses over many different units or work functions. Team members may have sharp disagreements about how to proceed, either because they wish to defend their own departments or because they honestly believe they <em>know </em>where the problem lies. For example, a hospital project improvement team was stymied in reducing operating room delays because the anesthesiologists blamed the surgeons, while the surgeons blamed the anesthesiologists. When the project team collected data and displayed it in a Pareto chart, it turned out that neither group accounted for a large proportion of the delays, and the team was able to stop finger-pointing. Even if the chart had indicated that one group or the other was involved in a significantly greater proportion of incidents, helping the team members see which types of delays were most 'vital' could be used to build consensus.</p>
Use Pareto Charts Outside of Quality Improvement Projects
<p>Their simplicity also makes <span><a href="http://blog.minitab.com/blog/real-world-quality-improvement/pareto-chart-power">Pareto charts</a> a valuable tool for making decisions beyond the world of quality improvement. By helping you visualize the relative importance of various categories, you can use them to prioritize customer needs, opportunities for training or investment—even your choices for lunch.</span></p>
How to Create a Pareto Chart
<p>Creating a Pareto chart is not difficult, even without statistical software. Of course, if you're using <a href="http://www.minitab.com/products/minitab/">Minitab</a>, the software will do all this for you automatically—create a Pareto chart by selecting <strong style="line-height: 1.6;">Stat > Quality Tools > Pareto Chart...</strong> or by selecting <strong style="line-height: 1.6;">Assistant > Graphical Analysis > Pareto Chart</strong>. You can collect raw data, in which each observation is recorded in a separate row of your worksheet, or summary data, in which you tally observation counts for each category.</p>
<p><strong>1. Gather Raw Data about Your Problem</strong></p>
<p>Be sure you collect a random sample that fully represents your process. For example, if you are counting the number of items returned to an electronics store in a given month, and you have multiple locations, you should not gather data from just one store and use it to make decisions about all locations. (If you want to compare the most important defects for different stores, you can show separate charts for each one side-by-side.)</p>
<p><strong>2. Tally Your Data</strong></p>
<p>Add up the observations in each of your categories.</p>
<p><strong>3. Label your horizontal and vertical axes.</strong></p>
<p>Make the widths of all your horizontal bars the same and label the categories in order from largest to smallest. On the vertical axis, use round numbers that slightly exceed your top category count, and include your measurement unit.</p>
<p><strong>4. Draw your category bars.</strong></p>
<p>Using your vertical axis, draw bars for each category that correspond to their respective counts. Keep the width of each bar the same.</p>
<p><strong>5. Add cumulative counts and lines.</strong></p>
<p>As a final step, you can list the cumulative counts along the horizontal axis and make a cumulative line over the top of your bars. Each category's cumulative count is the count for that category PLUS the total count of the preceding categories. If you want to add a line, draw a right axis and label it from 0 to 100%, lined up with the with the grand total on the left axis. Above the right edge of each category, mark a point at the cumulative total, then connect the points.</p>
Data AnalysisLean Six SigmaProject ToolsQuality ImprovementStatisticsWed, 14 Sep 2016 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/when-to-use-a-pareto-chartEston MartzControl Chart Tutorials and Examples
http://blog.minitab.com/blog/understanding-statistics/control-chart-tutorials-and-examples
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/3989007af54bf1e996aeee86c8cec497/control_chart_wow.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 288px; height: 173px;" />The other day I was talking with a friend about control charts, and I wanted to share an example one of my colleagues wrote on the Minitab Blog. Looking back through the index for "control charts" reminded me just how much material we've published on this topic.</p>
<p>Whether you're just getting started with control charts, or you're an old hand at statistical process control, you'll find some valuable information and food for thought in our control-chart related posts. </p>
Different Types of Control Charts
<p>One of the first things you learn in statistics is that when it comes to data, there's no one-size-fits-all approach. To get the most useful and reliable information from your analysis, you need to select the type of method that best suits the type of data you have.</p>
<p>The same is true with control charts. While there are a few charts that are used very frequently, a wide range of options is available, and selecting the right chart can make the difference between actionable information and false (or missed) alarms.</p>
<p><a href="http://blog.minitab.com/blog/understanding-statistics/what-control-chart-should-i-use">What Control Chart Should I Use?</a> offers a brief overview of the most common charts and a discussion of how to use the Assistant to help you choose the right one for your situation. And if you're a control chart neophyte and you want more background on why we use them, check out <a href="http://blog.minitab.com/blog/understanding-statistics/control-charts-show-you-variation-that-matters" itemprop="url">Control Charts Show You Variation that Matters.</a></p>
<p itemprop="headline">We extol the virtues of a less commonly used chart in <a href="http://blog.minitab.com/blog/fun-with-statistics/an-ode-to-the-ewma-control-chart" itemprop="url">Beyond the "Regular Guy" Control Charts: An Ode to the EWMA Chart</a>, and explain how to use control charts to track rare events in <a href="http://blog.minitab.com/blog/data-analysis-and-quality-improvement-and-stuff/using-g-whiz-charts-to-track-elusive-affirmations-from-almost-adolescents" itemprop="url">Using G-Whiz Charts to Track Elusive Affirmations from Almost Adolescents</a>.</p>
<p itemprop="headline">In <a href="http://blog.minitab.com/blog/adventures-in-software-development/the-laney-p-chart-and-minitab-software-development" itemprop="url">Using the Laney P' Control Chart in Minitab Software Development</a>, Dawn Keller discusses the distinction between P' charts and their cousins, described by Tammy Serensits in <a href="http://blog.minitab.com/blog/the-statistics-of-science/p-and-u-charts-and-limburger-cheese-a-smelly-combination" itemprop="url">P and U Charts and Limburger Cheese: A Smelly Combination</a>.</p>
<p itemprop="headline">And it's good to remember that things aren't always as complicated as they seem, and sometimes a simple solution can be just as effective as a more complicated approach. See why in <a href="http://blog.minitab.com/blog/understanding-statistics/take-it-easy-create-a-run-chart" itemprop="url">Take It Easy: Create a Run Chart. </a></p>
Control Chart Tutorials
<p itemprop="headline">Many of our Minitab bloggers have talked about the process of choosing, creating, and interpreting control charts under specific conditions. If you have data that can't be collected in subgroups, you may want to learn about <a href="http://blog.minitab.com/blog/understanding-statistics/how-create-and-read-an-i-mr-control-chart" itemprop="url">How to Create and Read an I-MR Control Chart</a>. </p>
<p itemprop="headline">If you do have data collected in subgroups, you'll want to understand why, when it comes to <a href="http://blog.minitab.com/blog/michelle-paret/control-charts-subgroup-size-matters" itemprop="url">Control Charts, Subgroup Size Matters</a>.</p>
<p itemprop="headline">It's often useful to look at control chart data in calendar-based increments, and taking the monthly approach is discussed in the series <a href="http://blog.minitab.com/blog/understanding-statistics/creating-a-chart-to-compare-month-to-month-change" itemprop="url">Creating a Chart to Compare Month-to-Month Change</a> and <a href="http://blog.minitab.com/blog/understanding-statistics/creating-charts-to-compare-month-to-month-change-part-2" itemprop="url">Creating Charts to Compare Month-to-Month Change, part 2</a>.</p>
<p itemprop="headline">If you want to see the difference your process improvements have made, check out <a href="http://blog.minitab.com/blog/real-world-quality-improvement/analyzing-a-process-before-and-after-improvement-historical-control-charts-with-stages" itemprop="url">Analyzing a Process Before and After Improvement: Historical Control Charts with Stages</a> and <a href="http://blog.minitab.com/blog/starting-out-with-statistical-software/setting-the-stage-accounting-for-process-changes-in-a-control-chart" itemprop="url">Setting the Stage: Accounting for Process Changes in a Control Chart</a>. </p>
<p itemprop="headline">While the basic idea of control charting is very simple, interpreting real-world control charts can be a little tricky. If you're using <a href="http://www.minitab.com/products/minitab">Minitab 17</a>, be sure to check out this post about a great new feature in the Assistant: <a href="http://blog.minitab.com/blog/statistics-and-quality-improvement/the-stability-report-for-control-charts-in-minitab-17-includes-example-patterns" itemprop="url">The Stability Report for Control Charts in Minitab 17 includes Example Patterns.</a></p>
<p itemprop="headline">Finally, one of our expert statistical trainers offers his suggestions about <a href="http://blog.minitab.com/blog/applying-statistics-in-quality-projects/five-ways-to-make-your-control-charts-more-effective" itemprop="url">Five Ways to Make Your Control Charts More Effective</a>.</p>
Control Chart Examples
<p itemprop="headline">Control charts are most frequently used for quality improvement and assurance, but they can be applied to almost any situation that involves variation.</p>
<p itemprop="headline">My favorite example of applying the lessons of quality improvement in business to your personal life involves Bill Howell, who applied his Six Sigma expertise to the (successful) management of his diabetes. Find out how he uses <a href="http://blog.minitab.com/blog/real-world-quality-improvement/control-charts-keep-blood-sugar-in-check" itemprop="url">Control Charts to Keep Blood Sugar in Check</a>.</p>
<p itemprop="headline">Some of our bloggers have applied control charts to their personal passions, including holiday candies in <a href="http://blog.minitab.com/blog/real-world-quality-improvement/control-charts-rational-subgrouping-and-marshmallow-peeps" itemprop="url">Control Charts: Rational Subgrouping and Marshmallow Peeps!</a> and bicycling in <a href="http://blog.minitab.com/blog/statistics-for-lean-six-sigma/the-problem-with-p-charts-out-of-control-cycle-laneys" itemprop="url">The Problem With P-Charts: Out-of-control Cycle LaneYs!</a>.</p>
<p itemprop="headline">If you're into sports, see how control charts can reveal <a href="http://blog.minitab.com/blog/the-statistical-mentor/when-should-nhl-goalies-get-pulled" itemprop="url">When NHL Goalies </a><a href="http://blog.minitab.com/blog/the-statistical-mentor/when-should-nhl-goalies-get-pulled" itemprop="url">Should </a><a href="http://blog.minitab.com/blog/the-statistical-mentor/when-should-nhl-goalies-get-pulled" itemprop="url">Get Pulled.</a> Or look to the cosmos to consider <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/signal-to-noise-detecting-extraterrestrials-and-special-causes" itemprop="url">Signal to Noise: Detecting Extraterrestrials and Special Causes</a>. And finally, compulsive readers like myself might be interested to see how relevant control charts are to literature, too, as Cody Steele illustrates in <a href="http://blog.minitab.com/blog/statistics-and-quality-improvement/laney-p-prime-charts-show-how-poe-creates-intensity-in-the-fall-of-the-house-of-usher" itemprop="url">Laney P' Charts Show How Poe Creates Intensity in "The Fall of the House of Usher."</a></p>
<p itemprop="headline">How are <em>you </em>using control charts?</p>
<p itemprop="headline"> </p>
Quality ImprovementMon, 12 Sep 2016 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/control-chart-tutorials-and-examplesEston MartzControl Charts and Capability Analysis: How to Setup Your Data
http://blog.minitab.com/blog/michelle-paret/control-charts-and-capability-analysis-how-to-setup-your-data
<p>To assess if a process is stable and in statistical control, you can use a <a href="http://www.minitab.com/support/videos/?vid=mtbucc">control chart</a>. It lets you answer the question "is the process that you see today going to be similar to the process that you see tomorrow?" To assess and quantify how well your process falls within specification limits, you can use <a href="http://www.minitab.com/support/videos/?vid=mtbac">capability analysis</a>.</p>
<p>Both of these tools are easy to use in <a href="http://www.minitab.com/products/minitab/">Minitab</a>, but you first need to properly setup your data. Here’s how.</p>
Chronological Order
<p>Your data should be entered in the order in which it was collected. The first measurement you take should be recorded in row 1. Then the next measurement belongs in row 2, etc. Data should never be sorted (e.g., arranged from smallest to largest) when creating control charts or running capability analysis.</p>
<p align="center" style="margin-left: 40px;"><img alt="control chart data setup - 1" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/361feb37a1903d6ab8df5f0586a8c871/controlchartscapadatasetup_1.jpg" style="border-width: 0px; border-style: solid; width: 500px; height: 257px;" /></p>
Data Collected in Subgroups
<p>If you collect your data in <a href="http://blog.minitab.com/blog/real-world-quality-improvement/control-charts-rational-subgrouping-and-marshmallow-peeps">subgroups</a> – say you collect 5 parts every hour – then those 5 individual data points don’t necessarily need to be in chronological order. However, in your Minitab worksheet, the first set of 5 data points collected needs to fall before the next set of 5 data points collected, and so on. You can then enter ‘5’ for your subgroup size in the <strong>Stat > Control Charts</strong>, <strong>Stat > Quality Tools > Capability Analysis</strong>, and <strong>Assistant</strong> dialog boxes.</p>
<p align="center" style="margin-left: 40px;"><img alt="control chart data setup - data collected in subgroups" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/bd70d98b36104fd2771462b5297f0dd4/controlchartscapadatasetup_3.jpg" style="line-height: 20.8px; border-width: 0px; border-style: solid; width: 371px; height: 400px;" /></p>
Missing Data
<p>Suppose you intend to collect 5 data points every hour, but during one of the hours you collect only 4 data points. In the case that your sample size falls short, you can represent the missing data point(s) with an asterisk (*).</p>
<p style="margin-left: 40px;"><img alt="control chart data setup - subgroup with missing measurement" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/792f797abda1279407dc729429a1dd71/controlchartscapadatasetup_2.jpg" style="line-height: 20.8px; text-align: -webkit-center; border-width: 0px; border-style: solid; width: 488px; height: 400px;" /></p>
Subgroup Indicator
<p>Suppose you intend to collect 5 data points every hour, but during one of the hours you collect 6 data points. Rather than tossing out perfectly good data, you can create a subgroup indicator column to let Minitab know that the subgroup size varies. You can then enter this subgroup column in the <strong>Stat > Control Charts</strong> and <strong>Stat > Quality Tools > Capability Analysis</strong> dialog boxes.</p>
<p align="center" style="margin-left: 40px;"><img alt="control chart setup - subgroup indicator" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c98c483f40339555fb7be6d2f84cd354/controlchartscapadatasetup_4.jpg" style="border-width: 0px; border-style: solid; width: 477px; height: 400px;" /></p>
<p>Note that you can use a subgroup indicator column for any of the cases above. It’s only absolutely required when your subgroup size varies.</p>
<p>When creating control charts or running capability analysis, the order in which your data appears directly impacts the resulting charts and calculations. Therefore, it’s important to make sure you enter your data properly.</p>
Capability AnalysisProject ToolsQuality ImprovementSix SigmaStatisticsFri, 09 Sep 2016 12:03:00 +0000http://blog.minitab.com/blog/michelle-paret/control-charts-and-capability-analysis-how-to-setup-your-dataMichelle ParetHow to Identify the Most Important Predictor Variables in Regression Models
http://blog.minitab.com/blog/adventures-in-statistics/how-to-identify-the-most-important-predictor-variables-in-regression-models
<p><img alt="Most important variable " src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/4789174457ef5eca4d2dcc1d274331e3/winning_trophy.jpg" style="width: 250px; height: 246px; float: right; margin: 10px 15px;" />You’ve performed multiple linear regression and have settled on a model which contains several predictor variables that are statistically significant. At this point, it’s common to ask, “Which variable is most important?”</p>
<p>This question is more complicated than it first appears. For one thing, how you define “most important” often depends on your subject area and goals. For another, how you collect and measure your sample data can influence the apparent importance of each variable.</p>
<p>With these issues in mind, I’ll help you answer this question. I’ll start by showing you statistics that <em>don’t</em> answer the question about importance, which may surprise you. Then, I’ll move on to both statistical and non-statistical methods for determining which variables are the most important in regression models.</p>
Don’t Compare Regular Regression Coefficients to Determine Variable Importance
<p>Regular <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">regression coefficients</a> describe the relationship between each <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">predictor variable</a> and the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">response</a>. The coefficient value represents the mean change in the response given a one-unit increase in the predictor. Consequently, it’s easy to think that variables with larger coefficients are more important because they represent a larger change in the response.</p>
<p>However, the units vary between the different types of variables, which makes it impossible to compare them directly. For example, the meaning of a one-unit change is very different if you’re talking about temperature, weight, or chemical concentration.</p>
<p>This problem is further complicated by the fact that there are different units within each type of measurement. For example, weight can be measured in grams and kilograms. If you fit models for the same data set using grams in one model and kilograms in another, the coefficient for weight changes by a factor of a thousand even though the underlying fit of the model remains unchanged. The coefficient value changes greatly while the importance of the variable remains constant.</p>
<p><strong>Takeaway</strong>: Larger coefficients don’t necessarily identify more important predictor variables.</p>
Don’t Compare P-values to Determine Variable Importance
<p>The coefficient value doesn’t indicate the importance a variable, but what about the variable’s p-value? After all, we look for low p-values to help determine whether the variable should be included in the model in the first place.</p>
<p>P-value calculations incorporate a variety of properties, but a measure of importance is not among them. A very low p-value can reflect properties other than importance, such as a very precise estimate and a large sample size.</p>
<p>Effects that are trivial in the real world can have very low p-values. <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/p-value-and-significance-level/practical-significance/" target="_blank">A statistically significant result may not be practically significant</a>.</p>
<p><strong>Takeaway</strong>: Low p-values don’t necessarily identify predictor variables that are practically important.</p>
<u>Do</u> Compare These Statistics To Help Determine Variable Importance
<p>We ruled out a couple of the more obvious statistics that can’t assess the importance of variables. Fortunately, there are several statistics that can help us determine which predictor variables are most important in regression models. These statistics might not agree because the manner in which each one defines "most important" is a bit different.</p>
<p><strong>Standardized regression coefficients</strong></p>
<p>I explained how regular regression coefficients use different scales and you can’t compare them directly. However, if you standardize the regression coefficients so they’re based on the same scale, you <em>can </em>compare them.</p>
<p>To obtain standardized coefficients, standardize the values for all of your continuous predictors. In <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">Minitab 17</a>, you can do this easily by clicking the <strong>Coding</strong> button in the main Regression dialog. Under <strong>Standardize continuous predictors</strong>, choose <em>Subtract the mean, then divide by the standard deviation</em>.</p>
<p>After you fit the regression model using your standardized predictors, look at the <em>coded coefficients</em>, which are the standardized coefficients. This coding puts the different predictors on the same scale and allows you to compare their coefficients directly. Standardized coefficients represent the mean change in the response given a one standard deviation change in the predictor.</p>
<p><strong>Takeaway</strong>: Look for the predictor variable with the largest absolute value for the standardized coefficient.</p>
<p><strong>Change in R-squared when the variable is added to the model last</strong></p>
<p>Multiple regression in <a href="http://www.minitab.com/en-us/products/minitab/assistant/" target="_blank">Minitab's Assistant menu</a> includes a neat analysis. It calculates the increase in <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit" target="_blank">R-squared</a> that each variable produces when it is added to a model that already contains all of the other variables.</p>
<p>Because the change in R-squared analysis treats each variable as the last one entered into the model, the change represents the percentage of the variance a variable explains that the other variables in the model cannot explain. In other words, this change in R-squared represents the amount of <em>unique</em> variance that each variable explains above and beyond the other variables in the model.</p>
<p><strong>Takeaway</strong>: Look for the predictor variable that is associated with the greatest increase in R-squared.</p>
An Example of Using Statistics to Identify the Most Important Variables in a Regression Model
<p>The example output below shows a regression model that has three predictors. The text output is produced by the regular regression analysis in Minitab. I’ve standardized the continuous predictors using the <strong>Coding</strong> dialog so we can see the standardized coefficients, which are labeled as coded coefficients. You can find this analysis in the Minitab menu: <strong>Stat > Regression > Regression > Fit Regression Model</strong>.</p>
<p>The report with the graphs is produced by Multiple Regression in the Assistant menu. You can find this analysis in the Minitab menu: <strong>Assistant > Regression > Multiple Regression.</strong></p>
<p style="margin-left: 40px;"> <img alt="Coded coefficient table" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/689520402ea091dae01a76b543158001/coeff_table.png" style="width: 414px; height: 194px;" /></p>
<p style="margin-left: 40px;"><img alt="Minitab's Assistant menu output that displays the incremental impact of the variables" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/9e24a607677a07a8007522c72fde1b2f/incrementalrsq.png" style="width: 600px; height: 451px;" /></p>
<p>The standardized coefficients show that North has the standardized coefficient with the largest absolute value, followed by South and East. The Incremental Impact graph shows that North explains the greatest amount of the unique variance, followed by South and East. For our example, both statistics suggest that North is the most important variable in the regression model.</p>
Caveats for Using Statistics to Identify Important Variables
<p>Statistical measures can show the relative importance of the different predictor variables. However, these measures can't determine whether the variables are important in a practical sense. To determine practical importance, you'll need to use your subject area knowledge.</p>
<p>How you collect and measure your sample can bias the apparent importance of the variables in your sample compared to their true importance in the population.</p>
<p>If you randomly sample your observations, the variability of the predictor values in your sample likely reflects the variability in the population. In this case, the standardized coefficients and the change in R-squared values are likely to reflect their population values.</p>
<p>However, if you select a restricted range of predictor values for your sample, both statistics tend to underestimate the importance of that predictor. Conversely, if the sample variability for a predictor is greater than the variability in the population, the statistics tend to overestimate the importance of that predictor.</p>
<p>Also, consider the accuracy and precision of the measurements for your predictors because this can affect their apparent importance. For example, lower-quality measurements can cause a variable to appear less predictive than it truly is.</p>
<p>If your goal is to change the response mean, you should be confident that causal relationships exist between the predictors and the response rather just a correlation. If there is an observed <a href="http://blog.minitab.com/blog/understanding-statistics/no-matter-how-strong-correlation-still-doesnt-imply-causation" target="_blank">correlation but no causation</a>, intentional changes in the predictor values won’t necessarily produce the desired change in the response regardless of the statistical measures of importance.</p>
<p>To determine that there is a causal relationship, you typically need to perform a <a href="http://blog.minitab.com/blog/adventures-in-statistics/use-random-assignment-in-experiments-to-combat-confounding-variables" target="_blank">designed experiment rather than an observational study</a>.</p>
Non-Statistical Considerations for Identifying Important Variables
<p>How you define “most important” often depends on your goals and subject area. While statistics can help you identify the most important variables in a regression model, applying subject area expertise to all aspects of statistical analysis is crucial. Real world issues are likely to influence which variable you identify as the most important in a regression model.</p>
<p>For example, if your goal is to change predictor values in order to change the response, use your expertise to determine which variables are the most feasible to change. There may be variables that are harder, or more expensive, to change. Some variables may be impossible to change. Sometimes a large change in one variable may be more practical than a small change in another variable.</p>
<p>“Most important” is a subjective, context sensitive characteristic. You can use statistics to help identify candidates for the most important variable in a regression model, but you’ll likely need to use your subject area expertise as well.</p>
<p>If you're just learning about regression, read my <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples">regression tutorial</a>!</p>
ANOVADesign of ExperimentsLearningRegression AnalysisStatistics HelpWed, 07 Sep 2016 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/how-to-identify-the-most-important-predictor-variables-in-regression-modelsJim FrostCreating Value from Your Data
http://blog.minitab.com/blog/applying-statistics-in-quality-projects/creating-value-from-your-data
<p>There may be huge potential benefits waiting in the data in your servers. These data may be used for many different purposes. Better data allows better decisions, of course. Banks, insurance firms, and telecom companies already own a large amount of data about their customers. These resources are useful for building a more personal relationship with each customer.</p>
<p>Some organizations already use data from agricultural fields to build complex and customized models based on a very extensive number of input variables (soil characteristics, weather, plant types, etc.) in order to improve crop yields. Airline companies and large hotel chains use dynamic pricing models to improve their yield management. Data is increasingly being referred as the new “gold mine” of the 21st century.</p>
<p>A couple of factors underlie the rising prominence of data (and, therefore, data analysis):</p>
<p><img alt="Afficher l'image d'origine" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/File/de034e63187d191e1666721fa12a8880/de034e63187d191e1666721fa12a8880.png" style="width: 283px; height: 212px; margin: 10px 15px; float: right;" /></p>
Huge volumes of data
<p><span style="line-height: 1.6;">Data acquisition has never been easier (sensors in manufacturing plants, sensors in connected objects, data from internet usage and web clicks, from credit cards, fidelity cards, Customer Relations Management databases, satellite images etc…) and it can easily be stored at costs that are lower than ever before (huge storage capacity now available on the cloud and elsewhere). The amount of data that is being collected is not only huge, it is growing very fast… in an exponential way.</span></p>
Unprecedented velocity
<p>Connected devices, like our smart phones, provide data in almost real time and it can be processed very quickly. It is now possible to react to any change…almost immediately.</p>
Incredible variety
<p>The data collected is not be restricted to billing information; every source of data is potentially valuable for a business. Not only is numeric data getting collected in a massive way, but also unstructured data such as videos, pictures, etc., in a large variety of situations.</p>
<p>But the explosion of data available to us is prompting every business to wrestle with an extremely complicated problem:</p>
How can we create value from these resources ?
<p>Very simple methods, such as counting words used in queries submitted to company web sites, do provide a good insight as to the general mood of your customers and its evolution. Simple statistical correlations are often used by web vendors to suggest a purchase just after buying a product on the web. Very simple descriptive statistics are also useful.</p>
<p>Just guess what could be achieved from advanced regression models or powerful statistical multivariate techniques, which can be applied easily with <a href="http://www.minitab.com/products/minitab/">statistical software packages like Minitab</a>.</p>
A simple example of the benefits of analyzing an enormous database
<p>Let's consider an example of how one company benefited from analyzing a very large database.</p>
<p><span style="line-height: 20.8px;">Many steps are needed (security and safety checks, cleaning the cabin, etc.) before a plane can depart.</span><span style="line-height: 20.8px;"> Since d</span><span style="line-height: 20.8px;">elays negatively impact customer perceptions and also affect productivity, a</span><span style="line-height: 1.6;">irline companies routinely collect a very large amount of data related to flight delays and times required to perform tasks before departure. Some times are automatically collected, others are manually recorded.</span></p>
<p>A major worldwide airline company intended to use this data to identify the crucial milestones among a very large number of preparation steps, and which ones often triggered delays in departure times. The company used Minitab's <span><a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-smackdown-stepwise-versus-best-subsets">stepwise regression analysis</a></span> to quickly focus on the few variables that played a major role among a large number of potential inputs. Many variables turned out to be statistically significant, but two among them clearly seemed to make a major contribution (X6 and X10).</p>
<p style="margin-left: 40px;">Analysis of Variance1</p>
<p style="margin-left: 40px;">Source DF Seq SS <strong><span style="color: rgb(0, 0, 128);">Contribution </span></strong> Adj SS Adj MS F-Value P-Value</p>
<p style="margin-left: 40px;"><span style="line-height: 1.6;"> X6 1 337394 </span><span style="line-height: 1.6; color: rgb(0, 0, 128);"><strong>53.54%</strong></span><span style="line-height: 1.6;"> 2512 2512.2 29.21 0.000</span></p>
<p style="margin-left: 40px;"><span style="line-height: 1.6;"> X10 1 112911 </span><strong style="line-height: 1.6;"><span style="color: rgb(0, 0, 128);"> 17.92%</span> </strong><span style="line-height: 1.6;"> 66357 66357.1 771.46 0.000</span></p>
<p>When huge databases are used, statistical analyses may become overly sensitive and <a href="http://blog.minitab.com/blog/the-stats-cat/sample-size-statistical-power-and-the-revenge-of-the-zombie-salmon-the-stats-cat">detect even very small differences</a> (due to the large sample and power of the analysis). P values often tend to be quite small (p < 0.05) for a large number of predictors.</p>
<p>However, in Minitab, if you click on Results in the regression dialogue box and select Expanded tables, contributions from each variable will get displayed. X6 and X10 when considered together were contributing to more than 80% of the overall variability (with the largest F values by far), the contributions from the remaining factors were much smaller. The airline then ran a residual analysis to cross-validate the final model. </p>
<p>In addition, a Principal Component Analysis (<a href="http://blog.minitab.com/blog/applying-statistics-in-quality-projects/use-statistics-to-better-understand-your-customers">PCA, a multivariate technique</a>) was performed in Minitab to describe the relations between the most important predictors and the response. Milestones were expected to be strongly correlated to the subsequent steps.</p>
<p style="margin-left: 40px;"><img src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/File/c023d71140ea4ee2b5b22480712a55a4/c023d71140ea4ee2b5b22480712a55a4.png" /></p>
<p>The graph above is a Loading Plot from a principal component analysis. Lines that go in the same direction and are close to one another indicate how the variables may be grouped. Variables are visually grouped together according to their statistical correlations and how closely they are related.</p>
<p>A group of nine variables turned out to be strongly correlated to the most important inputs (X6 and X10) and to the final delay times (Y). Delays at the X6 stage obviously affected the X7 and X8 stages (subsequent operations), and delays from X10 affected the subsequent X11 and X12 operations.</p>
Conclusion
<p>This analysis provided simple rules that this airline's crews can follow in order to avoid delays, making passengers' next flight more pleasant. </p>
<p>The airline can repeat this analysis periodically to search for the next most important causes of delays. Such an approach can propel innovation and help organizations replace traditional and intuitive decision-making methods with data-driven ones.</p>
<p>What's more, the use of data to make things better is not restricted to the corporate world. More and more public administrations and non-governmental organizations are making large, open databases easily accessible to communities and to virtually anyone. </p>
ANOVAData AnalysisHypothesis TestingRegression AnalysisStatisticsStatistics in the NewsTue, 06 Sep 2016 13:19:00 +0000http://blog.minitab.com/blog/applying-statistics-in-quality-projects/creating-value-from-your-dataBruno ScibiliaIs Alabama Going Undefeated this Year? Creating Simulations in Minitab
http://blog.minitab.com/blog/the-statistics-game/is-alabama-going-undefeated-this-year-creating-simulations-in-minitab
<p>The college football season is here, and this raises a very important question:</p>
<p>Is Alabama going to be undefeated when they win the national championship, or will they lose a regular-season game along the way?</p>
<img alt="Alabama" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/c353125e8df62efcc49bb6cc042e2006/alabama_crimson_tide.jpg" style="line-height: 20.8px; width: 250px; height: 250px; float: right; margin: 10px 15px;" />
<p>Okay, so it's not a <em>given </em>that Alabama is going to win the championship this year, but when you've won 4 of the last 7 you're definitely the odds-on favorite.</p>
<p>However, what if we wanted to take a quantitative look at Alabama's chances of going undefeated instead of just giving hot takes like the one above? How could we determine a probability of Alabama winning a specific number of games this year?</p>
<p>The answer is easy: a Monte Carlo Simulation.</p>
<p>Monte Carlo <a href="http://blog.minitab.com/blog/understanding-statistics/monte-carlo-is-not-as-difficult-as-you-think">simulations use repeated random sampling</a> to simulate data for a given mathematical model and evaluate the outcome. Sounds like the perfect situation for <a href="http://www.minitab.com/en-us/products/minitab/?WT.srch=1&WT.mc_id=SE3994&gclid=CMquxcr4684CFVBbhgod8sECMQ" target="_blank">Minitab Statistical Software</a>. We're going to use a Monte Carlo simulation to have Alabama play their schedule 100,000 times! But we need to establish a few things before we get started.</p>
The Transfer Equation
<p>First, we need a model to use in our simulation. This can be a known formula from your specific area of expertise, or it could be a model created from a designed experiment (DOE) or regression analysis. In our situation, we already know the transfer equation. It's just the summation of the number of games that Alabama wins during the season: </p>
<p style="margin-left: 40px;">Game1 + Game2 + Game3 ... + Game12</p>
The Variables
<p>Next, we need to define the distribution and parameters for the variables in our equation. We have 12 variables, one for each game Alabama will play.</p>
<p>For each game, Alabama can either win or lose. So each variable comes from the binomial distribution because there are only two outcomes.</p>
<p>Now we just need to determine the probability Alabama has of winning each game. For that, I'll turn to <a href="http://www.footballoutsiders.com/stats/ncaa2015" target="_blank">Bill Connelly's S&P+ rankings</a>. These rankings use play-by-play and drive data from every game to rank college football teams. But most importantly, these rankings can be used to generate win probabilities for individual games. And that's where the probability for our 12 binomial variables will come from.</p>
<p style="margin-left: 40px;"><img alt="Alabama probabilities" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/973494dbdb5a661a96ad4d746a48a50f/alabama_probabilities.jpg" style="width: 711px; height: 375px;" /></p>
Generate the Random Data
<p>Now that we have our variables, it's time to generate the random data for each one. We'll start with Alabama's opening game against USC, which is a binomial random variable with a probability of 0.71. To generate this data in Minitab, go to <strong>Calc > Random Data > Binomial</strong>. Then complete the dialog as follows.</p>
<p style="margin-left: 40px;"><img alt="Binomial Distribution" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/6591ac6257608ba23acf1f9d3386c89e/usc_dialog.jpg" style="width: 419px; height: 326px;" /></p>
<p>We're going to simulate this game 100,000 times, so that is the number of rows of data we want to generate. We want each row to represent a single game, so the number of trials is 1. And lastly, Alabama has a 71% chance of winning, so the event probability is 0.71. </p>
<p>After we repeat this for the other 11 games, we'll have simulated Alabama's regular season 100,000 times! Now all that's left to do is to analyze the results!</p>
<p><strong>Note:</strong> The probability for Alabama beating Chattanooga is 100%, but the probability for the binomial distribution has to be less than 1. So I used a value of 0.9999. Out of 100,000 games Chattanooga actually won twice! Hey, it's sports, anything can happen!</p>
Analyze the Simulation
<p>Remember that transfer equation we came up with at the beginning? Now that we have the data for all of our variables, it's time to use it! Go to <strong>Calc > Calculator</strong>, and set up the equation to store the results in a new column.</p>
<p style="margin-left: 40px;"><img alt="Calculator" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/5a590180fc44d9418329a2f86a1db2cc/alabama_wins.jpg" style="width: 443px; height: 393px;" /></p>
<p>I created a new column called "Alabama Wins" and entered the sum of the individual game columns in the expression. This will give me the number of wins Alabama will have for 100,000 different seasons! We can use a histogram to view the results.</p>
<p style="margin-left: 40px;"><img alt="Histogram" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/2c9882fcee727df721dc4ade21fc2455/histogram_of_alabama_wins.jpg" style="width: 576px; height: 384px;" /></p>
<p>The most common outcome was a 10-win season, which Alabama did approximately 29.6% of the time. And the simulation suggests it doesn't look good for Alabama going undefeated. That only happens in 4.6% of the simulations. In fact, there is a better chance that Alabama wins 7 games than all 12! A 7-5 Alabama team sounds impossible. But this is sports, and as our simulation has just shown, anything can happen!</p>
<p>Monte Carlo simulations can be applied to a wide variety of areas outside of sports too. If you want more, <a href="https://www.minitab.com/en-us/Published-Articles/Doing-Monte-Carlo-Simulation-in-Minitab-Statistical-Software/">check out this article</a> that illustrates how to use Minitab for Monte Carlo simulations using both a known engineering formula and a DOE equation.</p>
<p> </p>
Fun StatisticsMonte CarloMonte Carlo SimulationStatisticsStatistics in the NewsFri, 02 Sep 2016 12:00:00 +0000http://blog.minitab.com/blog/the-statistics-game/is-alabama-going-undefeated-this-year-creating-simulations-in-minitabKevin RudyHow to Pick the Right Statistical Software
http://blog.minitab.com/blog/real-world-quality-improvement/how-to-pick-the-right-statistical-software
<p>If you’re in the market for statistical software, there are many considerations and more than a few options for you to evaluate.</p>
<img alt="questions to ask" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/795f924e2aad164e93ba4654f3c012ac/photo_1458419948946_19fb2cc296af.jpg" style="line-height: 20.8px; width: 300px; height: 200px; border-width: 1px; border-style: solid; float: right; margin: 10px 15px;" />
<p>Check out these seven questions to ask yourself before choosing statistical software—your answers should help guide you towards the best solution for your needs!</p>
1. Who uses statistical software in your organization?
<p>Are they expert statisticians, novices, or a mix of both? Will they be analyzing data day-in, day-out, or will some be doing statistics on a less frequent basis? Is data analysis a core part of their jobs, or is it just one of many different hats some users have to wear? What's their relationship with technology—do they like computers, or just use them because they have to? </p>
<p>Figuring out who needs to use the software will help you match the options to their needs, so you can avoid choosing a package that does too much or too little.</p>
<p>If your users span a range of cultures and nationalities, be sure to see if the package you're considering is <a href="http://support.minitab.com/en-us/minitab/17/topic-library/minitab-environment/interface/customize-the-minitab-interface/change-the-language/" target="_blank">available in multiple languages</a>.</p>
2. What types of statistical analysis will they be doing?
<p>The specific types of analysis you need to do could play a big part in determining the right statistical software for your organization. The American Statistical Association's software page lists highly specialized programs for econometrics, spatial statistics, data mining, statistical genetics, risk modeling, and more. However, if your company has employees who specialize in the finer points of these kinds of analyses, chances are good they already identified and have access to the right software for their needs.</p>
<p>Most users will want a general statistical software package that offers the power and flexibility to do all of the most commonly used types of analysis, including regression, ANOVA, hypothesis testing, design of experiments, capability analysis, control charts, and more. If you're considering a general statistical software package, check its features list to make sure it does the kinds of analysis you need. <a href="http://www.minitab.com/products/minitab/features-list/" target="_blank">Here is the complete feature list for Minitab Statistical Software.</a> </p>
3. How easy is it to use the statistical software?
<p>Data analysis is not simple or easy, and many statistical software packages don’t even try to make it any easier. This is not necessarily a bad thing, because "ease of use" is different for different users.</p>
<p>An expert statistician will know how to set up data correctly and will be comfortable entering statistical equations in a command-line interface—in fact, they may even feel slowed down by using a menu-based interface. On the other hand, a less experienced user may be intimidated or overwhelmed by a statistical software package designed primarily for use by experts. </p>
<p>Since ease of use varies widely, look into what kinds of <a href="http://support.minitab.com/en-us/minitab/17/" target="_blank">built-in guidance statistical software packages offer</a> to see which would be easiest for the majority of your users.</p>
4. What kind of support is offered?
<p>If people in your organization will need help using statistical software to analyze their data, how will they get it? Does your company have expert statisticians who can provide assistance when it's needed, or is access to that kind of expertise limited? </p>
<p>If you think people in your organization are going to contact the software's support team for assistance, it's smart to check around and see what kinds of assistance different software companies offer. Do they offer help with analysis problems, or only with installation and IT issues? Do they charge for it?</p>
<p>Look around in online user and customer forums to see what people say about the customer service they've received for different types of statistical software. <a href="http://www.minitab.com/Support/" target="_blank">Some software packages offer free technical support from experts in statistics and IT</a>; others provide more limited, fee-based customer support; and some packages provide no support at all.</p>
5. Where will the software be used?
<p>Will you be doing data analysis in your office? At home? On the road? All of the above? Will people in your organization be using the software at different locations across the country, or even the world? What are the license requirements for software packages in that situation? Does each machine need a separate copy of the software, or are shared licenses available?</p>
<p>Check on the options available for the packages you're considering. A good software provider will seek to understand your organization's unique needs and work with you to find the most cost-effective solution.</p>
6. Are there special considerations for your industry?
<p>Some professions have specialized data analysis needs due to regulations, industry requirements, or the unique nature of their business. For example, the pharmaceutical and medical device industry needs to meet FDA recommendations for testing, which may involve statistical techniques such as Design of Experiments.</p>
<p>Depending on the needs of your business, one or more of these highly specialized software packages may be appropriate. However, general statistical software packages with a full range of tools may provide the functionality your industry requires, so be sure to investigate and compare these packages with the more specialized, and often more expensive, programs used in some industries.</p>
7. What do statistical software packages cost?
<p>Last but not least, you will need to consider the cost of the software package, which can range from $0 for some open-source programs to many thousands of dollars per license for more specialized offerings.</p>
<p>It’s important to compare not just the unit-copy price of a software package (i.e., what it costs to install a single copy of the software on a single machine), but to find out <a href="http://www.minitab.com/en-us/News/Minitab-Pricing-and-Licensing--Frequently-Asked-Questions/" target="_blank">what licensing options for statistical software</a> are available for your situation. </p>
Have more questions?
<p>If you have questions about data analysis software, please <a href="http://www.minitab.com/contact-us/" target="_blank">contact Minitab</a> to discuss your unique situation in detail. We are happy to help you identify the needs of your organization and find a solution that will best fit them!</p>
StatisticsStatistics HelpMon, 29 Aug 2016 18:10:00 +0000http://blog.minitab.com/blog/real-world-quality-improvement/how-to-pick-the-right-statistical-softwareCarly BarrySunny Day for A Statistician vs. Dark Day for A Householder with Solar Panels
http://blog.minitab.com/blog/using-data-and-statistics/sunny-day-for-a-statistician-vs-dark-day-for-a-householder-with-solar-panels
<p>In 2011 we had solar panels fitted on our property. In the last few months we have noticed a few problems with the inverter (the equipment that converts the electricity generated by the panels from DC to AC, and manages the transfer of unused electric to the power company). It was shutting down at various times throughout the day, typically when it was very sunny, resulting in no electricity being generated.<img alt="solar panels" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/0ee09d62f414b4bd79601d23995458bf/solar.jpg" style="width: 400px; height: 267px; margin: 10px 15px; float: right;" /></p>
<p>I contacted the inverter manufacturer for some help to diagnose the problem. They asked me to download their monitoring app, called Sunny Portal. I did this and started a communication process with the inverter via Bluetooth, which not only showed me the error code but also delivered a time series of the electricity generated by the hour since the panels were installed.</p>
<p>I thought I had gone to statistician heaven! By using this data, I could establish if this problem was significantly reducing the amount of electricity generated and, consequently, reducing the amount of cash I was being paid for generating electricity. </p>
<p>The Sunny Portal, does have some basic bar charts to plot <span><a href="http://blog.minitab.com/blog/real-world-quality-improvement/3-ways-to-examine-data-over-time">time series</a></span>, by the month, day, and 5-minute interval; however, each chart automatically works out the scale according to the data so it is difficult to compare time periods. </p>
<div>
<p><strong>Top Minitab Tip</strong>: If you want to compare multiple charts measuring the same thing for different time periods or groups, make sure the Y-axis scales are the same. In many Minitab graphs and charts, if you select the Multiple Graphs button you will be given the option to select the same Y-axis scale.</p>
</div>
Getting the Data into Minitab
<p>I realized that I could output the data to text files, which meant I could use my statistical skills and Minitab to answer my questions. For each month between Sept 2011 and June 2016 I exported a file like the example shown below. For each day I have the date, the cumulative units generated since the inverter was commissioned, and the daily generation.</p>
<p style="margin-left: 40px;"><img src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/65ceccab-4e9a-4eba-8ce7-73b8b3d4d078/File/06a0cad69d2d8bd7cc169fb1ccb039fc/06a0cad69d2d8bd7cc169fb1ccb039fc.png" /></p>
<p>These were easily read into Minitab, using <strong>File > Open</strong>, specifying the first row of data as row 9, and changing the delimiter from comma to semicolon. </p>
<p style="margin-left: 40px;"><img src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/65ceccab-4e9a-4eba-8ce7-73b8b3d4d078/File/a533773b174b01e721e6bae8f3240cdb/a533773b174b01e721e6bae8f3240cdb.png" style="line-height: 20.8px;" /></p>
<p>I read all of these monthly files into individual Minitab worksheets and then used <strong>Data > Stack Worksheets</strong> to create a single worksheet that contained all the data. </p>
Creating and Reviewing the Time Series Plots
<p>Using <strong>Graph > Time Series Plot, </strong>I created the following time series plots. To get each year in different colours, I double-clicked on an individual data point in the chart, chose the "Groups" tab in the Edit Symbols dialog box, and put Year as the grouping variable.</p>
<p style="margin-left: 40px;"><img src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/65ceccab-4e9a-4eba-8ce7-73b8b3d4d078/File/d8735a9c83b4d1fab3b48b1d850cab38/d8735a9c83b4d1fab3b48b1d850cab38.png" style="line-height: 20.8px;" /></p>
<p>Looking at this plot, it was clear that the most electricity is generated in the summer months and least in the winter months, but it was not easy to identify if the amount of electricity generated had been declining. I needed to consider another analytical approach.</p>
<p>Since I have only noticed this problem in the last 6 months, (Jan to June 2016) I decided to compare the electricity generated in the first 6 months of the year for the years 2012–2016. I did this using <strong>Assistant > Hypothesis Tests > One Way Anova</strong>. The descriptive results were as follows:</p>
<p style="margin-left: 40px;"><img src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/65ceccab-4e9a-4eba-8ce7-73b8b3d4d078/File/535915f4f060684ccbfb1bf1cf34475b/535915f4f060684ccbfb1bf1cf34475b.png" style="line-height: 1.6;" /></p>
<p>Just looking at the summary statistics, I can clearly see that the average electric units generated per day for the first six months of 2016 is much lower at 5.71 units than it was in the previous years, which range between 8.15 in 2012 and 9.22 in 2014. However by using the results from the one-way ANOVA I can work out if 2016 is <em>significantly </em>worse than previous years. </p>
<p>From this chart, you can see that the p-value is less than 0.001. Hence, we can conclude that not all the group means are equal. By using the Means Comparision Chart, shown below I can also see that 2016 is significantly lower than all the other years.</p>
<p style="margin-left: 40px;"><img src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/65ceccab-4e9a-4eba-8ce7-73b8b3d4d078/File/c3604a8dd269c552d10b231ad9e28f50/c3604a8dd269c552d10b231ad9e28f50.png" /></p>
<p>However, you might be thinking that first six months 2016 in England were darker than an average year, and there has been significantly less UV light. This might be a fair point, so to check this I looked at data produced by the UK Met Office, <strong>(<a href="http://www.metoffice.gov.uk/climate/uk/summaries/anomalygraphs">www.metoffice.gov.uk/climate/uk/summaries/anomalygraphs</a><u>)</u>. </strong>These charts, called anomaly graphs, compare the sunshine levels by month for particular years to the average sunshine levels for the previous decade.</p>
<p>The results for 2016 and 2012, the two worst years for average electricity generated per day, are as follows: </p>
<p style="margin-left: 40px;"><img src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/65ceccab-4e9a-4eba-8ce7-73b8b3d4d078/File/2a6f05e175bfc75a8fdf9ccb91037eef/2a6f05e175bfc75a8fdf9ccb91037eef.png" /></p>
<p style="margin-left: 40px;"><img src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/65ceccab-4e9a-4eba-8ce7-73b8b3d4d078/File/1803e9942cfdad869abf51aad522a874/1803e9942cfdad869abf51aad522a874.png" /></p>
<p>When I compare Met Office data for the amount of sunshine in the first six months of 2016 in England (red bar), with 2012, the second-worst year according to my the summary statistics, I can see that only Jan and March were better in 2012. It should also be noted you generate more electricity when there are more daylight hours. So a bad June has a bigger influence on electricity generated than a bad January, and June in 2012 was worse than 2016.</p>
<p>Consequently, I can see that the English weather cannot be blamed for the lower electricity generation figures and the fault is with my inverter. The next steps are to determine when this problem with the inverter started, and estimate what it has cost. </p>
<p>After I shared my results, the helpdesk at the manufacturer identified the problem with the Inverter: it had been set up with German power grid settings, and apparently the UK grid has more voltage fluctuation. The settings were changed on 15th July, and I'm looking forward to collecting more data and analyzing it in Minitab to determine whether this problem has been solved</p>
<p> </p>
ANOVAData AnalysisFun StatisticsHypothesis TestingStatisticsFri, 26 Aug 2016 12:00:00 +0000http://blog.minitab.com/blog/using-data-and-statistics/sunny-day-for-a-statistician-vs-dark-day-for-a-householder-with-solar-panelsGillian GroomWhat the Heck Are Sums of Squares in Regression?
http://blog.minitab.com/blog/marilyn-wheatleys-blog/what-the-heck-are-sums-of-squares-in-regression
<p>In regression, "sums of squares" are used to represent variation. In this post, we’ll use some sample data to walk through these calculations.</p>
<p><img alt="squares" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/bc08f61dab265a6e4a481df9e66e54e8/squares.jpg" style="width: 250px; height: 250px; margin: 10px 15px; float: right;" />The sample data used in this post is available within <a href="http://www.minitab.com/en-us/products/minitab/">Minitab</a> by choosing <strong>Help</strong> > <strong>Sample Data</strong>, or <strong>File</strong> > <strong>Open Worksheet</strong> > <strong>Look in Minitab Sample Data folder</strong> (depending on your version of Minitab). The dataset is called <strong>ResearcherSalary.MTW</strong>, and contains data on salaries for researchers in a pharmaceutical company.</p>
<p>For this example we will use the data in C1, the salary, as Y or the response variable and C4, the years of experience as X or the predictor variable.</p>
<p>First, we can run our data through Minitab to see the results: <strong>Stat</strong> > <strong>Regression</strong> > <strong>Fitted Line Plot</strong>. The salary is the Y variable, and the years of experience is our X variable. The regression output will tell us about the relationship between years of experience and salary after we complete the dialog box as shown below, and then click <strong>OK</strong>:</p>
<p style="margin-left: 40px;"><img alt="fitted line plot dialog" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/3c5f16c5e876000a5fe3941e6622dec3/sum_of_squares1.png" style="border-width: 0px; border-style: solid; width: 437px; height: 232px;" /></p>
<p>In the window above, I’ve also clicked the <strong>Storage</strong> button, selected the box next to <strong>Coefficients</strong> to store the coefficients from the regression equation in the worksheet. When we click <strong>OK</strong> in the window above, Minitab gives us two pieces of output:</p>
<p style="margin-left: 40px;"><img alt="fitted line plot and output" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/52211666f2a49c79d4127965964934da/sum_of_squares2.png" style="border-width: 0px; border-style: solid; width: 624px; height: 282px;" /></p>
<p>On the left side above we see the regression equation and the ANOVA (Analysis of Variance) table, and on the right side we see a graph that shows us the relationship between years of experience on the horizontal axis and salary on the vertical axis. Both the right and left side of the output above are conveying the same information. We can clearly see from the graph that as the years of experience increase, the salary increases, too (so years of experience and salary are positively correlated). For this post, we’ll focus on the SS (Sums of Squares) column in the Analysis of Variance table.</p>
Calculating the Regression Sum of Squares
<p>We see a SS value of 5086.02 in the Regression line of the ANOVA table above. That value represents the amount of variation in the salary that is attributable to the number of years of experience, based on this sample. Here's where that number comes from. </p>
<ol>
<li>Calculate the average response value (the salary). In Minitab, I’m using <strong>Stat</strong> > <strong>Basic Statistics</strong> > <strong>Store Descriptive Statistics</strong>:</li>
</ol>
<p style="margin-left: 40px;"><img alt="dialog boxes" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/1d4099c8c098df54b6b8c9a4756e978c/sum_of_squares3.png" style="border-width: 0px; border-style: solid; width: 624px; height: 378px;" /></p>
<p><span style="line-height: 1.6;">In addition to entering the Salary as the variable, I’ve clicked </span><strong style="line-height: 1.6;">Statistics</strong><span style="line-height: 1.6;"> to make sure only </span><strong style="line-height: 1.6;">Mean</strong><span style="line-height: 1.6;"> is selected, and I’ve also clicked </span><strong style="line-height: 1.6;">Options</strong><span style="line-height: 1.6;"> and checked the box next to </span><strong style="line-height: 1.6;">Store a row of output for each row of input</strong><span style="line-height: 1.6;">. As a result, Minitab will store a value of 82.9514 (the average salary) in C5 35 times:</span></p>
<p style="margin-left: 40px;"><img alt="data" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/2cafda0a9754841e093f614ce337d0ec/sum_of_squares4.png" style="border-width: 0px; border-style: solid; width: 359px; height: 252px;" /></p>
<ol>
<li value="2">Next, we will use the regression equation that Minitab gave us to calculate the fitted values. The fitted values are the salaries that our regression equation would predict, given the number of years of experience. </li>
</ol>
<p style="margin-left: 40px;">Our regression equation is <strong>Salary = 60.70 + 2.169*Years</strong>, so for every year of experience, we expect the salary to increase by 2.169. </p>
<p style="margin-left: 40px;">The first row in the Years column in our sample data is 11, so if we use 11 in our equation we get 60.70 + 2.169*11 = 84.559. So with 11 years of experience our regression equation tells us the expected salary is about $84,000. </p>
<p style="margin-left: 40px;">Rather than calculating this for every row in our worksheet manually, we can use Minitab’s calculator: <strong>Calc</strong> > <strong>Calculator </strong>(I used the stored coefficients in the worksheet to include more decimals in the regression equation that I’ve typed into the calculator):</p>
<p style="margin-left: 40px;"><img alt="calculator" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/5c11d157d8fb68745a47a8d5bbf12bc7/sum_of_squares5.png" style="border-width: 0px; border-style: solid; width: 423px; height: 381px;" /></p>
<p style="margin-left: 40px;">After clicking <strong>OK</strong> in the window above, Minitab will store the predicted salary value for every year in column C6. <strong>NOTE</strong>: <em>In the regression graph we obtained, the red regression line represents the values we’ve just calculated in C6.</em></p>
<ol>
<li value="3">Now that we have the average salary in C5 and the predicted values from our equation in C6, we can calculate the Sums of Squares for the Regression (the 5086.02). We’ll use <strong>Calc</strong> > <strong>Calculator</strong> again, and this time we will subtract the average salary from the predicted values, square those differences, and then add all of those squared differences together:</li>
</ol>
<p style="margin-left: 40px;"><img alt="calculator" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/e43bec2e97d1e3552fa68b5babbc9162/sum_of_squares6.png" style="border-width: 0px; border-style: solid; width: 426px; height: 384px;" /></p>
<p style="margin-left: 40px;">We square all the values because some of the predicted values from our equation are lower than the average, so those predicted values would be negative. If we sum together both positive and negative values, they will cancel each other out. But because we square the values, all observations will be taken into account.</p>
<p style="margin-left: 40px;">We have just calculated the Sum of Squares for the regression by summing the squared values. Our results should match what we’d seen in the regression output previously:</p>
<p style="margin-left: 40px;"><img alt="output" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ad62e21671cf488d1c6be83420b92979/sum_of_squares7.png" style="border-width: 0px; border-style: solid; width: 624px; height: 134px;" /></p>
Calculating the Error Sum of Squares
<p>The Error Sum of Squares is the variation in the salary that is not explained by number of years of experience. For example, the additional variation in the salary could be due to the person’s gender, number of publications, or other variables that are not part of this model. Any variation that is not explained by the predictors in the model becomes part of the error term.</p>
<ol>
<li>To calculate the error sum of squares we will use the calculator (<strong>Calc </strong>> <strong>Calculator</strong>) again to subtract the fitted values (the salaries predicted by our regression equation) from the observed response (the actual salaries): </li>
</ol>
<p style="margin-left: 40px;"><img alt="calculator" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/dfcb4fce9d5b4ac05179749c02cd4199/sum_of_squares8.png" style="border-width: 0px; border-style: solid; width: 424px; height: 383px;" /></p>
<p style="margin-left: 0.5in;">In C9, Minitab will store the differences between the actual salaries and what our equation predicted.</p>
<ol>
<li value="2">Because we’re calculating sums of squares again, we’re going to square all the values we stored in C9, and then add them up to come up with the sum of squares for error:</li>
</ol>
<p style="margin-left: 40px;"><img alt="calculator" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/246ff836925c61a1e944a27f0964c213/sum_of_squares9.png" style="border-width: 0px; border-style: solid; width: 422px; height: 381px;" /></p>
<p style="margin-left: 0.5in;">When we click <strong>OK</strong> in the calculator window above, we see that our calculated sum of squares for error matches Minitab’s output:</p>
<p style="margin-left: 0.5in;"><img alt="output" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/be5bb46afeb785da264c08e2c4ce50f5/sum_of_squares10.png" style="border-width: 0px; border-style: solid; width: 621px; height: 128px;" /></p>
<p style="margin-left: 0.5in;">Finally the Sum of Squares total is calculated by adding the Regression and Error SS together: 5086.02 + 1022.61 = 6108.63.</p>
<p>I hope you’ve enjoyed this post, and that it helps demystify what sums of squares are. If you’d like to read more about regression, you may like some of <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples">Jim Frost</a>’s <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples">regression tutorials</a>.</p>
ANOVAData AnalysisLearningRegression AnalysisStatisticsStatistics HelpStatsWed, 24 Aug 2016 12:02:00 +0000http://blog.minitab.com/blog/marilyn-wheatleys-blog/what-the-heck-are-sums-of-squares-in-regressionMarilyn WheatleyData Not Normal? Try Letting It Be, with a Nonparametric Hypothesis Test
http://blog.minitab.com/blog/understanding-statistics/data-not-normal-try-letting-it-be-with-a-nonparametric-hypothesis-test
<p>So the data you nurtured, that you worked so hard to format and make useful, failed the normality test.</p>
<img alt="not-normal" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c6e92e8046f3fcee28e7cf505fb77005/data_freak_flag_300.jpg" style="line-height: 20.8px; width: 300px; height: 293px; margin: 10px 15px; float: right;" />
<p>Time to face the truth: despite your best efforts, that data set is <em>never </em>going to measure up to the assumption you may have been trained to fervently look for.</p>
<p>Your data's lack of normality seems to make it poorly suited for analysis. Now what?</p>
<p>Take it easy. Don't get uptight. Just let your data be what they are, go to the <strong>Stat </strong>menu in Minitab Statistical Software, and choose "Nonparametrics."</p>
<p style="margin-left: 40px;"><img alt="nonparametrics menu" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/fbebf763ac6bd92b40c0d241b7c4029c/nonparametrics_menu.png" style="width: 367px; height: 309px;" /></p>
<p>If you're stymied by your data's lack of normality, nonparametric statistics might help you find answers. And if the word "nonparametric" looks like five syllables' worth of trouble, don't be intimidated—it's just a big word that usually refers to "tests that don't assume your data follow a normal distribution."</p>
<p>In fact, nonparametric statistics don't assume your data follow <em>any distribution at all</em>. The following table lists common parametric tests, their equivalent nonparametric tests, and the main characteristics of each.</p>
<p style="margin-left: 40px;"><img alt="correspondence table for parametric and nonparametric tests" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/4a69043809861f5187be271de67f8161/parametric_correspondence_table.png" style="width: 661px; height: 488px;" /></p>
<p>Nonparametric analyses free your data from the straitjacket of the <span style="line-height: 20.8px;">normality </span><span style="line-height: 1.6;">assumption. So choosing a nonparametric analysis is sort of like removing your data from a stifling, </span><a href="https://www.verywell.com/the-asch-conformity-experiments-2794996" style="line-height: 1.6;" target="_blank">conformist environment</a><span style="line-height: 1.6;">, and putting it into </span><a href="https://en.wikipedia.org/wiki/Utopia" style="line-height: 1.6;" target="_blank">a judgment-free, groovy idyll</a><span style="line-height: 1.6;">, where your data set can just be what it is, with no hassles about its unique and beautiful shape. How cool is </span><em style="line-height: 1.6;">that</em><span style="line-height: 1.6;">, man? Can you dig it?</span></p>
<p>Of course, it's not <em>quite </em>that carefree. Just like the 1960s encompassed both <a href="https://en.wikipedia.org/wiki/Woodstock" target="_blank">Woodstock</a> and <a href="https://en.wikipedia.org/wiki/Altamont_Free_Concert" target="_blank">Altamont</a>, so nonparametric tests offer both compelling advantages and serious limitations.</p>
Advantages of Nonparametric Tests
<p>Both parametric and nonparametric tests draw inferences about populations based on samples, but parametric tests focus on sample parameters like the mean and the standard deviation, and make various assumptions about your data—for example, that it follows a normal distribution, and that samples include a minimum number of data points.</p>
<p>In contrast, nonparametric tests are unaffected by the distribution of your data. Nonparametric tests also accommodate many conditions that parametric tests do not handle, including small sample sizes, ordered outcomes, and outliers.</p>
<p>Consequently, they can be used in a wider range of situations and with more types of data than traditional parametric tests. Many people also feel that nonparametric analyses are more intuitive.</p>
Drawbacks of Nonparametric Tests
<p><span style="line-height: 20.8px;">But nonparametric tests are not </span><em style="line-height: 20.8px;">completely </em><span style="line-height: 20.8px;">free from assumptions—they do require data to be an independent random sample, for example.</span></p>
<p>And nonparametric tests aren't a cure-all. For starters, they typically have less <a href="http://blog.minitab.com/blog/starting-out-with-statistical-software/how-powerful-am-i-power-and-sample-size-in-minitab">statistical power</a> than parametric equivalents. Power is the probability that you will correctly reject the null hypothesis when it is false. That means you have an increased chance making a Type II error with these tests.</p>
<p>In practical terms, that means nonparametric tests are <em>less </em>likely to detect an effect or association when one really exists.</p>
<p>So if you want to draw conclusions with the same confidence level you'd get using an equivalent parametric test, you will need larger sample sizes. </p>
<p>Nonparametric tests are not a one-size-fits-all solution for non-normal data, but they can yield good answers in situations that parametric statistics just won't work.</p>
Is Parametric or Nonparametric the Right Choice for You?
<p>I've briefly outlined differences between parametric and nonparametric hypothesis tests, looked at which tests are equivalent, and considered some of their advantages and disadvantages. If you're waiting for me to tell you which direction you should choose...well, all I can say is, "It depends..." But I can give you some established rules of thumb to consider when you're looking at the specifics of your situation.</p>
<p>Keep in mind that <strong>nonnormal data does not immediately disqualify your data for a parametric test</strong>. What's your sample size? <span style="line-height: 20.8px;">As long as a certain minimum sample size is met, most parametric tests will be </span><a href="http://blog.minitab.com/blog/fun-with-statistics/forget-statistical-assumptions-just-check-the-requirements" style="line-height: 20.8px;">robust to the normality assumption</a><span style="line-height: 20.8px;">. </span><span style="line-height: 1.6;">For example, the Assistant in Minitab (which uses Welch's t-test) points out that </span><span style="line-height: 1.6;">while the 2-sample t-test is based on the assumption that the data are normally distributed, this assumption is not critical when the sample sizes are at least 15. And Bonnett's 2-sample standard deviation test performs well for nonnormal data even when sample sizes are as small as 20. </span></p>
<p><span style="line-height: 1.6;">In addition, while they may not require normal data, many nonparametric tests have other assumptions that you can’t disregard.</span> For example, t<span style="line-height: 20.8px;">he Kruskal-Wallis test assumes your samples come from populations that have similar shapes and equal variances. </span><span style="line-height: 1.6;">And the 1-sample Wilcoxon test does not assume a particular population distribution, but it does assume the distribution is symmetrical. </span></p>
<p><span style="line-height: 1.6;">In most cases, your choice between parametric and nonparametric tests ultimately comes down to sample size, and whether the center of your data's distribution is better reflected by the mean or the median.</span></p>
<ul>
<li>If the mean accurately represents the center of your distribution and your sample size is large enough, a parametric test offers you better accuracy and more power. </li>
<li>If your sample size is small, you'll likely need to go with a nonparametric test. But if the median better represents the center of your distribution, a nonparametric test may be a better option even for a large sample.</li>
</ul>
<p> </p>
Data AnalysisHypothesis TestingStatisticsStatistics HelpMon, 22 Aug 2016 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/data-not-normal-try-letting-it-be-with-a-nonparametric-hypothesis-testEston Martz