Data Analysis Software | MinitabBlog posts and articles with tips for using statistical software to analyze data for quality improvement.
http://blog.minitab.com/blog/data-analysis-software/rss
Thu, 29 Jan 2015 16:17:26 +0000FeedCreator 1.7.3Analyzing Qualitative Data, part 2: Chi-Square and Multivariate Analysis
http://blog.minitab.com/blog/applying-statistics-in-quality-projects/analyzing-qualitative-data-part-2-chi-square-and-multivariate-analysis
<p><span style="color: rgb(77, 79, 81); font-family: 'Segoe UI', Frutiger, 'Frutiger Linotype', 'Dejavu Sans', 'Helvetica Neue', Tahoma, Arial, sans-serif; font-size: 14px; line-height: 21px;">In my recent meetings with people from various companies in the service industries, I realized that one of the problems they face is that they were collecting large amounts of "qualitative" data: types of product, customer profiles, different subsidiaries, several customer requirements, etc.</span></p>
<p>As I discussed in my previous post, one way to look at qualitative data is to use different types of charts, including <a href="http://blog.minitab.com/blog/applying-statistics-in-quality-projects/analyzing-qualitative-data-part-1-pareto-pie-and-stacked-bar-charts">pie charts, stacked bar charts, and Pareto charts</a>. In this post, we'll cover how to dig deeper into qualitative data with Chi-square analysis and multivariate analysis. </p>
A Chi-Square Test with Qualitative Data
<p style="line-height: 20.7999992370605px;">The table below shows which statistical methods can be used to analyze data according to the nature of such data (qualitative or numeric/quantitative). Even when the output (Y) is qualitative and the input (predictor : X) is also qualitative, at least one statistical method is relevant and can be used : the Chi-Square test.</p>
<p><strong> X \ Y</strong></p>
<p align="center"><strong>Numeric/quantitative Output</strong></p>
<p align="center"><span style="color: rgb(178, 34, 34);"><strong><u>Qualitative Output</u></strong></span></p>
<p><strong> Numeric/quantitative Input</strong></p>
<p align="center">Regression</p>
<p align="center">Logistic Regression</p>
<p><span style="color: rgb(178, 34, 34);"><strong> <u>Qualitative Input</u></strong></span></p>
<p align="center">ANOVA</p>
<p align="center">T tests</p>
<p align="center"><strong><span style="color: rgb(178, 34, 34);">Chi-Square</span></strong></p>
<p align="center"><span style="color: rgb(178, 34, 34);">Proportion tests</span></p>
<p style="line-height: 20.7999992370605px;">Let's perform the Chi-square test of statistical significance on the same qualitative mistakes data I used in my previous post:</p>
<p style="line-height: 20.7999992370605px;"><img alt="data" src="https://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/545c0823fc7368e795585c38424891d9/quali1.jpg" style="width: 375px; height: 378px;" /></p>
<p style="line-height: 20.7999992370605px;"><span style="line-height: 20.7999992370605px;">In Minitab Statistical Software, go to <strong>Stat > Tables > Cross Tabulation and Chi-square...</strong> In the output below, you can see that for each Employee / Error type combination, observed counts are obtained. Below that, expected counts (based on the assumption that the distribution of types of errors is strictly identical for each employee) are displayed. And below the expected count is displayed that combination's contribution to the overall Chi-Square.</span></p>
<p style="line-height: 20.7999992370605px; margin-left: 40px;"><img alt="" spellcheck="true" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/c34202f08d378b92e7abc1b32d2aab2d/quali10.JPG" style="width: 557px; height: 526px;" /></p>
<p style="line-height: 20.7999992370605px;">A low p-value (p = 0.042 <0.05), shown below the table, indicates a significant difference in the distribution of error types according to the three employees.</p>
<p style="line-height: 20.7999992370605px;">We then need to consider the major contributions to the overall chi-square:</p>
<p style="line-height: 20.7999992370605px;"><strong>Largest contribution: </strong>3.79 for the Mistake type: “Product” & Employee: A combination. Note that in this case, for that particular cell, the number of observed errors for “product” (third row) <u>and</u> employee A (first column of the table) is much larger than the number of expected errors. Due to that difference the contribution for that particular combination is large : 3.79.</p>
<p style="line-height: 20.7999992370605px;"><strong>Second largest contribution:</strong> 2.66 for the Error type: “Address” & employee: C combination. Note that for this particular combination (i.e., this particular cell in the table) the observed number of address errors is much larger than the number of expected errors for Employee C (and therefore the contribution 2.66 is quite large).</p>
Simple Correspondence (Multivariate) Analysis for Qualitative Data
<p><span style="line-height: 20.7999992370605px;">This third approach to analyzing qualitative data is more complex and computationally intensive but this is also a very effective and explicit statistical tool from a graphical point of view. In Minitab, go to <strong>Stat > Multivariate > Simple Correspondence Analysis...</strong></span></p>
<p style="line-height: 20.7999992370605px;">To do this analysis, I rearranged the data in a two way contingency table, with the addition of a column for the employee names :</p>
<p style="line-height: 20.7999992370605px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/f48bee1f16926d5370aa5ca6e1d7d26e/quali8.jpg" style="width: 336px; height: 114px;" /></p>
<p style="line-height: 20.7999992370605px;">The simple correspondence symmetric plot below indicates that “Product” type errors are more likely to be associated with employee A (see on the right part of the graph below the two points are close to one another) whereas "Address" type errors are more likely to be associated with employee C (the two points are visually close on the left part of the graph). This is the same conclusion we found using the Chi-square test.</p>
<p style="line-height: 20.7999992370605px;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/31b80fb2-db66-4edf-a753-74d4c9804ab8/Image/7824d6d02af62b303240567beec1081f/quali9.jpg" style="width: 450px; height: 332px;" /></p>
How Can You Use Qualitative Data?
<p style="line-height: 20.7999992370605px;">Counts of qualitative data may obviously be used to provide relevant information to decision takers, process owners, quality professionals etc., and several graphical or statistical tools are available for that in Minitab. Our <a href="http://www.minitab.com/products/minitab">statistical software</a> includes statistical tools that are useful to analyze qualitative values, but that I didn't have space to present in this short blog (for example, Kappa studies, Attribute sampling inspection, Nominal Logistic regression...). </p>
<p style="line-height: 20.7999992370605px;">Quantitative analysis and statistics might still be used more extensively in the service sector to improve quality and customer satisfaction. Of course, analyses of qualitative data are also often performed in the manufacturing industry. If you're not already using it, please download our <a href="http://it.minitab.com/products/minitab/free-trial.aspx">free 30-day trial</a> and see what you can learn from your data!</p>
Data AnalysisQuality ImprovementStatisticsStatsWed, 28 Jan 2015 13:00:00 +0000http://blog.minitab.com/blog/applying-statistics-in-quality-projects/analyzing-qualitative-data-part-2-chi-square-and-multivariate-analysisBruno ScibiliaHow to Choose the Best Regression Model
http://blog.minitab.com/blog/adventures-in-statistics/how-to-choose-the-best-regression-model
<p><img alt="Rodin's statue, The Thinker" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/381a4964475703a0136b974f98c6c47f/rodin_the_thinker2.jpg" style="float: right; width: 275px; height: 367px; margin: 10px 15px;" />Choosing the correct linear regression model can be difficult. After all, the world and how it works is complex. Trying to model it with only a sample doesn’t make it any easier. In this post, I'll review some common statistical methods for selecting models, complications you may face, and provide some practical advice for choosing the best regression model.</p>
<p>It starts when a researcher wants to mathematically describe the relationship between some <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">predictors</a> and the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">response variable</a>. The research team tasked to investigate typically measures many variables but includes only some of them in the model. The analysts try to eliminate the variables that are not related and include only those with a true relationship. Along the way, the analysts consider many possible models.</p>
<p>They strive to achieve a Goldilocks balance with the number of predictors they include. </p>
<ul>
<li><strong>Too few</strong>: An underspecified model tends to produce biased estimates.</li>
<li><strong>Too many</strong>: An overspecified model tends to have less precise estimates.</li>
<li><strong>Just right</strong>: A model with the correct terms has no bias and the most precise estimates.</li>
</ul>
Statistical Methods for Finding the Best Regression Model
<p>For a good regression model, you want to include the variables that you are specifically testing along with other variables that affect the response in order to avoid biased results. <a href="http://www.minitab.com/en-us/products/minitab/features/" target="_blank">Minitab statistical software </a>offers statistical measures and procedures that help you specify your regression model. I’ll review the common methods, but please do follow the links to read my more detailed posts about each.</p>
<p><strong><a href="http://blog.minitab.com/blog/adventures-in-statistics/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables" target="_blank">Adjusted R-squared and Predicted R-squared</a></strong>: Generally, you choose the models that have higher adjusted and predicted R-squared values. These statistics are designed to avoid a key problem with <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit" target="_blank">regular R-squared</a>—it increases <em>every</em> time you add a predictor and can trick you into specifying an overly complex model.</p>
<ul>
<li>The adjusted R squared increases only if the new term improves the model more than would be expected by chance and it can also decrease with poor quality predictors.</li>
<li>The predicted R-squared is a form of cross-validation and it can also decrease. Cross-validation determines how well your model generalizes to other data sets by partitioning your data.</li>
</ul>
<p><strong><a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">P-values for the predictors</a></strong>: In regression, low p-values indicate terms that are statistically significant. “Reducing the model” refers to the practice of including all candidate predictors in the model, and then systematically removing the term with the highest p-value one-by-one until you are left with only significant predictors.</p>
<p><strong><a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-smackdown-stepwise-versus-best-subsets" target="_blank">Stepwise regression and Best subsets regression</a></strong>: These are two automated procedures that can identify useful predictors during the exploratory stages of model building. With best subsets regression, Minitab provides Mallows’ Cp, which is a statistic specifically designed to help you manage the tradeoff between precision and bias.</p>
Real World Complications
<p>Great, there are a variety of statistical methods to help us choose the best model. Unfortunately, there also are a number of potential complications. Don’t worry, I’ll provide some practical advice!</p>
<ul>
<li>The best model can be only as good as the variables measured by the study. The results for the variables you include in the analysis can be biased by the significant variables that you don’t include. <a href="http://blog.minitab.com/blog/adventures-in-statistics/collecting-good-data-its-a-messy-world-confound-it" target="_blank">Read about an example of omitted variable bias</a>.</li>
<li>Your sample might be unusual, either by chance or by data collection methodology. <a href="http://blog.minitab.com/blog/adventures-in-statistics/not-all-p-values-are-created-equal" target="_blank">False positives</a> and false negatives are part of the game when working with samples.</li>
<li>P-values can change based on the specific terms in the model. In particular, <a href="http://blog.minitab.com/blog/adventures-in-statistics/what-are-the-effects-of-multicollinearity-and-when-can-i-ignore-them" target="_blank">multicollinearity</a> can sap significance and make it difficult to determine the role of each predictor.</li>
<li>If you assess enough models, you <em>will</em> find variables that appear to be significant but are only correlated by chance. <a href="http://blog.minitab.com/blog/adventures-in-statistics/four-tips-on-how-to-perform-a-regression-analysis-that-avoids-common-problems" target="_blank">This form of data mining can make random data appear significant</a>. A low predicted R-squared is a good way to check for this problem.</li>
<li>P-values, predicted and adjusted R-squared, and Mallows’ Cp can suggest different models.</li>
<li>Stepwise regression and best subsets regression are great tools and can get you close to the correct model. However, studies have found that <a href="http://blog.minitab.com/blog/adventures-in-statistics/which-is-better%2C-stepwise-regression-or-best-subsets-regression" target="_blank">they generally don’t pick the correct model</a>.</li>
</ul>
Recommendations for Finding the Best Regression Model
<p>Choosing the correct regression model is as much a science as it is an art. Statistical methods can help point you in the right direction but ultimately you’ll need to incorporate other considerations.</p>
<p><strong>Theory</strong></p>
<p>Research what others have done and incorporate those findings into constructing your model. Before beginning the regression analysis, develop an idea of what the important variables are along with their relationships, coefficient signs, and effect magnitudes. Building on the results of others makes it easier both to collect the correct data and to specify the best regression model without the need for data mining.</p>
<p>Theoretical considerations should not be discarded based solely on statistical measures. After you fit your model, determine whether it aligns with theory and possibly make adjustments. For example, based on theory, you might include a predictor in the model even if its p-value is not significant. If any of the coefficient signs contradict theory, investigate and either change your model or explain the inconsistency.</p>
<p><strong>Complexity</strong></p>
<p>You might think that complex problems require complex models, but many studies show that <a href="http://blog.minitab.com/blog/adventures-in-statistics/four-tips-on-how-to-perform-a-regression-analysis-that-avoids-common-problems" target="_blank">simpler models generally produce more precise predictions</a>. Given several models with similar explanatory ability, the simplest is most likely to be the best choice. Start simple, and only make the model more complex as needed. The more complex you make your model, the more likely it is that you are tailoring the model to your dataset specifically, and generalizability suffers.</p>
<p>Verify that added complexity actually produces narrower <a href="http://blog.minitab.com/blog/adventures-in-statistics/applied-regression-analysis-how-to-present-and-use-the-results-to-avoid-costly-mistakes-part-2" target="_blank">prediction intervals</a>. Check the predicted R-squared and don’t mindlessly chase a high regular R-squared!</p>
<p><strong>Residual Plots</strong></p>
<p>As you evaluate models, <a href="http://blog.minitab.com/blog/adventures-in-statistics/why-you-need-to-check-your-residual-plots-for-regression-analysis" target="_blank">check the residual plots</a> because they can help you avoid inadequate models and help you adjust your model for better results. For example, the bias in underspecified models can show up as patterns in the residuals, such as the need to <a href="http://blog.minitab.com/blog/adventures-in-statistics/curve-fitting-with-linear-and-nonlinear-regression" target="_blank">model curvature</a>. The simplest model that produces random residuals is a good candidate for being a relatively precise and unbiased model.</p>
<p>In the end, no single measure can tell you which model is the best. Statistical methods don't understand the underlying process or subject-area. Your knowledge is a crucial part of the process!</p>
<p>If you're learning about regression, read my <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples">regression tutorial</a>!</p>
<p><em>* The image of Rodin's </em>The Thinker <em>was taken by flickr user innoxius and licensed under <a href="https://creativecommons.org/licenses/by/2.0/" target="_blank"><span style="font-size: 12px; display: inline;">CC BY 2.0</span></a>.</em></p>
Regression AnalysisThu, 22 Jan 2015 13:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/how-to-choose-the-best-regression-modelJim FrostTom Brady and the Danger of Selective Endpoints
http://blog.minitab.com/blog/the-statistics-game/tom-brady-and-the-danger-of-selective-endpoints
<p>Last Friday I had an interesting tweet come across my Twitter feed.</p>
<p><img alt="Tweet" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/e9b3b34597f15a072ce760cdb7f90a1c/pats_tweet.jpg" style="width: 600px; height: 194px;" /></p>
<p>And that was <em>before</em> the Patriots failed to cover their first playoff game of 2015 against the Ravens. When you include that, the record becomes 3-11, good for a winning percentage of only 21%! With the Patriots set to play another playoff game against the Colts, it seems like the smart thing to do is to bet the Colts to cover. But wait, 14 games is a pretty small sample. We should do a <a href="http://blog.minitab.com/blog/understanding-statistics/what-statistical-hypothesis-test-should-i-use">hypothesis test</a> to determine whether this percentage is significantly less than 50%.</p>
<p><img alt="1 Proportion Test" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/e6e6f63ddde4905115625fc149a8e2fb/1_proportion.jpg" style="width: 516px; height: 172px;" /></p>
<p><a href="http://www.minitab.com/products/minitab">Minitab Statistical Software</a> returns a p-value of 0.029, which is less than the alpha value of 0.05, so we can be 95% confident that the true percentage of games that the Patriots cover during the playoffs is less than 50%. Great! Now it’s time to get my ATM card and bet a mortgage payment on the Colts. Thank you, statistics!</p>
<p>But wait, there is one more question I should probably ask pertaining to that tweet.</p>
<p>Why only the last 13 games?</p>
<p style="text-align: center;"><strong>Date</strong></p>
<p style="text-align: center;"><strong>Patriots Opponent</strong></p>
<p style="text-align: center;"><strong>Spread</strong></p>
<p style="text-align: center;"><strong>Score</strong></p>
<p style="text-align: center;"><strong>Cover the Spread?</strong></p>
<p style="text-align: center;">1/19/2014</p>
<p style="text-align: center;">@ Denver</p>
<p style="text-align: center;">+5</p>
<p style="text-align: center;">L 16-26</p>
<p style="text-align: center;">L</p>
<p style="text-align: center;">1/11/2014</p>
<p style="text-align: center;">Indianapolis</p>
<p style="text-align: center;">-7.5</p>
<p style="text-align: center;">W 43-22</p>
<p style="text-align: center;">W</p>
<p style="text-align: center;">1/20/2013</p>
<p style="text-align: center;">Baltimore</p>
<p style="text-align: center;">-8</p>
<p style="text-align: center;">L 13-28</p>
<p style="text-align: center;">L</p>
<p style="text-align: center;">1/13/2013</p>
<p style="text-align: center;">Houston</p>
<p style="text-align: center;">-9.5</p>
<p style="text-align: center;">W 41-28</p>
<p style="text-align: center;">W</p>
<p style="text-align: center;">2/5/2012</p>
<p style="text-align: center;">New York Giants</p>
<p style="text-align: center;">-3</p>
<p style="text-align: center;">L 17-21</p>
<p style="text-align: center;">L</p>
<p style="text-align: center;">1/22/2012</p>
<p style="text-align: center;">Baltimore</p>
<p style="text-align: center;">-7</p>
<p style="text-align: center;">W 23-20</p>
<p style="text-align: center;">L</p>
<p style="text-align: center;">1/14/2012</p>
<p style="text-align: center;">Denver</p>
<p style="text-align: center;">-14</p>
<p style="text-align: center;">W 45-10</p>
<p style="text-align: center;">W</p>
<p style="text-align: center;">1/16/2011</p>
<p style="text-align: center;">New York Jets</p>
<p style="text-align: center;">-9.5</p>
<p style="text-align: center;">L 21-28</p>
<p style="text-align: center;">L</p>
<p style="text-align: center;">1/10/2010</p>
<p style="text-align: center;">Baltimore</p>
<p style="text-align: center;">-3.5</p>
<p style="text-align: center;">L 14-33</p>
<p style="text-align: center;">L</p>
<p style="text-align: center;">2/3/2008</p>
<p style="text-align: center;">New York Giants</p>
<p style="text-align: center;">-12.5</p>
<p style="text-align: center;">L 14-17</p>
<p style="text-align: center;">L</p>
<p style="text-align: center;">1/20/2008</p>
<p style="text-align: center;">San Diego</p>
<p style="text-align: center;">-14</p>
<p style="text-align: center;">W 21-12</p>
<p style="text-align: center;">L</p>
<p style="text-align: center;">1/12/2008</p>
<p style="text-align: center;">Jacksonville</p>
<p style="text-align: center;">-13.5</p>
<p style="text-align: center;">W 31-20</p>
<p style="text-align: center;">L</p>
<p style="text-align: center;"><span style="color:#FF0000;">1/21/2007</span></p>
<p style="text-align: center;"><span style="color:#FF0000;">Indianapolis</span></p>
<p style="text-align: center;"><span style="color:#FF0000;">+3.5</span></p>
<p style="text-align: center;"><span style="color:#FF0000;">L 34-38</span></p>
<p style="text-align: center;"><span style="color:#FF0000;">L</span></p>
<p style="text-align: center;"><span style="color:#FF0000;">1/14/2007</span></p>
<p style="text-align: center;"><span style="color:#FF0000;">San Diego</span></p>
<p style="text-align: center;"><span style="color:#FF0000;">+5</span></p>
<p style="text-align: center;"><span style="color:#FF0000;">W 24-21</span></p>
<p style="text-align: center;"><span style="color:#FF0000;">W</span></p>
<p style="text-align: center;"><span style="color:#FF0000;">1/7/2007</span></p>
<p style="text-align: center;"><span style="color:#FF0000;">New York Jets</span></p>
<p style="text-align: center;"><span style="color:#FF0000;">-9.5</span></p>
<p style="text-align: center;"><span style="color:#FF0000;">W 37-16</span></p>
<p style="text-align: center;"><span style="color:#FF0000;">W</span></p>
<p>Here are the last <em>fifteen</em> games the Patriots played prior to the tweet (so, not including the most recent Baltimore game). I’ve highlighted the 13th, 14th, and 15th games in red. All three of these games were played 1 week apart, but the 13th game was included in the tweet, while the 14th and 15th games were conveniently left off.</p>
<p>Why? Because 3-10 against the spread sounds more impressive than 5-10.</p>
<p>This is using selective endpoints to manipulate statistics to help prove your point. It’s these kind of things that lead people to say “There are three kinds of lies: lies, damned lies, and statistics.” The conclusions you can make from your statistical analysis are only as good as the data behind it is. That’s why you should always make sure you collect a random, unbiased, sample. And before you believe the conclusions made by others, ensure they collected the data correctly too!</p>
<p><img alt="Tom Brady -- Keith Allison. Used under Creative Commons Attribution-ShareAlike 2.0" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c5c413bce99fe8f74543b520463a28e8/tombrady.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 165px; height: 240px;" />In our Patriots situation, we could go back and look at every playoff game the Patriots have played in. But I don’t think their games in 1963 have any effect on their games this season. So instead, the best thing to do is to associate this Patriots team with Tom Brady. So we should sample <em>all</em> the playoff games that Tom Brady has played in. That includes the 16 previous games (in which he went 5-11 against the spread) and 11 games he played before 2007 (in which he went 6-4-1). This gives us a final record of 11-15-1, which is a winning percentage of 42%.</p>
<p>Once we obtained a legitimate sample of data, we see that Tom Brady and the Patriots record against the spread in the playoffs isn’t nearly as bad as we were originally led to believe. While 42% is still less than 50%, it is no longer significantly different.</p>
<p><img alt="1 Proportion Test" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/207a813845e6ff925d8fac8c124d0a77/1_proportion_2.jpg" style="width: 529px; height: 172px;" /></p>
<p>So could the Patriots still fail to cover against the Colts this weekend? Of course. But I'm not going to go bet a mortgage payment on it. </p>
<p> </p>
<p><em>Photo of Tom Brady by Keith Allison, used under Creative Commons 2.0.</em></p>
Fri, 16 Jan 2015 13:00:00 +0000http://blog.minitab.com/blog/the-statistics-game/tom-brady-and-the-danger-of-selective-endpointsKevin RudyBirds Versus Statisticians: Testing the Gambler's Fallacy
http://blog.minitab.com/blog/statistics-in-the-field/birds-versus-statisticians%3A-testing-the-gamblers-fallacy
<p><em><span style="line-height: 1.6;">by Matthew Barsalou, guest blogger</span></em></p>
<p>Recently Minitab’s Joel Smith posted a blog about an incident in which he was pooped on by a bird. <a href="http://blog.minitab.com/blog/fun-with-statistics/poisson-processes-and-probability-of-poop">Twice</a>. I suspect many people would assume the odds of it happening twice are very low, so they would incorrectly assume they are safer after such a rare event happens.</p>
<p>I don’t have data on how often birds poop on one person, and I assume Joel is unwilling to stand under a flock of berry-fed birds waiting to collect data for me, so I’ll simply make up some numbers for illustration purposes only.</p>
<div><img alt="Joel, that bird's got that look in his eye again...." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/21770465cb23a69e5b7d99c4cc3351b9/bird1.jpg" style="line-height: 20.7999992370605px; border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 278px; height: 200px;" />
<p>Suppose there is a 5% chance of being pooped on by a bird during a vacation. That means the probability of being pooped on is 0.05. The probability of being pooped on twice during the vacation is 0.0025 (0.05 x 0.05) or 0.25%, and the probability of being pooped on three times is 0.000125 (0.05. x 0.05 x 0.05).</p>
<p>Joel has already been pooped on twice. So what is the probability of our intrepid statistician being pooped on a <em>third </em>time?</p>
<p>The probability is 0.05. If you said 0.000125, then you may have made a mistake known as the <a href="http://en.wikipedia.org/wiki/Gambler%27s_fallacy" target="_blank">Gambler’s Fallacy</a> or the Monte Carlo Fallacy. This fallacy is named after the mistaken belief that things will average out in the short-term. A gambler who has suffered repeated losses may incorrectly assume that the recent losses mean a win is due soon. Things <em>will </em>balance out in the long term, but the odds do not reset after each event. Joel could correctly conclude the probability of a bird pooping on him during his vacation are low and the odds of being pooped on twice are much lower. But being pooped on one time does not affect the probability of it happening a second time.</p>
<p>There is a caveat here. The probabilities only apply if the meeting of poop and Joel are random events. Perhaps birds, for reasons understood only by birds, have an inordinate fondness for Joel. Our probability calculations would no longer apply in such a situation. This would be like calculating the probabilities of a coin toss when there is some characteristic that causes the coin to land more on one side than on the other.</p>
<p>We can perform an experiment to determine if Joel is just a victim of the odds or if there is something that makes the birds target him. The generally low occurrence rate would make it difficult to collect data in a reasonable amount of time so we should perform an experiment to collect data. We could send Joel to a bird sanctuary for two weeks and record the number of times he is pooped on. Somebody of approximately the same size and appearance as Joel could be used as a control. Both Joel and the control should be dressed the same to ensure that birds are not targeting a particular color or clothing brand. The table below shows the hypothetical results of our little experiment.</p>
<p align="center"><img src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/0f8accc5b70a622475b15c5e70c34aa0/table1.png" style="border-width: 0px; border-style: solid; width: 187px; height: 322px;" /></p>
<p>We can see that Joel was hit 99 times, while the control was only hit 80 times. But does this difference mean anything? To find out, we can use <a href="http://www.minitab.com/products/minitab">Minitab Statistical Software</a> to determine if there is a statistically significant difference between the number of times Joel was hit and the number of times the control was hit.</p>
<p>Enter the data into Minitab and then go to <strong>Stat > Basic Statistics > 2-Sample Poisson Rate</strong> and select “Each sample is in its own column.” Go to Options and select “Difference > hypothesized difference” as the alternative hypothesis for a one-tailed upper tailed test. The resulting P-value shown in the output below is 0.078. That's greater than the alpha of 0.05 so we <a href="http://blog.minitab.com/blog/understanding-statistics/things-statisticians-say-failure-to-reject-the-null-hypothesis">fail to reject the null hypothesis</a>. Although there was a higher occurrence rate for Joel, we have no reason to think that birds are especially attracted to him.</p>
<p align="center"><img alt="Test and CI for two-sample Poisson rates output" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/565188c12e19c8bb56aed22dd0e48e9e/output.png" style="border-width: 0px; border-style: solid; width: 564px; height: 333px;" /></p>
<p>Joel is well aware of the Gambler’s Fallacy, so we can be assured that he is not under a false sense of security. He must know the probability of him getting struck a third time has not changed. But has he considered that these may not be random events? The experiment described here was only hypothetical. Perhaps Joel should consider wearing a<a href="http://www.merriam-webster.com/dictionary/sou'wester" target="_blank"> sou’wester</a> and rain coat the next time he takes a vacation in the sun.</p>
<p> </p>
<p><strong>About the Guest Blogger</strong></p>
<p><em><a href="https://www.linkedin.com/pub/matthew-barsalou/5b/539/198" target="_blank">Matthew Barsalou</a> is an engineering quality expert in <a href="http://www.3k-warner.de/" target="_blank">BorgWarner</a> Turbo Systems Engineering GmbH’s global engineering excellence department. He is a Smarter Solutions certified Lean Six Sigma Master Black Belt, ASQ-certified Six Sigma Black Belt, quality engineer, and quality technician, and a TÜV-certified quality manager, quality management representative, and auditor. He has a bachelor of science in industrial sciences, a master of liberal studies with emphasis in international business, and has a master of science in business administration and engineering from the Wilhelm Büchner Hochschule in Darmstadt, Germany. He is author of the books <a href="http://www.amazon.com/Root-Cause-Analysis-Step---Step/dp/148225879X/ref=sr_1_1?ie=UTF8&qid=1416937278&sr=8-1&keywords=Root+Cause+Analysis%3A+A+Step-By-Step+Guide+to+Using+the+Right+Tool+at+the+Right+Time" target="_blank">Root Cause Analysis: A Step-By-Step Guide to Using the Right Tool at the Right Time</a>, <a href="http://asq.org/quality-press/display-item/index.html?item=H1472" target="_blank">Statistics for Six Sigma Black Belts</a> and <a href="http://asq.org/quality-press/display-item/index.html?item=H1473&xvl=76115763" target="_blank">The ASQ Pocket Guide to Statistics for Six Sigma Black Belts</a>.</em></p>
</div>
Data AnalysisFun StatisticsTue, 13 Jan 2015 13:29:53 +0000http://blog.minitab.com/blog/statistics-in-the-field/birds-versus-statisticians%3A-testing-the-gamblers-fallacyGuest BloggerUnderstanding Qualitative, Quantitative, Attribute, Discrete, and Continuous Data Types
http://blog.minitab.com/blog/understanding-statistics/understanding-qualitative-quantitative-attribute-discrete-and-continuous-data-types
<p>"Data! Data! Data! I can't make bricks without clay."<br />
— Sherlock Holmes, in Arthur Conan Doyle's <em>The Adventure of the Copper Beeches</em></p>
<p>Whether you're the world's greatest detective trying to crack a case or a person trying to solve a problem at work, you're going to need information. Facts. <em>Data</em>, as Sherlock Holmes says. </p>
<p><img alt="jujubes" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/96d7c87addccc11b6072d6dfa38d0039/jujubes.jpg" style="line-height: 20.7999992370605px; margin: 10px 15px; float: right; width: 200px; height: 200px;" /></p>
<p>But not all data is created equal, especially if you plan to analyze as part of a quality improvement project.</p>
<p>If you're using Minitab Statistical Software, you can access the Assistant to <a href="http://www.minitab.com/products/minitab/assistant">guide you through your analysis step-by-step</a>, and help identify the type of data you have.</p>
<p>But it's still important to have at least a basic understanding of the different types of data, and the kinds of questions you can use them to answer. </p>
<p>In this post, I'll provide a basic overview of the types of data you're likely to encounter, and we'll use a box of my favorite candy—<a href="http://en.wikipedia.org/wiki/Jujube_(confectionery)" target="_blank">Jujubes</a>—to illustrate how we can gather these different kinds of data, and what types of analysis we might use it for. </p>
The Two Main Flavors of Data: Qualitative and Quantitative
<p>At the highest level, two kinds of data exist: <em><strong>quantitative</strong></em> and <em><strong>qualitative</strong></em>.</p>
<p><strong><em>Quantitative</em> </strong>data deals with numbers and things you can measure objectively: dimensions such as height, width, and length. Temperature and humidity. Prices. Area and volume.</p>
<p><strong><em>Qualitative </em></strong>data deals with characteristics and descriptors that can't be easily measured, but can be observed subjectively—such as smells, tastes, textures, attractiveness, and color. </p>
<p>Broadly speaking, when you measure something and give it a number value, you create quantitative data. When you classify or judge something, you create qualitative data. So far, so good. But this is just the highest level of data: there are also different types of quantitative and qualitative data.</p>
Quantitative Flavors: Continuous Data and Discrete Data
<p>There are two types of quantitative data, which is also referred to as numeric data: <em><strong>continuous </strong></em>and <em><strong>discrete</strong>. </em><span style="line-height: 20.7999992370605px;">As a general rule, </span><em style="line-height: 20.7999992370605px;">counts </em><span style="line-height: 20.7999992370605px;">are discrete and </span><em style="line-height: 20.7999992370605px;">measurements </em><span style="line-height: 20.7999992370605px;">are continuous.</span></p>
<p><strong><em>Discrete </em></strong>data is a count that can't be made more precise. Typically it involves integers. For instance, the number of children (or adults, or pets) in your family is discrete data, because you are counting whole, indivisible entities: you can't have 2.5 kids, or 1.3 pets.</p>
<p><strong><em>Continuous</em> </strong>data, on the other hand, could be divided and reduced to finer and finer levels. For example, you can measure the height of your kids at progressively more precise scales—meters, centimeters, millimeters, and beyond—so height is continuous data.</p>
<p>If I tally<span style="line-height: 1.6;"> the number of individual Jujubes in a box, that number is a piece of discrete data. </span></p>
<p style="margin-left: 40px;"><img alt="a count of jujubes is discrete data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f5e3c44269356903cf156c065b10746a/jujubes_count_tally.jpg" style="width: 200px; height: 200px;" /></p>
<p><span style="line-height: 1.6;">If I use a scale to measure the weight of each Jujube, or the weight of the entire box, that's continuous data. </span></p>
<p style="margin-left: 40px;"><span style="line-height: 1.6;"><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d11051162c9e2375e531ac589fd5a20e/jujube_weight_continuous_data.jpg" style="width: 200px; height: 200px;" /></span></p>
<p>Continuous data can be used in many different kinds of <a href="http://blog.minitab.com/blog/understanding-statistics/what-statistical-hypothesis-test-should-i-use">hypothesis tests</a>. For example, to assess the accuracy of the weight printed on the Jujubes box, we could measure 30 boxes and perform a 1-sample t-test. </p>
<p>Some analyses use continuous and discrete quantitative data at the same time. For instance, we could perform a <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples">regression analysis</a> to see if the weight of Jujube boxes (continuous data) is correlated with the number of Jujubes inside (discrete data). </p>
Qualitative Flavors: Binomial Data, Nominal Data, and Ordinal Data
<p>When you classify or categorize something, you create <em>Qualitative</em> or attribute<em> </em>data. There are three main kinds of qualitative data.</p>
<p><em><strong>Binary </strong></em>data place things in one of two mutually exclusive categories: right/wrong, true/false, or accept/reject. </p>
<p>Occasionally, I'll get a box of Jujubes that contains a couple of individual pieces that are either too hard or too dry. If I went through the box and classified each piece as "Good" or "Bad," that would be binary data. I could use this kind of data to develop a statistical model to predict how frequently I can expect to get a bad Jujube.</p>
<p>When collecting <em><strong>unordered </strong></em>or <em><strong>nominal </strong></em>data, we assign individual items to named categories that do not have an implicit or natural value or rank. If I went through a box of Jujubes and recorded the color of each in my worksheet, that would be nominal data. </p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ce64d648ac395d5c8098985caabc754f/jujubes_sorted_nominal_data.jpg" style="width: 200px; height: 97px;" /></p>
<p>This kind of data can be used in many different ways—for instance, I could use <a href="http://blog.minitab.com/blog/understanding-statistics/chi-square-analysis-of-halloween-and-friday-the-13th-is-there-a-slasher-movie-gender-gap">chi-square anlaysis</a> to see if there are statistically significant differences in the amounts of each color in a box. </p>
<p>We also can have <strong><em>ordered </em></strong>or <em><strong>ordinal </strong></em>data, in which items are assigned to categories that do have some kind of implicit or natural order, such as "Short, Medium, or Tall." <span style="line-height: 1.6;">Another example is a survey question that asks us to rate an item on a 1 to 10 scale, with 10 being the best. This implies that 10 is better than 9, which is better than 8, and so on. </span></p>
<p>The uses for ordered data is a matter of some debate among statisticians. Everyone agrees its appropriate for creating bar charts, but beyond that the answer to the question "What should I do with my ordinal data?" is "It depends." Here's a post from another blog that offers an excellent summary of the <a href="http://learnandteachstatistics.wordpress.com/2013/07/08/ordinal/" target="_blank">considerations involved</a>. </p>
Additional Resources about Data and Distributions
<p>For more fun statistics you can do with candy, check out this article (PDF format): <a href="http://www.minitab.com/uploadedFiles/Content/Academic/sweetening_statistics.pdf">Statistical Concepts: What M&M's Can Teach Us.</a> </p>
<p>For a deeper exploration of the probability distributions that apply to different types of data, check out my colleague Jim Frost's posts about <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-and-using-discrete-distributions">understanding and using discrete distributions</a> and <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-identify-the-distribution-of-your-data-using-minitab">how to identify the distribution of your data</a>.</p>
Fun StatisticsLearningStatistics HelpFri, 19 Dec 2014 13:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/understanding-qualitative-quantitative-attribute-discrete-and-continuous-data-typesEston MartzHow to Calculate B10 Life with Statistical Software
http://blog.minitab.com/blog/meredith-griffith/how-to-calculate-b10-life-with-statistical-software
<p><span style="line-height: 1.6;">Over the last year or so I’ve heard a lot of people asking, “How can I calculate B10 life in Minitab?” Despite being a statistician and industrial engineer (mind you, one who has never been </span><em style="line-height: 1.6;">in</em><span style="line-height: 1.6;"> the field like the customers asking this question) and having taken a reliability engineering course, I’d never heard of B10 life. So I did some research.</span></p>
<p>The B10 life metric originated in the ball and roller bearing industry, but has become a metric used across a variety of industries today. It’s particularly useful in establishing warranty periods for a product. The “BX” or “Bearing Life” nomenclature, which refers to the time at which X% of items in a population will fail, speaks to these roots.</p>
<p>So then, B10 life is the time at which 10% of units in a population will fail. Alternatively, you can think of it as the 90% reliability of a population at a specific point in its lifetime—or the point in time when an item has a 90% probability of survival. The B10 life metric became popular among ball and roller bearing makers due to the industry’s strict requirement that no more than 10% of bearings in a given batch fail by a specific time due to fatigue failure. </p>
<p>Now that I know what the term means, I can tell people who ask that <a href="http://blog.minitab.com/blog/fun-with-statistics/what-i-learned-from-treating-childbirth-as-a-failure">Minitab’s reliability analysis</a> can easily compute this metric. (In fact, our <a href="http://www.minitab.com/products/minitab">statistical software</a> can compute any “BX” lifetime—but we’ll save that for another blog post.) B10 life is also known as the 10th percentile and can be found in Minitab’s Table of Percentiles output, which is displayed in Minitab’s session window.</p>
<p><img alt="B10 Life - Table of Percentiles" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/1ac7bbfe20e1a18c284babde45ce84af/b10life_image1.png" style="width: 461px; height: 324px;" /></p>
<p>And unlike other reliability metrics, B10 life directly correlates the maximum allowable percentile of failures (or the minimum allowable reliability) with an application-specific life point in time.</p>
<p>So we can get the B10 life metric by looking at the Table of Percentiles in Minitab’s session window output. But you might still be asking two questions: how do I create this table, and how do I interpret it?</p>
<img alt="You can't just put one of these into a pacemaker, after all! " src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/6c60a30de1566a4cc65dbb03c730680e/batteries.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 180px; height: 180px;" />Finding B10 Life, Step by Step
<p>Suppose we have tracked and recorded the battery life times over a certain number of years for 1,970 pacemakers. The reliability of pacemakers is critical, because patients’ lives depend on these devices!</p>
<p>We observed exact failure times—defined as the time at which a low battery signal was detected—for 1,019 of those pacemakers. The remaining 951 pacemakers never warned of a low battery, so they “survived.”</p>
<p>Our data is organized as follows:</p>
<p><img alt="B10 Life - Data Organization" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/90b7e4084faeee5bbe82a2abd7ff7c7e/b10life_image2.png" style="width: 183px; height: 264px;" /></p>
<p>When we have both observed failures and units surviving beyond a given time, we call the data “right-censored.” And we know from process knowledge that the <a href="http://blog.minitab.com/blog/understanding-statistics/why-the-weibull-distribution-is-always-welcome">Weibull distribution</a> best describes the lifetime of these pacemaker batteries. Knowing this information will help us use Minitab’s reliability analysis correctly.</p>
Setting Up the Reliability Analysis
<p>Because we have right-censored data and we know our distribution, we are ready to access Minitab’s <strong>Statistics > Reliability/Survival > Distribution Analysis (Right Censoring) > Parametric Distribution Analysis </strong>menu to compute the B10 life.</p>
<p>We want to know the batteries’ reliability—or probability of survival—at different times, so our variable of interest is the number of years a pacemaker battery has survived. In the Parametric Distribution Analysis dialog, you’ll notice the Weibull distribution is already selected as the assumed distribution. We’ll leave this default setting since we know the Weibull distribution best describes battery life times.</p>
<p><img alt="B10 Life Metric - Right Censoring" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/646dd8ee5563d9748c90545d8f0a9fa0/b_10_life_image_3.png" style="width: 507px; height: 345px;" /></p>
<p>We also know whether the number in the ‘Years’ column was an exact failure time or a censored time (beyond which the battery survived). We must account for the censored data. By clicking the button labeled ‘Censor’, we can include a censoring column that contains values indicating whether or not the pacemaker survived or failed at the recorded time. In our Minitab worksheet, “Failed or Survived” is the censoring column. Our censoring value is ‘S’, which stands for ‘Survived’, indicating no failure was observed during the pacemaker battery tracking period.</p>
<p><img alt="B10 Life - censoring column" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/6042f1f9401bec3fe3c11ac335dfb834/b_10_life_image_4.png" style="width: 426px; height: 313px;" /></p>
Interpreting the Table of Percentiles and B10 Life
<p>Once we click OK through all dialogs to carry out the analysis, Minitab outputs the Table of Percentiles, where we can find our B10 life:</p>
<p> <img alt="B10 Life - Corresponding Percentile" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/2394c488f49f9f0e2615ae743926cde9/b_10_life_image_5a.png" style="width: 191px; height: 240px;" /></p>
<p>Where the Percent column displays 10, the corresponding Percentile value tells us that the B10 life of pacemaker batteries is 6.36 years—or, to put it another way, 6.36 years is the time at which 10% of the population of pacemaker batteries will fail.</p>
<p>There we have it! The next time you are looking to compute the B10 life of a product, and perhaps seeking to establish suitable warranty periods, you need look no further than Minitab’s reliability tools and the Table of Percentiles.</p>
Reliability AnalysisStatisticsMon, 15 Dec 2014 13:00:00 +0000http://blog.minitab.com/blog/meredith-griffith/how-to-calculate-b10-life-with-statistical-softwareMeredith GriffithThe World-Famous Disappearing-Reappearing-Analysis-Settings Act
http://blog.minitab.com/blog/statistics-and-quality-improvement/the-world-famous-disappearing-reappearing-analysis-settings-act
<p>Sure, Minitab Statistical Software is powerful and easy to use, but did you know that it’s also magic? One of the illusions that Minitab can peform is the world famous disappearing-reappearing-analysis-settings act. Of course, as with many illusions, it’s not so hard once you know the trick. In this case, it’s downright easy once you know about Minitab project files.</p>
<p><img alt="The statue of liberty" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/145817ecdac35066aeeea5b9fb106d99/statue_present.png" style="line-height: 20.7999992370605px; float: right; width: 250px; height: 180px; border-width: 1px; border-style: solid; margin: 10px 15px;" /></p>
<p>If you’ve done any work in Minitab you may very well have saved a project file and been grateful that <span>your data, graphs, and statistical tables could all be saved together in a single file</span>. But, it’s just as amazing that Minitab can remember exactly how you did your analysis the last time.</p>
<p>Imagine that you routinely <a href="http://blog.minitab.com/blog/understanding-statistics/i-think-i-can-i-know-i-can-a-high-level-overview-of-process-capability-analysis">run a capability analysis</a> on the same process. The first time you did the analysis, you changed several of the options to get the output that you wanted. When you open Minitab the next time, you want to perform the same analysis on a new data set. Having a saved project makes it easy. Try it for yourself if you want, following the steps below. Begin by downloading <a href="http://it.minitab.com/products/minitab/free-trial.aspx">our free trial</a> if you don't already have our statistical software, then download worksheets <a href="http://support.minitab.com/en-us/datasets/Basil.MTW">Basil.MTW</a> and <a href="http://support.minitab.com/en-us/datasets/Basil2.MTW">Basil2.MTW</a>.</p>
Introduce your Assistant
<ol>
<li>Open the Basil.MTW worksheet.</li>
<li>Choose <strong>Stat > Quality Tools > Capability Analysis > Multiple Variables (Normal)</strong>.</li>
<li>In <strong>Variables</strong>, enter <em>T1H1 T1H2</em>.</li>
<li>In <strong>Subgroup sizes</strong>, enter <em>4</em>.</li>
<li>In <strong>Lower spec</strong>, enter 2.</li>
<li>In <strong>Upper spec</strong>, enter 8.</li>
<li>Click <strong>Graphs</strong>.</li>
<li>Uncheck <strong>Normal probability plot</strong>. Click <strong>OK</strong>.</li>
<li>Click <strong>Options</strong>.</li>
<li>Under <strong>Display</strong>, select <strong>Benchmark Z’s (σ level)</strong> and check <strong>Include confidence intervals</strong>.</li>
<li>Click <strong>OK </strong>twice.</li>
</ol>
<p>The capability analysis is in your project file.</p>
<img alt="Statue not visible" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/7640a1717783092e715bc8a9321145ed/david_copperfield_statue_gone.jpg" style="width: 250px; height: 188px; float: right; border-width: 1px; border-style: solid; margin: 10px 15px;" />Presto, they’re gone!
<ol>
<li>Close Minitab. When asked if you want to save changes to the project, click <strong>Yes</strong>.</li>
<li>Name the file and click <strong>Save</strong>.</li>
</ol>
<p>Minitab Statistical Software is closed. The settings for your analysis are nowhere to be found.</p>
Abracadabra—they’re back!
<ol>
<li>Reopen the project file that you saved.</li>
<li>Open the Basil2.MTW worksheet.</li>
<li>Choose <strong>Stat > Quality Tools > Capability Analysis > Multiple Variables (Normal)</strong>.</li>
</ol>
<p>The settings from your previous analysis have reappeared! All you have to do to complete the capability analysis, with all of your customizations, is click <strong>OK</strong>.</p>
Bask in the applause from the audience
<p>Keeping all of the parts of your analysis in one place is a great feature of Minitab’s project files. For people who routinely repeat the same analysis, the fact that the project file also remembers the settings that you used for your analysis is a fantastic time saver.</p>
<p>Whether you repeat an analysis weekly, quarterly, or even annually, Minitab’s ready to pick up right where you left off. This might not be quite as astounding as David Copperfield making the Statue of Liberty disappear and reappear, but if you want to get your statistical results fast and easy, it’s the best kind of magic.</p>
<p>Ready for more? Projects files and many other fundamental features of Minitab, are explained in the online <a href="http://support.minitab.com/en-us/minitab/17/getting-started/">Getting Started Guide</a>.</p>
Project ToolsStatistics HelpWed, 10 Dec 2014 15:42:00 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/the-world-famous-disappearing-reappearing-analysis-settings-actCody SteeleHow Cpk and Ppk Are Calculated, part 2
http://blog.minitab.com/blog/marilyn-wheatleys-blog/how-cpk-and-ppk-are-calculated2c-part-2
<p>Minitab's capability analysis output gives you estimates of the capability indices Ppk and Cpk, and we receive many questions about the difference between them. Some of my colleagues have taken other approaches to explain the difference between Ppk and Cpk, so I wanted to show you how they differ by detailing precisely how each one is calculated. </p>
<p><span style="line-height: 1.6;">When you're using <a href="http://www.minitab.com/products/minitab">statistical software</a> like Minitab, you don't need to do these calculations by hand, but I also want to lift the lid off the "black box" to show you what Minitab does behind the scenes to provide these figures. </span></p>
<p><span style="line-height: 1.6;">In my previous post, we saw <a href="http://blog.minitab.com/blog/marilyn-wheatleys-blog/how-cpk-and-ppk-are-calculated2c-part-1">how Ppk is calculated</a>. This time, we'll go through the calculation of Cpk, using the same sample data set in Minitab.</span><span style="line-height: 1.6;"> Go to <strong>File > Open Worksheet</strong>, click the "Look in Minitab Sample Data folder" button at the bottom, and open the dataset named CABLE.MTW.</span></p>
Calculating Within-Subgroup Standard Deviation
<p>Where Ppk uses the overall standard deviation, Cpk uses the within-subgroup standard deviation. Calculating Cpk is easy once we have an estimate of the within-subgroup standard deviation. The default method in Minitab for the within-subgroup calculation is the pooled standard deviation. The formula for this calculation from Methods and formulas is:</p>
<p><img alt="formula for pooled standard deviation" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/e40d989c5285189e341d5ab615b9bfe0/pooledsd.png" style="width: 642px; height: 376px;" /></p>
<p>This looks a little intimidating, but you’ll see it’s not so bad if we take it one step at a time.</p>
<p>First, we’ll calculate Sp. For this example, the subgroup size is fixed at 5. We’ll begin with a clean worksheet containing only the Diameter data in C1.</p>
<p>We need to estimate the mean of the data in each subgroup and store those values in the worksheet. To do that, we’ll create a column that defines our subgroups using <strong>Calc > Make Patterned Data > Simple Set of Numbers</strong>, and then completing the dialog box as shown below:</p>
<p><img alt="subgroups" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/98f4740c6516242f060b4f22b0fa43ca/subgroup.png" style="width: 638px; height: 360px;" /></p>
<p>With 100 data points and 5 points in each subgroup, we have 20 subgroups.</p>
<p>Now we can use our new column containing the subgroups to calculate the mean of each subgroup, using <strong>Stat > Basic Statistics > Store Descriptive Statistics</strong>. We complete the dialog box like in the example below, entering the <em>Diameter </em>column under Variables and the <em>Subgroup </em>column as the By variable:</p>
<p><img alt="descriptive statistics" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9a50586cefdd8896b974097384cf37b4/descr_stats.png" style="width: 600px; height: 309px;" /></p>
<p><span style="line-height: 1.6;">We then click Options and choose <strong>Store a row of output for each row of input</strong>, uncheck <strong>Store district values of By variables</strong>, and then click OK in each dialog box. Now column C3 will show the average of each subgroup; the first 5 rows from C1 were used to calculate the mean of those first 5 rows, and that same mean value is displayed in the first 5 rows of C3.</span></p>
<p>We will now use these values to calculate the numerator for Sp using <strong>Calc > Calculator</strong>:</p>
<p><img alt="numerator for Sp" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/53aad02dfe93c8744f272a3ec3dabb76/calculator2.png" style="width: 609px; height: 400px;" /></p>
<p>We are summing the squared differences between each measurement and its subgroup mean. The Numerator column in the Minitab worksheet will show <strong>0.02735</strong> using the formula above.</p>
<p>Next, we calculate the denominator for Sp, which is the subgroup size minus 1, summed over all subgroups. Since we have a constant subgroup size of 5, and a total of 20 subgroups, an easy way to enter this in the calculator is:</p>
<p><img alt="denominator for Sp" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/22ec46ed79ede93c79af6e8288e8224b/calculator3.png" style="width: 593px; height: 396px;" /></p>
<p>Now with the numerator and denominator for Sp stored in the worksheet, we take the square root of Numerator/Denominator:</p>
<p><img alt="square root of numerator/denominator" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/29fe9f85f7fb535497948f3c5eb38451/calculator4.png" style="width: 590px; height: 394px;" /></p>
<p>Notice that the Sp value 0.0184899 is the estimate of the subgroup standard deviation if we tell Minitab NOT to use the unbiasing constant, C4, by clicking the Estimate button in the Normal Capability Analysis dialog box and then unchecking <strong>Use unbiasing constants</strong>. </p>
<p>Now to finish calculating the within-subgroup standard deviation using C4 (the default), we can look up C4 in the table that is linked in Methods and Formulas under the Methods heading.</p>
<p>The C4 value we need is C4 for (d + 1). As defined in Methods and formulas, d is the sum of (subgroup size – 1); in our case the subgroup size is fixed at 5, so 20*(5-1) = 80. If d = 80, we add 1 and get 81, so we look up N = 81 in the C4 column of unbiasing constants:</p>
<p><img alt="unbiasing constants" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/400dffc756ede4ae4f272cc10fb7a256/c4.png" style="width: 532px; height: 82px;" /></p>
<p><span style="line-height: 1.6;">We enter 0.996880 in column C7 in the worksheet and use it in the calculator to get the </span><span style="line-height: 20.7999992370605px;">pooled within-subgroup standard deviation</span><span style="line-height: 1.6;">:</span></p>
<p><span style="line-height: 1.6;"><img alt="within subgroup standard deviation" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d22f12738d7eade1bef550a3bfb061c1/sdwithin.png" style="width: 595px; height: 400px;" /></span></p>
<p> We can see that this value matches the output from our initial capability analysis graph.</p>
<p><img alt="initial graph" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/6b18e5ea5e14c5f5992cc335766e505f/initial_graph.png" style="width: 171px; height: 139px;" /></p>
Calculating Cpk
<p>Finally, we use our within-subgroup standard deviation to calculate CPU and CPL. <span style="line-height: 1.6;">Cpk is the lesser of CPU and CPL, and we find these two formulas in <strong>Methods and Formulas</strong>:</span></p>
<p><img alt="formula for CPL" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/6ecec9d3f48c1f1209985461f453017c/cpl.png" style="width: 391px; height: 178px;" /><img alt="formula for cpu" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/dc08335c4a2e31422ec35c4bbad332e7/cpu.png" style="width: 369px; height: 180px;" /></p>
<p><span style="line-height: 1.6;">We calculate CPL and CPU as shown below using the calculator and the mean of the data that we previously calculated:</span></p>
<p><img alt="calculate cpl and cpu" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/cef4860ce67f56b0eded93d66f2a9fbc/calculator5.png" style="width: 600px; height: 464px;" /></p>
<p><span style="line-height: 1.6;">Since Cpk is the lesser of the two resulting values, Cpk is 0.83. That matches the Cpk value in Minitab’s capability output:</span></p>
<p><img alt="process capability for diameter" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/fa1311e847201a8b17e8993d6d4cd889/capability_for_diameter.png" style="width: 600px; height: 363px;" /></p>
<p>As long as you're using Minitab, you won't need to calculate Ppk and Cpk by hand. But I hope seeing the calculations Minitab uses to get these capability indices provides some insight into the differences between them! </p>
Data AnalysisQuality ImprovementStatisticsTue, 09 Dec 2014 13:00:00 +0000http://blog.minitab.com/blog/marilyn-wheatleys-blog/how-cpk-and-ppk-are-calculated2c-part-2Marilyn WheatleyLessons in Quality from Guadalajara and Mexico City
http://blog.minitab.com/blog/understanding-statistics-and-its-application/lessons-in-quality-from-guadalajara-and-mexico-city
<p><img alt="View of Mexico City" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8e5ec9217bc8fbc2ca7a6784a1efcdfa/mexico_df_400w.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 400px; height: 235px;" />Last week, thanks to the collective effort from many people, we held very successful events in Guadalajara and Mexico City, which gave us a unique opportunity to meet with over 300 Spanish-speaking Minitab users. They represented many different industries, including automotive, textile, pharmaceutical, medical devices, oil and gas, electronics, and mining, as well as academic institutions and consultants.</p>
<p>As I listened to my peers Jose Padilla and <a href="http://blog.minitab.com/blog/marilyn-wheatleys-blog">Marilyn Wheatley</a> deliver their presentations, it was interesting to see people's reactions as they learned more about our products and services. Several attendees were particularly pleased to learn more about Minitab's ease-of-use and <a href="http://www.minitab.com/products/minitab/assistant/">step-by-step help with analysis</a> offered by the Assistant menu. I saw others react to demonstrations of Minitab's comprehensive Help system, the use of executables for automation purposes, and several of the tips and tricks discussed throughout our presentations.</p>
<p>We also had multiple conversations on Minitab's flexible licensing options. Several attendees who spend a lot of time on the road were particularly glad to learn about our <a href="http://support.minitab.com/installation/frequently-asked-questions/license-fulfillment/borrow-a-license-of-minitab-companion/">borrowing functionality</a>, which lets you “check out” a license so you can use Minitab software without accessing your organization’s license server.</p>
Acceptance Sampling Plans
<p>There were plenty of technical discussions as well. One interesting question came from a user who asked how Minitab's Acceptance Sampling Plans compare to the <a href="http://asq.org/knowledge-center/ANSI_ASQZ1_4-2008/index.html">ANSI Z1.4</a> standard (a.k.a. MIL-STD 105E). The short answer is that the tables provided by the ANSI Z1.4 are for a specific AQL (Acceptable Quality Level), while implicitly assuming a certain RQL (Rejectable Quality Level) based solely on the lot size. The ANSI Z1.4 is an AQL-based system, while Minitab's acceptance sampling plans give you the flexibility to create a customized sampling scheme for a specific AQL, RQL, or lot size using both the binomial or hypergeometric distributions.</p>
Destructive Testing and Gage R&R
<p>Other users had questions about Gage R&R and destructive testing. Practitioners commonly assess a destructive test using Nested Gage R&R; however, this is not always necessary. The main problem with destructive testing is that every part tested is destroyed and thus can only be measured by a single operator. Since the purpose of this type of analysis is to measure the repeatability and reproducibility of the measurement system, one must identify parts that are as homogeneous as possible. Typically, instead of 10 parts, practitioners may use multiple parts from each of 10 batches. If the within-batch variation is small enough then the parts from each batch can be considered to be "the same" and thus the readings measured by all the operators can be used to produce repeatability and reproducibility measures. The main trick is to have homogenous units or batches that can give you enough samples to be tested by all operators for all replicates. If this is the case, you can analyze a destructive test with crossed gage R&R.</p>
Control Charts and Subgroup Size
<p>We also had an interesting discussion about the sensitivity of Shewhart <a href="http://blog.minitab.com/blog/understanding-statistics/control-chart-tutorials-and-examples">control charts</a> to the subgroup size. Specifically, one of the attendees asked our recommendation for subgroup size: 4, or 5? </p>
<p>The answer to this intriguing question requires an understanding of the reason why subgroups are recommended. Control charts have limits that are constructed so that if the process is stable, the probability of observing points out of these control limits is very small; this probability is typically referred to as the false alarm rate and it is usually set at 0.0027. This calculation assumes the process is normally distributed, so if we were plotting the individual data as in an Individuals chart, the control limits would be effective to determine an out-of-control situation only if the data came from a normal distribution. To reduce the dependence on normality, Shewhart suggested collecting the data in subgroups, because if we plot the means instead of the individual data the control limits would become less and less sensitive to normality as the subgroup size increases. This is a result of the Central Limit Theorem (CLT), which states that regardless of the underlying distribution of the data, that if we take independent samples and compute the average (or a sum) of all the observations in each sample then the distribution of these sample means will converge to a normal distribution.</p>
<p>So going back to the original question, what is the recommended subgroup size for building control charts? The answer depends on how skewed the underlying distribution may be. For various distributions a subgroup size of 5 is sufficient to have the CLT kick in making our control charts robust to normality; however for extremely skewed distributions like the exponential, the subgroup sizes may need to be much larger than 50. This topic was discussed in a paper Schilling and Nelson titled "<a href="http://asq.org/qic/display-item/?item=5238">The Effect of Non-normality on the Control Limits of Xbar Charts</a>" published in JQT back in 1976.</p>
Analyzing Variability
<p>We also had a great discussion about modeling variability in a process. One of the attendees, working for McDonald's, was looking for statistical methods for reducing the variation of the weight of apple slices. An apple is cut in 10 slices, and the goal was to minimize the variation in weight so that exactly four slices be placed in each bag without further rework. This gave me the opportunity to demonstrate how to use the <a href="http://blog.minitab.com/blog/adventures-in-statistics/assessing-variability-for-quality-improvement">Analyze Variability</a> command in Minitab, which happens to be one of the topics we cover in our <a href="http://www.minitab.com/training/courses/#doe-in-practice-manufacturing">DOE in Practice</a> course.</p>
We Love Your Questions
<p>For me and my fellow trainers, there’s nothing better than talking with people who are using Minitab software to solve problems. Sometimes we’re able to provide a quick, helpful answer. Sometimes a question provokes a great discussion about some quality challenge we all have in common. And sometimes a question will lead to a great idea that we’re able to share with our developers and engineers to make our software better. </p>
<p>If you have a question about Minitab, statistics, or quality improvement, please feel free to comment here. And if you use Minitab software, you can always contact our <a href="http://www.minitab.com/support/">customer support</a> team for direct assistance from specialists in IT, statistics, and quality improvement.</p>
<p> </p>
Quality ImprovementStatisticsStatistics HelpWed, 19 Nov 2014 13:57:02 +0000http://blog.minitab.com/blog/understanding-statistics-and-its-application/lessons-in-quality-from-guadalajara-and-mexico-cityEduardo SantiagoWhat to Do When Your Data's a Mess, part 3
http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-3
<p>Everyone who analyzes data regularly has the experience of getting a worksheet that just isn't ready to use. Previously I wrote about tools you can use to <a href="http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-data-is-a-mess-part-1">clean up and elminate clutter in your data</a> and <a href="http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-2">reorganize your data</a>. </p>
<p><span style="line-height: 1.6;">In this post, I'm going to highlight tools that help you get the most out of messy data by altering its characteristics.</span></p>
Know Your Options
<p>Many problems with data don't become obvious until you begin to analyze it. A shortcut or abbreviation that seemed to make sense while the data was being collected, for instance, might turn out to be a time-waster in the end. What if abbreviated values in the data set only make sense to the person who collected it? Or a column of numeric data accidentally gets coded as text? You can solve those problems quickly with <a href="http://www.minitab.com/products/minitab">statistical software</a> packages.</p>
Change the Type of Data You Have
<p>Here's an instance where a data entry error resulted in a column of numbers being incorrectly classified as text data. This will severely limit the types of analysis that can be performed using the data.</p>
<p><img alt="misclassified data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c45b427d3e5e2b5eac4a505ed5c3b24f/misclassified_data.png" style="width: 200px; height: 156px;" /></p>
<p>To fix this, select <strong>Data > Change Data Type</strong> and use the dialog box to choose the column you want to change.</p>
<p><img alt="change data type menu" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/46ece127300500409098383a2e476a9b/text_to_numeric_data.png" style="width: 376px; height: 175px;" /></p>
<p>One click later, and the errant text data has been converted to the desired numeric format:</p>
<p><img alt="numeric data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f1b9df0211f9085e577a41b0e3661b45/numeric_data.png" style="width: 200px; height: 156px;" /></p>
Make Data More Meaningful by Coding It
<p>When this company collected data on the performance of its different functions across all its locations, it used numbers to represent both locations and units. </p>
<p><img alt="uncoded data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d22a57fe9e9e398bd948e86c0adafe34/uncoded_data.png" style="width: 135px; height: 158px;" /></p>
<p>That may have been a convenient way to record the data, but unless you've memorized what each set of numbers stands for, interpreting the results of your analysis will be a confusing chore. You can make the results easy to understand and communicating by coding the data. </p>
<p>In this case, we select <strong>Data > Code > Numeric to Text...</strong></p>
<p><img alt="code data menu" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c75e46cc190497fd41b0e6736518c0fe/code_data_menu.png" style="width: 384px; height: 255px;" /></p>
<p>And we complete the dialog box as follows, telling the software to replace the numbers with more meaningful information, like the town each facility is located in. </p>
<p><img alt="Code data dialog box" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/cd75c14324187806b8f3a74a3b8996b4/code_data_dialog.png" style="width: 400px; height: 345px;" /></p>
<p>Now you have data columns that can be understood by anyone. When you create graphs and figures, they will be clearly labelled. </p>
<p><img alt="Coded data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/7ff81bdb08170d6d8a4e8547623cf557/coded_data.png" style="width: 161px; height: 200px;" /></p>
Got the Time?
<p>Dates and times can be very important in looking at performance data and other indicators that might have a cyclical or time-sensitive effect. But the way the date is recorded in your data sheet might not be exactly what you need. </p>
<p>For example, if you wanted to see if the day of the week had an influence on the activities in certain divisions of your company, a list of dates in the MM/DD/YYYY format won't be very helpful. </p>
<p><img alt="date column" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f5b0dd178afbc0352f8dc2d9378e887b/date_column.png" style="width: 240px; height: 223px;" /></p>
<p>You can use <strong>Data > Date/Time > Extract to Text... </strong>to identify the day of the week for each date.</p>
<p><img alt="extract-date-to-text" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/7e6f7e8a87ee8291b9c6d51507092c19/extract_date_to_text.png" style="width: 351px; height: 132px;" /></p>
<p>Now you have a column that lists the day of the week, and you can easily use it in your analysis. </p>
<p><img alt="day column" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/dede93c9621917a0cfb54beef121d4e2/day_column.png" style="width: 249px; height: 205px;" /></p>
Manipulating for Meaning
<p>These tools are commonly seen as a way to correct data-entry errors, but as we've seen, you can use them to make your data sets more meaningful and easier to work with.</p>
<p>There are many other tools available in Minitab's Data menu, including an array of options for arranging, combining, dividing, fine-tuning, rounding, and otherwise massaging your data to make it easier to use. Next time you've got a column of data that isn't quite what you need, try using the Data menu to get it into shape.</p>
<p> </p>
<p> </p>
Data AnalysisStatisticsStatsMon, 17 Nov 2014 13:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-3Eston MartzAre Preseason Football or Basketball Rankings More Accurate?
http://blog.minitab.com/blog/the-statistics-game/are-preseason-football-or-basketball-rankings-more-accurate
<p>College basketball season tips off today, and for the second straight season Kentucky is the #1 ranked preseason team in the AP poll. Last year Kentucky did not live up to that ranking in the regular season, going 24-10 and earning a lowly 8 seed in the NCAA tournament. But then, in the tournament, they overachieved and made a run all the way to the championship game...before losing to Connecticut.</p>
<p>In football, Florida State was the AP poll preseason #1 football team. While they are currently still undefeated, they aren't quite playing like the #1 team in the country. So this made me wonder, which preseason rankings are more accurate, football or basketball?</p>
<p>I gathered <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/1d3961db92c5ba14bc90b2b8323b95f8/preseason_basketball_vs__football_rankings.MTW">data</a> from the last 10 seasons, and recorded the top 10 teams in the preseason AP poll for both football and basketball. Then I recorded the difference between their preseason ranking and their final ranking. Both sports had 10 teams that weren’t ranked or receiving votes in the final poll, so I gave all of those teams a final ranking of 40.</p>
Creating a Histogram to Compare Two Distributions
<p>Let’s start with a histogram to look at the distributions of the differences. (It's always a good idea to look at the distribution of your data when you're starting an analysis, whether you're looking at quality improvement data work or sports data for yourself.) </p>
<p>You can create this graph in Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a> by selecting <strong>Graph > Histograms</strong>, choosing "With Groups" in the dialog box, and using the Basketball Difference and Football Difference columns as the graph variables:</p>
<p><img alt="Histogram" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/53055c57978dbfa85d28688cc816c98a/histogram_of_basketball_difference__football_difference.jpg" style="width: 720px; height: 480px;" /></p>
<p>The differences in the rankings appear to be pretty similar. Most of the data is towards the left side of this histogram, meaning for most cases the difference between the preseason and final ranking is pretty small.</p>
Conducting a Mann-Whitney Hypothesis Test on Two Medians
<p>We can further investigate the data by performing a hypothesis test. Because the data is heavily skewed, I’ll use <a href="http://blog.minitab.com/blog/the-statistics-game/do-the-data-really-say-female-named-hurricanes-are-more-deadly">a Mann-Whitney test</a>. This compares the medians of two samples with similarly-shaped distributions, as opposed to a <a href="http://blog.minitab.com/blog/understanding-statistics/guidelines-and-how-tos-for-the-2-sample-t-test">2-sample t test</a>, which compares the means. <span style="line-height: 20.7999992370605px;">The median is the middle value of the data. Half the observations are less than or equal to it, and half the observations are greater than or equal to it.</span><span style="line-height: 20.7999992370605px;"> </span></p>
<p>To perform this test in our statistical software, we select <strong>Stat > Nonparametrics > Mann-Whitney</strong>, then choose the appropriate columns for our first and second sample: </p>
<p><img alt="Mann-Whitney Test" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/1a1f239841b82e60170e6ecbc8077d4b/mann_whitney.jpg" style="width: 689px; height: 241px;" /></p>
<p>The basketball rankings have a smaller median difference than the football rankings. However, when we examine the <a href="http://blog.minitab.com/blog/understanding-statistics/three-things-the-p-value-cant-tell-you-about-your-hypothesis-test">p-value</a> we see that this difference is not statistically significant. There is not enough evidence to conclude that one preseason poll is more accurate than the other.</p>
<p>But what about the best teams? I grouped each of the top 3 ranked teams and looked at the median difference between their preseason and final rank.</p>
<p><img alt="Bar Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/692a3db40dd5d3b4c20d539f92395629/bar_chart.jpg" style="width: 720px; height: 480px;" /></p>
<p>The preseason AP basketball poll has a smaller difference for the #1 and #3 ranked teams. But the football poll is better for the #2 team, having an impressive median value of 1. Overall, both polls are relatively good, as neither has a median value greater than 6. And the differences are close enough that we can’t conclude that one is more accurate than the other.</p>
What Does It Mean for the Teams?
<p>While the odds are against both Kentucky and Florida State to finish the season ranked #1 in their respective polls, previous seasons indicate that they’re still likely to finish as one of the top teams. This is better news for Kentucky, as being one of the top teams means they’ll easily make the NCAA basketball tournament and get a high seed. However, Florida State must finish as one of the top 4 teams, or else they’ll miss out on the football postseason completely.</p>
<p>So while we can’t conclude one poll is better than the other, teams at the top of the AP basketball poll are clearly much more likely to reach the postseason than football.</p>
Data AnalysisFun StatisticsHypothesis TestingStatistics in the NewsFri, 14 Nov 2014 15:03:33 +0000http://blog.minitab.com/blog/the-statistics-game/are-preseason-football-or-basketball-rankings-more-accurateKevin RudyThe Power of Multivariate ANOVA (MANOVA)
http://blog.minitab.com/blog/adventures-in-statistics/the-power-of-multivariate-anova-manova
<p><img alt="Willy Wonka" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/964d1b613c1569e983213d2544915ac5/willywonka.jpg" style="float: right; width: 225px; height: 225px; border-width: 1px; border-style: solid; margin: 10px 15px;" />Analysis of variance (ANOVA) is great when you want to compare the differences between group means. For example, you can use ANOVA to assess how three different alloys are related to the mean strength of a product. However, most ANOVA tests assess one response variable at a time, which can be a big problem in certain situations. Fortunately, <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">Minitab statistical software</a> offers a multivariate analysis of variance (MANOVA) test that allows you to assess multiple response variables simultaneously.</p>
<p>In this post, I’ll run through a MANOVA example, explain the benefits, and cover how to know when you should use MANOVA.</p>
Limitations of ANOVA
<p>Whether you’re using <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/basics/what-is-a-general-linear-model/" target="_blank">general linear model (GLM)</a> or <a href="http://blog.minitab.com/blog/adventures-in-statistics/did-welchs-anova-make-fishers-classic-one-way-anova-obsolete" target="_blank">one-way ANOVA</a>, most ANOVA procedures can only assess one response variable at a time. Even GLM, where you can include many factors and covariates in the model, the analysis simply cannot detect multivariate patterns in the response variable.</p>
<p>This limitation can be a huge roadblock for some studies because it may be impossible to obtain significant results with a regular ANOVA test. You don’t want to miss out on any significant findings!</p>
Example That Compares MANOVA to ANOVA
<p>What the heck are multivariate patterns in the response variable? It sounds complicated but it’s very easy to show the difference between how ANOVA and MANOVA tests the data by using graphs.</p>
<p>Let’s assume that we are studying the relationship between three alloys and the strength and flexibility of our products. Here is the <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/File/3f3b6f58c70a646731a9db97bd7edfab/manova_example.MTW">dataset for the example</a>.</p>
<p>The two individual value plots below show how one-way ANOVA analyzes the data—one response variable at a time. In these graphs, alloy is the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-models/factor-and-factor-levels/" target="_blank">factor</a> and strength and flexibility are the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">response variables</a>.</p>
<img alt="Individual value plot of strength by alloy" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/3402fd3845c2226f555b4ebfe18a87f5/strength_ivp.png" style="width: 350px; height: 233px;" />
<img alt="Individual value plot of flexibility by alloy" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/c7fba5c5eda5e81e02db60b2aefb3327/flexibility_ivp.png" style="width: 350px; height: 233px;" />
<p>The two graphs seem to show that the type of alloy is not related to either the strength or flexibility of the product. When you perform the one-way ANOVA procedure for these graphs, the p-values for strength and flexibility are 0.254 and 0.923 respectively.</p>
<p>Drat! I guess Alloy isn't related to either Strength or Flexibility, right? Not so fast!</p>
<p>Now, let’s take a look at the multivariate response patterns. To do this, I’ll display the same data with a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/graphs/graphs-of-pairs-of-variables/scatterplots/scatterplot/" target="_blank">scatterplot</a> that plots Strength by Flexibility with Alloy as a categorical grouping variable.</p>
<p><img alt="Scatterplot of strength by flexibility grouped by alloy" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/86483284f76817ea95b3c1787e45e7d5/scatterplot.png" style="width: 576px; height: 384px;" /></p>
<p>The scatterplot shows a positive correlation between Strength and Flexibility. MANOVA is useful when you have correlated response variables like these. You can also see that for a given flexibility score, Alloy 3 generally has a higher strength score than Alloys 1 and 2. We can use MANOVA to statistically test for this response pattern to be sure that it’s not due to random chance.</p>
<p>To perform the MANOVA test in Minitab, go to: <strong>Stat > ANOVA > General MANOVA</strong>. Our response variables are Strength and Flexibility and the predictor is Alloy.</p>
<p>Whereas one-way ANOVA could not detect the effect, MANOVA finds it with ease. The p-values in the results are all very significant. You can conclude that Alloy influences the properties of the product by changing the relationship between the response variables.</p>
<p><img alt="MANOVA results" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/c96fe9a066011b31692765318c2f0d26/manova_swo.png" style="width: 391px; height: 155px;" /></p>
<p>For a more complete guide on how to interpret MANOVA results in Minitab, go to: <strong>Help > StatGuide > ANOVA > General MANOVA</strong>.</p>
When and Why You Should Use MANOVA
<p>Use multivariate ANOVA when you have continuous response variables that are correlated. In addition to multiple responses, you can also include multiple <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-models/factor-and-factor-levels/" target="_blank">factors</a>, <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-models/adding-a-covariate-to-glm/" target="_blank">covariates</a>, and <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-models/what-is-an-interaction/" target="_blank">interactions</a> in your model. MANOVA uses the additional information provided by the relationship between the responses to provide three key benefits.</p>
<ul>
<li><strong>Increased power</strong>: If the response variables are correlated, MANOVA can detect differences too small to be detected through individual ANOVAs.</li>
<li><strong>Detects multivariate response patterns</strong>: The factors may influence the relationship between responses rather than affecting a single response. Single-response ANOVAs can miss these multivariate patterns as illustrated in the MANOVA example.</li>
<li><strong>Controls the family error rate</strong>: Your chance of incorrectly rejecting the null hypothesis increases with each successive ANOVA. Running one MANOVA to test all response variables simultaneously keeps the family error rate equal to your alpha level.</li>
</ul>
Data AnalysisStatisticsStatistics HelpThu, 13 Nov 2014 13:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/the-power-of-multivariate-anova-manovaJim FrostWhat to Do When Your Data's a Mess, part 2
http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-2
<p><span style="line-height: 1.6;">In my last post, I wrote about making a cluttered data set easier to work with by removing unneeded columns entirely, and by displaying just those columns you want to work with <em>now</em>. But <a href="http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-data-is-a-mess-part-1">too much unneeded data</a> isn't always the problem. </span></p>
<p><span style="line-height: 1.6;">What can you do when someone gives you data that isn't organized the way you need it to be? </span></p>
<p><span style="line-height: 1.6;">That happens for a variety of reasons, but most often it's because the simplest way for people to collect data is with a format that might make it difficult to assess in a worksheet. Most <a href="http://www.minitab.com/products/minitab">statistical software</a> will accept a wide range of data layouts, but just because a layout is readable doesn't mean it will be easy to analyze.</span></p>
<p><span style="line-height: 1.6;">You may not be in control of how your data were collected, but you can use tools like sorting, stacking, and ordering to put your data into a format that makes sense and is easy for you to use. </span></p>
Decide How You Want to Organize Your Data
<p>Depending on how its arranged, the same data can be easier to work with, simpler to understand, and can even yield deeper and more sophisticated insights. I can't tell you the best way to organize your specific data set, because that will depend on the types of analysis you want to perform, and the nature of the data you're working with. However, I can show you some easy ways to rearrange your data into the form that you select. </p>
Unstack Data to Make Multiple Columns
<p>The data below show concession sales for different types of events held at a local theater. </p>
<p><img alt="stacked data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8ea617d9de8138f26f2da0f3f95f4b88/stackedata.png" style="width: 202px; height: 188px;" /></p>
<p><span style="line-height: 20.7999992370605px;">If we wanted to perform an analysis that requires each type of event to be in its own column, we can choose <strong>Data > Unstack Columns...</strong> and complete the dialog box as shown: </span></p>
<p><span style="line-height: 20.7999992370605px;"><img alt="unstack columns dialog" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/fc098d3ddcbc21fe12602cb45336949c/unstack_columns.png" style="width: 350px; height: 263px;" /> </span></p>
<p>Minitab creates a new worksheet that contains a separate column of Concessions sales data for each type of event:</p>
<p><img alt="Unstacked Data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f24dd4ac29678e25069d299ccc13c535/unstacked_data.png" style="width: 400px; height: 150px;" /></p>
Stack Data to Form a Single Column (with Grouping Variable)
<p>A similar tool will help you put data from separate columns into a single column for the type of analysis required. The data below show sales figures for four employees: </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f546e2611e4fd6fe804de7c0aee3d230/stacked_data.png" style="width: 265px; height: 92px;" /></p>
<p>Select <strong>Data > Stack > Columns...</strong> and select the columns you wish to combine. Checking the "Use variable names in subscript column" will create a second column that identifies the person who made each sale. </p>
<p><img alt="Stack columns dialog" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a09dba196e68e5e75d0f248339a53e11/stack_data_dialog.jpg" style="width: 400px; height: 292px;" /></p>
<p>When you press OK, the sales data are stacked into a single column of measurements and ready for analysis, with Employee available as a grouping variable: </p>
<p><img alt="stacked columns" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c26bec8bec9447ab1df6b9ad669d9a1a/stacked_columns.jpg" style="width: 138px; height: 181px;" /></p>
Sort Data to Make It More Manageable
<p>The following data appear in the worksheet in the order in which individual stores in a chain sent them into the central accounting system.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/431dcae640fa0855a8db03b14bad3998/unsorted_data.jpg" style="width: 200px; height: 228px;" /></p>
<p>When the data appear in this uncontrolled order, finding an observation for any particular item, or from any specific store, would entail reviewing the entire list. We can fix that problem by selecting <strong>Data > Sort...</strong> and reordering the data by either store or item. </p>
<p><img alt="sorted data by item" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/0c982bb11359a001c048cb6c39ab1f60/sorted_data_by_item.jpg" style="width: 221px; height: 246px;" /> <img alt="sorted data by store" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/53e9a3f22b4a959af11952995703d7d4/sorted_data_by_store.jpg" style="width: 209px; height: 248px;" /></p>
Merge Multiple Worksheets
<p>What if you need to analyze information about the same items, but that were recorded on separate worksheets? For instance, if one group was gathering historic data about all of a corporation's manufacturing operations, while another was working on strategic planning, and your analysis required data from each? </p>
<p><img alt="two worksheets" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f63ed557c91fb6136b28ab43001b48b4/two_worksheets.png" style="width: 350px; height: 327px;" /></p>
<p>You can use <strong>Data > Merge Worksheets</strong> to bring the data together into a single worksheet, using the Division column to match the observations:</p>
<p><img alt="merging worksheets" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/651d3d676a4099a71eb180344d2e8282/merge_worksheets.png" style="width: 393px; height: 363px;" /></p>
<p>You can also choose whether or not <span style="line-height: 20.7999992370605px;">multiple</span><span style="line-height: 1.6;">, missing, or unmatched observations will be included in the merged worksheet. </span></p>
Reorganizing Data for Ease of Use and Clarity
<p>Making changes to the layout of your worksheet does entail a small investment of time, but it can bring big returns in making analyses quicker and easier to perform. The next time you're confronted with raw data that isn't ready to play nice, try some of these approaches to get it under control. </p>
<p>In my next post, I'll share some tips and tricks that can help you get more information out of your data.</p>
Data AnalysisStatisticsStatsTue, 11 Nov 2014 14:48:09 +0000http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-2Eston MartzWhat to Do When Your Data's a Mess, part 1
http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-data-is-a-mess-part-1
<p>Isn't it great when you get a set of data and it's perfectly organized and ready for you to analyze? I love it when the people who collect the data take special care to make sure to format it consistently, arrange it correctly, and eliminate the junk, clutter, and useless information I don't need. </p>
<p><img alt="Messy Data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ad531bc1c0dc575e774b7ecef670b231/messydata.png" style="border-width: 1px; border-style: solid; margin: 10px 15px; width: 250px; height: 248px; float: right;" />You've never received a data set in such perfect condition, you say?</p>
<p>Yeah, me neither. But I can dream, right? </p>
<p><span style="line-height: 1.6;">The truth is, when other people give me data, it's typically not ready to analyze. It's frequently messy, disorganized, and inconsistent. I get big headaches if I try to analyze it without doing a little clean-up work first. </span></p>
<p>I've talked with many people who've shared similar experiences, so I'm writing a series of posts on how to get your data in usable condition. In this first post, I'll talk about some basic methods you can use to make your data easier to work with. </p>
Preparing Data Is a Little Like Preparing Food
<p>I'm not complaining about the people who give me data. In most cases, they aren't statisticians and they have many higher priorities than giving me data in exactly the form I want. </p>
<p>The end result is that getting data is a little bit like getting food: it's not always going to be ready to eat when you pick it up. You don't eat raw chicken, and usually you can't analyze raw data, either. <span style="line-height: 20.7999992370605px;"> </span><span style="line-height: 1.6;">In both cases, you need to prepare it first or the results aren't going to be pretty. </span></p>
<p><span style="line-height: 1.6;">Here are a couple of very basic things to look for when you get a messy data set, and how to handle them. </span></p>
<span style="line-height: 1.6;">Kitchen-Sink Data and Information Overload</span>
<p>Frequently I get a data set that includes a lot of information that I don't need for my analysis. I also get data sets that combine or group information in ways that make analyzing it more difficult. </p>
<p>For example, let's say I needed to analyze data about different types of events that take place at a local theater. Here's my raw data sheet: </p>
<p><img alt="April data sheet" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/14fe4e9930171f54848b589c0e8139d1/april_data_raw.png" style="width: 400px; height: 224px;" /></p>
<p>With each type of event jammed into a single worksheet, it's a challenge to analyze just one event category. What would work better? A separate worksheet for each type of occasion. In Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a>, I can go to <strong>Data > Split Worksheet...</strong> and choose the Event column: </p>
<p><img alt="split worksheet" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/69c63e422339f9871ada5a244222dcfc/split_worksheet.png" style="width: 300px; height: 309px;" /></p>
<p>And Minitab will create new worksheets that include only the data for each type of event. </p>
<p><img alt="separate worksheets by event type" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8b97ea00ae39da8cb60e307ebe6140dc/separate_data_sheets.png" style="width: 300px; height: 243px;" /></p>
<p><span style="line-height: 20.7999992370605px;">Minitab also lets you merge worksheets to </span>combine items provided in separate data files. </p>
<p><span style="line-height: 1.6;">Let's say the data set you've been given contains a lot of columns that you don't need: irrelevant factors, redundant information, and the like. Those items just clutter up your data set, and getting rid of them will make it easier to identify and access the columns of data you actually need. </span><span style="line-height: 20.7999992370605px;">You can delete rows and columns you don't need, or use the</span><strong style="line-height: 20.7999992370605px;"> Data > Erase Variables</strong><span style="line-height: 20.7999992370605px;"> tool to make your worksheet more manageable. </span></p>
<span style="line-height: 1.6;">I Can't See You Right Now...Maybe Later</span>
<p>What if you don't want to actually <em>delete </em>any data, but you only want to see the columns you intend to use? For instance, in the data below, I don't need the Date, Manager, or Duration columns now, but I may have use for them in the future: </p>
<p><img alt="unwanted columns" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/99d785a0b5ff0cbac36f0c6af05b1cac/unwantedcolumns.png" style="width: 400px; height: 225px;" /></p>
<p>I can select and right-click those columns, then use <strong>Column > Hide Selected Columns</strong> to make them disappear. </p>
<p><img alt="hide selected columns" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/00defa2646d5e100873ef2961d374ff0/hideselectedcolumns.png" style="width: 400px; height: 308px;" /></p>
<p>Voila! They're gone from my sight. Note how the displayed columns jump from C1 to C5, indicating that some columns are hidden: </p>
<p><img alt="hidden columns" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a140bb6413744b431460e70f523e5a0b/hiddencolumns.png" style="width: 323px; height: 138px;" /></p>
<p>It's just as easy to bring those columns back in the limelight. When I want them to reappear, I select the C1 and C5 columns, right-click, and choose "Unhide Selected Columns." </p>
<p>Data may arrive in a disorganized and messy state, but you don't need to keep it that way. Getting rid of extraneous information and choosing the elements that are visible can make your work much easier. But that's just the tip of the iceberg. In my next post, I'll cover some more <a href="http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-2">ways to make unruly data behave</a>. </p>
Data AnalysisStatisticsMon, 10 Nov 2014 15:52:00 +0000http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-data-is-a-mess-part-1Eston MartzCreating and Reading Statistical Graphs: Trickier than You Think
http://blog.minitab.com/blog/understanding-statistics/creating-and-reading-statistical-graphs-trickier-than-you-think
<p>A few weeks ago my colleague Cody Steele illustrated <a href="http://blog.minitab.com/blog/statistics-and-quality-improvement/how-painful-does-the-income-gap-look-to-you">how the same set of data can appear to support two contradictory positions</a>. He showed how changing the scale of a graph that displays mean and median household income over time drastically alters the way it can be interpreted, even though there's no change in the data being presented.</p>
<p><img alt="Graph interpretation is tricky, especially if you're doing it quickly" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f594d20f8daa8e00e29380f68010b1cc/hunh.jpg" style="margin: 10px 15px; float: right; width: 200px; height: 200px;" /> When we analyze data, we need to present the results in an objective, honest, and fair way. That's the catch, of course. What's "fair" can be debated...and that leads us straight into "Lies, damned lies, and statistics" territory. </p>
<p><span style="line-height: 20.7999992370605px;">Cody's post got me thinking about the importance of statistical literacy, especially in a mediascape saturated with overhyped news reports about seemingly every new study, not to mention omnipresent "infographics" of frequently dubious origin and intent.</span></p>
<p><span style="line-height: 20.7999992370605px;">As consumers and providers of statistics, can we trust our own impressions of the information we're bombarded with on a daily basis? It's an increasing challenge, even for the statistics-savvy. </span></p>
So Much Data, So Many Graphs, So Little Time
<p>The increased amount of information available, combined with the acceleration of the news cycle to speeds that wouldn't have been dreamed of a decade or two ago, means we have less time available to absorb and evaluate individual items critically. </p>
<p>A half-hour television news broadcast might include several animations, charts, and figures based on the latest research, or polling numbers, or government data. They'll be presented for several seconds at most, then it's on to the next item. </p>
<p>Getting news online is even more rife with opportunities for split-second judgment calls. We scan through the headlines and eyeball the images, searching for stories interesting enough to click on. But with 25 interesting stories vying for your attention, and perhaps just a few minutes before your next appointment, you race through them very quickly. </p>
<p>But when we see graphs for a couple of seconds, do we really absorb their meaning completely and accurately? Or are we susceptible to misinterpretation? </p>
<p>Most of the graphs we see are very simple: bar charts and pie charts predominate. But <span style="line-height: 1.6;">as statistics educator Dr. Nic points out in </span><a href="http://learnandteachstatistics.wordpress.com/2012/07/16/tricky_graphs/" style="line-height: 1.6;">this blog post</a>,<span style="line-height: 1.6;"> </span><span style="line-height: 20.7999992370605px;">interpreting</span><span style="line-height: 20.7999992370605px;"> </span><span style="line-height: 1.6;">even simple bar charts can be a deceptively tricky business</span><span style="line-height: 1.6;">. I've adapted her example to demonstrate this below. </span></p>
Which Chart Shows Greater Variation?
<p>A city surveyed residents of two neighborhoods about the quality of service they get from local government. Respondents were asked to rate local services on a scale of 1 to 10. Their responses were charted using Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a>, as shown below. </p>
<p>Take a few seconds to scan the charts, then choose which neighborhood's responses exhibit the most variation, Ferndale or Lawnwood?</p>
<p><img alt="Lawnwood Bar Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f88262f2732bc43e8ac0b919d43139a5/lawnwoodbarchart.gif" style="width: 500px; height: 333px;" /></p>
<p><img alt="Ferndale Bar Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/67ee1909a89236e3caac2d11a9d42795/ferndalebarchart.gif" style="width: 500px; height: 333px;" /></p>
<p>Seems pretty straightforward, right? Lawnwood's graph is quite spiky and disjointed, with sharp peaks and valleys. The graph of Ferndale's responses, on the other hand, looks nice and even. Each bar's roughly the same height. </p>
<p>It looks like Lawnwood's responses have the most variation. But let's verify that impression with some basic descriptive statistics about each neighborhood's responses:</p>
<p style="margin-left: 40px;"><img alt="Descriptive Statistics for Fernwood and Lawndale" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/1eeed755d2a0baea0939dc7ccecacaea/descriptive_statistics.gif" style="width: 369px; height: 105px;" /></p>
<p>Uh-oh. A glance at the graphs suggested that Lawnwood has more variation, but the analysis demonstrates that Ferndale's variation is, in fact, much higher. <span style="line-height: 20.7999992370605px;">How did we get this so wrong?</span><span style="line-height: 20.7999992370605px;"> </span><span style="line-height: 1.6;"> </span></p>
Frequencies, Values, and Counterintuitive Graphs
<p><span style="line-height: 1.6;">The answer lies in how the data were presented. The charts above show frequencies, or counts, rather than individual responses. </span></p>
<p><span style="line-height: 1.6;">What if we graph the individual responses for each neighborhood? </span></p>
<p><img alt="Lawndale Individuals Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d8e91ae6c007e8f5327c54ac3ec65604/lawnwoodindividualsbarchart.gif" style="width: 500px; height: 333px;" /></p>
<p><img alt="Ferndale Individuals Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/4c01c68dbb96e2126a1fd313ee38e001/ferndaleindividualsbarchart.gif" style="width: 500px; height: 333px;" /></p>
<p>In <em>these </em>graphs, it's easy to see that the responses of Ferndale's citizens had much more variation than those of Lawnwood. But unless you appreciate the differences between values and frequencies—and paid careful attention to how the first set of graphs was labelled—a quick look at the earlier graphs could well leave you with the wrong conclusion. </p>
Being Responsible
<p>Since you're reading this, you probably both create and consume data analysis. You may generate your own reports and charts at work, and see the results of other peoples' analyses on the news. We should approach both situations with a certain degree of responsibility. </p>
<p>When looking at graphs and charts produced by others, we need to avoid snap judgments. We need to pay attention to what the graphs really show, and take the time to draw the right conclusions based on how the data are presented. </p>
<p>When sharing our own analyses, we have a responsibility to communicate clearly. In the frequency charts above, the X and Y axes are labelled adequately—but couldn't they be more explicit? Instead of just "Rating," couldn't the label read "Count for Each Rating" or some other, more meaningful description? </p>
<p>Statistical concepts may seem like common knowledge if you've spent a lot of time working with them, but many people aren't clear on ideas like "correlation is not causation" and margins of error, let alone the nuances of statistical assumptions, distributions, and significance levels.</p>
<p>If your audience includes people without a thorough grounding in statistics, are you going the extra mile to make sure the results are understood? For example, many expert statisticians have told us they use <a href="http://www.minitab.com/products/minitab/assistant/">the Assistant</a> in Minitab 17 to present their results precisely because it's designed to communicate the outcome of analysis clearly, even for statistical novices. </p>
<p><span style="line-height: 20.7999992370605px;">If you're already doing everything you can to make statistics accessible to others, kudos to you. </span><span style="line-height: 20.7999992370605px;">And if you're not, why aren't you? </span></p>
Data AnalysisStatisticsStatistics in the NewsStatsWed, 05 Nov 2014 14:25:00 +0000http://blog.minitab.com/blog/understanding-statistics/creating-and-reading-statistical-graphs-trickier-than-you-thinkEston Martz