Data Analysis Software | MinitabBlog posts and articles with tips for using statistical software to analyze data for quality improvement.
http://blog.minitab.com/blog/data-analysis-software/rss
Fri, 27 May 2016 16:04:58 +0000FeedCreator 1.7.3Is Stephen Curry the Best NBA Point Guard Ever? Let's Check the Data
http://blog.minitab.com/blog/statistics-in-the-field/is-stephen-curry-the-best-nba-point-guard-ever-lets-check-the-data
<p><em>by Laerte de Araujo Lima, guest blogger </em></p>
<p>The NBA's 2015-16 season will be one for the history books. Not only was it the last season of <a href="http://www.nba.com/lakers/news/160413_kobepresser">Kobe Bryan</a>, who scored 60 points in his final game, but the Golden State Warriors set <a href="http://www.nba.com/news/2015-16-golden-state-warriors-chase-1995-96-chicago-bulls-all-time-wins-record/">a new wins record</a>, beating the previous record set by 1995-96 Chicago Bulls.</p>
<p><img alt="stephen curry" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/25a3dc0f9e9c615fae1224259b7c0c6f/320px_stephen_curry_vs_washington_2016_1_.jpg" style="width: 320px; height: 216px; margin: 10px 15px; float: right;" />The Warriors seem likely to take this season's NBA title, in large part thanks to the performance of point guard <a href="http://www.nba.com/playerfile/stephen_curry/">Stephen Curry</a>. A lot of my friends are even saying Curry's skill and performance make him the best point guard ever in NBA history—but it is true? Curry’s performance is amazing, and he's the key element of Warriors’ success, but it seems a little early to define him as the best NBA point guard <em>ever</em>. But in the meantime, we can use data to answer another question:</p>
<p>Has any other point guard in NBA history matched Stephen Curry’s performance during their initial seven seasons?</p>
<p>As a fan of both basketball and Six Sigma, I set out to answer this question methodically, following these steps:</p>
1. Define the Sample of Point Guards for the Study
<p>ESPN recently published <a href="http://espn.go.com/nba/story/_/page/nbarankPGs/ranking-top-10-point-guards-ever">their list</a> of the 10 best NBA point guards, which puts Magic Johnson first and Curry fourth. ESPN considers both objective factors (NBA titles, MVP nominations, etc.) and subjective parameters (player vision, charisma, team engagement, etc.) to compare players. In keeping with Six Sigma, I want my analysis to be based on figure and facts; however, ESPN's list makes a good starting point. Here are their rankings:</p>
<ol>
<li>Magic Johnson</li>
<li>Oscar Robertson</li>
<li>John Stockton</li>
<li>Stephen Curry</li>
<li>Isiah Thomas</li>
<li>Chris Paul</li>
<li>Steve Nash</li>
<li>Jason Kidd</li>
<li>Walt Frazier</li>
<li>Bob Cousy</li>
</ol>
2. Define the Data Source
<p>This is the easiest part of the job. The NBA web site is a rich source of data, so we are going to use it to check the regular-season performances of each player in ESPN's list. This makes the data average well balanced among all players, because we are going to use the same number of matches per player per season.</p>
3. Define the Critical-to-Quality (CTQ) Factors
<p>In my opinion, the following CTQ factors (based on NBA standards criteria) best characterize point guard performance and how they add value to the team's main target—winning a game:</p>
<p style="text-align: center; margin: 5px 25px;"><strong>CTQ </strong></p>
<p style="text-align: center; margin: 5px 25px;"><strong>CTQ Definition</strong></p>
<p style="text-align: center; margin: 5px 25px;"><strong>Rationale</strong></p>
<p style="text-align: center; margin: 5px 25px;"><strong>PTS</strong></p>
<p style="text-align: center; margin: 5px 25px;">Average points per game</p>
<p style="text-align: center; margin: 5px 25px;">Impact of the player on the overall score makes a positive contribution to winning the game.</p>
<p style="text-align: center; margin: 5px 25px;"><strong>FG%</strong></p>
<p style="text-align: center; margin: 5px 25px;">Percentage of successful field goals</p>
<p style="text-align: center; margin: 5px 25px;">Player efficiency in shooting makes a positive contribution to winning the game.</p>
<p style="text-align: center; margin: 5px 25px;"><strong>3P%</strong></p>
<p style="text-align: center; margin: 5px 25px;">Percentage of successful 3-point field goals</p>
<p style="text-align: center; margin: 5px 25px;">Player efficiency in the 3-point line shoot makes a positive contribution to winning the game.</p>
<p style="text-align: center; margin: 5px 25px;"><strong>FT%</strong></p>
<p style="text-align: center; margin: 5px 25px;">Percentage of successful free-throw field goals</p>
<p style="text-align: center; margin: 5px 25px;">Player efficiency in the free throw makes a positive contribution to winning the game.</p>
<p style="text-align: center; margin: 5px 25px;"><strong>AST</strong></p>
<p style="text-align: center; margin: 5px 25px;">Average assistance per game</p>
<p style="text-align: center; margin: 5px 25px;">Assisting teammates makes a positive contribution to winning the game.</p>
<p style="text-align: center; margin: 5px 25px;"><strong>STL</strong></p>
<p style="text-align: center; margin: 5px 25px;">Average steal per game</p>
<p style="text-align: center; margin: 5px 25px;">New ball possession and counterattacks make a positive contribution to winning the game.</p>
<p style="text-align: center; margin: 5px 25px;"><strong>MIN</strong></p>
<p style="text-align: center; margin: 5px 25px;">Average minutes player per game</p>
<p style="text-align: center; margin: 5px 25px;">Player's strategic importance to the team.<br />
Positive contribution to team strategy.</p>
<p style="text-align: center; margin: 5px 25px;"><strong>GS</strong></p>
<p style="text-align: center; margin: 5px 25px;">Games per season where player is part of the initial 5.</p>
<p style="text-align: center; margin: 5px 25px;">Initial starts indicate importance in terms of strategy, as well as fewer injuries.</p>
<p>With the players, critical factors, and the source of data defined, let's dig into the analysis.</p>
4. Ranking Criteria and Methodology
<p>When I opened Minitab Statistical Software to begin looking at each player's average for each CTQ factor, I faced the first challenge in the analysis. Some players did not have the same CTQ measurements in the NBA database. They had played in the NBA’s early years, and the statistics for all CTQ factors weren't available (for example, the 3-point shot didn't exist at the time some players were active). Consequently, I decided to exclude those players from the analysis to avoid discrepancy in the data. That leaves us with this short list:</p>
<ol>
<li>Magic Johnson</li>
<li>John Stockton</li>
<li>Stephen Curry</li>
<li>Isiah Thomas</li>
<li>Chris Paul</li>
<li>Steve Nash</li>
<li>Jason Kidd</li>
</ol>
To compare these players, I used the statistical tool called Analysis of Variance (ANOVA). ANOVA tests the hypothesis that the means of two or more populations are equal. An ANOVA evaluates the importance of one or more factors by comparing the response variable means at the different factor levels. The null hypothesis states that all means are equal, while the alternative hypothesis states that at least one is different.
<p>For this analysis, I used the <a href="http://www.minitab.com/products/minitab/assistant/">Assistant</a> in Minitab to perform One-Way ANOVA analysis. To access this tool, select <strong>Assistant > Hypothesis Tests...</strong> and choose One-Way ANOVA.</p>
<p style="margin-left: 80px;"><img alt="The Assistant in Minitab" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/42afcc329cd4cc74808be92ee49931d9/image001.jpg" style="width: 529px; height: 329px;" /></p>
<p>By performing one-way ANOVA for each of the factors, I can position the players based on the average values of their CTQ variables during each of their first seven seasons. After compiling all results, I deployed a <a href="http://asq.org/learn-about-quality/decision-making-tools/overview/decision-matrix.html">Decision Matrix</a> (another Six Sigma tool) to assess all the players, based on the ANOVA results. The ultimate goal is to determine if Curry’s average performance is superior, inferior, or equal to that of the other players.</p>
<p>Let's take a look at the results of the ANOVA results for the individual CTQ factors.</p>
Average Points per Game (PPG)
<p style="margin-left: 40px;"><img alt="Average Points Per Game" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/bf37da89d5aca795b00a58125d00c9db/image002.gif" style="width: 624px; height: 468px;" /></p>
<p>The Assistant's output is designed to be very easy to understand. The blue bar at the top left answers the bottom-line question, "Do the means differ?" The p-value (0,001) is less than the threshold (< 0.05), telling us that there is a statistically significant difference in means. The intervals displayed on the Means Comparison Chart indicate that Curry and Nash both had huge variation in their average points-per-game in the first 7 years. Statistically speaking, the only player with a average PPG performance that was significantly different from Curry’s is Kidd; all the others had similar performance in their first 7 seasons.</p>
Percentage of Field Goals per Game (FG%)
<p style="margin-left: 40px;"><img alt="FG% ANOVA Results" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/6ce64723b8efe750fcb163443aef7fed/image003.gif" style="width: 624px; height: 468px;" /></p>
<p>As in the previous analysis, the p-value (0,001) is less than the threshold (< 0.05), telling us that there is a difference in means. However, the interpretation of analysis is clearer. In terms of statistical significance, Curry’s performance is better than Kidd's (again), but not better than Magic's, and it is similar to that of the all other players.</p>
<p>Again, we see that Nash has tremendous variation in his field-goal percentage, and Kidd exhibits the worst average FG% among these players.</p>
Average Percentage of 3-point Field Goals per Game (3P%)
<p style="margin-left: 40px;"><img alt="3P% ANOVA" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b997d1cc5da7397b9a7f3af002e44f89/image004.gif" style="width: 624px; height: 468px;" /></p>
<p>To my surprise, based on this comparison chart Magic has the <em>worst </em>performance—and the most variation— among the players for this factor. On the other hand, Curry has an extremely high average performance, with small variation, and this is what we see in the Warriors games.</p>
<p>If we take a closer look at the three highest performers in this category, Nash, Stockton, and Curry, we see that Nash and Curry’s performances are slightly different. Interestingly, the variation in Stockton's data prevents us from being able to conclude that statistically significant difference exists between his average and those of Curry <em>or </em>Nash.</p>
<p style="margin-left: 40px;"><img alt="3P% ANOVA for Curry, Nash, Stockton" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/93c4b44ac1c49bb7a84571a60567b798/image005.gif" style="width: 624px; height: 468px;" /></p>
<p>As happens in many Six Sigma projects, the results of this factor contradict conventional wisdom: how could Magic Johnson have the lowest average for this factor? I decided to dig a little bit deeper into Magic’s data using the Assistant's Diagnostic Report, which offers a better view of the data's distribution. we can see an outlier in Magic's data. According to this analysis, he actually had a season with 0% of 3-point field goals!</p>
<p style="margin-left: 40px;"><img alt="3PT% Diagnostic Report" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/70ad9430dc5f4ee5e2c98a21eafb7d8b/image007.png" style="width: 623px; height: 467px;" /></p>
<p>I could not believe this, so I double-checked the data at the source. To my surprise, it was correct:</p>
<p style="margin-left: 40px;"><img alt="Magic 0.0" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9d096d542a23ab383658316271a5af5b/image009.png" style="width: 624px; height: 346px; border-width: 1px; border-style: solid;" /></p>
Average Percentage of Free-Throw Field Goals per Game (FT%)
<p style="margin-left: 40px;"><img alt="FT% ANOVA Output" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/bd00d9293ba2bd6e45cae5307b7adea4/image010.gif" style="width: 624px; height: 468px;" /></p>
<p>In the free throw analysis, Curry's performance is similar to that of Nash and Paul, all of whom performed better than the other players. Once again, Kidd (whom I have nothing against!) has the worst performance.</p>
Average Assistance per Game (AST)
<p style="margin-left: 40px;"><img alt="AST% ANOVA Output" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/100e59e07307222bc73255c030e00316/image011.gif" style="width: 624px; height: 468px;" /></p>
<p>For this factor, both Nash and Curry are at the end of the queue with similar performance. For this factor, it's also clear that while Stockton has both the highest average and small variation in his performance, he's still comparable with Isiah and Magic.</p>
Average Steals per Game (STL)
<p style="margin-left: 40px;"><img alt="STL ANOVA Output" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a1450e5f5b2883a49ad8aa5e7941e2c0/image012.gif" style="width: 624px; height: 468px;" /></p>
<p>Again, the p-value (0,001) is less than the threshold (< 0.05), telling us that there is a statistically significant difference in means. It is clear clear that Nash is not a big “stealer” when compared with the other players. It's interesting to see that Curry’s mean performance is better than Nash's and worse than Paul's, but is not statistically significantly different from the mean performance of the remaining players.</p>
Minutes Played per Game (MIN)
<p style="margin-left: 40px;"><img alt="MIN ANOVA Output" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/33b90915031366f591dd73b52e092971/image013.gif" style="width: 624px; height: 468px;" /></p>
<p>For the first time, the ANOVA results have a p-value (0.075) greater than the threshold (< 0.05), telling us that there is no statistically significant difference in means. It is clear that Nash's performance has huge variation, indicating that his contribution was very irregular in the first 7 season (perhaps due to injuries, adaptation, etc.). The amount of variation in Curry's performance follows Nash's.</p>
Games Started in the Initial 5 per Season (GS)
<p style="margin-left: 40px;"><img alt="Initial 5 ANOVA Output" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/dcb0afca98d1ed138d3d33c07f2f0d7e/image014.gif" style="width: 624px; height: 468px;" /></p>
<p>For this final CTQ, we can see that the p-value (0.006) is less than the threshold (< 0.05), indicating that the means are different. In this case, Stockton and Kidd's means differ. Curry’s presence in the initial 5 in the first 7 season is not statistically significantly different from that of any other other palyers.</p>
<p>Let's take a look at the Diagnostic Report. We can see that Stockton's performance in this CTQ is incredible—he started all seasons' games in the initial 5, showing his importance to the team</p>
<p style="margin-left: 40px;"><img alt="Initial 5 ANOVA Diagnostic Report" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/767e1328424e2c3b332cd5c612d41924/image015.gif" style="width: 624px; height: 468px;" /></p>
Conclusion
<p>Based on the analyses of these criteria, we now have a final have the final outlook based purely on the data. We can use Minitab's <a href="https://blog.minitab.com/blog/statistics-and-quality-improvement/automatically-update-your-conditional-formatting">conditional formatting</a> to highlight the differences between players for the different factors (<strong>></strong> means "better than", <strong><</strong> means "worse than", and = means similar).</p>
<p style="margin-left: 40px;"><img alt="Final Outlook - Condition Formatting" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/12c216328e4bf5214e2693f7195cd3c8/image015.png" style="width: 604px; height: 202px;" /></p>
From the analysis, we can conclude that
<ul>
<li>Considering all of the CTQs, Curry’s overall performance is not better than any other point guard in the study, although he does stand out for some individual factors.</li>
<li>Curry’s PTS is superior only to Kidd's.</li>
<li>In terms of shot efficiency, Curry’s FG% is better than Kidd's but inferior to Magic's, and at the same level as all other players.</li>
<li>Curry’s 3-point performance is amazing, but this analysis shows Stockton’s at the same level.</li>
<li>On the other hand, Curry's FT% is better than that of all the other players, except Paul and Nash.</li>
<li>Curry’s assistance per season is inferior to all other point guards, except Nash.</li>
<li>For steals, Curry’s mean performance is better than Nash's, worse than Paul's, and not statistically significantly different from the remaining players.</li>
<li>In terms of MIN and GS, Curry's performance is similar to that of the other players.</li>
<li>If we just compare points-per-game (PTS) and shot efficiency (FG%,FT%,3P%) separately, Curry’s overall performance is better than Kidd's, for sure. But if we compare the other CTQ (AST, STL, MIN,GS) factors in the same way, Chris Paul has better performance than Curry.</li>
</ul>
<p>Based on this analysis, perhaps we need a few more seasons' worth of data to compare these players overall performance and reach a more certain conclusion.</p>
<p> </p>
<p><strong>About the Guest Blogger: </strong></p>
<p><em>Laerte de Araujo Lima is a Supplier Development Manager for Airbus (France). He has previously worked as product quality engineer for Ford (Brazil), a Project Manager in MGI Coutier (Spain), and Quality Manager in IKF-Imerys (Spain). He earned a bachelor's degree in mechanical engineering from the University of Campina Grande (Brazil) and a master's degree in energy and sustainability from the Vigo University (Spain). He has 10 years of experience in applying Lean Six Sigma to product and process development/improvement. To get in touch with Laerte, please follow him on Twitter @laertelima or on</em> <a href="http://www.linkedin.com/pub/laerte-lima/7/46b/443" target="_blank"><strong><em>LinkedIn</em></strong></a><em>.</em></p>
<p> </p>
<p style="font-size:11px;"><em>Photo of Stephen Curry by <a href="https://www.flickr.com/people/27003603@N00">Keith Allison</a>, used under Creative Commons 2.0. </em></p>
Fun StatisticsStatistics in the NewsFri, 13 May 2016 12:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/is-stephen-curry-the-best-nba-point-guard-ever-lets-check-the-dataGuest BloggerWhat's a Moving Range, and How Is It Calculated?
http://blog.minitab.com/blog/marilyn-wheatleys-blog/whats-a-moving-range-and-how-is-it-calculated
<p>We often receive questions about moving ranges because they're used in various tools in our <a href="http://www.minitab.com/products/minitab">statistical software</a>, including control charts and capability analysis when data is not collected in subgroups. In this post, I'll explain what a moving range is, and how a moving range and average moving range are calculated.</p>
<p>A moving range measures how variation changes over time when data are collected as individual measurements rather than in subgroups.</p>
<p>If we collect individual measurements and need to plot the data on a control chart, or assess the capability of a process, we need a way to estimate the variation over time. But when we have individual observations, we cannot calculate the standard deviation for each subgroup. In such cases, the average moving range across all subgroups is an alternative way to estimate process variation.</p>
<p>Consider the 10 random data points plotted in the graph below:</p>
<p style="margin-left: 40px;"><img height="369" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/7b447a0adb4a6e3a23fee5a34ab07563/7b447a0adb4a6e3a23fee5a34ab07563.png" width="624" /></p>
<p>A moving range is the distance or difference between consecutive points. For example, MR1 in the graph below represents the first moving range, MR2 represents the second moving range, and so forth:</p>
<p style="margin-left: 40px;"><img height="414" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/041539e9131ddbfb6cae7517ec190ab8/041539e9131ddbfb6cae7517ec190ab8.png" width="624" /></p>
<p>The difference between the first and second points (MR1) is 0.704, and that’s a positive number since the first point has a lower value than the second. The second moving range, MR2, is the difference between the second point (21.0494) and the third (19.6375), and that’s a negative number (-1.4119), since the third point has a lower value than the second. If we continue that way, we’ll have 9 moving ranges for our 10 data points.</p>
<p>In Minitab, a moving range is easy to compute by "lagging" the data. Continuing the example with the 10 data points above, I can use <strong>Stat</strong> > <strong>Time Series</strong> > <strong>Lag</strong>, and then complete the dialog box as shown below:</p>
<p style="margin-left: 40px;"><img alt="a" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/2b125f53827fb9cc7aec8b2a300845a7/capture.PNG" style="width: 557px; height: 330px;" /></p>
<p>Clicking <strong>OK</strong> in the dialog above will shift the data in C1 down by one row and store the results in C4. Now we can use <strong>Calc</strong> > <strong>Calculator</strong> to subtract C4 from C1 and calculate all the moving ranges:</p>
<p style="margin-left: 40px;"><img alt="b" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/070834223bef3007c9621c940ff3a195/capture.PNG" style="width: 563px; height: 380px;" /></p>
<p>To calculate the average moving range, we need to use the absolute value of the moving ranges we calculated above. We’ll take a look at how to do that later. </p>
<p>When Minitab calculates the average of a moving range, the calculation also includes and <a href="http://support.minitab.com/en-us/minitab/17/topic-library/quality-tools/capability-analyses/data-and-data-assumptions/unbiasing-constants/">unbiasing constant</a>. The formula used to calculate the moving range is:</p>
<p style="margin-left: 40px;"><img alt="equation" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/a5a46a4ff68b1425bbd155792d20a701/a5a46a4ff68b1425bbd155792d20a701.png" style="border-width: 0px; border-style: solid; width: 624px; height: 140px;" /></p>
<p>The table of unbiasing constants is available within Minitab and <a href="http://support.minitab.com/en-us/minitab-express/1/help-and-how-to/control-charts/how-to/variables-data-in-subgroups/xbar-r-chart/methods-and-formulas/unbiasing-constants-d2-d3-and-d4/">on this page</a>.</p>
<p>We’ve already done most of the work. To finish, we’ll find the right value of d2 in the table linked above, and use Minitab’s calculator to get the answer. We need the value of d2 that corresponds to a moving range of length 2 (that’s the number of points in each moving range calculation, but don’t worry, I’ll explain more about the length of the moving range later):</p>
<p style="margin-left: 40px;"><img border="0" height="179" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/2caa9e4eec046f281a834976260d3f8c/2caa9e4eec046f281a834976260d3f8c.png" width="173" /></p>
<p>Now back to Minitab, and we can use <strong>Calc</strong> > <strong>Calculator</strong> to get our answer:</p>
<p style="margin-left: 40px;"><img alt="c" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/f3eaf58a9007d6420c44b559206206eb/capture.PNG" style="width: 604px; height: 386px;" /></p>
<p>Using the formula above, we’re telling Minitab to use the absolute values (ABS calculator command) in C5 to calculate the mean, and then divide that by our unbiasing constant value of 1.128.</p>
<p>Now to check our results against Minitab, we can use <strong>Stat </strong>> <strong>Control Charts</strong> > <strong>Variables Charts for Individuals</strong> > <strong>I-MR</strong> and enter our original data column:</p>
<p style="margin-left: 40px;"><img border="0" height="334" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/0c80b992ef94f8d021aa1ebfc5bbc594/0c80b992ef94f8d021aa1ebfc5bbc594.png" width="507" /></p>
<p>Next, choose <strong>I-MR Options</strong> > <strong>Storage</strong>, and check the box next to <strong>Standard deviations</strong>, then click <strong>OK</strong> in each dialog box:</p>
<p style="margin-left: 40px;"><img alt="d" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/c4b545c37882980e3f690ad046f63626/capture.PNG" style="width: 582px; height: 440px;" /></p>
<p>The results show the same average moving range value we calculated, <strong>0.602627</strong>. </p>
<p>In this case, because we used a moving range of length 2, the average moving range gives us an estimate of the average distance between our consecutive individual data points. A moving range of length 2 is Minitab’s default, but that can be changed by clicking the <strong>I-MR Options</strong> button in the I-MR chart dialog, and then choosing the <strong>Estimate</strong> tab:</p>
<p style="margin-left: 40px;"><img border="0" height="438" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/3e03c57905dc63ff5be0971285a4d518/3e03c57905dc63ff5be0971285a4d518.png" width="442" /></p>
<p>Here we can type in a different value (let’s use 3 as an example), and Minitab will use that number of points to estimate the moving ranges. If we did that for the calculations above, we’d have to make two adjustments:</p>
<ol>
<li>
<p>We’d need to choose the correct value for the unbiasing constant, d2, that corresponds with a moving range length of 3:</p>
<p><img alt="t" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/94764a32eec04329f8dfdd4d73219214/capture.PNG" style="width: 173px; height: 182px;" /></p>
</li>
<li>We’d have to adjust the number of points used for our moving ranges from 2 to 3. Using the same random data as before:</li>
</ol>
<p style="margin-left: 40px;"><img border="0" height="248" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/bf32b968b0788bc21e03920397ccefe4/bf32b968b0788bc21e03920397ccefe4.png" width="71" /></p>
<p style="margin-left: 40px;">With three data points, we’ll use just the highest and the lowest values from the first 3 rows, so MR1 will be 21.0494 – 19.6375 = 1.4119.</p>
<p><span style="line-height: 1.6;">If you’ve enjoyed this post, check out some of our other blog </span><a href="http://blog.minitab.com/blog/control-charts" style="line-height: 1.6;">posts about control charts</a><span style="line-height: 1.6;">.</span></p>
<p> </p>
Fri, 29 Apr 2016 12:00:00 +0000http://blog.minitab.com/blog/marilyn-wheatleys-blog/whats-a-moving-range-and-how-is-it-calculatedMarilyn WheatleyManipulating Your Survey Data in Minitab
http://blog.minitab.com/blog/statistics-and-quality/manipulating-your-survey-data-in-minitab
<p>As a recent graduate from Arizona State University with a degree in Business Statistics, I had the opportunity to work with students from different areas of study and help analyze data from various projects for them.</p>
<p><img alt="survey symbold" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/3b2a7f4c85707a09177d3da12dbaa009/online_survey_icon_or_logo_svg.png" style="margin: 10px 15px; float: right; width: 300px; height: 300px;" />One particular group asked for help analyzing online survey data they had gathered from other students, and they wanted to see if their new student program was beneficial. I would describe this request as them giving us a "pile of data" and saying, "Tell us what you can find out." </p>
<p>There were numerous problems with this "pile of data" because it wasn't organized, in part because of the way the survey itself was set up. (Our statistics professor later told us that she asked this group to come in because she'd looked at their data before they presented it to us and she wanted to see how we would perform with a "real-world" situation.)</p>
<p>Unfortunately, the statistics department didn't have a time machine that would enable us to go back and set up the survey to have better data that was more organized (I guess if we <em>did </em>have a time machine there would be no need for predictive analytics), but we did have <a href="http://www.minitab.com/products/minitab/">Minitab and its tools</a> to help with the importing of data, reviewing the data, and putting it in a format that is best for analyzing. </p>
<p>So let’s assume you have a pile of survey data that is:</p>
<ul>
<li>Unbiased</li>
<li>Taken from a random sample</li>
<li>Taken from the appropriate audience</li>
<li>Contained enough respondents</li>
</ul>
<p><span style="line-height: 1.6;">Many online survey tools allow you to download your data to a .csv or Excel file, which would be perfect to <span>import into Minitab</span>. </span></p>
<p><span style="line-height: 1.6;">In fact, Minitab 17.3 has recently included a new dialog box that shows you the data before it is opened so you can modify the data type, include/exclude certain columns, and see how many rows are within the data. Within options of that same dialog box you are able to choose what is done with missing data points, and missing data rows. All of these new functions give you the ability to bring a "pile of data" into Minitab a little cleaner with less headache.</span></p>
<p style="margin-left: 40px;"><img alt="open survey data dialog" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/b51a0c86-e2dd-456e-878a-4196c7381c3a/File/c5319276614d905f12f38eca2f3a6343/c5319276614d905f12f38eca2f3a6343.png" style="width: 669px; height: 570px;" /> </p>
<p><span style="line-height: 1.6;">Once the data is in Minitab reviewing the data is essential to uncover any irregularities that may be hiding in the data before analysis. Within the Project Manager Bar there is the information icon that allows you to be able to see each column name, column ID, row count, how many missing data points and the type of data of each column. This provides the ability to quickly scan the different columns to make sure that the online data you received correctly by checking the row count, any missing data irregularities, and data type. </span></p>
<p style="margin-left: 40px;"><img alt="data" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/b51a0c86-e2dd-456e-878a-4196c7381c3a/File/637ee7794419e3ad489f4a98c96cbc3c/637ee7794419e3ad489f4a98c96cbc3c.png" style="width: 396px; height: 342px;" /></p>
<p> </p>
<p>Minitab also has numerous tools to format the data before analysis, including <a href="http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-3">coding, sorting and splitting worksheets</a>. </p>
<p>For example, occasionally survey data will use “0” in the place of a non-response. This can be a problem because any data analysis will make this a data point when it probably shouldn't be. Minitab can find those “0”s and replace them with missing data to remove them from your worksheet so they won't throw off your analysis (<strong>Editor > Find and Replace > Replace</strong>).</p>
<p>Before analysis you can also sort your data (<strong>Data > Sort</strong>) and choose the column you would like to sort the data to, and you can also create a new worksheet from the sorted data. I also really like the Split and Subset Worksheet options in the event you have a lot of data and it would be easier to look at smaller sections of it for analysis (<strong>Data > Split Worksheet</strong> and <strong>Data > Subset Worksheet</strong>)<strong>.</strong></p>
<p>These are just a few tools that allow you to import data and then prepare the data without having to go back and forth between your spreadsheet software and statistical software. So when you have someone drop off a "pile of data," see how you can use your Minitab tools to shovel through and find the gems that are lying beneath the surface.</p>
Data AnalysisStatisticsTue, 26 Apr 2016 12:00:00 +0000http://blog.minitab.com/blog/statistics-and-quality/manipulating-your-survey-data-in-minitabJoseph Hartsock3 Tips for Importing Excel Data into Minitab
http://blog.minitab.com/blog/michelle-paret/3-tips-for-importing-excel-data-into-minitab
<p>Getting your data from Excel into <a href="http://www.minitab.com/products/minitab/">Minitab Statistical Software</a> for analysis is easy, especially if you keep the following tips in mind.</p>
Copy and Paste
<p><span style="line-height: 20.8px;">To paste into Minitab, you can either right-click in the worksheet and choose </span><strong style="line-height: 20.8px;">Paste Cells</strong><span style="line-height: 20.8px;"> or you can use </span><strong style="line-height: 20.8px;">Control-V</strong><span style="line-height: 20.8px;">. </span>Minitab allows for 1 row of column headers, so if you have a single row of column info (or no column header info), then you can quickly copy and paste an entire sheet at once. However, if you have multiple rows of descriptive text at the top of your Excel file, then use the following steps:</p>
<p><em> Step 1</em> - Choose a single row for your column headers and paste it into Minitab. </p>
<p><em> Step 2</em> - Go back to your Excel file to copy all of the actual data over.</p>
<p>And if you have any summary info at the end of your Excel file, you'll want to exclude that too, just like any extraneous column header info.</p>
<p><img alt="Excel to Minitab" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/951006fe8ebf8bfde86486660018fbe0/excel_to_mtb.jpg" style="width: 650px; height: 379px;" /></p>
<p> </p>
Importing Lots of Data
<p><img alt="File Open dialog" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/75e6b833214b1e9cbda4e6056a2fde43/file_open_menu.jpg" style="line-height: 20.8px; width: 253px; height: 359px; margin: 10px 15px; float: right;" /></p>
<p>Copy/paste is ideal when you have only a few Excel sheets. But what if you have lots of <span style="line-height: 1.6;">sheets? In this case, try using </span><strong style="line-height: 1.6;">File > Open</strong><span style="line-height: 1.6;">. Another advantage of </span><strong style="line-height: 1.6;">File > Open</strong><span style="line-height: 1.6;"> is the additional import options, should you need them. For example, you can specify which sheets </span><span style="line-height: 1.6;">and rows to include. And there are even options to handle messy data issues, such as case mismatches and </span><a href="http://blog.minitab.com/blog/michelle-paret/how-to-remove-leading-or-trailing-spaces-from-a-data-set" style="line-height: 1.6;">leading and trailing spaces</a><span style="line-height: 1.6;">.</span></p>
<div>
Fixing Column Formats
<p>Minitab has 3 column formats: numeric, text, and date/time. Text columns are noted with a <strong>-T</strong> and date/time columns are noted with a <strong>-D</strong>, while numeric columns appear without such an indicator. Why does column format matter? It matters because certain graphs and analyses are only available for certain formats. For example, if you want to create a time series plot, Minitab will not allow you to use a text column. If you bring data over from Excel and the format does not reflect the type of data in a given column, just right-click in the column and choose <strong>Format Column</strong> to select the right type, such as <strong>Automatic numeric</strong>.</p>
<p><img alt="column formats" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/350de8d0fc91e01d485bc1f124a28148/column_format.jpg" style="width: 645px; height: 332px;" /></p>
<p><span style="line-height: 1.6;">Once you import your data and it's properly formatted, you can then use the </span><strong style="line-height: 1.6;">Stat</strong><span style="line-height: 1.6;">, </span><strong style="line-height: 1.6;">Graph</strong><span style="line-height: 1.6;">, and </span><strong style="line-height: 1.6;">Assistant</strong><span style="line-height: 1.6;"> menus to start analyzing it. And if you need help running a particular analysis, just </span><a href="http://www.minitab.com/contact-us" style="line-height: 1.6;">contact Minitab Technical Support</a><span style="line-height: 1.6;">. This outstanding service is free and is staffed with statisticians, so don't hesitate to give them a call.</span></p>
</div>
Data AnalysisFri, 22 Apr 2016 12:00:00 +0000http://blog.minitab.com/blog/michelle-paret/3-tips-for-importing-excel-data-into-minitabMichelle ParetUnderstanding t-Tests: t-values and t-distributions
http://blog.minitab.com/blog/adventures-in-statistics/understanding-t-tests-t-values-and-t-distributions
<p>T-tests are handy <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-why-we-need-to-use-hypothesis-tests-in-statistics" target="_blank">hypothesis tests</a> in statistics when you want to compare means. You can compare a sample mean to a hypothesized or target value using a one-sample t-test. You can compare the means of two groups with a two-sample t-test. If you have two groups with paired observations (e.g., before and after measurements), use the paired t-test.</p>
<img alt="Output that shows a t-value" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/efd51d69e3947d70197143b735e0c51d/t_value_swo.png" style="line-height: 20.8px; float: right; width: 400px; height: 57px; margin: 10px 15px; border-width: 1px; border-style: solid;" />
<p>How do t-tests work? How do t-values fit in? In this series of posts, I’ll answer these questions by focusing on concepts and graphs rather than equations and numbers. After all, a key reason to use <a href="http://www.minitab.com/products/minitab">statistical software like </a><a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">Minitab</a> is so you don’t get bogged down in the calculations and can instead focus on understanding your results.</p>
<p>In this post, I will explain t-values, t-distributions, and how t-tests use them to calculate probabilities and assess hypotheses.</p>
What Are t-Values?
<p>T-tests are called t-tests because the test results are all based on t-values. T-values are an example of what statisticians call test statistics. A test statistic is a standardized value that is calculated from sample data during a hypothesis test. The procedure that calculates the test statistic compares your data to what is expected under the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/null-and-alternative-hypotheses/" target="_blank">null hypothesis</a>.</p>
<p>Each type of t-test uses a specific procedure to boil all of your sample data down to one value, the t-value. The calculations behind t-values compare your sample mean(s) to the null hypothesis and incorporates both the sample size and the variability in the data. A t-value of 0 indicates that the sample results exactly equal the null hypothesis. As the difference between the sample data and the null hypothesis increases, the absolute value of the t-value increases.</p>
<p>Assume that we perform a t-test and it calculates a t-value of 2 for our sample data. What does that even mean? I might as well have told you that our data equal 2 fizbins! We don’t know if that’s common or rare when the null hypothesis is true.</p>
<p>By itself, a t-value of 2 doesn’t really tell us anything. T-values are not in the units of the original data, or anything else we’d be familiar with. We need a larger context in which we can place individual t-values before we can interpret them. This is where t-distributions come in.</p>
What Are t-Distributions?
<p>When you perform a t-test for a single study, you obtain a single t-value. However, if we drew multiple random samples of the same size from the same population and performed the same t-test, we would obtain many t-values and we could plot a distribution of all of them. This type of distribution is known as a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/basic-concepts/sampling-distribution/" target="_blank">sampling distribution</a>.</p>
<p>Fortunately, the properties of t-distributions are well understood in statistics, so we can plot them without having to collect many samples! A specific t-distribution is defined by its <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/basic-concepts/df/" target="_blank">degrees of freedom (DF)</a>, a value closely related to sample size. Therefore, different t-distributions exist for every sample size. <span style="line-height: 20.8px;">You can graph t-distributions u</span><span style="line-height: 1.6;">sing Minitab’s </span><a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/graphs/graphs-of-distributions/probability-distribution-plots/probability-distribution-plot/" style="line-height: 1.6;" target="_blank">probability distribution plots</a><span style="line-height: 1.6;">.</span></p>
<p>T-distributions assume that you draw repeated random samples from a population where the null hypothesis is true. You place the t-value from your study in the t-distribution to determine how consistent your results are with the null hypothesis.</p>
<p style="margin-left: 40px;"><img alt="Plot of t-distribution" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/d628e56f0380e0edcf575502a670ed31/t_dist_20_df.png" style="width: 576px; height: 384px;" /></p>
<p>The graph above shows a t-distribution that has 20 degrees of freedom, which corresponds to a sample size of 21 in a one-sample t-test. It is a symmetric, bell-shaped distribution that is similar to the normal distribution, but with thicker tails. This graph plots the probability density function (PDF), which describes the likelihood of each t-value.</p>
<p>The peak of the graph is right at zero, which indicates that obtaining a sample value close to the null hypothesis is the most likely. That makes sense because t-distributions assume that the null hypothesis is true. T-values become less likely as you get further away from zero in either direction. In other words, when the null hypothesis is true, you are less likely to obtain a sample that is very different from the null hypothesis.</p>
<p>Our t-value of 2 indicates a positive difference between our sample data and the null hypothesis. The graph shows that there is a reasonable probability of obtaining a t-value from -2 to +2 when the null hypothesis is true. Our t-value of 2 is an unusual value, but we don’t know exactly <em>how </em>unusual. Our ultimate goal is to determine whether our t-value is unusual enough to warrant rejecting the null hypothesis. To do that, we'll need to calculate the probability.</p>
Using t-Values and t-Distributions to Calculate Probabilities
<p>The foundation behind any hypothesis test is being able to take the test statistic from a specific sample and place it within the context of a known probability distribution. For t-tests, if you take a t-value and place it in the context of the correct t-distribution, you can calculate the probabilities associated with that t-value.</p>
<p>A probability allows us to determine how common or rare our t-value is under the assumption that the null hypothesis is true. If the probability is low enough, we can conclude that the effect observed in our sample is inconsistent with the null hypothesis. The evidence in the sample data is strong enough to reject the null hypothesis for the entire population.</p>
<p>Before we calculate the probability associated with our t-value of 2, there are two important details to address.</p>
<p>First, we’ll actually use the t-values of +2 and -2 because we’ll perform a two-tailed test. A two-tailed test is one that can test for differences in both directions. For example, a two-tailed 2-sample t-test can determine whether the difference between group 1 and group 2 is statistically significant in either the positive or negative direction. A one-tailed test can only assess one of those directions.</p>
<p>Second, we can only calculate a non-zero probability for a range of t-values. As you’ll see in the graph below, a range of t-values corresponds to a proportion of the total area under the distribution curve, which is the probability. The probability for any specific point value is zero because it does not produce an area under the curve.</p>
<p>With these points in mind, we’ll shade the area of the curve that has t-values greater than 2 and t-values less than -2.</p>
<p><img alt="T-distribution with a shaded area that represents a probability" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/5e124a2c8139681afec706799ebabcec/t_dist_prob.png" style="width: 576px; height: 384px;" /></p>
<p>The graph displays the probability for observing a difference from the null hypothesis that is at least as extreme as the difference present in our sample data while assuming that the null hypothesis is actually true. Each of the shaded regions has a probability of 0.02963, which sums to a total probability of 0.05926. When the null hypothesis is true, the t-value falls within these regions nearly 6% of the time.</p>
<p>This probability has a name that you might have heard of—it’s called the p-value! While the probability of our t-value falling within these regions is fairly low, it’s not low enough to reject the null hypothesis using the common <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-significance-levels-alpha-and-p-values-in-statistics" target="_blank">significance level</a> of 0.05.</p>
<p><a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">Learn how to correctly interpret the p-value.</a></p>
t-Distributions and Sample Size
<p>As mentioned above, t-distributions are defined by the DF, which are closely associated with sample size. As the DF increases, the probability density in the tails decreases and the distribution becomes more tightly clustered around the central value. The graph below depicts t-distributions with 5 and 30 degrees of freedom.</p>
<p><img alt="Comparison of t-distributions with different degrees of freedom" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/5220dc6347611a230e89b70de904b034/t_dist_comp_df.png" style="width: 576px; height: 384px;" /></p>
<p>The t-distribution with fewer degrees of freedom has thicker tails. This occurs because the t-distribution is designed to reflect the added uncertainty associated with analyzing small samples. In other words, if you have a small sample, the probability that the sample statistic will be further away from the null hypothesis is greater even when the null hypothesis is true.</p>
<p>Small samples are more likely to be unusual. This affects the probability associated with any given t-value. For 5 and 30 degrees of freedom, a t-value of 2 in a two-tailed test has p-values of 10.2% and 5.4%, respectively. Large samples are better!</p>
<p>I’ve showed how t-values and t-distributions work together to produce probabilities. To see how each type of t-test works and actually calculates the t-values, read the other post in this series, <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-t-tests:-1-sample,-2-sample,-and-paired-t-tests">Understanding t-Tests: 1-sample, 2-sample, and Paired t-Tests</a>.</p>
<p>If you'd like to learn how the ANOVA F-test works, read my post, <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-analysis-of-variance-anova-and-the-f-test">Understanding Analysis of Variance (ANOVA) and the F-test</a>.</p>
Data AnalysisHypothesis TestingLearningStatistics HelpWed, 20 Apr 2016 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/understanding-t-tests-t-values-and-t-distributionsJim FrostBest Way to Analyze Likert Item Data: Two Sample T-Test versus Mann-Whitney
http://blog.minitab.com/blog/adventures-in-statistics/best-way-to-analyze-likert-item-data%3A-two-sample-t-test-versus-mann-whitney
<p><img alt="Worksheet that shows Likert data" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/6b1cf78b969699ed58febb026d32051d/likert_worksheet.png" style="float: right; width: 162px; height: 265px; margin: 10px 15px;" />Five-point Likert scales are commonly associated with surveys and are used in a wide variety of settings. You’ve run into the Likert scale if you’ve ever been asked whether you strongly agree, agree, neither agree or disagree, disagree, or strongly disagree about something. The worksheet to the right shows what five-point Likert data look like when you have two groups.</p>
<p>Because Likert item data are discrete, ordinal, and have a limited range, there’s been a longstanding dispute about the most valid way to analyze Likert data. The basic choice is between <a href="http://blog.minitab.com/blog/adventures-in-statistics/choosing-between-a-nonparametric-test-and-a-parametric-test" target="_blank">a parametric test and a nonparametric test</a>. The pros and cons for each type of test are generally described as the following:</p>
<ul>
<li>Parametric tests, such as the 2-sample t-test, assume a normal, continuous distribution. However, with a sufficient sample size, t-tests are robust to departures from normality.</li>
<li>Nonparametric tests, such as the Mann-Whitney test, do not assume a normal or a continuous distribution. However, there are concerns about a lower ability to detect a difference when one truly exists.</li>
</ul>
<p>What’s the better choice? This is a real-world decision that users of <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">statistical software</a> have to make when they want to analyze Likert data.</p>
<p>Over the years, a number of studies that have tried to answer this question. However, they’ve tended to look at a limited number of potential distributions for the Likert data, which causes the generalizability of the results to suffer. Thanks to increases in computing power, simulation studies can now thoroughly assess a wide range of distributions.</p>
<p>In this blog post, I highlight a simulation study conducted by de Winter and Dodou* that compares the capabilities of the two sample t-test and the Mann-Whitney test to analyze five-point Likert items for two groups. Is it better to use one analysis or the other?</p>
<p>The researchers identified a diverse set of 14 distributions that are representative of actual Likert data. The computer program drew independent pairs of samples to test all possible combinations of the 14 distributions. All in all, 10,000 random samples were generated for each of the 98 distribution combinations! The pairs of samples are analyzed using both the two sample t-test and the Mann-Whitney test to compare how well each test performs. The study also assessed different sample sizes.</p>
<p>The results show that for all pairs of distributions the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/type-i-and-type-ii-error/" target="_blank">Type I (false positive) error rates</a> are very close to the target amounts. In other words, if you use either analysis and your results are statistically significant, you don’t need to be overly concerned about a false positive.</p>
<p>The results also show that for most pairs of distributions, the difference between the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/power-and-sample-size/what-is-power/" target="_blank">statistical power</a> of the two tests is trivial. In other words, if a difference truly exists at the population level, either analysis is equally likely to detect it. The concerns about the Mann-Whitney test having less power in this context appear to be unfounded.</p>
<p>I do have one caveat. There are a few pairs of specific distributions where there is a power difference between the two tests. If you perform both tests on the same data and they disagree (one is significant and the other is not), you can look at a table in the article to help you determine whether a difference in statistical power might be an issue. This power difference affects only a small minority of the cases.</p>
<p>Generally speaking, the choice between the two analyses is tie. If you need to compare two groups of five-point Likert data, it usually doesn’t matter which analysis you use. Both tests almost always provide the same protection against false negatives and always provide the same protection against false positives. These patterns hold true for sample sizes of 10, 30, and 200 per group.</p>
<p>*de Winter, J.C.F. and D. Dodou (2010), Five-Point Likert Items: t test versus Mann-Whitney-Wilcoxon, <em>Practical Assessment, Research and Evaluation</em>, 15(11).</p>
Data AnalysisHypothesis TestingStatisticsStatistics HelpWed, 06 Apr 2016 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/best-way-to-analyze-likert-item-data%3A-two-sample-t-test-versus-mann-whitneyJim FrostAre You Putting the Data Cart Before the Horse? Best Practices for Prepping Data for Analysis, ...
http://blog.minitab.com/blog/meredith-griffith/are-you-putting-the-data-cart-before-the-horse-best-practices-for-prepping-data-for-analysis%2C-part-1
<p>Most of us have heard a backwards way of completing a task, or doing something in the conventionally wrong order, described as “putting the cart before the horse.” That’s because a horse pulling a cart is much more efficient than a horse pushing a cart.</p>
<p><img alt="cart before horse" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ec1fbea4785510ea0e0a9997c1669c68/cart_horse.png" style="margin: 10px 15px; float: right; width: 350px; height: 206px;" />This saying may be especially true in the world of statistics. Focusing on a statistical tool or analysis before checking out the condition of your data is one way you may be putting the cart before the horse. You may then find yourself trying to force your data to fit an analysis, particularly when the data has not been set up properly. It’s far more efficient to first make sure your <a href="http://blog.minitab.com/blog/understanding-statistics/the-single-most-important-question-in-every-statistical-analysis">data are reliable</a> and then allow your questions of interest to guide you to the right analysis.</p>
<p>Spending a little quality time with your data up front can save you from wasting a lot of time on an analysis that either can’t work—or can’t be trusted.</p>
<p>As a quality practitioner, you’re likely to be involved in many activities—establishing quality requirements for external suppliers, monitoring product quality, reviewing product specifications and ensuring they are met, improving process efficiency, and much more.</p>
<p>All of these tasks will involve data collection and statistical analysis with software such as Minitab. For example, suppose you need to perform a <a href="http://blog.minitab.com/blog/meredith-griffith/fundamentals-of-gage-rr">Gage R&R</a> study to verify your measurement systems are valid, or you need to understand how machine failures impact downtime.</p>
<p>Rather than jumping right into the analysis, you will be at an advantage if you take time to look at your data. Ask yourself questions such as:</p>
<ul>
<li>What problem am I trying to solve?</li>
<li>Is my data set up in a way that will be useful to answering my question?</li>
<li>Did I make any mistakes while recording my data?</li>
</ul>
<p>Utilizing process knowledge can also help you answer questions about your data and identify data entry errors. A focus on preparing and exploring your data prior to an analysis will not only save you time in the long run, but will help you obtain reliable results.</p>
<p>So then, where to begin with best practices for prepping data for an analysis? Let’s look no further than your data.</p>
Clean your data before you analyze it
<p>Let’s assume you already know what problem you’re trying to solve with your data. For instance, you are the area supervisor of a manufacturing facility, and you’ve been experiencing lower productivity than usual on the machines in your area and want to understand why. You have collected data on these machines, recording the amount of time a machine was out of operation, the reason for the machine being down, the shift number when the machine went down, and the speed of the machine when it went down.</p>
<p>The first step toward answering your question is to ensure your data are clean. Cleaning your data before you begin an analysis can save time by preventing rework, such as reformatting data or correcting data entry errors, after you’ve already begun the analysis. Data cleaning is also essential to ensure your analyses and results—and the decisions you make—are reliable.</p>
<p>With the <a href="https://www.minitab.com/en-us/support/minitab/minitab-17.3.1-update/" style="line-height: 20.8px;">latest update to Minitab 17</a><span style="line-height: 20.8px;">, an improved data import helps you identify and correct case mismatches, fix improperly formatted columns, represent missing data accurately and in a manner that is recognized by the software, remove blank rows and extra spaces, and more. When importing your data, you see a preview of your data as a reminder to ensure it’s in the best possible state before it finds its way into Minitab. This preview helps you spot mistakes you have made in your data collection, and automatically corrects mistakes you don’t notice or that are difficult to find in large data sets.</span></p>
<p><img alt="Data Import" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/b1c679056c60ac2fa82f37e1f1de406b/data_import.jpg" style="width: 775px; height: 655px;" /></p>
<p><em>Minitab offers a data import dialog that helps you quickly clean and format your data before importing into the software, ensuring your data are trustworthy and allowing you to get to your analysis sooner.</em></p>
<p><span style="line-height: 20.8px;">If you’d rather copy and paste your data from Excel, Minitab will ensure you paste your data in the right place. For instance, if your data have column names and you accidentally paste your data into the first row of the worksheet, your data will all be formatted as text—even when the data following your column names are numeric! With </span><a href="https://www.minitab.com/en-us/products/minitab/whats-new/" style="line-height: 20.8px;">Minitab 17.3</a><span style="line-height: 20.8px;">, you will receive an alert that your data is in the wrong place, and Minitab will automatically move your data where it belongs. This alert ensures your data are formatted properly, preventing you from running into the problem during an analysis and saving you time manually correcting every improperly formatted column.</span></p>
<p><img alt="Copy Paste Warning" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/5df941ffaa491a0072261aef075a19d6/copy_paste_warning.jpg" style="width: 431px; height: 299px;" /></p>
<p><em>Pasting your Excel data in the first row of a Minitab worksheet will trigger this warning, which safeguards against improperly formatted columns.</em></p>
<p><span style="line-height: 1.6;">This is only the beginning! Minitab makes it quick and painless to begin exploring and visualizing your data, offering more insights and ease once you get to the analysis. If you’d like to learn additional best practices for prepping your data for any analysis, stay tuned for my next post where I’ll offer tips for exploring and drawing insights from your data!</span></p>
Data AnalysisStatisticsWed, 30 Mar 2016 14:05:04 +0000http://blog.minitab.com/blog/meredith-griffith/are-you-putting-the-data-cart-before-the-horse-best-practices-for-prepping-data-for-analysis%2C-part-1Meredith GriffithHow to Remove Leading or Trailing Spaces from a Data Set
http://blog.minitab.com/blog/michelle-paret/how-to-remove-leading-or-trailing-spaces-from-a-data-set
<p>Leading and trailing spaces in a data set are like termites in your house. If you don’t realize they are there and you don’t get rid of them, they’re going to wreak havoc.</p>
<p><img alt="keyboard" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/cc148fcf427d6e92fba27f00bba3968c/keyboard.jpg" style="margin: 10px 15px; float: right; width: 274px; height: 184px;" />Here are a few easy ways to remove these pesky characters with <a href="http://www.minitab.com/products/minitab/">Minitab Statistical Software</a> prior to analysis.</p>
Data Import
<p>If you’re importing data from Excel, a text file, or some other file type:</p>
<ol>
<li>Choose <strong>File > Open</strong> and select your Excel file, text file, etc.</li>
<li>Click <strong>Options</strong> and select <em>Remove nonprintable characters and extra spaces</em>.</li>
<li>Click <strong>OK</strong>.</li>
</ol>
<p>Note: This feature was introduced in Minitab 17.3. If you have an older version of Minitab 17, use <strong>Help > Check for Updates</strong>. If you have Minitab 16 or earlier—or you don't have Minitab at all—you can download a <a href="http://www.minitab.com/products/minitab/free-trial/">free 30-day trial</a>.</p>
The Calculator
<p>Suppose you already have your data in Minitab, located in column C1:</p>
<ol>
<li>Choose <strong>Calc > Calculator</strong>.</li>
<li>In <strong>Store result in variable</strong>, enter a blank column (e.g. <em>C5</em>), or you can overwrite an existing column.</li>
<li>In <strong>Expression</strong>, enter <em>TRIM(C1).</em></li>
<li>Click <strong>OK</strong>.</li>
</ol>
<p>If you also want to remove all non-printable characters using the Calculator, <em>CLEAN</em> is available as well.</p>
<p><img alt="Calculator" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/bce689217238ff90e42bbde154308b74/calculator.jpg" style="width: 350px; height: 310px;" /></p>
<p><span style="line-height: 1.6;">And that’s all there is to it.</span></p>
Data AnalysisFri, 25 Mar 2016 12:00:00 +0000http://blog.minitab.com/blog/michelle-paret/how-to-remove-leading-or-trailing-spaces-from-a-data-setMichelle ParetGage R&R Metrics: What Do They All Mean?
http://blog.minitab.com/blog/starting-out-with-statistical-software/gage-rr-metrics%3A-what-do-they-all-mean
<p>When you analyze a Gage R&R study in <a href="http://www.minitab.com/products/minitab/">statistical software</a>, your results can be overwhelming. There are a lot of statistics listed in Minitab's Session Window—what do they all mean, and are they telling you the same thing?</p>
<p>If you don't know where to start, it can be hard to figure out what the analysis is telling you, especially if your measurement system is giving you some numbers you'd think are good, and others that might not be. I'm going to focus on three different statistics that are often confused when <span><a href="http://blog.minitab.com/blog/meredith-griffith/fundamentals-of-gage-rr">reading Gage R&R output</a></span>. </p>
<p>The first thing to look at is the %Study Variation and the %Contribution.</p>
<p style="margin-left: 40px;"><img alt="gage r&R output" src="https://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f7e1af57-c25e-4ec3-a999-2166d525717e/Image/be2a9d9d311b9fad9b00eacdd73abff5/gage2.png" style="width: 618px; height: 404px;" /></p>
<p>You could look at either of them, as they are both telling you the same thing, just in a different way. By definition, the %Contribution for a source is 100 times the variance component for that source divided by the Total Variation variance component. This calculation has the benefit of making all of your sources of variability add up to 100%, which can make things easy to interpret.</p>
<p>The %Study Variation does not sum up to 100% like %Contribution, but it does have other benefits. %Contribution is based on the variance component that is specific to the values you observed in your study, not what the population of values might be. In contrast, the %Study Variation, by taking 6*standard deviation, extrapolates out over the entire population of values (based on the observed values, of course).</p>
<p>The bottom line is that both % Study Variation and %Contribution are telling you, in simple terms, about the percentage of variation in your process attributable to that particular source. </p>
<p>What about %Tolerance? What does <em>that </em>allow us to look at? While %StudyVar and %Contribution compare the variation from a particular source to the total variation, the %Tolerance compares the amount of variation from a source to a specified tolerance spread. This can lead to seemingly conflicting results, such as getting a low %StudyVar while having a high %Tolerance. In this case, your gage system may be introducing low levels of variability compared to other sources, but the amount of variation is still too much based on your spec limits. The %Tolerance column may be more important to you in this case, as it's more specific to your actual product and its spec limits. </p>
<p>So, a short summary:</p>
<p><strong>%Contribution: </strong>The percentage of variation due to the source compared to the total variation, but with the added benefit that all sources will sum to 100%</p>
<p><strong>%StudyVar:</strong> The <span style="line-height: 20.8px;">percentage of variation due to the source compared to the total variation, but with the added benefit of extrapolating beyond your specific data values. </span></p>
<p><strong>%Tolerance:</strong> The percentage of variation due to the source compared to your specified tolerance range.</p>
<p>The %StudyVar is perhaps more reliant on having a good quality study and can be used when your goal is improving the measurement system. On the other hand %Tolerance can be used when the focus is on the measurement system being able to do it’s job and classify parts as in or out of spec.</p>
<p>Each of these statistics provide valuable information, and how you weigh each of these largely depends on what you're looking to get out of your study.</p>
Lean Six SigmaProject ToolsQuality ImprovementMon, 21 Mar 2016 12:00:00 +0000http://blog.minitab.com/blog/starting-out-with-statistical-software/gage-rr-metrics%3A-what-do-they-all-meanEric HeckmanWhat a Trip to the Dentist Taught Us about Automation
http://blog.minitab.com/blog/meredith-griffith/what-a-trip-to-the-dentist-taught-us-about-automation
<p>After my husband’s most recent visit to the dentist, he returned home cavity-free...and with a $150 electric toothbrush in hand. </p>
<p><span style="line-height: 1.6;">I wanted details.</span></p>
<p>It began innocently. His dreaded trip to the dentist ended in high praise for no cavities and only a warning to floss more. That prompted my programming-and-automation-obsessed husband, still in the chair, to exclaim, "I wish there was a way to automate this whole process—the brushing and the flossing."</p>
<p><img alt="Teeth" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/935fceb17af86287ddad7b098f1ab0cf/teeth.png" style="margin: 10px 15px; float: right; width: 296px; height: 273px;" />Next thing you know, he’s swiping the credit card (to "earn miles for our next flight," he says) and walking out with a nice Philips Sonicare DiamondClean Sonic Electric Toothbrush.</p>
<p>From this anecdote, you’d think I was sitting beside him as his teeth-cleaning proceeded. I merely received the story secondhand from our dental hygienist the very next day when I went in for my own visit. But I digress.</p>
<p>When my husband exclaimed his desire to automate a process that very few humans enjoy doing, our dentist was pleased to tell him that this toothbrush comes close. Granted, the toothbrush can’t <em>completely </em>automate these tasks: it still requires the user to be present. However, our dentist offered the following points to consider:</p>
<ol>
<li>The toothbrush does most of the brushing for you (with the exception of moving your hand so that you brush all your teeth).</li>
<li>The bristles automatically move, reaching crevices between teeth that no manual tooth-brushing ever could.</li>
<li>Because of point #2, plaque buildup will decrease and gum health will improve.</li>
<li>Because of point #3, flossing won’t be a strict daily requirement.</li>
</ol>
<p>Sold.</p>
<p>The dentist's points give us a nice framework for thinking about automation. An automated solution might not be perfect. But an automated solution should:</p>
<p style="margin-left: 40px;">a. make a task easier and more efficient (brushing hard-to-reach places more effectively)<br />
b. <span style="line-height: 1.6;">require less of your time (reduces the need to floss daily), and <br />
c. </span><span style="line-height: 1.6;">save you money (better tooth and gum health and fewer fillings equates to cost savings). </span></p>
<p><span style="line-height: 1.6;">Who wouldn’t buy into that idea?</span></p>
<p>Automated solutions can turn feelings of boredom over performing tedious tasks into feelings of excitement. Why? Because automation removes the need to perform repetitive tasks that we know how to do but might not particularly enjoy, helps us see results faster, and incites us to implement change sooner. This can translate into business efficiency and increased profit.</p>
<p>The mere <em>idea</em> of automating the task of brushing teeth and the results he might experience incited my husband to think about tooth-brushing differently, and prompted the decision to purchase this custom solution (the electric toothbrush) before even implementing it in his daily habits; imagine the changes and process improvements that might occur once the automated solution is in place. Perhaps a report of no cavities for several visits <em>in a row</em> and an extra lump of cash for him to spend on me!</p>
<p>Just as Philips (and other manufacturers) developed an electric toothbrush as a custom solution to automate difficult or tedious aspects of brushing teeth, Minitab has created custom statistical solutions and has automated processes for numerous customers in various industries, including manufacturing, pharmaceutical, medical devices, and healthcare.</p>
<p>Did you know that Minitab is not merely an out-of-the-box statistical software package? Behind the software interface is a powerful statistical and graphical engine that can integrate with a customer’s workflow and provide a unique solution tailored to that customer’s industry-specific problem. Minitab’s engine can communicate with a customer’s databases, applications, and other programs such as Excel, in order to automatically perform analyses and provide output relevant to the customer’s needs.</p>
<p>One interesting example that comes to mind is a project our custom development consultants tackled for a pharmaceutical company. This company was responding to an FDA warning letter and needed to assess the quality of hundreds of active ingredients in a particular drug. They needed to analyze data collected for each ingredient using Minitab’s <a href="http://blog.minitab.com/blog/starting-out-with-statistical-software/starting-out-with-capability-analysis">capability analysis tool</a>, and create a report detailing the result of the analysis in order to show the FDA that their drug was stable and safe for consumption—but they needed to perform the same analysis and create the same report hundreds of times over.</p>
<p>Our custom development consultants used Minitab’s engine to access the customer’s data in Excel, automatically perform capability analysis on each active ingredient in the drug, and create custom reports detailing the quality level of each ingredient and a few additional pieces of output that the FDA wanted to see. Automating this work saved a tremendous amount of time, energy, and money, and ultimately helped the pharmaceutical company respond to the FDA warning letter in a timely manner.</p>
<p>Of course, Minitab’s <a href="https://www.minitab.com/en-us/services/custom-development/">custom solutions</a> can take on many forms, including custom reports as mentioned in the pharmaceutical example above, real-time dashboard solutions, and alert systems (I’ll save details on that one for the second installment of this blog series, where we’ll hear about more of my husband’s shenanigans pertaining to online bill payments).</p>
<p><span style="line-height: 1.6;">We live in a world of innovation and creativity; automated solutions touch on both ideas. If we can automate aspects of brushing our teeth, then surely we can automate a business process or task to help you become more efficient, save time, reduce costs, and see results sooner. If you’d like to learn how Minitab can help you, contact us at </span><a href="mailto:customdev@minitab.com" style="line-height: 1.6;">customdev@minitab.com</a><span style="line-height: 1.6;">.</span></p>
<p>My hope is that after reading this blog post, you see the relevance and value of automation—whether brushing your teeth, performing the same statistical analyses, or creating custom reports. And the power of automation extends far beyond these simple examples! So if I’ve piqued your interest, stay tuned for Part 2 of this series to hear more lessons learned by my husband in his automation endeavors!</p>
AutomationWed, 02 Mar 2016 13:00:00 +0000http://blog.minitab.com/blog/meredith-griffith/what-a-trip-to-the-dentist-taught-us-about-automationMeredith GriffithFive Reasons Why Your R-squared Can Be Too High
http://blog.minitab.com/blog/adventures-in-statistics/five-reasons-why-your-r-squared-can-be-too-high
<p>I’ve written about R-squared before and I’ve concluded that it’s not as intuitive as it seems at first glance. It can be a misleading statistic because <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit" target="_blank">a high R-squared is not always good and a low R-squared is not always bad</a>. I’ve even said that <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-high-should-r-squared-be-in-regression-analysis" target="_blank">R-squared is overrated</a> and that <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-to-interpret-s-the-standard-error-of-the-regression" target="_blank">the standard error of the estimate (S)</a> can be more useful.</p>
<p>Even though I haven’t always been enthusiastic about R-squared, that’s not to say it isn’t useful at all. For instance, if you perform a study and notice that similar studies generally obtain a notably higher or lower R-squared, you should investigate why yours is different because there might be a problem.</p>
<p>In this blog post, I look at five reasons why your R-squared can be too high. This isn’t a comprehensive list, but it covers some of the more common reasons.</p>
Is A High R-squared Value a Problem?
<p><img alt="Very high R-squared" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/072d9c1d584683676849dd76d9802993/highr_sq.png" style="float: right; width: 216px; height: 80px;" />A very high R-squared value is not necessarily a problem. Some processes can have R-squared values that are in the high 90s. These are often physical process where you can obtain precise measurements and there's low process noise.</p>
<p>You'll have to use your subject area knowledge to determine whether a high R-squared is problematic. Are you modeling something that is inherently predictable? Or, not so much? If you're measuring a physical process, an R-squared of 0.9 might not be surprising. However, if you're predicting human behavior, that's way too high!</p>
<p>Compare your study to similar studies to determine whether your R-squared is in the right ballpark. If your R-squared is too high, consider the following possibilities. To determine whether any apply to your model specifically, you'll have to use your subject area knowledge, information about how you fit the model, and data specific details.</p>
Reason 1: R-squared is a biased estimate
<p><img alt="bathroom scale" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/410bbc59cf4450a2d8bbc20373c683a4/overweight_w1024.jpeg" style="float: right; width: 200px; height: 300px;" />The R-squared in your regression output is a biased estimate based on your sample—it tends to be too high. This bias is a reason why some practitioners don’t use R-squared at all but use adjusted R-squared instead.</p>
<p>R-squared is like a broken bathroom scale that tends to read too high. No one wants that! Researchers have long recognized that regression’s optimization process takes advantage of chance correlations in the sample data and inflates the R-squared.</p>
<p>Adjusted R-squared does what you’d do with that broken bathroom scale. If you knew the scale was consistently too high, you’d reduce it by an appropriate amount to produce a weight that is correct on average.</p>
<p>Adjusted R-squared does this by comparing the sample size to the number of terms in your regression model. Regression models that have many samples per term produce a better R-squared estimate and require less shrinkage. Conversely, models that have few samples per term require more shrinkage to correct the bias.</p>
<p>For more information, read my posts about <a href="http://blog.minitab.com/blog/adventures-in-statistics/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables" target="_blank">Adjusted R-squared</a> and <a href="http://blog.minitab.com/blog/adventures-in-statistics/r-squared-shrinkage-and-power-and-sample-size-guidelines-for-regression-analysis" target="_blank">R-squared shrinkage</a>.</p>
Reason 2: You might be overfitting your model
<p>An overfit model is one that is too complicated for your data set. You’ve included too many terms in your model compared to the number of observations. When this happens, the regression model becomes tailored to fit the quirks and random noise in your specific sample rather than reflecting the overall population. If you drew another sample, it would have its own quirks, and your original overfit model would not likely fit the new data.</p>
<p>Adjusted R-squared doesn't always catch this, but <a href="http://blog.minitab.com/blog/adventures-in-statistics/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables" target="_blank">predicted R-squared</a> often does. Read my post about <a href="http://blog.minitab.com/blog/adventures-in-statistics/the-danger-of-overfitting-regression-models" target="_blank">the dangers of overfitting your model</a>.</p>
Reason 3: Data mining and chance correlations
<p>If you fit many models, you will find variables that appear to be significant but they are correlated only by chance. While your final model might not be too complex for the number of observations (Reason 2), problems occur when you fit many different models to arrive at the final model. Data mining can produce <a href="http://blog.minitab.com/blog/adventures-in-statistics/four-tips-on-how-to-perform-a-regression-analysis-that-avoids-common-problems" target="_blank">high R-squared values even with entirely random data</a>!</p>
<p>Before performing regression analysis, you should already have an idea of what the important variables are along with their relationships, coefficient signs, and effect magnitudes based on previous research. Unfortunately, recent trends have moved away from this approach thanks to large, readily available databases and automated procedures that build regression models.</p>
<p>For more information, read my post about using <a href="http://blog.minitab.com/blog/adventures-in-statistics/beware-of-phantom-degrees-of-freedom-that-haunt-your-regression-models" target="_blank">too many phantom degrees of freedom</a>.</p>
Reason 4: Trends in Panel (Time Series) Data
<p>If you have time series data and your response variable and a predictor variable both have significant trends over time, this can produce very high R-squared values. You might try a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/time-series/basics/time-series-analyses-in-minitab/" target="_blank">time series analysis</a>, or including time related variables in your regression model, such as <a href="http://support.minitab.com/en-us/minitab/17/topic-library/minitab-environment/calculator-and-matrices/column-calculator-functions/lag-function/" target="_blank">lagged</a> and/or <a href="http://support.minitab.com/en-us/minitab/17/topic-library/minitab-environment/calculator-and-matrices/column-calculator-functions/differences-function/" target="_blank">differenced</a> variables. Conveniently, these analyses and functions are all available in <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">Minitab statistical software</a>.</p>
Reason 5: Form of a Variable
<p>It's possible that you're including different forms of the same variable for both the response variable and a predictor variable. For example, if the response variable is temperature in Celsius and you include a predictor variable of temperature in some other scale, you'd get an R-squared of nearly 100%! That's an obvious example, but you can have the same thing happening more subtlety.</p>
<p>For more information about regression models, read my post about <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-choose-the-best-regression-model">How to Choose the Best Regression Model</a>.</p>
Data AnalysisRegression AnalysisStatisticsStatistics HelpWed, 24 Feb 2016 13:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/five-reasons-why-your-r-squared-can-be-too-highJim FrostMind the Gap
http://blog.minitab.com/blog/data-analysis-and-quality-improvement-and-stuff/mind-the-gap
<p><span style="line-height: 1.6;"><img alt="Mind the gap" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/686c659773c05351ff2e3d7608e17984/subwayip3.jpg" style="float: right; width: 336px; height: 168px; margin: 10px 15px;" />Mind the gap. It's is an important concept to bear in mind whilst traveling on the Tube in London, the T in Boston, the Metro in Washington, D.C., etc. But how many of us remember to mind the gap when we create an interval plot in Minitab Statistical Software? Not too many of us, I'd wager. And it's a shame, too.</span></p>
<p>When you travel on the subway, minding the gap means giving thoughtful consideration to the space between the platform the and the train. On the subway, minding the gap can make the difference between these two very different views of the subway station:</p>
<p><img alt="Bad view of subway" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/d66944dc44479e5de0cbefe80ec188cd/subwaybadview.jpg" style="line-height: 1.6; width: 146px; height: 220px;" /><span style="line-height: 1.6;"> </span><img alt="Nice view of subway" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/e0d2c8a78ea25db3522d762b37f3c9e2/subwayniceview_medium.jpg" style="line-height: 1.6; width: 331px; height: 220px;" /></p>
<p>When you make an interval plot in Minitab, minding the gap means giving thoughtful consideration to the space between groups on the x-axis. For interval plots, minding the gap can make the difference between these two very different views of your data:</p>
<p><img alt="Plain view of data" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/11d00d4165de5508c8b7cb9ae9fadb33/plainviewofdata.jpg" style="width: 260px; height: 174px;" /> <img alt="Awesome view of data" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/0f7fe05a6b132c3477aa7cc2ce526cbb/awesomeviewofdata.jpg" style="width: 260px; height: 174px;" /></p>
<p>Allow me to demonstrate with an example. If you like, you can download the data file, <a href="http://support.minitab.com/en-us/datasets/anova-data-sets/moisture-content/">PercentMoisture.MTW</a> from our data set library and follow along. (You can get the free 30-day trial of Minitab <a href="http://it.minitab.com/products/minitab/free-trial.aspx">here</a> if you don't already have the software.) Technicians at a food company collected these data to try to figure out the best combination of time and temperature to bake cereal grains to minimize their moisture content. </p>
<p>Interval plots are useful because they summarize your data and allow you to simultaneously compare the means (represented by the points or symbols) and the variability (represented by the interval bars) for each sample or group. (To see more interval plots in action, check out these other blog posts: <a href="http://blog.minitab.com/blog/understanding-statistics/seven-alternatives-to-pie-charts">Seven Alternatives to Pie Charts</a> and <a href="http://blog.minitab.com/blog/fun-with-statistics/when-even-cupid-isnt-accurate-enough-interval-plots-and-olympic-finals">When Even Cupid Isn't Accurate Enough</a>.) </p>
<p>Creating a basic interval plot in Minitab is simple. Just select <strong>Graph > Interval Plot</strong>. Then choose the <strong>One Y, With Groups</strong> option, enter the data as follows, and click <strong>OK</strong>. (For the sake of space in this article, I renamed the columns "Time" and "Temp".)</p>
<p><img alt="Creating the interval plot" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/403a5bb5dd1a85c5cdb6317fa50b043f/initialdb.jpg" style="width: 360px; height: 168px;" /> </p>
<p><img alt="Basic interval plot" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/00ef81c34648e57be92f9151a7938db6/initialip.jpg" style="width: 360px; height: 240px;" /></p>
<p>The nice thing about interval plots is that multiple levels of multiple factors can be represented by different positions on the x-axis. But the unfortunate thing about interval plots is that multiple levels of multiple factors are represented by different positions on the x-axis.</p>
<p>All the information is there, but it's hard to see how one group relates to the next. For example, to compare the results for the 130-degree oven temperature across the different oven times, you need to compare the 2nd interval bar to the 5th <span style="line-height: 20.8px;">interval bar </span>and the 8th <span style="line-height: 20.8px;">interval bar</span>. You end up going from one similar-looking bar to another and another, and that seldom ends well. </p>
<p>To make the different oven temperatures stand out more, you can add a little color. Just double-click one of the symbols to open the <em>Edit Mean Symbols</em> dialog box. Click the <em>Groups </em>tab, enter the temperature variable, and click <strong>OK</strong>. </p>
<p><img alt="Grouping the symbols" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/047d352b6f366120350b9883c9f4a118/groupsdb.jpg" style="width: 360px; height: 147px;" /> </p>
<p><img alt="Interval plot with grouped symbols" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/b4560e6f14caa1ef4c3a9f2efc38d197/ipwithgroups.jpg" style="width: 360px; height: 240px;" /></p>
<p>To help make the grouping even clearer, you can connect the dots. Right-click the graph and choose <strong>Add > Data Display</strong>, then select <strong>Mean connect line</strong> and click <strong>OK</strong>.</p>
<p><img alt="Adding mean connect lines" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/01ef5650cc8067423f804beff3d45992/dbconnectline.jpg" style="width: 360px; height: 207px;" /> </p>
<p><img alt="Interval plot with mean connect lines" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/a4db0ce40482f00310be78c21044907a/ipwithconnectlines.jpg" style="width: 360px; height: 240px;" /></p>
<p>Now it's much easier to identify and compare the results for the different oven temperatures. But here is where we really start to mind that gap. By which I mean that we start to give thoughtful consideration to the space between the <span style="line-height: 20.8px;">oven-</span>time groups on the x-axis. And by which I also mean that we mind these gaps because they are annoying and we want them to go away. But we need not worry, because that's one gap we can shrink easily.</p>
<p>Double-click the x-axis to open the <em>Edit Scale</em> dialog box. Notice the <strong>Gap within clusters</strong> setting. A setting of –1 means that the intervals for all levels of <span style="line-height: 20.8px;">oven </span>temperature at each level of <span style="line-height: 20.8px;">oven </span>time will be at the same location on the x-axis. Change the setting to –1 and the gap is closed. </p>
<p>And while we're at it, let's make the tick labels for temperature go away as well because they are redundant with the legend, and because the legend conveys the same information. And because if we don't, those labels would appear on top of each other, which looks pretty weird. </p>
<p><img alt="Removing the gap" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/a8bf53a30ab3e6f202cb1584578f5185/dbremovinggap.jpg" style="width: 275px; height: 263px;" /> <img alt="Removing labels" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/dc04c4f2dc9da480fec6e5f7d9ff5383/dbremovelabels.jpg" style="line-height: 1.6; width: 300px; height: 156px;" /><span style="line-height: 1.6;"> </span></p>
<p><img alt="Interval plot with no more gap!" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/c1b5587fbb5c69394f085cd590be70a7/ipsansgap.jpg" style="width: 360px; height: 240px;" /></p>
<p>Awesome! The plot looks much better without the big gaps. Although, perhaps a little gap would make it easier to see the individual intervals more clearly. If we change that gap to –0.85, then everything is groovy.</p>
<p><img alt="Interval plot with tasteful gap" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/06e3d9586005800f4eef141a3c6b4503/ipwithgroovygap.jpg" style="width: 360px; height: 240px;" /></p>
<p>Now that's a gap I don't mind at all! Now it's really easy to compare the results for different oven temperatures within and across the different oven times. The interval plot suggests that to minimize moisture content, we want to use the 90-minute oven time, but we don't want to use the 125-degree oven temperature. </p>
<p>As you can see, the interval plot is an easy and fast way to get a good idea of which differences could be important. But remember, the interval plot can’t tell us which effects or which differences are statistically significant or not. For that, we need to conduct an <a href="http://support.minitab.com/minitab/17/topic-library/modeling-statistics/anova/basics/what-is-anova/">analysis of variance (ANOVA)</a>.</p>
<p>Spoiler alert: I already ran an ANOVA on these data and it confirms what we gleaned from the interval plot. The main effects for both time and temperature are significant. (The interaction effect is not quite significant at the 0.05-level.) Tukey comparisons show that 90 minutes in the oven reduces moisture significantly better than either 30 minutes or 60 minutes in the oven. Tukey comparisons also show that a 125-degree oven is significantly worse at reducing moisture than either a 130-degree oven or a 135-degree oven. The effects of the 135-degree oven are not significantly different from the 130-degree oven, so we can probably save some energy and just use 130 degrees to desiccate our wild oats. </p>
<p><em style="box-sizing: border-box; font-family: 'Segoe UI', Frutiger, 'Frutiger Linotype', 'Dejavu Sans', 'Helvetica Neue', Tahoma, Arial, sans-serif; line-height: 15px; color: rgb(77, 79, 81); font-size: 10px;">Credit for the <a href="https://www.flickr.com/photos/thomasclaveirole/1414940422/">subway tunnel photo</a> goes to Thomas Claveirole.</em><em style="box-sizing: border-box; font-family: 'Segoe UI', Frutiger, 'Frutiger Linotype', 'Dejavu Sans', 'Helvetica Neue', Tahoma, Arial, sans-serif; line-height: 15px; color: rgb(77, 79, 81); font-size: 10px;"> </em><em style="box-sizing: border-box; font-family: 'Segoe UI', Frutiger, 'Frutiger Linotype', 'Dejavu Sans', 'Helvetica Neue', Tahoma, Arial, sans-serif; line-height: 15px; color: rgb(77, 79, 81); font-size: 10px;"> Credit for the <a href="https://www.flickr.com/photos/36217981@N02/14008362659/">subway station photo</a> goes to Tim Adams<span style="box-sizing: border-box;">. Both are available under Creative Commons License 2.0. </span></em></p>
Data AnalysisStatisticsTue, 23 Feb 2016 13:00:00 +0000http://blog.minitab.com/blog/data-analysis-and-quality-improvement-and-stuff/mind-the-gapGreg FoxHow to Calculate BX Life, Part 2
http://blog.minitab.com/blog/meredith-griffith/how-to-calculate-bx-life-part-2
<p><span style="line-height: 1.6;">When I wrote <a href="http://blog.minitab.com/blog/meredith-griffith/how-to-calculate-b10-life-with-statistical-software">How to Calculate B10 Life with Statistical Software</a></span><span style="line-height: 1.6;">, I promised a follow-up blog post that would describe how to compute any “BX” lifetime. In this post I’ll follow through on that promise, and in a third blog post in this series, I will explain why BX life is one of the best measures you can use in your reliability analysis.</span></p>
<p>As a refresher, B10 life refers to the time at which 10% of the population has failed—or, to put it another way, it is the 90% reliability of a population at a specific point in time. Let’s revisit our pacemaker battery example from part 1 of this blog series. Here's <a href="//cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/126181c5eca45c380dfed332ee3c3c7d/pacemakerbatterylife.MTW">the data</a>.</p>
<p><img alt="Data" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/87f5a262f6fa042461f047c74c26a72a/table1.jpg" style="width: 177px; height: 243px;" /></p>
<p>Recall that we found the B10 life of pacemaker batteries to be 6.36 years. Another way to interpret this value is to say that 6.36 years is the time at which 10% of the population of pacemaker batteries will fail. This information is useful in establishing a realistic warranty period for a product so that customers are covered through a product’s 90% reliability period, and so the manufacturer won’t have to incur extra cost by replacing an excess of the product during the warranty period.</p>
<p>But perhaps a particular product has additional reliability requirements a manufacturer wishes to monitor, such as B15 life. Or perhaps we would like to know when half of the population will fail—its B50 life. Both B10 and B50 life are industry standards for measuring the life expectancy of an automotive engine, for instance. This is where BX life calculations become even more useful—and Minitab makes it incredibly easy to compute and interpret those values. (If you don't already have Minitab and you'd like to follow along, <a href="http://www.minitab.com/products/minitab/free-trial/">download the free trial</a>.)</p>
Calculating BX Life
<p>Navigate to Minitab’s <strong>Statistics > Reliability/Survival > Distribution Analysis (Right Censoring) > Parametric Distribution Analysis</strong> menu and set up the main dialog and the 'Censor' subdialog the same way we did in <a href="http://blog.minitab.com/blog/meredith-griffith/how-to-calculate-b10-life-with-statistical-software">Part 1</a>:</p>
<p><img alt="Parametric Distribution Analysis - Main Dialog" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/d0669f86f85b236ba2a3adcef520a994/dialog1.jpg" style="width: 507px; height: 345px;" /></p>
<p>Press the "Censor" button and fill out the subdialog as follows: </p>
<p><img alt="Censor Subdialog" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/d319e5438d544d051aa997adeb14f271/dialog3.jpg" style="width: 426px; height: 313px;" /></p>
<p>When you press OK, Minitab analyzes the distribution of your data and by default will display a Table of Percentiles in the session window. We can take advantage of this table for measures such as B50 life, because the table produces output for a variety of percentiles by default. The percent of population failures at the 50th percentile is included in the default output.</p>
<p><img alt="Table of Percentiles for B50 Life" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/1e6e0c6abf084578a9c2993e6c09f530/table2.jpg" style="width: 536px; height: 482px;" /></p>
<p>We see that 50% of the population of pacemaker batteries will fail by 9.735 years. But what if we want to compute B15 life? This percentile does not display by default in the Table of Percentiles.</p>
<p>Revisiting the Parametric Distribution Analysis dialog (pressing CTRL-E is a Minitab shortcut that will bring up your most recently completed dialog), we can click the ‘Estimate’ button to specify what “BX” life we want. In the section titled ‘Estimate percentiles for these additional percents,’ entering the number 15 will give us the B15 life for pacemaker batteries.</p>
<p><img alt="Estimate Subdialog" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/c84a41ca235867a883d886d754e6fc5d/dialog2.jpg" style="width: 508px; height: 447px;" /></p>
<p>Click OK through the dialogs, and we see that a row of output for the 15th percentile is now included in the Table of Percentiles.</p>
<p><img alt="Table of Percentiles for B15 Life" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/1ac702357a0ce4437048ec6aa470ba1f/table3.jpg" style="width: 313px; height: 47px;" /></p>
<p>It’s as simple as that!</p>
<p>If you’ve never used BX life as a reliability metric, and you’re wondering just how and why these can be some of the best measures of reliability, stay tuned for my final post in this series!</p>
Quality ImprovementReliability AnalysisSix SigmaFri, 05 Feb 2016 13:00:00 +0000http://blog.minitab.com/blog/meredith-griffith/how-to-calculate-bx-life-part-2Meredith GriffithHow to Analyze Like a Citizen Data Scientist in Flint
http://blog.minitab.com/blog/statistics-and-quality-improvement/how-to-analyze-like-a-citizen-data-scientist-in-flint
<p><img alt="The Citizen's Bank Weather Ball in Flint, Michigan" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/f0a4660c3136750443aede3c2be41c52/6109589699_98d685d0d5_z.jpg" style="width: 200px; height: 133px; float: right; border-width: 1px; border-style: solid; margin: 10px 15px;" />If you follow the news in the United States then you’ve heard that there’s a water crisis in Flint, Michigan. Although there’s going to continue to be debate about how much ethics played a role in the data collection practices, it’s worthwhile to at least be ready to perform the correct analysis on the data when you have it. Here’s how you can use Minitab to be like a citizen data scientist in Flint, and see for yourself what the data indicate.</p>
<p>Let’s start with the Environmental Protection Agency’s (EPA) <a href="http://www.epa.gov/dwreginfo/lead-and-copper-rule">Lead and Copper Rule</a>. The EPA says that a water system needs to act when “lead concentrations exceed an action level of 15 ppb” in more than 10% of samples. The statistic that identifies the highest 10% of the samples is called the 90th percentile.</p>
<p><a href="http://www.ecfr.gov/cgi-bin/text-idx?SID=531617f923c3de2cbf5d12ae4663f56d&mc=true&node=sp40.23.141.i&rgn=div6#se40.23.141_186">The applicable Code of Federal Regulations</a> (CFR) does not prescribe a random sample to characterize the entire water system. Instead, the CFR suggests that those who administer the water system should select sampling sites based on the likelihood of contamination. In particular, those who administer the system should prefer sampling sites that meet these two criteria:</p>
<p style="margin-left:.5in;">(i) Contain copper pipes with lead solder installed after 1982 or contain lead pipes; and/or</p>
<p style="margin-left:.5in;">(ii) Are served by a lead service line.</p>
<p>Clearly, we are not dealing with a random sample—that's because the goal is not to characterize the entire system, but to better understand the worst contamination risks. In this context we're characterizing only the sites that we sample, which we suspect contain the highest lead results in the system. The CFR suggests taking samples from at least 60 sites for a system the size of Flint’s.</p>
<p>The <a href="http://flintwaterstudy.org/2015/12/complete-dataset-lead-results-in-tap-water-for-271-flint-samples/" target="_blank">data we’ll work with</a> was collected through an effort organized by <a href="http://flintwaterstudy.org/about-page/about-us/" target="_blank">an independent research team at Virginia Tech</a>. The data contain 271 samples from 269 different locations, which exceeds the minimum recommended sample size. Because we’re looking for the 90th percentile, what we do isn’t very different from counting down 271/10 ≈ 27 data points from the maximum. The CFR references the use of “first draw” tap samples, so we’ll pay attention to that column in the Virginia Tech data.</p>
A Quick Calculation of the 90th Percentile
<p>Once the data’s in <a href="http://www.minitab.com/products/minitab">Minitab Statistical Software</a>, the fastest way to calculate the 90th percentile is with Minitab’s calculator. Try this:</p>
<ol>
<li>Choose <strong>Calc > Calculator</strong>.</li>
<li>In <strong>Store result in variable</strong>, enter <em>90th percentile</em>.</li>
<li>In <strong>Expression</strong>, enter <em>percentile (‘PB Bottle 1 (ppb) – First Draw’, 0.9)</em>. Click <strong>OK.</strong></li>
</ol>
<p>Minitab stores the value 26.944. Because this value is greater than 15, you are now ready to make <a href="http://flintwaterstudy.org/information-for-flint-residents/results-for-citizen-testing-for-lead-300-kits/" target="_blank">strongly-worded statements urging people to take measures to protect themselves from lead exposure</a>.</p>
Communicating the 90th Percentile on a Graph
<p>But if you’re really going to communicate your results, it’s nice to have a graph available. A simple bar chart might do:</p>
<p><img alt="Bart chart of the actual 90th percentile and the action limit." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/44a161f0fc39a3b030b9895a11313c1f/bar_chart.png" style="border-width: 0px; border-style: solid; width: 576px; height: 384px;" /></p>
<p>However, you can show the data in more detail with a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/graphs/graphs-of-distributions/histograms/histogram/">histogram</a>.</p>
<ol>
<li>Choose <strong>Graph > Histogram</strong>.</li>
<li>Select <strong>Simple</strong>. Click <strong>OK</strong>.</li>
<li>In <strong>Graph variables</strong>, enter ‘<em>PB Bottle 1 (ppb) – First Draw’</em>.</li>
<li>Click <strong>Scale</strong>.</li>
<li>Select the <strong>Reference Lines</strong> tab.</li>
<li>In <strong>Show reference lines at data values</strong>, enter <em>15 26.9</em>. Click <strong>OK</strong> twice.</li>
</ol>
<p><img alt="Histogram showing the 90th percentile exceeds the action limit of 15 parts per billion." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/a6d9b14bf5031621ac62f922b0d68466/histogram.png" style="border-width: 0px; border-style: solid; width: 576px; height: 384px;" /></p>
<p>Histograms divide the sample values into intervals called bins. The height of the histogram represents the number of observations that are in the bin. The taller the bar, the more observations in that interval. The reference lines on the graph show the action limit for the 90th percentile and the actual value of the 90th percentile. This graph shows that the action limit is exceeded.</p>
Gather Your Data
<p>In April of 2015, then-mayor of Flint Dayne Walling reported that he and his family “drink and use the Flint water everyday, at home, work, and schools.” It’s easy for me to believe that the mayor’s personal experience with water that was not dangerous affected his judgment about the situation. The zip code for the mayor’s office in Flint is 48502. The news bureau for WNEM TV 5, <a href="http://www.wnem.com/story/29511581/flints-mayor-drinks-water-from-tap" target="_blank">one place where Mayor Walling drank tap water on TV</a>, is in the same zip code. The citizen data scientists who analyzed the Flint data knew that the geographically-limited sample being shown on TV and Twitter wasn't good enough. Instead, they collected data from 269 different locations around Flint and found that lead was a serious problem.</p>
<p>Of course, collecting that data was no small task: the data scientists estimate that gathering, preparing, and analyzing water samples ended up costing about $180,000, not including volunteer labor. If you’d like to donate towards offsetting the costs and future efforts, check out the <a href="http://flintwaterstudy.org/2016/01/the-flintwaterstudy-research-support-fundraiser/" target="_blank">Flint Water Study Research Support Fundraiser</a>.</p>
<p>If you’d like to support residents in Flint, consider volunteering for or contributing to the <a href="http://www.unitedwaygenesee.org/civicrm/contribute/transact?reset=1&id=5" target="_blank">United Way of Genesee County’s Flint Water Fund</a> which “has sourced more than 11,000 filters systems and 5,000 replacement filters, ongoing sources of bottled water to the Food Bank of Eastern Michigan and also supports a dedicated driver for daily distribution.”</p>
<p>The attention brought to Flint <a href="http://www.theguardian.com/environment/2016/jan/22/water-lead-content-tests-us-authorities-distorting-flint-crisis" target="_blank">has called into question the water testing done in other municipalities in the United States</a>. If you’re concerned about the potential for lead in your own water, the EPA notes that <a href="http://www.epa.gov/lead/protect-your-family#testdw" target="_blank">lead testing kits are available in home improvement stores</a> that can be sent to laboratories for analysis.</p>
<p>The citation for the referenced data set is: FlintWaterStudy.org (2015)<strong> “Lead Results from Tap Water Sampling in Flint, MI during the Flint Water Crisis.”</strong> This link provides the data as a Minitab worksheet: <a href="https://app.compendium.com/api/post_attachments/3d9b8ce9-c0ce-45ed-a759-3da70816d238/view">lead_results_from_tap_water_sampling_in_flint__mi_during_the_flint_water_crisis.MTW</a></p>
<p> </p>
<p><em>The image of the Citizen's Bank Weather Ball is by the <a href="https://www.flickr.com/photos/michigancommunities/6109589699">Michigan Municipal League</a> and is licensed under <a href="https://creativecommons.org/licenses/by-nd/2.0/">this Creative Commons License</a></em>.</p>
Statistics in the NewsMon, 01 Feb 2016 13:00:00 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/how-to-analyze-like-a-citizen-data-scientist-in-flintCody SteeleHow to Compare Regression Slopes
http://blog.minitab.com/blog/adventures-in-statistics/how-to-compare-regression-lines-between-different-models
<p>If you perform linear regression analysis, you might need to compare different regression lines to see if their constants and slope coefficients are different. Imagine there is an established relationship between X and Y. Now, suppose you want to determine whether that relationship has changed. Perhaps there is a new context, process, or some other qualitative change, and you want to determine whether that affects the relationship between X and Y.</p>
<p>For example, you might want to assess whether the relationship between the height and weight of football players is significantly different than the same relationship in the general population.</p>
<p>You can graph the regression lines to visually compare the slope coefficients and constants. However, you should also statistically test the differences. <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-why-we-need-to-use-hypothesis-tests-in-statistics" target="_blank">Hypothesis testing</a> helps separate the true differences from the random differences caused by sampling error so you can have more confidence in your findings.</p>
<p>In this blog post, I’ll show you how to compare a relationship between different regression models and determine whether the differences are statistically significant. Fortunately, these tests are easy to do using <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">Minitab statistical software</a>.</p>
<p>In the example I’ll use throughout this post, there is an input variable and an output variable for a hypothetical process. We want to compare the relationship between these two variables under two different conditions. Here is the <a href="//cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/File/569a0e7d067944f6f9147434794efcd6/comparingregressionmodels.MPJ">Minitab project file</a> with the data.</p>
Comparing Constants in Regression Analysis
<p>When the <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-to-interpret-the-constant-y-intercept" target="_blank">constants</a> (or y intercepts) in two different regression equations are different, this indicates that the two regression lines are shifted up or down on the Y axis. In the scatterplot below, you can see that the Output from Condition B is consistently higher than Condition A for any given Input value. We want to determine whether this vertical shift is statistically significant.</p>
<p><img alt="Scatterplot with two regression lines that have different constants." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/2ed27f4204515bac9d9674c16fa0c0f7/scatter_constant_dift.png" style="width: 576px; height: 384px;" /></p>
<p>To test the difference between the constants, we just need to include a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/data-concepts/cat-quan-variable/" target="_blank">categorical variable</a> that identifies the qualitative attribute of interest in the model. For our example, I have created a variable for the condition (A or B) associated with each observation.</p>
<p>To fit the model in Minitab, I’ll use: <strong>Stat > Regression > Regression > Fit Regression Model</strong>. I’ll include <em>Output</em> as the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">response variable</a>, <em>Input</em> as the continuous <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">predictor</a>, and <em>Condition</em> as the categorical predictor.</p>
<p>In the regression analysis output, we’ll first check the coefficients table.</p>
<p style="margin-left: 40px;"><img alt="Coefficients table that shows that the constants are different" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/23657868f2cf893d216d05d3400ab9e6/coeff_constant_dift.png" style="width: 369px; height: 117px;" /></p>
<p>This table shows us that the relationship between Input and Output is statistically significant because the p-value for Input is 0.000.</p>
<p>The <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">coefficient</a> for Condition is 10 and its <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">p-value</a> is significant (0.000). The coefficient tells us that the vertical distance between the two regression lines in the scatterplot is 10 units of Output. The p-value tells us that this difference is statistically significant—you can reject the null hypothesis that the distance between the two constants is zero. You can also see the difference between the two constants in the regression equation table below.</p>
<p style="margin-left: 40px;"><img alt="Regression equation table that shows constants that are different" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/a879996e37ebb05a297721e695a71943/equ_constant_dift.png" style="width: 305px; height: 113px;" /></p>
Comparing Coefficients in Regression Analysis
<p>When two slope coefficients are different, a one-unit change in a predictor is associated with different mean changes in the response. In the scatterplot below, it appears that a one-unit increase in Input is associated with a greater increase in Output in Condition B than in Condition A. We can <em>see</em> that the slopes look different, but we want to be sure this difference is statistically significant.</p>
<p><img alt="Scatterplot that shows two slopes that are different" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/200c12087fdf7eecd9b773d9ce213020/scatter_slope_dift.png" style="width: 576px; height: 384px;" /></p>
<p>How do you statistically test the difference between regression coefficients? It sounds like it might be complicated, but it is actually very simple. We can even use the same Condition variable that we did for testing the constants.</p>
<p>We need to determine whether the coefficient for Input depends on the Condition. In statistics, when we say that the effect of one variable depends on another variable, that’s an interaction effect. All we need to do is include the interaction term for Input*Condition!</p>
<p>In Minitab, you can specify interaction terms by clicking the <strong>Model</strong> button in the main regression dialog box. After I fit the regression model with the interaction term, we obtain the following coefficients table:</p>
<p style="margin-left: 40px;"><img alt="Coefficients table that shows different slopes" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/f06eff56f2266d0ff7e3919aa1292285/coeff_slope_dift.png" style="width: 410px; height: 154px;" /></p>
<p>The table shows us that the interaction term (Input*Condition) is statistically significant (p = 0.000). Consequently, we reject the null hypothesis and conclude that the difference between the two coefficients for Input (below, 1.5359 and 2.0050) does not equal zero. We also see that the main effect of Condition is not significant (p = 0.093), which indicates that difference between the two constants is not statistically significant.</p>
<p style="margin-left: 40px;"><img alt="Regression equation table that shows different slopes" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/d5e5142c0ff13645d1dacc3e2c0bee27/equ_coeff_dift.png" style="width: 295px; height: 105px;" /></p>
<p>It is easy to compare and test the differences between the constants and coefficients in regression models by including a categorical variable. These tests are useful when you can see differences between regression models and you want to defend your conclusions with p-values.</p>
<p>If you're learning about regression, read my <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples">regression tutorial</a>!</p>
Data AnalysisHypothesis TestingRegression AnalysisStatistics HelpWed, 13 Jan 2016 13:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/how-to-compare-regression-lines-between-different-modelsJim Frost