Minitab | MinitabBlog posts and articles about using Minitab software in quality improvement projects, research, and more.
http://blog.minitab.com/blog/minitab/rss
Sun, 01 May 2016 19:18:40 +0000FeedCreator 1.7.3What's a Moving Range, and How Is It Calculated?
http://blog.minitab.com/blog/marilyn-wheatleys-blog/whats-a-moving-range-and-how-is-it-calculated
<p>We often receive questions about moving ranges because they're used in various tools in our <a href="http://www.minitab.com/products/minitab">statistical software</a>, including control charts and capability analysis when data is not collected in subgroups. In this post, I'll explain what a moving range is, and how a moving range and average moving range are calculated.</p>
<p>A moving range measures how variation changes over time when data are collected as individual measurements rather than in subgroups.</p>
<p>If we collect individual measurements and need to plot the data on a control chart, or assess the capability of a process, we need a way to estimate the variation over time. But when we have individual observations, we cannot calculate the standard deviation for each subgroup. In such cases, the average moving range across all subgroups is an alternative way to estimate process variation.</p>
<p>Consider the 10 random data points plotted in the graph below:</p>
<p style="margin-left: 40px;"><img height="369" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/7b447a0adb4a6e3a23fee5a34ab07563/7b447a0adb4a6e3a23fee5a34ab07563.png" width="624" /></p>
<p>A moving range is the distance or difference between consecutive points. For example, MR1 in the graph below represents the first moving range, MR2 represents the second moving range, and so forth:</p>
<p style="margin-left: 40px;"><img height="414" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/041539e9131ddbfb6cae7517ec190ab8/041539e9131ddbfb6cae7517ec190ab8.png" width="624" /></p>
<p>The difference between the first and second points (MR1) is 0.704, and that’s a positive number since the first point has a lower value than the second. The second moving range, MR2, is the difference between the second point (21.0494) and the third (19.6375), and that’s a negative number (-1.4119), since the third point has a lower value than the second. If we continue that way, we’ll have 9 moving ranges for our 10 data points.</p>
<p>In Minitab, a moving range is easy to compute by "lagging" the data. Continuing the example with the 10 data points above, I can use <strong>Stat</strong> > <strong>Time Series</strong> > <strong>Lag</strong>, and then complete the dialog box as shown below:</p>
<p style="margin-left: 40px;"><img alt="a" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/2b125f53827fb9cc7aec8b2a300845a7/capture.PNG" style="width: 557px; height: 330px;" /></p>
<p>Clicking <strong>OK</strong> in the dialog above will shift the data in C1 down by one row and store the results in C4. Now we can use <strong>Calc</strong> > <strong>Calculator</strong> to subtract C4 from C1 and calculate all the moving ranges:</p>
<p style="margin-left: 40px;"><img alt="b" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/070834223bef3007c9621c940ff3a195/capture.PNG" style="width: 563px; height: 380px;" /></p>
<p>To calculate the average moving range, we need to use the absolute value of the moving ranges we calculated above. We’ll take a look at how to do that later. </p>
<p>When Minitab calculates the average of a moving range, the calculation also includes and <a href="http://support.minitab.com/en-us/minitab/17/topic-library/quality-tools/capability-analyses/data-and-data-assumptions/unbiasing-constants/">unbiasing constant</a>. The formula used to calculate the moving range is:</p>
<p style="margin-left: 40px;"><img alt="equation" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/a5a46a4ff68b1425bbd155792d20a701/a5a46a4ff68b1425bbd155792d20a701.png" style="border-width: 0px; border-style: solid; width: 624px; height: 140px;" /></p>
<p>The table of unbiasing constants is available within Minitab and <a href="http://support.minitab.com/en-us/minitab-express/1/help-and-how-to/control-charts/how-to/variables-data-in-subgroups/xbar-r-chart/methods-and-formulas/unbiasing-constants-d2-d3-and-d4/">on this page</a>.</p>
<p>We’ve already done most of the work. To finish, we’ll find the right value of d2 in the table linked above, and use Minitab’s calculator to get the answer. We need the value of d2 that corresponds to a moving range of length 2 (that’s the number of points in each moving range calculation, but don’t worry, I’ll explain more about the length of the moving range later):</p>
<p style="margin-left: 40px;"><img border="0" height="179" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/2caa9e4eec046f281a834976260d3f8c/2caa9e4eec046f281a834976260d3f8c.png" width="173" /></p>
<p>Now back to Minitab, and we can use <strong>Calc</strong> > <strong>Calculator</strong> to get our answer:</p>
<p style="margin-left: 40px;"><img alt="c" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/f3eaf58a9007d6420c44b559206206eb/capture.PNG" style="width: 604px; height: 386px;" /></p>
<p>Using the formula above, we’re telling Minitab to use the absolute values (ABS calculator command) in C5 to calculate the mean, and then divide that by our unbiasing constant value of 1.128.</p>
<p>Now to check our results against Minitab, we can use <strong>Stat </strong>> <strong>Control Charts</strong> > <strong>Variables Charts for Individuals</strong> > <strong>I-MR</strong> and enter our original data column:</p>
<p style="margin-left: 40px;"><img border="0" height="334" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/0c80b992ef94f8d021aa1ebfc5bbc594/0c80b992ef94f8d021aa1ebfc5bbc594.png" width="507" /></p>
<p>Next, choose <strong>I-MR Options</strong> > <strong>Storage</strong>, and check the box next to <strong>Standard deviations</strong>, then click <strong>OK</strong> in each dialog box:</p>
<p style="margin-left: 40px;"><img alt="d" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/c4b545c37882980e3f690ad046f63626/capture.PNG" style="width: 582px; height: 440px;" /></p>
<p>The results show the same average moving range value we calculated, <strong>0.602627</strong>. </p>
<p>In this case, because we used a moving range of length 2, the average moving range gives us an estimate of the average distance between our consecutive individual data points. A moving range of length 2 is Minitab’s default, but that can be changed by clicking the <strong>I-MR Options</strong> button in the I-MR chart dialog, and then choosing the <strong>Estimate</strong> tab:</p>
<p style="margin-left: 40px;"><img border="0" height="438" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/3e03c57905dc63ff5be0971285a4d518/3e03c57905dc63ff5be0971285a4d518.png" width="442" /></p>
<p>Here we can type in a different value (let’s use 3 as an example), and Minitab will use that number of points to estimate the moving ranges. If we did that for the calculations above, we’d have to make two adjustments:</p>
<ol>
<li>
<p>We’d need to choose the correct value for the unbiasing constant, d2, that corresponds with a moving range length of 3:</p>
<p><img alt="t" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/94764a32eec04329f8dfdd4d73219214/capture.PNG" style="width: 173px; height: 182px;" /></p>
</li>
<li>We’d have to adjust the number of points used for our moving ranges from 2 to 3. Using the same random data as before:</li>
</ol>
<p style="margin-left: 40px;"><img border="0" height="248" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/bf32b968b0788bc21e03920397ccefe4/bf32b968b0788bc21e03920397ccefe4.png" width="71" /></p>
<p style="margin-left: 40px;">With three data points, we’ll use just the highest and the lowest values from the first 3 rows, so MR1 will be 21.0494 – 19.6375 = 1.4119.</p>
<p><span style="line-height: 1.6;">If you’ve enjoyed this post, check out some of our other blog </span><a href="http://blog.minitab.com/blog/control-charts" style="line-height: 1.6;">posts about control charts</a><span style="line-height: 1.6;">.</span></p>
<p> </p>
Fri, 29 Apr 2016 12:00:00 +0000http://blog.minitab.com/blog/marilyn-wheatleys-blog/whats-a-moving-range-and-how-is-it-calculatedMarilyn WheatleyBeware the Radar Chart!
http://blog.minitab.com/blog/fun-with-statistics/beware-the-radar-chart
<p>Along with the explosion of interest in visualizing data over the past few years has been an excessive focus on how attractive the graph is at the expense of how useful it is. Don't get me wrong...I believe that a colorful, modern graph comes across better than a black-and-white, pixelated one. Unfortunately, however, all the talk seems to be about the attractiveness and not the value of the information presented.</p>
<p>Although perhaps not the most egregious example, one that sticks out to me is the radar chart (also known as the spider chart). The web site <a href="http://www.mockdraftable.com" target="_blank">Mock Draftable</a> provides radar charts for every prospect in the NFL draft. For example, here is their radar chart for defensive end Dadi Nicolas of Virginia Tech:</p>
<p style="text-align: center;"><img alt="Mock Draftable Radar Chart for Dadi Nicolas" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/46889f0e-f0a5-4b4a-8a19-2d2b8dce6087/Image/3d2e8553c1e501ce1b1a59ec1e7c8ec3/md_dadi_radar_chart.PNG" style="width: 483px; height: 473px;" /></p>
<p>This chart uses Dadi's percentiles among other defensive-end prospects on some body measurements and physical tests completed at the combine. It attempts to convey:</p>
<ol>
<li>How well Dadi measures against the other prospects in each measurement, by providing a point on the axis pertaining to that measurement.</li>
<li>How good Dadi is overall, by connecting the dots and enclosing a polygon that has an area that increases as individual measurements increase.</li>
<li>How "well rounded" Dadi is by looking at how rounded the polygon is...more round indicates a more balanced player, and one with more peaks indicates a less balanced player.</li>
</ol>
<p>There is no question that what the eye is immediately drawn to is the area covered by the shaded polygon. This is a very misleading graph because of that and I'll explain why. For starters, the order of the categories as you read each axis on the chart is arbitrary. In this example it begins with physical attributes and continues through physical tests in no meaningful order. Allow me to provide four examples of radar charts for Dadi Nicolas that plot the exact same information but change the order of the categories:</p>
<p style="text-align: center;"><img alt="Radar Chart 1" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/46889f0e-f0a5-4b4a-8a19-2d2b8dce6087/Image/6189ae43bdda4ebc41c877d46c328dc5/radarchart1.png" style="width: 480px; height: 289px;" /></p>
<p style="text-align: center;"><img alt="Radar Chart 2" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/46889f0e-f0a5-4b4a-8a19-2d2b8dce6087/Image/1958a49ff6b9002b7dd00768f41d5641/radarchart2.png" style="width: 480px; height: 289px;" /></p>
<p style="text-align: center;"><img alt="Radar Chart 3" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/46889f0e-f0a5-4b4a-8a19-2d2b8dce6087/Image/08406a5b955ece2de3ba30add3d4b48c/radarchart3.png" style="width: 480px; height: 289px;" /></p>
<p style="text-align: center;"><img alt="Radar Chart 4" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/46889f0e-f0a5-4b4a-8a19-2d2b8dce6087/Image/f69b1dee4f912cf0fbc4bc9cc10cf917/radarchart4.png" style="width: 480px; height: 289px;" /></p>
<p>If I didn't tell you these were all the same player, you would have to carefully inspect the axes and specific numbers to figure it out. But more broadly, you could draw contradictory conclusions as you look through them:</p>
<ol>
<li>These certainly give different impressions of how well-rounded Dadi is. Charts 1 and 4 appear to show a player that is exceptional in some categories and not very good at all in others. Charts 2 and 3 appear to show a much more balanced player.</li>
<li>The area of the polygon on the charts varied wildly and gives completely different impression of the overall skill of the player. Chart 4 covers 20% of the available area while chart 3 covers 40%...using the same information.</li>
</ol>
<p>I could go into the mathematical details on why the area differs so much but I think the pictures above are worth 1000 words.</p>
<p>If I were asked to chart Dadi's statistics, I could quite easily use <a href="http://www.minitab.com/products/minitab">Minitab </a>to provide one that conveys the information in a better format. To start, I would use an Individual Value Plot so that I can asses where the player lies on <span><a href="http://blog.minitab.com/blog/adventures-in-statistics/the-graphical-benefits-of-identifying-the-distribution-of-your-data">the distribution of prospects, rather than looking at the percentile</a></span>. <span style="line-height: 1.6;">I would then create a grouping variable to highlight Nicolas' data on the graph. Then I would place the categories in order of importance—I'm obviously not an NFL scout, but I did a quick correlation on these stats for the 2015 prospects and their draft position to come up with a rough order. </span></p>
<p><span style="line-height: 1.6;">With more work I might come up with some even better ideas, but the point here is to illustrate how quickly a more informative graph could be produced. My graph looks like this (after some editing for looks...that still matters, after all!):</span></p>
<p style="text-align: center;"><img alt="Individual Value Plot of Dadi Nicolas" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/46889f0e-f0a5-4b4a-8a19-2d2b8dce6087/Image/03f1ea5e22e5e7b2f7c401fbd3bd1a5c/individual_value_plot_of_dadi_nicolas.png" style="width: 576px; height: 384px;" /></p>
<p>Now I can quickly make the following assessments without being mislead:</p>
<ol>
<li>Dadi is roughly average when all characteristics are combined, but not impressive on two of the three most important categories—40-yard sprint and weight.</li>
<li>By plotting the raw values and not just percentile, we see that Dadi not only had the highest vertical jump, but was well above all others. In fact, the gap from Dadi to the 2nd-highest equals the gap from 2nd to 10th.</li>
<li>Nicolas is in fact unique and not balanced.</li>
</ol>
<p>Of course, instead of using an Individual Value Plot, you could also just watch a freshman Dadi Nicolas chase down future NFL wide receiver Brandon Coleman:</p>
<p style="text-align: center;"></p>
<p>Just don't use a radar chart!</p>
Data AnalysisStatisticsStatistics in the NewsStatsWed, 27 Apr 2016 12:00:00 +0000http://blog.minitab.com/blog/fun-with-statistics/beware-the-radar-chartJoel SmithManipulating Your Survey Data in Minitab
http://blog.minitab.com/blog/statistics-and-quality/manipulating-your-survey-data-in-minitab
<p>As a recent graduate from Arizona State University with a degree in Business Statistics, I had the opportunity to work with students from different areas of study and help analyze data from various projects for them.</p>
<p><img alt="survey symbold" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/3b2a7f4c85707a09177d3da12dbaa009/online_survey_icon_or_logo_svg.png" style="margin: 10px 15px; float: right; width: 300px; height: 300px;" />One particular group asked for help analyzing online survey data they had gathered from other students, and they wanted to see if their new student program was beneficial. I would describe this request as them giving us a "pile of data" and saying, "Tell us what you can find out." </p>
<p>There were numerous problems with this "pile of data" because it wasn't organized, in part because of the way the survey itself was set up. (Our statistics professor later told us that she asked this group to come in because she'd looked at their data before they presented it to us and she wanted to see how we would perform with a "real-world" situation.)</p>
<p>Unfortunately, the statistics department didn't have a time machine that would enable us to go back and set up the survey to have better data that was more organized (I guess if we <em>did </em>have a time machine there would be no need for predictive analytics), but we did have <a href="http://www.minitab.com/products/minitab/">Minitab and its tools</a> to help with the importing of data, reviewing the data, and putting it in a format that is best for analyzing. </p>
<p>So let’s assume you have a pile of survey data that is:</p>
<ul>
<li>Unbiased</li>
<li>Taken from a random sample</li>
<li>Taken from the appropriate audience</li>
<li>Contained enough respondents</li>
</ul>
<p><span style="line-height: 1.6;">Many online survey tools allow you to download your data to a .csv or Excel file, which would be perfect to <span>import into Minitab</span>. </span></p>
<p><span style="line-height: 1.6;">In fact, Minitab 17.3 has recently included a new dialog box that shows you the data before it is opened so you can modify the data type, include/exclude certain columns, and see how many rows are within the data. Within options of that same dialog box you are able to choose what is done with missing data points, and missing data rows. All of these new functions give you the ability to bring a "pile of data" into Minitab a little cleaner with less headache.</span></p>
<p style="margin-left: 40px;"><img alt="open survey data dialog" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/b51a0c86-e2dd-456e-878a-4196c7381c3a/File/c5319276614d905f12f38eca2f3a6343/c5319276614d905f12f38eca2f3a6343.png" style="width: 669px; height: 570px;" /> </p>
<p><span style="line-height: 1.6;">Once the data is in Minitab reviewing the data is essential to uncover any irregularities that may be hiding in the data before analysis. Within the Project Manager Bar there is the information icon that allows you to be able to see each column name, column ID, row count, how many missing data points and the type of data of each column. This provides the ability to quickly scan the different columns to make sure that the online data you received correctly by checking the row count, any missing data irregularities, and data type. </span></p>
<p style="margin-left: 40px;"><img alt="data" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/b51a0c86-e2dd-456e-878a-4196c7381c3a/File/637ee7794419e3ad489f4a98c96cbc3c/637ee7794419e3ad489f4a98c96cbc3c.png" style="width: 396px; height: 342px;" /></p>
<p> </p>
<p>Minitab also has numerous tools to format the data before analysis, including <a href="http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-3">coding, sorting and splitting worksheets</a>. </p>
<p>For example, occasionally survey data will use “0” in the place of a non-response. This can be a problem because any data analysis will make this a data point when it probably shouldn't be. Minitab can find those “0”s and replace them with missing data to remove them from your worksheet so they won't throw off your analysis (<strong>Editor > Find and Replace > Replace</strong>).</p>
<p>Before analysis you can also sort your data (<strong>Data > Sort</strong>) and choose the column you would like to sort the data to, and you can also create a new worksheet from the sorted data. I also really like the Split and Subset Worksheet options in the event you have a lot of data and it would be easier to look at smaller sections of it for analysis (<strong>Data > Split Worksheet</strong> and <strong>Data > Subset Worksheet</strong>)<strong>.</strong></p>
<p>These are just a few tools that allow you to import data and then prepare the data without having to go back and forth between your spreadsheet software and statistical software. So when you have someone drop off a "pile of data," see how you can use your Minitab tools to shovel through and find the gems that are lying beneath the surface.</p>
Data AnalysisStatisticsTue, 26 Apr 2016 12:00:00 +0000http://blog.minitab.com/blog/statistics-and-quality/manipulating-your-survey-data-in-minitabJoseph HartsockMerge All Your Data At Once
http://blog.minitab.com/blog/statistics-and-quality-improvement/merge-all-your-data-at-once
<p>Did you know about the <a href="http://www.linkedin.com/groups?gid=166220&trk=my_groups-b-grp-v">Minitab Network group on LinkedIn</a>? It’s the one managed by <a href="http://blog.minitab.com/blog/understanding-statistics">Eston Martz, who also edits the Minitab blog</a>. I like to see what the members are talking about, which recently got me into some discussions about Raman spectroscopy data.</p>
<p><img alt="An incredibly fine 5-carat emerald crystal, that has it all: bright grass-green color, glassy luster, a fine termination, and most of all, TOP gemminess." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/20e2bf0c5c35af33e68aa4f2c8f86130/beryl_130023.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; width: 253px; height: 400px; float: right;" />Not having much experience with Raman spectroscopy data, I thought I’d learn more about it and found <a href="http://rruff.info/about/about_general.php" target="_blank">the RRUFFTM Project</a>.</p>
<p>The idea is that if you have a Raman device, you can analyze a mineral sample and compare your results to information in the database so that you can identify your mineral. Not having a Raman device, the site is still exciting to me because all of the RRUFFTM data are available in ZIP files that you can download and use to illustrate some neat things in Minitab.</p>
<p>So let’s say that you download <a href="http://rruff.info/repository/zipped_data_files/raman/excellent_oriented.zip">one of the ZIP files from the RRUFFTM Project</a>. The ZIP file contains a few thousand text files with intensity data for different minerals. Some minerals have a small number of files. Some minerals, like beryl, have many files.</p>
<p>Turns out beryl’s pretty cool. In its pure form, it’s colorless, but it comes in a variety of colors. In the presence of different ions, beryl can be aquamarine, maxixe, goshenite, heliodor, and emerald.</p>
<p>I extracted just the beryl files into a folder on my computer. Now, I want to analyze the files in Minitab. If I open the worksheet in Minitab without any adjustments, I get something like this:</p>
<p><img alt="This worksheet puts sample identification information with the measurements, so you can't analyze the data." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/0aa4efeec94e41cb5e48d57316785c73/bad_worksheet.png" style="border-width: 0px; border-style: solid; width: 375px; height: 236px;" /></p>
<p>While I could certainly rearrange this with formulas, I need only a few steps to open the file ready to analyze.</p>
<ol>
<li><img alt="Use the preview to find where the data begin." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/d4d4a0f6ad9866c760f22cdf6d83eef0/single_column_data.jpg" style="border-width: 0px; border-style: solid; margin: 10px 15px; width: 236px; height: 466px; float: right;" />Choose <strong>File > Open Worksheet</strong>.</li>
<li>Select the text file.</li>
<li>Click <strong>Open</strong>. Minitab automatically recognizes that you have a text file, opens common options, and lets you see a simple preview of your data.</li>
<li>Scroll down so that you can see the first row of numbers, in this case, row 13.</li>
<li>Uncheck <strong>Data has column names</strong>.</li>
<li>In <strong>First Row to import</strong>, enter the row that has the data. In this case, 13.</li>
</ol>
<p>Now you’ve solved the problem of including identifying information about the mineral in the worksheet. The other problem is that Minitab places all of the data in a single column unless you tell it how to divide the data. You can see the problem, even in the simple preview. Finish the<strong> </strong>with these steps:</p>
<ol>
<li value="7">In <strong>Field Delimiter</strong>, select <strong>Comma</strong>.</li>
<li value="8">Click <strong>OK</strong>.</li>
</ol>
<p>Now your data is in a nice, analyzable format. But remember that there are more than 30 files with data on beryl. To analyze them together in Minitab, the data need to be in same worksheet.</p>
<p><span id="cke_bm_197E" style="display: none;"> </span>First, open the remaining worksheets with the correct import settings. Then, try these steps:</p>
<ol>
<li>Choose<strong> Data > Merge Worksheets > Side-by-Side</strong>.</li>
<li>Click <img alt="The double angle bracket button moves all of the worksheets." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/bdf45c3af74d5d079e8664d8efd7bd36/double_button.jpg" style="width: 51px; height: 37px;" /> to move all of the data from <strong>Available worksheets</strong> to <strong>Worksheets to merge</strong>.</li>
<li>Name the new worksheet.</li>
<li>Click <strong>OK</strong>.
<p><img alt="The double angle bracket buttons make it easy to get all of your data." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/087e2bdd4c46c847cc7d58cd2bb656b6/merge_dialog.jpg" style="width: 518px; height: 343px;" /></p>
</li>
</ol>
<p> </p>
<p>All of your data is ready to go in a single worksheet.</p>
<p><img alt="The new worksheet contains all of the data." src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/5fcdb9b22ec7b7e4b5f7081b4ec2de52/merged_worksheet.jpg" style="border-width: 0px; border-style: solid; width: 901px; height: 436px;" /></p>
<p>The options that Minitab provides for opening and merging data sources make it easy to get a wide variety of data ready for analysis. The data features are a good complement to the easy graphs and analyses that you can do in Minitab.</p>
<em><span style="color: rgb(169, 169, 169);">The image of the emerald</span><span style="color: rgb(169, 169, 169);"> is by <a href="http://www.irocks.com/"><b><span class="description en" lang="en"><span class="licensetpl_aut">Rob Lavinsky</span></span></b></a> and is licensed under this <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">Creative Commons License</a>.</span></em>
Data AnalysisStatistics HelpMon, 25 Apr 2016 12:00:00 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/merge-all-your-data-at-onceCody Steele3 Tips for Importing Excel Data into Minitab
http://blog.minitab.com/blog/michelle-paret/3-tips-for-importing-excel-data-into-minitab
<p>Getting your data from Excel into <a href="http://www.minitab.com/products/minitab/">Minitab Statistical Software</a> for analysis is easy, especially if you keep the following tips in mind.</p>
Copy and Paste
<p><span style="line-height: 20.8px;">To paste into Minitab, you can either right-click in the worksheet and choose </span><strong style="line-height: 20.8px;">Paste Cells</strong><span style="line-height: 20.8px;"> or you can use </span><strong style="line-height: 20.8px;">Control-V</strong><span style="line-height: 20.8px;">. </span>Minitab allows for 1 row of column headers, so if you have a single row of column info (or no column header info), then you can quickly copy and paste an entire sheet at once. However, if you have multiple rows of descriptive text at the top of your Excel file, then use the following steps:</p>
<p><em> Step 1</em> - Choose a single row for your column headers and paste it into Minitab. </p>
<p><em> Step 2</em> - Go back to your Excel file to copy all of the actual data over.</p>
<p>And if you have any summary info at the end of your Excel file, you'll want to exclude that too, just like any extraneous column header info.</p>
<p><img alt="Excel to Minitab" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/951006fe8ebf8bfde86486660018fbe0/excel_to_mtb.jpg" style="width: 650px; height: 379px;" /></p>
<p> </p>
Importing Lots of Data
<p><img alt="File Open dialog" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/75e6b833214b1e9cbda4e6056a2fde43/file_open_menu.jpg" style="line-height: 20.8px; width: 253px; height: 359px; margin: 10px 15px; float: right;" /></p>
<p>Copy/paste is ideal when you have only a few Excel sheets. But what if you have lots of <span style="line-height: 1.6;">sheets? In this case, try using </span><strong style="line-height: 1.6;">File > Open</strong><span style="line-height: 1.6;">. Another advantage of </span><strong style="line-height: 1.6;">File > Open</strong><span style="line-height: 1.6;"> is the additional import options, should you need them. For example, you can specify which sheets </span><span style="line-height: 1.6;">and rows to include. And there are even options to handle messy data issues, such as case mismatches and </span><a href="http://blog.minitab.com/blog/michelle-paret/how-to-remove-leading-or-trailing-spaces-from-a-data-set" style="line-height: 1.6;">leading and trailing spaces</a><span style="line-height: 1.6;">.</span></p>
<div>
Fixing Column Formats
<p>Minitab has 3 column formats: numeric, text, and date/time. Text columns are noted with a <strong>-T</strong> and date/time columns are noted with a <strong>-D</strong>, while numeric columns appear without such an indicator. Why does column format matter? It matters because certain graphs and analyses are only available for certain formats. For example, if you want to create a time series plot, Minitab will not allow you to use a text column. If you bring data over from Excel and the format does not reflect the type of data in a given column, just right-click in the column and choose <strong>Format Column</strong> to select the right type, such as <strong>Automatic numeric</strong>.</p>
<p><img alt="column formats" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/350de8d0fc91e01d485bc1f124a28148/column_format.jpg" style="width: 645px; height: 332px;" /></p>
<p><span style="line-height: 1.6;">Once you import your data and it's properly formatted, you can then use the </span><strong style="line-height: 1.6;">Stat</strong><span style="line-height: 1.6;">, </span><strong style="line-height: 1.6;">Graph</strong><span style="line-height: 1.6;">, and </span><strong style="line-height: 1.6;">Assistant</strong><span style="line-height: 1.6;"> menus to start analyzing it. And if you need help running a particular analysis, just </span><a href="http://www.minitab.com/contact-us" style="line-height: 1.6;">contact Minitab Technical Support</a><span style="line-height: 1.6;">. This outstanding service is free and is staffed with statisticians, so don't hesitate to give them a call.</span></p>
</div>
Data AnalysisFri, 22 Apr 2016 12:00:00 +0000http://blog.minitab.com/blog/michelle-paret/3-tips-for-importing-excel-data-into-minitabMichelle ParetUnderstanding t-Tests: t-values and t-distributions
http://blog.minitab.com/blog/adventures-in-statistics/understanding-t-tests-t-values-and-t-distributions
<p>T-tests are handy <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-why-we-need-to-use-hypothesis-tests-in-statistics" target="_blank">hypothesis tests</a> in statistics when you want to compare means. You can compare a sample mean to a hypothesized or target value using a one-sample t-test. You can compare the means of two groups with a two-sample t-test. If you have two groups with paired observations (e.g., before and after measurements), use the paired t-test.</p>
<img alt="Output that shows a t-value" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/efd51d69e3947d70197143b735e0c51d/t_value_swo.png" style="line-height: 20.8px; float: right; width: 400px; height: 57px; margin: 10px 15px; border-width: 1px; border-style: solid;" />
<p>How do t-tests work? How do t-values fit in? In this series of posts, I’ll answer these questions by focusing on concepts and graphs rather than equations and numbers. After all, a key reason to use <a href="http://www.minitab.com/products/minitab">statistical software like </a><a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">Minitab</a> is so you don’t get bogged down in the calculations and can instead focus on understanding your results.</p>
<p>In this post, I will explain t-values, t-distributions, and how t-tests use them to calculate probabilities and assess hypotheses.</p>
What Are t-Values?
<p>T-tests are called t-tests because the test results are all based on t-values. T-values are an example of what statisticians call test statistics. A test statistic is a standardized value that is calculated from sample data during a hypothesis test. The procedure that calculates the test statistic compares your data to what is expected under the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/null-and-alternative-hypotheses/" target="_blank">null hypothesis</a>.</p>
<p>Each type of t-test uses a specific procedure to boil all of your sample data down to one value, the t-value. The calculations behind t-values compare your sample mean(s) to the null hypothesis and incorporates both the sample size and the variability in the data. A t-value of 0 indicates that the sample results exactly equal the null hypothesis. As the difference between the sample data and the null hypothesis increases, the absolute value of the t-value increases.</p>
<p>Assume that we perform a t-test and it calculates a t-value of 2 for our sample data. What does that even mean? I might as well have told you that our data equal 2 fizbins! We don’t know if that’s common or rare when the null hypothesis is true.</p>
<p>By itself, a t-value of 2 doesn’t really tell us anything. T-values are not in the units of the original data, or anything else we’d be familiar with. We need a larger context in which we can place individual t-values before we can interpret them. This is where t-distributions come in.</p>
What Are t-Distributions?
<p>When you perform a t-test for a single study, you obtain a single t-value. However, if we drew multiple random samples of the same size from the same population and performed the same t-test, we would obtain many t-values and we could plot a distribution of all of them. This type of distribution is known as a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/basic-concepts/sampling-distribution/" target="_blank">sampling distribution</a>.</p>
<p>Fortunately, the properties of t-distributions are well understood in statistics, so we can plot them without having to collect many samples! A specific t-distribution is defined by its <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/basic-concepts/df/" target="_blank">degrees of freedom (DF)</a>, a value closely related to sample size. Therefore, different t-distributions exist for every sample size. <span style="line-height: 20.8px;">You can graph t-distributions u</span><span style="line-height: 1.6;">sing Minitab’s </span><a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/graphs/graphs-of-distributions/probability-distribution-plots/probability-distribution-plot/" style="line-height: 1.6;" target="_blank">probability distribution plots</a><span style="line-height: 1.6;">.</span></p>
<p>T-distributions assume that you draw repeated random samples from a population where the null hypothesis is true. You place the t-value from your study in the t-distribution to determine how consistent your results are with the null hypothesis.</p>
<p style="margin-left: 40px;"><img alt="Plot of t-distribution" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/d628e56f0380e0edcf575502a670ed31/t_dist_20_df.png" style="width: 576px; height: 384px;" /></p>
<p>The graph above shows a t-distribution that has 20 degrees of freedom, which corresponds to a sample size of 21 in a one-sample t-test. It is a symmetric, bell-shaped distribution that is similar to the normal distribution, but with thicker tails. This graph plots the probability density function (PDF), which describes the likelihood of each t-value.</p>
<p>The peak of the graph is right at zero, which indicates that obtaining a sample value close to the null hypothesis is the most likely. That makes sense because t-distributions assume that the null hypothesis is true. T-values become less likely as you get further away from zero in either direction. In other words, when the null hypothesis is true, you are less likely to obtain a sample that is very different from the null hypothesis.</p>
<p>Our t-value of 2 indicates a positive difference between our sample data and the null hypothesis. The graph shows that there is a reasonable probability of obtaining a t-value from -2 to +2 when the null hypothesis is true. Our t-value of 2 is an unusual value, but we don’t know exactly <em>how </em>unusual. Our ultimate goal is to determine whether our t-value is unusual enough to warrant rejecting the null hypothesis. To do that, we'll need to calculate the probability.</p>
Using t-Values and t-Distributions to Calculate Probabilities
<p>The foundation behind any hypothesis test is being able to take the test statistic from a specific sample and place it within the context of a known probability distribution. For t-tests, if you take a t-value and place it in the context of the correct t-distribution, you can calculate the probabilities associated with that t-value.</p>
<p>A probability allows us to determine how common or rare our t-value is under the assumption that the null hypothesis is true. If the probability is low enough, we can conclude that the effect observed in our sample is inconsistent with the null hypothesis. The evidence in the sample data is strong enough to reject the null hypothesis for the entire population.</p>
<p>Before we calculate the probability associated with our t-value of 2, there are two important details to address.</p>
<p>First, we’ll actually use the t-values of +2 and -2 because we’ll perform a two-tailed test. A two-tailed test is one that can test for differences in both directions. For example, a two-tailed 2-sample t-test can determine whether the difference between group 1 and group 2 is statistically significant in either the positive or negative direction. A one-tailed test can only assess one of those directions.</p>
<p>Second, we can only calculate a non-zero probability for a range of t-values. As you’ll see in the graph below, a range of t-values corresponds to a proportion of the total area under the distribution curve, which is the probability. The probability for any specific point value is zero because it does not produce an area under the curve.</p>
<p>With these points in mind, we’ll shade the area of the curve that has t-values greater than 2 and t-values less than -2.</p>
<p><img alt="T-distribution with a shaded area that represents a probability" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/5e124a2c8139681afec706799ebabcec/t_dist_prob.png" style="width: 576px; height: 384px;" /></p>
<p>The graph displays the probability for observing a difference from the null hypothesis that is at least as extreme as the difference present in our sample data while assuming that the null hypothesis is actually true. Each of the shaded regions has a probability of 0.02963, which sums to a total probability of 0.05926. When the null hypothesis is true, the t-value falls within these regions nearly 6% of the time.</p>
<p>This probability has a name that you might have heard of—it’s called the p-value! While the probability of our t-value falling within these regions is fairly low, it’s not low enough to reject the null hypothesis using the common <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-hypothesis-tests%3A-significance-levels-alpha-and-p-values-in-statistics" target="_blank">significance level</a> of 0.05.</p>
<p><a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values" target="_blank">Learn how to correctly interpret the p-value.</a></p>
t-Distributions and Sample Size
<p>As mentioned above, t-distributions are defined by the DF, which are closely associated with sample size. As the DF increases, the probability density in the tails decreases and the distribution becomes more tightly clustered around the central value. The graph below depicts t-distributions with 5 and 30 degrees of freedom.</p>
<p><img alt="Comparison of t-distributions with different degrees of freedom" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/5220dc6347611a230e89b70de904b034/t_dist_comp_df.png" style="width: 576px; height: 384px;" /></p>
<p>The t-distribution with fewer degrees of freedom has thicker tails. This occurs because the t-distribution is designed to reflect the added uncertainty associated with analyzing small samples. In other words, if you have a small sample, the probability that the sample statistic will be further away from the null hypothesis is greater even when the null hypothesis is true.</p>
<p>Small samples are more likely to be unusual. This affects the probability associated with any given t-value. For 5 and 30 degrees of freedom, a t-value of 2 in a two-tailed test has p-values of 10.2% and 5.4%, respectively. Large samples are better!</p>
<p>I’ve explained how t-values and t-distributions work together to produce probabilities. In my next post, I’ll show how each type of t-test works.</p>
Data AnalysisHypothesis TestingLearningStatistics HelpWed, 20 Apr 2016 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/understanding-t-tests-t-values-and-t-distributionsJim FrostHave It Your Way: How to Create a Custom Toolbar in Minitab
http://blog.minitab.com/blog/starting-out-with-statistical-software/have-it-your-way-how-to-create-a-custom-toolbar-in-minitab
<p>Depending on how often and when you use <a href="http://www.minitab.com/products/minitab">statistical software like Minitab</a>, there may be specific tools or a group of tools you find yourself using over and over again. <span style="line-height: 20.8px;">You may have to do a monthly report, for instance, for which you use one tool in our Basic Statistics menu, another in Quality Tools, and a third in Regression. </span></p>
<p>But there are a lot of functions and capabilities in our software, and if you don't use Minitab every day, it might be hard to remember where specific tools are located. While the menus are easy to navigate, you might benefit from grouping all of those commands you use most often in one place. In Minitab, you can do this by creating a custom toolbar to fit your exact needs. </p>
<p>To add a toolbar, go to <strong>Tools</strong> > <strong>Customize</strong> and choose the Toolbars tab. Now click <strong>New…</strong> and enter a name. At this point, a blank box will appear, like this:</p>
<p style="margin-left: 40px;"><img alt="toolbar" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/732ead34-1005-4470-b034-d7f8b87fabcf/Image/7d39db69f2fe3e3ccceca6791e494e0e/toolbar.png" style="width: 40px; height: 58px;" /></p>
<p>You can leave it hovering, or you can dock it by dragging it next to an existing toolbar. From here, we can add menus and commands for different tasks. This is done by switching to the <strong>Commands </strong>tab and picking out the commands you want included.</p>
<p>For example, if you wanted a <a href="http://blog.minitab.com/blog/understanding-statistics/the-easiest-way-to-do-capability-analysis">Capability Analysis</a> to be included in your toolbar, you can choose the <strong>Stat </strong>category on the left and then find the Capability command from the right. You can then drag this command into your toolbar for easy use. </p>
<p style="margin-left: 40px;"><img alt="toolbar3" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/732ead34-1005-4470-b034-d7f8b87fabcf/Image/484e7f4c0d305041aa439582f71f8ca8/toolbar3.png" style="width: 447px; height: 381px;" /></p>
<p>You can add any command you wish that appears in Minitab's menu.</p>
<p>In addition to custom toolbars, there is a <strong>New Menu</strong> command, which can give you even more control over organization. The picture above illustrates how I accomplished this while building a DMAIC toolbar. </p>
<p>You can drag this into the toolbar, and then right click and choose to rename it to anything you wish. This is helpful if you want to organize your toolbar into steps. Step 1 may be preliminary graphs, while Step 2 may be analysis and results. </p>
<p style="margin-left: 40px;"><img alt="toolbar" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/eb075a0a348470ddda32dbb2a9b3466c/toolbar2.png" style="width: 113px; height: 48px;" /></p>
<p>Once you have your toolbar built, it will become a part of your active profile. Anytime you make this profile active in your Minitab session, you will have access to this toolbar, which you can use to quickly navigate the commands you use most often. </p>
AutomationMon, 18 Apr 2016 12:00:00 +0000http://blog.minitab.com/blog/starting-out-with-statistical-software/have-it-your-way-how-to-create-a-custom-toolbar-in-minitabEric HeckmanWhat If Major League Baseball Had a 16-Game Season?
http://blog.minitab.com/blog/the-statistics-game/what-if-major-league-baseball-had-a-16-game-season
<p>When it comes to statistical analyses, collecting a large enough sample size is essential to obtaining quality results. If your sample size is too small, confidence intervals may be too wide to be useful, linear models may lack necessary precision, and <a href="http://blog.minitab.com/blog/statistics-support/control-your-control-chart">control charts may get so out of control</a> that they become self-aware and rise up against humankind.</p>
<img alt="NFL and MLB Logos" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/71352ecd9092619a3262811e64594f20/baseball_and_nfl_logo.jpg" style="line-height: 20.8px; float: right; width: 250px; height: 133px; margin: 10px 15px;" />
<p>Okay,that last point may have been over-exaggerated, but you get the idea. </p>
<p>However, sometimes collecting a large sample size is easier said than done. Financial or time constraints often limit the number of observations we can collect. And in the world of sports, there is no better example of this than the NFL.</p>
<p>Football is a violent sport, so the players need a week to rest and recover between games. This time constraint limits the regular season to only 16 games. This is very small compared to the other major American leagues—hockey, basketball, and baseball. The NHL and NBA both play a 82-game season, while MLB plays 162 games!</p>
<p>But we never consider the sample size when we consider the best and worst teams in the NFL. It's not uncommon to see teams with sub-par records come back and have a great record the following year, or vice versa. We'll often credit/blame coaches and quarterbacks, but did you ever hear a sports analyst just say "Hey, sometimes crazy things can happen over a 16-game sample"? And we're almost at the point in the MLB baseball season where most teams have played 16 games.</p>
<p>That makes me wonder, what would baseball look like if they only played 16 games?</p>
Looking at Major League Baseball as a 16-Game Season
<p>I took the previous 10 seasons and recorded every MLB team's record in their first 16 games. I also looked at their final record to get a good estimate of their "true" winning percentage. <span style="line-height: 20.8px;">The fitted line plot below shows the relationship between a baseball team's winning percentage in their first 16 games and their final winning percentage.</span></p>
<p><img alt="Fitted Line Plot" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/ed92ab388a86bd225398ca38674759a8/fitted_line_plot.jpg" style="width: 576px; height: 384px;" /></p>
<p>The relationship isn't completely random, as a higher winning percentage in your first 16 means you're more likely to have a better final winning percentage. But it's not a very strong relationship, as only 20.2% of the variation in a team's final winning percentage is explained by their winning percentage in the first 16 games.</p>
<p>Observations toward the bottom right show teams that started off with a very strong record but ended up in the bottom of the league. You can see that the Colorado Rockies have a habit of doing this. But the more interesting teams are in the top left corner. These are teams that started out slow but ended up as one of the best teams in the league. In fact, there were 31 teams that started .500 or worse in their first 16 games and ended up making the playoffs. That's 35% of all playoff teams in the last 10 years! Four teams that stand out are the Rays, Rockies, Rangers, and Phillies. The Rays, Rockies and Rangers were all sub-.500 and in last place in their division after the first 16 games—and all three ended up making it to the World Series that same year. And the Phillies were 8-8 and in third place after 16 games of the 2008 season. That would have put them out of the playoffs. But with a larger sample, they finished first in their division and ended up winning the World Series!</p>
Can a Small Sample be Good?
<p>In the world of sports, a small sample size isn't necessarily a bad thing. Small samples definitely make things entertaining. For example, just compare the first round of the NCAA tournament (single elimination) to the first round of the NBA playoffs (7 game series). The former has upsets galore (East Tennessee St over Michigan St anyone?) that would never be near impossible in a 7 game game series. And the variance that can occur in an NFL regular season certainly contributes to it being more popular than the marathon that is the MLB regular season. Larger samples help determine who the better team is, but the unpredictability that we love in sports is greatly helped by smaller samples.</p>
<p>Of course, the quality world varies greatly from sports entertainment. Usually, we want all the observations we can get to improve the reliability of our results. Just make sure that you don't collect such a large sample that <a href="http://support.minitab.com/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/p-value-and-significance-level/practical-significance/" target="_blank">statistically significant results aren't practical to your situation</a>. Luckily, Minitab Statistical Software offers <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/power-and-sample-size/power-and-sample-size-analyses-in-minitab/" target="_blank">power and sample size analyses</a> to help you determine how much data to collect. You want enough data to ensure you'll reliable results without spending extra time and money on unnecessary observations. </p>
<p>And remember, when that NFL team comes out of nowhere to win their division next year; it could be the coach, it could be the quarterback...or it could just be the sample size!</p>
Fun StatisticsStatistics in the NewsFri, 15 Apr 2016 12:00:00 +0000http://blog.minitab.com/blog/the-statistics-game/what-if-major-league-baseball-had-a-16-game-seasonKevin RudyBetter Home$ and Baseboards: Using Data Distributions to Set a List Price
http://blog.minitab.com/blog/data-analysis-and-quality-improvement-and-stuff/using-data-distributions-to-set-a-list-price
<p>People say that I overthink everything. I've given this assertion considerable thought, and I don't believe that it is true. After all, how can any one person possibly overthink every possible thing in just one lifetime?</p>
<p>For example, suppose I live 85 years. That's 2,680,560,000 seconds (85 years x 365 days per year x 24 hours per day x 60 min per hour x 60 seconds per minute). I'm asleep about a third of the time, so that leaves just 1,787,040,000 seconds to ponder a nearly infinite variety of things. This morning, I paused for about 2 seconds to ruminate about a gray hair. ("Hey, that hair wasn't gray yesterday.") At a rate of 1 cogitation every 2 seconds, I would have time in life to mull over only 893,520,000 items.</p>
<p>That's a plethora, for sure. But this number doesn't seem so big when you consider the large (though shrinking) number of not-yet-gray hairs on my head, or the vast number of <a href="http://blog.minitab.com/blog/quality-improvement-2">ways that you can use Minitab Statistical Software to improve quality in your organization</a>. So, to those who say that I overthink everything, after much deliberation I am confident that you are mistaken.</p>
<p><img alt="house" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/e83ea86a6b12c65a9fdaed3652ed4347/house.png" style="border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 300px; height: 205px;" />But I do overthink <em>some </em>things. Take my house...<em>please</em>. After much blood (literal), sweat (literal), and tears (not telling), I am finally ready to list my house for sale. (If you're in the market for a lovely 4-bedroom home, nestled in the heart of State College, Pennsylvania, I have just the house for you.)</p>
<p>The other day, a gaggle of realtors (I <em>think </em>that's the collective noun for realtors) inspected the property, and each submitted their personal estimate of what the house is worth. Being a numbers guy (and being anxious to know how much I could get for the dump ol' homestead), I was excited to see the results.</p>
<p>Imagine my disappointment when my realtor gave me only summary data!</p>
<p>Ladies and gentlemen, the data you are about see are real; only the values have been transformed, to protect the innocent (a.k.a., the guy who bought <em>way </em>more house than he ever should have, or ever will again).</p>
Highest valuation
$460,000
Lowest valuation
$425,000
Average
$450,000
Number of realtors
12
Written comments
"Great large rooms, bright, nice windows, loved the decks!"<br style="line-height: 20.8px;" />
"Nice, presents well!"<br style="line-height: 20.8px;" />
"Baseboards are blotchy/scratched and need to be painted."
<p>Desperate questions crowded my frantic mind as I struggled to process the surprisingly sparse information:</p>
<ul>
<li>What happened to the <em>rest </em>of the data!?!</li>
<li>How many realtors thought the house is worth $460,000? Just one? I can't tell!</li>
<li>Is $425,000 an outlier? Did the other realtors take all the chocolate chip cookies that I baked and leave poor 425 grumpy and snackless?</li>
<li>Do I really need to paint the baseboards!?!</li>
</ul>
<p>I was deeply distraught by this dearth of data, this omission of observations, this not-enough-ness of numbers. So, I asked my realtor if I could see the raw data. Her response shocked me: <em>"My assistant threw out the individual responses. These valuations are just gut feelings. Don't overthink it."</em></p>
<p>What? <em>"Threw out the individual responses"</em>!?!?!</p>
<p><em>"Don't overthink it"</em>?!?!? </p>
<p><em>"Paint the baseboards"</em>?!?!?!?!</p>
<p>I had planned to use the realtor valuations to help me come up with a list price. I was concerned because the mean of different distributions can be the same, even if the shapes of the distributions are wildly different. For example, each sample in this histogram has a mean of 4. Obviously, the mean alone doesn't tell you anything about how the observations are distributed.</p>
<p style="margin-left: 40px;"><img alt="Differently shaped distributions, each with a mean of 4" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/102028490a955f7c2f22175ad57c7445/mean_4_norm_exp_bimo.jpg" style="width: 360px; height: 240px;" /></p>
<p>Also, the mean itself probably wouldn't be a good list price. I'm not trying to appeal to the <em>average </em>buyer; I'm trying to appeal to those special few buyers who actually like the house and are willing to bid more for it. On the other hand, if I pick a number that is too high, I could out-price even the high-bidders and I might get no offers.</p>
<p>What to do, what to do?</p>
<p>I had done a little reading about <a href="http://blog.minitab.com/blog/understanding-statistics/monte-carlo-is-not-as-difficult-as-you-think">Monte Carlo simulations</a>. And I recalled that data simulations were invaluable when we designed the new test and confidence interval for 2 variances in <a href="http://www.minitab.com/en-us/products/minitab/">Minitab Statistical Software</a>. (You can read more than most people ever want to know about those simulations <a href="http://support.minitab.com/en-us/minitab/17/Bonetts_Method_Two_Variances.pdf">here</a>.) So I decided to try some simple simulations to see what I could learn about possible sample distributions that fit the summary statistics I was given.</p>
<p style="">First, a quick note about my methodology. For simplicity, I assigned each observation 1 of 15 discrete values: $425,000, $427,500, $430,000, ..., up to $460,000. Each hypothetical distribution includes 12 observations and has a mean of $450,000 (within rounding error). Each distribution includes at least one observation at $425,000 (the reported minimum) and at least one observation at $460,000 (the reported maximum). Values on the graph are in units of $1,000 (for example, 425 = $425,000). Reference lines are included on the histograms to show the following statistics:</p>
<p style="margin-left: 40px;">Mn = the mean, which is always equal to $450,000<br />
Md = the median<br />
Mo = the mode<br />
Q3 = the 3rd quartile (also called the 75th percentile)</p>
Simulated Sample Data
<p>My first guess when I saw the summary data was that the distribution of the realtor evaluations was probably left-skewed, so I simulated that first.</p>
<p style="margin-left: 40px;"><img alt="Left skewed scenario" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/246f7a9760e3166691dc46065c046119/left_skewed.jpg" style="width: 260px; height: 174px;" /></p>
<p>In this scenario, most of the valuations are clustered at the high end, with fewer valuations in the middle, and even fewer valuations at the low end. This is my favorite scenario, because the most frequent response (the mode) is $460,000, which is the highest value in the sample. If the real distribution looked like this, I'd be comfortable choosing $460,000 as my list price because I'd know that 3 of the 12 realtors think the house is worth that price.</p>
<p>Next I wondered what it would look like if there was a major disagreement among the realtors. So I worked up this bimodal scenario.</p>
<p style="margin-left: 40px;"><img alt="Bimodal scenario" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/f389032fee8724d6445fc4989fd122d8/bimodal.jpg" style="width: 260px; height: 174px;" /></p>
<p>In order to maintain a mean of $450,000, I could not include very many observations on the low end of the spectrum. So most of the valuations in this scenario fall on the high end. But—and this is a <em>big</em> but—in this scenario, 3 different realtors actually gave the house the minimum valuation. I would definitely want to know why those realtors priced the house so differently from the others. It could be that they noticed something that the other realtors did not. In this scenario, I can't really come up with a reasonable list price until I find out why there are two distinct peaks.</p>
<p>Next, I wondered what the data might look like if the realtors were feeling blasé about the price. This flat-looking distribution is my statistical interpretation of realtor ennui.</p>
<p style="margin-left: 40px;"><img alt="Flat distribution scenario" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/b18e5e1f76baeffa35410b36c59b10c5/flat_low_outlier.jpg" style="width: 260px; height: 174px;" /></p>
<p>Again, in order to maintain a mean of $450,000, I could not put many observations on the low side. In fact, I included only the one minimum observation on the low side, which makes that observation an outlier. If I didn't already know that this outlier was just Mr. Blotchy Baseboards having a bad day, I'd need to investigate. The other valuations are distributed fairly evenly between $445,000 and $460,000. In a case like this, it seems like the 3rd quartile (Q3) might be a reasonable choice. By definition, at least 25% of the observations in a distribution are greater than or equal to Q3. If 25% of potential buyers think the house is worth $455,000, then I'll have a decent chance of getting an offer quickly at that price.</p>
<p>I also wondered what the data might look like if most of the realtors were in close agreement on the price.</p>
<p style="margin-left: 40px;"><img alt="Peaked distribution scenario" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/52b3b9024531dcd1a6fdb224b97711fd/peaked_low_high_outliers.jpg" style="width: 260px; height: 174px;" /></p>
<p>In this scenario, most the valuations are grouped closely together near the mean. The minimum valuation is again an outlier. The maximum valuation also appears to be an outlier. I definitely would not base my list price on the maximum valuation because it does not seem representative. The mode is a disappointing $450,000, but Q3 is a little higher at $452,500.</p>
<p>Just for the heck of it, I tried one final scenario—a right-skewed distribution.</p>
<p style="margin-left: 40px;"><img alt="Right skewed scenario" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/8de770ba-a50a-4f6b-9144-9713c3b99f66/Image/b3ea184f3e2a7be3d7a5ae58f6d8038c/right_skew_low_outlier.jpg" style="width: 260px; height: 174px;" /></p>
<p>Again, Mr. Blotchy Baseboards is an outlier. This is another disappointing scenario because the mode is $450,000. But at least Q3 is $454,400, which is a little higher than Q3 for the peaked scenario.</p>
<p>Here is a recap of the list prices I would choose for each simulated data set (in decreasing order):</p>
<strong>Scenario</strong>
<strong>List Price</strong>
Left Skewed
$460,000
Flat
$455,000
Right Skewed
$454,400
Peaked
$452,500
Bimodal
Undetermined
<p>I'm still mad at my realtor for throwing away perfectly good data. But I am feeling better about choosing a list price for my house. I would like to think that the left-skewed scenario is closest to the truth. But even if it is not, the lowest list price that I came up with was $452,500, which isn't much different. The bimodal scenario is problematic, but since I don't know if the actual data were bimodal, I kind of have to ignore that one.</p>
<p>I will probably go with the second highest list price of $455,000. In the end, it's just gut feel, right? I don't want to overthink it.</p>
Fun StatisticsStatisticsWed, 13 Apr 2016 19:49:00 +0000http://blog.minitab.com/blog/data-analysis-and-quality-improvement-and-stuff/using-data-distributions-to-set-a-list-priceGreg Fox3 Ways to Examine Data Over Time
http://blog.minitab.com/blog/real-world-quality-improvement/3-ways-to-examine-data-over-time
<p>Did you know that Minitab provides several tools you can use to view patterns in data over time? If you want to examine, say, monthly sales for your company, or even how the number of patients admitted to your hospital changes throughout the year, then these tools are for you!</p>
1. Time Series Plot
<p>Time series plots are often used to examine daily, weekly, seasonal or annual variations, or before-and-after effects of a process change. They’re especially useful for comparing data patterns of different groups. For example, you could examine monthly production for several plants for the previous year, or employment trends in different industries across several months.</p>
<p>Here’s an example of a time series plot that shows the monthly sales for two companies over two years:</p>
<p style="margin-left:.25in;"><img alt="http://support.minitab.com/en-us/minitab/17/time_series_plot_def.png" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/bbbbd2d7dd47553916d144d09e3ba3ea/time_series_plot.png" style="width: 360px; height: 240px;" /></p>
<p>This simple plot reveals a lot about the sales of these two companies. We can conclude that company A’s growth was slow, but was steadily rising over these two years. Company B’s sales started lower than company A’s, but shot up and surpassed company A by the second year.</p>
<p>It’s easy to create time series plots in Minitab – just choose <strong>Graph > Time Series Plot</strong>, and you’ll be off and running. Check out <a href="http://support.minitab.com/minitab/17/topic-library/basic-statistics-and-graphs/graphs/graphs-of-time-series/time-series-plots/creating-different-types-of-time-series-plots/" target="_blank">this article</a> to learn about the different types of time series plots you can create in Minitab, or <a href="http://blog.minitab.com/blog/real-world-quality-improvement/looking-at-past-weather-data-with-minitab-time-series-plots" target="_blank">this past blog post</a> I wrote on weather data and time series plots.</p>
2. Area Graph
<p>An area graph evaluates contributions to a total over time. They display multiple time series stacked on the y-axis against equally spaced time intervals on the x-axis. Each line on the graph is the cumulative sum so that you can see each series' contribution to the sum and how the composition of the sum changes over time.</p>
<p>For example, you could examine the quarterly sales of three different car models over two years, or employment trends in four different industries over several months.</p>
<p>And here’s an example of an area graph you could use to examine the number of cardiac inpatients and outpatients admitted over the past 12 months:</p>
<p style="margin-left:.5in;"><img alt="area graph" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/8b66ad34035fc36689082121ef16172a/area_graph.png" style="border-width: 0px; border-style: solid; width: 376px; height: 250px;" /></p>
<p>The graph shows that both inpatients and outpatients follow a similar trend, and it also suggests a seasonal effect: the number of patients admitted to the hospital is higher in the winter months.</p>
<p>To create area graphs in Minitab, choose <strong>Graph > Area Graph</strong>. For step-by-step instructions, check out <a href="http://support.minitab.com/minitab/17/topic-library/basic-statistics-and-graphs/graphs/graphs-of-time-series/area-graphs/create-a-simple-area-graph/" target="_blank">this article</a>.</p>
3. Scatterplot with a connect line
<p>You’ll want to create a scatterplot with a connect line if your data were collected at irregular intervals or are not in chronological order in the Minitab worksheet. For example, in this scatterplot, you can see that as the time in days increases, the weight of the fruit on the tree also increases:</p>
<p style="margin-left:.25in;"><img alt="scatterplot" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ccb8f6d6-3464-4afb-a432-56c623a7b437/Image/5690ad1998b307e6fa0b4b061e60eff6/scatterplot.png" style="border-width: 0px; border-style: solid; width: 403px; height: 269px;" /></p>
<p>You can create a scatterplot with a connect line in Minitab by choosing <strong>Graph > Scatterplot > With Connect Line</strong>. If you need a refresher on scatterplots, check out <a href="http://support.minitab.com/minitab-express/1/help-and-how-to/graphs/scatterplot/before-you-start/overview/" target="_blank">this article</a> from Minitab Support. </p>
Data AnalysisStatisticsMon, 11 Apr 2016 12:00:00 +0000http://blog.minitab.com/blog/real-world-quality-improvement/3-ways-to-examine-data-over-timeCarly BarryWhat Are Degrees of Freedom in Statistics?
http://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-are-degrees-of-freedom-in-statistics
<p><img alt="lion" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/46e9f0d33e15792636e82357642661b9/the_wizard_of_oz_bert_lahr_1939.jpg" style="float: right; margin: 10px 15px; width: 235px; height: 266px;" />About a year ago, <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-are-t-values-and-p-values-in-statistics" target="_blank">a reader asked if I could try to explain <em>degrees of freedom</em> in statistics</a>. Since then, I’ve been circling around that request very cautiously, like it’s some kind of wild beast that I’m not sure I can safely wrestle to the ground.</p>
<p>Degrees of freedom aren’t easy to explain. They come up in many different contexts in statistics—some advanced and complicated. In mathematics, they're technically defined as the dimension of the domain of a random vector.</p>
<p>But we won't get into that. Because degrees of freedom are generally not something you <em>need</em> to understand to perform a statistical analysis—unless you’re a research statistician, or someone studying statistical theory.</p>
<p>And yet, enquiring minds want to know. So for the adventurous and the curious, here are some examples that provide a basic gist of their meaning in statistics.</p>
<strong>The Freedom to Vary</strong>
<p>First, forget about statistics. Imagine you’re a fun-loving person who loves to wear hats. You couldn't care less what a degree of freedom is. You believe that variety is the spice of life.</p>
<p>Unfortunately, you have constraints. You have only 7 hats. Yet you want to wear a different hat every day of the week.</p>
<p><img alt="7 hats" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d05c904b7d0e90af30970c3c64ac11e9/hats.png" style="width: 628px; height: 85px; margin-left: 15px; margin-right: 15px;" /></p>
<p>On the first day, you can wear any of the 7 hats. On the second day, you can choose from the 6 remaining hats, on day 3 you can choose from 5 hats, and so on.</p>
<p>When day 6 rolls around, you still have a choice between 2 hats that you haven’t worn yet that week. But after you choose your hat for day 6, you have no choice for the hat that you wear on Day 7. You <em>must</em> wear the one remaining hat. You had 7-1 = 6 days of “hat” freedom—in which the hat you wore could vary!</p>
<p>That’s kind of the idea behind degrees of freedom in statistics. Degrees of freedom are often broadly defined as the number of "observations" (pieces of information) in the data that are free to vary when estimating statistical parameters.</p>
<strong>Degrees of Freedom: 1-Sample t test</strong>
<p>Now imagine you're not into hats. You're into data analysis.</p>
<p>You have a data set with 10 values. If you’re not estimating anything, each value can take on any number, right? Each value is completely free to vary.</p>
<p>But suppose you want to test the population mean with a sample of 10 values, using a 1-sample t test. You now have a constraint—the estimation of the mean. What is that constraint, exactly? By definition of the mean, the following relationship must hold: The sum of all values in the data must equal <em>n</em> x mean, where <em>n </em>is the number of values in the data set.</p>
<p>So if a data set has 10 values, the sum of the 10 values <em>must</em> equal the mean x 10. If the mean of the 10 values is 3.5 (you could pick any number), this constraint requires that the sum of the 10 values must equal 10 x 3.5 = 35.</p>
<p>With that constraint, the first value in the data set is free to vary. Whatever value it is, it’s still possible for the sum of all 10 numbers to have a value of 35. The second value is also free to vary, because whatever value you choose, it still allows for the possibility that the sum of all the values is 35.</p>
<p>In fact, the first 9 values could be anything, including these two examples:</p>
<p style="margin-left: 40px;">34, -8.3, -555, -92, -1, 0, 1, -22, 99<br />
0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9</p>
<p>But to have all 10 values sum to 35, and have a mean of 3.5, the 10th value <em>cannot </em>vary. It must be a specific number:</p>
<p style="margin-left: 40px;">34, -8.3, -555, -92, -1, 0, 1, -22, 99 -----> <span style="color:#FF0000;">10TH value <em>must</em> be 61.3</span><br />
0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 ----> <span style="color:#FF0000;">10TH value <em>must</em> be 20.5</span></p>
<p>Therefore, you have 10 - 1 = 9 degrees of freedom. It doesn’t matter what sample size you use, or what mean value you use—the last value in the sample is not free to vary. You end up with <em>n </em>- 1 degrees of freedom, where <em>n </em>is the sample size.</p>
<p>Another way to say this is that the number of degrees of freedom equals the number of "observations" minus the number of required relations among the observations (e.g., the number of parameter estimates). For a 1-sample t-test, one degree of freedom is spent estimating the mean, and the remaining <em>n </em>- 1 degrees of freedom estimate variability.</p>
<p>The degrees for freedom then define the specific<a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-are-t-values-and-p-values-in-statistics" target="_blank"> t-distribution that’s used to calculate the p-values and t-values for the t-test</a>.</p>
<p style="margin-left: 40px;"><img alt="t dist" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/0ec9b126c69082faffa9e133d02c9cc8/t_distribution_df.jpg" style="width: 576px; height: 384px;" /></p>
<p>Notice that for small sample sizes (n), which correspond with smaller degrees of freedom (<em>n </em>- 1 for the 1-sample t test), the t-distribution has fatter tails. This is because the <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/beer-statistics-and-quality" target="_blank">t distribution was specially designed to provide more conservative test results when analyzing small samples (such as in the brewing industry</a>). As the sample size (n) increases, the number of degrees of freedom increases, and the t-distribution approaches a normal distribution.</p>
<strong>Degrees of Freedom: Chi-Square Test of Independence</strong>
<p>Let's look at another context. A chi-square test of independence is used to determine whether two categorical variables are dependent. For this test, the degrees of freedom are the number of cells in the two-way table of the categorical variables that can vary, given the constraints of the row and column marginal totals.So each "observation" in this case is a frequency in a cell.</p>
<p>Consider the simplest example: a 2 x 2 table, with two categories and two levels for each category:</p>
<p style="text-align: center;"> </p>
<p style="text-align: center;"><strong>Category A</strong></p>
<p style="text-align: center;"><strong>Total</strong></p>
<p style="text-align: center;"><strong>Category B</strong></p>
<p style="text-align: center;"> <span style="color:#FF8C00;"><strong>?</strong></span></p>
<p style="text-align: center;"> </p>
<p style="text-align: center;"> 6</p>
<p style="text-align: center;"> </p>
<p style="text-align: center;"> </p>
<p style="text-align: center;"> 15</p>
<p style="text-align: center;"><strong>Total</strong></p>
<p style="text-align: center;"> 10</p>
<p style="text-align: center;"> 11</p>
<p style="text-align: center;"> 2</p>
<p>It doesn't matter what values you use for the row and column marginal totals. Once those values are set, there's only one cell value that can vary (here, shown with the question mark—but it could be any one of the four cells). Once you enter a number for one cell, the numbers for all the other cells are predetermined by the row and column totals. They're not free to vary. So the chi-square test for independence has only 1 degree of freedom for a 2 x 2 table.</p>
<p>Similarly, a 3 x 2 table has 2 degrees of freedom, because only two of the cells can vary for a given set of marginal totals.</p>
<p> </p>
<p style="text-align: center;"><strong>Category A</strong></p>
<p style="text-align: center;"><strong> Total</strong></p>
<p style="text-align: center;"><strong>Category B</strong></p>
<p style="text-align: center;"> <span style="color:#FF8C00;"><strong>?</strong></span></p>
<p style="text-align: center;"> <span style="color:#FF8C00;"><strong>?</strong></span></p>
<p style="text-align: center;"> </p>
<p style="text-align: center;"> 15</p>
<p style="text-align: center;"> </p>
<p style="text-align: center;"> </p>
<p style="text-align: center;"> </p>
<p style="text-align: center;"> 15</p>
<p style="text-align: center;"><strong>Total</strong></p>
<p style="text-align: center;"> 10</p>
<p style="text-align: center;"> 11</p>
<p style="text-align: center;"> 9</p>
<p style="text-align: center;"> 30</p>
<p>If you experimented with different sized tables, eventually you’d find a general pattern. For a table with <em>r </em>rows and <em>c </em>columns, the number of cells that can vary is (<em>r</em>-1)(<em>c</em>-1). And that’s the formula for the degrees for freedom for the chi-square test of independence!</p>
<p>The degrees of freedom then define the chi-square distribution used to evaluate independence for the test.</p>
<p style="margin-left: 40px;"><img alt="chi square" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/e99af33c511f72e0a8e9982a2299c78d/chi_square_dfs.jpg" style="width: 576px; height: 384px;" /></p>
<p>The chi-square distribution is positively skewed. As the degrees of freedom increases, it approaches the normal curve.</p>
<strong>Degrees of Freedom: Regression</strong>
<p>Degrees of freedom is more involved in the context of regression. Rather than risk losing the one remaining reader still reading this post (hi, Mom!), I'll cut to the chase. </p>
<p>Recall that degrees of freedom generally equals the number of observations (or pieces of information) minus the number of parameters estimated. When you perform regression, a parameter is estimated for every term in the model, and and each one consumes a degree of freedom. Therefore, including excessive terms in a multiple regression model reduces the degrees of freedom available to estimate the parameters' variability. In fact, if the amount of data isn't sufficient for the number of terms in your model, there may not even be enough degrees of freedom (DF) for the error term and no p-value or F-values can be calculated at all. You'll get output something like this:</p>
<p style="margin-left: 40px;"><img alt="regression output" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/ba6a552e-3bc0-4eed-9c9a-eae3ade49498/Image/98625baacb490cbde1c3fb60664fb310/regression_output_dof.jpg" style="width: 656px; height: 213px;" /></p>
<p><a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/doe/basics/f--and-p-values-that-are-shown-as-asterisks/" target="_blank">If this happens</a>, you either need to collect more data (to increase the degrees of freedom) or drop terms from your model (to reduce the number of degrees of freedom required). So degrees of freedom does have real, tangible effects on your data analysis, despite existing in the netherworld of the domain of a random vector.</p>
Follow-up
<p>This post provides a basic, informal introduction to degrees of freedom in statistics. If you want to further your conceptual understanding of degrees of freedom, check out <a href="http://courses.ncssm.edu/math/Stat_Inst/PDFS/DFWalker.pdf" target="_blank">this classic paper in the <em>Journal of Educational Psychology</em></a> by Dr. Helen Walker, an associate professor of education at Columbia who was the first female president of the American Statistical Association. Another good general reference is by Pandy, S., and Bright, C. L., <em>Social Work Research</em> Vol 32, number 2, June 2008, available <a href="http://www.montefiore.ulg.ac.be/~kvansteen/MATH0008-2/ac20122013/Class11Dec/Supplementary%20info_AppendixD_DegreesOfFreedom.pdf" target="_blank">here</a>.</p>
Data AnalysisLearningStatisticsStatsFri, 08 Apr 2016 12:00:00 +0000http://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-are-degrees-of-freedom-in-statisticsPatrick RunkelBest Way to Analyze Likert Item Data: Two Sample T-Test versus Mann-Whitney
http://blog.minitab.com/blog/adventures-in-statistics/best-way-to-analyze-likert-item-data%3A-two-sample-t-test-versus-mann-whitney
<p><img alt="Worksheet that shows Likert data" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/6b1cf78b969699ed58febb026d32051d/likert_worksheet.png" style="float: right; width: 162px; height: 265px; margin: 10px 15px;" />Five-point Likert scales are commonly associated with surveys and are used in a wide variety of settings. You’ve run into the Likert scale if you’ve ever been asked whether you strongly agree, agree, neither agree or disagree, disagree, or strongly disagree about something. The worksheet to the right shows what five-point Likert data look like when you have two groups.</p>
<p>Because Likert item data are discrete, ordinal, and have a limited range, there’s been a longstanding dispute about the most valid way to analyze Likert data. The basic choice is between <a href="http://blog.minitab.com/blog/adventures-in-statistics/choosing-between-a-nonparametric-test-and-a-parametric-test" target="_blank">a parametric test and a nonparametric test</a>. The pros and cons for each type of test are generally described as the following:</p>
<ul>
<li>Parametric tests, such as the 2-sample t-test, assume a normal, continuous distribution. However, with a sufficient sample size, t-tests are robust to departures from normality.</li>
<li>Nonparametric tests, such as the Mann-Whitney test, do not assume a normal or a continuous distribution. However, there are concerns about a lower ability to detect a difference when one truly exists.</li>
</ul>
<p>What’s the better choice? This is a real-world decision that users of <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">statistical software</a> have to make when they want to analyze Likert data.</p>
<p>Over the years, a number of studies that have tried to answer this question. However, they’ve tended to look at a limited number of potential distributions for the Likert data, which causes the generalizability of the results to suffer. Thanks to increases in computing power, simulation studies can now thoroughly assess a wide range of distributions.</p>
<p>In this blog post, I highlight a simulation study conducted by de Winter and Dodou* that compares the capabilities of the two sample t-test and the Mann-Whitney test to analyze five-point Likert items for two groups. Is it better to use one analysis or the other?</p>
<p>The researchers identified a diverse set of 14 distributions that are representative of actual Likert data. The computer program drew independent pairs of samples to test all possible combinations of the 14 distributions. All in all, 10,000 random samples were generated for each of the 98 distribution combinations! The pairs of samples are analyzed using both the two sample t-test and the Mann-Whitney test to compare how well each test performs. The study also assessed different sample sizes.</p>
<p>The results show that for all pairs of distributions the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/type-i-and-type-ii-error/" target="_blank">Type I (false positive) error rates</a> are very close to the target amounts. In other words, if you use either analysis and your results are statistically significant, you don’t need to be overly concerned about a false positive.</p>
<p>The results also show that for most pairs of distributions, the difference between the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/power-and-sample-size/what-is-power/" target="_blank">statistical power</a> of the two tests is trivial. In other words, if a difference truly exists at the population level, either analysis is equally likely to detect it. The concerns about the Mann-Whitney test having less power in this context appear to be unfounded.</p>
<p>I do have one caveat. There are a few pairs of specific distributions where there is a power difference between the two tests. If you perform both tests on the same data and they disagree (one is significant and the other is not), you can look at a table in the article to help you determine whether a difference in statistical power might be an issue. This power difference affects only a small minority of the cases.</p>
<p>Generally speaking, the choice between the two analyses is tie. If you need to compare two groups of five-point Likert data, it usually doesn’t matter which analysis you use. Both tests almost always provide the same protection against false negatives and always provide the same protection against false positives. These patterns hold true for sample sizes of 10, 30, and 200 per group.</p>
<p>*de Winter, J.C.F. and D. Dodou (2010), Five-Point Likert Items: t test versus Mann-Whitney-Wilcoxon, <em>Practical Assessment, Research and Evaluation</em>, 15(11).</p>
Data AnalysisHypothesis TestingStatisticsStatistics HelpWed, 06 Apr 2016 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/best-way-to-analyze-likert-item-data%3A-two-sample-t-test-versus-mann-whitneyJim FrostLinking Minitab to Excel to Get Fast Answers
http://blog.minitab.com/blog/marilyn-wheatleys-blog/linking-minitab-to-excel-to-get-fast-answers
<p style="line-height: 20.8px;">Since opening a new <a href="https://www.minitab.com/en-us/Press-Releases/Minitab-Inc--Expands-to-Phoenix,-Arizona/">office in Phoenix</a> to support our customers on the West Coast, some evenings in Minitab technical support feel busier than others. (By evenings, I mean after 5:30 p.m. Eastern time, when the members of our tech support team in Pennsylvania go home for the day, and I become an office of one.)</p>
<p style="line-height: 20.8px;">The variability in terms of days that felt extremely busy versus days that didn’t seemed unpredictable, so I decided to keep track of that information in an Excel spreadsheet, which I’ve been updating each evening:</p>
<p style="line-height: 20.8px; margin-left: 40px;"><img height="302" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/5fe16943903bc742a614cfce5f013619/5fe16943903bc742a614cfce5f013619.png" width="291" /></p>
<p style="line-height: 20.8px;">After gathering this information for several months, I used the data to make a few graphs in Minitab to see if any particular days were busiest. The graphs were fun, but not exactly what I needed. I wanted an easy way to make Minitab produce the graphs automatically each morning, so that they reflect the most up-to-date information.</p>
<p style="line-height: 20.8px;">In this post, I’ll show you the steps I took to link my Excel file to my Minitab worksheet and how I automated the generation of the graphs. You can do the same thing with any <span><a href="http://blog.minitab.com/blog/the-statistics-of-science/minitab-and-excel-making-the-data-connection">data you record regularly in Excel</a></span> spreadsheets. </p>
<span style="line-height: 20.8px;">Creating a DDE Link from Excel to Minitab</span>
<p style="line-height: 20.8px;">The first step was to create a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/minitab-environment/input-output/open-files-and-import-data/exchange-data-by-using-dde/">DDE link</a> from Excel to Minitab. To do that, I highlighted and copied a range of cells from Excel, beginning with the first row of data, and extending well beyond my last row of data (I went down to row 500):</p>
<p style="line-height: 20.8px; margin-left: 40px;"><img border="0" height="233" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/cfd5e8f132972de92b244ea2b8b5e46a/cfd5e8f132972de92b244ea2b8b5e46a.png" width="387" /></p>
<p style="line-height: 20.8px;">After copying the data from Excel, I navigated to my Minitab worksheet, clicked in the column where I want to link the data, and then used <strong>Edit</strong> > <strong>Paste Link</strong>:</p>
<p style="line-height: 20.8px; margin-left: 40px;"><img border="0" height="323" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/7f36e7c1e9bf84153b068ab0a31a4772/7f36e7c1e9bf84153b068ab0a31a4772.png" width="431" /></p>
<p style="line-height: 20.8px;">After creating the link, the data is automatically imported from Excel into Minitab:</p>
<p style="line-height: 20.8px; margin-left: 40px;"><img border="0" height="218" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/9eabcee5d7459bb1e5464f02e14cd8c1/9eabcee5d7459bb1e5464f02e14cd8c1.png" width="162" /></p>
<p style="line-height: 20.8px;">Since I have three columns to link from Excel to Minitab, I repeated the copy/paste process again for the two other columns, until all three columns were linked. I also added titles to the columns in my Minitab worksheet:</p>
<p style="line-height: 20.8px; margin-left: 40px;"><img border="0" height="202" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/eccb3c0f0cefcf511fb8d6d8e7aca28f/eccb3c0f0cefcf511fb8d6d8e7aca28f.png" width="443" /></p>
<p style="line-height: 20.8px;">Now with the links in place, any time I update my Excel file, the data is automatically updated in Minitab. Since the data is being transferred from Excel to Minitab, one important thing to remember is that for these links to continue working, Excel must be opened <strong>before</strong> opening Minitab each day.</p>
Adding a Macro to the DDE Links
<p style="line-height: 20.8px;">As a next step, I created a <a href="http://support.minitab.com/minitab/17/macro-library/macro-help/macro-basics/">Minitab macro</a> with the commands needed to manipulate the data that is imported and generate the graphs.</p>
<p style="line-height: 20.8px;">After saving the commands for the graphs I wanted to create in a GMACRO titled <strong>busydays.mac</strong>, I used the <strong>Edit</strong> menu shown below to add my macro to my DDE link:</p>
<p style="line-height: 20.8px; margin-left: 40px;"><img border="0" height="325" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/6df9160090573e29d26a506a298b570a/6df9160090573e29d26a506a298b570a.png" width="455" /></p>
<p style="line-height: 20.8px;">The <strong>Manage Links</strong> menu shows the links for each column in the order in which the columns were linked. First I linked C1, then C2, and then C3, so the last link listed corresponds to C3, which is the last column of data that is imported. Therefore, I’ll add my macro to that link so that my graphs will be generated after all the data is imported by highlighting that option and clicking the <strong>Change</strong> button:</p>
<p style="line-height: 20.8px; margin-left: 40px;"><img border="0" height="225" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/8eb6bd0cefbc354ac71c018e1078da45/8eb6bd0cefbc354ac71c018e1078da45.png" width="460" /></p>
<p style="line-height: 20.8px;">After opening the link, I just added the macro to the <strong>Commands </strong>field—the % symbol tells Minitab to look for the <strong>Busydays</strong> macro in my default macro location. Finally I clicked the <strong>Change</strong> button to save the change to the link:</p>
<p style="line-height: 20.8px; margin-left: 40px;"><img border="0" height="322" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/d95ba957dce186152b05aeee71be393e/d95ba957dce186152b05aeee71be393e.png" width="504" /></p>
<p style="line-height: 20.8px;">As a final step, I saved the Minitab project file with all the links that I added.</p>
<p style="line-height: 20.8px;">Now each morning when I come to the office, I open the Excel file first, then open my Minitab project file and I just watch the magic happen:</p>
<p style="line-height: 20.8px; margin-left: 40px;"><img border="0" height="338" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/5b7dc8015abf32b63ea6d7f8cd29a6ab/5b7dc8015abf32b63ea6d7f8cd29a6ab.png" width="600" /></p>
Data AnalysisFun StatisticsProject ToolsMon, 04 Apr 2016 12:31:00 +0000http://blog.minitab.com/blog/marilyn-wheatleys-blog/linking-minitab-to-excel-to-get-fast-answersMarilyn WheatleyWhat Are the Odds? Chutes and Ladders
http://blog.minitab.com/blog/fun-with-statistics/what-are-the-odds-chutes-and-ladders
<p>Allow me to make a confession up front: I won't hesitate to beat my kids at a game.</p>
<p><img alt="playing-chutes" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9f8d7c8a83739d64331c25aabd1fd859/playing_chutes.png" style="margin: 10px 15px; float: right; width: 250px; height: 208px;" />My kids are young enough that in pretty much any game that is predominantly determined by skill and not luck, I can beat them—and beat them easily. This isn't some macho thing where it makes me feel good, and I suppose is only partially based in wanting them to handle both winning and losing well. It's just how I play, and in any event most kids games are designed with enough luck that a young child has a chance to beat an adult. I've observed other parents who let their kids win at these games of skill—maybe not every time, but enough to make the game seem more fair than it really is.</p>
<p>For both me and my bigger-hearted friends, <a href="https://en.wikipedia.org/wiki/Snakes_and_Ladders" target="_blank">Chutes and Ladders</a> is a fantastic game. Why? It's not 10% luck, or 50%, or even 90% luck. Chutes and Ladders is<em> 100%</em> luck. You can't be good at it or bad at it. You can't try to win or lose. I can have no mercy and want to win as much as anything and it won't help me at all. And a parent who can't stand the idea of beating a kid at a game can (I hope) let that stress go, because they couldn't let their opponent win if they wanted to.</p>
<p>Modeling human decision-making with statistics can be very tough. If I wanted to make a simulation of three people playing Monopoly, for example, you can imagine the complexity in doing so accurately. But modeling a game like Chutes and Ladders is much easier. So much easier, in fact, that I went ahead and did it in <a href="http://www.minitab.com/products/minitab">Minitab Statistical Software</a> using a concept known as <a href="https://en.wikipedia.org/wiki/Markov_chain" target="_blank">Markov Chains</a>.</p>
<p>So let's learn a little bit about the odds associated with Chutes and Ladders...</p>
How the Odds Work in Chutes and Ladders
<p>Here is a standard Chutes and Ladders board:</p>
<p style="text-align: center;"><img alt="chutes" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/5d6cecd3ebc951a3d5888086a1c3173f/chutes.png" style="width: 604px; height: 613px;" /></p>
<p>The game is simple: you roll a die (or use a spinner) that we can assume gives you an equal probability of obtaining a 1, 2, 3, 4, 5, or 6, and that's the number of spaces you move from your current position. If you stop on a space with a ladder that extends up, you go up the ladder. If you stop on a space with a chute that goes down, you descend. Ultimately, you must land exactly on the 100 space to win the game.</p>
The Fastest (and Slowest) Win
<p>The fewest turns you can possibly use to reach the 100 space is 7, and this only occurs in 0.151% of player-games, or about 1 in 622. Now those are the odds for <em>each </em>player—if you have multiple players in the game (and I hope no one plays Chutes and Ladders alone), then the odds are higher that at least one player finishes in the minimum 7 moves. Specifically, the odds are 1-(1-0.00151)n, where n is the number of players. For example, with three players the odds have jumped to a whopping 0.452%, or about 1 in 221 games.</p>
<p>Technically you could have an unlimited number of turns without ever reaching the 100 space, so there is no theoretical "most" turns it can take. But obviously as you take more turns, the odds of not winning keep decreasing. So instead let's just look at some cutoffs:</p>
<ul>
<li>90% of player-games will finish before turn 72</li>
<li>95% of player-games will finish before turn 89</li>
<li>99% of player-games will finish before turn 128</li>
</ul>
<p>Again, that's for just one player. For multiple players here is a graph of how many spins there are until a winner and the odds:</p>
<p style="text-align: center;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/46889f0e-f0a5-4b4a-8a19-2d2b8dce6087/Image/3e678517df9e8ffcf82fbef6005e1f2d/spins_to_win.png" style="width: 576px; height: 384px;" /></p>
<p>So if you're not looking for a drawn-out game, it helps immensely to recruit more players! With four players, the 99th percentile drops from 128 spins to just 44 spins.</p>
How Many Moves Are Expected?
<p>Now that the extreme cases have been covered, consider the more basic question of how many moves are expected before the game is won. First it's worth considering whether an expected value - the mean - is truly desired, or the median value. For one player, it is simple to answer both:</p>
<ul>
<li>The mean number of moves before a single player wins is about 39.</li>
<li>The median number of moves is 32.</li>
</ul>
<p>Based on this it can be seen that the distribution of moves is skewed right and the extremely long games are raising the mean number of moves well above the median.</p>
<p>But what about a game with more than one player?</p>
<ul>
<li>A game with two players would end in a median 23 moves and a mean 26.3 moves.</li>
<li>A game with three players would end in a median 20 moves and a mean 21.7 moves.</li>
<li>A game with four players would end in a median 18 moves and a mean 19.3 moves.</li>
</ul>
<p>As the <a href="http://blog.minitab.com/blog/rkelly/weight-for-it-a-healthy-application-of-the-central-limit-theorem">central limit theorem</a> would predict, the distribution becomes less skewed as the number of players increases, and therefore the mean is closer to the median. So while increasing players decreases the median number of moves only a little bit, it greatly reduces the chances that a game will require a large number of turns.</p>
All Spaces Are Not Created Equally
<p>I once found myself maybe 20 spins into a game, and yet still on the bottom row. Space 6, to be exact. <span style="line-height: 20.8px;">Not surprisingly,</span><span style="line-height: 20.8px;"> i</span><span style="line-height: 20.8px;">f you examine the board above, you find a chute that ends on space 6. As it turns out, not only is there a path that can get you all the way to the 100 spot in just a few spins, there is also a path that can be devastating. Specifically, you can take a spin while on space 97, and within a few spins find yourself back to space 6 thanks to an unfortunate series of chutes.</span></p>
<p>The chutes and ladders on the board mean some spaces are much more likely to have a player on them at any given time than others. Consider a ladder: there are multiple spaces you can spin from and land at the bottom of a particular ladder, but everyone who lands on that ladder ends up on the same spot. To illustrate the distribution of spaces occupied after a certain number of spins by a single player, I created a <a href="http://blog.minitab.com/blog/starting-out-with-statistical-software/introducing-the-bubble-plot">bubble plot</a> where the size of each bubble corresponds to the probability of a player being on that space after each of the first 40 spins:</p>
<p style="text-align: center;"><img alt="" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/46889f0e-f0a5-4b4a-8a19-2d2b8dce6087/Image/e1fe0c91ef05782dc77b2c2f52120358/bubble_plot_of_space_vs_spin.png" style="width: 576px; height: 384px;" /></p>
<p>From the plot it can be seen that although the odds are low, even after 40 spins you might find yourself still on space 6. Rows of bubbles larger than those around them correspond to the end points of the various chutes and ladders, giving them higher probability than other points.</p>
A Statistician's Take on Chutes and Ladders
<p>While certain games like blackjack or poker might allow a player to improve their ability by understanding the odds, Chutes and Ladders is entirely based on luck and no such advantage can be gained. However, that doesn't mean there is nothing to learn by examining them. For example, if you really dislike the game but have a young child who <em>always </em>wants to play, you now know that encouraging another parent or a sibling to join in can really help prevent a never-ending game!</p>
<p> </p>
<p style="font-size:10px;"><em>Game-play photo by <a href="https://www.flickr.com/photos/benhusmann/3120095949" target="_blank">Ben Husmann</a>, used under Creative Commons 2.0. </em></p>
Fun StatisticsLearningFri, 01 Apr 2016 14:00:00 +0000http://blog.minitab.com/blog/fun-with-statistics/what-are-the-odds-chutes-and-laddersJoel SmithAre You Putting the Data Cart Before the Horse? Best Practices for Prepping Data for Analysis, ...
http://blog.minitab.com/blog/meredith-griffith/are-you-putting-the-data-cart-before-the-horse-best-practices-for-prepping-data-for-analysis%2C-part-1
<p>Most of us have heard a backwards way of completing a task, or doing something in the conventionally wrong order, described as “putting the cart before the horse.” That’s because a horse pulling a cart is much more efficient than a horse pushing a cart.</p>
<p><img alt="cart before horse" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ec1fbea4785510ea0e0a9997c1669c68/cart_horse.png" style="margin: 10px 15px; float: right; width: 350px; height: 206px;" />This saying may be especially true in the world of statistics. Focusing on a statistical tool or analysis before checking out the condition of your data is one way you may be putting the cart before the horse. You may then find yourself trying to force your data to fit an analysis, particularly when the data has not been set up properly. It’s far more efficient to first make sure your <a href="http://blog.minitab.com/blog/understanding-statistics/the-single-most-important-question-in-every-statistical-analysis">data are reliable</a> and then allow your questions of interest to guide you to the right analysis.</p>
<p>Spending a little quality time with your data up front can save you from wasting a lot of time on an analysis that either can’t work—or can’t be trusted.</p>
<p>As a quality practitioner, you’re likely to be involved in many activities—establishing quality requirements for external suppliers, monitoring product quality, reviewing product specifications and ensuring they are met, improving process efficiency, and much more.</p>
<p>All of these tasks will involve data collection and statistical analysis with software such as Minitab. For example, suppose you need to perform a <a href="http://blog.minitab.com/blog/meredith-griffith/fundamentals-of-gage-rr">Gage R&R</a> study to verify your measurement systems are valid, or you need to understand how machine failures impact downtime.</p>
<p>Rather than jumping right into the analysis, you will be at an advantage if you take time to look at your data. Ask yourself questions such as:</p>
<ul>
<li>What problem am I trying to solve?</li>
<li>Is my data set up in a way that will be useful to answering my question?</li>
<li>Did I make any mistakes while recording my data?</li>
</ul>
<p>Utilizing process knowledge can also help you answer questions about your data and identify data entry errors. A focus on preparing and exploring your data prior to an analysis will not only save you time in the long run, but will help you obtain reliable results.</p>
<p>So then, where to begin with best practices for prepping data for an analysis? Let’s look no further than your data.</p>
Clean your data before you analyze it
<p>Let’s assume you already know what problem you’re trying to solve with your data. For instance, you are the area supervisor of a manufacturing facility, and you’ve been experiencing lower productivity than usual on the machines in your area and want to understand why. You have collected data on these machines, recording the amount of time a machine was out of operation, the reason for the machine being down, the shift number when the machine went down, and the speed of the machine when it went down.</p>
<p>The first step toward answering your question is to ensure your data are clean. Cleaning your data before you begin an analysis can save time by preventing rework, such as reformatting data or correcting data entry errors, after you’ve already begun the analysis. Data cleaning is also essential to ensure your analyses and results—and the decisions you make—are reliable.</p>
<p>With the <a href="https://www.minitab.com/en-us/support/minitab/minitab-17.3.1-update/" style="line-height: 20.8px;">latest update to Minitab 17</a><span style="line-height: 20.8px;">, an improved data import helps you identify and correct case mismatches, fix improperly formatted columns, represent missing data accurately and in a manner that is recognized by the software, remove blank rows and extra spaces, and more. When importing your data, you see a preview of your data as a reminder to ensure it’s in the best possible state before it finds its way into Minitab. This preview helps you spot mistakes you have made in your data collection, and automatically corrects mistakes you don’t notice or that are difficult to find in large data sets.</span></p>
<p><img alt="Data Import" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/b1c679056c60ac2fa82f37e1f1de406b/data_import.jpg" style="width: 775px; height: 655px;" /></p>
<p><em>Minitab offers a data import dialog that helps you quickly clean and format your data before importing into the software, ensuring your data are trustworthy and allowing you to get to your analysis sooner.</em></p>
<p><span style="line-height: 20.8px;">If you’d rather copy and paste your data from Excel, Minitab will ensure you paste your data in the right place. For instance, if your data have column names and you accidentally paste your data into the first row of the worksheet, your data will all be formatted as text—even when the data following your column names are numeric! With </span><a href="https://www.minitab.com/en-us/products/minitab/whats-new/" style="line-height: 20.8px;">Minitab 17.3</a><span style="line-height: 20.8px;">, you will receive an alert that your data is in the wrong place, and Minitab will automatically move your data where it belongs. This alert ensures your data are formatted properly, preventing you from running into the problem during an analysis and saving you time manually correcting every improperly formatted column.</span></p>
<p><img alt="Copy Paste Warning" src="http://cdn.app.compendium.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/dae6c7b7-fc22-4616-9d65-f04909c20ab1/Image/5df941ffaa491a0072261aef075a19d6/copy_paste_warning.jpg" style="width: 431px; height: 299px;" /></p>
<p><em>Pasting your Excel data in the first row of a Minitab worksheet will trigger this warning, which safeguards against improperly formatted columns.</em></p>
<p><span style="line-height: 1.6;">This is only the beginning! Minitab makes it quick and painless to begin exploring and visualizing your data, offering more insights and ease once you get to the analysis. If you’d like to learn additional best practices for prepping your data for any analysis, stay tuned for my next post where I’ll offer tips for exploring and drawing insights from your data!</span></p>
Data AnalysisStatisticsWed, 30 Mar 2016 14:05:04 +0000http://blog.minitab.com/blog/meredith-griffith/are-you-putting-the-data-cart-before-the-horse-best-practices-for-prepping-data-for-analysis%2C-part-1Meredith Griffith