Data Analysis Software | Minitab
Understanding Qualitative, Quantitative, Attribute, Discrete, and Continuous Data Types
<p>"Data! Data! Data! I can't make bricks without clay."<br />
— Sherlock Holmes, in Arthur Conan Doyle's <em>The Adventure of the Copper Beeches</em></p>
<p>Whether you're the world's greatest detective trying to crack a case or a person trying to solve a problem at work, you're going to need information. Facts. <em>Data</em>, as Sherlock Holmes says. </p>
<p><img alt="jujubes" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/96d7c87addccc11b6072d6dfa38d0039/jujubes.jpg" style="line-height: 20.7999992370605px; margin: 10px 15px; float: right; width: 200px; height: 200px;" /></p>
<p>But not all data is created equal, especially if you plan to analyze as part of a quality improvement project.</p>
<p>If you're using Minitab Statistical Software, you can access the Assistant to <a href="http://www.minitab.com/products/minitab/assistant">guide you through your analysis step-by-step</a>, and help identify the type of data you have.</p>
<p>But it's still important to have at least a basic understanding of the different types of data, and the kinds of questions you can use them to answer. </p>
<p>In this post, I'll provide a basic overview of the types of data you're likely to encounter, and we'll use a box of my favorite candy—<a href="http://en.wikipedia.org/wiki/Jujube_(confectionery)" target="_blank">Jujubes</a>—to illustrate how we can gather these different kinds of data, and what types of analysis we might use it for. </p>
The Two Main Flavors of Data: Qualitative and Quantitative
<p>At the highest level, two kinds of data exist: <em><strong>quantitative</strong></em> and <em><strong>qualitative</strong></em>.</p>
<p><strong><em>Quantitative</em> </strong>data deals with numbers and things you can measure objectively: dimensions such as height, width, and length. Temperature and humidity. Prices. Area and volume.</p>
<p><strong><em>Qualitative </em></strong>data deals with characteristics and descriptors that can't be easily measured, but can be observed subjectively—such as smells, tastes, textures, attractiveness, and color. </p>
<p>Broadly speaking, when you measure something and give it a number value, you create quantitative data. When you classify or judge something, you create qualitative data. So far, so good. But this is just the highest level of data: there are also different types of quantitative and qualitative data.</p>
Quantitative Flavors: Continuous Data and Discrete Data
<p>There are two types of quantitative data, which is also referred to as numeric data: <em><strong>continuous </strong></em>and <em><strong>discrete</strong>. </em><span style="line-height: 20.7999992370605px;">As a general rule, </span><em style="line-height: 20.7999992370605px;">counts </em><span style="line-height: 20.7999992370605px;">are discrete and </span><em style="line-height: 20.7999992370605px;">measurements </em><span style="line-height: 20.7999992370605px;">are continuous.</span></p>
<p><strong><em>Discrete </em></strong>data is a count that can't be made more precise. Typically it involves integers. For instance, the number of children (or adults, or pets) in your family is discrete data, because you are counting whole, indivisible entities: you can't have 2.5 kids, or 1.3 pets.</p>
<p><strong><em>Continuous</em> </strong>data, on the other hand, could be divided and reduced to finer and finer levels. For example, you can measure the height of your kids at progressively more precise scales—meters, centimeters, millimeters, and beyond—so height is continuous data.</p>
<p>If I tally<span style="line-height: 1.6;"> the number of individual Jujubes in a box, that number is a piece of discrete data. </span></p>
<p style="margin-left: 40px;"><img alt="a count of jujubes is discrete data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f5e3c44269356903cf156c065b10746a/jujubes_count_tally.jpg" style="width: 200px; height: 200px;" /></p>
<p><span style="line-height: 1.6;">If I use a scale to measure the weight of each Jujube, or the weight of the entire box, that's continuous data. </span></p>
<p style="margin-left: 40px;"><span style="line-height: 1.6;"><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d11051162c9e2375e531ac589fd5a20e/jujube_weight_continuous_data.jpg" style="width: 200px; height: 200px;" /></span></p>
<p>Continuous data can be used in many different kinds of <a href="http://blog.minitab.com/blog/understanding-statistics/what-statistical-hypothesis-test-should-i-use">hypothesis tests</a>. For example, to assess the accuracy of the weight printed on the Jujubes box, we could measure 30 boxes and perform a 1-sample t-test. </p>
<p>Some analyses use continuous and discrete quantitative data at the same time. For instance, we could perform a <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples">regression analysis</a> to see if the weight of Jujube boxes (continuous data) is correlated with the number of Jujubes inside (discrete data). </p>
Qualitative Flavors: Binomial Data, Nominal Data, and Ordinal Data
<p>When you classify or categorize something, you create <em>Qualitative</em> or attribute<em> </em>data. There are three main kinds of qualitative data.</p>
<p><em><strong>Binary </strong></em>data place things in one of two mutually exclusive categories: right/wrong, true/false, or accept/reject. </p>
<p>Occasionally, I'll get a box of Jujubes that contains a couple of individual pieces that are either too hard or too dry. If I went through the box and classified each piece as "Good" or "Bad," that would be binary data. I could use this kind of data to develop a statistical model to predict how frequently I can expect to get a bad Jujube.</p>
<p>When collecting <em><strong>unordered </strong></em>or <em><strong>nominal </strong></em>data, we assign individual items to named categories that do not have an implicit or natural value or rank. If I went through a box of Jujubes and recorded the color of each in my worksheet, that would be nominal data. </p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ce64d648ac395d5c8098985caabc754f/jujubes_sorted_nominal_data.jpg" style="width: 200px; height: 97px;" /></p>
<p>This kind of data can be used in many different ways—for instance, I could use <a href="http://blog.minitab.com/blog/understanding-statistics/chi-square-analysis-of-halloween-and-friday-the-13th-is-there-a-slasher-movie-gender-gap">chi-square anlaysis</a> to see if there are statistically significant differences in the amounts of each color in a box. </p>
<p>We also can have <strong><em>ordered </em></strong>or <em><strong>ordinal </strong></em>data, in which items are assigned to categories that do have some kind of implicit or natural order, such as "Short, Medium, or Tall." <span style="line-height: 1.6;">Another example is a survey question that asks us to rate an item on a 1 to 10 scale, with 10 being the best. This implies that 10 is better than 9, which is better than 8, and so on. </span></p>
<p>The uses for ordered data is a matter of some debate among statisticians. Everyone agrees its appropriate for creating bar charts, but beyond that the answer to the question "What should I do with my ordinal data?" is "It depends." Here's a post from another blog that offers an excellent summary of the <a href="http://learnandteachstatistics.wordpress.com/2013/07/08/ordinal/" target="_blank">considerations involved</a>. </p>
Additional Resources about Data and Distributions
<p>For more fun statistics you can do with candy, check out this article (PDF format): <a href="http://www.minitab.com/uploadedFiles/Content/Academic/sweetening_statistics.pdf">Statistical Concepts: What M&M's Can Teach Us.</a> </p>
<p>For a deeper exploration of the probability distributions that apply to different types of data, check out my colleague Jim Frost's posts about <a href="http://blog.minitab.com/blog/adventures-in-statistics/understanding-and-using-discrete-distributions">understanding and using discrete distributions</a> and <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-identify-the-distribution-of-your-data-using-minitab">how to identify the distribution of your data</a>.</p>
How to Calculate B10 Life with Statistical Software
<p><span style="line-height: 1.6;">Over the last year or so I’ve heard a lot of people asking, “How can I calculate B10 life in Minitab?” Despite being a statistician and industrial engineer (mind you, one who has never been </span><em style="line-height: 1.6;">in</em><span style="line-height: 1.6;"> the field like the customers asking this question) and having taken a reliability engineering course, I’d never heard of B10 life. So I did some research.</span></p>
<p>The B10 life metric originated in the ball and roller bearing industry, but has become a metric used across a variety of industries today. It’s particularly useful in establishing warranty periods for a product. The “BX” or “Bearing Life” nomenclature, which refers to the time at which X% of items in a population will fail, speaks to these roots.</p>
<p>So then, B10 life is the time at which 10% of units in a population will fail. Alternatively, you can think of it as the 90% reliability of a population at a specific point in its lifetime—or the point in time when an item has a 90% probability of survival. The B10 life metric became popular among ball and roller bearing makers due to the industry’s strict requirement that no more than 10% of bearings in a given batch fail by a specific time due to fatigue failure. </p>
<p>Now that I know what the term means, I can tell people who ask that <a href="http://blog.minitab.com/blog/fun-with-statistics/what-i-learned-from-treating-childbirth-as-a-failure">Minitab’s reliability analysis</a> can easily compute this metric. (In fact, our <a href="http://www.minitab.com/products/minitab">statistical software</a> can compute any “BX” lifetime—but we’ll save that for another blog post.) B10 life is also known as the 10th percentile and can be found in Minitab’s Table of Percentiles output, which is displayed in Minitab’s session window.</p>
<p><img alt="B10 Life - Table of Percentiles" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/1ac7bbfe20e1a18c284babde45ce84af/b10life_image1.png" style="width: 461px; height: 324px;" /></p>
<p>And unlike other reliability metrics, B10 life directly correlates the maximum allowable percentile of failures (or the minimum allowable reliability) with an application-specific life point in time.</p>
<p>So we can get the B10 life metric by looking at the Table of Percentiles in Minitab’s session window output. But you might still be asking two questions: how do I create this table, and how do I interpret it?</p>
<img alt="You can't just put one of these into a pacemaker, after all! " src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/6c60a30de1566a4cc65dbb03c730680e/batteries.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 180px; height: 180px;" />Finding B10 Life, Step by Step
<p>Suppose we have tracked and recorded the battery life times over a certain number of years for 1,970 pacemakers. The reliability of pacemakers is critical, because patients’ lives depend on these devices!</p>
<p>We observed exact failure times—defined as the time at which a low battery signal was detected—for 1,019 of those pacemakers. The remaining 951 pacemakers never warned of a low battery, so they “survived.”</p>
<p>Our data is organized as follows:</p>
<p><img alt="B10 Life - Data Organization" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/90b7e4084faeee5bbe82a2abd7ff7c7e/b10life_image2.png" style="width: 183px; height: 264px;" /></p>
<p>When we have both observed failures and units surviving beyond a given time, we call the data “right-censored.” And we know from process knowledge that the <a href="http://blog.minitab.com/blog/understanding-statistics/why-the-weibull-distribution-is-always-welcome">Weibull distribution</a> best describes the lifetime of these pacemaker batteries. Knowing this information will help us use Minitab’s reliability analysis correctly.</p>
Setting Up the Reliability Analysis
<p>Because we have right-censored data and we know our distribution, we are ready to access Minitab’s <strong>Statistics > Reliability/Survival > Distribution Analysis (Right Censoring) > Parametric Distribution Analysis </strong>menu to compute the B10 life.</p>
<p>We want to know the batteries’ reliability—or probability of survival—at different times, so our variable of interest is the number of years a pacemaker battery has survived. In the Parametric Distribution Analysis dialog, you’ll notice the Weibull distribution is already selected as the assumed distribution. We’ll leave this default setting since we know the Weibull distribution best describes battery life times.</p>
<p><img alt="B10 Life Metric - Right Censoring" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/646dd8ee5563d9748c90545d8f0a9fa0/b_10_life_image_3.png" style="width: 507px; height: 345px;" /></p>
<p>We also know whether the number in the ‘Years’ column was an exact failure time or a censored time (beyond which the battery survived). We must account for the censored data. By clicking the button labeled ‘Censor’, we can include a censoring column that contains values indicating whether or not the pacemaker survived or failed at the recorded time. In our Minitab worksheet, “Failed or Survived” is the censoring column. Our censoring value is ‘S’, which stands for ‘Survived’, indicating no failure was observed during the pacemaker battery tracking period.</p>
<p><img alt="B10 Life - censoring column" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/6042f1f9401bec3fe3c11ac335dfb834/b_10_life_image_4.png" style="width: 426px; height: 313px;" /></p>
Interpreting the Table of Percentiles and B10 Life
<p>Once we click OK through all dialogs to carry out the analysis, Minitab outputs the Table of Percentiles, where we can find our B10 life:</p>
<p> <img alt="B10 Life - Corresponding Percentile" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/2394c488f49f9f0e2615ae743926cde9/b_10_life_image_5a.png" style="width: 191px; height: 240px;" /></p>
<p>Where the Percent column displays 10, the corresponding Percentile value tells us that the B10 life of pacemaker batteries is 6.36 years—or, to put it another way, 6.36 years is the time at which 10% of the population of pacemaker batteries will fail.</p>
<p>There we have it! The next time you are looking to compute the B10 life of a product, and perhaps seeking to establish suitable warranty periods, you need look no further than Minitab’s reliability tools and the Table of Percentiles.</p>
The World-Famous Disappearing-Reappearing-Analysis-Settings Act
<p>Sure, Minitab Statistical Software is powerful and easy to use, but did you know that it’s also magic? One of the illusions that Minitab can peform is the world famous disappearing-reappearing-analysis-settings act. Of course, as with many illusions, it’s not so hard once you know the trick. In this case, it’s downright easy once you know about Minitab project files.</p>
<p><img alt="The statue of liberty" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/145817ecdac35066aeeea5b9fb106d99/statue_present.png" style="line-height: 20.7999992370605px; float: right; width: 250px; height: 180px; border-width: 1px; border-style: solid; margin: 10px 15px;" /></p>
<p>If you’ve done any work in Minitab you may very well have saved a project file and been grateful that <span>your data, graphs, and statistical tables could all be saved together in a single file</span>. But, it’s just as amazing that Minitab can remember exactly how you did your analysis the last time.</p>
<p>Imagine that you routinely <a href="http://blog.minitab.com/blog/understanding-statistics/i-think-i-can-i-know-i-can-a-high-level-overview-of-process-capability-analysis">run a capability analysis</a> on the same process. The first time you did the analysis, you changed several of the options to get the output that you wanted. When you open Minitab the next time, you want to perform the same analysis on a new data set. Having a saved project makes it easy. Try it for yourself if you want, following the steps below. Begin by downloading <a href="http://it.minitab.com/products/minitab/free-trial.aspx">our free trial</a> if you don't already have our statistical software, then download worksheets <a href="http://support.minitab.com/en-us/datasets/Basil.MTW">Basil.MTW</a> and <a href="http://support.minitab.com/en-us/datasets/Basil2.MTW">Basil2.MTW</a>.</p>
Introduce your Assistant
<ol>
<li>Open the Basil.MTW worksheet.</li>
<li>Choose <strong>Stat > Quality Tools > Capability Analysis > Multiple Variables (Normal)</strong>.</li>
<li>In <strong>Variables</strong>, enter <em>T1H1 T1H2</em>.</li>
<li>In <strong>Subgroup sizes</strong>, enter <em>4</em>.</li>
<li>In <strong>Lower spec</strong>, enter 2.</li>
<li>In <strong>Upper spec</strong>, enter 8.</li>
<li>Click <strong>Graphs</strong>.</li>
<li>Uncheck <strong>Normal probability plot</strong>. Click <strong>OK</strong>.</li>
<li>Click <strong>Options</strong>.</li>
<li>Under <strong>Display</strong>, select <strong>Benchmark Z’s (σ level)</strong> and check <strong>Include confidence intervals</strong>.</li>
<li>Click <strong>OK </strong>twice.</li>
</ol>
<p>The capability analysis is in your project file.</p>
<img alt="Statue not visible" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/7640a1717783092e715bc8a9321145ed/david_copperfield_statue_gone.jpg" style="width: 250px; height: 188px; float: right; border-width: 1px; border-style: solid; margin: 10px 15px;" />Presto, they’re gone!
<ol>
<li>Close Minitab. When asked if you want to save changes to the project, click <strong>Yes</strong>.</li>
<li>Name the file and click <strong>Save</strong>.</li>
</ol>
<p>Minitab Statistical Software is closed. The settings for your analysis are nowhere to be found.</p>
Abracadabra—they’re back!
<ol>
<li>Reopen the project file that you saved.</li>
<li>Open the Basil2.MTW worksheet.</li>
<li>Choose <strong>Stat > Quality Tools > Capability Analysis > Multiple Variables (Normal)</strong>.</li>
</ol>
<p>The settings from your previous analysis have reappeared! All you have to do to complete the capability analysis, with all of your customizations, is click <strong>OK</strong>.</p>
Bask in the applause from the audience
<p>Keeping all of the parts of your analysis in one place is a great feature of Minitab’s project files. For people who routinely repeat the same analysis, the fact that the project file also remembers the settings that you used for your analysis is a fantastic time saver.</p>
<p>Whether you repeat an analysis weekly, quarterly, or even annually, Minitab’s ready to pick up right where you left off. This might not be quite as astounding as David Copperfield making the Statue of Liberty disappear and reappear, but if you want to get your statistical results fast and easy, it’s the best kind of magic.</p>
<p>Ready for more? Projects files and many other fundamental features of Minitab, are explained in the online <a href="http://support.minitab.com/en-us/minitab/17/getting-started/">Getting Started Guide</a>.</p>
How Cpk and Ppk Are Calculated, part 2
<p>Minitab's capability analysis output gives you estimates of the capability indices Ppk and Cpk, and we receive many questions about the difference between them. Some of my colleagues have taken other approaches to explain the difference between Ppk and Cpk, so I wanted to show you how they differ by detailing precisely how each one is calculated. </p>
<p><span style="line-height: 1.6;">When you're using <a href="http://www.minitab.com/products/minitab">statistical software</a> like Minitab, you don't need to do these calculations by hand, but I also want to lift the lid off the "black box" to show you what Minitab does behind the scenes to provide these figures. </span></p>
<p><span style="line-height: 1.6;">In my previous post, we saw <a href="http://blog.minitab.com/blog/marilyn-wheatleys-blog/how-cpk-and-ppk-are-calculated2c-part-1">how Ppk is calculated</a>. This time, we'll go through the calculation of Cpk, using the same sample data set in Minitab.</span><span style="line-height: 1.6;"> Go to <strong>File > Open Worksheet</strong>, click the "Look in Minitab Sample Data folder" button at the bottom, and open the dataset named CABLE.MTW.</span></p>
Calculating Within-Subgroup Standard Deviation
<p>Where Ppk uses the overall standard deviation, Cpk uses the within-subgroup standard deviation. Calculating Cpk is easy once we have an estimate of the within-subgroup standard deviation. The default method in Minitab for the within-subgroup calculation is the pooled standard deviation. The formula for this calculation from Methods and formulas is:</p>
<p><img alt="formula for pooled standard deviation" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/e40d989c5285189e341d5ab615b9bfe0/pooledsd.png" style="width: 642px; height: 376px;" /></p>
<p>This looks a little intimidating, but you’ll see it’s not so bad if we take it one step at a time.</p>
<p>First, we’ll calculate Sp. For this example, the subgroup size is fixed at 5. We’ll begin with a clean worksheet containing only the Diameter data in C1.</p>
<p>We need to estimate the mean of the data in each subgroup and store those values in the worksheet. To do that, we’ll create a column that defines our subgroups using <strong>Calc > Make Patterned Data > Simple Set of Numbers</strong>, and then completing the dialog box as shown below:</p>
<p><img alt="subgroups" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/98f4740c6516242f060b4f22b0fa43ca/subgroup.png" style="width: 638px; height: 360px;" /></p>
<p>With 100 data points and 5 points in each subgroup, we have 20 subgroups.</p>
<p>Now we can use our new column containing the subgroups to calculate the mean of each subgroup, using <strong>Stat > Basic Statistics > Store Descriptive Statistics</strong>. We complete the dialog box like in the example below, entering the <em>Diameter </em>column under Variables and the <em>Subgroup </em>column as the By variable:</p>
<p><img alt="descriptive statistics" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9a50586cefdd8896b974097384cf37b4/descr_stats.png" style="width: 600px; height: 309px;" /></p>
<p><span style="line-height: 1.6;">We then click Options and choose <strong>Store a row of output for each row of input</strong>, uncheck <strong>Store district values of By variables</strong>, and then click OK in each dialog box. Now column C3 will show the average of each subgroup; the first 5 rows from C1 were used to calculate the mean of those first 5 rows, and that same mean value is displayed in the first 5 rows of C3.</span></p>
<p>We will now use these values to calculate the numerator for Sp using <strong>Calc > Calculator</strong>:</p>
<p><img alt="numerator for Sp" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/53aad02dfe93c8744f272a3ec3dabb76/calculator2.png" style="width: 609px; height: 400px;" /></p>
<p>We are summing the squared differences between each measurement and its subgroup mean. The Numerator column in the Minitab worksheet will show <strong>0.02735</strong> using the formula above.</p>
<p>Next, we calculate the denominator for Sp, which is the subgroup size minus 1, summed over all subgroups. Since we have a constant subgroup size of 5, and a total of 20 subgroups, an easy way to enter this in the calculator is:</p>
<p><img alt="denominator for Sp" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/22ec46ed79ede93c79af6e8288e8224b/calculator3.png" style="width: 593px; height: 396px;" /></p>
<p>Now with the numerator and denominator for Sp stored in the worksheet, we take the square root of Numerator/Denominator:</p>
<p><img alt="square root of numerator/denominator" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/29fe9f85f7fb535497948f3c5eb38451/calculator4.png" style="width: 590px; height: 394px;" /></p>
<p>Notice that the Sp value 0.0184899 is the estimate of the subgroup standard deviation if we tell Minitab NOT to use the unbiasing constant, C4, by clicking the Estimate button in the Normal Capability Analysis dialog box and then unchecking <strong>Use unbiasing constants</strong>. </p>
<p>Now to finish calculating the within-subgroup standard deviation using C4 (the default), we can look up C4 in the table that is linked in Methods and Formulas under the Methods heading.</p>
<p>The C4 value we need is C4 for (d + 1). As defined in Methods and formulas, d is the sum of (subgroup size – 1); in our case the subgroup size is fixed at 5, so 20*(5-1) = 80. If d = 80, we add 1 and get 81, so we look up N = 81 in the C4 column of unbiasing constants:</p>
<p><img alt="unbiasing constants" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/400dffc756ede4ae4f272cc10fb7a256/c4.png" style="width: 532px; height: 82px;" /></p>
<p><span style="line-height: 1.6;">We enter 0.996880 in column C7 in the worksheet and use it in the calculator to get the </span><span style="line-height: 20.7999992370605px;">pooled within-subgroup standard deviation</span><span style="line-height: 1.6;">:</span></p>
<p><span style="line-height: 1.6;"><img alt="within subgroup standard deviation" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d22f12738d7eade1bef550a3bfb061c1/sdwithin.png" style="width: 595px; height: 400px;" /></span></p>
<p> We can see that this value matches the output from our initial capability analysis graph.</p>
<p><img alt="initial graph" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/6b18e5ea5e14c5f5992cc335766e505f/initial_graph.png" style="width: 171px; height: 139px;" /></p>
Calculating Cpk
<p>Finally, we use our within-subgroup standard deviation to calculate CPU and CPL. <span style="line-height: 1.6;">Cpk is the lesser of CPU and CPL, and we find these two formulas in <strong>Methods and Formulas</strong>:</span></p>
<p><img alt="formula for CPL" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/6ecec9d3f48c1f1209985461f453017c/cpl.png" style="width: 391px; height: 178px;" /><img alt="formula for cpu" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/dc08335c4a2e31422ec35c4bbad332e7/cpu.png" style="width: 369px; height: 180px;" /></p>
<p><span style="line-height: 1.6;">We calculate CPL and CPU as shown below using the calculator and the mean of the data that we previously calculated:</span></p>
<p><img alt="calculate cpl and cpu" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/cef4860ce67f56b0eded93d66f2a9fbc/calculator5.png" style="width: 600px; height: 464px;" /></p>
<p><span style="line-height: 1.6;">Since Cpk is the lesser of the two resulting values, Cpk is 0.83. That matches the Cpk value in Minitab’s capability output:</span></p>
<p><img alt="process capability for diameter" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/fa1311e847201a8b17e8993d6d4cd889/capability_for_diameter.png" style="width: 600px; height: 363px;" /></p>
<p>As long as you're using Minitab, you won't need to calculate Ppk and Cpk by hand. But I hope seeing the calculations Minitab uses to get these capability indices provides some insight into the differences between them! </p>
Lessons in Quality from Guadalajara and Mexico City
http://blog.minitab.com/blog/understanding-statistics-and-its-application/lessons-in-quality-from-guadalajara-and-mexico-city
<p><img alt="View of Mexico City" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8e5ec9217bc8fbc2ca7a6784a1efcdfa/mexico_df_400w.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 400px; height: 235px;" />Last week, thanks to the collective effort from many people, we held very successful events in Guadalajara and Mexico City, which gave us a unique opportunity to meet with over 300 Spanish-speaking Minitab users. They represented many different industries, including automotive, textile, pharmaceutical, medical devices, oil and gas, electronics, and mining, as well as academic institutions and consultants.</p>
<p>As I listened to my peers Jose Padilla and <a href="http://blog.minitab.com/blog/marilyn-wheatleys-blog">Marilyn Wheatley</a> deliver their presentations, it was interesting to see people's reactions as they learned more about our products and services. Several attendees were particularly pleased to learn more about Minitab's ease-of-use and <a href="http://www.minitab.com/products/minitab/assistant/">step-by-step help with analysis</a> offered by the Assistant menu. I saw others react to demonstrations of Minitab's comprehensive Help system, the use of executables for automation purposes, and several of the tips and tricks discussed throughout our presentations.</p>
<p>We also had multiple conversations on Minitab's flexible licensing options. Several attendees who spend a lot of time on the road were particularly glad to learn about our <a href="http://support.minitab.com/installation/frequently-asked-questions/license-fulfillment/borrow-a-license-of-minitab-companion/">borrowing functionality</a>, which lets you “check out” a license so you can use Minitab software without accessing your organization’s license server.</p>
Acceptance Sampling Plans
<p>There were plenty of technical discussions as well. One interesting question came from a user who asked how Minitab's Acceptance Sampling Plans compare to the <a href="http://asq.org/knowledge-center/ANSI_ASQZ1_4-2008/index.html">ANSI Z1.4</a> standard (a.k.a. MIL-STD 105E). The short answer is that the tables provided by the ANSI Z1.4 are for a specific AQL (Acceptable Quality Level), while implicitly assuming a certain RQL (Rejectable Quality Level) based solely on the lot size. The ANSI Z1.4 is an AQL-based system, while Minitab's acceptance sampling plans give you the flexibility to create a customized sampling scheme for a specific AQL, RQL, or lot size using both the binomial or hypergeometric distributions.</p>
Destructive Testing and Gage R&R
<p>Other users had questions about Gage R&R and destructive testing. Practitioners commonly assess a destructive test using Nested Gage R&R; however, this is not always necessary. The main problem with destructive testing is that every part tested is destroyed and thus can only be measured by a single operator. Since the purpose of this type of analysis is to measure the repeatability and reproducibility of the measurement system, one must identify parts that are as homogeneous as possible. Typically, instead of 10 parts, practitioners may use multiple parts from each of 10 batches. If the within-batch variation is small enough then the parts from each batch can be considered to be "the same" and thus the readings measured by all the operators can be used to produce repeatability and reproducibility measures. The main trick is to have homogenous units or batches that can give you enough samples to be tested by all operators for all replicates. If this is the case, you can analyze a destructive test with crossed gage R&R.</p>
Control Charts and Subgroup Size
<p>We also had an interesting discussion about the sensitivity of Shewhart <a href="http://blog.minitab.com/blog/understanding-statistics/control-chart-tutorials-and-examples">control charts</a> to the subgroup size. Specifically, one of the attendees asked our recommendation for subgroup size: 4, or 5? </p>
<p>The answer to this intriguing question requires an understanding of the reason why subgroups are recommended. Control charts have limits that are constructed so that if the process is stable, the probability of observing points out of these control limits is very small; this probability is typically referred to as the false alarm rate and it is usually set at 0.0027. This calculation assumes the process is normally distributed, so if we were plotting the individual data as in an Individuals chart, the control limits would be effective to determine an out-of-control situation only if the data came from a normal distribution. To reduce the dependence on normality, Shewhart suggested collecting the data in subgroups, because if we plot the means instead of the individual data the control limits would become less and less sensitive to normality as the subgroup size increases. This is a result of the Central Limit Theorem (CLT), which states that regardless of the underlying distribution of the data, that if we take independent samples and compute the average (or a sum) of all the observations in each sample then the distribution of these sample means will converge to a normal distribution.</p>
<p>So going back to the original question, what is the recommended subgroup size for building control charts? The answer depends on how skewed the underlying distribution may be. For various distributions a subgroup size of 5 is sufficient to have the CLT kick in making our control charts robust to normality; however for extremely skewed distributions like the exponential, the subgroup sizes may need to be much larger than 50. This topic was discussed in a paper Schilling and Nelson titled "<a href="http://asq.org/qic/display-item/?item=5238">The Effect of Non-normality on the Control Limits of Xbar Charts</a>" published in JQT back in 1976.</p>
Analyzing Variability
<p>We also had a great discussion about modeling variability in a process. One of the attendees, working for McDonald's, was looking for statistical methods for reducing the variation of the weight of apple slices. An apple is cut in 10 slices, and the goal was to minimize the variation in weight so that exactly four slices be placed in each bag without further rework. This gave me the opportunity to demonstrate how to use the <a href="http://blog.minitab.com/blog/adventures-in-statistics/assessing-variability-for-quality-improvement">Analyze Variability</a> command in Minitab, which happens to be one of the topics we cover in our <a href="http://www.minitab.com/training/courses/#doe-in-practice-manufacturing">DOE in Practice</a> course.</p>
We Love Your Questions
<p>For me and my fellow trainers, there’s nothing better than talking with people who are using Minitab software to solve problems. Sometimes we’re able to provide a quick, helpful answer. Sometimes a question provokes a great discussion about some quality challenge we all have in common. And sometimes a question will lead to a great idea that we’re able to share with our developers and engineers to make our software better. </p>
<p>If you have a question about Minitab, statistics, or quality improvement, please feel free to comment here. And if you use Minitab software, you can always contact our <a href="http://www.minitab.com/support/">customer support</a> team for direct assistance from specialists in IT, statistics, and quality improvement.</p>
<p> </p>
What to Do When Your Data's a Mess, part 3
<p>Everyone who analyzes data regularly has the experience of getting a worksheet that just isn't ready to use. Previously I wrote about tools you can use to <a href="http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-data-is-a-mess-part-1">clean up and elminate clutter in your data</a> and <a href="http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-2">reorganize your data</a>. </p>
<p><span style="line-height: 1.6;">In this post, I'm going to highlight tools that help you get the most out of messy data by altering its characteristics.</span></p>
Know Your Options
<p>Many problems with data don't become obvious until you begin to analyze it. A shortcut or abbreviation that seemed to make sense while the data was being collected, for instance, might turn out to be a time-waster in the end. What if abbreviated values in the data set only make sense to the person who collected it? Or a column of numeric data accidentally gets coded as text? You can solve those problems quickly with <a href="http://www.minitab.com/products/minitab">statistical software</a> packages.</p>
Change the Type of Data You Have
<p>Here's an instance where a data entry error resulted in a column of numbers being incorrectly classified as text data. This will severely limit the types of analysis that can be performed using the data.</p>
<p><img alt="misclassified data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c45b427d3e5e2b5eac4a505ed5c3b24f/misclassified_data.png" style="width: 200px; height: 156px;" /></p>
<p>To fix this, select <strong>Data > Change Data Type</strong> and use the dialog box to choose the column you want to change.</p>
<p><img alt="change data type menu" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/46ece127300500409098383a2e476a9b/text_to_numeric_data.png" style="width: 376px; height: 175px;" /></p>
<p>One click later, and the errant text data has been converted to the desired numeric format:</p>
<p><img alt="numeric data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f1b9df0211f9085e577a41b0e3661b45/numeric_data.png" style="width: 200px; height: 156px;" /></p>
Make Data More Meaningful by Coding It
<p>When this company collected data on the performance of its different functions across all its locations, it used numbers to represent both locations and units. </p>
<p><img alt="uncoded data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d22a57fe9e9e398bd948e86c0adafe34/uncoded_data.png" style="width: 135px; height: 158px;" /></p>
<p>That may have been a convenient way to record the data, but unless you've memorized what each set of numbers stands for, interpreting the results of your analysis will be a confusing chore. You can make the results easy to understand and communicating by coding the data. </p>
<p>In this case, we select <strong>Data > Code > Numeric to Text...</strong></p>
<p><img alt="code data menu" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c75e46cc190497fd41b0e6736518c0fe/code_data_menu.png" style="width: 384px; height: 255px;" /></p>
<p>And we complete the dialog box as follows, telling the software to replace the numbers with more meaningful information, like the town each facility is located in. </p>
<p><img alt="Code data dialog box" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/cd75c14324187806b8f3a74a3b8996b4/code_data_dialog.png" style="width: 400px; height: 345px;" /></p>
<p>Now you have data columns that can be understood by anyone. When you create graphs and figures, they will be clearly labelled. </p>
<p><img alt="Coded data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/7ff81bdb08170d6d8a4e8547623cf557/coded_data.png" style="width: 161px; height: 200px;" /></p>
Got the Time?
<p>Dates and times can be very important in looking at performance data and other indicators that might have a cyclical or time-sensitive effect. But the way the date is recorded in your data sheet might not be exactly what you need. </p>
<p>For example, if you wanted to see if the day of the week had an influence on the activities in certain divisions of your company, a list of dates in the MM/DD/YYYY format won't be very helpful. </p>
<p><img alt="date column" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f5b0dd178afbc0352f8dc2d9378e887b/date_column.png" style="width: 240px; height: 223px;" /></p>
<p>You can use <strong>Data > Date/Time > Extract to Text... </strong>to identify the day of the week for each date.</p>
<p><img alt="extract-date-to-text" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/7e6f7e8a87ee8291b9c6d51507092c19/extract_date_to_text.png" style="width: 351px; height: 132px;" /></p>
<p>Now you have a column that lists the day of the week, and you can easily use it in your analysis. </p>
<p><img alt="day column" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/dede93c9621917a0cfb54beef121d4e2/day_column.png" style="width: 249px; height: 205px;" /></p>
Manipulating for Meaning
<p>These tools are commonly seen as a way to correct data-entry errors, but as we've seen, you can use them to make your data sets more meaningful and easier to work with.</p>
<p>There are many other tools available in Minitab's Data menu, including an array of options for arranging, combining, dividing, fine-tuning, rounding, and otherwise massaging your data to make it easier to use. Next time you've got a column of data that isn't quite what you need, try using the Data menu to get it into shape.</p>
<p> </p>
<p> </p>
Are Preseason Football or Basketball Rankings More Accurate?
<p>College basketball season tips off today, and for the second straight season Kentucky is the #1 ranked preseason team in the AP poll. Last year Kentucky did not live up to that ranking in the regular season, going 24-10 and earning a lowly 8 seed in the NCAA tournament. But then, in the tournament, they overachieved and made a run all the way to the championship game...before losing to Connecticut.</p>
<p>In football, Florida State was the AP poll preseason #1 football team. While they are currently still undefeated, they aren't quite playing like the #1 team in the country. So this made me wonder, which preseason rankings are more accurate, football or basketball?</p>
<p>I gathered <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/1d3961db92c5ba14bc90b2b8323b95f8/preseason_basketball_vs__football_rankings.MTW">data</a> from the last 10 seasons, and recorded the top 10 teams in the preseason AP poll for both football and basketball. Then I recorded the difference between their preseason ranking and their final ranking. Both sports had 10 teams that weren’t ranked or receiving votes in the final poll, so I gave all of those teams a final ranking of 40.</p>
Creating a Histogram to Compare Two Distributions
<p>Let’s start with a histogram to look at the distributions of the differences. (It's always a good idea to look at the distribution of your data when you're starting an analysis, whether you're looking at quality improvement data work or sports data for yourself.) </p>
<p>You can create this graph in Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a> by selecting <strong>Graph > Histograms</strong>, choosing "With Groups" in the dialog box, and using the Basketball Difference and Football Difference columns as the graph variables:</p>
<p><img alt="Histogram" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/53055c57978dbfa85d28688cc816c98a/histogram_of_basketball_difference__football_difference.jpg" style="width: 720px; height: 480px;" /></p>
<p>The differences in the rankings appear to be pretty similar. Most of the data is towards the left side of this histogram, meaning for most cases the difference between the preseason and final ranking is pretty small.</p>
Conducting a Mann-Whitney Hypothesis Test on Two Medians
<p>We can further investigate the data by performing a hypothesis test. Because the data is heavily skewed, I’ll use <a href="http://blog.minitab.com/blog/the-statistics-game/do-the-data-really-say-female-named-hurricanes-are-more-deadly">a Mann-Whitney test</a>. This compares the medians of two samples with similarly-shaped distributions, as opposed to a <a href="http://blog.minitab.com/blog/understanding-statistics/guidelines-and-how-tos-for-the-2-sample-t-test">2-sample t test</a>, which compares the means. <span style="line-height: 20.7999992370605px;">The median is the middle value of the data. Half the observations are less than or equal to it, and half the observations are greater than or equal to it.</span><span style="line-height: 20.7999992370605px;"> </span></p>
<p>To perform this test in our statistical software, we select <strong>Stat > Nonparametrics > Mann-Whitney</strong>, then choose the appropriate columns for our first and second sample: </p>
<p><img alt="Mann-Whitney Test" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/1a1f239841b82e60170e6ecbc8077d4b/mann_whitney.jpg" style="width: 689px; height: 241px;" /></p>
<p>The basketball rankings have a smaller median difference than the football rankings. However, when we examine the <a href="http://blog.minitab.com/blog/understanding-statistics/three-things-the-p-value-cant-tell-you-about-your-hypothesis-test">p-value</a> we see that this difference is not statistically significant. There is not enough evidence to conclude that one preseason poll is more accurate than the other.</p>
<p>But what about the best teams? I grouped each of the top 3 ranked teams and looked at the median difference between their preseason and final rank.</p>
<p><img alt="Bar Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/692a3db40dd5d3b4c20d539f92395629/bar_chart.jpg" style="width: 720px; height: 480px;" /></p>
<p>The preseason AP basketball poll has a smaller difference for the #1 and #3 ranked teams. But the football poll is better for the #2 team, having an impressive median value of 1. Overall, both polls are relatively good, as neither has a median value greater than 6. And the differences are close enough that we can’t conclude that one is more accurate than the other.</p>
What Does It Mean for the Teams?
<p>While the odds are against both Kentucky and Florida State to finish the season ranked #1 in their respective polls, previous seasons indicate that they’re still likely to finish as one of the top teams. This is better news for Kentucky, as being one of the top teams means they’ll easily make the NCAA basketball tournament and get a high seed. However, Florida State must finish as one of the top 4 teams, or else they’ll miss out on the football postseason completely.</p>
<p>So while we can’t conclude one poll is better than the other, teams at the top of the AP basketball poll are clearly much more likely to reach the postseason than football.</p>
The Power of Multivariate ANOVA (MANOVA)
<p><img alt="Willy Wonka" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/964d1b613c1569e983213d2544915ac5/willywonka.jpg" style="float: right; width: 225px; height: 225px; border-width: 1px; border-style: solid; margin: 10px 15px;" />Analysis of variance (ANOVA) is great when you want to compare the differences between group means. For example, you can use ANOVA to assess how three different alloys are related to the mean strength of a product. However, most ANOVA tests assess one response variable at a time, which can be a big problem in certain situations. Fortunately, <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">Minitab statistical software</a> offers a multivariate analysis of variance (MANOVA) test that allows you to assess multiple response variables simultaneously.</p>
<p>In this post, I’ll run through a MANOVA example, explain the benefits, and cover how to know when you should use MANOVA.</p>
Limitations of ANOVA
<p>Whether you’re using <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/basics/what-is-a-general-linear-model/" target="_blank">general linear model (GLM)</a> or <a href="http://blog.minitab.com/blog/adventures-in-statistics/did-welchs-anova-make-fishers-classic-one-way-anova-obsolete" target="_blank">one-way ANOVA</a>, most ANOVA procedures can only assess one response variable at a time. Even GLM, where you can include many factors and covariates in the model, the analysis simply cannot detect multivariate patterns in the response variable.</p>
<p>This limitation can be a huge roadblock for some studies because it may be impossible to obtain significant results with a regular ANOVA test. You don’t want to miss out on any significant findings!</p>
Example That Compares MANOVA to ANOVA
<p>What the heck are multivariate patterns in the response variable? It sounds complicated but it’s very easy to show the difference between how ANOVA and MANOVA tests the data by using graphs.</p>
<p>Let’s assume that we are studying the relationship between three alloys and the strength and flexibility of our products. Here is the <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/File/3f3b6f58c70a646731a9db97bd7edfab/manova_example.MTW">dataset for the example</a>.</p>
<p>The two individual value plots below show how one-way ANOVA analyzes the data—one response variable at a time. In these graphs, alloy is the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-models/factor-and-factor-levels/" target="_blank">factor</a> and strength and flexibility are the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">response variables</a>.</p>
<img alt="Individual value plot of strength by alloy" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/3402fd3845c2226f555b4ebfe18a87f5/strength_ivp.png" style="width: 350px; height: 233px;" />
<img alt="Individual value plot of flexibility by alloy" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/c7fba5c5eda5e81e02db60b2aefb3327/flexibility_ivp.png" style="width: 350px; height: 233px;" />
<p>The two graphs seem to show that the type of alloy is not related to either the strength or flexibility of the product. When you perform the one-way ANOVA procedure for these graphs, the p-values for strength and flexibility are 0.254 and 0.923 respectively.</p>
<p>Drat! I guess Alloy isn't related to either Strength or Flexibility, right? Not so fast!</p>
<p>Now, let’s take a look at the multivariate response patterns. To do this, I’ll display the same data with a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/graphs/graphs-of-pairs-of-variables/scatterplots/scatterplot/" target="_blank">scatterplot</a> that plots Strength by Flexibility with Alloy as a categorical grouping variable.</p>
<p><img alt="Scatterplot of strength by flexibility grouped by alloy" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/86483284f76817ea95b3c1787e45e7d5/scatterplot.png" style="width: 576px; height: 384px;" /></p>
<p>The scatterplot shows a positive correlation between Strength and Flexibility. MANOVA is useful when you have correlated response variables like these. You can also see that for a given flexibility score, Alloy 3 generally has a higher strength score than Alloys 1 and 2. We can use MANOVA to statistically test for this response pattern to be sure that it’s not due to random chance.</p>
<p>To perform the MANOVA test in Minitab, go to: <strong>Stat > ANOVA > General MANOVA</strong>. Our response variables are Strength and Flexibility and the predictor is Alloy.</p>
<p>Whereas one-way ANOVA could not detect the effect, MANOVA finds it with ease. The p-values in the results are all very significant. You can conclude that Alloy influences the properties of the product by changing the relationship between the response variables.</p>
<p><img alt="MANOVA results" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/c96fe9a066011b31692765318c2f0d26/manova_swo.png" style="width: 391px; height: 155px;" /></p>
<p>For a more complete guide on how to interpret MANOVA results in Minitab, go to: <strong>Help > StatGuide > ANOVA > General MANOVA</strong>.</p>
When and Why You Should Use MANOVA
<p>Use multivariate ANOVA when you have continuous response variables that are correlated. In addition to multiple responses, you can also include multiple <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-models/factor-and-factor-levels/" target="_blank">factors</a>, <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-models/adding-a-covariate-to-glm/" target="_blank">covariates</a>, and <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-models/what-is-an-interaction/" target="_blank">interactions</a> in your model. MANOVA uses the additional information provided by the relationship between the responses to provide three key benefits.</p>
<ul>
<li><strong>Increased power</strong>: If the response variables are correlated, MANOVA can detect differences too small to be detected through individual ANOVAs.</li>
<li><strong>Detects multivariate response patterns</strong>: The factors may influence the relationship between responses rather than affecting a single response. Single-response ANOVAs can miss these multivariate patterns as illustrated in the MANOVA example.</li>
<li><strong>Controls the family error rate</strong>: Your chance of incorrectly rejecting the null hypothesis increases with each successive ANOVA. Running one MANOVA to test all response variables simultaneously keeps the family error rate equal to your alpha level.</li>
</ul>
What to Do When Your Data's a Mess, part 2
<p><span style="line-height: 1.6;">In my last post, I wrote about making a cluttered data set easier to work with by removing unneeded columns entirely, and by displaying just those columns you want to work with <em>now</em>. But <a href="http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-data-is-a-mess-part-1">too much unneeded data</a> isn't always the problem. </span></p>
<p><span style="line-height: 1.6;">What can you do when someone gives you data that isn't organized the way you need it to be? </span></p>
<p><span style="line-height: 1.6;">That happens for a variety of reasons, but most often it's because the simplest way for people to collect data is with a format that might make it difficult to assess in a worksheet. Most <a href="http://www.minitab.com/products/minitab">statistical software</a> will accept a wide range of data layouts, but just because a layout is readable doesn't mean it will be easy to analyze.</span></p>
<p><span style="line-height: 1.6;">You may not be in control of how your data were collected, but you can use tools like sorting, stacking, and ordering to put your data into a format that makes sense and is easy for you to use. </span></p>
Decide How You Want to Organize Your Data
<p>Depending on how its arranged, the same data can be easier to work with, simpler to understand, and can even yield deeper and more sophisticated insights. I can't tell you the best way to organize your specific data set, because that will depend on the types of analysis you want to perform, and the nature of the data you're working with. However, I can show you some easy ways to rearrange your data into the form that you select. </p>
Unstack Data to Make Multiple Columns
<p>The data below show concession sales for different types of events held at a local theater. </p>
<p><img alt="stacked data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8ea617d9de8138f26f2da0f3f95f4b88/stackedata.png" style="width: 202px; height: 188px;" /></p>
<p><span style="line-height: 20.7999992370605px;">If we wanted to perform an analysis that requires each type of event to be in its own column, we can choose <strong>Data > Unstack Columns...</strong> and complete the dialog box as shown: </span></p>
<p><span style="line-height: 20.7999992370605px;"><img alt="unstack columns dialog" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/fc098d3ddcbc21fe12602cb45336949c/unstack_columns.png" style="width: 350px; height: 263px;" /> </span></p>
<p>Minitab creates a new worksheet that contains a separate column of Concessions sales data for each type of event:</p>
<p><img alt="Unstacked Data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f24dd4ac29678e25069d299ccc13c535/unstacked_data.png" style="width: 400px; height: 150px;" /></p>
Stack Data to Form a Single Column (with Grouping Variable)
<p>A similar tool will help you put data from separate columns into a single column for the type of analysis required. The data below show sales figures for four employees: </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f546e2611e4fd6fe804de7c0aee3d230/stacked_data.png" style="width: 265px; height: 92px;" /></p>
<p>Select <strong>Data > Stack > Columns...</strong> and select the columns you wish to combine. Checking the "Use variable names in subscript column" will create a second column that identifies the person who made each sale. </p>
<p><img alt="Stack columns dialog" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a09dba196e68e5e75d0f248339a53e11/stack_data_dialog.jpg" style="width: 400px; height: 292px;" /></p>
<p>When you press OK, the sales data are stacked into a single column of measurements and ready for analysis, with Employee available as a grouping variable: </p>
<p><img alt="stacked columns" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c26bec8bec9447ab1df6b9ad669d9a1a/stacked_columns.jpg" style="width: 138px; height: 181px;" /></p>
Sort Data to Make It More Manageable
<p>The following data appear in the worksheet in the order in which individual stores in a chain sent them into the central accounting system.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/431dcae640fa0855a8db03b14bad3998/unsorted_data.jpg" style="width: 200px; height: 228px;" /></p>
<p>When the data appear in this uncontrolled order, finding an observation for any particular item, or from any specific store, would entail reviewing the entire list. We can fix that problem by selecting <strong>Data > Sort...</strong> and reordering the data by either store or item. </p>
<p><img alt="sorted data by item" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/0c982bb11359a001c048cb6c39ab1f60/sorted_data_by_item.jpg" style="width: 221px; height: 246px;" /> <img alt="sorted data by store" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/53e9a3f22b4a959af11952995703d7d4/sorted_data_by_store.jpg" style="width: 209px; height: 248px;" /></p>
Merge Multiple Worksheets
<p>What if you need to analyze information about the same items, but that were recorded on separate worksheets? For instance, if one group was gathering historic data about all of a corporation's manufacturing operations, while another was working on strategic planning, and your analysis required data from each? </p>
<p><img alt="two worksheets" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f63ed557c91fb6136b28ab43001b48b4/two_worksheets.png" style="width: 350px; height: 327px;" /></p>
<p>You can use <strong>Data > Merge Worksheets</strong> to bring the data together into a single worksheet, using the Division column to match the observations:</p>
<p><img alt="merging worksheets" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/651d3d676a4099a71eb180344d2e8282/merge_worksheets.png" style="width: 393px; height: 363px;" /></p>
<p>You can also choose whether or not <span style="line-height: 20.7999992370605px;">multiple</span><span style="line-height: 1.6;">, missing, or unmatched observations will be included in the merged worksheet. </span></p>
Reorganizing Data for Ease of Use and Clarity
<p>Making changes to the layout of your worksheet does entail a small investment of time, but it can bring big returns in making analyses quicker and easier to perform. The next time you're confronted with raw data that isn't ready to play nice, try some of these approaches to get it under control. </p>
<p>In my next post, I'll share some tips and tricks that can help you get more information out of your data.</p>
What to Do When Your Data's a Mess, part 1
<p>Isn't it great when you get a set of data and it's perfectly organized and ready for you to analyze? I love it when the people who collect the data take special care to make sure to format it consistently, arrange it correctly, and eliminate the junk, clutter, and useless information I don't need. </p>
<p><img alt="Messy Data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ad531bc1c0dc575e774b7ecef670b231/messydata.png" style="border-width: 1px; border-style: solid; margin: 10px 15px; width: 250px; height: 248px; float: right;" />You've never received a data set in such perfect condition, you say?</p>
<p>Yeah, me neither. But I can dream, right? </p>
<p><span style="line-height: 1.6;">The truth is, when other people give me data, it's typically not ready to analyze. It's frequently messy, disorganized, and inconsistent. I get big headaches if I try to analyze it without doing a little clean-up work first. </span></p>
<p>I've talked with many people who've shared similar experiences, so I'm writing a series of posts on how to get your data in usable condition. In this first post, I'll talk about some basic methods you can use to make your data easier to work with. </p>
Preparing Data Is a Little Like Preparing Food
<p>I'm not complaining about the people who give me data. In most cases, they aren't statisticians and they have many higher priorities than giving me data in exactly the form I want. </p>
<p>The end result is that getting data is a little bit like getting food: it's not always going to be ready to eat when you pick it up. You don't eat raw chicken, and usually you can't analyze raw data, either. <span style="line-height: 20.7999992370605px;"> </span><span style="line-height: 1.6;">In both cases, you need to prepare it first or the results aren't going to be pretty. </span></p>
<p><span style="line-height: 1.6;">Here are a couple of very basic things to look for when you get a messy data set, and how to handle them. </span></p>
<span style="line-height: 1.6;">Kitchen-Sink Data and Information Overload</span>
<p>Frequently I get a data set that includes a lot of information that I don't need for my analysis. I also get data sets that combine or group information in ways that make analyzing it more difficult. </p>
<p>For example, let's say I needed to analyze data about different types of events that take place at a local theater. Here's my raw data sheet: </p>
<p><img alt="April data sheet" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/14fe4e9930171f54848b589c0e8139d1/april_data_raw.png" style="width: 400px; height: 224px;" /></p>
<p>With each type of event jammed into a single worksheet, it's a challenge to analyze just one event category. What would work better? A separate worksheet for each type of occasion. In Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a>, I can go to <strong>Data > Split Worksheet...</strong> and choose the Event column: </p>
<p><img alt="split worksheet" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/69c63e422339f9871ada5a244222dcfc/split_worksheet.png" style="width: 300px; height: 309px;" /></p>
<p>And Minitab will create new worksheets that include only the data for each type of event. </p>
<p><img alt="separate worksheets by event type" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8b97ea00ae39da8cb60e307ebe6140dc/separate_data_sheets.png" style="width: 300px; height: 243px;" /></p>
<p><span style="line-height: 20.7999992370605px;">Minitab also lets you merge worksheets to </span>combine items provided in separate data files. </p>
<p><span style="line-height: 1.6;">Let's say the data set you've been given contains a lot of columns that you don't need: irrelevant factors, redundant information, and the like. Those items just clutter up your data set, and getting rid of them will make it easier to identify and access the columns of data you actually need. </span><span style="line-height: 20.7999992370605px;">You can delete rows and columns you don't need, or use the</span><strong style="line-height: 20.7999992370605px;"> Data > Erase Variables</strong><span style="line-height: 20.7999992370605px;"> tool to make your worksheet more manageable. </span></p>
<span style="line-height: 1.6;">I Can't See You Right Now...Maybe Later</span>
<p>What if you don't want to actually <em>delete </em>any data, but you only want to see the columns you intend to use? For instance, in the data below, I don't need the Date, Manager, or Duration columns now, but I may have use for them in the future: </p>
<p><img alt="unwanted columns" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/99d785a0b5ff0cbac36f0c6af05b1cac/unwantedcolumns.png" style="width: 400px; height: 225px;" /></p>
<p>I can select and right-click those columns, then use <strong>Column > Hide Selected Columns</strong> to make them disappear. </p>
<p><img alt="hide selected columns" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/00defa2646d5e100873ef2961d374ff0/hideselectedcolumns.png" style="width: 400px; height: 308px;" /></p>
<p>Voila! They're gone from my sight. Note how the displayed columns jump from C1 to C5, indicating that some columns are hidden: </p>
<p><img alt="hidden columns" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a140bb6413744b431460e70f523e5a0b/hiddencolumns.png" style="width: 323px; height: 138px;" /></p>
<p>It's just as easy to bring those columns back in the limelight. When I want them to reappear, I select the C1 and C5 columns, right-click, and choose "Unhide Selected Columns." </p>
<p>Data may arrive in a disorganized and messy state, but you don't need to keep it that way. Getting rid of extraneous information and choosing the elements that are visible can make your work much easier. But that's just the tip of the iceberg. In my next post, I'll cover some more <a href="http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-2">ways to make unruly data behave</a>. </p>
Creating and Reading Statistical Graphs: Trickier than You Think
<p>A few weeks ago my colleague Cody Steele illustrated <a href="http://blog.minitab.com/blog/statistics-and-quality-improvement/how-painful-does-the-income-gap-look-to-you">how the same set of data can appear to support two contradictory positions</a>. He showed how changing the scale of a graph that displays mean and median household income over time drastically alters the way it can be interpreted, even though there's no change in the data being presented.</p>
<p><img alt="Graph interpretation is tricky, especially if you're doing it quickly" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f594d20f8daa8e00e29380f68010b1cc/hunh.jpg" style="margin: 10px 15px; float: right; width: 200px; height: 200px;" /> When we analyze data, we need to present the results in an objective, honest, and fair way. That's the catch, of course. What's "fair" can be debated...and that leads us straight into "Lies, damned lies, and statistics" territory. </p>
<p><span style="line-height: 20.7999992370605px;">Cody's post got me thinking about the importance of statistical literacy, especially in a mediascape saturated with overhyped news reports about seemingly every new study, not to mention omnipresent "infographics" of frequently dubious origin and intent.</span></p>
<p><span style="line-height: 20.7999992370605px;">As consumers and providers of statistics, can we trust our own impressions of the information we're bombarded with on a daily basis? It's an increasing challenge, even for the statistics-savvy. </span></p>
So Much Data, So Many Graphs, So Little Time
<p>The increased amount of information available, combined with the acceleration of the news cycle to speeds that wouldn't have been dreamed of a decade or two ago, means we have less time available to absorb and evaluate individual items critically. </p>
<p>A half-hour television news broadcast might include several animations, charts, and figures based on the latest research, or polling numbers, or government data. They'll be presented for several seconds at most, then it's on to the next item. </p>
<p>Getting news online is even more rife with opportunities for split-second judgment calls. We scan through the headlines and eyeball the images, searching for stories interesting enough to click on. But with 25 interesting stories vying for your attention, and perhaps just a few minutes before your next appointment, you race through them very quickly. </p>
<p>But when we see graphs for a couple of seconds, do we really absorb their meaning completely and accurately? Or are we susceptible to misinterpretation? </p>
<p>Most of the graphs we see are very simple: bar charts and pie charts predominate. But <span style="line-height: 1.6;">as statistics educator Dr. Nic points out in </span><a href="http://learnandteachstatistics.wordpress.com/2012/07/16/tricky_graphs/" style="line-height: 1.6;">this blog post</a>,<span style="line-height: 1.6;"> </span><span style="line-height: 20.7999992370605px;">interpreting</span><span style="line-height: 20.7999992370605px;"> </span><span style="line-height: 1.6;">even simple bar charts can be a deceptively tricky business</span><span style="line-height: 1.6;">. I've adapted her example to demonstrate this below. </span></p>
Which Chart Shows Greater Variation?
<p>A city surveyed residents of two neighborhoods about the quality of service they get from local government. Respondents were asked to rate local services on a scale of 1 to 10. Their responses were charted using Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a>, as shown below. </p>
<p>Take a few seconds to scan the charts, then choose which neighborhood's responses exhibit the most variation, Ferndale or Lawnwood?</p>
<p><img alt="Lawnwood Bar Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f88262f2732bc43e8ac0b919d43139a5/lawnwoodbarchart.gif" style="width: 500px; height: 333px;" /></p>
<p><img alt="Ferndale Bar Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/67ee1909a89236e3caac2d11a9d42795/ferndalebarchart.gif" style="width: 500px; height: 333px;" /></p>
<p>Seems pretty straightforward, right? Lawnwood's graph is quite spiky and disjointed, with sharp peaks and valleys. The graph of Ferndale's responses, on the other hand, looks nice and even. Each bar's roughly the same height. </p>
<p>It looks like Lawnwood's responses have the most variation. But let's verify that impression with some basic descriptive statistics about each neighborhood's responses:</p>
<p style="margin-left: 40px;"><img alt="Descriptive Statistics for Fernwood and Lawndale" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/1eeed755d2a0baea0939dc7ccecacaea/descriptive_statistics.gif" style="width: 369px; height: 105px;" /></p>
<p>Uh-oh. A glance at the graphs suggested that Lawnwood has more variation, but the analysis demonstrates that Ferndale's variation is, in fact, much higher. <span style="line-height: 20.7999992370605px;">How did we get this so wrong?</span><span style="line-height: 20.7999992370605px;"> </span><span style="line-height: 1.6;"> </span></p>
Frequencies, Values, and Counterintuitive Graphs
<p><span style="line-height: 1.6;">The answer lies in how the data were presented. The charts above show frequencies, or counts, rather than individual responses. </span></p>
<p><span style="line-height: 1.6;">What if we graph the individual responses for each neighborhood? </span></p>
<p><img alt="Lawndale Individuals Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d8e91ae6c007e8f5327c54ac3ec65604/lawnwoodindividualsbarchart.gif" style="width: 500px; height: 333px;" /></p>
<p><img alt="Ferndale Individuals Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/4c01c68dbb96e2126a1fd313ee38e001/ferndaleindividualsbarchart.gif" style="width: 500px; height: 333px;" /></p>
<p>In <em>these </em>graphs, it's easy to see that the responses of Ferndale's citizens had much more variation than those of Lawnwood. But unless you appreciate the differences between values and frequencies—and paid careful attention to how the first set of graphs was labelled—a quick look at the earlier graphs could well leave you with the wrong conclusion. </p>
Being Responsible
<p>Since you're reading this, you probably both create and consume data analysis. You may generate your own reports and charts at work, and see the results of other peoples' analyses on the news. We should approach both situations with a certain degree of responsibility. </p>
<p>When looking at graphs and charts produced by others, we need to avoid snap judgments. We need to pay attention to what the graphs really show, and take the time to draw the right conclusions based on how the data are presented. </p>
<p>When sharing our own analyses, we have a responsibility to communicate clearly. In the frequency charts above, the X and Y axes are labelled adequately—but couldn't they be more explicit? Instead of just "Rating," couldn't the label read "Count for Each Rating" or some other, more meaningful description? </p>
<p>Statistical concepts may seem like common knowledge if you've spent a lot of time working with them, but many people aren't clear on ideas like "correlation is not causation" and margins of error, let alone the nuances of statistical assumptions, distributions, and significance levels.</p>
<p>If your audience includes people without a thorough grounding in statistics, are you going the extra mile to make sure the results are understood? For example, many expert statisticians have told us they use <a href="http://www.minitab.com/products/minitab/assistant/">the Assistant</a> in Minitab 17 to present their results precisely because it's designed to communicate the outcome of analysis clearly, even for statistical novices. </p>
<p><span style="line-height: 20.7999992370605px;">If you're already doing everything you can to make statistics accessible to others, kudos to you. </span><span style="line-height: 20.7999992370605px;">And if you're not, why aren't you? </span></p>
Comparing the College Football Playoff Top 25 and the Preseason AP Poll
<p>The college football playoff committee waited until the end of October to release their first top 25 rankings. One of the reasons for waiting so far into the season was that the committee would rank the teams off of actual games and wouldn’t be influenced by preseason rankings.</p>
<p>At least, that was the idea.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/8ac74acf42052d068b6cd0eeec32f609/cfb_playoff.jpg" style="line-height: 20.7999992370605px; float: right; width: 300px; height: 187px;" /></p>
<p>Earlier this year, I found that the <a href="http://blog.minitab.com/blog/the-statistics-game/has-the-college-football-playoff-already-been-decided">final AP poll was correlated with the preseason AP poll</a>. That is, if team A was ranked ahead of team B in the preseason and they had the same number of losses, team A was still usually ranked ahead of team B. The biggest exception was SEC teams, who were able to regularly jump ahead of teams (with the same number of losses) ranked ahead of them in the preseason.</p>
<p>If the final AP poll can be influenced by preseason expectations, could the college football playoff committee be influenced, too? Let’s compare their first set of rankings to the preseason AP poll to find out.</p>
Comparing the Ranks
<p>There are currently 17 different teams in the committee’s top 25 that have just one loss. I <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/26e7c8d8d8eee4fe2dfa26dc3d6e3c54/preseason_ap_vs__cfb_playoff_rankings.MTW">recorded the order</a> they are ranked in the committee’s poll and their order in the AP preseason poll. Below is an individual value plot of the data that shows each team’s preseason rank versus their current rank.</p>
<p><img alt="IVP" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/4098bab194a586865d3861f854d65627/ivp.jpg" style="width: 600px; height: 400px;" /></p>
<p>Teams on the diagonal line haven’t moved up or down since the preseason. Although Notre Dame is the only team to fall directly on the line, most teams aren’t too far off.</p>
<p>Teams below the line have jumped teams that were ranked ahead of them in the preseason. The biggest winner is actually not an SEC team, it’s TCU. Before the season, 13 of the current one-loss teams were ranked ahead of TCU, but now there are only 4. On the surface TCU seems to counter the idea that only SEC teams can drastically move up from their preseason ranking. However, of the 9 teams TCU jumped, only one (Georgia) is from the SEC. And the only other team to jump up more than 5 spots is Mississippi—who of course is from the SEC. So I wouldn’t conclude that the CFB playoff committee rankings behave differently than the AP poll quite yet.</p>
<p>Teams below the line have been passed by teams that had been ranked behind them in the preseason. Ohio State is the biggest loser, having had 9 different teams pass over them. Part of this can be explained by the fact that they have the worst loss (a 4-4 Virginia Tech game at home). But another factor is that the preseason AP poll was released before anybody knew Buckeye quarterback Braxton Miller would miss the entire season. Had voters known that, Ohio State probably wouldn’t have been ranked so high to begin with. </p>
<p>Overall, 10 teams have moved up or down from their preseason spot by 3 spots or less. The correlation between the two polls is 0.571, which indicates a positive association between the preseason AP poll and the current CFB playoff rankings. That is, teams ranked higher in the preseason poll tend to be ranked higher in the playoff rankings.</p>
Concordant and Discordant Pairs
<p>We can take this analysis a step further by looking at the concordant and discordant pairs. A pair is concordant if the observations are in the same direction. A pair is discordant if the observations are in opposite directions. This will let us compare teams to each other two at a time.</p>
<p>For example, let’s compare Auburn and Mississippi. In the preseason, Auburn was ranked 3 (out of the 17 one-loss teams) and Mississippi was ranked 10. In the playoff rankings, Auburn is ranked 1 and Mississippi is ranked 2. This pair is concordant, since in both cases Auburn is ranked higher than Mississippi. But if you compare Alabama and Mississippi, you’ll see Alabama was ranked higher in the preseason, but Mississippi is ranked higher in the playoff rankings. That pair is discordant.</p>
<p>When we compare every team, we end up with 136 pairs. How many of those are concordant? Our <a href="http://www.minitab.com/products/minitab">favorite statistical software</a> has the answer: </p>
<p><img alt="Measures of Concordance" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/5f281abfa1e06d5cda492e17b3f9746b/concordance.jpg" style="width: 663px; height: 176px;" /></p>
<p>There are 96 concordant pairs, which is just over 70%. So most of the time, if a team ranked higher in the preseason poll, they are ranked higher in the playoff rankings. And consider this: of the one-loss teams, the top 4 ranked preseason teams were Alabama, Oregon, Auburn, and Michigan St. Currently, the top 4 one loss teams are Auburn, Mississippi, Oregon, and Alabama. That’s only one new team—which just so happens to be from the SEC.</p>
<p>That’s bad news for non-SEC teams that started the season ranked low, like Arizona, Notre Dame, Nebraska, and Kansas State. It's going to be hard for them to jump teams with the same record, especially if those teams are from the SEC. Just look at Alabama’s résumé so far. Their best win is over West Virginia and they lost to #4 Mississippi. Is that <em>really </em>better than Kansas State, who lost to #3 Auburn and beat Oklahoma <em>on the road</em>? If you simply changed the name on Alabama’s uniform to Utah and had them unranked to start the season, would they still be ranked three spots higher than Kansas State? I doubt it.</p>
<p>The good news is that there are still many games left to play. Most of these one-loss teams will lose at least one more game. But with 4 teams making the playoff this year, odds are we'll see multiple teams with the same record vying for the last playoff spot. And if this college football playoff ranking is any indication, if you're not in the SEC, teams who were highly thought of in the preseason will have an edge.</p>
Simulating Robust Processing with Design of Experiments, part 2
<p>by <a href="http://uk.linkedin.com/in/jasminwongym" target="_blank">Jasmin Wong</a>, guest blogger</p>
<p> </p>
<p><em><a href="http://blog.minitab.com/blog/statistics-in-the-field/simulating-robust-processing2c-part-1">Part 1</a> of this two-part blog post discusses the issues and challenges in injection moulding and suggests using simulation software and the statistical method called Design of Experiments (DOE) to speed development and boost quality. This part presents a case study that illustrates this approach. </em></p>
Preliminary Fill and Designed Experiment
<p>This case study considers the example of a hand dispensing pump for a sanitiser bottle where the main areas of concern were warpage and the concentricity of the tube, as this had a critical impact on fit and functionality. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f6c68e56710c222c2a20dd002021287f/dispenser_top.png" style="line-height: 20.7999992370605px; margin: 10px 15px; float: right; width: 400px; height: 236px;" /></p>
<div>
<p>In this example, the first step was to carry out a preliminary fill, pack, cool and warp analysis to ensure that the part had no filling difficulties such as short shots or hesitation. DOE was then carried out and, since the areas of concern were warpage and concentricity, these were selected as the quality factor/responses.</p>
<div>
<p>Four control factors that affected warpage and concentricity were used to carry out the DOE: melt temperature, packing pressure, cooling time, and fill time. The factors levels are shown in the table below:</p>
<p><img alt="Taguchi DOE control factors" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/322b2d00c3b22d962ca76ac0485e437b/taguchi_doe_control_factors.png" style="width: 450px; height: 136px;" /></p>
<p>A Taguchi L9 DOE was then created using Minitab Statistical Software. <span style="line-height: 1.6;">It should be noted that a Taguchi DOE assumes no significant interaction between factors, but this may not necessarily be true. In this case, however, it was selected to determine the relationship between the factors and responses in the shortest simulation time.</span></p>
<p>The Minitab worksheet below shows the process settings for the nine runs using the Taguchi L9 Design.</p>
<p><img alt="Taguchi design worksheet" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/7cbc350e2fbe466708f4b5b4a2f58566/taguchi_doe_worksheet.png" style="width: 450px; height: 169px;" /></p>
<p>Moldex3D DOE was then used to perform the mathematical calculations based on the user’s specification (minimum warpage and linear shrinkage between nodes) to determine the optimum process setting.</p>
<p>From the nine different simulated runs, a main effect graph for warpage was plotted. </p>
<p><img alt="Main Effects Plor for Warpage" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/dbec7e75117c7763745e8260d78852fd/main_effects_warpage.png" style="width: 577px; height: 385px;" /></p>
<p><span style="line-height: 1.6;">From this, it could be seen that by increasing the packing pressure and cooling time, warpage was reduced. Increasing melt temperature, on the other hand, lead to higher warpage. Using a filling time of 0.2s or 0.3s seemed to give slightly lesser warpage than 0.1s. Hence, it was determined that to achieve lower warpage, the optimum process setting should be a melt temperature of 225°C, packing pressure of 15MPa, cooling time of 12s and filling time of 0.3s.</span></p>
<p style="line-height: 20.7999992370605px;">Taking the results obtained from Moldex3D, Minitab 17 statistical software was used to determine which of the four factors had the biggest influence on part warpage.</p>
<p style="line-height: 20.7999992370605px;"><img alt="response table for warpage" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/20e65680dd317de7add7a8559b1d50e3/response_table_warpage.png" style="width: 500px; height: 153px;" /></p>
<p style="line-height: 20.7999992370605px;">This data analysis showed that cool time had the biggest impact on part warpage, followed by packing pressure, melt temperature and then filling time. An area graph of warpage (PDF DOWNLOAD CHART 1) showed a quick comparison of the nine different runs, indicating that run 3 gave the least warpage.</p>
<p><img alt="area graph of warpage" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/740d75c1b4424da02ee136a673e43780/area_graph_of_warpage.png" style="width: 500px; height: 333px;" /></p>
<p>Concentricity is difficult to measure, in both real life and in simulation. In real life, the distance between different points is measured using a coordinate-measuring machine (CMM). In the Moldex3D simulation, the linear shrinkage between different nodes was measured. Eight different nodes were identified. The linear shrinkage of the diameter of the tube across was determined and the lower the linear shrinkage, the more circular or better concentricity of the part.</p>
<p>The main effects plot below for shrinkage shows that to get better concentricity/linear shrinkage between the nodes, a lower melt temperature, cooling time and filling time with a high pack pressure was preferable.</p>
<p><img alt="Main Effects Plot for Shrinkage" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/3eb9b51b4bd8caeac5ead713a86ce90b/main_effects_shrinkage.png" style="width: 579px; height: 385px;" /></p>
<p>It had already been established that to achieve lower linear shrinkage, the optimum process setting should be melt temperature of 225°C, packing pressure of 15MPa, cooling time of 8s and filling time of 0.1s. However, a cooling time of 8s may not be practical, as the analysis of warpage shows it would give high warpage.</p>
<p>Minitab was also used to find out which of the four control factors resulted in the greatest impact on linear shrinkage.</p>
<p><img alt="Response Table for Shrinkage" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9e0e2aca3064320d44a9860223665f48/response_table_shrinkage.png" style="width: 500px; height: 153px;" /></p>
<p>This showed that pack pressure is ranked first, followed by cooling time, melt temperature and lastly the filling time. Since the 8s cooling time would lead to high warpage, a compromise had to be made.</p>
<p>As mentioned earlier, for linear shrinkage the packing pressure was more of a contributing factor than the cooling time, so it makes sense to use 12s cooling time with 15MPa packing pressure. Comparing the nine different runs for linear shrinkage in an area graph showed that run six gave the lowest linear shrinkage.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/dfabcb5cb7861c6dc11cc0fdb25c2b2d/area_graph_of_shrinkage.png" style="width: 500px; height: 333px;" /></p>
<p>Based on the user specification, Moldex3D’s mathematical calculations obtained the optimised run<span style="line-height: 1.6;">. For this example, weighting for warpage was the same as for linear shrinkage. However, based on the DOE simulation results obtained, the optimum process setting for the lowest warpage was to have a cooling time of 12s and filling time of 0.3s. The optimum process for the lowest linear shrinkage, on the other hand, required a cooling time of 8s and fill time of 0.1s.</span></p>
Concluding thoughts
<p>Moldex3D simulation resulted in a compromise process setting (melt temperature of 225°C, packing pressure of 15MPa, cooling time of 12s and filling time of 0.1s), which was used as the optimum run. From the area graphs shown below, it can be seen that the optimised run 10 gives the lowest warpage compared to the other nine runs, while having low linear shrinkage.</p>
<p><img alt="optimized run - area chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/13c7a74c8d37f74f4acc152b676e53b6/optimized_run_area_graph_w640.png" style="width: 640px; height: 210px;" /></p>
<p>From the simulation in Moldex 3D, shown below, it can be seen that part warpage and concentricity of the tube has been significantly improved (warpage has been improved by 20-30% while linear shrinkage has been kept to 0.6-0.7%).</p>
<p><img alt="Moldex 3D simulation" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a1b9270c0e645e9db3d7c4f626308aba/moldex_3d_sim.png" style="width: 500px; height: 179px;" /></p>
<p>It is important that designers and moulders understand that numerical results in a simulation such as this provide only a relative comparison and should not be treated as absolute. This is because there are various uncontrollable factors in the actual mould shop environment—‘noise’—which cannot be re-enacted in a simulation. However, running DOE using simulation can give the engineering team a head start on identifying which control factors to focus on and the relationship those factors have with part quality.</p>
<p> </p>
<p><strong>About the guest blogger</strong></p>
<p><a href="http://uk.linkedin.com/in/jasminwongym">Jasmin Wong</a> is project engineer at UK-based <a href="http://www.plazology.co.uk/" target="_blank">Plazology</a>, which provides product design optimisation, injection moulding fl ow simulation, mould design, mould procurement, and moulding process validation services to global manufacturing customers. She is an MSc graduate in polymer composite science and engineering and recently gained Moldex3D Analyst Certification.</p>
<p> </p>
<p> </p>
<p><em>A version of this article originally appeared in the <a href="http://content.yudu.com/htmlReader/A3572w/IWOct14/reader.html?page=26" target="_blank">October 2012 issue of Injection World</a> magazine.</em></p>
</div>
</div>
Can Regression and Statistical Software Help You Find a Great Deal on a Used Car?
<p>You need to consider many factors when you’re buying a used car. Once you narrow your choice down to a particular car model, you can get a wealth of information about individual cars on the market through the Internet. How do you navigate through it all to find the best deal? By analyzing the data you have available. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/710ce579b4120727bf67e8b48f5965e8/240_used_car_kovacs.jpg" style="line-height: 20.7999992370605px; border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 240px; height: 240px;" /></p>
<p>Let's look at how this works using <a href="http://blog.minitab.com/blog/understanding-statistics/we-just-got-rid-of-five-reasons-to-fear-data-analysis">the Assistant</a> in Minitab 17. With the Assistant, you can use regression analysis to calculate the expected price of a vehicle based on variables such as year, mileage, whether or not the technology package is included, and whether or not a free Carfax report is included.</p>
<p>And it's probably a lot easier than you think. </p>
<p>A search of a leading Internet auto sales site yielded data about 988 vehicles of a specific make and model. After putting the data into Minitab, we choose <strong>Assistant > Regression…</strong></p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9e87de993a0daa39e6643b8c6d3aed9c/regression_dialog.png" style="width: 395px; height: 247px;" /></p>
<p>At this point, if you aren’t very comfortable with regression, <a href="http://www.minitab.com/products/minitab/assistant/">the Assistant makes it easy to select the right option for your analysis</a>.</p>
A Decision Tree for Selecting the Right Analysis
<p>We want to explore the relationships between the price of the vehicle and four factors, or X variables. Since we have more than one X variable, and since we're not looking to optimize a response, we want to choose Multiple Regression.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/bc802d35bfb57ca3b86e061da4fa4b09/regression_decision_tree_w640.png" style="width: 640px; height: 502px;" /></p>
<p>This <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/9ecb2280228deb621ee2db7f6fbe300e/used_cars.MTW">data set</a> includes five columns: mileage, the age of the car in years, whether or not it has a technology package, whether or not it includes a free CARFAX report, and, finally, the price of the car.</p>
<p>We don’t know which of these factors may have significant relationship to the cost of the vehicle, and we don’t know whether there are significant two-way interactions between them, or if there are quadratic (nonlinear) terms we should include—but we don’t need to. Just fill out the dialog box as shown. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b93a0a755e8e73dc7f681ea4b1965749/regression_dialog_box.png" style="width: 532px; height: 382px;" /></p>
<p>Press OK and the Assistant assesses each potential model and selects the best-fitting one. It also provides a comprehensive set of reports, including a Model Building Report that details how the final model was selected and a Report Card that notifies you to potential problems with the analysis, if there are any.</p>
Interpreting Regression Results in Plain Language
<p>The Summary Report tells us in plain language that there is a significant relationship between the Y and X variables in this analysis, and that the factors in the final model explain 91 percent of the observed variation in price. It confirms that all of the variables we looked at are significant, and that there are significant interactions between them. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/746574a27bba821ffab4f77ae1a2931b/multiple_regression_summary_report_w640.png" style="width: 640px; height: 480px;" /></p>
<p>The Model Equations Report contains the final regression models, which can be used to predict the price of a used vehicle. The Assistant provides 2 equations, one for vehicles that include a free CARFAX report, and one for vehicles that do not.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/58598060212558634d62d75a7045bf0b/regression_equation_w640.png" style="width: 640px; height: 186px;" /></p>
<p>We can tell several interesting things about the price of this vehicle model by reading the equations. First, the average cost for vehicles with a free CARFAX report is about $200 more than the average for vehicles with a paid report ($30,546 vs. $30,354). This could be because these cars probably have a clean report (if not, the sellers probably wouldn’t provide it for free).</p>
<p>Second, each additional mile added to the car decreases its expected price by roughly 8 cents, while each year added to the cars age decreases the expected price by $2,357.</p>
<p>The technology package adds, on average, $1,105 to the price of vehicles that have a free CARFAX report, but the package adds $2,774 to vehicles with a paid CARFAX report. Perhaps the sellers of these vehicles hope to use the appeal of the technology package to compensate for some other influence on the asking price. </p>
Residuals versus Fitted Values
<p>While these findings are interesting, our goal is to find the car that offers the best value. In other words, we want to find the car that has the largest difference between the asking price and the expected asking price predicted by the regression analysis.</p>
<p>For that, we can look at the Assistant’s Diagnostic Report. The report presents a chart of Residuals vs. Fitted Values. If we see obvious patterns in this chart, it can indicate problems with the analysis. In that respect, this chart of Residuals vs. Fitted Values looks fine, but now we’re going to use the chart to identify the best value on the market.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d55ae8720ba281bf37135b68b2069434/multiple_regression_diagnostic_report_w640.png" style="width: 640px; height: 480px;" /></p>
<p>In this analysis, the “Fitted Values” are the prices predicted by the regression model. “Residuals” are what you get when you subtract the actual asking price from the predicted asking price—exactly the information you’re looking for! The Assistant marks large residuals in red, making them very easy to find. And three of those residuals—which appear in light blue above because we’ve selected them—appear to be very far below the asking price predicted by the regression analysis.</p>
<p>Selecting these data points on the graph reveals that these are vehicles whose data appears in rows 357, 359, and 934 of the data sheet. Now we can revisit those vehicles online to see if one of them is the right vehicle to purchase, or if there’s something undesirable that explains the low asking price. </p>
<p>Sure enough, the records for those vehicles reveal that two of them have severe collision damage.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/5dbbf5aa405d4b2d53ec720657a09556/vehicles.jpg" style="width: 320px; height: 356px;" /></p>
<p>But the remaining vehicle appears to be in pristine condition, and is several thousand dollars less than the price you’d expect to pay, based on this analysis!</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/640bd720a3d1f8b04713aa0ec321a570/nice_car.png" style="width: 254px; height: 189px;" /></p>
<p>With the power of regression analysis and the Assistant, we’ve found a great used car—at a price you know is a real bargain.</p>
<p> </p>
Using Data Analysis to Maximize Webinar Attendance
<p>We like to host webinars, and our customers and prospects like to attend them. But when our webinar vendor moved from a pay-per-person pricing model to a pay-per-webinar pricing model, we wanted to find out how to maximize registrations and thereby minimize our costs.<img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/8a6733d3b0516b7f1c7ad80ea753d430/mtbnewspromos_w640.jpeg" style="width: 400px; height: 273px; float: right; border-width: 1px; border-style: solid; margin: 10px 15px;" /></p>
<p>We collected webinar data on the following variables:</p>
<ul>
<li>Webinar topic</li>
<li>Day of week</li>
<li>Time of day – 11 a.m. or 2 p.m.</li>
<li>Newsletter promotion – no promotion, newsletter article, newsletter sidebar</li>
<li>Number of registrants</li>
<li>Number of attendees</li>
</ul>
<p>Once we'd collected our data, it was time to analyze it and answer some key questions using <a href="http://www.minitab.com/products/minitab/">Minitab Statistical Software</a>.</p>
Should we use registrant or attendee counts for the analysis?
<strong><span style="line-height: 16.8666667938232px; font-family: Calibri, sans-serif; font-size: 11pt;"><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/4d9fa1e3c73606627d2ca1ec34b620e2/scatterplot_w640.jpeg" style="width: 300px; height: 197px; margin: 10px 15px; float: left;" /></span></strong>
<p>First we needed to decide what we would use to measure our results: the number of people who signed up, or the number of people who actually attended the webinar. This question really boils down to answering the question, “Can I trust my data?”</p>
<p>Our data collection system for webinar registrants is much more accurate than our data collection system for webinar attendees. This is due to customer behavior and their willingness to share contact information, in addition to the automated database processes that connect our webinar vendor data with our own database. So, for a period of time, I manually collected the attendee data directly from our webinar vendor to see how it correlated with the easily-accessible and accurate registration data. The scatterplot above shows the results.</p>
<p>With a <a href="http://blog.minitab.com/blog/understanding-statistics/no-matter-how-strong-correlation-still-doesnt-imply-causation">correlation coefficient </a>of 0.929 and a p-value of 0.000, there was a strong positive linear relationship between the registrations and attendee counts. If registrations are high, then attendance is also high. If registrations are low, then attendance is also low. I concluded that I could use the registration data—which is both easily accessible and extremely reliable—to conduct my analysis.</p>
Should we consider data for the last 6 years?
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/5e73f48b852c7afc17762f28bf8887cf/i_mr_chart_of_registrants_w640.jpeg" style="width: 400px; height: 263px; margin: 10px 15px; float: left;" />We’ve been collecting webinar data for 6 years, but that doesn’t mean we can treat the last 6 years of data as one homogeneous population.</p>
<p>A lot can change in a 6-year time period. Perhaps there was a change in the webinar process that affected registrations. To determine whether or not I should use all of the data, I used an Individuals and Moving Range (I-MR, also referred to as X-MR) <a href="http://blog.minitab.com/blog/understanding-statistics/how-create-and-read-an-i-mr-control-chart">control chart</a> to evaluate the process stability of webinar registrations over time.</p>
<p>The graph revealed a single point on the MR chart that flagged as out-of-control. I looked more closely at this point and verified that the data was accurate and that this webinar belonged with the larger population. Based on this information, I decided to proceed with analyzing all 6 years of data together. (Note there is some clustering of points due to promotions, but again the goal here was to determine if we could use data over a 6-year time period.)</p>
What variables impact registrations?
<p>I performed an ANOVA using Minitab's General Linear Model tool to find out which factors—topic, day of week, time of day, or newsletter promotion—significantly affect webinar registrations.<img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/3758d3d03a604bab9921ad9f94663dc8/main_effects_plot_for_registrants_w640.jpeg" style="width: 400px; height: 263px; float: right; margin: 10px 15px;" /></p>
<p>The ANOVA results revealed that the day of week, time of day, and webinar topic <em>do not</em> affect webinar registrations, but the newsletter promotion type <em>does</em> (p-value = 0.000).</p>
<p>So which webinar promotion type maximizes webinar registrations?</p>
<p>Using Minitab to conduct <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/keep-that-special-someone-happy-when-you-perform-multiple-comparisons">Tukey comparisons</a>, we can see that registrations for webinars promoted in the newsletter sidebar space were not significantly different from webinars that weren't promoted at all.</p>
<p>However, webinars that were promoted in the newsletter <em>article </em>space resulted in significantly more registrations than both the sidebar promotions and no promotions.</p>
<p>From this analysis, we concluded that we still had the flexibility to offer webinars at various times and days of the week, and we could continue to vary webinar topics based on customer demand and other factors. To maximize webinar attendance and minimize webinar cost, we needed to focus our efforts on promoting the webinars in our newsletter, utilizing the article space.</p>
<p>But over the past year, we’ve started to actively promote our webinars via other channels as well, so next up is some more data analysis—using Minitab—to figure out what marketing channels provide the best results…</p>
Data AnalysisHypothesis TestingRegression AnalysisStatisticsFri, 17 Oct 2014 12:00:00 +0000http://blog.minitab.com/blog/michelle-paret/using-data-analysis-to-maximize-webinar-attendanceMichelle Paret