Minitab | MinitabBlog posts and articles about using Minitab software in quality improvement projects, research, and more.
http://blog.minitab.com/blog/minitab/rss
Wed, 26 Nov 2014 18:22:50 +0000FeedCreator 1.7.3Giving Thanks for Ways to Edit a Bar Chart of Pies
http://blog.minitab.com/blog/statistics-and-quality-improvement/giving-thanks-for-ways-to-edit-a-bar-chart-of-pies
<p>My siblings occasionally remind me that because I’m getting older, one day, my metabolism is going to collapse. When that day comes, consuming mass quantities of food will surely lead to the collapse of my body, mind, and soul. But, as that day is coming slowly, on Thanksgiving, I’m an every-pie-kind-of-guy.</p>
<p>Now, I know what you’re thinking. It’s Thanksgiving. I’ve just mentioned pies. We’re going to look at pie charts of pies. If you really want to look at pie charts of pies, go ahead and get it out of your system:</p>
<p><a href="http://cf2s1.cbncdn.com/wp-content/blogs.dir/1/files/2012/11/pies_final.jpg">2012 survey by National Public Radio about pie preferences</a></p>
<p><a href="http://www.livescience.com/33111-favorite-pie-america.html">2008 survey by Schwan’s Consumer Brands North America</a></p>
<p><a href="http://thecreatorsproject.vice.com/blog/a-robot-that-puts-pie-charts-onto-actual-pies">A Robot that Puts Pie Charts onto Actual Pies</a></p>
<p>In this post, we’re going to do something more like this:</p>
<p></p>
<p>At our house, we usually do three pies for Thanksgiving: Pumpkin, Chess, and Pecan. I’m going to use a chart of these to show you the things I’m most thankful you can do after you’ve made your bar chart in Minitab. Let’s say that we start with a chart of the calories per slice.</p>
<p><img alt="The default graph has all blue bars. In this case, the order of the bars is the order from the worksheet." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/0e3e9622c239cb3ac8587a5168c98f95/default_graph.png" style="border-width: 0px; border-style: solid; width: 576px; height: 384px;" /></p>
Reorder the bars
<p>These bars are presently in the order that they were listed in the worksheet. But I like to eat them in order of difficulty, starting with the pecan and easing towards the pumpkin. This tends to follow the order of the calories, so we can put the pies in descending order.</p>
<ol>
<li>Double-click the bars.</li>
<li>Select the <strong>Chart Options</strong> tab.</li>
<li>In <strong>Order Main X Groups By</strong>, select <strong>Decreasing Y</strong>. Click <strong>OK</strong>.</li>
</ol>
<p><img alt="The pecan pie is on the left because it has the most calories. Other pies follow in descending order." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/1f20a2cee7c03e5d894dd21791999e57/step_1_ordering.png" style="border-width: 0px; border-style: solid; width: 577px; height: 385px;" /></p>
Add labels that show the y-values
<p>Bar charts are great for making comparisons. Ordering them makes it even clearer which categories are greatest and which are least. But if you want to get precise numbers, you can easily add labels that show the values from the data.</p>
<ol>
<li>Right-click the graph.</li>
<li>Select <strong>Add > Data Labels</strong>. Click <strong>OK</strong>.</li>
</ol>
<p><img alt="The numbers above the bars give the exact number of calories per slice." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/4b92591995ee62865f6354f8d7ac6215/step_2_labeling.png" style="border-width: 0px; border-style: solid; width: 576px; height: 384px;" /></p>
Accumulate bars
<p>As an every-pie-kind-of-guy, one of the things I might want to know is how many calories I eat when I have a slice of each pie. That’s the kind of situation when it’s helpful to accumulate Y across X.</p>
<ol>
<li>Double-click the bars.</li>
<li>Select the <strong>Chart Options</strong> tab.</li>
<li>In <strong>Percent and Accumulate</strong>, check <strong>Accumulate Y across X</strong>. Click <strong>OK</strong>.</li>
</ol>
<p>The resulting graph shows the number of calories for a slice of pecan, for a slice of pecan and a slice of chess, and for a slice of all 3.</p>
<p><img alt="The right bar shows the number of calories if I eat one slice of each pie." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/fa96c4e43cfb45b871b91caa0265999e/step_3_accumulate.png" style="border-width: 0px; border-style: solid; width: 576px; height: 384px;" /></p>
Edit the fill patterns
<p>Like when you’re making a graph about pies, it’s often helpful to make colorful bars that help to represent the categories in the data. In this case, all you have to do is follow these steps:</p>
<ol>
<li>Click the bars in the graph once to select all of them.</li>
<li>Click one of the bars in the graph once to select only one bar.</li>
<li>Double-click the selected bar to edit the bar.</li>
<li>In <strong>Fill Pattern</strong>, select <strong>Custom</strong>.</li>
<li>From <strong>Background color</strong>, select the color that represents your category. Click <strong>OK</strong>.</li>
</ol>
<p>For example, we could make the pecan bar “chestnut,” the chess bar “gold,” and the pumpkin bar “orange.”</p>
<p><img alt="Colors of the bars are the colors of the pies." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/8b4d20a4aff1a324e61f847949df3d89/step_4_coloring.png" style="border-width: 0px; border-style: solid; width: 576px; height: 384px;" /></p>
<p>It’s generally best to leave this step to last, because some other editing steps, like changing the order, can change the bar colors.</p>
Wrap up
<p>Very often, editing a graph so that it presents the message that you want is easier once you’re able to see the graph. That makes it wonderful that it’s so easy to edit a graph after you’ve already made it in Minitab. To see even more about what you can do with different types of graphs, check out <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/graph-options/">the list of graph options</a>. And have a Happy Thanksgiving where you are!</p>
Fun StatisticsWed, 26 Nov 2014 15:16:11 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/giving-thanks-for-ways-to-edit-a-bar-chart-of-piesCody SteeleLessons in Quality from Guadalajara and Mexico City
http://blog.minitab.com/blog/understanding-statistics-and-its-application/lessons-in-quality-from-guadalajara-and-mexico-city
<p><img alt="View of Mexico City" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8e5ec9217bc8fbc2ca7a6784a1efcdfa/mexico_df_400w.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 400px; height: 235px;" />Last week, thanks to the collective effort from many people, we held very successful events in Guadalajara and Mexico City, which gave us a unique opportunity to meet with over 300 Spanish-speaking Minitab users. They represented many different industries, including automotive, textile, pharmaceutical, medical devices, oil and gas, electronics, and mining, as well as academic institutions and consultants.</p>
<p>As I listened to my peers Jose Padilla and <a href="http://blog.minitab.com/blog/marilyn-wheatleys-blog">Marilyn Wheatley</a> deliver their presentations, it was interesting to see people's reactions as they learned more about our products and services. Several attendees were particularly pleased to learn more about Minitab's ease-of-use and <a href="http://www.minitab.com/products/minitab/assistant/">step-by-step help with analysis</a> offered by the Assistant menu. I saw others react to demonstrations of Minitab's comprehensive Help system, the use of executables for automation purposes, and several of the tips and tricks discussed throughout our presentations.</p>
<p>We also had multiple conversations on Minitab's flexible licensing options. Several attendees who spend a lot of time on the road were particularly glad to learn about our <a href="http://support.minitab.com/installation/frequently-asked-questions/license-fulfillment/borrow-a-license-of-minitab-companion/">borrowing functionality</a>, which lets you “check out” a license so you can use Minitab software without accessing your organization’s license server.</p>
Acceptance Sampling Plans
<p>There were plenty of technical discussions as well. One interesting question came from a user who asked how Minitab's Acceptance Sampling Plans compare to the <a href="http://asq.org/knowledge-center/ANSI_ASQZ1_4-2008/index.html">ANSI Z1.4</a> standard (a.k.a. MIL-STD 105E). The short answer is that the tables provided by the ANSI Z1.4 are for a specific AQL (Acceptable Quality Level), while implicitly assuming a certain RQL (Rejectable Quality Level) based solely on the lot size. The ANSI Z1.4 is an AQL-based system, while Minitab's acceptance sampling plans give you the flexibility to create a customized sampling scheme for a specific AQL, RQL, or lot size using both the binomial or hypergeometric distributions.</p>
Destructive Testing and Gage R&R
<p>Other users had questions about Gage R&R and destructive testing. Practitioners commonly assess a destructive test using Nested Gage R&R; however, this is not always necessary. The main problem with destructive testing is that every part tested is destroyed and thus can only be measured by a single operator. Since the purpose of this type of analysis is to measure the repeatability and reproducibility of the measurement system, one must identify parts that are as homogeneous as possible. Typically, instead of 10 parts, practitioners may use multiple parts from each of 10 batches. If the within-batch variation is small enough then the parts from each batch can be considered to be "the same" and thus the readings measured by all the operators can be used to produce repeatability and reproducibility measures. The main trick is to have homogenous units or batches that can give you enough samples to be tested by all operators for all replicates. If this is the case, you can analyze a destructive test with crossed gage R&R.</p>
Control Charts and Subgroup Size
<p>We also had an interesting discussion about the sensitivity of Shewhart <a href="http://blog.minitab.com/blog/understanding-statistics/control-chart-tutorials-and-examples">control charts</a> to the subgroup size. Specifically, one of the attendees asked our recommendation for subgroup size: 4, or 5? </p>
<p>The answer to this intriguing question requires an understanding of the reason why subgroups are recommended. Control charts have limits that are constructed so that if the process is stable, the probability of observing points out of these control limits is very small; this probability is typically referred to as the false alarm rate and it is usually set at 0.0027. This calculation assumes the process is normally distributed, so if we were plotting the individual data as in an Individuals chart, the control limits would be effective to determine an out-of-control situation only if the data came from a normal distribution. To reduce the dependence on normality, Shewhart suggested collecting the data in subgroups, because if we plot the means instead of the individual data the control limits would become less and less sensitive to normality as the subgroup size increases. This is a result of the Central Limit Theorem (CLT), which states that regardless of the underlying distribution of the data, that if we take independent samples and compute the average (or a sum) of all the observations in each sample then the distribution of these sample means will converge to a normal distribution.</p>
<p>So going back to the original question, what is the recommended subgroup size for building control charts? The answer depends on how skewed the underlying distribution may be. For various distributions a subgroup size of 5 is sufficient to have the CLT kick in making our control charts robust to normality; however for extremely skewed distributions like the exponential, the subgroup sizes may need to be much larger than 50. This topic was discussed in a paper Schilling and Nelson titled "<a href="http://asq.org/qic/display-item/?item=5238">The Effect of Non-normality on the Control Limits of Xbar Charts</a>" published in JQT back in 1976.</p>
Analyzing Variability
<p>We also had a great discussion about modeling variability in a process. One of the attendees, working for McDonald's, was looking for statistical methods for reducing the variation of the weight of apple slices. An apple is cut in 10 slices, and the goal was to minimize the variation in weight so that exactly four slices be placed in each bag without further rework. This gave me the opportunity to demonstrate how to use the <a href="http://blog.minitab.com/blog/adventures-in-statistics/assessing-variability-for-quality-improvement">Analyze Variability</a> command in Minitab, which happens to be one of the topics we cover in our <a href="http://www.minitab.com/training/courses/#doe-in-practice-manufacturing">DOE in Practice</a> course.</p>
We Love Your Questions
<p>For me and my fellow trainers, there’s nothing better than talking with people who are using Minitab software to solve problems. Sometimes we’re able to provide a quick, helpful answer. Sometimes a question provokes a great discussion about some quality challenge we all have in common. And sometimes a question will lead to a great idea that we’re able to share with our developers and engineers to make our software better. </p>
<p>If you have a question about Minitab, statistics, or quality improvement, please feel free to comment here. And if you use Minitab software, you can always contact our <a href="http://www.minitab.com/support/">customer support</a> team for direct assistance from specialists in IT, statistics, and quality improvement.</p>
<p> </p>
Quality ImprovementStatisticsStatistics HelpWed, 19 Nov 2014 13:57:02 +0000http://blog.minitab.com/blog/understanding-statistics-and-its-application/lessons-in-quality-from-guadalajara-and-mexico-cityEduardo SantiagoWhat to Do When Your Data's a Mess, part 3
http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-3
<p>Everyone who analyzes data regularly has the experience of getting a worksheet that just isn't ready to use. Previously I wrote about tools you can use to <a href="http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-data-is-a-mess-part-1">clean up and elminate clutter in your data</a> and <a href="http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-2">reorganize your data</a>. </p>
<p><span style="line-height: 1.6;">In this post, I'm going to highlight tools that help you get the most out of messy data by altering its characteristics.</span></p>
Know Your Options
<p>Many problems with data don't become obvious until you begin to analyze it. A shortcut or abbreviation that seemed to make sense while the data was being collected, for instance, might turn out to be a time-waster in the end. What if abbreviated values in the data set only make sense to the person who collected it? Or a column of numeric data accidentally gets coded as text? You can solve those problems quickly with <a href="http://www.minitab.com/products/minitab">statistical software</a> packages.</p>
Change the Type of Data You Have
<p>Here's an instance where a data entry error resulted in a column of numbers being incorrectly classified as text data. This will severely limit the types of analysis that can be performed using the data.</p>
<p><img alt="misclassified data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c45b427d3e5e2b5eac4a505ed5c3b24f/misclassified_data.png" style="width: 200px; height: 156px;" /></p>
<p>To fix this, select <strong>Data > Change Data Type</strong> and use the dialog box to choose the column you want to change.</p>
<p><img alt="change data type menu" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/46ece127300500409098383a2e476a9b/text_to_numeric_data.png" style="width: 376px; height: 175px;" /></p>
<p>One click later, and the errant text data has been converted to the desired numeric format:</p>
<p><img alt="numeric data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f1b9df0211f9085e577a41b0e3661b45/numeric_data.png" style="width: 200px; height: 156px;" /></p>
Make Data More Meaningful by Coding It
<p>When this company collected data on the performance of its different functions across all its locations, it used numbers to represent both locations and units. </p>
<p><img alt="uncoded data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d22a57fe9e9e398bd948e86c0adafe34/uncoded_data.png" style="width: 135px; height: 158px;" /></p>
<p>That may have been a convenient way to record the data, but unless you've memorized what each set of numbers stands for, interpreting the results of your analysis will be a confusing chore. You can make the results easy to understand and communicating by coding the data. </p>
<p>In this case, we select <strong>Data > Code > Numeric to Text...</strong></p>
<p><img alt="code data menu" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c75e46cc190497fd41b0e6736518c0fe/code_data_menu.png" style="width: 384px; height: 255px;" /></p>
<p>And we complete the dialog box as follows, telling the software to replace the numbers with more meaningful information, like the town each facility is located in. </p>
<p><img alt="Code data dialog box" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/cd75c14324187806b8f3a74a3b8996b4/code_data_dialog.png" style="width: 400px; height: 345px;" /></p>
<p>Now you have data columns that can be understood by anyone. When you create graphs and figures, they will be clearly labelled. </p>
<p><img alt="Coded data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/7ff81bdb08170d6d8a4e8547623cf557/coded_data.png" style="width: 161px; height: 200px;" /></p>
Got the Time?
<p>Dates and times can be very important in looking at performance data and other indicators that might have a cyclical or time-sensitive effect. But the way the date is recorded in your data sheet might not be exactly what you need. </p>
<p>For example, if you wanted to see if the day of the week had an influence on the activities in certain divisions of your company, a list of dates in the MM/DD/YYYY format won't be very helpful. </p>
<p><img alt="date column" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f5b0dd178afbc0352f8dc2d9378e887b/date_column.png" style="width: 240px; height: 223px;" /></p>
<p>You can use <strong>Data > Date/Time > Extract to Text... </strong>to identify the day of the week for each date.</p>
<p><img alt="extract-date-to-text" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/7e6f7e8a87ee8291b9c6d51507092c19/extract_date_to_text.png" style="width: 351px; height: 132px;" /></p>
<p>Now you have a column that lists the day of the week, and you can easily use it in your analysis. </p>
<p><img alt="day column" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/dede93c9621917a0cfb54beef121d4e2/day_column.png" style="width: 249px; height: 205px;" /></p>
Manipulating for Meaning
<p>These tools are commonly seen as a way to correct data-entry errors, but as we've seen, you can use them to make your data sets more meaningful and easier to work with.</p>
<p>There are many other tools available in Minitab's Data menu, including an array of options for arranging, combining, dividing, fine-tuning, rounding, and otherwise massaging your data to make it easier to use. Next time you've got a column of data that isn't quite what you need, try using the Data menu to get it into shape.</p>
<p> </p>
<p> </p>
Data AnalysisStatisticsStatsMon, 17 Nov 2014 13:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-3Eston MartzAre Preseason Football or Basketball Rankings More Accurate?
http://blog.minitab.com/blog/the-statistics-game/are-preseason-football-or-basketball-rankings-more-accurate
<p>College basketball season tips off today, and for the second straight season Kentucky is the #1 ranked preseason team in the AP poll. Last year Kentucky did not live up to that ranking in the regular season, going 24-10 and earning a lowly 8 seed in the NCAA tournament. But then, in the tournament, they overachieved and made a run all the way to the championship game...before losing to Connecticut.</p>
<p>In football, Florida State was the AP poll preseason #1 football team. While they are currently still undefeated, they aren't quite playing like the #1 team in the country. So this made me wonder, which preseason rankings are more accurate, football or basketball?</p>
<p>I gathered <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/1d3961db92c5ba14bc90b2b8323b95f8/preseason_basketball_vs__football_rankings.MTW">data</a> from the last 10 seasons, and recorded the top 10 teams in the preseason AP poll for both football and basketball. Then I recorded the difference between their preseason ranking and their final ranking. Both sports had 10 teams that weren’t ranked or receiving votes in the final poll, so I gave all of those teams a final ranking of 40.</p>
Creating a Histogram to Compare Two Distributions
<p>Let’s start with a histogram to look at the distributions of the differences. (It's always a good idea to look at the distribution of your data when you're starting an analysis, whether you're looking at quality improvement data work or sports data for yourself.) </p>
<p>You can create this graph in Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a> by selecting <strong>Graph > Histograms</strong>, choosing "With Groups" in the dialog box, and using the Basketball Difference and Football Difference columns as the graph variables:</p>
<p><img alt="Histogram" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/53055c57978dbfa85d28688cc816c98a/histogram_of_basketball_difference__football_difference.jpg" style="width: 720px; height: 480px;" /></p>
<p>The differences in the rankings appear to be pretty similar. Most of the data is towards the left side of this histogram, meaning for most cases the difference between the preseason and final ranking is pretty small.</p>
Conducting a Mann-Whitney Hypothesis Test on Two Medians
<p>We can further investigate the data by performing a hypothesis test. Because the data is heavily skewed, I’ll use <a href="http://blog.minitab.com/blog/the-statistics-game/do-the-data-really-say-female-named-hurricanes-are-more-deadly">a Mann-Whitney test</a>. This compares the medians of two samples with similarly-shaped distributions, as opposed to a <a href="http://blog.minitab.com/blog/understanding-statistics/guidelines-and-how-tos-for-the-2-sample-t-test">2-sample t test</a>, which compares the means. <span style="line-height: 20.7999992370605px;">The median is the middle value of the data. Half the observations are less than or equal to it, and half the observations are greater than or equal to it.</span><span style="line-height: 20.7999992370605px;"> </span></p>
<p>To perform this test in our statistical software, we select <strong>Stat > Nonparametrics > Mann-Whitney</strong>, then choose the appropriate columns for our first and second sample: </p>
<p><img alt="Mann-Whitney Test" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/1a1f239841b82e60170e6ecbc8077d4b/mann_whitney.jpg" style="width: 689px; height: 241px;" /></p>
<p>The basketball rankings have a smaller median difference than the football rankings. However, when we examine the <a href="http://blog.minitab.com/blog/understanding-statistics/three-things-the-p-value-cant-tell-you-about-your-hypothesis-test">p-value</a> we see that this difference is not statistically significant. There is not enough evidence to conclude that one preseason poll is more accurate than the other.</p>
<p>But what about the best teams? I grouped each of the top 3 ranked teams and looked at the median difference between their preseason and final rank.</p>
<p><img alt="Bar Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/692a3db40dd5d3b4c20d539f92395629/bar_chart.jpg" style="width: 720px; height: 480px;" /></p>
<p>The preseason AP basketball poll has a smaller difference for the #1 and #3 ranked teams. But the football poll is better for the #2 team, having an impressive median value of 1. Overall, both polls are relatively good, as neither has a median value greater than 6. And the differences are close enough that we can’t conclude that one is more accurate than the other.</p>
What Does It Mean for the Teams?
<p>While the odds are against both Kentucky and Florida State to finish the season ranked #1 in their respective polls, previous seasons indicate that they’re still likely to finish as one of the top teams. This is better news for Kentucky, as being one of the top teams means they’ll easily make the NCAA basketball tournament and get a high seed. However, Florida State must finish as one of the top 4 teams, or else they’ll miss out on the football postseason completely.</p>
<p>So while we can’t conclude one poll is better than the other, teams at the top of the AP basketball poll are clearly much more likely to reach the postseason than football.</p>
Data AnalysisFun StatisticsHypothesis TestingStatistics in the NewsFri, 14 Nov 2014 15:03:33 +0000http://blog.minitab.com/blog/the-statistics-game/are-preseason-football-or-basketball-rankings-more-accurateKevin RudyThe Power of Multivariate ANOVA (MANOVA)
http://blog.minitab.com/blog/adventures-in-statistics/the-power-of-multivariate-anova-manova
<p><img alt="Willy Wonka" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/964d1b613c1569e983213d2544915ac5/willywonka.jpg" style="float: right; width: 225px; height: 225px; border-width: 1px; border-style: solid; margin: 10px 15px;" />Analysis of variance (ANOVA) is great when you want to compare the differences between group means. For example, you can use ANOVA to assess how three different alloys are related to the mean strength of a product. However, most ANOVA tests assess one response variable at a time, which can be a big problem in certain situations. Fortunately, <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">Minitab statistical software</a> offers a multivariate analysis of variance (MANOVA) test that allows you to assess multiple response variables simultaneously.</p>
<p>In this post, I’ll run through a MANOVA example, explain the benefits, and cover how to know when you should use MANOVA.</p>
Limitations of ANOVA
<p>Whether you’re using <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/basics/what-is-a-general-linear-model/" target="_blank">general linear model (GLM)</a> or <a href="http://blog.minitab.com/blog/adventures-in-statistics/did-welchs-anova-make-fishers-classic-one-way-anova-obsolete" target="_blank">one-way ANOVA</a>, most ANOVA procedures can only assess one response variable at a time. Even GLM, where you can include many factors and covariates in the model, the analysis simply cannot detect multivariate patterns in the response variable.</p>
<p>This limitation can be a huge roadblock for some studies because it may be impossible to obtain significant results with a regular ANOVA test. You don’t want to miss out on any significant findings!</p>
Example That Compares MANOVA to ANOVA
<p>What the heck are multivariate patterns in the response variable? It sounds complicated but it’s very easy to show the difference between how ANOVA and MANOVA tests the data by using graphs.</p>
<p>Let’s assume that we are studying the relationship between three alloys and the strength and flexibility of our products. Here is the <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/File/3f3b6f58c70a646731a9db97bd7edfab/manova_example.MTW">dataset for the example</a>.</p>
<p>The two individual value plots below show how one-way ANOVA analyzes the data—one response variable at a time. In these graphs, alloy is the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-models/factor-and-factor-levels/" target="_blank">factor</a> and strength and flexibility are the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">response variables</a>.</p>
<img alt="Individual value plot of strength by alloy" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/3402fd3845c2226f555b4ebfe18a87f5/strength_ivp.png" style="width: 350px; height: 233px;" />
<img alt="Individual value plot of flexibility by alloy" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/c7fba5c5eda5e81e02db60b2aefb3327/flexibility_ivp.png" style="width: 350px; height: 233px;" />
<p>The two graphs seem to show that the type of alloy is not related to either the strength or flexibility of the product. When you perform the one-way ANOVA procedure for these graphs, the p-values for strength and flexibility are 0.254 and 0.923 respectively.</p>
<p>Drat! I guess Alloy isn't related to either Strength or Flexibility, right? Not so fast!</p>
<p>Now, let’s take a look at the multivariate response patterns. To do this, I’ll display the same data with a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/graphs/graphs-of-pairs-of-variables/scatterplots/scatterplot/" target="_blank">scatterplot</a> that plots Strength by Flexibility with Alloy as a categorical grouping variable.</p>
<p><img alt="Scatterplot of strength by flexibility grouped by alloy" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/86483284f76817ea95b3c1787e45e7d5/scatterplot.png" style="width: 576px; height: 384px;" /></p>
<p>The scatterplot shows a positive correlation between Strength and Flexibility. MANOVA is useful when you have correlated response variables like these. You can also see that for a given flexibility score, Alloy 3 generally has a higher strength score than Alloys 1 and 2. We can use MANOVA to statistically test for this response pattern to be sure that it’s not due to random chance.</p>
<p>To perform the MANOVA test in Minitab, go to: <strong>Stat > ANOVA > General MANOVA</strong>. Our response variables are Strength and Flexibility and the predictor is Alloy.</p>
<p>Whereas one-way ANOVA could not detect the effect, MANOVA finds it with ease. The p-values in the results are all very significant. You can conclude that Alloy influences the properties of the product by changing the relationship between the response variables.</p>
<p><img alt="MANOVA results" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/c96fe9a066011b31692765318c2f0d26/manova_swo.png" style="width: 391px; height: 155px;" /></p>
<p>For a more complete guide on how to interpret MANOVA results in Minitab, go to: <strong>Help > StatGuide > ANOVA > General MANOVA</strong>.</p>
When and Why You Should Use MANOVA
<p>Use multivariate ANOVA when you have continuous response variables that are correlated. In addition to multiple responses, you can also include multiple <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-models/factor-and-factor-levels/" target="_blank">factors</a>, <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-models/adding-a-covariate-to-glm/" target="_blank">covariates</a>, and <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-models/what-is-an-interaction/" target="_blank">interactions</a> in your model. MANOVA uses the additional information provided by the relationship between the responses to provide three key benefits.</p>
<ul>
<li><strong>Increased power</strong>: If the response variables are correlated, MANOVA can detect differences too small to be detected through individual ANOVAs.</li>
<li><strong>Detects multivariate response patterns</strong>: The factors may influence the relationship between responses rather than affecting a single response. Single-response ANOVAs can miss these multivariate patterns as illustrated in the MANOVA example.</li>
<li><strong>Controls the family error rate</strong>: Your chance of incorrectly rejecting the null hypothesis increases with each successive ANOVA. Running one MANOVA to test all response variables simultaneously keeps the family error rate equal to your alpha level.</li>
</ul>
Data AnalysisStatisticsStatistics HelpThu, 13 Nov 2014 13:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/the-power-of-multivariate-anova-manovaJim FrostLeaving Out-of-control Points Out of Control Chart Calculations Looks Hard, but It Isn't
http://blog.minitab.com/blog/statistics-and-quality-improvement/leaving-out-of-control-points-out-of-control-chart-calculations-looks-hard2c-but-it-isnt
<p><img alt="Houston skyline" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/cc75ea27bfe3966a53c129479367301b/houston_skyline11.jpg" style="float: right; width: 250px; height: 188px; border-width: 1px; border-style: solid; margin: 10px 15px;" /><a href="http://blog.minitab.com/blog/understanding-statistics/control-chart-tutorials-and-examples">Control charts</a> are excellent tools for looking at data points that seem unusual and for deciding whether they're worthy of investigation. If you use control charts frequently, then you're used to the idea that if certain subgroups reflect temporary abnormalities, you can leave them out when you calculate your center line and control limits. If you include points that you already know are different because of an assignable cause, you reduce the sensitivity of your control chart to other, unknown causes that you would want to investigate. Fortunately, Minitab Statistical Software makes it fast and easy to leave points out when you calculate your center line and control limits. And because Minitab’s so powerful, you have the flexibility to decide if and how the omitted points appear on your chart.</p>
<p>Here’s an example with some environmental data taken from <a href="http://www.tceq.texas.gov/cgi-bin/compliance/monops/yearly_summary.pl">the Meyer Park ozone detector in Houston, Texas</a>. The data are the readings at midnight from January 1, 2014 to November 9, 2014. (My knowledge of ozone is too limited to properly chart these data, but they’re going to make a nice illustration. Please forgive my scientific deficiencies.) If you plot these on an individuals chart with all of the data, you get this:</p>
<p><img alt="The I-chart shows seven out-of-control points between May 3rd and May 17th." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/8d42d9d369c2808c62ede97e8ec9e8af/all_data.jpg" style="width: 450px; height: 300px;" /></p>
<p>Beginning on May 3, a two-week period contains 7 out of 14 days where the ozone measurements are higher than you would expect based on the amount that they normally vary. If we know the reason that these days have higher measurements, then we could exclude them from the calculation of the center line and control limits. Here are the three options for what to do with the points:</p>
<p><img alt="Three ways to show or hide omitted points" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/e70f52244b7960596243b25c87b91911/compare_all_three.jpg" style="width: 576px; height: 384px;" /></p>
Like it never happened
<p>One way to handle points that you don't want to use to calculate the center line and control limits is to act like they never happened. The points neither appear on the chart, nor are there gaps that show where omitted points were. The fastest way to do this is by <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/graph-options/exploring-data-and-revising-graphs/using-brushing-to-investigate-data-points/">brushing</a>:</p>
<ol>
<li>On the Graph Editing toolbar, click the paintbrush.</li>
</ol>
<p><img alt="The paintbrush is between the arrow and the crosshairs." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/fc33a185eaf5c95f7b6db85edb76a3e3/graph_editing_toolbar.png" style="width: 522px; height: 28px;" /></p>
<ol>
<li>Click and drag a square that surrounds the 7 out-of-control points.</li>
<li>Press CTRL + E to recall the Individuals chart dialog box.</li>
<li>Click <strong>Data Options</strong>.</li>
<li>Select <strong>Specify which rows to exclude</strong>.</li>
<li>Select <strong>Brushed Rows</strong>.</li>
<li>Click <strong>OK</strong> twice.</li>
</ol>
<p>On the resulting chart, the upper control limit changes from 41.94 parts per billion to 40.79 parts per billion. The new limits indicate that April 11 was also a measurement that's larger than expected based on the variation typical of the rest of the data. These two facts will be true on the control chart no matter how you treat the omitted points. What's special about this chart is that there's no suggestion that any other data exists. The focus of the chart is on the new out-of-control point:</p>
<p><img alt="The line between the data is unbroken, even though other data exists." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/813c8d531f82a0c6dd8eda2b5ae5078b/not_there.jpg" style="width: 450px; height: 300px;" /></p>
Guilty by omission
<p>A display that only shows the data used to calculate the control line and center limits might be exactly what you want, but you might also want to acknowledge that you didn't use all of the data in the data set. In this case, after step 6, you would check the box labeled <strong>Leave gaps for excluded points</strong>. The resulting gaps look like this:</p>
<p><img alt="Gaps in the control limits and data connect lines show where points were omitted." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/e1a93a320af74fb5aea264d7f3b7f106/gaps.jpg" style="width: 450px; height: 300px;" /></p>
<p>In this case, the spaces are most obvious in the control limit line, but the gaps also exist in the lines that connect the data points. The chart shows that some data was left out.</p>
Hide nothing
<p>In many cases, not showing data that wasn't in the calculations for the center line and control limits is effective. However, we might want to show all of the points that were out-of-control in the original data. In this case, we would still brush the points, but not use the Data Options. Starting from the chart that calculated the center line and control limits from all of the data, these would be the steps:</p>
<ol>
<li>On the Graph Editing toolbar, click the paintbrush.</li>
</ol>
<p><img alt="The paintbrush is between the arrow and the crosshairs." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/fc33a185eaf5c95f7b6db85edb76a3e3/graph_editing_toolbar.png" style="width:522px;height:28px;" /></p>
<ol>
<li>Click and drag a square that surrounds the 7 out-of-control points.</li>
<li>Press CTRL + E to recall the Individuals chart dialog box. Arrange the dialog box so that you can see the list of brushed points.</li>
<li>Click <strong>I Chart Options</strong>.</li>
<li>Select the <strong>Estimate</strong> tab.</li>
<li>Under <strong>Omit the following subgroups when estimating parameters</strong>, enter the row numbers from the list of brushed points.</li>
<li>Click <strong>OK</strong> twice.</li>
</ol>
<p>This chart still shows the new center line, control limits, and out-of-control point, but also includes the points that were omitted from the calculations.</p>
<p><img alt="Points not in the calculations are still on the chart." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/4909f3f8902786e3169f649c490c0fb1/everything.jpg" style="width: 450px; height: 300px;" /></p>
Wrap up
<p>Control charts help you to identify when some of your data are different than the rest so that you can examine the cause more closely. Developing control limits that exclude data points with an assignable cause is easy in Minitab and you also have the flexibility to decide how to display these points to convey the most important information. The only thing better than getting the best information from your data? Getting the best information from your data faster!</p>
<p>Ready for more? Check out some more tips about <a href="http://www.minitab.com/en-us/Support/Tutorials/Optimize-the-Performance-of-Your-Control-Charts/">optimizing the performance of your control charts</a>!</p>
<p> </p>
The image of the Houston skyline is from <a href="http://commons.wikimedia.org/wiki/File:Houston_Skyline11.jpg">Wikimedia commons</a> and is licensed under <a href="http://creativecommons.org/licenses/by-sa/3.0/">this creative commons license</a>.
Lean Six SigmaQuality ImprovementSix SigmaStatistics HelpWed, 12 Nov 2014 17:10:04 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/leaving-out-of-control-points-out-of-control-chart-calculations-looks-hard2c-but-it-isntCody SteeleWhat to Do When Your Data's a Mess, part 2
http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-2
<p><span style="line-height: 1.6;">In my last post, I wrote about making a cluttered data set easier to work with by removing unneeded columns entirely, and by displaying just those columns you want to work with <em>now</em>. But <a href="http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-data-is-a-mess-part-1">too much unneeded data</a> isn't always the problem. </span></p>
<p><span style="line-height: 1.6;">What can you do when someone gives you data that isn't organized the way you need it to be? </span></p>
<p><span style="line-height: 1.6;">That happens for a variety of reasons, but most often it's because the simplest way for people to collect data is with a format that might make it difficult to assess in a worksheet. Most <a href="http://www.minitab.com/products/minitab">statistical software</a> will accept a wide range of data layouts, but just because a layout is readable doesn't mean it will be easy to analyze.</span></p>
<p><span style="line-height: 1.6;">You may not be in control of how your data were collected, but you can use tools like sorting, stacking, and ordering to put your data into a format that makes sense and is easy for you to use. </span></p>
Decide How You Want to Organize Your Data
<p>Depending on how its arranged, the same data can be easier to work with, simpler to understand, and can even yield deeper and more sophisticated insights. I can't tell you the best way to organize your specific data set, because that will depend on the types of analysis you want to perform, and the nature of the data you're working with. However, I can show you some easy ways to rearrange your data into the form that you select. </p>
Unstack Data to Make Multiple Columns
<p>The data below show concession sales for different types of events held at a local theater. </p>
<p><img alt="stacked data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8ea617d9de8138f26f2da0f3f95f4b88/stackedata.png" style="width: 202px; height: 188px;" /></p>
<p><span style="line-height: 20.7999992370605px;">If we wanted to perform an analysis that requires each type of event to be in its own column, we can choose <strong>Data > Unstack Columns...</strong> and complete the dialog box as shown: </span></p>
<p><span style="line-height: 20.7999992370605px;"><img alt="unstack columns dialog" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/fc098d3ddcbc21fe12602cb45336949c/unstack_columns.png" style="width: 350px; height: 263px;" /> </span></p>
<p>Minitab creates a new worksheet that contains a separate column of Concessions sales data for each type of event:</p>
<p><img alt="Unstacked Data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f24dd4ac29678e25069d299ccc13c535/unstacked_data.png" style="width: 400px; height: 150px;" /></p>
Stack Data to Form a Single Column (with Grouping Variable)
<p>A similar tool will help you put data from separate columns into a single column for the type of analysis required. The data below show sales figures for four employees: </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f546e2611e4fd6fe804de7c0aee3d230/stacked_data.png" style="width: 265px; height: 92px;" /></p>
<p>Select <strong>Data > Stack > Columns...</strong> and select the columns you wish to combine. Checking the "Use variable names in subscript column" will create a second column that identifies the person who made each sale. </p>
<p><img alt="Stack columns dialog" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a09dba196e68e5e75d0f248339a53e11/stack_data_dialog.jpg" style="width: 400px; height: 292px;" /></p>
<p>When you press OK, the sales data are stacked into a single column of measurements and ready for analysis, with Employee available as a grouping variable: </p>
<p><img alt="stacked columns" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c26bec8bec9447ab1df6b9ad669d9a1a/stacked_columns.jpg" style="width: 138px; height: 181px;" /></p>
Sort Data to Make It More Manageable
<p>The following data appear in the worksheet in the order in which individual stores in a chain sent them into the central accounting system.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/431dcae640fa0855a8db03b14bad3998/unsorted_data.jpg" style="width: 200px; height: 228px;" /></p>
<p>When the data appear in this uncontrolled order, finding an observation for any particular item, or from any specific store, would entail reviewing the entire list. We can fix that problem by selecting <strong>Data > Sort...</strong> and reordering the data by either store or item. </p>
<p><img alt="sorted data by item" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/0c982bb11359a001c048cb6c39ab1f60/sorted_data_by_item.jpg" style="width: 221px; height: 246px;" /> <img alt="sorted data by store" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/53e9a3f22b4a959af11952995703d7d4/sorted_data_by_store.jpg" style="width: 209px; height: 248px;" /></p>
Merge Multiple Worksheets
<p>What if you need to analyze information about the same items, but that were recorded on separate worksheets? For instance, if one group was gathering historic data about all of a corporation's manufacturing operations, while another was working on strategic planning, and your analysis required data from each? </p>
<p><img alt="two worksheets" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f63ed557c91fb6136b28ab43001b48b4/two_worksheets.png" style="width: 350px; height: 327px;" /></p>
<p>You can use <strong>Data > Merge Worksheets</strong> to bring the data together into a single worksheet, using the Division column to match the observations:</p>
<p><img alt="merging worksheets" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/651d3d676a4099a71eb180344d2e8282/merge_worksheets.png" style="width: 393px; height: 363px;" /></p>
<p>You can also choose whether or not <span style="line-height: 20.7999992370605px;">multiple</span><span style="line-height: 1.6;">, missing, or unmatched observations will be included in the merged worksheet. </span></p>
Reorganizing Data for Ease of Use and Clarity
<p>Making changes to the layout of your worksheet does entail a small investment of time, but it can bring big returns in making analyses quicker and easier to perform. The next time you're confronted with raw data that isn't ready to play nice, try some of these approaches to get it under control. </p>
<p>In my next post, I'll share some tips and tricks that can help you get more information out of your data.</p>
Data AnalysisStatisticsStatsTue, 11 Nov 2014 14:48:09 +0000http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-2Eston MartzWhat to Do When Your Data's a Mess, part 1
http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-data-is-a-mess-part-1
<p>Isn't it great when you get a set of data and it's perfectly organized and ready for you to analyze? I love it when the people who collect the data take special care to make sure to format it consistently, arrange it correctly, and eliminate the junk, clutter, and useless information I don't need. </p>
<p><img alt="Messy Data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ad531bc1c0dc575e774b7ecef670b231/messydata.png" style="border-width: 1px; border-style: solid; margin: 10px 15px; width: 250px; height: 248px; float: right;" />You've never received a data set in such perfect condition, you say?</p>
<p>Yeah, me neither. But I can dream, right? </p>
<p><span style="line-height: 1.6;">The truth is, when other people give me data, it's typically not ready to analyze. It's frequently messy, disorganized, and inconsistent. I get big headaches if I try to analyze it without doing a little clean-up work first. </span></p>
<p>I've talked with many people who've shared similar experiences, so I'm writing a series of posts on how to get your data in usable condition. In this first post, I'll talk about some basic methods you can use to make your data easier to work with. </p>
Preparing Data Is a Little Like Preparing Food
<p>I'm not complaining about the people who give me data. In most cases, they aren't statisticians and they have many higher priorities than giving me data in exactly the form I want. </p>
<p>The end result is that getting data is a little bit like getting food: it's not always going to be ready to eat when you pick it up. You don't eat raw chicken, and usually you can't analyze raw data, either. <span style="line-height: 20.7999992370605px;"> </span><span style="line-height: 1.6;">In both cases, you need to prepare it first or the results aren't going to be pretty. </span></p>
<p><span style="line-height: 1.6;">Here are a couple of very basic things to look for when you get a messy data set, and how to handle them. </span></p>
<span style="line-height: 1.6;">Kitchen-Sink Data and Information Overload</span>
<p>Frequently I get a data set that includes a lot of information that I don't need for my analysis. I also get data sets that combine or group information in ways that make analyzing it more difficult. </p>
<p>For example, let's say I needed to analyze data about different types of events that take place at a local theater. Here's my raw data sheet: </p>
<p><img alt="April data sheet" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/14fe4e9930171f54848b589c0e8139d1/april_data_raw.png" style="width: 400px; height: 224px;" /></p>
<p>With each type of event jammed into a single worksheet, it's a challenge to analyze just one event category. What would work better? A separate worksheet for each type of occasion. In Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a>, I can go to <strong>Data > Split Worksheet...</strong> and choose the Event column: </p>
<p><img alt="split worksheet" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/69c63e422339f9871ada5a244222dcfc/split_worksheet.png" style="width: 300px; height: 309px;" /></p>
<p>And Minitab will create new worksheets that include only the data for each type of event. </p>
<p><img alt="separate worksheets by event type" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8b97ea00ae39da8cb60e307ebe6140dc/separate_data_sheets.png" style="width: 300px; height: 243px;" /></p>
<p><span style="line-height: 20.7999992370605px;">Minitab also lets you merge worksheets to </span>combine items provided in separate data files. </p>
<p><span style="line-height: 1.6;">Let's say the data set you've been given contains a lot of columns that you don't need: irrelevant factors, redundant information, and the like. Those items just clutter up your data set, and getting rid of them will make it easier to identify and access the columns of data you actually need. </span><span style="line-height: 20.7999992370605px;">You can delete rows and columns you don't need, or use the</span><strong style="line-height: 20.7999992370605px;"> Data > Erase Variables</strong><span style="line-height: 20.7999992370605px;"> tool to make your worksheet more manageable. </span></p>
<span style="line-height: 1.6;">I Can't See You Right Now...Maybe Later</span>
<p>What if you don't want to actually <em>delete </em>any data, but you only want to see the columns you intend to use? For instance, in the data below, I don't need the Date, Manager, or Duration columns now, but I may have use for them in the future: </p>
<p><img alt="unwanted columns" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/99d785a0b5ff0cbac36f0c6af05b1cac/unwantedcolumns.png" style="width: 400px; height: 225px;" /></p>
<p>I can select and right-click those columns, then use <strong>Column > Hide Selected Columns</strong> to make them disappear. </p>
<p><img alt="hide selected columns" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/00defa2646d5e100873ef2961d374ff0/hideselectedcolumns.png" style="width: 400px; height: 308px;" /></p>
<p>Voila! They're gone from my sight. Note how the displayed columns jump from C1 to C5, indicating that some columns are hidden: </p>
<p><img alt="hidden columns" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a140bb6413744b431460e70f523e5a0b/hiddencolumns.png" style="width: 323px; height: 138px;" /></p>
<p>It's just as easy to bring those columns back in the limelight. When I want them to reappear, I select the C1 and C5 columns, right-click, and choose "Unhide Selected Columns." </p>
<p>Data may arrive in a disorganized and messy state, but you don't need to keep it that way. Getting rid of extraneous information and choosing the elements that are visible can make your work much easier. But that's just the tip of the iceberg. In my next post, I'll cover some more <a href="http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-2">ways to make unruly data behave</a>. </p>
Data AnalysisStatisticsMon, 10 Nov 2014 15:52:00 +0000http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-data-is-a-mess-part-1Eston MartzCreating and Reading Statistical Graphs: Trickier than You Think
http://blog.minitab.com/blog/understanding-statistics/creating-and-reading-statistical-graphs-trickier-than-you-think
<p>A few weeks ago my colleague Cody Steele illustrated <a href="http://blog.minitab.com/blog/statistics-and-quality-improvement/how-painful-does-the-income-gap-look-to-you">how the same set of data can appear to support two contradictory positions</a>. He showed how changing the scale of a graph that displays mean and median household income over time drastically alters the way it can be interpreted, even though there's no change in the data being presented.</p>
<p><img alt="Graph interpretation is tricky, especially if you're doing it quickly" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f594d20f8daa8e00e29380f68010b1cc/hunh.jpg" style="margin: 10px 15px; float: right; width: 200px; height: 200px;" /> When we analyze data, we need to present the results in an objective, honest, and fair way. That's the catch, of course. What's "fair" can be debated...and that leads us straight into "Lies, damned lies, and statistics" territory. </p>
<p><span style="line-height: 20.7999992370605px;">Cody's post got me thinking about the importance of statistical literacy, especially in a mediascape saturated with overhyped news reports about seemingly every new study, not to mention omnipresent "infographics" of frequently dubious origin and intent.</span></p>
<p><span style="line-height: 20.7999992370605px;">As consumers and providers of statistics, can we trust our own impressions of the information we're bombarded with on a daily basis? It's an increasing challenge, even for the statistics-savvy. </span></p>
So Much Data, So Many Graphs, So Little Time
<p>The increased amount of information available, combined with the acceleration of the news cycle to speeds that wouldn't have been dreamed of a decade or two ago, means we have less time available to absorb and evaluate individual items critically. </p>
<p>A half-hour television news broadcast might include several animations, charts, and figures based on the latest research, or polling numbers, or government data. They'll be presented for several seconds at most, then it's on to the next item. </p>
<p>Getting news online is even more rife with opportunities for split-second judgment calls. We scan through the headlines and eyeball the images, searching for stories interesting enough to click on. But with 25 interesting stories vying for your attention, and perhaps just a few minutes before your next appointment, you race through them very quickly. </p>
<p>But when we see graphs for a couple of seconds, do we really absorb their meaning completely and accurately? Or are we susceptible to misinterpretation? </p>
<p>Most of the graphs we see are very simple: bar charts and pie charts predominate. But <span style="line-height: 1.6;">as statistics educator Dr. Nic points out in </span><a href="http://learnandteachstatistics.wordpress.com/2012/07/16/tricky_graphs/" style="line-height: 1.6;">this blog post</a>,<span style="line-height: 1.6;"> </span><span style="line-height: 20.7999992370605px;">interpreting</span><span style="line-height: 20.7999992370605px;"> </span><span style="line-height: 1.6;">even simple bar charts can be a deceptively tricky business</span><span style="line-height: 1.6;">. I've adapted her example to demonstrate this below. </span></p>
Which Chart Shows Greater Variation?
<p>A city surveyed residents of two neighborhoods about the quality of service they get from local government. Respondents were asked to rate local services on a scale of 1 to 10. Their responses were charted using Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a>, as shown below. </p>
<p>Take a few seconds to scan the charts, then choose which neighborhood's responses exhibit the most variation, Ferndale or Lawnwood?</p>
<p><img alt="Lawnwood Bar Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f88262f2732bc43e8ac0b919d43139a5/lawnwoodbarchart.gif" style="width: 500px; height: 333px;" /></p>
<p><img alt="Ferndale Bar Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/67ee1909a89236e3caac2d11a9d42795/ferndalebarchart.gif" style="width: 500px; height: 333px;" /></p>
<p>Seems pretty straightforward, right? Lawnwood's graph is quite spiky and disjointed, with sharp peaks and valleys. The graph of Ferndale's responses, on the other hand, looks nice and even. Each bar's roughly the same height. </p>
<p>It looks like Lawnwood's responses have the most variation. But let's verify that impression with some basic descriptive statistics about each neighborhood's responses:</p>
<p style="margin-left: 40px;"><img alt="Descriptive Statistics for Fernwood and Lawndale" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/1eeed755d2a0baea0939dc7ccecacaea/descriptive_statistics.gif" style="width: 369px; height: 105px;" /></p>
<p>Uh-oh. A glance at the graphs suggested that Lawnwood has more variation, but the analysis demonstrates that Ferndale's variation is, in fact, much higher. <span style="line-height: 20.7999992370605px;">How did we get this so wrong?</span><span style="line-height: 20.7999992370605px;"> </span><span style="line-height: 1.6;"> </span></p>
Frequencies, Values, and Counterintuitive Graphs
<p><span style="line-height: 1.6;">The answer lies in how the data were presented. The charts above show frequencies, or counts, rather than individual responses. </span></p>
<p><span style="line-height: 1.6;">What if we graph the individual responses for each neighborhood? </span></p>
<p><img alt="Lawndale Individuals Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d8e91ae6c007e8f5327c54ac3ec65604/lawnwoodindividualsbarchart.gif" style="width: 500px; height: 333px;" /></p>
<p><img alt="Ferndale Individuals Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/4c01c68dbb96e2126a1fd313ee38e001/ferndaleindividualsbarchart.gif" style="width: 500px; height: 333px;" /></p>
<p>In <em>these </em>graphs, it's easy to see that the responses of Ferndale's citizens had much more variation than those of Lawnwood. But unless you appreciate the differences between values and frequencies—and paid careful attention to how the first set of graphs was labelled—a quick look at the earlier graphs could well leave you with the wrong conclusion. </p>
Being Responsible
<p>Since you're reading this, you probably both create and consume data analysis. You may generate your own reports and charts at work, and see the results of other peoples' analyses on the news. We should approach both situations with a certain degree of responsibility. </p>
<p>When looking at graphs and charts produced by others, we need to avoid snap judgments. We need to pay attention to what the graphs really show, and take the time to draw the right conclusions based on how the data are presented. </p>
<p>When sharing our own analyses, we have a responsibility to communicate clearly. In the frequency charts above, the X and Y axes are labelled adequately—but couldn't they be more explicit? Instead of just "Rating," couldn't the label read "Count for Each Rating" or some other, more meaningful description? </p>
<p>Statistical concepts may seem like common knowledge if you've spent a lot of time working with them, but many people aren't clear on ideas like "correlation is not causation" and margins of error, let alone the nuances of statistical assumptions, distributions, and significance levels.</p>
<p>If your audience includes people without a thorough grounding in statistics, are you going the extra mile to make sure the results are understood? For example, many expert statisticians have told us they use <a href="http://www.minitab.com/products/minitab/assistant/">the Assistant</a> in Minitab 17 to present their results precisely because it's designed to communicate the outcome of analysis clearly, even for statistical novices. </p>
<p><span style="line-height: 20.7999992370605px;">If you're already doing everything you can to make statistics accessible to others, kudos to you. </span><span style="line-height: 20.7999992370605px;">And if you're not, why aren't you? </span></p>
Data AnalysisStatisticsStatistics in the NewsStatsWed, 05 Nov 2014 14:25:00 +0000http://blog.minitab.com/blog/understanding-statistics/creating-and-reading-statistical-graphs-trickier-than-you-thinkEston MartzMethods and Formulas: How Are I-MR Chart Control Limits Calculated?
http://blog.minitab.com/blog/marilyn-wheatleys-blog/methods-and-formulas3a-how-are-i-mr-chart-control-limits-calculated
<p>Users often contact Minitab technical support to ask how the software calculates the control limits on control charts.</p>
<p>A frequently asked question is how the control limits are calculated on an <a href="http://blog.minitab.com/blog/understanding-statistics/how-create-and-read-an-i-mr-control-chart">I-MR Chart or Individuals Chart</a>. If Minitab plots the upper and lower control limits (UCL and LCL) three standard deviations above and below the mean, why are the limits plotted at values other than 3 times the standard deviation that I get using <strong>Stat > Basic Statistics</strong>? </p>
<p>That’s a valid question—if we’re plotting individual points on the I-Chart, it doesn’t seem unreasonable to try to calculate a simple standard deviation of the data points, multiply by 3 and expect the UCL and LCL to be the data mean plus or minus 3 standard deviations. This can be especially confusing because the Mean line on the Individuals chart IS the mean of the data!</p>
<p>However, the standard deviation that Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a> uses is not the simple standard deviation of the data. The default method that Minitab uses (and an option to change the method) is available by clicking the I-MR Options button, and then choosing the Estimate tab:</p>
<p><img alt="" spellcheck="true" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/a1de7e505c9c75978c6f847c078eb6be/pic1.PNG" style="width: 845px; height: 392px;" /></p>
<p>There we can see that Minitab is using the <strong>Average moving range</strong> method with <strong>2</strong> as the <strong>length of moving range</strong> to estimate the standard deviation.</p>
<p>That’s all well and good, but exactly what the heck is an average moving range with length 2?!</p>
<p>Minitab’s <strong>Methods and Formulas</strong> section details the formulas used for these calculations. In fact, Methods and formulas provides information on formulas used for all the calculations available through the dialog boxes: This information can be accessed via the Help menu, by choosing <strong>Help</strong> > <strong>Methods and Formulas...</strong></p>
<p>Too see the formulas for control chart calculations, we choose <strong>Control Charts</strong> > <strong>Variables Charts for Individuals</strong> as shown below:</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/18d67df6f4507de993dd2ab52ddbf7e3/pic2.PNG" style="width: 600px; height: 602px;" /></p>
<p>The next page shows the formulas organized by topic. By selecting the link <strong>Methods for estimating standard deviation</strong> we find the formula for the <strong>Average moving range</strong>:</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/f3cdc685d352daaa3c4895bdda0b92b6/pic3.PNG" style="width: 636px; height: 444px;" /></p>
<p>Looking at the formula, things become a bit clearer—the ‘length of the moving range’ is the number of data points used when we calculate the moving range (i.e., the difference from point 1 to point 2, 2 to 3, and so forth).</p>
<p>If we want to hand-calculate the control limits for a dataset, we can do that with a little help from Minitab!</p>
<p>The dataset I’ve used for this example is available <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/File/739bad1756df6628d4142b54711fb77c/data.MTW">HERE</a>.</p>
<p>First, we’ll need to get the values of the moving ranges. We’ll use the calculator by navigating to <strong>Calc</strong> > <strong>Calculator; </strong>in the example below, we’re storing the results in column C2 (an empty column) and we’re using the <strong>LAG</strong> function in the calculator. That will move each of our values in column C1 down by 1 row. Click <strong>OK</strong> to store the results in the worksheet.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/3806d3a2fa609587ecb514a834feff74/pic4.PNG" style="width: 439px; height: 391px;" /></p>
<p>Note: By choosing the <strong>Assign as a formula</strong> option at the bottom of the calculator, we can add a formula to column C2 which we can easily go back and edit if a mistake was made.</p>
<p>Now with the lags stored in C2, we use the calculator again: <strong>Calc</strong> > <strong>Calculator</strong> (here's a tip: press F3 on the keyboard to clear out the previous calculator entry), then subtract column C2 from column C1 as shown below, storing the results in C3. We use the <strong>ABS</strong> calculator command to get the absolute differences of each row:</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/f59426052c560720a5043223c5084609/pic5.PNG" style="width: 441px; height: 391px;" /></p>
<p>Next we calculate the sum of the absolute value of the moving ranges by using <strong>Calc</strong> > <strong>Calculator</strong> once again. We’ll store the sum in the next empty column, C4:</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/f526db4204eb75f8fb41e3470a033f12/pic6.PNG" style="width: 441px; height: 392px;" /></p>
<p>The value of this sum represents the numerator in the Rbar calculation:</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/a5ce2c1b1eb7b2ee900fef6308ab3aed/pic7.PNG" style="width: 560px; height: 110px;" /></p>
<p>To complete the Rbar calculation, we use the information from Methods and Formulas to come up with the denominator; n is the number of data points (in this example it’s 100), w’s default value is 2 ,and we add 1, so the denominator is <strong>100-2+1</strong>. In Minitab, we can once again use <strong>Calc</strong> > <strong>Calculator </strong>to store the results in C5:</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/ecee73ab86aec5be4307272cd5cfc5ee/pic8.PNG" style="width: 577px; height: 147px;" /></p>
<p>With Rbar calculated, we find the value of the unbiasing constant d2 from the table that is linked in Methods and Formulas:</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/bbc81ead705e0404900e350d7a7f8b30/pic9.PNG" style="width: 332px; height: 94px;" /></p>
<p>For a moving-range of length 2, the d2 value is 1.128, so we enter 1.128 in the first row in column C6, and use the calculator one more time to divide Rbar by d2 to get the standard deviation, which works out to be 2.02549:</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/41e1ff67d5031010a074efa585a5b75a/pic10.PNG" style="width: 634px; height: 149px;" /></p>
<p>We can check our results by using the original data to create an I-MR chart. We enter the <strong>data</strong> column in <strong>Variables</strong>, and then click <strong>I-MR Options</strong> and choose the <strong>Storage</strong> tab; here we can tell Minitab to store the standard deviation in the worksheet when we create the chart:</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/4e24faef9b7e4f9c34e7a624b5060f7f/pic11.PNG" style="width: 457px; height: 454px;" /></p>
<p>The stored standard deviation is shown in the new column titled STDE1, and it matched the value we hand-calculated. Notice also that the Rbar we calculated is the average of the moving ranges on the Moving-Range chart. Beautiful!</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/f6d0da32-ba1d-41d4-ace1-af34dcb51351/Image/4b0e62d3316d9610bf5eed6faa69250d/pic12.PNG" style="width: 609px; height: 505px;" /></p>
Data AnalysisQuality ImprovementStatisticsStatistics HelpTue, 04 Nov 2014 16:07:59 +0000http://blog.minitab.com/blog/marilyn-wheatleys-blog/methods-and-formulas3a-how-are-i-mr-chart-control-limits-calculatedMarilyn WheatleyComparing the College Football Playoff Top 25 and the Preseason AP Poll
http://blog.minitab.com/blog/the-statistics-game/comparing-the-college-football-playoff-top-25-and-the-preseason-ap-poll
<p>The college football playoff committee waited until the end of October to release their first top 25 rankings. One of the reasons for waiting so far into the season was that the committee would rank the teams off of actual games and wouldn’t be influenced by preseason rankings.</p>
<p>At least, that was the idea.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/8ac74acf42052d068b6cd0eeec32f609/cfb_playoff.jpg" style="line-height: 20.7999992370605px; float: right; width: 300px; height: 187px;" /></p>
<p>Earlier this year, I found that the <a href="http://blog.minitab.com/blog/the-statistics-game/has-the-college-football-playoff-already-been-decided">final AP poll was correlated with the preseason AP poll</a>. That is, if team A was ranked ahead of team B in the preseason and they had the same number of losses, team A was still usually ranked ahead of team B. The biggest exception was SEC teams, who were able to regularly jump ahead of teams (with the same number of losses) ranked ahead of them in the preseason.</p>
<p>If the final AP poll can be influenced by preseason expectations, could the college football playoff committee be influenced, too? Let’s compare their first set of rankings to the preseason AP poll to find out.</p>
Comparing the Ranks
<p>There are currently 17 different teams in the committee’s top 25 that have just one loss. I <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/26e7c8d8d8eee4fe2dfa26dc3d6e3c54/preseason_ap_vs__cfb_playoff_rankings.MTW">recorded the order</a> they are ranked in the committee’s poll and their order in the AP preseason poll. Below is an individual value plot of the data that shows each team’s preseason rank versus their current rank.</p>
<p><img alt="IVP" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/4098bab194a586865d3861f854d65627/ivp.jpg" style="width: 600px; height: 400px;" /></p>
<p>Teams on the diagonal line haven’t moved up or down since the preseason. Although Notre Dame is the only team to fall directly on the line, most teams aren’t too far off.</p>
<p>Teams below the line have jumped teams that were ranked ahead of them in the preseason. The biggest winner is actually not an SEC team, it’s TCU. Before the season, 13 of the current one-loss teams were ranked ahead of TCU, but now there are only 4. On the surface TCU seems to counter the idea that only SEC teams can drastically move up from their preseason ranking. However, of the 9 teams TCU jumped, only one (Georgia) is from the SEC. And the only other team to jump up more than 5 spots is Mississippi—who of course is from the SEC. So I wouldn’t conclude that the CFB playoff committee rankings behave differently than the AP poll quite yet.</p>
<p>Teams below the line have been passed by teams that had been ranked behind them in the preseason. Ohio State is the biggest loser, having had 9 different teams pass over them. Part of this can be explained by the fact that they have the worst loss (a 4-4 Virginia Tech game at home). But another factor is that the preseason AP poll was released before anybody knew Buckeye quarterback Braxton Miller would miss the entire season. Had voters known that, Ohio State probably wouldn’t have been ranked so high to begin with. </p>
<p>Overall, 10 teams have moved up or down from their preseason spot by 3 spots or less. The correlation between the two polls is 0.571, which indicates a positive association between the preseason AP poll and the current CFB playoff rankings. That is, teams ranked higher in the preseason poll tend to be ranked higher in the playoff rankings.</p>
Concordant and Discordant Pairs
<p>We can take this analysis a step further by looking at the concordant and discordant pairs. A pair is concordant if the observations are in the same direction. A pair is discordant if the observations are in opposite directions. This will let us compare teams to each other two at a time.</p>
<p>For example, let’s compare Auburn and Mississippi. In the preseason, Auburn was ranked 3 (out of the 17 one-loss teams) and Mississippi was ranked 10. In the playoff rankings, Auburn is ranked 1 and Mississippi is ranked 2. This pair is concordant, since in both cases Auburn is ranked higher than Mississippi. But if you compare Alabama and Mississippi, you’ll see Alabama was ranked higher in the preseason, but Mississippi is ranked higher in the playoff rankings. That pair is discordant.</p>
<p>When we compare every team, we end up with 136 pairs. How many of those are concordant? Our <a href="http://www.minitab.com/products/minitab">favorite statistical software</a> has the answer: </p>
<p><img alt="Measures of Concordance" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/5f281abfa1e06d5cda492e17b3f9746b/concordance.jpg" style="width: 663px; height: 176px;" /></p>
<p>There are 96 concordant pairs, which is just over 70%. So most of the time, if a team ranked higher in the preseason poll, they are ranked higher in the playoff rankings. And consider this: of the one-loss teams, the top 4 ranked preseason teams were Alabama, Oregon, Auburn, and Michigan St. Currently, the top 4 one loss teams are Auburn, Mississippi, Oregon, and Alabama. That’s only one new team—which just so happens to be from the SEC.</p>
<p>That’s bad news for non-SEC teams that started the season ranked low, like Arizona, Notre Dame, Nebraska, and Kansas State. It's going to be hard for them to jump teams with the same record, especially if those teams are from the SEC. Just look at Alabama’s résumé so far. Their best win is over West Virginia and they lost to #4 Mississippi. Is that <em>really </em>better than Kansas State, who lost to #3 Auburn and beat Oklahoma <em>on the road</em>? If you simply changed the name on Alabama’s uniform to Utah and had them unranked to start the season, would they still be ranked three spots higher than Kansas State? I doubt it.</p>
<p>The good news is that there are still many games left to play. Most of these one-loss teams will lose at least one more game. But with 4 teams making the playoff this year, odds are we'll see multiple teams with the same record vying for the last playoff spot. And if this college football playoff ranking is any indication, if you're not in the SEC, teams who were highly thought of in the preseason will have an edge.</p>
Fun StatisticsHypothesis TestingFri, 31 Oct 2014 13:04:57 +0000http://blog.minitab.com/blog/the-statistics-game/comparing-the-college-football-playoff-top-25-and-the-preseason-ap-pollKevin RudyR-squared Shrinkage and Power and Sample Size Guidelines for Regression Analysis
http://blog.minitab.com/blog/adventures-in-statistics/r-squared-shrinkage-and-power-and-sample-size-guidelines-for-regression-analysis
<p>Using a sample to <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/basic-concepts/parameters/" target="_blank">estimate the properties of an entire population</a> is common practice in statistics. For example, the mean from a random sample estimates that <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/basic-concepts/parameter-esimates/" target="_blank">parameter</a> for an entire population. In linear regression analysis, we’re used to the idea that the <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-interpret-regression-analysis-results-p-values-and-coefficients" target="_blank">regression coefficients</a> are estimates of the true parameters. However, it’s easy to forget that R-squared (R2) is also an estimate. Unfortunately, it has a problem that many other estimates don’t have. R-squared is inherently biased!</p>
<p>In this post, I look at how to obtain an unbiased and reasonably precise estimate of the population R-squared. I also present power and sample size guidelines for regression analysis.</p>
R-squared as a Biased Estimate
<p>R-squared measures the strength of the relationship between the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">predictors</a> and <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">response</a>. The <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit" target="_blank">R-squared in your regression output</a> is a biased estimate based on your sample.</p>
<ul>
<li>An unbiased estimate is one that is just as likely to be too high as it is to be too low, and it is correct on average. If you collect a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/data-concepts/why-collect-random-sample/" target="_blank">random sample</a> correctly, the sample mean is an unbiased estimate of the population mean.</li>
<li>A biased estimate is systematically too high or low, and so is the average. It’s like a bathroom scale that always indicates you are heavier than you really are. No one wants that!</li>
</ul>
<p>R-squared is like the broken bathroom scale: it is deceptively large. Researchers have long recognized that regression’s optimization process takes advantage of chance correlations in the sample data and inflates the R-squared.</p>
<p>This bias is a reason why some practitioners don’t use R-squared at all—it tends to be wrong.</p>
R-squared Shrinkage
<p>What should we do about this bias? Fortunately, there is a solution and you’re probably already familiar with it: adjusted R-squared. I’ve written about <a href="http://blog.minitab.com/blog/adventures-in-statistics/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables" target="_blank">using the adjusted R-squared</a> to compare regression models with a different number of terms. Another use is that it is an unbiased estimator of the population R-squared.</p>
<p>Adjusted R-squared does what you’d do with that broken bathroom scale. If you knew the scale was consistently too high, you’d reduce it by an appropriate amount to produce an accurate weight. In statistics this is called shrinkage. (You <em>Seinfeld</em> fans are probably giggling now. Yes, George, we’re talking about shrinkage, but here it’s a good thing!)</p>
<p>We need to shrink the R-squared down so that it is not biased. Adjusted R-squared does this by comparing the sample size to the number of terms in your regression model.</p>
<p>Regression models that have many samples per term produce a better R-squared estimate and require less shrinkage. Conversely, models that have few samples per term require more shrinkage to correct the bias.</p>
<p><img alt="Line plot showing R-squared shrinkage by sample size per term" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/c8687540a1adaecc534f746991ce52f0/rsq_shrinkage_w640.png" style="width: 640px; height: 427px;" /></p>
<p>The graph shows greater shrinkage when you have a smaller sample size per term and lower R-squared values.</p>
Precision of the Adjusted R-squared Estimate
<p>Now that we have an unbiased estimator, let's take a look at the precision.</p>
<p>Estimates in statistics have both a point estimate and a <a href="http://blog.minitab.com/blog/adventures-in-statistics/when-should-i-use-confidence-intervals-prediction-intervals-and-tolerance-intervals" target="_blank">confidence interval</a>. For example, the sample mean is the point estimate for the population mean. However, the population mean is unlikely to exactly equal the sample mean. A confidence interval provides a range of values that is likely to contain the population mean. Narrower confidence intervals indicate a more precise estimate of the parameter. Larger sample sizes help produce more precise estimates.</p>
<p>All of this is true with the adjusted R-squared as well because it is just another estimate. The adjusted R-squared value is the point estimate, but how precise is it and what’s a good sample size?</p>
<p>Rob Kelly, a senior statistician at Minitab, was asked to study this issue in order to develop power and sample size guidelines for regression in the <a href="http://www.minitab.com/en-us/products/minitab/assistant/" target="_blank">Assistant menu</a>. He simulated the distribution of adjusted R-squared values around different population values of R-squared for different sample sizes. This histogram shows the distribution of 10,000 simulated adjusted R-squared values for a true population value of 0.6 (rho-sq (adj)) for a simple regression model.</p>
<p><img alt="Histogram showing distribution of adjusted R-squared values around the population value" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/4405515ecfbe8605fdca8347d34dac5d/adjrsqprecision_w640.png" style="width: 640px; height: 427px;" /></p>
<p>With 15 observations, the adjusted R-squared varies widely around the population value. Increasing the sample size from 15 to 40 greatly reduces the likely magnitude of the difference. With a sample size of 40 observations for a simple regression model, the margin of error for a 90% confidence interval is +/- 20%. For multiple regression models, the sample size guidelines increase as you add terms to the model.</p>
Power and Sample Size Guidelines for Regression Analysis
<p>Satisfying these sample size guidelines helps ensure that you have sufficient power to detect a relationship and provides a reasonably precise estimate of the strength of that relationship. Specifically, if you follow these guidelines:</p>
<ul>
<li>The power of the overall F-test ranges from about 0.8 to 0.9 for a moderately weak relationship (0.25). Stronger relationships yield higher power.</li>
<li>You can be 90% confident that the adjusted R-squared in your output is within +/- 20% of the true population R-squared value. Stronger relationships (~0.9) produce more precise estimates.</li>
</ul>
<p style="text-align: center;"><strong>Terms</strong></p>
<p style="text-align: center;"><strong>Total sample size</strong></p>
<p style="text-align: center;">1-3</p>
<p style="text-align: center;">40</p>
<p style="text-align: center;">4-6</p>
<p style="text-align: center;">45</p>
<p style="text-align: center;">7-8</p>
<p style="text-align: center;">50</p>
<p style="text-align: center;">9-11</p>
<p style="text-align: center;">55</p>
<p style="text-align: center;">12-14</p>
<p style="text-align: center;">60</p>
<p style="text-align: center;">15-18</p>
<p style="text-align: center;">65</p>
<p style="text-align: center;">19-21</p>
<p style="text-align: center;">70</p>
<p>In closing, if you want to estimate the strength of the relationship in the population, assess the adjusted R-squared and consider the precision of the estimate.</p>
<p>Even when you meet the sample size guidelines for regression, the adjusted R-squared is a rough estimate. If the adjusted R2 in your output is 60%, you can be 90% confident that the population value is between 40-80%.</p>
<p>If you're learning about regression, read my <a href="http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples" target="_blank">regression tutorial</a>! For more histograms and the full guidelines table, see the <a href="http://support.minitab.com/en-us/minitab/17/Assistant_Simple_Regression.pdf" target="_blank">simple regression white paper</a> and <a href="http://support.minitab.com/en-us/minitab/17/Assistant_Multiple_Regression.pdf" target="_blank">multiple regression white paper</a>.</p>
Regression AnalysisThu, 30 Oct 2014 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/r-squared-shrinkage-and-power-and-sample-size-guidelines-for-regression-analysisJim FrostWHO Cares about How Much Sugar You Eat on Halloween?
http://blog.minitab.com/blog/statistics-and-quality-improvement/who-cares-about-how-much-sugar-you-eat-on-halloween
<p>It’s almost Halloween, so there’s lots to do. If you haven’t picked out your costume, get ideas from the National Retail Federation’s list of <a href="https://nrf.com/media/press-releases/disneys-frozen-characters-teenage-mutant-ninja-turtles-top-childrens-costume" target="_blank">the most popular costumes</a> for 2014. Last-minute candy shopping? Check out kidzworld.com’s list of the <a href="http://www.kidzworld.com/article/27503-top-10-halloween-candy" target="_blank">top 10 candies</a> for Halloween. And of course, you have to plan your daily candy consumption to match the <a href="http://www.who.int/nutrition/sugars_public_consultation/en/" target="_blank">limits on free sugar</a> recommended by the World Health Organization (WHO) earlier this year.</p>
<p><img alt="Mixed candy" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/b89ff78e38c6875631f8d86c8eee308e/candy_w640.jpeg" style="line-height: 20.7999992370605px; float: right; width: 200px; height: 133px; border-width: 1px; border-style: solid; margin: 10px 15px;" /></p>
<p>What’s that you say? You didn’t plan your candy consumption yet? Well, the guideline says that no more than 10% of your calories should come from free sugars and that you can achieve increased health benefits by keeping the number below 5%. If you’re a good nutrition tracker, that should be no problem for you. For those of you looking for more general suggestions, we’re going to make a scatterplot in Minitab that should provide a helpful reference.</p>
<p>We like to show some fairly nifty graph features on the Minitab blog. For example, <a href="http://blog.minitab.com/blog/real-world-quality-improvement">Carly Barry</a>’s shown you how to <a href="http://blog.minitab.com/blog/real-world-quality-improvement/making-your-graphs-more-manageable">make your graphs more manageable with paneling</a>, <a href="http://blog.minitab.com/blog/adventures-in-statistics">Jim Frost</a>’s shown you how to <a href="http://blog.minitab.com/blog/adventures-in-statistics/world-travel-bumpy-roads-and-adjusting-your-graph-scales">adjust your scales</a> for travel bumps, and <a href="http://blog.minitab.com/blog/understanding-statistics">Eston Martz</a> adjusted <a href="http://blog.minitab.com/blog/understanding-statistics/studying-old-dogs-with-new-statistical-tricks-part-ii-contour-plots-and-cracking-bones">contour plots</a> while looking at data about hyena skulls. This time though, we’re going to see how our statistical software makes it easy to clarify a graph by taking something away.</p>
<p>The USDA last published their <a href="http://www.cnpp.usda.gov/sites/default/files/dietary_guidelines_for_americans/PolicyDoc.pdf">dietary guidelines</a> in 2010. Appendix 6 contains calorie estimates based on age, gender and activity level, rounded to the nearest 200 calories. Multiply those levels by 0.05 to get an estimate of your recommended sugar limit in calories. To change that into grams that you can find on candy labels, we’ll assume that sugar has 4 calories per gram.</p>
<p>Now, if we create the default graph in Minitab we get something a bit like this. Note the symbols crammed together along each line:</p>
<p><img alt="Crowded symbols make the graph less clear." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/fe2bb52a63302df85b8b685d22241f54/image1.jpg" style="border-width: 0px; border-style: solid; width: 450px; height: 324px;" /></p>
<p>Let’s be honest, pushing all those symbols together to show a line with no variation looks a bit silly. But select those symbols and a clearer graph is only a right-click away:</p>
<p><img alt="Right-click the symbols and click Delete to make the graph clearer." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/2271a15debb7aeb4389609b6af61f082/grayend.gif" style="border-width: 0px; border-style: solid; width: 450px; height: 321px;" /></p>
<p>Without the symbols on the graph, the lines and the differences between them are clearer, especially when the lines are closest together during the early phase when people grow rapidly.</p>
<p><img alt="Without the symbols, the graph is clearer." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/9ba188145759ce1c2720e0e2004b6059/image5.jpg" style="width: 450px; height: 288px;" /></p>
<p>Much has been made of the fact that the 5% WHO guideline is less than the sugar in a can of soda, so Halloween can be a treacherous time for someone who wants to limit their sugar intake. After all, <a href="http://www.popsci.com/article/science/how-new-sugar-stats-will-kill-halloween-save-you" target="_blank">Popular Science</a> reports that the average trick-or-treater begins home over 600 grams. So what do you do if your ghost or goblin brings home more candy than you want? <a href="http://mommypoppins.com/halloween-candy-donation-candy-buy-back-donating-treats-operation-gratitude" target="_blank">Natalie Silverstein</a> offers some suggestions about how to make your candy do some good for others.</p>
<p> </p>
<p style="font-size:8px">The image of mixed candy is by <a href="https://www.flickr.com/photos/stevendepolo/" target="_blank">Steven Depolo</a> and appears under this <a href="https://creativecommons.org/licenses/by/2.0/">Creative Commons</a> license.</p>
Fun StatisticsStatistics in the NewsWed, 29 Oct 2014 12:21:15 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/who-cares-about-how-much-sugar-you-eat-on-halloweenCody SteeleSimulating Robust Processing with Design of Experiments, part 2
http://blog.minitab.com/blog/statistics-in-the-field/simulating-robust-processing-with-design-of-experiments2c-part-2
<p>by <a href="http://uk.linkedin.com/in/jasminwongym" target="_blank">Jasmin Wong</a>, guest blogger</p>
<p> </p>
<p><em><a href="http://blog.minitab.com/blog/statistics-in-the-field/simulating-robust-processing2c-part-1">Part 1</a> of this two-part blog post discusses the issues and challenges in injection moulding and suggests using simulation software and the statistical method called Design of Experiments (DOE) to speed development and boost quality. This part presents a case study that illustrates this approach. </em></p>
Preliminary Fill and Designed Experiment
<p>This case study considers the example of a hand dispensing pump for a sanitiser bottle where the main areas of concern were warpage and the concentricity of the tube, as this had a critical impact on fit and functionality. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f6c68e56710c222c2a20dd002021287f/dispenser_top.png" style="line-height: 20.7999992370605px; margin: 10px 15px; float: right; width: 400px; height: 236px;" /></p>
<div>
<p>In this example, the first step was to carry out a preliminary fill, pack, cool and warp analysis to ensure that the part had no filling difficulties such as short shots or hesitation. DOE was then carried out and, since the areas of concern were warpage and concentricity, these were selected as the quality factor/responses.</p>
<div>
<p>Four control factors that affected warpage and concentricity were used to carry out the DOE: melt temperature, packing pressure, cooling time, and fill time. The factors levels are shown in the table below:</p>
<p><img alt="Taguchi DOE control factors" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/322b2d00c3b22d962ca76ac0485e437b/taguchi_doe_control_factors.png" style="width: 450px; height: 136px;" /></p>
<p>A Taguchi L9 DOE was then created using Minitab Statistical Software. <span style="line-height: 1.6;">It should be noted that a Taguchi DOE assumes no significant interaction between factors, but this may not necessarily be true. In this case, however, it was selected to determine the relationship between the factors and responses in the shortest simulation time.</span></p>
<p>The Minitab worksheet below shows the process settings for the nine runs using the Taguchi L9 Design.</p>
<p><img alt="Taguchi design worksheet" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/7cbc350e2fbe466708f4b5b4a2f58566/taguchi_doe_worksheet.png" style="width: 450px; height: 169px;" /></p>
<p>Moldex3D DOE was then used to perform the mathematical calculations based on the user’s specification (minimum warpage and linear shrinkage between nodes) to determine the optimum process setting.</p>
<p>From the nine different simulated runs, a main effect graph for warpage was plotted. </p>
<p><img alt="Main Effects Plor for Warpage" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/dbec7e75117c7763745e8260d78852fd/main_effects_warpage.png" style="width: 577px; height: 385px;" /></p>
<p><span style="line-height: 1.6;">From this, it could be seen that by increasing the packing pressure and cooling time, warpage was reduced. Increasing melt temperature, on the other hand, lead to higher warpage. Using a filling time of 0.2s or 0.3s seemed to give slightly lesser warpage than 0.1s. Hence, it was determined that to achieve lower warpage, the optimum process setting should be a melt temperature of 225°C, packing pressure of 15MPa, cooling time of 12s and filling time of 0.3s.</span></p>
<p style="line-height: 20.7999992370605px;">Taking the results obtained from Moldex3D, Minitab 17 statistical software was used to determine which of the four factors had the biggest influence on part warpage.</p>
<p style="line-height: 20.7999992370605px;"><img alt="response table for warpage" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/20e65680dd317de7add7a8559b1d50e3/response_table_warpage.png" style="width: 500px; height: 153px;" /></p>
<p style="line-height: 20.7999992370605px;">This data analysis showed that cool time had the biggest impact on part warpage, followed by packing pressure, melt temperature and then filling time. An area graph of warpage (PDF DOWNLOAD CHART 1) showed a quick comparison of the nine different runs, indicating that run 3 gave the least warpage.</p>
<p><img alt="area graph of warpage" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/740d75c1b4424da02ee136a673e43780/area_graph_of_warpage.png" style="width: 500px; height: 333px;" /></p>
<p>Concentricity is difficult to measure, in both real life and in simulation. In real life, the distance between different points is measured using a coordinate-measuring machine (CMM). In the Moldex3D simulation, the linear shrinkage between different nodes was measured. Eight different nodes were identified. The linear shrinkage of the diameter of the tube across was determined and the lower the linear shrinkage, the more circular or better concentricity of the part.</p>
<p>The main effects plot below for shrinkage shows that to get better concentricity/linear shrinkage between the nodes, a lower melt temperature, cooling time and filling time with a high pack pressure was preferable.</p>
<p><img alt="Main Effects Plot for Shrinkage" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/3eb9b51b4bd8caeac5ead713a86ce90b/main_effects_shrinkage.png" style="width: 579px; height: 385px;" /></p>
<p>It had already been established that to achieve lower linear shrinkage, the optimum process setting should be melt temperature of 225°C, packing pressure of 15MPa, cooling time of 8s and filling time of 0.1s. However, a cooling time of 8s may not be practical, as the analysis of warpage shows it would give high warpage.</p>
<p>Minitab was also used to find out which of the four control factors resulted in the greatest impact on linear shrinkage.</p>
<p><img alt="Response Table for Shrinkage" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9e0e2aca3064320d44a9860223665f48/response_table_shrinkage.png" style="width: 500px; height: 153px;" /></p>
<p>This showed that pack pressure is ranked first, followed by cooling time, melt temperature and lastly the filling time. Since the 8s cooling time would lead to high warpage, a compromise had to be made.</p>
<p>As mentioned earlier, for linear shrinkage the packing pressure was more of a contributing factor than the cooling time, so it makes sense to use 12s cooling time with 15MPa packing pressure. Comparing the nine different runs for linear shrinkage in an area graph showed that run six gave the lowest linear shrinkage.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/dfabcb5cb7861c6dc11cc0fdb25c2b2d/area_graph_of_shrinkage.png" style="width: 500px; height: 333px;" /></p>
<p>Based on the user specification, Moldex3D’s mathematical calculations obtained the optimised run<span style="line-height: 1.6;">. For this example, weighting for warpage was the same as for linear shrinkage. However, based on the DOE simulation results obtained, the optimum process setting for the lowest warpage was to have a cooling time of 12s and filling time of 0.3s. The optimum process for the lowest linear shrinkage, on the other hand, required a cooling time of 8s and fill time of 0.1s.</span></p>
Concluding thoughts
<p>Moldex3D simulation resulted in a compromise process setting (melt temperature of 225°C, packing pressure of 15MPa, cooling time of 12s and filling time of 0.1s), which was used as the optimum run. From the area graphs shown below, it can be seen that the optimised run 10 gives the lowest warpage compared to the other nine runs, while having low linear shrinkage.</p>
<p><img alt="optimized run - area chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/13c7a74c8d37f74f4acc152b676e53b6/optimized_run_area_graph_w640.png" style="width: 640px; height: 210px;" /></p>
<p>From the simulation in Moldex 3D, shown below, it can be seen that part warpage and concentricity of the tube has been significantly improved (warpage has been improved by 20-30% while linear shrinkage has been kept to 0.6-0.7%).</p>
<p><img alt="Moldex 3D simulation" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a1b9270c0e645e9db3d7c4f626308aba/moldex_3d_sim.png" style="width: 500px; height: 179px;" /></p>
<p>It is important that designers and moulders understand that numerical results in a simulation such as this provide only a relative comparison and should not be treated as absolute. This is because there are various uncontrollable factors in the actual mould shop environment—‘noise’—which cannot be re-enacted in a simulation. However, running DOE using simulation can give the engineering team a head start on identifying which control factors to focus on and the relationship those factors have with part quality.</p>
<p> </p>
<p><strong>About the guest blogger</strong></p>
<p><a href="http://uk.linkedin.com/in/jasminwongym">Jasmin Wong</a> is project engineer at UK-based <a href="http://www.plazology.co.uk/" target="_blank">Plazology</a>, which provides product design optimisation, injection moulding fl ow simulation, mould design, mould procurement, and moulding process validation services to global manufacturing customers. She is an MSc graduate in polymer composite science and engineering and recently gained Moldex3D Analyst Certification.</p>
<p> </p>
<p> </p>
<p><em>A version of this article originally appeared in the <a href="http://content.yudu.com/htmlReader/A3572w/IWOct14/reader.html?page=26" target="_blank">October 2012 issue of Injection World</a> magazine.</em></p>
</div>
</div>
Design of ExperimentsMon, 27 Oct 2014 11:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/simulating-robust-processing-with-design-of-experiments2c-part-2Guest BloggerSimulating Robust Processing with Design of Experiments, part 1
http://blog.minitab.com/blog/statistics-in-the-field/simulating-robust-processing-with-design-of-experiments2c-part-1
<p>by <a href="http://uk.linkedin.com/in/jasminwongym" target="_blank">Jasmin Wong</a>, guest blogger</p>
<p><em>The combination of statistical methods and injection moulding simulation software gives manufacturers a powerful way to predict moulding defects and to develop a robust moulding process at the part design phase. </em></p>
<p>CAE (computer-aided engineering) is widely used in the injection moulding industry today to improve product and mould designs as well as to resolve or troubleshoot engineering problems. But CAE can also be used to carry out in-depth processing simulations, allowing the critical process parameters that influence part quality to be identified and to enable determination of an appropriate and achievable process window at the earliest stage of the development process.</p>
<img alt="injection-molded dispenser pump" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f6c68e56710c222c2a20dd002021287f/dispenser_top.png" style="width: 400px; height: 236px;" />
<p style="text-align: center;">Warpage and tube concentricity were the key<br />
quality criteria in this mold-injected hand dispenser pump.</p>
<p>In order to produce good quality injection mouldings with high consistency, a well-designed part and mould is critical, along with the selection of the right material and processing parameters. Changes to any of these four factors can have a significant effect on the moulded part.</p>
<p>With regard to defining process parameters, the injection moulding industry has been dependent on experienced process engineers using trial-and-error methods. Without the insight into polymer behaviour inside the mould, more often than not engineers would ‘process the part dimensions in.’ Such an approach typically leads to a narrow process window, where just a slight change in processing conditions can cause part dimensions to fall outside of the specification limit. This trial-and-error method is also laborious, expensive, and frequently ineffective, making it unsuitable for today’s fast-moving plastics processing industry.</p>
<p>Plastic injection moulding simulation software such as Moldex 3D from CoreTech System can help in the validation and optimisation of the part and/or mould design by identifying potential moulding defects before the tool is manufactured. The software can reduce the need for expensive prototypes, minimise the cost of tooling (since less rework needs to be done), and shorten validation time. When combined with the Design of Experiments techniques available in <a href="http://www.minitab.com/products/minitab">statistical software such as Minitab</a>, doing simulation <span style="line-height: 20.7999992370605px;">ahead of real world mould trials </span><span style="line-height: 1.6;">can also be used to speed mould approval. </span></p>
The Design of Experiments (DOE) Approach
<p><a href="http://blog.minitab.com/blog/real-world-quality-improvement/leveraging-designed-experiments-doe-for-success">Design of Experiments, or DOE</a>, involves performing a series of carefully planned, systematic tests while controlling the inputs and monitoring the outputs. In the context of injection moulding, the process parameters are usually referred to as the <em>factors </em>or <em>inputs</em>, while the customer requirements (part quality/dimensions or other part specifications) are referred to as <em>responses </em>or <em>outputs</em>. By analysing the results from these tests, moulders can characterise, optimise and/or troubleshoot the injection moulding process effectively and efficiently.</p>
<p>By applying DOE in an injection moulding simulation, designers and/or moulders can study the relationship between the moulding factors (inputs) and response (outputs) prior to the actual trial on the mould floor. This means that they can have a good understanding of which factors will affect the quality or certain part specifications as early as possible in the development process. Optimal moulding process conditions for the specific part design can then be identified so the focus can be directed to the conditions that have the biggest influence on the customer’s requirements. This can save time and increase productivity.</p>
When Should Simulation Be Performed?
<p>Ideally, CAE simulation should be carried out before the actual mould trial so potential mould defects—such as sink marks, weld lines, short shots, etc.—can be predicted and rectified in the original mould design.</p>
<p>The most challenging problem is often warpage. Due to temperature variations and differences in volumetric shrinkage, it is almost impossible to get a part which is exactly the same as the CAD model. It is, therefore, important to conduct a DOE to understand the impact certain processing parameters have and to define the <a href="http://blog.minitab.com/blog/statistics-in-the-field/optimizing-attribute-responses-using-design-of-experiments-doe-part-1">optimum processing settings</a>.</p>
<p>Before the DOE is conducted, however, it is important to carry out a preliminary simulation to understand the root cause of mould defects. Changes to the part are sometimes inevitable to avoid having too narrow a process window to work within. If the fill pattern is not balanced, for example, there is a high possibility of warpage occurring regardless of the process parameters.</p>
<p> </p>
<p><em><a href="http://blog.minitab.com/blog/statistics-in-the-field/simulating-robust-processing-with-design-of-experiments2c-part-2">The second half of this two-part post includes a detailed case study illustrating how moulding simulation software and design of experiments can be combined to speed part design and approval</a>. </em></p>
<p> </p>
<p><strong>About the guest blogger</strong></p>
<p><a href="http://uk.linkedin.com/in/jasminwongym">Jasmin Wong</a> is project engineer at UK-based <a href="http://www.plazology.co.uk/" target="_blank">Plazology</a>, which provides product design optimisation, injection moulding flow simulation, mould design, mould procurement, and moulding process validation services to global manufacturing customers. She is an MSc graduate in polymer composite science and engineering and recently gained Moldex3D Analyst Certification.</p>
<p> </p>
<div> </div>
<div><em>A version of this article originally appeared in the <a href="http://content.yudu.com/htmlReader/A3572w/IWOct14/reader.html?page=26" style="line-height: 20.7999992370605px;" target="_blank">October 2012 issue of Injection World</a> magazine.</em></div>
Design of ExperimentsFri, 24 Oct 2014 15:22:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/simulating-robust-processing-with-design-of-experiments2c-part-1Guest Blogger