Data Analysis Software | MinitabBlog posts and articles with tips for using statistical software to analyze data for quality improvement.
http://blog.minitab.com/blog/data-analysis-software/rss
Mon, 24 Nov 2014 02:30:18 +0000FeedCreator 1.7.3Lessons in Quality from Guadalajara and Mexico City
http://blog.minitab.com/blog/understanding-statistics-and-its-application/lessons-in-quality-from-guadalajara-and-mexico-city
<p><img alt="View of Mexico City" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8e5ec9217bc8fbc2ca7a6784a1efcdfa/mexico_df_400w.jpg" style="border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 400px; height: 235px;" />Last week, thanks to the collective effort from many people, we held very successful events in Guadalajara and Mexico City, which gave us a unique opportunity to meet with over 300 Spanish-speaking Minitab users. They represented many different industries, including automotive, textile, pharmaceutical, medical devices, oil and gas, electronics, and mining, as well as academic institutions and consultants.</p>
<p>As I listened to my peers Jose Padilla and <a href="http://blog.minitab.com/blog/marilyn-wheatleys-blog">Marilyn Wheatley</a> deliver their presentations, it was interesting to see people's reactions as they learned more about our products and services. Several attendees were particularly pleased to learn more about Minitab's ease-of-use and <a href="http://www.minitab.com/products/minitab/assistant/">step-by-step help with analysis</a> offered by the Assistant menu. I saw others react to demonstrations of Minitab's comprehensive Help system, the use of executables for automation purposes, and several of the tips and tricks discussed throughout our presentations.</p>
<p>We also had multiple conversations on Minitab's flexible licensing options. Several attendees who spend a lot of time on the road were particularly glad to learn about our <a href="http://support.minitab.com/installation/frequently-asked-questions/license-fulfillment/borrow-a-license-of-minitab-companion/">borrowing functionality</a>, which lets you “check out” a license so you can use Minitab software without accessing your organization’s license server.</p>
Acceptance Sampling Plans
<p>There were plenty of technical discussions as well. One interesting question came from a user who asked how Minitab's Acceptance Sampling Plans compare to the <a href="http://asq.org/knowledge-center/ANSI_ASQZ1_4-2008/index.html">ANSI Z1.4</a> standard (a.k.a. MIL-STD 105E). The short answer is that the tables provided by the ANSI Z1.4 are for a specific AQL (Acceptable Quality Level), while implicitly assuming a certain RQL (Rejectable Quality Level) based solely on the lot size. The ANSI Z1.4 is an AQL-based system, while Minitab's acceptance sampling plans give you the flexibility to create a customized sampling scheme for a specific AQL, RQL, or lot size using both the binomial or hypergeometric distributions.</p>
Destructive Testing and Gage R&R
<p>Other users had questions about Gage R&R and destructive testing. Practitioners commonly assess a destructive test using Nested Gage R&R; however, this is not always necessary. The main problem with destructive testing is that every part tested is destroyed and thus can only be measured by a single operator. Since the purpose of this type of analysis is to measure the repeatability and reproducibility of the measurement system, one must identify parts that are as homogeneous as possible. Typically, instead of 10 parts, practitioners may use multiple parts from each of 10 batches. If the within-batch variation is small enough then the parts from each batch can be considered to be "the same" and thus the readings measured by all the operators can be used to produce repeatability and reproducibility measures. The main trick is to have homogenous units or batches that can give you enough samples to be tested by all operators for all replicates. If this is the case, you can analyze a destructive test with crossed gage R&R.</p>
Control Charts and Subgroup Size
<p>We also had an interesting discussion about the sensitivity of Shewhart <a href="http://blog.minitab.com/blog/understanding-statistics/control-chart-tutorials-and-examples">control charts</a> to the subgroup size. Specifically, one of the attendees asked our recommendation for subgroup size: 4, or 5? </p>
<p>The answer to this intriguing question requires an understanding of the reason why subgroups are recommended. Control charts have limits that are constructed so that if the process is stable, the probability of observing points out of these control limits is very small; this probability is typically referred to as the false alarm rate and it is usually set at 0.0027. This calculation assumes the process is normally distributed, so if we were plotting the individual data as in an Individuals chart, the control limits would be effective to determine an out-of-control situation only if the data came from a normal distribution. To reduce the dependence on normality, Shewhart suggested collecting the data in subgroups, because if we plot the means instead of the individual data the control limits would become less and less sensitive to normality as the subgroup size increases. This is a result of the Central Limit Theorem (CLT), which states that regardless of the underlying distribution of the data, that if we take independent samples and compute the average (or a sum) of all the observations in each sample then the distribution of these sample means will converge to a normal distribution.</p>
<p>So going back to the original question, what is the recommended subgroup size for building control charts? The answer depends on how skewed the underlying distribution may be. For various distributions a subgroup size of 5 is sufficient to have the CLT kick in making our control charts robust to normality; however for extremely skewed distributions like the exponential, the subgroup sizes may need to be much larger than 50. This topic was discussed in a paper Schilling and Nelson titled "<a href="http://asq.org/qic/display-item/?item=5238">The Effect of Non-normality on the Control Limits of Xbar Charts</a>" published in JQT back in 1976.</p>
Analyzing Variability
<p>We also had a great discussion about modeling variability in a process. One of the attendees, working for McDonald's, was looking for statistical methods for reducing the variation of the weight of apple slices. An apple is cut in 10 slices, and the goal was to minimize the variation in weight so that exactly four slices be placed in each bag without further rework. This gave me the opportunity to demonstrate how to use the <a href="http://blog.minitab.com/blog/adventures-in-statistics/assessing-variability-for-quality-improvement">Analyze Variability</a> command in Minitab, which happens to be one of the topics we cover in our <a href="http://www.minitab.com/training/courses/#doe-in-practice-manufacturing">DOE in Practice</a> course.</p>
We Love Your Questions
<p>For me and my fellow trainers, there’s nothing better than talking with people who are using Minitab software to solve problems. Sometimes we’re able to provide a quick, helpful answer. Sometimes a question provokes a great discussion about some quality challenge we all have in common. And sometimes a question will lead to a great idea that we’re able to share with our developers and engineers to make our software better. </p>
<p>If you have a question about Minitab, statistics, or quality improvement, please feel free to comment here. And if you use Minitab software, you can always contact our <a href="http://www.minitab.com/support/">customer support</a> team for direct assistance from specialists in IT, statistics, and quality improvement.</p>
<p> </p>
Quality ImprovementStatisticsStatistics HelpWed, 19 Nov 2014 13:57:02 +0000http://blog.minitab.com/blog/understanding-statistics-and-its-application/lessons-in-quality-from-guadalajara-and-mexico-cityEduardo SantiagoWhat to Do When Your Data's a Mess, part 3
http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-3
<p>Everyone who analyzes data regularly has the experience of getting a worksheet that just isn't ready to use. Previously I wrote about tools you can use to <a href="http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-data-is-a-mess-part-1">clean up and elminate clutter in your data</a> and <a href="http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-2">reorganize your data</a>. </p>
<p><span style="line-height: 1.6;">In this post, I'm going to highlight tools that help you get the most out of messy data by altering its characteristics.</span></p>
Know Your Options
<p>Many problems with data don't become obvious until you begin to analyze it. A shortcut or abbreviation that seemed to make sense while the data was being collected, for instance, might turn out to be a time-waster in the end. What if abbreviated values in the data set only make sense to the person who collected it? Or a column of numeric data accidentally gets coded as text? You can solve those problems quickly with <a href="http://www.minitab.com/products/minitab">statistical software</a> packages.</p>
Change the Type of Data You Have
<p>Here's an instance where a data entry error resulted in a column of numbers being incorrectly classified as text data. This will severely limit the types of analysis that can be performed using the data.</p>
<p><img alt="misclassified data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c45b427d3e5e2b5eac4a505ed5c3b24f/misclassified_data.png" style="width: 200px; height: 156px;" /></p>
<p>To fix this, select <strong>Data > Change Data Type</strong> and use the dialog box to choose the column you want to change.</p>
<p><img alt="change data type menu" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/46ece127300500409098383a2e476a9b/text_to_numeric_data.png" style="width: 376px; height: 175px;" /></p>
<p>One click later, and the errant text data has been converted to the desired numeric format:</p>
<p><img alt="numeric data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f1b9df0211f9085e577a41b0e3661b45/numeric_data.png" style="width: 200px; height: 156px;" /></p>
Make Data More Meaningful by Coding It
<p>When this company collected data on the performance of its different functions across all its locations, it used numbers to represent both locations and units. </p>
<p><img alt="uncoded data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d22a57fe9e9e398bd948e86c0adafe34/uncoded_data.png" style="width: 135px; height: 158px;" /></p>
<p>That may have been a convenient way to record the data, but unless you've memorized what each set of numbers stands for, interpreting the results of your analysis will be a confusing chore. You can make the results easy to understand and communicating by coding the data. </p>
<p>In this case, we select <strong>Data > Code > Numeric to Text...</strong></p>
<p><img alt="code data menu" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c75e46cc190497fd41b0e6736518c0fe/code_data_menu.png" style="width: 384px; height: 255px;" /></p>
<p>And we complete the dialog box as follows, telling the software to replace the numbers with more meaningful information, like the town each facility is located in. </p>
<p><img alt="Code data dialog box" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/cd75c14324187806b8f3a74a3b8996b4/code_data_dialog.png" style="width: 400px; height: 345px;" /></p>
<p>Now you have data columns that can be understood by anyone. When you create graphs and figures, they will be clearly labelled. </p>
<p><img alt="Coded data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/7ff81bdb08170d6d8a4e8547623cf557/coded_data.png" style="width: 161px; height: 200px;" /></p>
Got the Time?
<p>Dates and times can be very important in looking at performance data and other indicators that might have a cyclical or time-sensitive effect. But the way the date is recorded in your data sheet might not be exactly what you need. </p>
<p>For example, if you wanted to see if the day of the week had an influence on the activities in certain divisions of your company, a list of dates in the MM/DD/YYYY format won't be very helpful. </p>
<p><img alt="date column" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f5b0dd178afbc0352f8dc2d9378e887b/date_column.png" style="width: 240px; height: 223px;" /></p>
<p>You can use <strong>Data > Date/Time > Extract to Text... </strong>to identify the day of the week for each date.</p>
<p><img alt="extract-date-to-text" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/7e6f7e8a87ee8291b9c6d51507092c19/extract_date_to_text.png" style="width: 351px; height: 132px;" /></p>
<p>Now you have a column that lists the day of the week, and you can easily use it in your analysis. </p>
<p><img alt="day column" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/dede93c9621917a0cfb54beef121d4e2/day_column.png" style="width: 249px; height: 205px;" /></p>
Manipulating for Meaning
<p>These tools are commonly seen as a way to correct data-entry errors, but as we've seen, you can use them to make your data sets more meaningful and easier to work with.</p>
<p>There are many other tools available in Minitab's Data menu, including an array of options for arranging, combining, dividing, fine-tuning, rounding, and otherwise massaging your data to make it easier to use. Next time you've got a column of data that isn't quite what you need, try using the Data menu to get it into shape.</p>
<p> </p>
<p> </p>
Data AnalysisStatisticsStatsMon, 17 Nov 2014 13:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-3Eston MartzAre Preseason Football or Basketball Rankings More Accurate?
http://blog.minitab.com/blog/the-statistics-game/are-preseason-football-or-basketball-rankings-more-accurate
<p>College basketball season tips off today, and for the second straight season Kentucky is the #1 ranked preseason team in the AP poll. Last year Kentucky did not live up to that ranking in the regular season, going 24-10 and earning a lowly 8 seed in the NCAA tournament. But then, in the tournament, they overachieved and made a run all the way to the championship game...before losing to Connecticut.</p>
<p>In football, Florida State was the AP poll preseason #1 football team. While they are currently still undefeated, they aren't quite playing like the #1 team in the country. So this made me wonder, which preseason rankings are more accurate, football or basketball?</p>
<p>I gathered <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/1d3961db92c5ba14bc90b2b8323b95f8/preseason_basketball_vs__football_rankings.MTW">data</a> from the last 10 seasons, and recorded the top 10 teams in the preseason AP poll for both football and basketball. Then I recorded the difference between their preseason ranking and their final ranking. Both sports had 10 teams that weren’t ranked or receiving votes in the final poll, so I gave all of those teams a final ranking of 40.</p>
Creating a Histogram to Compare Two Distributions
<p>Let’s start with a histogram to look at the distributions of the differences. (It's always a good idea to look at the distribution of your data when you're starting an analysis, whether you're looking at quality improvement data work or sports data for yourself.) </p>
<p>You can create this graph in Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a> by selecting <strong>Graph > Histograms</strong>, choosing "With Groups" in the dialog box, and using the Basketball Difference and Football Difference columns as the graph variables:</p>
<p><img alt="Histogram" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/53055c57978dbfa85d28688cc816c98a/histogram_of_basketball_difference__football_difference.jpg" style="width: 720px; height: 480px;" /></p>
<p>The differences in the rankings appear to be pretty similar. Most of the data is towards the left side of this histogram, meaning for most cases the difference between the preseason and final ranking is pretty small.</p>
Conducting a Mann-Whitney Hypothesis Test on Two Medians
<p>We can further investigate the data by performing a hypothesis test. Because the data is heavily skewed, I’ll use <a href="http://blog.minitab.com/blog/the-statistics-game/do-the-data-really-say-female-named-hurricanes-are-more-deadly">a Mann-Whitney test</a>. This compares the medians of two samples with similarly-shaped distributions, as opposed to a <a href="http://blog.minitab.com/blog/understanding-statistics/guidelines-and-how-tos-for-the-2-sample-t-test">2-sample t test</a>, which compares the means. <span style="line-height: 20.7999992370605px;">The median is the middle value of the data. Half the observations are less than or equal to it, and half the observations are greater than or equal to it.</span><span style="line-height: 20.7999992370605px;"> </span></p>
<p>To perform this test in our statistical software, we select <strong>Stat > Nonparametrics > Mann-Whitney</strong>, then choose the appropriate columns for our first and second sample: </p>
<p><img alt="Mann-Whitney Test" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/1a1f239841b82e60170e6ecbc8077d4b/mann_whitney.jpg" style="width: 689px; height: 241px;" /></p>
<p>The basketball rankings have a smaller median difference than the football rankings. However, when we examine the <a href="http://blog.minitab.com/blog/understanding-statistics/three-things-the-p-value-cant-tell-you-about-your-hypothesis-test">p-value</a> we see that this difference is not statistically significant. There is not enough evidence to conclude that one preseason poll is more accurate than the other.</p>
<p>But what about the best teams? I grouped each of the top 3 ranked teams and looked at the median difference between their preseason and final rank.</p>
<p><img alt="Bar Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/692a3db40dd5d3b4c20d539f92395629/bar_chart.jpg" style="width: 720px; height: 480px;" /></p>
<p>The preseason AP basketball poll has a smaller difference for the #1 and #3 ranked teams. But the football poll is better for the #2 team, having an impressive median value of 1. Overall, both polls are relatively good, as neither has a median value greater than 6. And the differences are close enough that we can’t conclude that one is more accurate than the other.</p>
What Does It Mean for the Teams?
<p>While the odds are against both Kentucky and Florida State to finish the season ranked #1 in their respective polls, previous seasons indicate that they’re still likely to finish as one of the top teams. This is better news for Kentucky, as being one of the top teams means they’ll easily make the NCAA basketball tournament and get a high seed. However, Florida State must finish as one of the top 4 teams, or else they’ll miss out on the football postseason completely.</p>
<p>So while we can’t conclude one poll is better than the other, teams at the top of the AP basketball poll are clearly much more likely to reach the postseason than football.</p>
Data AnalysisFun StatisticsHypothesis TestingStatistics in the NewsFri, 14 Nov 2014 15:03:33 +0000http://blog.minitab.com/blog/the-statistics-game/are-preseason-football-or-basketball-rankings-more-accurateKevin RudyThe Power of Multivariate ANOVA (MANOVA)
http://blog.minitab.com/blog/adventures-in-statistics/the-power-of-multivariate-anova-manova
<p><img alt="Willy Wonka" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/964d1b613c1569e983213d2544915ac5/willywonka.jpg" style="float: right; width: 225px; height: 225px; border-width: 1px; border-style: solid; margin: 10px 15px;" />Analysis of variance (ANOVA) is great when you want to compare the differences between group means. For example, you can use ANOVA to assess how three different alloys are related to the mean strength of a product. However, most ANOVA tests assess one response variable at a time, which can be a big problem in certain situations. Fortunately, <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">Minitab statistical software</a> offers a multivariate analysis of variance (MANOVA) test that allows you to assess multiple response variables simultaneously.</p>
<p>In this post, I’ll run through a MANOVA example, explain the benefits, and cover how to know when you should use MANOVA.</p>
Limitations of ANOVA
<p>Whether you’re using <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/basics/what-is-a-general-linear-model/" target="_blank">general linear model (GLM)</a> or <a href="http://blog.minitab.com/blog/adventures-in-statistics/did-welchs-anova-make-fishers-classic-one-way-anova-obsolete" target="_blank">one-way ANOVA</a>, most ANOVA procedures can only assess one response variable at a time. Even GLM, where you can include many factors and covariates in the model, the analysis simply cannot detect multivariate patterns in the response variable.</p>
<p>This limitation can be a huge roadblock for some studies because it may be impossible to obtain significant results with a regular ANOVA test. You don’t want to miss out on any significant findings!</p>
Example That Compares MANOVA to ANOVA
<p>What the heck are multivariate patterns in the response variable? It sounds complicated but it’s very easy to show the difference between how ANOVA and MANOVA tests the data by using graphs.</p>
<p>Let’s assume that we are studying the relationship between three alloys and the strength and flexibility of our products. Here is the <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/File/3f3b6f58c70a646731a9db97bd7edfab/manova_example.MTW">dataset for the example</a>.</p>
<p>The two individual value plots below show how one-way ANOVA analyzes the data—one response variable at a time. In these graphs, alloy is the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-models/factor-and-factor-levels/" target="_blank">factor</a> and strength and flexibility are the <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/regression-and-correlation/regression-models/what-are-response-and-predictor-variables/" target="_blank">response variables</a>.</p>
<img alt="Individual value plot of strength by alloy" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/3402fd3845c2226f555b4ebfe18a87f5/strength_ivp.png" style="width: 350px; height: 233px;" />
<img alt="Individual value plot of flexibility by alloy" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/c7fba5c5eda5e81e02db60b2aefb3327/flexibility_ivp.png" style="width: 350px; height: 233px;" />
<p>The two graphs seem to show that the type of alloy is not related to either the strength or flexibility of the product. When you perform the one-way ANOVA procedure for these graphs, the p-values for strength and flexibility are 0.254 and 0.923 respectively.</p>
<p>Drat! I guess Alloy isn't related to either Strength or Flexibility, right? Not so fast!</p>
<p>Now, let’s take a look at the multivariate response patterns. To do this, I’ll display the same data with a <a href="http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/graphs/graphs-of-pairs-of-variables/scatterplots/scatterplot/" target="_blank">scatterplot</a> that plots Strength by Flexibility with Alloy as a categorical grouping variable.</p>
<p><img alt="Scatterplot of strength by flexibility grouped by alloy" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/86483284f76817ea95b3c1787e45e7d5/scatterplot.png" style="width: 576px; height: 384px;" /></p>
<p>The scatterplot shows a positive correlation between Strength and Flexibility. MANOVA is useful when you have correlated response variables like these. You can also see that for a given flexibility score, Alloy 3 generally has a higher strength score than Alloys 1 and 2. We can use MANOVA to statistically test for this response pattern to be sure that it’s not due to random chance.</p>
<p>To perform the MANOVA test in Minitab, go to: <strong>Stat > ANOVA > General MANOVA</strong>. Our response variables are Strength and Flexibility and the predictor is Alloy.</p>
<p>Whereas one-way ANOVA could not detect the effect, MANOVA finds it with ease. The p-values in the results are all very significant. You can conclude that Alloy influences the properties of the product by changing the relationship between the response variables.</p>
<p><img alt="MANOVA results" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/c96fe9a066011b31692765318c2f0d26/manova_swo.png" style="width: 391px; height: 155px;" /></p>
<p>For a more complete guide on how to interpret MANOVA results in Minitab, go to: <strong>Help > StatGuide > ANOVA > General MANOVA</strong>.</p>
When and Why You Should Use MANOVA
<p>Use multivariate ANOVA when you have continuous response variables that are correlated. In addition to multiple responses, you can also include multiple <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-models/factor-and-factor-levels/" target="_blank">factors</a>, <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-models/adding-a-covariate-to-glm/" target="_blank">covariates</a>, and <a href="http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/anova-models/what-is-an-interaction/" target="_blank">interactions</a> in your model. MANOVA uses the additional information provided by the relationship between the responses to provide three key benefits.</p>
<ul>
<li><strong>Increased power</strong>: If the response variables are correlated, MANOVA can detect differences too small to be detected through individual ANOVAs.</li>
<li><strong>Detects multivariate response patterns</strong>: The factors may influence the relationship between responses rather than affecting a single response. Single-response ANOVAs can miss these multivariate patterns as illustrated in the MANOVA example.</li>
<li><strong>Controls the family error rate</strong>: Your chance of incorrectly rejecting the null hypothesis increases with each successive ANOVA. Running one MANOVA to test all response variables simultaneously keeps the family error rate equal to your alpha level.</li>
</ul>
Data AnalysisStatisticsStatistics HelpThu, 13 Nov 2014 13:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/the-power-of-multivariate-anova-manovaJim FrostWhat to Do When Your Data's a Mess, part 2
http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-2
<p><span style="line-height: 1.6;">In my last post, I wrote about making a cluttered data set easier to work with by removing unneeded columns entirely, and by displaying just those columns you want to work with <em>now</em>. But <a href="http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-data-is-a-mess-part-1">too much unneeded data</a> isn't always the problem. </span></p>
<p><span style="line-height: 1.6;">What can you do when someone gives you data that isn't organized the way you need it to be? </span></p>
<p><span style="line-height: 1.6;">That happens for a variety of reasons, but most often it's because the simplest way for people to collect data is with a format that might make it difficult to assess in a worksheet. Most <a href="http://www.minitab.com/products/minitab">statistical software</a> will accept a wide range of data layouts, but just because a layout is readable doesn't mean it will be easy to analyze.</span></p>
<p><span style="line-height: 1.6;">You may not be in control of how your data were collected, but you can use tools like sorting, stacking, and ordering to put your data into a format that makes sense and is easy for you to use. </span></p>
Decide How You Want to Organize Your Data
<p>Depending on how its arranged, the same data can be easier to work with, simpler to understand, and can even yield deeper and more sophisticated insights. I can't tell you the best way to organize your specific data set, because that will depend on the types of analysis you want to perform, and the nature of the data you're working with. However, I can show you some easy ways to rearrange your data into the form that you select. </p>
Unstack Data to Make Multiple Columns
<p>The data below show concession sales for different types of events held at a local theater. </p>
<p><img alt="stacked data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8ea617d9de8138f26f2da0f3f95f4b88/stackedata.png" style="width: 202px; height: 188px;" /></p>
<p><span style="line-height: 20.7999992370605px;">If we wanted to perform an analysis that requires each type of event to be in its own column, we can choose <strong>Data > Unstack Columns...</strong> and complete the dialog box as shown: </span></p>
<p><span style="line-height: 20.7999992370605px;"><img alt="unstack columns dialog" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/fc098d3ddcbc21fe12602cb45336949c/unstack_columns.png" style="width: 350px; height: 263px;" /> </span></p>
<p>Minitab creates a new worksheet that contains a separate column of Concessions sales data for each type of event:</p>
<p><img alt="Unstacked Data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f24dd4ac29678e25069d299ccc13c535/unstacked_data.png" style="width: 400px; height: 150px;" /></p>
Stack Data to Form a Single Column (with Grouping Variable)
<p>A similar tool will help you put data from separate columns into a single column for the type of analysis required. The data below show sales figures for four employees: </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f546e2611e4fd6fe804de7c0aee3d230/stacked_data.png" style="width: 265px; height: 92px;" /></p>
<p>Select <strong>Data > Stack > Columns...</strong> and select the columns you wish to combine. Checking the "Use variable names in subscript column" will create a second column that identifies the person who made each sale. </p>
<p><img alt="Stack columns dialog" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a09dba196e68e5e75d0f248339a53e11/stack_data_dialog.jpg" style="width: 400px; height: 292px;" /></p>
<p>When you press OK, the sales data are stacked into a single column of measurements and ready for analysis, with Employee available as a grouping variable: </p>
<p><img alt="stacked columns" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/c26bec8bec9447ab1df6b9ad669d9a1a/stacked_columns.jpg" style="width: 138px; height: 181px;" /></p>
Sort Data to Make It More Manageable
<p>The following data appear in the worksheet in the order in which individual stores in a chain sent them into the central accounting system.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/431dcae640fa0855a8db03b14bad3998/unsorted_data.jpg" style="width: 200px; height: 228px;" /></p>
<p>When the data appear in this uncontrolled order, finding an observation for any particular item, or from any specific store, would entail reviewing the entire list. We can fix that problem by selecting <strong>Data > Sort...</strong> and reordering the data by either store or item. </p>
<p><img alt="sorted data by item" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/0c982bb11359a001c048cb6c39ab1f60/sorted_data_by_item.jpg" style="width: 221px; height: 246px;" /> <img alt="sorted data by store" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/53e9a3f22b4a959af11952995703d7d4/sorted_data_by_store.jpg" style="width: 209px; height: 248px;" /></p>
Merge Multiple Worksheets
<p>What if you need to analyze information about the same items, but that were recorded on separate worksheets? For instance, if one group was gathering historic data about all of a corporation's manufacturing operations, while another was working on strategic planning, and your analysis required data from each? </p>
<p><img alt="two worksheets" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f63ed557c91fb6136b28ab43001b48b4/two_worksheets.png" style="width: 350px; height: 327px;" /></p>
<p>You can use <strong>Data > Merge Worksheets</strong> to bring the data together into a single worksheet, using the Division column to match the observations:</p>
<p><img alt="merging worksheets" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/651d3d676a4099a71eb180344d2e8282/merge_worksheets.png" style="width: 393px; height: 363px;" /></p>
<p>You can also choose whether or not <span style="line-height: 20.7999992370605px;">multiple</span><span style="line-height: 1.6;">, missing, or unmatched observations will be included in the merged worksheet. </span></p>
Reorganizing Data for Ease of Use and Clarity
<p>Making changes to the layout of your worksheet does entail a small investment of time, but it can bring big returns in making analyses quicker and easier to perform. The next time you're confronted with raw data that isn't ready to play nice, try some of these approaches to get it under control. </p>
<p>In my next post, I'll share some tips and tricks that can help you get more information out of your data.</p>
Data AnalysisStatisticsStatsTue, 11 Nov 2014 14:48:09 +0000http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-2Eston MartzWhat to Do When Your Data's a Mess, part 1
http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-data-is-a-mess-part-1
<p>Isn't it great when you get a set of data and it's perfectly organized and ready for you to analyze? I love it when the people who collect the data take special care to make sure to format it consistently, arrange it correctly, and eliminate the junk, clutter, and useless information I don't need. </p>
<p><img alt="Messy Data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ad531bc1c0dc575e774b7ecef670b231/messydata.png" style="border-width: 1px; border-style: solid; margin: 10px 15px; width: 250px; height: 248px; float: right;" />You've never received a data set in such perfect condition, you say?</p>
<p>Yeah, me neither. But I can dream, right? </p>
<p><span style="line-height: 1.6;">The truth is, when other people give me data, it's typically not ready to analyze. It's frequently messy, disorganized, and inconsistent. I get big headaches if I try to analyze it without doing a little clean-up work first. </span></p>
<p>I've talked with many people who've shared similar experiences, so I'm writing a series of posts on how to get your data in usable condition. In this first post, I'll talk about some basic methods you can use to make your data easier to work with. </p>
Preparing Data Is a Little Like Preparing Food
<p>I'm not complaining about the people who give me data. In most cases, they aren't statisticians and they have many higher priorities than giving me data in exactly the form I want. </p>
<p>The end result is that getting data is a little bit like getting food: it's not always going to be ready to eat when you pick it up. You don't eat raw chicken, and usually you can't analyze raw data, either. <span style="line-height: 20.7999992370605px;"> </span><span style="line-height: 1.6;">In both cases, you need to prepare it first or the results aren't going to be pretty. </span></p>
<p><span style="line-height: 1.6;">Here are a couple of very basic things to look for when you get a messy data set, and how to handle them. </span></p>
<span style="line-height: 1.6;">Kitchen-Sink Data and Information Overload</span>
<p>Frequently I get a data set that includes a lot of information that I don't need for my analysis. I also get data sets that combine or group information in ways that make analyzing it more difficult. </p>
<p>For example, let's say I needed to analyze data about different types of events that take place at a local theater. Here's my raw data sheet: </p>
<p><img alt="April data sheet" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/14fe4e9930171f54848b589c0e8139d1/april_data_raw.png" style="width: 400px; height: 224px;" /></p>
<p>With each type of event jammed into a single worksheet, it's a challenge to analyze just one event category. What would work better? A separate worksheet for each type of occasion. In Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a>, I can go to <strong>Data > Split Worksheet...</strong> and choose the Event column: </p>
<p><img alt="split worksheet" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/69c63e422339f9871ada5a244222dcfc/split_worksheet.png" style="width: 300px; height: 309px;" /></p>
<p>And Minitab will create new worksheets that include only the data for each type of event. </p>
<p><img alt="separate worksheets by event type" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8b97ea00ae39da8cb60e307ebe6140dc/separate_data_sheets.png" style="width: 300px; height: 243px;" /></p>
<p><span style="line-height: 20.7999992370605px;">Minitab also lets you merge worksheets to </span>combine items provided in separate data files. </p>
<p><span style="line-height: 1.6;">Let's say the data set you've been given contains a lot of columns that you don't need: irrelevant factors, redundant information, and the like. Those items just clutter up your data set, and getting rid of them will make it easier to identify and access the columns of data you actually need. </span><span style="line-height: 20.7999992370605px;">You can delete rows and columns you don't need, or use the</span><strong style="line-height: 20.7999992370605px;"> Data > Erase Variables</strong><span style="line-height: 20.7999992370605px;"> tool to make your worksheet more manageable. </span></p>
<span style="line-height: 1.6;">I Can't See You Right Now...Maybe Later</span>
<p>What if you don't want to actually <em>delete </em>any data, but you only want to see the columns you intend to use? For instance, in the data below, I don't need the Date, Manager, or Duration columns now, but I may have use for them in the future: </p>
<p><img alt="unwanted columns" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/99d785a0b5ff0cbac36f0c6af05b1cac/unwantedcolumns.png" style="width: 400px; height: 225px;" /></p>
<p>I can select and right-click those columns, then use <strong>Column > Hide Selected Columns</strong> to make them disappear. </p>
<p><img alt="hide selected columns" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/00defa2646d5e100873ef2961d374ff0/hideselectedcolumns.png" style="width: 400px; height: 308px;" /></p>
<p>Voila! They're gone from my sight. Note how the displayed columns jump from C1 to C5, indicating that some columns are hidden: </p>
<p><img alt="hidden columns" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a140bb6413744b431460e70f523e5a0b/hiddencolumns.png" style="width: 323px; height: 138px;" /></p>
<p>It's just as easy to bring those columns back in the limelight. When I want them to reappear, I select the C1 and C5 columns, right-click, and choose "Unhide Selected Columns." </p>
<p>Data may arrive in a disorganized and messy state, but you don't need to keep it that way. Getting rid of extraneous information and choosing the elements that are visible can make your work much easier. But that's just the tip of the iceberg. In my next post, I'll cover some more <a href="http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-datas-a-mess2c-part-2">ways to make unruly data behave</a>. </p>
Data AnalysisStatisticsMon, 10 Nov 2014 15:52:00 +0000http://blog.minitab.com/blog/understanding-statistics/what-to-do-when-your-data-is-a-mess-part-1Eston MartzCreating and Reading Statistical Graphs: Trickier than You Think
http://blog.minitab.com/blog/understanding-statistics/creating-and-reading-statistical-graphs-trickier-than-you-think
<p>A few weeks ago my colleague Cody Steele illustrated <a href="http://blog.minitab.com/blog/statistics-and-quality-improvement/how-painful-does-the-income-gap-look-to-you">how the same set of data can appear to support two contradictory positions</a>. He showed how changing the scale of a graph that displays mean and median household income over time drastically alters the way it can be interpreted, even though there's no change in the data being presented.</p>
<p><img alt="Graph interpretation is tricky, especially if you're doing it quickly" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f594d20f8daa8e00e29380f68010b1cc/hunh.jpg" style="margin: 10px 15px; float: right; width: 200px; height: 200px;" /> When we analyze data, we need to present the results in an objective, honest, and fair way. That's the catch, of course. What's "fair" can be debated...and that leads us straight into "Lies, damned lies, and statistics" territory. </p>
<p><span style="line-height: 20.7999992370605px;">Cody's post got me thinking about the importance of statistical literacy, especially in a mediascape saturated with overhyped news reports about seemingly every new study, not to mention omnipresent "infographics" of frequently dubious origin and intent.</span></p>
<p><span style="line-height: 20.7999992370605px;">As consumers and providers of statistics, can we trust our own impressions of the information we're bombarded with on a daily basis? It's an increasing challenge, even for the statistics-savvy. </span></p>
So Much Data, So Many Graphs, So Little Time
<p>The increased amount of information available, combined with the acceleration of the news cycle to speeds that wouldn't have been dreamed of a decade or two ago, means we have less time available to absorb and evaluate individual items critically. </p>
<p>A half-hour television news broadcast might include several animations, charts, and figures based on the latest research, or polling numbers, or government data. They'll be presented for several seconds at most, then it's on to the next item. </p>
<p>Getting news online is even more rife with opportunities for split-second judgment calls. We scan through the headlines and eyeball the images, searching for stories interesting enough to click on. But with 25 interesting stories vying for your attention, and perhaps just a few minutes before your next appointment, you race through them very quickly. </p>
<p>But when we see graphs for a couple of seconds, do we really absorb their meaning completely and accurately? Or are we susceptible to misinterpretation? </p>
<p>Most of the graphs we see are very simple: bar charts and pie charts predominate. But <span style="line-height: 1.6;">as statistics educator Dr. Nic points out in </span><a href="http://learnandteachstatistics.wordpress.com/2012/07/16/tricky_graphs/" style="line-height: 1.6;">this blog post</a>,<span style="line-height: 1.6;"> </span><span style="line-height: 20.7999992370605px;">interpreting</span><span style="line-height: 20.7999992370605px;"> </span><span style="line-height: 1.6;">even simple bar charts can be a deceptively tricky business</span><span style="line-height: 1.6;">. I've adapted her example to demonstrate this below. </span></p>
Which Chart Shows Greater Variation?
<p>A city surveyed residents of two neighborhoods about the quality of service they get from local government. Respondents were asked to rate local services on a scale of 1 to 10. Their responses were charted using Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a>, as shown below. </p>
<p>Take a few seconds to scan the charts, then choose which neighborhood's responses exhibit the most variation, Ferndale or Lawnwood?</p>
<p><img alt="Lawnwood Bar Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f88262f2732bc43e8ac0b919d43139a5/lawnwoodbarchart.gif" style="width: 500px; height: 333px;" /></p>
<p><img alt="Ferndale Bar Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/67ee1909a89236e3caac2d11a9d42795/ferndalebarchart.gif" style="width: 500px; height: 333px;" /></p>
<p>Seems pretty straightforward, right? Lawnwood's graph is quite spiky and disjointed, with sharp peaks and valleys. The graph of Ferndale's responses, on the other hand, looks nice and even. Each bar's roughly the same height. </p>
<p>It looks like Lawnwood's responses have the most variation. But let's verify that impression with some basic descriptive statistics about each neighborhood's responses:</p>
<p style="margin-left: 40px;"><img alt="Descriptive Statistics for Fernwood and Lawndale" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/1eeed755d2a0baea0939dc7ccecacaea/descriptive_statistics.gif" style="width: 369px; height: 105px;" /></p>
<p>Uh-oh. A glance at the graphs suggested that Lawnwood has more variation, but the analysis demonstrates that Ferndale's variation is, in fact, much higher. <span style="line-height: 20.7999992370605px;">How did we get this so wrong?</span><span style="line-height: 20.7999992370605px;"> </span><span style="line-height: 1.6;"> </span></p>
Frequencies, Values, and Counterintuitive Graphs
<p><span style="line-height: 1.6;">The answer lies in how the data were presented. The charts above show frequencies, or counts, rather than individual responses. </span></p>
<p><span style="line-height: 1.6;">What if we graph the individual responses for each neighborhood? </span></p>
<p><img alt="Lawndale Individuals Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d8e91ae6c007e8f5327c54ac3ec65604/lawnwoodindividualsbarchart.gif" style="width: 500px; height: 333px;" /></p>
<p><img alt="Ferndale Individuals Chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/4c01c68dbb96e2126a1fd313ee38e001/ferndaleindividualsbarchart.gif" style="width: 500px; height: 333px;" /></p>
<p>In <em>these </em>graphs, it's easy to see that the responses of Ferndale's citizens had much more variation than those of Lawnwood. But unless you appreciate the differences between values and frequencies—and paid careful attention to how the first set of graphs was labelled—a quick look at the earlier graphs could well leave you with the wrong conclusion. </p>
Being Responsible
<p>Since you're reading this, you probably both create and consume data analysis. You may generate your own reports and charts at work, and see the results of other peoples' analyses on the news. We should approach both situations with a certain degree of responsibility. </p>
<p>When looking at graphs and charts produced by others, we need to avoid snap judgments. We need to pay attention to what the graphs really show, and take the time to draw the right conclusions based on how the data are presented. </p>
<p>When sharing our own analyses, we have a responsibility to communicate clearly. In the frequency charts above, the X and Y axes are labelled adequately—but couldn't they be more explicit? Instead of just "Rating," couldn't the label read "Count for Each Rating" or some other, more meaningful description? </p>
<p>Statistical concepts may seem like common knowledge if you've spent a lot of time working with them, but many people aren't clear on ideas like "correlation is not causation" and margins of error, let alone the nuances of statistical assumptions, distributions, and significance levels.</p>
<p>If your audience includes people without a thorough grounding in statistics, are you going the extra mile to make sure the results are understood? For example, many expert statisticians have told us they use <a href="http://www.minitab.com/products/minitab/assistant/">the Assistant</a> in Minitab 17 to present their results precisely because it's designed to communicate the outcome of analysis clearly, even for statistical novices. </p>
<p><span style="line-height: 20.7999992370605px;">If you're already doing everything you can to make statistics accessible to others, kudos to you. </span><span style="line-height: 20.7999992370605px;">And if you're not, why aren't you? </span></p>
Data AnalysisStatisticsStatistics in the NewsStatsWed, 05 Nov 2014 14:25:00 +0000http://blog.minitab.com/blog/understanding-statistics/creating-and-reading-statistical-graphs-trickier-than-you-thinkEston MartzComparing the College Football Playoff Top 25 and the Preseason AP Poll
http://blog.minitab.com/blog/the-statistics-game/comparing-the-college-football-playoff-top-25-and-the-preseason-ap-poll
<p>The college football playoff committee waited until the end of October to release their first top 25 rankings. One of the reasons for waiting so far into the season was that the committee would rank the teams off of actual games and wouldn’t be influenced by preseason rankings.</p>
<p>At least, that was the idea.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/8ac74acf42052d068b6cd0eeec32f609/cfb_playoff.jpg" style="line-height: 20.7999992370605px; float: right; width: 300px; height: 187px;" /></p>
<p>Earlier this year, I found that the <a href="http://blog.minitab.com/blog/the-statistics-game/has-the-college-football-playoff-already-been-decided">final AP poll was correlated with the preseason AP poll</a>. That is, if team A was ranked ahead of team B in the preseason and they had the same number of losses, team A was still usually ranked ahead of team B. The biggest exception was SEC teams, who were able to regularly jump ahead of teams (with the same number of losses) ranked ahead of them in the preseason.</p>
<p>If the final AP poll can be influenced by preseason expectations, could the college football playoff committee be influenced, too? Let’s compare their first set of rankings to the preseason AP poll to find out.</p>
Comparing the Ranks
<p>There are currently 17 different teams in the committee’s top 25 that have just one loss. I <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/26e7c8d8d8eee4fe2dfa26dc3d6e3c54/preseason_ap_vs__cfb_playoff_rankings.MTW">recorded the order</a> they are ranked in the committee’s poll and their order in the AP preseason poll. Below is an individual value plot of the data that shows each team’s preseason rank versus their current rank.</p>
<p><img alt="IVP" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/4098bab194a586865d3861f854d65627/ivp.jpg" style="width: 600px; height: 400px;" /></p>
<p>Teams on the diagonal line haven’t moved up or down since the preseason. Although Notre Dame is the only team to fall directly on the line, most teams aren’t too far off.</p>
<p>Teams below the line have jumped teams that were ranked ahead of them in the preseason. The biggest winner is actually not an SEC team, it’s TCU. Before the season, 13 of the current one-loss teams were ranked ahead of TCU, but now there are only 4. On the surface TCU seems to counter the idea that only SEC teams can drastically move up from their preseason ranking. However, of the 9 teams TCU jumped, only one (Georgia) is from the SEC. And the only other team to jump up more than 5 spots is Mississippi—who of course is from the SEC. So I wouldn’t conclude that the CFB playoff committee rankings behave differently than the AP poll quite yet.</p>
<p>Teams below the line have been passed by teams that had been ranked behind them in the preseason. Ohio State is the biggest loser, having had 9 different teams pass over them. Part of this can be explained by the fact that they have the worst loss (a 4-4 Virginia Tech game at home). But another factor is that the preseason AP poll was released before anybody knew Buckeye quarterback Braxton Miller would miss the entire season. Had voters known that, Ohio State probably wouldn’t have been ranked so high to begin with. </p>
<p>Overall, 10 teams have moved up or down from their preseason spot by 3 spots or less. The correlation between the two polls is 0.571, which indicates a positive association between the preseason AP poll and the current CFB playoff rankings. That is, teams ranked higher in the preseason poll tend to be ranked higher in the playoff rankings.</p>
Concordant and Discordant Pairs
<p>We can take this analysis a step further by looking at the concordant and discordant pairs. A pair is concordant if the observations are in the same direction. A pair is discordant if the observations are in opposite directions. This will let us compare teams to each other two at a time.</p>
<p>For example, let’s compare Auburn and Mississippi. In the preseason, Auburn was ranked 3 (out of the 17 one-loss teams) and Mississippi was ranked 10. In the playoff rankings, Auburn is ranked 1 and Mississippi is ranked 2. This pair is concordant, since in both cases Auburn is ranked higher than Mississippi. But if you compare Alabama and Mississippi, you’ll see Alabama was ranked higher in the preseason, but Mississippi is ranked higher in the playoff rankings. That pair is discordant.</p>
<p>When we compare every team, we end up with 136 pairs. How many of those are concordant? Our <a href="http://www.minitab.com/products/minitab">favorite statistical software</a> has the answer: </p>
<p><img alt="Measures of Concordance" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/5f281abfa1e06d5cda492e17b3f9746b/concordance.jpg" style="width: 663px; height: 176px;" /></p>
<p>There are 96 concordant pairs, which is just over 70%. So most of the time, if a team ranked higher in the preseason poll, they are ranked higher in the playoff rankings. And consider this: of the one-loss teams, the top 4 ranked preseason teams were Alabama, Oregon, Auburn, and Michigan St. Currently, the top 4 one loss teams are Auburn, Mississippi, Oregon, and Alabama. That’s only one new team—which just so happens to be from the SEC.</p>
<p>That’s bad news for non-SEC teams that started the season ranked low, like Arizona, Notre Dame, Nebraska, and Kansas State. It's going to be hard for them to jump teams with the same record, especially if those teams are from the SEC. Just look at Alabama’s résumé so far. Their best win is over West Virginia and they lost to #4 Mississippi. Is that <em>really </em>better than Kansas State, who lost to #3 Auburn and beat Oklahoma <em>on the road</em>? If you simply changed the name on Alabama’s uniform to Utah and had them unranked to start the season, would they still be ranked three spots higher than Kansas State? I doubt it.</p>
<p>The good news is that there are still many games left to play. Most of these one-loss teams will lose at least one more game. But with 4 teams making the playoff this year, odds are we'll see multiple teams with the same record vying for the last playoff spot. And if this college football playoff ranking is any indication, if you're not in the SEC, teams who were highly thought of in the preseason will have an edge.</p>
Fun StatisticsHypothesis TestingFri, 31 Oct 2014 13:04:57 +0000http://blog.minitab.com/blog/the-statistics-game/comparing-the-college-football-playoff-top-25-and-the-preseason-ap-pollKevin RudySimulating Robust Processing with Design of Experiments, part 2
http://blog.minitab.com/blog/statistics-in-the-field/simulating-robust-processing-with-design-of-experiments2c-part-2
<p>by <a href="http://uk.linkedin.com/in/jasminwongym" target="_blank">Jasmin Wong</a>, guest blogger</p>
<p> </p>
<p><em><a href="http://blog.minitab.com/blog/statistics-in-the-field/simulating-robust-processing2c-part-1">Part 1</a> of this two-part blog post discusses the issues and challenges in injection moulding and suggests using simulation software and the statistical method called Design of Experiments (DOE) to speed development and boost quality. This part presents a case study that illustrates this approach. </em></p>
Preliminary Fill and Designed Experiment
<p>This case study considers the example of a hand dispensing pump for a sanitiser bottle where the main areas of concern were warpage and the concentricity of the tube, as this had a critical impact on fit and functionality. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f6c68e56710c222c2a20dd002021287f/dispenser_top.png" style="line-height: 20.7999992370605px; margin: 10px 15px; float: right; width: 400px; height: 236px;" /></p>
<div>
<p>In this example, the first step was to carry out a preliminary fill, pack, cool and warp analysis to ensure that the part had no filling difficulties such as short shots or hesitation. DOE was then carried out and, since the areas of concern were warpage and concentricity, these were selected as the quality factor/responses.</p>
<div>
<p>Four control factors that affected warpage and concentricity were used to carry out the DOE: melt temperature, packing pressure, cooling time, and fill time. The factors levels are shown in the table below:</p>
<p><img alt="Taguchi DOE control factors" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/322b2d00c3b22d962ca76ac0485e437b/taguchi_doe_control_factors.png" style="width: 450px; height: 136px;" /></p>
<p>A Taguchi L9 DOE was then created using Minitab Statistical Software. <span style="line-height: 1.6;">It should be noted that a Taguchi DOE assumes no significant interaction between factors, but this may not necessarily be true. In this case, however, it was selected to determine the relationship between the factors and responses in the shortest simulation time.</span></p>
<p>The Minitab worksheet below shows the process settings for the nine runs using the Taguchi L9 Design.</p>
<p><img alt="Taguchi design worksheet" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/7cbc350e2fbe466708f4b5b4a2f58566/taguchi_doe_worksheet.png" style="width: 450px; height: 169px;" /></p>
<p>Moldex3D DOE was then used to perform the mathematical calculations based on the user’s specification (minimum warpage and linear shrinkage between nodes) to determine the optimum process setting.</p>
<p>From the nine different simulated runs, a main effect graph for warpage was plotted. </p>
<p><img alt="Main Effects Plor for Warpage" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/dbec7e75117c7763745e8260d78852fd/main_effects_warpage.png" style="width: 577px; height: 385px;" /></p>
<p><span style="line-height: 1.6;">From this, it could be seen that by increasing the packing pressure and cooling time, warpage was reduced. Increasing melt temperature, on the other hand, lead to higher warpage. Using a filling time of 0.2s or 0.3s seemed to give slightly lesser warpage than 0.1s. Hence, it was determined that to achieve lower warpage, the optimum process setting should be a melt temperature of 225°C, packing pressure of 15MPa, cooling time of 12s and filling time of 0.3s.</span></p>
<p style="line-height: 20.7999992370605px;">Taking the results obtained from Moldex3D, Minitab 17 statistical software was used to determine which of the four factors had the biggest influence on part warpage.</p>
<p style="line-height: 20.7999992370605px;"><img alt="response table for warpage" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/20e65680dd317de7add7a8559b1d50e3/response_table_warpage.png" style="width: 500px; height: 153px;" /></p>
<p style="line-height: 20.7999992370605px;">This data analysis showed that cool time had the biggest impact on part warpage, followed by packing pressure, melt temperature and then filling time. An area graph of warpage (PDF DOWNLOAD CHART 1) showed a quick comparison of the nine different runs, indicating that run 3 gave the least warpage.</p>
<p><img alt="area graph of warpage" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/740d75c1b4424da02ee136a673e43780/area_graph_of_warpage.png" style="width: 500px; height: 333px;" /></p>
<p>Concentricity is difficult to measure, in both real life and in simulation. In real life, the distance between different points is measured using a coordinate-measuring machine (CMM). In the Moldex3D simulation, the linear shrinkage between different nodes was measured. Eight different nodes were identified. The linear shrinkage of the diameter of the tube across was determined and the lower the linear shrinkage, the more circular or better concentricity of the part.</p>
<p>The main effects plot below for shrinkage shows that to get better concentricity/linear shrinkage between the nodes, a lower melt temperature, cooling time and filling time with a high pack pressure was preferable.</p>
<p><img alt="Main Effects Plot for Shrinkage" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/3eb9b51b4bd8caeac5ead713a86ce90b/main_effects_shrinkage.png" style="width: 579px; height: 385px;" /></p>
<p>It had already been established that to achieve lower linear shrinkage, the optimum process setting should be melt temperature of 225°C, packing pressure of 15MPa, cooling time of 8s and filling time of 0.1s. However, a cooling time of 8s may not be practical, as the analysis of warpage shows it would give high warpage.</p>
<p>Minitab was also used to find out which of the four control factors resulted in the greatest impact on linear shrinkage.</p>
<p><img alt="Response Table for Shrinkage" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9e0e2aca3064320d44a9860223665f48/response_table_shrinkage.png" style="width: 500px; height: 153px;" /></p>
<p>This showed that pack pressure is ranked first, followed by cooling time, melt temperature and lastly the filling time. Since the 8s cooling time would lead to high warpage, a compromise had to be made.</p>
<p>As mentioned earlier, for linear shrinkage the packing pressure was more of a contributing factor than the cooling time, so it makes sense to use 12s cooling time with 15MPa packing pressure. Comparing the nine different runs for linear shrinkage in an area graph showed that run six gave the lowest linear shrinkage.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/dfabcb5cb7861c6dc11cc0fdb25c2b2d/area_graph_of_shrinkage.png" style="width: 500px; height: 333px;" /></p>
<p>Based on the user specification, Moldex3D’s mathematical calculations obtained the optimised run<span style="line-height: 1.6;">. For this example, weighting for warpage was the same as for linear shrinkage. However, based on the DOE simulation results obtained, the optimum process setting for the lowest warpage was to have a cooling time of 12s and filling time of 0.3s. The optimum process for the lowest linear shrinkage, on the other hand, required a cooling time of 8s and fill time of 0.1s.</span></p>
Concluding thoughts
<p>Moldex3D simulation resulted in a compromise process setting (melt temperature of 225°C, packing pressure of 15MPa, cooling time of 12s and filling time of 0.1s), which was used as the optimum run. From the area graphs shown below, it can be seen that the optimised run 10 gives the lowest warpage compared to the other nine runs, while having low linear shrinkage.</p>
<p><img alt="optimized run - area chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/13c7a74c8d37f74f4acc152b676e53b6/optimized_run_area_graph_w640.png" style="width: 640px; height: 210px;" /></p>
<p>From the simulation in Moldex 3D, shown below, it can be seen that part warpage and concentricity of the tube has been significantly improved (warpage has been improved by 20-30% while linear shrinkage has been kept to 0.6-0.7%).</p>
<p><img alt="Moldex 3D simulation" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a1b9270c0e645e9db3d7c4f626308aba/moldex_3d_sim.png" style="width: 500px; height: 179px;" /></p>
<p>It is important that designers and moulders understand that numerical results in a simulation such as this provide only a relative comparison and should not be treated as absolute. This is because there are various uncontrollable factors in the actual mould shop environment—‘noise’—which cannot be re-enacted in a simulation. However, running DOE using simulation can give the engineering team a head start on identifying which control factors to focus on and the relationship those factors have with part quality.</p>
<p> </p>
<p><strong>About the guest blogger</strong></p>
<p><a href="http://uk.linkedin.com/in/jasminwongym">Jasmin Wong</a> is project engineer at UK-based <a href="http://www.plazology.co.uk/" target="_blank">Plazology</a>, which provides product design optimisation, injection moulding fl ow simulation, mould design, mould procurement, and moulding process validation services to global manufacturing customers. She is an MSc graduate in polymer composite science and engineering and recently gained Moldex3D Analyst Certification.</p>
<p> </p>
<p> </p>
<p><em>A version of this article originally appeared in the <a href="http://content.yudu.com/htmlReader/A3572w/IWOct14/reader.html?page=26" target="_blank">October 2012 issue of Injection World</a> magazine.</em></p>
</div>
</div>
Design of ExperimentsMon, 27 Oct 2014 11:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/simulating-robust-processing-with-design-of-experiments2c-part-2Guest BloggerCan Regression and Statistical Software Help You Find a Great Deal on a Used Car?
http://blog.minitab.com/blog/understanding-statistics/can-regression-and-statistical-software-help-you-find-a-great-deal-on-a-used-car
<p>You need to consider many factors when you’re buying a used car. Once you narrow your choice down to a particular car model, you can get a wealth of information about individual cars on the market through the Internet. How do you navigate through it all to find the best deal? By analyzing the data you have available. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/710ce579b4120727bf67e8b48f5965e8/240_used_car_kovacs.jpg" style="line-height: 20.7999992370605px; border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 240px; height: 240px;" /></p>
<p>Let's look at how this works using <a href="http://blog.minitab.com/blog/understanding-statistics/we-just-got-rid-of-five-reasons-to-fear-data-analysis">the Assistant</a> in Minitab 17. With the Assistant, you can use regression analysis to calculate the expected price of a vehicle based on variables such as year, mileage, whether or not the technology package is included, and whether or not a free Carfax report is included.</p>
<p>And it's probably a lot easier than you think. </p>
<p>A search of a leading Internet auto sales site yielded data about 988 vehicles of a specific make and model. After putting the data into Minitab, we choose <strong>Assistant > Regression…</strong></p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9e87de993a0daa39e6643b8c6d3aed9c/regression_dialog.png" style="width: 395px; height: 247px;" /></p>
<p>At this point, if you aren’t very comfortable with regression, <a href="http://www.minitab.com/products/minitab/assistant/">the Assistant makes it easy to select the right option for your analysis</a>.</p>
A Decision Tree for Selecting the Right Analysis
<p>We want to explore the relationships between the price of the vehicle and four factors, or X variables. Since we have more than one X variable, and since we're not looking to optimize a response, we want to choose Multiple Regression.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/bc802d35bfb57ca3b86e061da4fa4b09/regression_decision_tree_w640.png" style="width: 640px; height: 502px;" /></p>
<p>This <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/9ecb2280228deb621ee2db7f6fbe300e/used_cars.MTW">data set</a> includes five columns: mileage, the age of the car in years, whether or not it has a technology package, whether or not it includes a free CARFAX report, and, finally, the price of the car.</p>
<p>We don’t know which of these factors may have significant relationship to the cost of the vehicle, and we don’t know whether there are significant two-way interactions between them, or if there are quadratic (nonlinear) terms we should include—but we don’t need to. Just fill out the dialog box as shown. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b93a0a755e8e73dc7f681ea4b1965749/regression_dialog_box.png" style="width: 532px; height: 382px;" /></p>
<p>Press OK and the Assistant assesses each potential model and selects the best-fitting one. It also provides a comprehensive set of reports, including a Model Building Report that details how the final model was selected and a Report Card that notifies you to potential problems with the analysis, if there are any.</p>
Interpreting Regression Results in Plain Language
<p>The Summary Report tells us in plain language that there is a significant relationship between the Y and X variables in this analysis, and that the factors in the final model explain 91 percent of the observed variation in price. It confirms that all of the variables we looked at are significant, and that there are significant interactions between them. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/746574a27bba821ffab4f77ae1a2931b/multiple_regression_summary_report_w640.png" style="width: 640px; height: 480px;" /></p>
<p>The Model Equations Report contains the final regression models, which can be used to predict the price of a used vehicle. The Assistant provides 2 equations, one for vehicles that include a free CARFAX report, and one for vehicles that do not.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/58598060212558634d62d75a7045bf0b/regression_equation_w640.png" style="width: 640px; height: 186px;" /></p>
<p>We can tell several interesting things about the price of this vehicle model by reading the equations. First, the average cost for vehicles with a free CARFAX report is about $200 more than the average for vehicles with a paid report ($30,546 vs. $30,354). This could be because these cars probably have a clean report (if not, the sellers probably wouldn’t provide it for free).</p>
<p>Second, each additional mile added to the car decreases its expected price by roughly 8 cents, while each year added to the cars age decreases the expected price by $2,357.</p>
<p>The technology package adds, on average, $1,105 to the price of vehicles that have a free CARFAX report, but the package adds $2,774 to vehicles with a paid CARFAX report. Perhaps the sellers of these vehicles hope to use the appeal of the technology package to compensate for some other influence on the asking price. </p>
Residuals versus Fitted Values
<p>While these findings are interesting, our goal is to find the car that offers the best value. In other words, we want to find the car that has the largest difference between the asking price and the expected asking price predicted by the regression analysis.</p>
<p>For that, we can look at the Assistant’s Diagnostic Report. The report presents a chart of Residuals vs. Fitted Values. If we see obvious patterns in this chart, it can indicate problems with the analysis. In that respect, this chart of Residuals vs. Fitted Values looks fine, but now we’re going to use the chart to identify the best value on the market.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d55ae8720ba281bf37135b68b2069434/multiple_regression_diagnostic_report_w640.png" style="width: 640px; height: 480px;" /></p>
<p>In this analysis, the “Fitted Values” are the prices predicted by the regression model. “Residuals” are what you get when you subtract the actual asking price from the predicted asking price—exactly the information you’re looking for! The Assistant marks large residuals in red, making them very easy to find. And three of those residuals—which appear in light blue above because we’ve selected them—appear to be very far below the asking price predicted by the regression analysis.</p>
<p>Selecting these data points on the graph reveals that these are vehicles whose data appears in rows 357, 359, and 934 of the data sheet. Now we can revisit those vehicles online to see if one of them is the right vehicle to purchase, or if there’s something undesirable that explains the low asking price. </p>
<p>Sure enough, the records for those vehicles reveal that two of them have severe collision damage.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/5dbbf5aa405d4b2d53ec720657a09556/vehicles.jpg" style="width: 320px; height: 356px;" /></p>
<p>But the remaining vehicle appears to be in pristine condition, and is several thousand dollars less than the price you’d expect to pay, based on this analysis!</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/640bd720a3d1f8b04713aa0ec321a570/nice_car.png" style="width: 254px; height: 189px;" /></p>
<p>With the power of regression analysis and the Assistant, we’ve found a great used car—at a price you know is a real bargain.</p>
<p> </p>
Fun StatisticsRegression AnalysisStatisticsStatistics HelpWed, 22 Oct 2014 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/can-regression-and-statistical-software-help-you-find-a-great-deal-on-a-used-carEston MartzUsing Data Analysis to Maximize Webinar Attendance
http://blog.minitab.com/blog/michelle-paret/using-data-analysis-to-maximize-webinar-attendance
<p>We like to host webinars, and our customers and prospects like to attend them. But when our webinar vendor moved from a pay-per-person pricing model to a pay-per-webinar pricing model, we wanted to find out how to maximize registrations and thereby minimize our costs.<img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/8a6733d3b0516b7f1c7ad80ea753d430/mtbnewspromos_w640.jpeg" style="width: 400px; height: 273px; float: right; border-width: 1px; border-style: solid; margin: 10px 15px;" /></p>
<p>We collected webinar data on the following variables:</p>
<ul>
<li>Webinar topic</li>
<li>Day of week</li>
<li>Time of day – 11 a.m. or 2 p.m.</li>
<li>Newsletter promotion – no promotion, newsletter article, newsletter sidebar</li>
<li>Number of registrants</li>
<li>Number of attendees</li>
</ul>
<p>Once we'd collected our data, it was time to analyze it and answer some key questions using <a href="http://www.minitab.com/products/minitab/">Minitab Statistical Software</a>.</p>
Should we use registrant or attendee counts for the analysis?
<strong><span style="line-height: 16.8666667938232px; font-family: Calibri, sans-serif; font-size: 11pt;"><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/4d9fa1e3c73606627d2ca1ec34b620e2/scatterplot_w640.jpeg" style="width: 300px; height: 197px; margin: 10px 15px; float: left;" /></span></strong>
<p>First we needed to decide what we would use to measure our results: the number of people who signed up, or the number of people who actually attended the webinar. This question really boils down to answering the question, “Can I trust my data?”</p>
<p>Our data collection system for webinar registrants is much more accurate than our data collection system for webinar attendees. This is due to customer behavior and their willingness to share contact information, in addition to the automated database processes that connect our webinar vendor data with our own database. So, for a period of time, I manually collected the attendee data directly from our webinar vendor to see how it correlated with the easily-accessible and accurate registration data. The scatterplot above shows the results.</p>
<p>With a <a href="http://blog.minitab.com/blog/understanding-statistics/no-matter-how-strong-correlation-still-doesnt-imply-causation">correlation coefficient </a>of 0.929 and a p-value of 0.000, there was a strong positive linear relationship between the registrations and attendee counts. If registrations are high, then attendance is also high. If registrations are low, then attendance is also low. I concluded that I could use the registration data—which is both easily accessible and extremely reliable—to conduct my analysis.</p>
Should we consider data for the last 6 years?
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/5e73f48b852c7afc17762f28bf8887cf/i_mr_chart_of_registrants_w640.jpeg" style="width: 400px; height: 263px; margin: 10px 15px; float: left;" />We’ve been collecting webinar data for 6 years, but that doesn’t mean we can treat the last 6 years of data as one homogeneous population.</p>
<p>A lot can change in a 6-year time period. Perhaps there was a change in the webinar process that affected registrations. To determine whether or not I should use all of the data, I used an Individuals and Moving Range (I-MR, also referred to as X-MR) <a href="http://blog.minitab.com/blog/understanding-statistics/how-create-and-read-an-i-mr-control-chart">control chart</a> to evaluate the process stability of webinar registrations over time.</p>
<p>The graph revealed a single point on the MR chart that flagged as out-of-control. I looked more closely at this point and verified that the data was accurate and that this webinar belonged with the larger population. Based on this information, I decided to proceed with analyzing all 6 years of data together. (Note there is some clustering of points due to promotions, but again the goal here was to determine if we could use data over a 6-year time period.)</p>
What variables impact registrations?
<p>I performed an ANOVA using Minitab's General Linear Model tool to find out which factors—topic, day of week, time of day, or newsletter promotion—significantly affect webinar registrations.<img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/3758d3d03a604bab9921ad9f94663dc8/main_effects_plot_for_registrants_w640.jpeg" style="width: 400px; height: 263px; float: right; margin: 10px 15px;" /></p>
<p>The ANOVA results revealed that the day of week, time of day, and webinar topic <em>do not</em> affect webinar registrations, but the newsletter promotion type <em>does</em> (p-value = 0.000).</p>
<p>So which webinar promotion type maximizes webinar registrations?</p>
<p>Using Minitab to conduct <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/keep-that-special-someone-happy-when-you-perform-multiple-comparisons">Tukey comparisons</a>, we can see that registrations for webinars promoted in the newsletter sidebar space were not significantly different from webinars that weren't promoted at all.</p>
<p>However, webinars that were promoted in the newsletter <em>article </em>space resulted in significantly more registrations than both the sidebar promotions and no promotions.</p>
<p>From this analysis, we concluded that we still had the flexibility to offer webinars at various times and days of the week, and we could continue to vary webinar topics based on customer demand and other factors. To maximize webinar attendance and minimize webinar cost, we needed to focus our efforts on promoting the webinars in our newsletter, utilizing the article space.</p>
<p>But over the past year, we’ve started to actively promote our webinars via other channels as well, so next up is some more data analysis—using Minitab—to figure out what marketing channels provide the best results…</p>
Data AnalysisHypothesis TestingRegression AnalysisStatisticsFri, 17 Oct 2014 12:00:00 +0000http://blog.minitab.com/blog/michelle-paret/using-data-analysis-to-maximize-webinar-attendanceMichelle ParetHow Important Are Normal Residuals in Regression Analysis?
http://blog.minitab.com/blog/adventures-in-statistics/how-important-are-normal-residuals-in-regression-analysis
<p>I’ve written about the importance of <a href="http://blog.minitab.com/blog/adventures-in-statistics/why-you-need-to-check-your-residual-plots-for-regression-analysis" target="_blank">checking your residual plots</a> when performing linear regression analysis. If you don’t satisfy the assumptions for an analysis, you might not be able to trust the results. One of the assumptions for regression analysis is that the residuals are normally distributed. Typically, you assess this assumption using the normal probability plot of the residuals.</p>
<div style="float: right; width: 250px; margin: 15px 0px 15px 15px;"><img alt="Normal Probability Plot showing residuals that are not distributed normally" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/d84cbe3e157257e1ba07563dacdacbd7/nonnormal_residuals.png" title="Are these nonnormal residuals bad?" width="250" /> <em>Are these nonnormal residuals a problem?</em></div>
<p>If you have nonnormal residuals, can you trust the results of the regression analysis?</p>
<p>Answering this question highlights some of the research that Rob Kelly, a senior statistician here at Minitab, was tasked with in order to guide the development of our <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">statistical software</a>.</p>
Simulation Study Details
<p>The goals of the simulation study were to:</p>
<ul>
<li>determine whether nonnormal residuals affect the error rate of the F-tests for regression analysis</li>
<li>generate a safe, minimum sample size recommendation for nonnormal residuals</li>
</ul>
<p>For simple regression, the study assessed both the overall F-test (for both linear and quadratic models) and the F-test specifically for the highest-order term.</p>
<p>For multiple regression, the study assessed the overall F-test for three models that involved five continuous predictors:</p>
<ul>
<li>a linear model with all five X variables</li>
<li>all linear and square terms</li>
<li>all linear terms and seven of the 2-way interactions</li>
</ul>
<p>The residual distributions included skewed, heavy-tailed, and light-tailed distributions that depart substantially from the normal distribution.</p>
<p>There were 10,000 tests for each condition. The study determined whether the tests incorrectly rejected the null hypothesis more often or less often than expected for the different nonnormal distributions. If the test performs well, the Type I error rates should be very close to the target significance level.</p>
Results and Sample Size Guideline
<p>The study found that a sample size of at least 15 was important for both simple and multiple regression. If you meet this guideline, the test results are usually reliable for any of the nonnormal distributions.</p>
<p>In simple regression, the observed Type I error rates are all between 0.0380 and 0.0529, very close to the target significance level of 0.05.</p>
<p>In multiple regression, the Type I error rates are all between 0.08820 and 0.11850, close to the target of 0.10.</p>
Closing Thoughts
<p>The good news is that if you have at least 15 samples, the test results are reliable even when the residuals depart substantially from the normal distribution.</p>
<p>However, there is a caveat if you are using regression analysis to generate predictions. <a href="http://blog.minitab.com/blog/adventures-in-statistics/when-should-i-use-confidence-intervals-prediction-intervals-and-tolerance-intervals" target="_blank">Prediction intervals</a> are calculated based on the assumption that the residuals are normally distributed. If the residuals are nonnormal, the prediction intervals may be inaccurate.</p>
<p>This research guided the implementation of regression features in the <a href="http://www.minitab.com/en-us/products/minitab/assistant/" target="_blank">Assistant menu</a>. The Assistant is your interactive guide to choosing the right tool, analyzing data correctly, and interpreting the results. Because the regression tests perform well with relatively small samples, the Assistant does not test the residuals for normality. Instead, the Assistant checks the size of the sample and indicates when the sample is less than 15.</p>
<p><a href="http://blog.minitab.com/blog/adventures-in-statistics/multiple-regression-analysis-and-response-optimization-examples-using-the-assistant-in-minitab-17" target="_blank">See a multiple regression example that uses the Assistant.</a></p>
<p>You can read the full study results in the <a href="http://support.minitab.com/en-us/minitab/17/Assistant_Simple_Regression.pdf" target="_blank">simple regression white paper</a> and the <a href="http://support.minitab.com/en-us/minitab/17/Assistant_Multiple_Regression.pdf" target="_blank">multiple regression white paper</a>. You can also peruse all of our <a href="http://support.minitab.com/en-us/minitab/17/technical-papers/" target="_blank">technical white papers</a> to see the research we conduct to develop methodology throughout the Assistant and Minitab.</p>
Regression AnalysisThu, 16 Oct 2014 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/how-important-are-normal-residuals-in-regression-analysisJim FrostThe Ghost Pattern: A Haunting Cautionary Tale about Moving Averages
http://blog.minitab.com/blog/understanding-statistics/the-ghost-pattern-a-haunting-cautionary-tale-about-moving-averages
<p>Halloween's right around the corner, so here's a scary thought for the statistically minded: That pattern in your time series plot? Maybe it's just a ghost. <em>It might not really be there at all.</em> </p>
<p>That's right. The trend that seems so evident might be a phantom. Or, if you don't believe in that sort of thing, chalk it up to the brain's desire to impose order on what we see, even when it doesn't exit. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/336bc5b657c980e1c2769192a4757fa9/ghosts.png" style="line-height: 20.7999992370605px; margin: 10px 15px; float: right; width: 200px; height: 200px;" /></p>
<p>I'm going to demonstrate this with Minitab Statistical Software (get the free 30-day <a href="http://it.minitab.com/products/minitab/free-trial.aspx">trial version</a> and play along, if you don't already use it). And if things get scary, just keep telling yourself "It's only a simulation. It's only a simulation."</p>
<p>But remember the ghost pattern when we're done. It's a great reminder of how important it is to make sure that you've interpreted your data properly, and looked at all the factors that might influence your analysis—including the quirks inherent in the statistical methods you used. </p>
Plotting Random Data from a 20-Sided Die
<p>We're going to need some random data, which we can get Minitab to generate for us. In many role-playing games, players use a 20-sided die to determine the outcome of battles with horrible monsters, so in keeping with the Halloween theme we'll simulate 500 consecutive rolls with a 20-sided die. Choose <strong>Calc > Random Data > Integer...</strong> and have Minitab generate 500 rows of random integers between 1 and 20. </p>
<p>Now go to <strong>Graph > Time Series Plot...</strong> and select the column of random integers. Minitab creates a graph that will look something like this: </p>
<p><img alt="Time Series Plot of 200 Twenty-Sided Die Rolls" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/bc2a4c9bf05e4a61103451fa6e6f8342/20_sided_die_time_series_plot.png" style="width: 577px; height: 386px;" /></p>
<p>It looks like there could be a pattern, one that looks a little bit like a sine wave...but it's hard to see, since there's a lot of variation in consecutive points. In this situation, many analysts will use a technique called the Moving Average to filter the data. The idea is to <span style="line-height: 20.7999992370605px;">smooth out the natural variation in the data </span><span style="line-height: 1.6;">by looking at the <em>average </em>of several consecutive data points, thus enabling a pattern to reveal itself. It's the statistical equivalent of applying a noise filter to eliminate hiss on an audio recording. </span></p>
<p>A moving average can be calculated based on the average of as few as 2 data points, but this depends on the size and nature of your data set. We're going to calculate the moving average of every 5 numbers. Choose <strong>Stat > Time Series > Moving Average...</strong> Enter the column of integers as the Variable, and enter 5 as the MA length. Then click "Storage" and have Minitab store the calculated averages in a new data column. </p>
<p>Now create a new time series plot using the moving averages:</p>
<p><img alt="moving average time series plot" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f93eff7bceb62bd5da113de356afcd8e/moving_average_time_series_plot.png" style="width: 576px; height: 384px;" /></p>
<p>You can see how some of the "noise" from point-to-point variation has been reduced, and it does look like there could, just possibly, be a pattern there.</p>
Can Moving Averages Predict the Future?
<p>Of course, a primary reason for doing a time series analysis is to forecast the next item (or several) in the series. Let's see if we might predict the next moving average of the die by knowing the current moving average. </p>
<p>Select <strong>Stat > Time Series > Lag</strong>. In the dialog box, choose the "moving averages" column as the series to lag. We'll use this dialog to create a new column of data that places each moving average down 1 row in the column and inserts missing value symbols, *, at the top of the column.</p>
<p>Now we can create a <a href="http://blog.minitab.com/blog/understanding-statistics/using-statistics-software-and-graphs-to-quickly-explore-relationships-between-variables">simple scatterplot</a> that will show if there's a correlation between the observed moving average and the next one. </p>
<p><img alt="Scatterplot of Current and Next Moving Averages" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/78607f90333600cdeb6eeba721c62ee7/scatterplot_of_moving_averages.png" style="width: 578px; height: 386px;" /></p>
<p>Clearly, there's a positive correlation between the current moving average and the next, which means we <em>can </em>use the current moving average to predict the next one. </p>
<p><span style="line-height: 1.6;">But wait a minute...this is </span><em style="line-height: 1.6;">random data!</em><span style="line-height: 1.6;"> </span><span style="line-height: 20.7999992370605px;">By definition, you <em>can't </em>predict random</span><span style="line-height: 1.6;">, so how can there be a correlation? This is getting kind of creepy...it's like there's some kind of ghost in this data. </span></p>
<p>Zoinks! What would Scooby Doo make of all this? </p>
Debunking the "Ghost" with the Slutsky-Yule Effect
<p>Don't panic—there's a perfectly rational explanation for what we're seeing here. It's called the Slutsky-Yule Effect, which simply says an autoregressive time series (like a moving average) can <em>look like </em>patterned data, even if there's no relationship among the data points. </p>
<p>So there's no ghost in our random data; instead, we're seeing a sort of statistical illusion. Using the moving average can make it seem like a pattern or relationship exists, but that apparent pattern could be a side effect of the tool, and not an indication of a real pattern. </p>
<p>Does this mean you shouldn't use moving averages to look at your data? No! It's a very valuable and useful technique. However, using it carelessly could get you into trouble. And if you're basing a major decision solely on moving averages, you might want to try some alternate approaches, too. Mikel Harry, one of the originators of Six Sigma, has a <a href="http://drmikelharry.wordpress.com/2014/04/08/beware-the-moving-average/">great blog post</a> that presents a workplace example of how far apart reality and moving averages can be. </p>
<p>So just remember the Slutsky-Yule Effect when you're analyzing data in the dead of night, and your moving average chart shows something frightening. <span style="line-height: 20.7999992370605px;">Shed some more light on the subject with follow-up analysis and you might find there's nothing to fear at </span><span style="line-height: 1.6;">all. </span></p>
Data AnalysisFun StatisticsStatisticsStatsMon, 13 Oct 2014 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/the-ghost-pattern-a-haunting-cautionary-tale-about-moving-averagesEston MartzUsing Before/After Control Charts to Assess a Carâ€™s Gas Mileage
http://blog.minitab.com/blog/understanding-statistics/using-before-and-after-control-charts-to-assess-a-care28099s-gas-mileage
<p>Keeping your vehicle fueled up is expensive. Maximizing the miles you get per gallon of fuel saves money and helps the environment, too. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/05b215659b2ef9b8a0e478c92e2dd932/car_dash_200.jpg" style="line-height: 20.7999992370605px; border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 200px; height: 200px;" /></p>
<p>But knowing if you're getting good mileage requires some data analysis, which gives us a good opportunity to apply one of the common tools used in Six Sigma -- the I-MR (individuals and moving range) control chart to daily life. </p>
Finding Trends or Unusual Variation
<p>Looking at your vehicle’s MPG data lets you see if your mileage is holding steady, declining, or rising over time. This data can also reveal unusual variation that might indicate a problem you need to fix.</p>
<p>Here's a simulated <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/12e461add9f92cb704d405aec09dd4be/mileage.MTW">data set</a> that collects 3 years’ worth of gas mileage records for a car that should get an average of 20 miles per gallon, according to the manufacturer’s estimates. However, the owner didn’t do any vehicle maintenance for the first two years he owned the car. This year, though, he’s diligently performed recommended maintenance.</p>
<p>How does his mileage measure up? And has his attention to maintenance in the past 12 months affected his car’s fuel economy? Let’s find out with the Assistant in Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a>.</p>
Creating a Control Chart that Accounts for Process Changes
<p>To create the most meaningful chart, we need to recall that a major change in how the vehicle is handled took place during the time the data were collected. The owner bought the car three years ago, but he’s only done the recommended maintenance in the last year.</p>
<p>Since the data were collected both before and after this change, we want to account for it in the analysis.</p>
<p>The easiest way to handle this is to choose <strong>Assistant > Before/After Control Charts…</strong> to create a chart that makes it easy to see how the change affected both the mean and variance in the process.</p>
<p>If you're following along with Minitab, the Maint column in the worksheet notes which MPG measurements were taken before and after DeWaggen started paying attention to maintenance. Complete the Before/After I-MR Chart dialog box as shown below:</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b5500c9339bfbb45f5baa07cfd455943/before_after_i_mr_chart_dialog.png" style="width: 498px; height: 376px;" /></p>
Interpreting the Results of Your Data Analysis
<p>After you press OK, the Assistant produces a Diagnostic Report with detailed information about the analysis, as well as a Report Card, which provides guidance on how to interpret the results and flags potential problems. In this case, there are no concerns with the <a href="http://blog.minitab.com/blog/real-world-quality-improvement/quality-improvement-in-healthcare3a-showing-if-process-changes-actually-improve-the-patient-experience">process mean and variation</a>.</p>
<p>The Assistant's Summary Report gives you the bottom-line results of the analysis.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/fbe653d819f1baf7531202ab1ed32212/before_after_i_mr_chart_summary_report_w640.png" style="width: 640px; height: 473px;" /></p>
<p>T<span style="line-height: 20.7999992370605px;">he Moving Range chart,</span><span style="line-height: 20.7999992370605px;"> shown in the</span><span style="line-height: 1.6;"> lower portion of the graph, illustrates the moving range of the data. It shows that while the upper and lower control limits have shifted, the difference in variation before and after the change is not statistically significant. </span></p>
<p><span style="line-height: 1.6;">However, the car’s mean mileage, which is shown in the Individual Value chart displayed at the top of the graph, </span><em style="line-height: 1.6;">has </em><span style="line-height: 1.6;">seen a statistically significant change, moving from 19.12 MPG to just under 21 MPG. </span></p>
<span style="line-height: 1.6;">Easy Creation of Control Charts</span>
<p>Control charts have been used in statistical process control for decades, and are among the most commonly accessed tools available in statistical software packages. The Assistant has made it particularly easy for anyone to create and see whether or not a process is within control limits, to confirm that observation statistically, and to see whether or not a change in the process results in a change in the process outcome or variation.</p>
<p>As for the data we used in this example, whether or not a 2 mile-per-gallon increase in fuel economy is practically as well as statistically significant could be debated. But since the price of fuel rarely falls, we recommend that the owner of this vehicle continue to keep it tuned up!</p>
Data AnalysisFun StatisticsQuality ImprovementStatisticsFri, 26 Sep 2014 12:21:03 +0000http://blog.minitab.com/blog/understanding-statistics/using-before-and-after-control-charts-to-assess-a-care28099s-gas-mileageEston MartzNot Getting a No-Hitter? Statistically Speaking, the Best Bet Ever
http://blog.minitab.com/blog/the-statistics-game/not-getting-a-no-hitter-statistically-speaking2c-the-best-bet-ever
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/ca5dc4e25f623c98b4c0ab10d4eeba50/money_w640.png" style="width: 325px; height: 217px; float: right; margin: 10px 15px;" />The no-hitter is one of the most impressive feats in baseball. It’s no easy task to face more than 27 batters without letting one of them get a hit. So naturally, no-hitters don’t occur very often. In fact, since 1900 there has been an average of only about 2 no-hitters per year.</p>
<p>But what if you had the opportunity to bet that one <em>wouldn’t </em>occur?</p>
<p>That’s exactly what happened to sportswriter C. Trent Rosecrans. He had a friend who kept insisting the Reds would be no-hit his season. And with 24 games left in the season, the friend put his money where his mouth is, betting Mr. Rosecrans <a href="http://www.cincinnati.com/story/redsblog/2014/09/17/bar091714/15767373/">$5 that the Reds would be no-hit</a> by the end of the year.</p>
<p>Even if the Reds <em>do </em>have one of the worst hitting percentages in baseball, would you take the bet that in 24 games there won’t be an event that occurs only twice in an entire year?</p>
<p>Sounds like a no-brainer.</p>
Calculating the odds
<p>Back in 2012, I <a href="http://blog.minitab.com/blog/the-statistics-game/the-odds-of-throwing-a-perfect-game">calculated that the odds of throwing a no-hitter</a> were approximately 1 in 1,548. If you update that number to include all the games and no-hitters that have occurred since 2012, the odds become 1 in 1,562. The numbers are very similar, but we’ll use the latter since it incorporates more data.</p>
<p>So there is a 99.936% chance that a no-hitter does not occur in any single game. But the bet was that it wouldn’t occur in 24 games. What are Mr. Rosencrans' chances of winning the bet?</p>
<p align="center"><strong>24 games without a no-hitter</strong> = .99936^24 = .98475 = approximately <strong>98.475%</strong></p>
<p>I wish <em>I</em> could make bets with a winning percentage that was that high! For Mr. Rosecrans, 98.475% of the time he’ll win $5, and 1.525% of the time he’ll lose $5. For his friend, the opposite is true. We can use these numbers to calculate the expected value for each side of the bet.</p>
<p align="center">Reds don’t get no-hit: (0.98475*5) – (0.01525*5) = <strong>$4.85</strong></p>
<p align="center">Reds get no-hit: (.01525*5) – (0.98475*5) = <span style="color:#FF0000;"><strong>-$4.85</strong></span></p>
Making it a fair bet
<p>Obviously this was just a friendly wager and was not meant to be taken too seriously. If Mr. Rosecrans regularly made bets with expected values close to $5 with all of his friends, he probably wouldn’t have many left. But what if he wanted to be a <em>nice </em>friend? How much money should he have offered in return to make it a fair bet? We’ll simply set the expected value to 0 and solve for the amount of money he’d lose the 1.525% of the time the Reds were no-hit.</p>
<p align="center">0 = (0.98475*5) – (0.01525*X)</p>
<p align="center">0.01525*X = 4.92375</p>
<p align="center">X = $322.87</p>
<p>To make the bet fair, Mr. Rosecrans should offer to pay his friend $322.87 if the Reds get no-hit. And earlier this week the Reds didn’t get their first hit until the 8th inning. Imagine sweating out <em>that </em>game if you had over $300 on the line!</p>
Adjusting for the Reds
<p>One of the reasons the friend bet on the Reds to be no-hit was that they are one of the worst-hitting teams in their league. Their batting average of 0.238 is ranked 28th in baseball. That means, on average, a Reds batter <em>won’t</em> hit the ball 76.2% of the time. So if a pitcher wanted to no-hit the Reds, they would need to face at least 27 batters who didn’t get a hit.</p>
<p align="center"><strong>Probability of having 27 straight batters not have a hit</strong> = 0.762^27 = 0.00065 = <strong>approx. 1 in 1,539</strong></p>
<p>But remember, just because a batter doesn’t get a hit does not mean they’re out. They can get walked, hit by a pitch, or reach on an error. Unless they pitch a perfect game, the pitcher will face more than 27 batters. Let’s look how the probability changes as we increase the number of Reds batters that the pitcher must face without allowing a hit.</p>
<p align="center"><strong>Probability of having 28 straight batters not have a hit</strong> = 0.762^28 = <strong>approx. 1 in 2,020</strong></p>
<p align="center"><strong>Probability of having 29 straight batters not have a hit</strong> = 0.762^29 = <strong>approx. 1 in 2,650</strong></p>
<p align="center"><strong>Probability of having 30 straight batters not have a hit</strong> = 0.762^30 = <strong>approx. 1 in 3,478</strong></p>
<p align="center"><strong>Probability of having 31 straight batters not have a hit</strong> = 0.762^31 = <strong>approx. 1 in 4,565</strong></p>
<p>This was <em>supposed</em> to show that because they are a poor-hitting team, the Reds have a better chance of being no-hit than the average used above. But as you can see, that’s not the case at all. Despite being one of the worst-hitting teams in the league, it appears that it’s <em>harder</em> to no-hit the Reds than the historical average.</p>
<p>Things get even odder when you consider that the average batting average (according to <a href="http://www.baseball-reference.com/leagues/MLB/bat.shtml">Baseball-Reference.com</a>) is 0.263. Using that number, the odds of having 27 straight batters not have a hit is 1 in 3,788. And those odds drop as you increase the number of batters the pitcher has to face. Applying this probability to the number of games played since 1900, we would expect there to be fewer than 100 no-hitters. And how many have there been? <em>241</em>!</p>
<p>This is the same conundrum I encountered when finding <a href="http://blog.minitab.com/blog/the-statistics-game/the-odds-of-throwing-a-perfect-game-part-ii">the odds of throwing a perfect game</a>. The number of perfect games and no-hitters that have occurred is <em>much higher</em> than what we would expect based on historical batting statistics. One explanation could be pitching from the wind-up vs. the stretch. With no runners on base (which is always the case in a perfect game and often the case in a no-hitter), the pitcher can always throw from the wind-up. Assuming pitchers are better when pitching from the wind-up, this would result in a lower batting average than normal, thus explaining the higher number of perfect games and no hitters. This would make for a great analysis using Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a>, but since we can’t separate the data on hand into at bats facing pitchers throwing from the stretch vs. the wind-up, we can't test the theory.</p>
<p>Since the Reds have a batting average .025 points lower than the historical average, it’s probably safe to assume they do in fact have a greater chance of being no-hit. The problem is that it’s nearly impossible to quantify how much greater!</p>
Looking ahead to next year
<p>With the season almost over, it’s unlikely the Reds will be no-hit this year. But what if the two friends decided to do their bet again next year, only this time, they do it at the start of the season. Let’s use our original probability of throwing a no hitter (the one we’ve observed) and determine what the odds are that the Reds go 162 games getting at least one hit per game.</p>
<p align="center"><strong>162 games without a no-hitter</strong> = .99936^162 = .9015 = approximately <strong>90.15%</strong></p>
<p>The probability of the Reds getting no-hit is still pretty low, but it’s a lot better than the current bet. I just hope next year the friend gets some better odds than even money!</p>
Data AnalysisFun StatisticsStatistics in the NewsFri, 19 Sep 2014 13:35:15 +0000http://blog.minitab.com/blog/the-statistics-game/not-getting-a-no-hitter-statistically-speaking2c-the-best-bet-everKevin Rudy