Data Analysis Software | MinitabBlog posts and articles with tips for using statistical software to analyze data for quality improvement.
http://blog.minitab.com/blog/data-analysis-software/rss
Sat, 01 Nov 2014 08:03:21 +0000FeedCreator 1.7.3Comparing the College Football Playoff Top 25 and the Preseason AP Poll
http://blog.minitab.com/blog/the-statistics-game/comparing-the-college-football-playoff-top-25-and-the-preseason-ap-poll
<p>The college football playoff committee waited until the end of October to release their first top 25 rankings. One of the reasons for waiting so far into the season was that the committee would rank the teams off of actual games and wouldn’t be influenced by preseason rankings.</p>
<p>At least, that was the idea.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/8ac74acf42052d068b6cd0eeec32f609/cfb_playoff.jpg" style="line-height: 20.7999992370605px; float: right; width: 300px; height: 187px;" /></p>
<p>Earlier this year, I found that the <a href="http://blog.minitab.com/blog/the-statistics-game/has-the-college-football-playoff-already-been-decided">final AP poll was correlated with the preseason AP poll</a>. That is, if team A was ranked ahead of team B in the preseason and they had the same number of losses, team A was still usually ranked ahead of team B. The biggest exception was SEC teams, who were able to regularly jump ahead of teams (with the same number of losses) ranked ahead of them in the preseason.</p>
<p>If the final AP poll can be influenced by preseason expectations, could the college football playoff committee be influenced, too? Let’s compare their first set of rankings to the preseason AP poll to find out.</p>
Comparing the Ranks
<p>There are currently 17 different teams in the committee’s top 25 that have just one loss. I <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/26e7c8d8d8eee4fe2dfa26dc3d6e3c54/preseason_ap_vs__cfb_playoff_rankings.MTW">recorded the order</a> they are ranked in the committee’s poll and their order in the AP preseason poll. Below is an individual value plot of the data that shows each team’s preseason rank versus their current rank.</p>
<p><img alt="IVP" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/4098bab194a586865d3861f854d65627/ivp.jpg" style="width: 600px; height: 400px;" /></p>
<p>Teams on the diagonal line haven’t moved up or down since the preseason. Although Notre Dame is the only team to fall directly on the line, most teams aren’t too far off.</p>
<p>Teams below the line have jumped teams that were ranked ahead of them in the preseason. The biggest winner is actually not an SEC team, it’s TCU. Before the season, 13 of the current one-loss teams were ranked ahead of TCU, but now there are only 4. On the surface TCU seems to counter the idea that only SEC teams can drastically move up from their preseason ranking. However, of the 9 teams TCU jumped, only one (Georgia) is from the SEC. And the only other team to jump up more than 5 spots is Mississippi—who of course is from the SEC. So I wouldn’t conclude that the CFB playoff committee rankings behave differently than the AP poll quite yet.</p>
<p>Teams below the line have been passed by teams that had been ranked behind them in the preseason. Ohio State is the biggest loser, having had 9 different teams pass over them. Part of this can be explained by the fact that they have the worst loss (a 4-4 Virginia Tech game at home). But another factor is that the preseason AP poll was released before anybody knew Buckeye quarterback Braxton Miller would miss the entire season. Had voters known that, Ohio State probably wouldn’t have been ranked so high to begin with. </p>
<p>Overall, 10 teams have moved up or down from their preseason spot by 3 spots or less. The correlation between the two polls is 0.571, which indicates a positive association between the preseason AP poll and the current CFB playoff rankings. That is, teams ranked higher in the preseason poll tend to be ranked higher in the playoff rankings.</p>
Concordant and Discordant Pairs
<p>We can take this analysis a step further by looking at the concordant and discordant pairs. A pair is concordant if the observations are in the same direction. A pair is discordant if the observations are in opposite directions. This will let us compare teams to each other two at a time.</p>
<p>For example, let’s compare Auburn and Mississippi. In the preseason, Auburn was ranked 3 (out of the 17 one-loss teams) and Mississippi was ranked 10. In the playoff rankings, Auburn is ranked 1 and Mississippi is ranked 2. This pair is concordant, since in both cases Auburn is ranked higher than Mississippi. But if you compare Alabama and Mississippi, you’ll see Alabama was ranked higher in the preseason, but Mississippi is ranked higher in the playoff rankings. That pair is discordant.</p>
<p>When we compare every team, we end up with 136 pairs. How many of those are concordant? Our <a href="http://www.minitab.com/products/minitab">favorite statistical software</a> has the answer: </p>
<p><img alt="Measures of Concordance" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/5f281abfa1e06d5cda492e17b3f9746b/concordance.jpg" style="width: 663px; height: 176px;" /></p>
<p>There are 96 concordant pairs, which is just over 70%. So most of the time, if a team ranked higher in the preseason poll, they are ranked higher in the playoff rankings. And consider this: of the one-loss teams, the top 4 ranked preseason teams were Alabama, Oregon, Auburn, and Michigan St. Currently, the top 4 one loss teams are Auburn, Mississippi, Oregon, and Alabama. That’s only one new team—which just so happens to be from the SEC.</p>
<p>That’s bad news for non-SEC teams that started the season ranked low, like Arizona, Notre Dame, Nebraska, and Kansas State. It's going to be hard for them to jump teams with the same record, especially if those teams are from the SEC. Just look at Alabama’s résumé so far. Their best win is over West Virginia and they lost to #4 Mississippi. Is that <em>really </em>better than Kansas State, who lost to #3 Auburn and beat Oklahoma <em>on the road</em>? If you simply changed the name on Alabama’s uniform to Utah and had them unranked to start the season, would they still be ranked three spots higher than Kansas State? I doubt it.</p>
<p>The good news is that there are still many games left to play. Most of these one-loss teams will lose at least one more game. But with 4 teams making the playoff this year, odds are we'll see multiple teams with the same record vying for the last playoff spot. And if this college football playoff ranking is any indication, if you're not in the SEC, teams who were highly thought of in the preseason will have an edge.</p>
Fun StatisticsHypothesis TestingFri, 31 Oct 2014 13:04:57 +0000http://blog.minitab.com/blog/the-statistics-game/comparing-the-college-football-playoff-top-25-and-the-preseason-ap-pollKevin RudySimulating Robust Processing with Design of Experiments, part 2
http://blog.minitab.com/blog/statistics-in-the-field/simulating-robust-processing-with-design-of-experiments2c-part-2
<p>by Jasmin Wong, guest blogger</p>
<p> </p>
<p><em><a href="http://blog.minitab.com/blog/statistics-in-the-field/simulating-robust-processing2c-part-1">Part 1</a> of this two-part blog post discusses the issues and challenges in injection moulding and suggests using simulation software and the statistical method called Design of Experiments (DOE) to speed development and boost quality. This part presents a case study that illustrates this approach. </em></p>
Preliminary Fill and Designed Experiment
<p>This case study considers the example of a hand dispensing pump for a sanitiser bottle where the main areas of concern were warpage and the concentricity of the tube, as this had a critical impact on fit and functionality. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f6c68e56710c222c2a20dd002021287f/dispenser_top.png" style="line-height: 20.7999992370605px; margin: 10px 15px; float: right; width: 400px; height: 236px;" /></p>
<div>
<p>In this example, the first step was to carry out a preliminary fill, pack, cool and warp analysis to ensure that the part had no filling difficulties such as short shots or hesitation. DOE was then carried out and, since the areas of concern were warpage and concentricity, these were selected as the quality factor/responses.</p>
<div>
<p>Four control factors that affected warpage and concentricity were used to carry out the DOE: melt temperature, packing pressure, cooling time, and fill time. The factors levels are shown in the table below:</p>
<p><img alt="Taguchi DOE control factors" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/322b2d00c3b22d962ca76ac0485e437b/taguchi_doe_control_factors.png" style="width: 450px; height: 136px;" /></p>
<p>A Taguchi L9 DOE was then created using Minitab Statistical Software. <span style="line-height: 1.6;">It should be noted that a Taguchi DOE assumes no significant interaction between factors, but this may not necessarily be true. In this case, however, it was selected to determine the relationship between the factors and responses in the shortest simulation time.</span></p>
<p>The Minitab worksheet below shows the process settings for the nine runs using the Taguchi L9 Design.</p>
<p><img alt="Taguchi design worksheet" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/7cbc350e2fbe466708f4b5b4a2f58566/taguchi_doe_worksheet.png" style="width: 450px; height: 169px;" /></p>
<p>Moldex3D DOE was then used to perform the mathematical calculations based on the user’s specification (minimum warpage and linear shrinkage between nodes) to determine the optimum process setting.</p>
<p>From the nine different simulated runs, a main effect graph for warpage was plotted. </p>
<p><img alt="Main Effects Plor for Warpage" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/dbec7e75117c7763745e8260d78852fd/main_effects_warpage.png" style="width: 577px; height: 385px;" /></p>
<p><span style="line-height: 1.6;">From this, it could be seen that by increasing the packing pressure and cooling time, warpage was reduced. Increasing melt temperature, on the other hand, lead to higher warpage. Using a filling time of 0.2s or 0.3s seemed to give slightly lesser warpage than 0.1s. Hence, it was determined that to achieve lower warpage, the optimum process setting should be a melt temperature of 225°C, packing pressure of 15MPa, cooling time of 12s and filling time of 0.3s.</span></p>
<p style="line-height: 20.7999992370605px;">Taking the results obtained from Moldex3D, Minitab 17 statistical software was used to determine which of the four factors had the biggest influence on part warpage.</p>
<p style="line-height: 20.7999992370605px;"><img alt="response table for warpage" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/20e65680dd317de7add7a8559b1d50e3/response_table_warpage.png" style="width: 500px; height: 153px;" /></p>
<p style="line-height: 20.7999992370605px;">This data analysis showed that cool time had the biggest impact on part warpage, followed by packing pressure, melt temperature and then filling time. An area graph of warpage (PDF DOWNLOAD CHART 1) showed a quick comparison of the nine different runs, indicating that run 3 gave the least warpage.</p>
<p><img alt="area graph of warpage" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/740d75c1b4424da02ee136a673e43780/area_graph_of_warpage.png" style="width: 500px; height: 333px;" /></p>
<p>Concentricity is difficult to measure, in both real life and in simulation. In real life, the distance between different points is measured using a coordinate-measuring machine (CMM). In the Moldex3D simulation, the linear shrinkage between different nodes was measured. Eight different nodes were identified. The linear shrinkage of the diameter of the tube across was determined and the lower the linear shrinkage, the more circular or better concentricity of the part.</p>
<p>The main effects plot below for shrinkage shows that to get better concentricity/linear shrinkage between the nodes, a lower melt temperature, cooling time and filling time with a high pack pressure was preferable.</p>
<p><img alt="Main Effects Plot for Shrinkage" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/3eb9b51b4bd8caeac5ead713a86ce90b/main_effects_shrinkage.png" style="width: 579px; height: 385px;" /></p>
<p>It had already been established that to achieve lower linear shrinkage, the optimum process setting should be melt temperature of 225°C, packing pressure of 15MPa, cooling time of 8s and filling time of 0.1s. However, a cooling time of 8s may not be practical, as the analysis of warpage shows it would give high warpage.</p>
<p>Minitab was also used to find out which of the four control factors resulted in the greatest impact on linear shrinkage.</p>
<p><img alt="Response Table for Shrinkage" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9e0e2aca3064320d44a9860223665f48/response_table_shrinkage.png" style="width: 500px; height: 153px;" /></p>
<p>This showed that pack pressure is ranked first, followed by cooling time, melt temperature and lastly the filling time. Since the 8s cooling time would lead to high warpage, a compromise had to be made.</p>
<p>As mentioned earlier, for linear shrinkage the packing pressure was more of a contributing factor than the cooling time, so it makes sense to use 12s cooling time with 15MPa packing pressure. Comparing the nine different runs for linear shrinkage in an area graph showed that run six gave the lowest linear shrinkage.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/dfabcb5cb7861c6dc11cc0fdb25c2b2d/area_graph_of_shrinkage.png" style="width: 500px; height: 333px;" /></p>
<p>Based on the user specification, Moldex3D’s mathematical calculations obtained the optimised run<span style="line-height: 1.6;">. For this example, weighting for warpage was the same as for linear shrinkage. However, based on the DOE simulation results obtained, the optimum process setting for the lowest warpage was to have a cooling time of 12s and filling time of 0.3s. The optimum process for the lowest linear shrinkage, on the other hand, required a cooling time of 8s and fill time of 0.1s.</span></p>
Concluding thoughts
<p>Moldex3D simulation resulted in a compromise process setting (melt temperature of 225°C, packing pressure of 15MPa, cooling time of 12s and filling time of 0.1s), which was used as the optimum run. From the area graphs shown below, it can be seen that the optimised run 10 gives the lowest warpage compared to the other nine runs, while having low linear shrinkage.</p>
<p><img alt="optimized run - area chart" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/13c7a74c8d37f74f4acc152b676e53b6/optimized_run_area_graph_w640.png" style="width: 640px; height: 210px;" /></p>
<p>From the simulation in Moldex 3D, shown below, it can be seen that part warpage and concentricity of the tube has been significantly improved (warpage has been improved by 20-30% while linear shrinkage has been kept to 0.6-0.7%).</p>
<p><img alt="Moldex 3D simulation" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a1b9270c0e645e9db3d7c4f626308aba/moldex_3d_sim.png" style="width: 500px; height: 179px;" /></p>
<p>It is important that designers and moulders understand that numerical results in a simulation such as this provide only a relative comparison and should not be treated as absolute. This is because there are various uncontrollable factors in the actual mould shop environment—‘noise’—which cannot be re-enacted in a simulation. However, running DOE using simulation can give the engineering team a head start on identifying which control factors to focus on and the relationship those factors have with part quality.</p>
<p> </p>
<p><strong>About the guest blogger</strong></p>
<p>Jasmin Wong is project engineer at UK-based <a href="http://www.plazology.co.uk/" target="_blank">Plazology</a>, which provides product design optimisation, injection moulding fl ow simulation, mould design, mould procurement, and moulding process validation services to global manufacturing customers. She is an MSc graduate in polymer composite science and engineering and recently gained Moldex3D Analyst Certification.</p>
<p> </p>
<p> </p>
<p><em>A version of this article originally appeared in the <a href="http://content.yudu.com/htmlReader/A3572w/IWOct14/reader.html?page=26" target="_blank">October 2012 issue of Injection World</a> magazine.</em></p>
</div>
</div>
Design of ExperimentsMon, 27 Oct 2014 12:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/simulating-robust-processing-with-design-of-experiments2c-part-2Guest BloggerCan Regression and Statistical Software Help You Find a Great Deal on a Used Car?
http://blog.minitab.com/blog/understanding-statistics/can-regression-and-statistical-software-help-you-find-a-great-deal-on-a-used-car
<p>You need to consider many factors when you’re buying a used car. Once you narrow your choice down to a particular car model, you can get a wealth of information about individual cars on the market through the Internet. How do you navigate through it all to find the best deal? By analyzing the data you have available. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/710ce579b4120727bf67e8b48f5965e8/240_used_car_kovacs.jpg" style="line-height: 20.7999992370605px; border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 240px; height: 240px;" /></p>
<p>Let's look at how this works using <a href="http://blog.minitab.com/blog/understanding-statistics/we-just-got-rid-of-five-reasons-to-fear-data-analysis">the Assistant</a> in Minitab 17. With the Assistant, you can use regression analysis to calculate the expected price of a vehicle based on variables such as year, mileage, whether or not the technology package is included, and whether or not a free Carfax report is included.</p>
<p>And it's probably a lot easier than you think. </p>
<p>A search of a leading Internet auto sales site yielded data about 988 vehicles of a specific make and model. After putting the data into Minitab, we choose <strong>Assistant > Regression…</strong></p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9e87de993a0daa39e6643b8c6d3aed9c/regression_dialog.png" style="width: 395px; height: 247px;" /></p>
<p>At this point, if you aren’t very comfortable with regression, <a href="http://www.minitab.com/products/minitab/assistant/">the Assistant makes it easy to select the right option for your analysis</a>.</p>
A Decision Tree for Selecting the Right Analysis
<p>We want to explore the relationships between the price of the vehicle and four factors, or X variables. Since we have more than one X variable, and since we're not looking to optimize a response, we want to choose Multiple Regression.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/bc802d35bfb57ca3b86e061da4fa4b09/regression_decision_tree_w640.png" style="width: 640px; height: 502px;" /></p>
<p>This <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/9ecb2280228deb621ee2db7f6fbe300e/used_cars.MTW">data set</a> includes five columns: mileage, the age of the car in years, whether or not it has a technology package, whether or not it includes a free CARFAX report, and, finally, the price of the car.</p>
<p>We don’t know which of these factors may have significant relationship to the cost of the vehicle, and we don’t know whether there are significant two-way interactions between them, or if there are quadratic (nonlinear) terms we should include—but we don’t need to. Just fill out the dialog box as shown. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b93a0a755e8e73dc7f681ea4b1965749/regression_dialog_box.png" style="width: 532px; height: 382px;" /></p>
<p>Press OK and the Assistant assesses each potential model and selects the best-fitting one. It also provides a comprehensive set of reports, including a Model Building Report that details how the final model was selected and a Report Card that notifies you to potential problems with the analysis, if there are any.</p>
Interpreting Regression Results in Plain Language
<p>The Summary Report tells us in plain language that there is a significant relationship between the Y and X variables in this analysis, and that the factors in the final model explain 91 percent of the observed variation in price. It confirms that all of the variables we looked at are significant, and that there are significant interactions between them. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/746574a27bba821ffab4f77ae1a2931b/multiple_regression_summary_report_w640.png" style="width: 640px; height: 480px;" /></p>
<p>The Model Equations Report contains the final regression models, which can be used to predict the price of a used vehicle. The Assistant provides 2 equations, one for vehicles that include a free CARFAX report, and one for vehicles that do not.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/58598060212558634d62d75a7045bf0b/regression_equation_w640.png" style="width: 640px; height: 186px;" /></p>
<p>We can tell several interesting things about the price of this vehicle model by reading the equations. First, the average cost for vehicles with a free CARFAX report is about $200 more than the average for vehicles with a paid report ($30,546 vs. $30,354). This could be because these cars probably have a clean report (if not, the sellers probably wouldn’t provide it for free).</p>
<p>Second, each additional mile added to the car decreases its expected price by roughly 8 cents, while each year added to the cars age decreases the expected price by $2,357.</p>
<p>The technology package adds, on average, $1,105 to the price of vehicles that have a free CARFAX report, but the package adds $2,774 to vehicles with a paid CARFAX report. Perhaps the sellers of these vehicles hope to use the appeal of the technology package to compensate for some other influence on the asking price. </p>
Residuals versus Fitted Values
<p>While these findings are interesting, our goal is to find the car that offers the best value. In other words, we want to find the car that has the largest difference between the asking price and the expected asking price predicted by the regression analysis.</p>
<p>For that, we can look at the Assistant’s Diagnostic Report. The report presents a chart of Residuals vs. Fitted Values. If we see obvious patterns in this chart, it can indicate problems with the analysis. In that respect, this chart of Residuals vs. Fitted Values looks fine, but now we’re going to use the chart to identify the best value on the market.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/d55ae8720ba281bf37135b68b2069434/multiple_regression_diagnostic_report_w640.png" style="width: 640px; height: 480px;" /></p>
<p>In this analysis, the “Fitted Values” are the prices predicted by the regression model. “Residuals” are what you get when you subtract the actual asking price from the predicted asking price—exactly the information you’re looking for! The Assistant marks large residuals in red, making them very easy to find. And three of those residuals—which appear in light blue above because we’ve selected them—appear to be very far below the asking price predicted by the regression analysis.</p>
<p>Selecting these data points on the graph reveals that these are vehicles whose data appears in rows 357, 359, and 934 of the data sheet. Now we can revisit those vehicles online to see if one of them is the right vehicle to purchase, or if there’s something undesirable that explains the low asking price. </p>
<p>Sure enough, the records for those vehicles reveal that two of them have severe collision damage.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/5dbbf5aa405d4b2d53ec720657a09556/vehicles.jpg" style="width: 320px; height: 356px;" /></p>
<p>But the remaining vehicle appears to be in pristine condition, and is several thousand dollars less than the price you’d expect to pay, based on this analysis!</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/640bd720a3d1f8b04713aa0ec321a570/nice_car.png" style="width: 254px; height: 189px;" /></p>
<p>With the power of regression analysis and the Assistant, we’ve found a great used car—at a price you know is a real bargain.</p>
<p> </p>
Fun StatisticsRegression AnalysisStatisticsStatistics HelpWed, 22 Oct 2014 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/can-regression-and-statistical-software-help-you-find-a-great-deal-on-a-used-carEston MartzUsing Data Analysis to Maximize Webinar Attendance
http://blog.minitab.com/blog/michelle-paret/using-data-analysis-to-maximize-webinar-attendance
<p>We like to host webinars, and our customers and prospects like to attend them. But when our webinar vendor moved from a pay-per-person pricing model to a pay-per-webinar pricing model, we wanted to find out how to maximize registrations and thereby minimize our costs.<img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/8a6733d3b0516b7f1c7ad80ea753d430/mtbnewspromos_w640.jpeg" style="width: 400px; height: 273px; float: right; border-width: 1px; border-style: solid; margin: 10px 15px;" /></p>
<p>We collected webinar data on the following variables:</p>
<ul>
<li>Webinar topic</li>
<li>Day of week</li>
<li>Time of day – 11 a.m. or 2 p.m.</li>
<li>Newsletter promotion – no promotion, newsletter article, newsletter sidebar</li>
<li>Number of registrants</li>
<li>Number of attendees</li>
</ul>
<p>Once we'd collected our data, it was time to analyze it and answer some key questions using <a href="http://www.minitab.com/products/minitab/">Minitab Statistical Software</a>.</p>
Should we use registrant or attendee counts for the analysis?
<strong><span style="line-height: 16.8666667938232px; font-family: Calibri, sans-serif; font-size: 11pt;"><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/4d9fa1e3c73606627d2ca1ec34b620e2/scatterplot_w640.jpeg" style="width: 300px; height: 197px; margin: 10px 15px; float: left;" /></span></strong>
<p>First we needed to decide what we would use to measure our results: the number of people who signed up, or the number of people who actually attended the webinar. This question really boils down to answering the question, “Can I trust my data?”</p>
<p>Our data collection system for webinar registrants is much more accurate than our data collection system for webinar attendees. This is due to customer behavior and their willingness to share contact information, in addition to the automated database processes that connect our webinar vendor data with our own database. So, for a period of time, I manually collected the attendee data directly from our webinar vendor to see how it correlated with the easily-accessible and accurate registration data. The scatterplot above shows the results.</p>
<p>With a <a href="http://blog.minitab.com/blog/understanding-statistics/no-matter-how-strong-correlation-still-doesnt-imply-causation">correlation coefficient </a>of 0.929 and a p-value of 0.000, there was a strong positive linear relationship between the registrations and attendee counts. If registrations are high, then attendance is also high. If registrations are low, then attendance is also low. I concluded that I could use the registration data—which is both easily accessible and extremely reliable—to conduct my analysis.</p>
Should we consider data for the last 6 years?
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/5e73f48b852c7afc17762f28bf8887cf/i_mr_chart_of_registrants_w640.jpeg" style="width: 400px; height: 263px; margin: 10px 15px; float: left;" />We’ve been collecting webinar data for 6 years, but that doesn’t mean we can treat the last 6 years of data as one homogeneous population.</p>
<p>A lot can change in a 6-year time period. Perhaps there was a change in the webinar process that affected registrations. To determine whether or not I should use all of the data, I used an Individuals and Moving Range (I-MR, also referred to as X-MR) <a href="http://blog.minitab.com/blog/understanding-statistics/how-create-and-read-an-i-mr-control-chart">control chart</a> to evaluate the process stability of webinar registrations over time.</p>
<p>The graph revealed a single point on the MR chart that flagged as out-of-control. I looked more closely at this point and verified that the data was accurate and that this webinar belonged with the larger population. Based on this information, I decided to proceed with analyzing all 6 years of data together. (Note there is some clustering of points due to promotions, but again the goal here was to determine if we could use data over a 6-year time period.)</p>
What variables impact registrations?
<p>I performed an ANOVA using Minitab's General Linear Model tool to find out which factors—topic, day of week, time of day, or newsletter promotion—significantly affect webinar registrations.<img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/6060c2db-f5d9-449b-abe2-68eade74814a/Image/3758d3d03a604bab9921ad9f94663dc8/main_effects_plot_for_registrants_w640.jpeg" style="width: 400px; height: 263px; float: right; margin: 10px 15px;" /></p>
<p>The ANOVA results revealed that the day of week, time of day, and webinar topic <em>do not</em> affect webinar registrations, but the newsletter promotion type <em>does</em> (p-value = 0.000).</p>
<p>So which webinar promotion type maximizes webinar registrations?</p>
<p>Using Minitab to conduct <a href="http://blog.minitab.com/blog/statistics-and-quality-data-analysis/keep-that-special-someone-happy-when-you-perform-multiple-comparisons">Tukey comparisons</a>, we can see that registrations for webinars promoted in the newsletter sidebar space were not significantly different from webinars that weren't promoted at all.</p>
<p>However, webinars that were promoted in the newsletter <em>article </em>space resulted in significantly more registrations than both the sidebar promotions and no promotions.</p>
<p>From this analysis, we concluded that we still had the flexibility to offer webinars at various times and days of the week, and we could continue to vary webinar topics based on customer demand and other factors. To maximize webinar attendance and minimize webinar cost, we needed to focus our efforts on promoting the webinars in our newsletter, utilizing the article space.</p>
<p>But over the past year, we’ve started to actively promote our webinars via other channels as well, so next up is some more data analysis—using Minitab—to figure out what marketing channels provide the best results…</p>
Data AnalysisHypothesis TestingRegression AnalysisStatisticsFri, 17 Oct 2014 12:00:00 +0000http://blog.minitab.com/blog/michelle-paret/using-data-analysis-to-maximize-webinar-attendanceMichelle ParetHow Important Are Normal Residuals in Regression Analysis?
http://blog.minitab.com/blog/adventures-in-statistics/how-important-are-normal-residuals-in-regression-analysis
<p>I’ve written about the importance of <a href="http://blog.minitab.com/blog/adventures-in-statistics/why-you-need-to-check-your-residual-plots-for-regression-analysis" target="_blank">checking your residual plots</a> when performing linear regression analysis. If you don’t satisfy the assumptions for an analysis, you might not be able to trust the results. One of the assumptions for regression analysis is that the residuals are normally distributed. Typically, you assess this assumption using the normal probability plot of the residuals.</p>
<div style="float: right; width: 250px; margin: 15px 0px 15px 15px;"><img alt="Normal Probability Plot showing residuals that are not distributed normally" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/742d7708-efd3-492c-abff-6044d78e3bbd/Image/d84cbe3e157257e1ba07563dacdacbd7/nonnormal_residuals.png" title="Are these nonnormal residuals bad?" width="250" /> <em>Are these nonnormal residuals a problem?</em></div>
<p>If you have nonnormal residuals, can you trust the results of the regression analysis?</p>
<p>Answering this question highlights some of the research that Rob Kelly, a senior statistician here at Minitab, was tasked with in order to guide the development of our <a href="http://www.minitab.com/en-us/products/minitab/" target="_blank">statistical software</a>.</p>
Simulation Study Details
<p>The goals of the simulation study were to:</p>
<ul>
<li>determine whether nonnormal residuals affect the error rate of the F-tests for regression analysis</li>
<li>generate a safe, minimum sample size recommendation for nonnormal residuals</li>
</ul>
<p>For simple regression, the study assessed both the overall F-test (for both linear and quadratic models) and the F-test specifically for the highest-order term.</p>
<p>For multiple regression, the study assessed the overall F-test for three models that involved five continuous predictors:</p>
<ul>
<li>a linear model with all five X variables</li>
<li>all linear and square terms</li>
<li>all linear terms and seven of the 2-way interactions</li>
</ul>
<p>The residual distributions included skewed, heavy-tailed, and light-tailed distributions that depart substantially from the normal distribution.</p>
<p>There were 10,000 tests for each condition. The study determined whether the tests incorrectly rejected the null hypothesis more often or less often than expected for the different nonnormal distributions. If the test performs well, the Type I error rates should be very close to the target significance level.</p>
Results and Sample Size Guideline
<p>The study found that a sample size of at least 15 was important for both simple and multiple regression. If you meet this guideline, the test results are usually reliable for any of the nonnormal distributions.</p>
<p>In simple regression, the observed Type I error rates are all between 0.0380 and 0.0529, very close to the target significance level of 0.05.</p>
<p>In multiple regression, the Type I error rates are all between 0.08820 and 0.11850, close to the target of 0.10.</p>
Closing Thoughts
<p>The good news is that if you have at least 15 samples, the test results are reliable even when the residuals depart substantially from the normal distribution.</p>
<p>However, there is a caveat if you are using regression analysis to generate predictions. <a href="http://blog.minitab.com/blog/adventures-in-statistics/when-should-i-use-confidence-intervals-prediction-intervals-and-tolerance-intervals" target="_blank">Prediction intervals</a> are calculated based on the assumption that the residuals are normally distributed. If the residuals are nonnormal, the prediction intervals may be inaccurate.</p>
<p>This research guided the implementation of regression features in the <a href="http://www.minitab.com/en-us/products/minitab/assistant/" target="_blank">Assistant menu</a>. The Assistant is your interactive guide to choosing the right tool, analyzing data correctly, and interpreting the results. Because the regression tests perform well with relatively small samples, the Assistant does not test the residuals for normality. Instead, the Assistant checks the size of the sample and indicates when the sample is less than 15.</p>
<p><a href="http://blog.minitab.com/blog/adventures-in-statistics/multiple-regression-analysis-and-response-optimization-examples-using-the-assistant-in-minitab-17" target="_blank">See a multiple regression example that uses the Assistant.</a></p>
<p>You can read the full study results in the <a href="http://support.minitab.com/en-us/minitab/17/Assistant_Simple_Regression.pdf" target="_blank">simple regression white paper</a> and the <a href="http://support.minitab.com/en-us/minitab/17/Assistant_Multiple_Regression.pdf" target="_blank">multiple regression white paper</a>. You can also peruse all of our <a href="http://support.minitab.com/en-us/minitab/17/technical-papers/" target="_blank">technical white papers</a> to see the research we conduct to develop methodology throughout the Assistant and Minitab.</p>
Regression AnalysisThu, 16 Oct 2014 12:00:00 +0000http://blog.minitab.com/blog/adventures-in-statistics/how-important-are-normal-residuals-in-regression-analysisJim FrostThe Ghost Pattern: A Haunting Cautionary Tale about Moving Averages
http://blog.minitab.com/blog/understanding-statistics/the-ghost-pattern-a-haunting-cautionary-tale-about-moving-averages
<p>Halloween's right around the corner, so here's a scary thought for the statistically minded: That pattern in your time series plot? Maybe it's just a ghost. <em>It might not really be there at all.</em> </p>
<p>That's right. The trend that seems so evident might be a phantom. Or, if you don't believe in that sort of thing, chalk it up to the brain's desire to impose order on what we see, even when it doesn't exit. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/336bc5b657c980e1c2769192a4757fa9/ghosts.png" style="line-height: 20.7999992370605px; margin: 10px 15px; float: right; width: 200px; height: 200px;" /></p>
<p>I'm going to demonstrate this with Minitab Statistical Software (get the free 30-day <a href="http://it.minitab.com/products/minitab/free-trial.aspx">trial version</a> and play along, if you don't already use it). And if things get scary, just keep telling yourself "It's only a simulation. It's only a simulation."</p>
<p>But remember the ghost pattern when we're done. It's a great reminder of how important it is to make sure that you've interpreted your data properly, and looked at all the factors that might influence your analysis—including the quirks inherent in the statistical methods you used. </p>
Plotting Random Data from a 20-Sided Die
<p>We're going to need some random data, which we can get Minitab to generate for us. In many role-playing games, players use a 20-sided die to determine the outcome of battles with horrible monsters, so in keeping with the Halloween theme we'll simulate 500 consecutive rolls with a 20-sided die. Choose <strong>Calc > Random Data > Integer...</strong> and have Minitab generate 500 rows of random integers between 1 and 20. </p>
<p>Now go to <strong>Graph > Time Series Plot...</strong> and select the column of random integers. Minitab creates a graph that will look something like this: </p>
<p><img alt="Time Series Plot of 200 Twenty-Sided Die Rolls" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/bc2a4c9bf05e4a61103451fa6e6f8342/20_sided_die_time_series_plot.png" style="width: 577px; height: 386px;" /></p>
<p>It looks like there could be a pattern, one that looks a little bit like a sine wave...but it's hard to see, since there's a lot of variation in consecutive points. In this situation, many analysts will use a technique called the Moving Average to filter the data. The idea is to <span style="line-height: 20.7999992370605px;">smooth out the natural variation in the data </span><span style="line-height: 1.6;">by looking at the <em>average </em>of several consecutive data points, thus enabling a pattern to reveal itself. It's the statistical equivalent of applying a noise filter to eliminate hiss on an audio recording. </span></p>
<p>A moving average can be calculated based on the average of as few as 2 data points, but this depends on the size and nature of your data set. We're going to calculate the moving average of every 5 numbers. Choose <strong>Stat > Time Series > Moving Average...</strong> Enter the column of integers as the Variable, and enter 5 as the MA length. Then click "Storage" and have Minitab store the calculated averages in a new data column. </p>
<p>Now create a new time series plot using the moving averages:</p>
<p><img alt="moving average time series plot" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/f93eff7bceb62bd5da113de356afcd8e/moving_average_time_series_plot.png" style="width: 576px; height: 384px;" /></p>
<p>You can see how some of the "noise" from point-to-point variation has been reduced, and it does look like there could, just possibly, be a pattern there.</p>
Can Moving Averages Predict the Future?
<p>Of course, a primary reason for doing a time series analysis is to forecast the next item (or several) in the series. Let's see if we might predict the next moving average of the die by knowing the current moving average. </p>
<p>Select <strong>Stat > Time Series > Lag</strong>. In the dialog box, choose the "moving averages" column as the series to lag. We'll use this dialog to create a new column of data that places each moving average down 1 row in the column and inserts missing value symbols, *, at the top of the column.</p>
<p>Now we can create a <a href="http://blog.minitab.com/blog/understanding-statistics/using-statistics-software-and-graphs-to-quickly-explore-relationships-between-variables">simple scatterplot</a> that will show if there's a correlation between the observed moving average and the next one. </p>
<p><img alt="Scatterplot of Current and Next Moving Averages" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/78607f90333600cdeb6eeba721c62ee7/scatterplot_of_moving_averages.png" style="width: 578px; height: 386px;" /></p>
<p>Clearly, there's a positive correlation between the current moving average and the next, which means we <em>can </em>use the current moving average to predict the next one. </p>
<p><span style="line-height: 1.6;">But wait a minute...this is </span><em style="line-height: 1.6;">random data!</em><span style="line-height: 1.6;"> </span><span style="line-height: 20.7999992370605px;">By definition, you <em>can't </em>predict random</span><span style="line-height: 1.6;">, so how can there be a correlation? This is getting kind of creepy...it's like there's some kind of ghost in this data. </span></p>
<p>Zoinks! What would Scooby Doo make of all this? </p>
Debunking the "Ghost" with the Slutsky-Yule Effect
<p>Don't panic—there's a perfectly rational explanation for what we're seeing here. It's called the Slutsky-Yule Effect, which simply says an autoregressive time series (like a moving average) can <em>look like </em>patterned data, even if there's no relationship among the data points. </p>
<p>So there's no ghost in our random data; instead, we're seeing a sort of statistical illusion. Using the moving average can make it seem like a pattern or relationship exists, but that apparent pattern could be a side effect of the tool, and not an indication of a real pattern. </p>
<p>Does this mean you shouldn't use moving averages to look at your data? No! It's a very valuable and useful technique. However, using it carelessly could get you into trouble. And if you're basing a major decision solely on moving averages, you might want to try some alternate approaches, too. Mikel Harry, one of the originators of Six Sigma, has a <a href="http://drmikelharry.wordpress.com/2014/04/08/beware-the-moving-average/">great blog post</a> that presents a workplace example of how far apart reality and moving averages can be. </p>
<p>So just remember the Slutsky-Yule Effect when you're analyzing data in the dead of night, and your moving average chart shows something frightening. <span style="line-height: 20.7999992370605px;">Shed some more light on the subject with follow-up analysis and you might find there's nothing to fear at </span><span style="line-height: 1.6;">all. </span></p>
Data AnalysisFun StatisticsStatisticsStatsMon, 13 Oct 2014 12:00:00 +0000http://blog.minitab.com/blog/understanding-statistics/the-ghost-pattern-a-haunting-cautionary-tale-about-moving-averagesEston MartzUsing Before/After Control Charts to Assess a Car’s Gas Mileage
http://blog.minitab.com/blog/understanding-statistics/using-before-and-after-control-charts-to-assess-a-care28099s-gas-mileage
<p>Keeping your vehicle fueled up is expensive. Maximizing the miles you get per gallon of fuel saves money and helps the environment, too. </p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/05b215659b2ef9b8a0e478c92e2dd932/car_dash_200.jpg" style="line-height: 20.7999992370605px; border-width: 1px; border-style: solid; margin: 10px 15px; float: right; width: 200px; height: 200px;" /></p>
<p>But knowing if you're getting good mileage requires some data analysis, which gives us a good opportunity to apply one of the common tools used in Six Sigma -- the I-MR (individuals and moving range) control chart to daily life. </p>
Finding Trends or Unusual Variation
<p>Looking at your vehicle’s MPG data lets you see if your mileage is holding steady, declining, or rising over time. This data can also reveal unusual variation that might indicate a problem you need to fix.</p>
<p>Here's a simulated <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/12e461add9f92cb704d405aec09dd4be/mileage.MTW">data set</a> that collects 3 years’ worth of gas mileage records for a car that should get an average of 20 miles per gallon, according to the manufacturer’s estimates. However, the owner didn’t do any vehicle maintenance for the first two years he owned the car. This year, though, he’s diligently performed recommended maintenance.</p>
<p>How does his mileage measure up? And has his attention to maintenance in the past 12 months affected his car’s fuel economy? Let’s find out with the Assistant in Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a>.</p>
Creating a Control Chart that Accounts for Process Changes
<p>To create the most meaningful chart, we need to recall that a major change in how the vehicle is handled took place during the time the data were collected. The owner bought the car three years ago, but he’s only done the recommended maintenance in the last year.</p>
<p>Since the data were collected both before and after this change, we want to account for it in the analysis.</p>
<p>The easiest way to handle this is to choose <strong>Assistant > Before/After Control Charts…</strong> to create a chart that makes it easy to see how the change affected both the mean and variance in the process.</p>
<p>If you're following along with Minitab, the Maint column in the worksheet notes which MPG measurements were taken before and after DeWaggen started paying attention to maintenance. Complete the Before/After I-MR Chart dialog box as shown below:</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b5500c9339bfbb45f5baa07cfd455943/before_after_i_mr_chart_dialog.png" style="width: 498px; height: 376px;" /></p>
Interpreting the Results of Your Data Analysis
<p>After you press OK, the Assistant produces a Diagnostic Report with detailed information about the analysis, as well as a Report Card, which provides guidance on how to interpret the results and flags potential problems. In this case, there are no concerns with the <a href="http://blog.minitab.com/blog/real-world-quality-improvement/quality-improvement-in-healthcare3a-showing-if-process-changes-actually-improve-the-patient-experience">process mean and variation</a>.</p>
<p>The Assistant's Summary Report gives you the bottom-line results of the analysis.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/fbe653d819f1baf7531202ab1ed32212/before_after_i_mr_chart_summary_report_w640.png" style="width: 640px; height: 473px;" /></p>
<p>T<span style="line-height: 20.7999992370605px;">he Moving Range chart,</span><span style="line-height: 20.7999992370605px;"> shown in the</span><span style="line-height: 1.6;"> lower portion of the graph, illustrates the moving range of the data. It shows that while the upper and lower control limits have shifted, the difference in variation before and after the change is not statistically significant. </span></p>
<p><span style="line-height: 1.6;">However, the car’s mean mileage, which is shown in the Individual Value chart displayed at the top of the graph, </span><em style="line-height: 1.6;">has </em><span style="line-height: 1.6;">seen a statistically significant change, moving from 19.12 MPG to just under 21 MPG. </span></p>
<span style="line-height: 1.6;">Easy Creation of Control Charts</span>
<p>Control charts have been used in statistical process control for decades, and are among the most commonly accessed tools available in statistical software packages. The Assistant has made it particularly easy for anyone to create and see whether or not a process is within control limits, to confirm that observation statistically, and to see whether or not a change in the process results in a change in the process outcome or variation.</p>
<p>As for the data we used in this example, whether or not a 2 mile-per-gallon increase in fuel economy is practically as well as statistically significant could be debated. But since the price of fuel rarely falls, we recommend that the owner of this vehicle continue to keep it tuned up!</p>
Data AnalysisFun StatisticsQuality ImprovementStatisticsFri, 26 Sep 2014 12:21:03 +0000http://blog.minitab.com/blog/understanding-statistics/using-before-and-after-control-charts-to-assess-a-care28099s-gas-mileageEston MartzNot Getting a No-Hitter? Statistically Speaking, the Best Bet Ever
http://blog.minitab.com/blog/the-statistics-game/not-getting-a-no-hitter-statistically-speaking2c-the-best-bet-ever
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/ca5dc4e25f623c98b4c0ab10d4eeba50/money_w640.png" style="width: 325px; height: 217px; float: right; margin: 10px 15px;" />The no-hitter is one of the most impressive feats in baseball. It’s no easy task to face more than 27 batters without letting one of them get a hit. So naturally, no-hitters don’t occur very often. In fact, since 1900 there has been an average of only about 2 no-hitters per year.</p>
<p>But what if you had the opportunity to bet that one <em>wouldn’t </em>occur?</p>
<p>That’s exactly what happened to sportswriter C. Trent Rosecrans. He had a friend who kept insisting the Reds would be no-hit his season. And with 24 games left in the season, the friend put his money where his mouth is, betting Mr. Rosecrans <a href="http://www.cincinnati.com/story/redsblog/2014/09/17/bar091714/15767373/">$5 that the Reds would be no-hit</a> by the end of the year.</p>
<p>Even if the Reds <em>do </em>have one of the worst hitting percentages in baseball, would you take the bet that in 24 games there won’t be an event that occurs only twice in an entire year?</p>
<p>Sounds like a no-brainer.</p>
Calculating the odds
<p>Back in 2012, I <a href="http://blog.minitab.com/blog/the-statistics-game/the-odds-of-throwing-a-perfect-game">calculated that the odds of throwing a no-hitter</a> were approximately 1 in 1,548. If you update that number to include all the games and no-hitters that have occurred since 2012, the odds become 1 in 1,562. The numbers are very similar, but we’ll use the latter since it incorporates more data.</p>
<p>So there is a 99.936% chance that a no-hitter does not occur in any single game. But the bet was that it wouldn’t occur in 24 games. What are Mr. Rosencrans' chances of winning the bet?</p>
<p align="center"><strong>24 games without a no-hitter</strong> = .99936^24 = .98475 = approximately <strong>98.475%</strong></p>
<p>I wish <em>I</em> could make bets with a winning percentage that was that high! For Mr. Rosecrans, 98.475% of the time he’ll win $5, and 1.525% of the time he’ll lose $5. For his friend, the opposite is true. We can use these numbers to calculate the expected value for each side of the bet.</p>
<p align="center">Reds don’t get no-hit: (0.98475*5) – (0.01525*5) = <strong>$4.85</strong></p>
<p align="center">Reds get no-hit: (.01525*5) – (0.98475*5) = <span style="color:#FF0000;"><strong>-$4.85</strong></span></p>
Making it a fair bet
<p>Obviously this was just a friendly wager and was not meant to be taken too seriously. If Mr. Rosecrans regularly made bets with expected values close to $5 with all of his friends, he probably wouldn’t have many left. But what if he wanted to be a <em>nice </em>friend? How much money should he have offered in return to make it a fair bet? We’ll simply set the expected value to 0 and solve for the amount of money he’d lose the 1.525% of the time the Reds were no-hit.</p>
<p align="center">0 = (0.98475*5) – (0.01525*X)</p>
<p align="center">0.01525*X = 4.92375</p>
<p align="center">X = $322.87</p>
<p>To make the bet fair, Mr. Rosecrans should offer to pay his friend $322.87 if the Reds get no-hit. And earlier this week the Reds didn’t get their first hit until the 8th inning. Imagine sweating out <em>that </em>game if you had over $300 on the line!</p>
Adjusting for the Reds
<p>One of the reasons the friend bet on the Reds to be no-hit was that they are one of the worst-hitting teams in their league. Their batting average of 0.238 is ranked 28th in baseball. That means, on average, a Reds batter <em>won’t</em> hit the ball 76.2% of the time. So if a pitcher wanted to no-hit the Reds, they would need to face at least 27 batters who didn’t get a hit.</p>
<p align="center"><strong>Probability of having 27 straight batters not have a hit</strong> = 0.762^27 = 0.00065 = <strong>approx. 1 in 1,539</strong></p>
<p>But remember, just because a batter doesn’t get a hit does not mean they’re out. They can get walked, hit by a pitch, or reach on an error. Unless they pitch a perfect game, the pitcher will face more than 27 batters. Let’s look how the probability changes as we increase the number of Reds batters that the pitcher must face without allowing a hit.</p>
<p align="center"><strong>Probability of having 28 straight batters not have a hit</strong> = 0.762^28 = <strong>approx. 1 in 2,020</strong></p>
<p align="center"><strong>Probability of having 29 straight batters not have a hit</strong> = 0.762^29 = <strong>approx. 1 in 2,650</strong></p>
<p align="center"><strong>Probability of having 30 straight batters not have a hit</strong> = 0.762^30 = <strong>approx. 1 in 3,478</strong></p>
<p align="center"><strong>Probability of having 31 straight batters not have a hit</strong> = 0.762^31 = <strong>approx. 1 in 4,565</strong></p>
<p>This was <em>supposed</em> to show that because they are a poor-hitting team, the Reds have a better chance of being no-hit than the average used above. But as you can see, that’s not the case at all. Despite being one of the worst-hitting teams in the league, it appears that it’s <em>harder</em> to no-hit the Reds than the historical average.</p>
<p>Things get even odder when you consider that the average batting average (according to <a href="http://www.baseball-reference.com/leagues/MLB/bat.shtml">Baseball-Reference.com</a>) is 0.263. Using that number, the odds of having 27 straight batters not have a hit is 1 in 3,788. And those odds drop as you increase the number of batters the pitcher has to face. Applying this probability to the number of games played since 1900, we would expect there to be fewer than 100 no-hitters. And how many have there been? <em>241</em>!</p>
<p>This is the same conundrum I encountered when finding <a href="http://blog.minitab.com/blog/the-statistics-game/the-odds-of-throwing-a-perfect-game-part-ii">the odds of throwing a perfect game</a>. The number of perfect games and no-hitters that have occurred is <em>much higher</em> than what we would expect based on historical batting statistics. One explanation could be pitching from the wind-up vs. the stretch. With no runners on base (which is always the case in a perfect game and often the case in a no-hitter), the pitcher can always throw from the wind-up. Assuming pitchers are better when pitching from the wind-up, this would result in a lower batting average than normal, thus explaining the higher number of perfect games and no hitters. This would make for a great analysis using Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a>, but since we can’t separate the data on hand into at bats facing pitchers throwing from the stretch vs. the wind-up, we can't test the theory.</p>
<p>Since the Reds have a batting average .025 points lower than the historical average, it’s probably safe to assume they do in fact have a greater chance of being no-hit. The problem is that it’s nearly impossible to quantify how much greater!</p>
Looking ahead to next year
<p>With the season almost over, it’s unlikely the Reds will be no-hit this year. But what if the two friends decided to do their bet again next year, only this time, they do it at the start of the season. Let’s use our original probability of throwing a no hitter (the one we’ve observed) and determine what the odds are that the Reds go 162 games getting at least one hit per game.</p>
<p align="center"><strong>162 games without a no-hitter</strong> = .99936^162 = .9015 = approximately <strong>90.15%</strong></p>
<p>The probability of the Reds getting no-hit is still pretty low, but it’s a lot better than the current bet. I just hope next year the friend gets some better odds than even money!</p>
Data AnalysisFun StatisticsStatistics in the NewsFri, 19 Sep 2014 13:35:15 +0000http://blog.minitab.com/blog/the-statistics-game/not-getting-a-no-hitter-statistically-speaking2c-the-best-bet-everKevin RudySwitch the Inner and Outer Categories on a Bar Chart
http://blog.minitab.com/blog/statistics-and-quality-improvement/switch-the-inner-and-outer-categories-on-a-bar-chart
<p>Did you just go shopping for school supplies? If you did, you’ve participated in what’s become the second biggest spending season of the year in the United States, according to the National Retail Federation (NRF). <img alt="Kids running in backpacks" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/b8bd8f188623299c5be19c86c99c2433/backpacks.jpg" style="float: right; width: 300px; height: 170px; margin: 10px 15px;" /></p>
<p>The trends and analysis are so interesting to the NRF that they actually add questions about back-to-school shopping to two monthly consumer surveys. The two surveys have different questions, but there’s one case where the allowed responses are the same. In July, the survey asked, “Where will you purchase back-to-school items this year?” In August, the survey asked, “Where do you anticipate you will do the remainder of your Back-to-School shopping?”</p>
<p>Did people give the same answers both times? Let’s use <a href="http://www.minitab.com/en-us/products/minitab/features/">Minitab Statistical Software</a> to find out. Doing so will give us a chance to see how easy it is to change the focus of a chart by switching the inner and outer categories on a bar chart.</p>
<strong>Did people answer the same way in both surveys? Yes.</strong>
<p>Let’s say that your data are in the same layout as the original NRF reports. Each row contains the percentage for a different location. I put the dates in two different columns because the numbers came from two different PDF files (<a href="https://nrf.com/sites/default/files/BTS%207-09-14%20press.pdf">July</a> and <a href="https://nrf.com/sites/default/files/Documents/BTS%20Update%208-2014.pdf">August</a>).</p>
<p><img alt="Percentages of people who said that they would shop at each location." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/5b7d784b09b8b949c0ca3ebcfebaea72/data_window.png" style="width: 313px; height: 202px;" /></p>
<p>Making a bar chart in Minitab is easy, so follow along if you like:</p>
<ol>
<li>Choose <strong>Graph > Bar Chart</strong>.</li>
<li>In <strong>Bars represent</strong>, select <strong>Values from a table</strong>.</li>
<li>Under <strong>Two-way table</strong>, select <strong>Cluster</strong>. Click <strong>OK</strong>.</li>
<li>In <strong>Graph variables</strong>, enter <em>'7/1 to 7/8 2014' '8/5 to 8/12 2014'</em></li>
<li>In <strong>Row labels</strong>, enter <em>'Where will you purchase?'</em> Click <strong>OK</strong>.</li>
</ol>
<p>From this display, you can quickly determine that the order of the categories is the same in each survey. In both cases, most consumers plan to shop the most at discount stores and the least from catalogs. In fact, the popularity of where consumers planned to shop and where they planned to finish shopping has a constant order.</p>
<p><img alt="With month outermost, you can see that the popularity of the categories is the same in both surveys." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/c79ca1158307cd23045fd40ebb404645/outermost_month.png" style="width: 576px; height: 384px;" /></p>
<strong>Did people answer the same way in both surveys? No.</strong>
<p>The order of popularity might not be all that you want to know from this data. Minitab makes it easy for you to get another view of the data. You can quickly switch which category is inner and which is the outer category.</p>
<ol>
<li>Press CTRL + E.</li>
<li>In <strong>Table arrangement</strong>, select <strong>Rows are outermost categories and columns are innermost</strong>. Click <strong>OK</strong>.</li>
<li>Double-click one of the bars in the graph.</li>
<li>Select the <strong>Groups</strong> tab. Check <strong>Assign attributes by graph variables</strong>. Click <strong>OK</strong>.</li>
<li>Double-click one of the category labels on the bottom of the graph.</li>
<li>Select the Show tab. In the <strong>Show Labels By Scale Level</strong><strong> </strong>table, uncheck <strong>Tick labels</strong> for <strong>Graph variables</strong>. Click <strong>OK</strong>.</li>
</ol>
<p><img alt="With categories outermost, you can see which locations have the biggest change between the two surveys." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/e35c857c8b737666102ca6e14e7f3647/location_outermost_2.png" style="width: 576px; height: 384px;" /></p>
<p>In this display, you can easily see the change for each location between the two questions. For every location, the number of people who reported that they planned to shop there on the first survey is higher than the number who planned to finish shopping there on the second survey.</p>
<p>This result seems reasonable. One possible explanation is that people finished their shopping at some locations. In terms of the difference in the percentages, those who plan to shop for school items at clothing stores and electronics stores changed the most. Customers who finished shopping at a location seem to have finished at those types of locations the earliest.</p>
<strong>Wrap up</strong>
<p>When you’re looking at data, discovering what’s important often involves looking at the data from more than one perspective. Fortunately, Minitab’s bar chart makes it easy for you to change the focus of the categories so that you can dig deeper, faster. It’s nice to know that the information that you need is so readily available!</p>
<p><strong>Bonus</strong></p>
<p>I set up my data as values from a table today. Want to see what the other two options do? Check out <a href="http://blog.minitab.com/blog/quality-data-analysis-and-statistics/bar-charts-decoded">Choosing what the bars in your chart represent!</a></p>
The image of the children running in backpacks is from healthinhandkelowna.blogspot.com and is licensed under this <a href="https://creativecommons.org/licenses/by/2.0/">Creative Commons License</a>.
Statistics in the NewsFri, 05 Sep 2014 16:04:43 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/switch-the-inner-and-outer-categories-on-a-bar-chartCody SteeleAnalyzing NFL Ticket Prices: How Much Would You Pay to See the Green Bay Packers?
http://blog.minitab.com/blog/the-statistical-mentor/analyzing-nfl-ticket-prices3a-how-much-would-you-pay-to-see-the-green-bay-packers
<p><span style="line-height: 1.6;">The 2014-15 NFL season is only days away, and fans all over the country are planning their fall weekends accordingly. In this post, I'm going to use data analysis to answer some questions related to ticket prices, such as:</span></p>
<ul>
<li>Which team is the least/most expensive to watch at home? </li>
<li>Which team is the least/most expensive to watch on the road? </li>
<li>If you are thinking of a road trip, which stadiums offer the largest ticket discount for your team?<img alt="Football stadium crowd" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/5781b1d52907305361bd13535983580b/stadium.jpg" style="float: right; width: 350px; height: 269px; border-width: 1px; border-style: solid; margin: 10px 15px;" />
<ul>
</ul>
</li>
</ul>
<p>For dedicated fans, this is far from a trivial matter. As we'll see, fans of one team can get an average 48% discount on road-game tickets, while fans of two other teams will pay, on average, more than double the cost to see their team on the road.</p>
Gathering and Preparing NFL Ticket Price Data
<p>The data I'm analyzing comes from Stubhub, an online ticket marketplace owned by ebay. You'll find a summary of the number of Stubhub tickets available and mimimum price on Stubhub for each NFL game in 2014 on the ESPN website: <a href="http://espn.go.com/nfl/schedule/_/seasontype/2/week/1">http://espn.go.com/nfl/schedule/_/seasontype/2/week/1</a></p>
<p><img alt="snapshot of NFL data from ESPN.com" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/0844140f9ada17966cf7a63eda771c6a/nfl_data.jpg" style="width: 600px; height: 384px;" /></p>
<p>I did a quick copy-and-paste from ESPN into Excel to put each variable nicely into a column, and then another copy-and-pasted the data into Minitab <a href="http://www.minitab.com/products/minitab">Statistical Software</a> to prepare it for analysis. I used the <a href="http://blog.minitab.com/blog/understanding-statistics/three-ways-to-get-more-out-of-your-text-data"><strong>Calc > Calculator</strong></a> commands Left() and Right() in Minitab to extract the minimum ticket price, the first few letters of the away team name, and the first few letters of the home team name. (Since the summary on ESPN.com only shows the minimum price, the analysis below is based only on the minimum ticket price available for each game.)</p>
Which Is the Most Expensive Team to See on the Road?
<p>The Bar Chart below shows that Green Bay is the most expensive road team to watch play with a 2014 average price of $145 per road game. This is noticeably higher than the other NFL teams. The next closest is San Francisco with an average price of $128 per road game. But catching a Jacksonville road game is a fraction of those costs, averaging $48. </p>
<p><img alt="Bar Chart of Average Minimum Price for Away Team 2014 NFL Season" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b7a4e021aea84b3bf1b37ee619afc93c/avg_min_price_away_team_2014_season.jpg" style="width: 586px; height: 390px;" /></p>
Which Is the Most Expensive Team to See at Home?
<p>The Bar Chart below shows that Chicago is the most expensive team to watch play on their home turf, with a 2014 average price of $175 per home game. Seattle is a close second with an average price of $171 per home game. Seeing Dallas or St. Louis in a home game is a fraction of those costs, averaging just $35. </p>
<p><img alt="Bar Chart of Average Minimum Price for Home Team 2014 NFL Season" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/27f5c44c660d02e27573ca4cd4366632/avg_min_price_home_team_2014_season.jpg" style="width: 580px; height: 387px;" /></p>
Is It Cheaper to See Your Favorite Team on the Road?
<p>Finally, I compared the average home game ticket price to the average road game ticket price for each NFL team.</p>
<p>The road team discount award goes to the Seattle Seahawks. You'll save, on average, 48% watching their games on the road. But if you're a fan of Dallas or Miami, you'll be financially better off watching your team at home—their average price increases more than 110% when they're on the road. One factor that drives this result is the popularity of Dallas and Miami across the country: the higher demand supports their higher road-game price. Also, Dallas' enormous home stadium (AT&T) offers cheap Party Pass seats (which aren't really seats at all, but rather a standing room section). </p>
<p><img alt="Is It Cheaper to See Your Favorite NFL Team on the Road? " src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8063637683d8b39c1fcc4083965e7428/cheaper_on_the_road.jpg" style="width: 582px; height: 387px;" /></p>
<p>One drawback with this analysis is it doesn't take into account the opponent that each team faces. For example, Chicago may happen to be playing some very popular teams at home in 2014, which drives their home-game ticket prices up for this season.</p>
<p>In a future post, I'll discuss how to adjust for opponents and other variables such as game day and game time.</p>
Tue, 26 Aug 2014 12:00:00 +0000http://blog.minitab.com/blog/the-statistical-mentor/analyzing-nfl-ticket-prices3a-how-much-would-you-pay-to-see-the-green-bay-packersJim ColtonUse a Line Plot to Show a Summary Statistic Over Time
http://blog.minitab.com/blog/statistics-and-quality-improvement/use-a-line-plot-to-show-a-summary-statistic-over-time
<p><img alt="Terrorist Attacks, 2013, Concentration and Intensity" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/2ecbcca8429152afb73d991b4a532f5a/start_globalterrorismdatabase_2013terroristattacksconcentrationintensitymap_w640.png" style="width: 500px; height: 216px;" /></p>
<p>If you’re already a strong user of Minitab Statistical Software, then you’re probably familiar with <a href="https://blog.minitab.com/blog/starting-out-with-statistical-software/investigating-starfighters-with-bar-charts3a-function-of-a-variable">how to use bar charts to show means</a>, medians, sums, and other statistics. Bar charts are excellent tools, but traditionally used when you want all of your categorical variables to have different sections on the chart. When you want to plot statistics with groups that flow directly from one category to the next, look no further than Minitab’s <a href="http://www.minitab.com/en-us/Support/Tutorials/Minitab-s-Line-Plots/">line plots</a>. I particularly like line plots when I want to use time as a category, because I prefer the connect line display to separated bars.</p>
<p>I like to illustrate Minitab with data about pleasant subjects: <a href="https://blog.minitab.com/blog/statistics-and-quality-improvement/practicing-data-analysis-get-some-fun-data-into-minitab-v1">poetry</a>, <a href="https://blog.minitab.com/blog/statistics-and-quality-improvement/gummi-bear-measurement-systems-analysis-msa-the-gage-randr-study">candy</a>, and maybe even <a href="https://blog.minitab.com/blog/statistics-and-quality-improvement/process-capability-statistics-cp-and-cpk-working-together">the volume of ethanol in E85 fuel</a>. Data that are about unpleasant subjects also exist, and we can learn from that data too. We’re fortunate to have both the <a href="http://cpost.uchicago.edu/">Chicago Project on Security and Terrorism</a> (CPOST) and the <a href="http://www.start.umd.edu/">National Consortium for the Study of Terrorism and Responses to Terrorism</a> (START) working hard to produce publicly-accessible databases with information about terrorism.</p>
<p>START has been sharing <a href="http://www.start.umd.edu/news/majority-2013-terrorist-attacks-occurred-just-few-countries">analyses of its 2013 data</a> recently. The new data prompted staff from the two institutions to engage in an interesting debate on the Washington Post’s website about whether the Global Terrorism Database (GTD) that Start maintains “<a href="http://www.washingtonpost.com/blogs/monkey-cage/wp/2014/08/15/global-terrorism-data-show-that-the-reach-of-terrorism-is-expanding/">exaggerates a recent increase in terrorist activities</a>.” For today, I’m just going to use the GTD to demonstrate a nice line plot in Minitab, which will give a tiny bit of insight into what that debate is about.</p>
<p>When you <a href="http://www.start.umd.edu/gtd/contact/">download the GTD data</a>, you can open one file that has all of the data except for the year 1993. Incident-level data for 1993 was lost, so that year is not included, although you can get country-level totals for numbers of attacks and casualties from the <a href="http://www.start.umd.edu/gtd/downloads/Codebook.pdf">GTD Codebook</a>. Those who maintain the GTD <a href="http://www.start.umd.edu/gtd/using-gtd/">recommend</a> “users should note that differences in levels of attacks and casualties before and after January 1, 1998, before and after April 1, 2008, and before and after January 1, 2012 are at least partially explained by differences in data collection” (START, downloaded August 18th, 2014).</p>
<p>The GTD is great for detail. One column it contains records a one if an event was a suicide attack and a 0 if an event is not a suicide attack, which makes it easy to sum that column so that you can see the number of suicide attacks per year. Absent from the data is a column that references the changes in methodology, but we can easily add this column in Minitab. Without a methdology column, it’s easy to end up with the <a href="http://www.washingtonpost.com/blogs/monkey-cage/wp/2014/07/21/government-data-exaggerate-the-increase-in-terrorist-attacks/">recently-criticized</a> graph that started the debate between the staff at the two institutions. The graph shows all of the data in the GTD for <a href="http://warontherocks.com/2014/06/infographic-suicide-terrorism-past-and-present/">the number of suicide attacks for each year since 1970</a>. It looks a bit like this:</p>
<p><img alt="The number of suicide attacks increases dramatically in the past two years." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/87aeffb3b5c3391d19070bc9fbc6f717/all_gtd_data.jpg" style="width: 576px; height: 384px;" /></p>
<p>The message of this graph is that the number of suicide attacks has never been higher. The criticism about the absence of the different methodologies seems fair. So how would we capture the different methodologies in Minitab? With a calculator formula, of course. Try this, if you’re following along:</p>
<ol>
<li>Choose <strong>Calc > Calculator</strong>.</li>
<li>In <strong>Store result in variable</strong>, enter <em>Methodology</em>.</li>
<li>In <strong>Expression</strong>, enter:</li>
</ol>
<p><em>if(iyear < 1998, 1, iyear < 2009, 2, iyear=2009 and imonth < 4, 2, iyear < 2012, 3, 4)</em></p>
<ol>
<li value="4">Click <strong>OK</strong>.</li>
</ol>
<p>Notice that because the GTD uses 3 separate columns to record the dates, I’ve used two conditions to identify the second methodology. With the new column, you can easily divide the data series trends according to the method for counting events. This is where the line plot comes in. The line plot is the easiest way in Minitab to plot a summary statistic with time as a category. You can try it this way:</p>
<ol>
<li>Choose <strong>Graph > Line Plot</strong>.</li>
<li>Select <strong>With Symbols</strong>, <strong>One Y</strong>. Click <strong>OK</strong>.</li>
<li>In <strong>Function</strong>, select <strong>Sum</strong>.</li>
<li>In <strong>Graph variables</strong>, enter <em>suicide</em>.</li>
<li>In <strong>Categorical variable for X-scale grouping</strong>, enter <em>iyear</em>.</li>
<li>In <strong>Categorical variable for legend grouping</strong>, enter <em>Methodology</em>.</li>
</ol>
<p>You’ll get a graph that looks a bit like this, though I already edited some labels.</p>
<p><img alt="The last two years, which are dramatically higher in number, have a new methodology." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/22959d1fc87b72c4ed837c82ef3f1b7a/number_of_attacks_divided.jpg" style="width: 576px; height: 384px;" /></p>
<p>One interesting feature of this line plot is that there are two data points for 2009. Because we’re calling attention to the different methodologies, it’s important to consider that the first quarter and the last 3 quarters of 2009 use different methodologies. In this display, we can see the mixture of methodologies. The fact that the two highest points are from the newest methodology also lend some credence to the question of whether the numbers from 2012 and 2013 should be directly compared to numbers from earlier years. The amount of the increase due to better data collection is not clear.</p>
<p>Interestingly, a line plot that shows the proportion of suicide attacks out of all terrorist attacks presents a different picture about the increase related to the different methodologies. That’s what you get if you make a line plot of the means instead of the sums.</p>
<p><img alt="By proportion, the increase in suicide attacks in the last two years does not look as dramatic." src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/22791f44-517c-42aa-9f28-864c95cb4e27/Image/7cd0512a55b4876dac769278eba9d90c/proportion_of_attacks_divided.jpg" style="width: 576px; height: 384px;" /></p>
<p>Considering which statistics to compute and how to interpret them in conjunction with one another is an important task for people doing data analysis. In the final installment of the series on the Washington Post’s website, GTD staff members note that they do not “rely solely on global aggregate percent change statistics when assessing trends.” The flexibility of the line plot to show different statistics can make the work of considering the data from different perspectives much easier.</p>
<p>We do like to have fun at the Minitab Blog, but we know that there’s serious data in the world too. Whether your application is <a href="http://www.minitab.com/en-us/Case-Studies/Bridgestone/">making tires that keep people safe on the road</a> or <a href="http://www.minitab.com/en-us/Case-Studies/Northern-Sydney-Central-Coast-Health-Service/">helping people recover from wounds</a>, our goal is to give you the best possible tools to make your process improvement efforts successful.</p>
<p> </p>
Statistics in the NewsWed, 20 Aug 2014 15:48:18 +0000http://blog.minitab.com/blog/statistics-and-quality-improvement/use-a-line-plot-to-show-a-summary-statistic-over-timeCody SteeleUsing the G-Chart Control Chart for Rare Events to Predict Borewell Accidents
http://blog.minitab.com/blog/statistics-in-the-field/using-the-g-chart-control-chart-for-rare-events-to-predict-borewell-accidents
<p><em>by Lion "Ari" Ondiappan Arivazhagan, guest blogger</em></p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/ac11ba7bc8daa85327ad905ba5dc5f96/borewell_screencap.jpg" style="margin: 10px 15px; width: 400px; height: 283px; float: right;" />In India, we've seen this story far too many times in recent years:</p>
<p>Timmanna Hatti, a six-year old boy, was trapped in a 160-feet borewell for more than 5 days in Sulikeri village of Bagalkot district in Karnataka after falling into the well. Perhaps the most heartbreaking aspect of the situation was the decision of the Bagalkot district administration to stop the rescue operation because the digging work, if continued further, might lead to collapse of the vertical wall created by the side of the borewell within which Timmanna had struggled for his life.</p>
<p><a href="http://timesofindia.indiatimes.com/city/mysore/8-days-on-boys-body-pulled-out/articleshow/40082590.cms?" target="_blank">Timmanna's body was retrieved from the well 8 days after he fell in</a>. Sadly, this is just one of an alarming number of borewell accidents, especially involving little children, across India in the recent past.</p>
<p>This most recent event prompted me to conduct a preliminary study of borewell accidents across India in the last 8-9 years.</p>
Using Data to Assess Borewell Accidents
<p>My main objective was to find out the possible causes of such accidents and to assess the likelihood of such adverse events based on the data available to date.</p>
<p>This very preliminary study has heightened my awareness of lot of uncomfortable and dismaying factors involved in these deadly incidents, including the pathetic circumstances of many rural children and carelessness on the part of many borewell contractors and farmers.</p>
<p>In this post, I'll lead you through my analysis, which concludes with the use of a G-chart for the possible prediction of the next such adverse event, based on Geometric distribution probabilities.</p>
Collecting Data on Borewell Accidents
<p>My search of newspaper articles and Google provided details about a total of 34 borewell incidents since 2006. The actual number of incidents may be higher, since many incidents go unreported. The table below shows the total number of borewell cases reported each year between 2006 and 2014.</p>
<p><img alt="Borewell Accident Summary Data" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/9e60f3b9c08b0125a38b30d717e1acb8/borewell_g_chart_table_2.jpg" style="width: 189px; height: 289px;" /></p>
Summary Analysis of the Borewell Accident Data
<p>First, I used Minitab to create a histogram of the data I'd collected, shown below.</p>
<p>A quick review of the histogram reveals that out of 34 reported cases, the highest number of accidents occurred in the years 2007 and 2014.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/23338847757384f399eb013afe81191f/borewell_histogram_of_accidents.jpg" style="width: 500px; height: 334px;" /></p>
<p>The ages of children trapped in the borewells ranged from 2 years to 9 years. More boys (21) than girls (13) were involved in these incidents.</p>
<p>What hurts most is that, in this modern India, more than 70% of the children did not survive the incident. They died either in the borewell itself or in the hospital after the rescue. Only about 20% of children (7 out of 34) have been rescued successfully. The ultimate status of 10% of the cases reported is not known.</p>
Pie Chart of Borewell Incidents by Indian State
<p>Analysis of a state-wise pie chart, shown below, indicates that Haryana, Gujarat, and Tamil Nadu top the list of the borewell accident states. These three states alone account for more than 50% of the borewell accidents since 2006.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8466766e4788ea2d73b7d8672692be4d/borewell_pie_chart.jpg" style="width: 500px; height: 334px;" /></p>
Pareto Chart for Vital Causes of Borewell Accidents
<p>I used a <a href="http://blog.minitab.com/blog/michelle-paret/fast-food-and-identifying-the-vital-few">Pareto chart</a> to analyze the various causes of these borewell accidents, which revealed the top causes of these tragedies:</p>
<ol>
<li>Children accidentally falling into open borewell pits while playing in the fields.</li>
<li>Abandoned borewell pits not bring properly closed / sealed.<br />
</li>
</ol>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/8012effc2a1aa662d5a276d487e55954/borewell_pareto_chart_w640.jpeg" style="width: 500px; height: 335px;" /></p>
Applying the Geometric Distribution to Rare Adverse Events
<p>There are many different types of control charts, but for rare events, we can use <a href="http://www.minitab.com/products/minitab">Minitab Statistical Software</a> and the G chart. Based on the geometric distribution, the G chart is designed specifically for monitoring rare events. In the geometric distribution, we count the number of opportunities before or until the defect (adverse event) occurs.</p>
<p>The figure below shows the geometric probability distribution of days between such rare events if the probability of the event is 0.01. As you can see, the odds of an event happening 50 or 100 days after the previous one are much higher than the odds of the next event happening 300 or 400 days later.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/1587c05dd9a8d77bcda5be87bb2a748b/borewell_distribution_plot.jpg" style="width: 500px; height: 333px;" /></p>
<p>By using the geometric distribution to plot the number of <a href="http://www.minitab.com/support/tutorials/monitoring-rare-events-with-g-charts/">days between rare events</a>, such as borewell accidents, the G chart can reveal patterns or trends that might enable us to prevent such accidents in future. In this case, we count the number of days between reported borewell accidents. One key assumption, when counting the number of days between the events, is that the number of accidents per day was fairly constant.</p>
A G-Chart for Prediction of the Next Borewell Accident
<p>I now used Minitab to create a G-chart for the analysis of the borewell accident data I collected, shown below.</p>
<p>Although the observations fall within the upper and lower control limits (UCL and LCL), the G chart shows a cluster of observations below the center line (the mean) after the 28th observation and before the 34th observation (the latest event). Overall, the chart indicates/detects an unusually high rate adverse events (borewell accidents) over the past decade.</p>
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/7571156e97822d68efe18af3225902e5/borewell_g_chart_date_between_events.jpg" style="width: 500px; height: 332px; border-width: 1px; border-style: solid;" /></p>
<p>Descriptive statistics based on the Gaussian distribution for my data show 90.8 days as the mean "days between events." But the G-chart, based on geometric distribution, which is more apt for studying the distribution of adverse events, indicates a Mean (CL) of only 67.2 days as "days between events."</p>
Predicting Days Between Borewell Accidents with a Cumulative Probability Distribution
<p>I used Minitab to create a cumulative distribution function for data, using the geometric distribution with probability set at 0.01. This gives us some additional detail about how many incident-free days we're likely to have until the next borewell tragedy strikes: </p>
<p style="margin-left: 40px;"><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/77a56196f91723fca7f7e7222a815573/borewell_output.jpg" style="width: 290px; height: 640px;" /></p>
<p>Based on the above, we can reasonably predict when next borewell accident is most likely to occur in any of the states included in the data, especially in the states of Haryana, Tamil Nadu, Gujarat, Rajasthan, and Karnataka.</p>
<p>The probabilities are shown below, with the assumption that the sample size and the Gage R&R / Measurement errors of event data reported and collected are adequate and within the allowable limits.</p>
<p><strong>Probability of next borewell event happening in...</strong></p>
<ul>
<li>31 days or less: 0.275020 = 27.5% appx.<br />
</li>
<li>104 days or less = 0.651907 = 65% appx.<br />
</li>
<li>181 days or less = 0.839452 = 84% appx.<br />
</li>
<li>488 days or less = 0.992661 = 99% appx.</li>
</ul>
<p> </p>
<p>My purpose in preparing this study would be fulfilled if enough people take preventive actions before the possibility of occurrence next such an adverse event within next 6 months (p > 80%). NGOs, government officials, and individuals all need to take preventive actions—like sealing all open borewells across India, especially in the above 5 states—to prevent many more innocent children from dying while playing.</p>
<p> </p>
<p><strong>About the Guest Blogger:</strong></p>
<p><em>Ondiappan "Ari" Arivazhagan is an honors graduate in civil / structural engineering from the University of Madras. He is a certified PMP, PMI-SP, PMI-RMP from the Project Management Institute. He is also a Master Black Belt in Lean Six Sigma and has done Business Analytics from IIM, Bangalore. He has 30 years of professional global project management experience in various countries and has almost 14 years of teaching / training experience in project management and Lean Six Sigma. He is the Founder-CEO of International Institute of Project Management (IIPM), Chennai, and can be reached at <a href="mailto:askari@iipmchennai.com?subject=Minitab%20Blog%20Reader" target="_blank">askari@iipmchennai.com</a>.</em></p>
<p><em>An earlier version of this article was published on LinkedIn. </em></p>
Data AnalysisStatistics in the NewsTue, 19 Aug 2014 12:00:00 +0000http://blog.minitab.com/blog/statistics-in-the-field/using-the-g-chart-control-chart-for-rare-events-to-predict-borewell-accidentsGuest BloggerHow Accurate are Fantasy Football Rankings? Part II
http://blog.minitab.com/blog/the-statistics-game/how-accurate-are-fantasy-football-rankings-part-ii
<p>Previously, we looked at how accurate fantasy football rankings were <a href="http://blog.minitab.com/blog/the-statistics-game/how-accurate-are-fantasy-football-rankings">for quarterbacks and tight ends</a>. We found out that rankings for quarterbacks were quite accurate, with most of the top-ranked quarterbacks in the preseason finishing in the top 5 at the end of the season. Tight end rankings had more variation, with 36% of the top 5 preseason tight ends (over the last 5 years) actually finishing outside the top 10!</p>
<p><img alt="Cheat Sheat" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/14edab962b5c1df587e395a75459439b/2014_fantasy_football_cheat_sheat.jpg" style="float: right; width: 275px; height: 157px;" />Now it’s time to move our attention to the running backs and wide receivers. Just like before, I went back the previous 5 seasons and found ESPN’s preseason rankings. For each season I recorded where the top preseason players finished at the end of the season, and also where the top players at the end of the season were ranked before the season started.</p>
<p>With quarterbacks and tight ends, I only looked at the top 5 players. But since more running backs and receivers are drafted, I’ll look at the top 10 players. Now let's analyze the data using <a href="http://www.minitab.com/products/minitab/">Minitab Statistical Software</a>. </p>
How did the top-ranked preseason RBs and WRs finish the season ranked?
<p>Let’s start by looking at how the top-rated preseason players fared at the end of the season. I took the top 10 ranked preseason RBs and WRs for each season from 2009-2013 and recorded where they ranked to finish the season. </p>
<p><img alt="IVP" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/f2db50de83552939b23088e7cb196a7b/ivp_rbs_wrs_preseason_w640.jpeg" style="width: 640px; height: 427px;" /></p>
<p><img alt="Describe" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/85fe1626bed20238726f327e27bb78af/describe_presason_rb_wr_w640.jpeg" style="width: 640px; height: 109px;" /></p>
<p>At first glance, the individual plots show that the spread for running backs and wide receivers appears to be about the same. But the descriptive statistics tell a different story. The 3rd quartile value (Q3) is the most telling. 75% of preseason top 10 running backs finish in the top 18, while that number rises all the way to 28.75 for wide receivers! In fact, 32% of wide receivers ranked in the top 10 in the preseason finished the season outside the <em>top 20</em>, while the same was only true for 24% of running backs. Running backs do have the biggest outlier (when Ryan Grant had a season ending injury in his first game of 2010 and finished as the 126th ranked running back), but injuries like that are random and impossible to predict. Overall, preseason ranks for running backs are more accurate than for wide receivers.</p>
How were the top-scoring RBs and WRs ranked in the preseason?
<p>Let’s shift our focus to later in the draft. How often can you draft a lower-ranked running back or wide receiver and still have them finish in the top 10?</p>
<p><img alt="IVP" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/6b08619fded42ff700cb050fa03f0033/ivp_wrs_rbs_postseason_w640.jpeg" style="width: 640px; height: 427px;" /></p>
<p><img alt="Describe" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/fe2c58f6-2410-4b6f-b687-d378929b1f9b/Image/c7bc6b2d68572eff76d8a22edfeda563/describe_postseason_rbs_wrs_w640.jpeg" style="width: 640px; height: 108px;" /></p>
<p>Wide receivers have had more players come out of nowhere to be top 10 scorers at the end of the season (Victor Cruz in 2011 and Brandon Lloyd and Stevie Johnson in 2010 were all ranked 87th or worse, yet finished in the top 10). But the descriptive statistics indicate a pretty even distribution otherwise. About half of the top 10 scoring RBs and WRs were <em>not</em> ranked in the top 10 to begin the season. And 25% of players were ranked outside the top 25, yet were still able to finish in the top 10. For both positions, there are frequently lower ranked players that exceed expectations and finish in the top 10.</p>
<p>But if you want one of the <em>best</em> players, say top 3...can you afford to wait or do you need to select a top ranked player early? The following table shows the 3 highest scoring players for each year, with their preseason rank in parentheses.</p>
<p align="center"><strong>Year</strong></p>
<p align="center"><strong>Top Scoring RB</strong></p>
<p align="center"><strong>2nd Highest Scoring RB</strong></p>
<p align="center"><strong>3rd Highest Scoring RB</strong></p>
<p align="center"><strong>Top Scoring WR</strong></p>
<p align="center"><strong>2nd Highest Scoring WR</strong></p>
<p align="center"><strong>3rd Highest Scoring WR</strong></p>
<p align="center">2013</p>
<p align="center">Jamaal Charles (6)</p>
<p align="center">LeSean McCoy (10)</p>
<p align="center">Matt Forte (12)</p>
<p align="center">Calvin Johnson (1)</p>
<p align="center">Josh Gordon (42)</p>
<p align="center">Demaryius Thomas (6)</p>
<p align="center">2012</p>
<p align="center">Adrian Peterson (10)</p>
<p align="center">Arian Foster (1)</p>
<p align="center">Doug Martin (27)</p>
<p align="center">Calvin Johnson (1)</p>
<p align="center">Brandon Marshall (12)</p>
<p align="center">Dez Bryant (15)</p>
<p align="center">2011</p>
<p align="center">LeSean McCoy (6)</p>
<p align="center">Ray Rice (5)</p>
<p align="center">Arian Foster (4)</p>
<p align="center">Calvin Johnson (5)</p>
<p align="center">Wes Welker (22)</p>
<p align="center">Victor Cruz (110)</p>
<p align="center">2010</p>
<p align="center">Arian Foster (23)</p>
<p align="center">Adrian Peterson (2)</p>
<p align="center">Peyton Hillis (63)</p>
<p align="center">Dwayne Bowe (20)</p>
<p align="center">Brandon Lloyd (123)</p>
<p align="center">Greg Jennings (11)</p>
<p align="center">2009</p>
<p align="center">Chris Johnson (7)</p>
<p align="center">Adrian Peterson (1)</p>
<p align="center">Maurice Jones-Drew (3)</p>
<p align="center">Andre Johnson (2)</p>
<p align="center">Randy Moss (4)</p>
<p align="center">Miles Austin (68)</p>
<p>Since 2009, nine different receivers finished the season in the top 3 despite being ranked outside the preseason top 10. <em>That’s 60%</em>! And two of those players were ranked outside the top 100 in the preseason! But amongst all the inconsistency is Calvin Johnson. He’s the only wide receiver that is listed more than once. And he’s finished as the #1 ranked receiver 3 times in a row!</p>
<p>Meanwhile only 4 running backs (27%) were able to finish in the top 3 despite being ranked outside the preseason top 10. Right now in ESPN’s average draft position, the 10th running back is being drafted with the 19th overall pick. So before the 2nd round of the draft is even over, there is a good chance that the top 3 running backs have already been selected. Compare that to wide receivers, where the 10th receiver is being drafted with the 34th overall pick. So in the middle of the 4th round, a top 3 wide receiver (or even two) could still be on the board!</p>
<p>You can definitely wait to draft a wide receiver. The same can’t be said of running backs.</p>
<p>So how should you use this information in your fantasy football draft?</p>
Focus on Running Backs Early
<p>It’s not that the running back you pick is guaranteed to have a great season, but we just saw that, on average, 10 running backs are being selected before the end of the 2nd round! After that, your chances of picking a top running back start to diminish. At least one of your first two picks should be a running back, if not both!</p>
<p>However, keep in mind that selecting RB/RB with your first two picks can be a high-variance strategy. Consider that last year, in a 10-team league you could have taken Jamaal Charles and Matt Forte with the 6th and 15th pick respectively. Those players finished as the #1 and #3 RB, and if you didn’t win your fantasy league you definitely made the playoffs. Of course, you could have just as easily picked C. J. Spiller and Stevan Ridley, who finished 31st and 26th. Unless you got really lucky with your later picks, you could say hello to the consolation bracket.</p>
<p>If you want to play it more conservative, this data analysis pointed out a few other options. We know that quarterbacks are the most consistent position (Aaron Rodgers in 2013 aside), and this year Peyton Manning, Aaron Rodgers, and Drew Brees are the top 3 ranked quarterbacks. Spending an early pick on one of them should give you a consistent scorer who is much less likely to be a bust than an early running back.</p>
<p>Calvin Johnson and Jimmy Graham are also two very consistent players at two very inconsistent positions. Both players have finished in the top 3 at their position for the last 3 years (with Johnson finishing #1 all 3 years). You should feel just fine using your first two picks on one of these players and a running back. But use caution on selecting a different TE or WR with an early pick.</p>
Wait on Your Wide Receivers
<p>Wide receivers have the least accurate preseason rankings. Half of the preseason top 10 finish outside the top 12, and 25% finish <em>outside the top 28!</em> Because of this, there is value to be found later in the draft for wide receivers. Try to identify some wide receivers you like in later rounds, and focus your early picks on other positions.</p>
<p>This example is a bit extreme, but last year in a fantasy draft I spent 4 of my first 5 draft picks on running backs (with Jimmy Graham being the non-running back pick). I was able to do so because I was fine getting Eric Decker (preseason #20) and Antonio Brown (preseason #24) in the 6th and 7th rounds. They finished as the 8th and 6th ranked wide receivers. Obviously I got a little lucky that they were <em>that</em> <em>good</em>, but that’s kind of the point. I like to think of fantasy football picks as lottery tickets. You could hit the jackpot with some players, win a decent amount with others, and have some that are busts. After the first few rounds, wide receivers have a better chance of being winning lottery tickets than other positions.</p>
<p>Now, you don’t have to <em>completely</em> neglect the WR position before the 6th round like in the example above. Just know that you’re putting the odds in your favor by waiting to draft the bulk of your wide receivers.</p>
Who Needs a Backup QB?
<p>One last thing while we’re on the lottery ticket analogy. Let’s say you draft one of the top quarterbacks (Manning, Rodgers, or Brees). Don’t draft a backup quarterback! We already saw quarterbacks have the most accurate preseason rankings. By the time you draft a backup, it’s unlikely that lower-ranked player you choose will rise into a star that you will start each week or be able to use as trade bait. And on your QB’s bye week, you can easily pick somebody up off the waiver wire.</p>
<p>So why waste that pick on somebody with very little upside? Even if you’re picking in the 100s, there is still value to be had! Josh Gordon, Alshon Jeffery, Knowshon Moreno, and Julius Thomas were all ranked outside the preseason top 100 last year, and all turned into great fantasy players! </p>
<p>Want to take this idea to the (slightly crazy) extreme? If you have a late first round pick, try and use your first two picks on Jimmy Graham and one of Manning, Rodgers, or Brees. With your QB and TE position locked up, spend your next 12 picks on nothing but RBs and WRs. Then use your last two picks on a defense and kicker! I know this goes against the advice of focusing on running backs early, but I <em>did </em>say it was a slightly crazy and extreme strategy! If you can get lucky and find a winning lottery ticket with a lower-ranked running back or two (maybe Montee Ball, Ben Tate, Andre Ellington), it <em>could</em> even be a winning strategy. </p>
<p>If you decide to try that draft strategy, let me know how it goes! And whatever strategy you use, good luck with your 2014 fantasy football season!</p>
Fun StatisticsFri, 15 Aug 2014 15:48:00 +0000http://blog.minitab.com/blog/the-statistics-game/how-accurate-are-fantasy-football-rankings-part-iiKevin Rudy“You’ve got a friend” in Minitab Support
http://blog.minitab.com/blog/real-world-quality-improvement/youve-got-a-friend-in-minitab-support
<p>I caught the end of Toy Story over the weekend, which is definitely one of my all-time favorite children’s movies. Now—unfortunately or fortunately—I can’t get Randy Newman's theme song,“You’ve Got a Friend in Me,” out of my head!</p>
<p>It's also got me thinking about the nature of friendship, and how "best friends forever" are supposed to always be there when you need them. And, not to get too maudlin about it, but just like Woody and Buzz eventually realize their friendship, all of us hope the professionals who use our software also realize that “you’ve got a friend” in Minitab.</p>
<p></p>
<p>Now what do I mean by all this “BFF” business? I’m talking about our <a href="http://www.minitab.com/support/" target="_blank">free technical support</a> services (online and by telephone), as well as the plethora of free documentation that’s available online for each of our products. <em>We’re here for you!</em></p>
<p>Be sure to visit the <a href="http://www.minitab.com/support/" target="_blank">Support</a> section of our website to browse the individual support sections that are available for each of our product offerings. From there, you can access the latest software downloads, documentation, and tutorials, and find the answers to all of your questions about software use, statistics, and quality improvement. In fact, there's a lot of great information there even if you're not using our software yet!</p>
<p>And for our latest and greatest release, Minitab 17 Statistical Software, we’ve expanded our online support offerings. Be sure to check out the following:</p>
<strong>1. <u><a href="http://support.minitab.com/minitab/17/getting-started/" target="_blank">Getting Started with Minitab 17</a></u></strong>
<p><em>Getting Started</em> is our user guide that introduces you to some of the most commonly used features and tasks in Minitab—including how to explore your data with graphs, conduct statistical analyses and interpret the results, assess quality using control charts and capability analysis, and design an experiment.</p>
<p>The guide also includes shortcuts and tips for customizing Minitab.</p>
<strong>2. <u><a href="http://support.minitab.com/minitab/17/topic-library/" target="_blank">Topic Library</a></u></strong>
<p>The Minitab 17 Topic Library is a compilation of content from Help, StatGuide™, and Glossary—all of which are also available within the software itself. The library is arranged by statistical area so that you can easily find relevant topics, such as Basic Statistics and Graphs, Quality Tools, and Modeling Statistics (ANOVA, regression, DOE, etc.).</p>
<strong>3. <u><a href="http://support.minitab.com/datasets/" target="_blank">Data Sets</a></u></strong>
<p>We took the best data sets from Minitab 17 Help and made them accessible online. We also made them even more realistic, so you can practice performing analyses and interpretation, explore alternate data layouts, and investigate statistical tools commonly used in your industry.</p>
<strong>4. <u><a href="http://support.minitab.com/minitab/17/macro-library/" target="_blank">Macros Library</a></u></strong>
<p>Our Macros Library includes many macros that allow you to <a href="http://blog.minitab.com/blog/customized-data-analysis/creating-a-custom-report-using-minitab-part-1">automate, customize and repeat an analysis</a> of your choice. You can download the .mac file for each macro we offer.</p>
<strong>5. <u><a href="http://support.minitab.com/minitab/17/technical-papers/" target="_blank">Technical Papers</a></u></strong>
<p>Access technical papers that describe the research conducted to develop the methods and data checks used in the <a href="http://www.minitab.com/products/minitab/assistant/">Assistant</a>, as well as the methodology and supporting researching underlying two new analyses in Minitab 17.</p>
<strong>6. <a href="http://support.minitab.com/installation/" target="_blank">Installation</a> and <a href="http://www.minitab.com/support/licensing/" target="_blank">Licensing</a> FAQs</strong>
<p>Browse our troubleshooting solutions to the most common error messages, installation issues, and activation/licensing topics.</p>
The Personal Touch
<p>If you've checked the website and still need help, know that we’re here whenever you need us (a real, live person I might add!). <span style="line-height: 1.6;">Access unlimited phone or online support from experts in statistics, quality improvement, and computer systems by visiting </span><a href="http://www.minitab.com/support/" style="line-height: 1.6;" target="_blank">http://www.minitab.com/support/</a><span style="line-height: 1.6;">.</span></p>
<p>You really do have a friend in Minitab Support!</p>
Statistics HelpFri, 15 Aug 2014 12:54:00 +0000http://blog.minitab.com/blog/real-world-quality-improvement/youve-got-a-friend-in-minitab-supportCarly BarryHow Deadly Is this Ebola Outbreak?
http://blog.minitab.com/blog/the-statistical-mentor/how-deadly-is-this-ebola-outbreak
<p><img alt="" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/0b8e48b97aed2afbce026c9be82263d0/ebola_sign.png" style="border-width: 1px; border-style: solid; margin: 10px 15px; width: 350px; height: 212px; float: right;" />The current Ebola outbreak in Guinea, Liberia, and Sierra Leone is making headlines around the world, and rightfully so: it's a frightening disease, and last week the World Health Organization reported its spread is outpacing their response. Nearly 900 of the more than 1,600 people infected during this outbreak have died, including some leading medical professionals trying to stanch the outbreak's spread. And yesterday, one of the American doctors who contracted the disease arrived back in the U.S. for treatment.</p>
<p>Many sources state that Ebola virus outbreaks have a case fatality rate of up to 90%, but a look at the data about ebola shows the death rate significantly varies based on the ebola species, case location, and year.</p>
Plotting Ebola Outbreaks Since 1976
<p>Infection with the ebola virus causes a hemorrhagic fever. Symptoms most commonly appear 8 to 10 days after exposure, and include fever, headache, joint and muscle aches, and weakness. These symptoms quickly escalate to diarrhea, vomiting, stomach pain, lack of appetite, abnormal internal and external bleeding, and organ failure.</p>
<p>The disease first appeared in Africa in 1976, and since then sporadic outbreaks have occurred as indicated in graph 1, which depicts data from the World Health Organization web site. (You can download my Minitab project file, which includes all of the data used in this blog post, <a href="//cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/File/64eb13a3deb8e4b026e24bdefb846038/ebola2.MPJ">here</a>.)</p>
<p><img alt="ebola virus cases per year" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/b4f9d7d91f5bde98f7a162ec21b74457/ebola_cases_by_year.png" style="width: 500px; height: 333px;" /></p>
<p>According to the Centers for Disease Control, of the five known species of the Ebola virus, only three have resulted in large outbreaks. The current outbreak is associated with the species Zaire ebolavirus (EBOV). The two other species that have been associated with large outbreaks are Bundibugyo ebolavirus (BDBV) and Sudan ebolavirus (SUDV).</p>
<p>Graphing the outbreak death rate over time can help us understand the impact of species, location, and year. But plotting raw outbreak death rates, as I did above, is not ideal due to the difference in case numbers (sample size) across outbreaks. Let's try a different approach.</p>
Assessing Ebola Outbreaks with Binary Logistic Regression
<p>Fitting a model which accounts for the different sample sizes and <em>then </em>plotting the model predictions over time is more appropriate than simply graphing the raw fatality numbers.</p>
<p>I put the data into <a href="http://www.minitab.com/products/minitab">Minitab Statistical Software</a> and used binary logistic regression to fit a model with three predictors: year, ebola virus species, and location of outbreak. I could not fit interactions among these factors because of the limited amount of data available.</p>
<p>All three predictors had <a href="http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values">p-values</a> below 0.001, indicating strong statistical significance:</p>
<p style="margin-left: 40px;"><img alt="ebola virus binary logistic regression analysis" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/39820ae9941558d84f39e4f897c588d7/ebola_binary_logistic_regression.gif" style="width: 410px; height: 128px;" /></p>
<p>I also created a scatterplot to illustrate the model's predicted death rates over time:</p>
<p><img alt="ebola scatterplot of predicted death rate vs year" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/a2738293bfffe3d097765a50eef2a602/ebola_predicted_death_rate_scatterplot.png" style="width: 500px; height: 333px;" /></p>
<p>We can draw the following conclusions from the binary logistic regression analysis and the graph above:</p>
<ol>
<li>The death rate from ebola decreases over time.</li>
<li>The death rate is significantly different across species. After accounting for the effects of location and time, species SUDV and BDBV have lower death rates than EBOV. The current outbreak is EBOV.</li>
<li>The death rate is significantly different across locations. After accounting for the effects of species and time, Gabon, Sudan, and the current outbreak location (Guinea, Sierra Leone, and Liberia), appear to have a lower death rate.</li>
</ol>
Assessing the Current EBOV Outbreak with Binary Logistic Regression
<p>The current outbreak has a low death rate relative to previous EBOV outbreaks. Since the current location has not appeared before, we can not tell whether this decreased death rate is due to improvements in treatment over time, the quality of care available in the location of the outbreak, or some other factor, such as better immunity to the virus in the region.</p>
<p>The graph below shows the EBOV death rate predictions from a binary logistic regression model fit to the EBOV data only.</p>
<p><img alt="ebola scatterplot of predicted death rate vs year - EBOV only" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/477783387835e4af4c0c85742097c41c/ebola_binary_logistic_regression_scatterplot.png" style="width: 500px; height: 333px;" /></p>
<p>The current outbreak is severe in terms of number of cases, but the death rate is lower than expected based on past EBOV outbreaks in different locations.</p>
Seeing the Outbreak Day by Day
<p>One final graph shows the number of new cases per day by location for the current outbreak.</p>
<p><img alt="ebola scatterplot of new cases per day vs. date" src="http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/479b4fbd-f8c0-4011-9409-f4109cc4c745/Image/fcb8517f5fef67229aa4ff3250a2b994/ebola_scatterplot_of_new_cases_by_day_vs_date.png" style="width: 500px; height: 332px;" /></p>
<p>Cases per day has fluctuated widely in Guinea, while Liberia and Sierra Leone have both seen an extremely rapid rise in cases per day since mid-July.</p>
<p>This is one graph that will change greatly from day-to-day as the outbreak runs its course. Let's hope the data quickly return to 0 new cases per day for all locations.</p>
<p> </p>
Statistics in the NewsWed, 06 Aug 2014 12:00:00 +0000http://blog.minitab.com/blog/the-statistical-mentor/how-deadly-is-this-ebola-outbreakJim Colton