A few weeks ago, I used Minitab to calculate the odds of throwing a perfect game. The results were surprising. I found that the number of perfect games that have occurred since 1900 is vastly greater than the number we would have expected. And whether you're doing a six sigma project or a simple baseball data analysis, it's always good to go back and make sure you did everything correctly whenever you find surprising results.

To determine the number of perfect games we would have expected to occur since 1900, I calculated the probability of getting 27 outs in a row. To do this, I used the average on-base percentage (OBP) for the entire league since 1900. The odds of getting 27 outs in a row is (1-OBP)^27. And this is where I went wrong. You see, the average OBP is the percentage of the time a batter gets on base against every pitcher. That includes starters and relief pitchers. However, when a perfect game occurs, the starter pitches all 9 innings. Since starters are usually better than relief pitchers (that’s why they’re starters), batters should have a lower OBP percentage against them. Instead of using the OBP against all pitchers, we need the OBP against just starters.

# Collecting the Data

Luckily, baseball-reference.com has the OBP of all the players who batted against an individual pitcher. For example, this year batters have an OBP of 0.255 when batting against Matt Cain. So I took the OBP for 4,397 individual pitchers from 1960 until 2012. Not counting any pitchers from this year (because they’ve played less than half the season) every pitcher has pitched at least 100 innings a season. They also have pitched an average of 33 games a year and 6.2 innings per game. So we’ll assume that the vast majority of these pitchers are starters.

Why did I only go back to 1960? Because baseball reference says that for many detailed stats (like the one I was collecting), pre-1954 seasons have a substantial number of games that are missing play-by-play accounts and should be viewed as essentially incomplete. Why didn’t I go back to 1954 then? Well...because it took a long time to gather all that data and I got lazy.

My laziness aside, we have a problem. We can’t simply ignore 60 years of baseball! Now, from last post, we have the league OBP against all pitchers for each year from 1900-2012. Baseball reference did not say there were any problems with those statistics. And one would imagine that as the OBP against all pitchers goes up and down from year to year, the OBP against just starters would, too. So I took the average OBP against starters for each year, and used a time series plot to compare it to the average OBP against all pitchers.  (If you'd like to do the same, my data sheet is available here.)

First of all, we see that the OBP against starters is lower than the OBP against all pitchers. This supports the notion that starters are better than relief pitchers. We also see that both stats seem to go up and down similarly. So why don’t we run a regression analysis and obatin a model that can predict OBP against just starters?

Great! The r-squared value for this regression analysis is 94.3%! That means that 94.3% of the variation in the OBP against starting pitchers can be explained by the OBP against all pitchers. We can now use the model from this regression analysis to calculate the OBP against just the starting pitchers for each year before 1960.

# A New Expected Value

Using the equation, I predicted that since 1900, the proportion of batters who get on base against starting pitchers is 0.317. In my previous post, I used a value of 0.329. So using our new number, we can calculate the following:

• 1 - 0.317 = 0.683 = A probability of 68.3% of a starting pitcher getting a batter out
• 0.683^27 = .00003383 = Odds of 1 in 29,555 of a starting pitcher throwing a perfect game
• 0.00003383 * 363,842 = We would have expected 12.3 perfect games since 1900

In my previous post, I found an expected value of 7.8 perfect games. Because there have been 20 perfect games since 1900, it appears that our value using numbers for just starters is much more accurate. Still less than what we’ve observed, but not nearly as unlikely. A probability distribution plot can help show us just how unlikely it would be to have had 20 perfect games since 1900.

In my last post, 20 wasn’t even close to any gray bars, and I calculated odds of 1 in 5,780 of there being at least 20 perfect games. When we use only OBP against starting pitchers, we see that having 20 perfect games is much more likely. In fact, the odds drop to 1 in 37. One other thing to keep in mind is that I wrote the article the day after a perfect game was thrown. There was some bias in the point where I stopped collecting the number of perfect game opportunities. Had I written the article the day before the perfect game, the odds of seeing 19 perfect games by June 12, 2012 would have been 1 in 20.

It looks like we’ve been able to partially explain why the numbers were so far off before. We’ve still seen more perfect games than we would expect, but it’s a lot closer than before. My conspiracy theory that MLB has started rigging perfect games to increase interest is losing steam. Okay, I never really believed that, but there have been 5 perfect games in the last 4 years (and one Jim Joyce call from 6). If we keep seeing a steady stream of perfect games, I might just have to go back and revisit that idea. Until then, I’m going to stick to the conclusion that throwing a game is really, really hard. And with odds of 1 in 29,555, any pitcher who throws one is in a very exclusive club.