A few weeks ago, I used Minitab to calculate the odds of throwing a perfect game. The results were surprising. I found that the number of perfect games that have occurred since 1900 is vastly greater than the number we would have expected. And whether you're doing a six sigma project or a simple baseball data analysis, it's always good to go back and make sure you did everything correctly whenever you find surprising results.
To determine the number of perfect games we would have expected to occur since 1900, I calculated the probability of getting 27 outs in a row. To do this, I used the average on-base percentage (OBP) for the entire league since 1900. The odds of getting 27 outs in a row is (1-OBP)^27. And this is where I went wrong. You see, the average OBP is the percentage of the time a batter gets on base against every pitcher. That includes starters and relief pitchers. However, when a perfect game occurs, the starter pitches all 9 innings. Since starters are usually better than relief pitchers (that’s why they’re starters), batters should have a lower OBP percentage against them. Instead of using the OBP against all pitchers, we need the OBP against just starters.
Collecting the Data
Luckily, baseball-reference.com has the OBP of all the players who batted against an individual pitcher. For example, this year batters have an OBP of 0.255 when batting against Matt Cain. So I took the OBP for 4,397 individual pitchers from 1960 until 2012. Not counting any pitchers from this year (because they’ve played less than half the season) every pitcher has pitched at least 100 innings a season. They also have pitched an average of 33 games a year and 6.2 innings per game. So we’ll assume that the vast majority of these pitchers are starters.
Why did I only go back to 1960? Because baseball reference says that for many detailed stats (like the one I was collecting), pre-1954 seasons have a substantial number of games that are missing play-by-play accounts and should be viewed as essentially incomplete. Why didn’t I go back to 1954 then? Well...because it took a long time to gather all that data and I got lazy.
My laziness aside, we have a problem. We can’t simply ignore 60 years of baseball! Now, from last post, we have the league OBP against all pitchers for each year from 1900-2012. Baseball reference did not say there were any problems with those statistics. And one would imagine that as the OBP against all pitchers goes up and down from year to year, the OBP against just starters would, too. So I took the average OBP against starters for each year, and used a time series plot to compare it to the average OBP against all pitchers. (If you'd like to do the same, my data sheet is available here.)
First of all, we see that the OBP against starters is lower than the OBP against all pitchers. This supports the notion that starters are better than relief pitchers. We also see that both stats seem to go up and down similarly. So why don’t we run a regression analysis and obatin a model that can predict OBP against just starters?
Great! The r-squared value for this regression analysis is 94.3%! That means that 94.3% of the variation in the OBP against starting pitchers can be explained by the OBP against all pitchers. We can now use the model from this regression analysis to calculate the OBP against just the starting pitchers for each year before 1960.
A New Expected Value
Using the equation, I predicted that since 1900, the proportion of batters who get on base against starting pitchers is 0.317. In my previous post, I used a value of 0.329. So using our new number, we can calculate the following:
- 1 - 0.317 = 0.683 = A probability of 68.3% of a starting pitcher getting a batter out
- 0.683^27 = .00003383 = Odds of 1 in 29,555 of a starting pitcher throwing a perfect game
- 0.00003383 * 363,842 = We would have expected 12.3 perfect games since 1900
In my previous post, I found an expected value of 7.8 perfect games. Because there have been 20 perfect games since 1900, it appears that our value using numbers for just starters is much more accurate. Still less than what we’ve observed, but not nearly as unlikely. A probability distribution plot can help show us just how unlikely it would be to have had 20 perfect games since 1900.
In my last post, 20 wasn’t even close to any gray bars, and I calculated odds of 1 in 5,780 of there being at least 20 perfect games. When we use only OBP against starting pitchers, we see that having 20 perfect games is much more likely. In fact, the odds drop to 1 in 37. One other thing to keep in mind is that I wrote the article the day after a perfect game was thrown. There was some bias in the point where I stopped collecting the number of perfect game opportunities. Had I written the article the day before the perfect game, the odds of seeing 19 perfect games by June 12, 2012 would have been 1 in 20.
It looks like we’ve been able to partially explain why the numbers were so far off before. We’ve still seen more perfect games than we would expect, but it’s a lot closer than before. My conspiracy theory that MLB has started rigging perfect games to increase interest is losing steam. Okay, I never really believed that, but there have been 5 perfect games in the last 4 years (and one Jim Joyce call from 6). If we keep seeing a steady stream of perfect games, I might just have to go back and revisit that idea. Until then, I’m going to stick to the conclusion that throwing a game is really, really hard. And with odds of 1 in 29,555, any pitcher who throws one is in a very exclusive club.
Time: Saturday, July 7, 2012
This is fun analysis but I think you have to include the fact that those throwing perfect games are not just starters but also "aces" who have an above average probability of getting a batter out.
Time: Saturday, July 14, 2012
Very interesting experiment. I Wonder if there is anything notable on those Perfect game pitchers that would permit us to find who will be the next one to achive so. maybe as Walker says, their probability of getting a batter out, their ERA, ...
Time: Wednesday, July 18, 2012
I agree that "aces" have an above average probability of getting a batter out, and thus a higher probability of throwing a perfect game. But most pitchers out there aren't aces. Most pitchers are 2nd through 5th in a team's rotation. Their OBP against batters balances out the aces to get the 0.317 value I used.
As for predicting the next perfect game, I think it is so rare that it is really almost impossible to predict. Plus there is so much that the pitcher can't control. If they don't strike out a batter (which most of the time they don't) they have to hope the ball wasn't hit in a gap, and if it wasn't, they also have to hope their fielders get the out.
For an example, take Phillip Humber. So far this season, he gets batters out only 33.8% of the time. That's a lot worse than the average. And yet he was able to throw a perfect game earlier this year. It's not just the great pitchers that throw perfect games. Less than averages ones can do it to!
Time: Saturday, August 18, 2012
Hi there. I just came across your analysis (both parts). While the second part is clearly superior, I believe you are still omitting two very important factors from your data.
The first one is more obvious: since OBP does not include batters who get on base due to an error (or dropped third strike or catchers interference -- if you want to get picky), you need to include those possibilities as well. Including errors is tricky since some errors don't actually result in the batter getting on base (consider a poor pick-off attempt or an outfielder booting a clean single).
Yet there is another consideration that is less obvious: pitching from the wind-up vs. pitching from the stretch. If you're going to make the (correct) argument that only OBP stats from starting pitchers' appearances should be counted because, after all, starting pitchers are typically better pitchers than relievers, then you cannot ignore that starting pitchers throw from the wind-up whenever possible -- presumably because they pitch better from the wind-up than from the stretch. (BTW, it's been shown that pitching from the wind-up does NOT increase velocity. So it's likely the pitcher's ability to hide the ball that makes him a better pitcher from the wind-up.)
So to further tighten your results, (1) consider errors that result in a batters reaching base and (2) only consider the OBP of batters when a starting pitcher is throwing with no runners on base (the pitcher will be throwing from the wind-up!).
The latter scenario is, after all, exactly what a pitcher faces during every batter of a perfect game.
Time: Monday, September 3, 2012
Tommy,
I completely agree with both of your factors. I omitted errors on purpose. There was just a lot of data to collect, so to save myself some time (they actually expect me to do other work here than write about sports) I didn't bother with errors. I figured that errors are so rare that they wouldn't effect the probability too much. If I get some more time in the future, hopefully I can go back and see how much they really do effect things.
However, I never even thought of your second factor. I could definitely see that effecting the probability. The problem is I have no idea how to separate the OBP of batters when hitting against a wind-up vs. the stretch. Do you know of any studies that have looked at this and broken down the difference? I'd be very interested to see it. But in the meantime, I'm not sure how to quantify the difference.
Thanks for the feedback!