Analyzing Baseball Park Factors: Highlighting Unrealistic Data
The last time I posted, I used Minitab to find an error in the baseball park factor data that ESPN provides on its website. The error was easy to spot because no other values for Chase Field were below 1.
As a reminder, here’s how ESPN reports it calculates park factors:
((homeRS + homeRA)/(homeG)) / ((roadRS + roadRA)/(roadG))
- homeRS is the number of runs the team scored at home.
- homeRA is the number of runs opposing teams scored against the team at home.
- roadRS is the number of runs a team scored on the road.
- roadRA is the number of runs opposing teams scored against the team on the road.
- homeG is the number of games the team played at home.
- roadG is the number of games the team played on the road.
Note that baseball teams play the same number of road and away games in most seasons, so this typically reduces to (homeRS + home RA) / (roadRS + roadRA). But be careful if your analysis includes seasons like 1994.
Another way to use graphs to spot errors is to look for values that simply seem unrealistic, even if they are more like the other values a park produces. In the graphs below, I’ve added reference lines at 0.6 showing where I think clearly unrealistic values would fall. Could park design really be so extreme that a team would score less than 60% of the runs at home that it would at all of the various parks in played in on the road? The points I’ve edited show the first 8 points I checked the data with ESPN’s formula using data from baseball-reference.com. I corrected the point for Chase Field that I discussed last time.
Some of the values look as obviously wrong as the Chase Field value was. It seems unrealistic that Minute Maid Park and Turner Field – which typically play neutral – would produce values of 0.562, 0.651, and 0.655. Other low values fit the patterns of their parks better. The Oakland County Coliseum has not one, but two unusually low values. The minimum for Safeco Field – always a pitchers' park – is not so different from other years. The new Yankee Stadium shows one unusually high value, but the low value is not so far off from other years.
When you check the values with the formula that ESPN provides, they're all wrong. With those 9 points corrected, the graphs look like this:
Minitab graphs make it easy to see when unrealistic values occur so that you can check them in your own data. It seems safe to say that if you calculate park factors according to the formula that ESPN reports, then a value below 0.6 is almost certainly an error. The minimum value I found as I was checking the data set turned out to be 0.6454 for the 2012 Seattle Mariners, which had been recorded as 0.687. Someone should probably warn Nate Silver.