When you perform a statistical analysis, you want to make sure you collect enough data that your results are reliable. But you also want to avoid wasting time and money collecting more data than you need. So it's important to find an appropriate middle ground when determining your sample size.
Now, technically, the Major League Baseball regular season isn't a statistical analysis. But it does kind of work like one, since the goal of the regular season is to "determine who the best teams are." The National Football League uses a 16-game regular season to determine who the best teams are. Hockey and Basketball use 82 games.
Baseball uses 162 games.
So is baseball wasting time collecting more data than it needs? Right now the MLB regular season is about halfway over. So could they just end the regular season now? Will playing another 81 games really have a significant effect on the standings? Let's find out.
How much do MLB standings change in the 2nd half of the season?
I went back through five years of records and recorded where each MLB team ranked in their league (American League and National League) on July 8, and then again at the end of the season. We can use this data to look at concordant and discordant pairs. A pair is concordant if the observations are in the same direction. A pair is discordant if the observations are in opposite directions. This will let us compare teams to each other two at a time.
For example, let's compare the Astros and Angels from 2015. On July 8th, the Astros were ranked 2nd in the AL and the Angels were ranked 3rd. At the end of the season, Houston was ranked 5th and the Angles were ranked 6th. This pair is concordant since in both cases the Astros were ranked higher than the Angels. But if you compare the Astros and the Yankees, you'll see the Astros were ranked higher on July 8th, but the Yankees were ranked higher at the end of the season. That pair is discordant.
When we compare every team, we end up with 11,175 pairs. How many of those are concordant? Minitab Statistical Software has the answer.
There are 8,307 concordant pairs, which is just over 74% of the data. So most of the time, if a team is higher in the standings as of July 8th, they will finish higher in the final standings too. We can also use Spearman's rho and Pearson's r to asses the association between standings on July 8th and the final standings. These two values give us a coefficient that can range from -1 to +1. The larger the absolute value, the stronger the relationship between the variables. A value of 0 indicates the absence of a relationship.
Both values are high and positive, once again indicating that teams ranked higher than other teams on July 8th usually stay that way by the end of the season. So did we do it? Did we show that baseball doesn't really need the 2nd half of their season?
Consider that each league has 15 teams. So a lot of our pairs are comparing teams that aren't that close together, like 1st team to the 15th, the 1st team to the 14th, the 2nd team to the 15th, and so on. It's not very surprising that those pairs are going to be concordant. So let's dig a little deeper and compare each individual team's ranking in July compared to the end of the season. The following histogram shows the difference in a team's rank. Positive values mean the team moved up in the standings, negative values mean they fell.
The most common outcome is that a team doesn't move up or down in the standings, as 34 of our observations have a difference of 0. However, there are 150 total observations, so most of the time a team does move up or down. In fact, 55 times a team moved up or down in the standings by 3 or more spots. That's over a third of the time! And there are multiple instances of a team moving 6, 7, or even 8 spots! That doesn't seem to imply that the 2nd half of the season doesn't matter. So what if we narrow the scope of our analysis?
Looking at the Playoff Teams
We previously noted that the regular season is supposed to determine the best teams. So let's focus on the top of the MLB standings. I took the top 5 teams in each league (since the top 5 teams make the playoffs) on July 8th, and recorded whether they were still a top 5 team (and in the playoffs) at the end of the season. The following pie chart shows the results.
Twenty eight percent of the time, a team that was in the playoffs in July fell far enough in the standings to drop out. So over a quarter of your playoff teams would be different if the season ended around 82 games. That sounds like a significant effect to me. And last, let's return to our concordant and discordant pairs. Except this time, we'll just look at the top half of the standings (top 8 teams).
This time our percentage of concordant pairs has dropped to 59%, and the values for Spearman's rho and Pearson's r show a weaker association. Teams ranked higher in the 1st half of the season are usually still ranked higher at the end of the season. But there is clearly enough shuffling among the top teams to warrant the 2nd half of the season. So don't worry baseball fans, your regular season will continue to extend to September.
Because, you know, Major League Baseball totally would have shorten the season if this statistical analysis suggested doing so!
And if you're looking to determine the appropriate sample size for your own analysis, Minitab offers a wide variety of power and sample size analyses that can help you out.