The Kentucky Conundrum: Creating a New Regression Model to Predict the NCAA Tournament
The NCAA Tournament is right around the corner, and you know what that means: It’s time to start thinking about how you’re going to fill out your bracket! For the last two years I’ve used the Sagarin Predictor Ratings to predict the tournament. However, there is a problem with that strategy this year. The old method uses a regression model that calculates the probability one team has of beating another based on where the team is ranked in the Sagarin Ratings. So from year to year, the #1 ranked team is going to have the same probability of beating, say, the #25 ranked team.
The problem this year, of course, is that Kentucky isn’t your average #1 team.
We can’t simply use the fact that Kentucky is the #1 ranked team to calculate their probability of beating other teams because they’re so much better than the #1 teams that came before them! In fact, if you take the Pomeroy rating of the #1 ranked teams since 2002, this Kentucky team not only has the highest rating, they’re a full 2 standard deviations higher!
And Kentucky isn’t alone. The previous link goes on to show that 8 of the 10 teams in the Pomeroy top 10 have the highest rating of any similarly ranked team to come before them. This may be the best group of #1 and #2 seeds the NCAA tournament has ever had. So any predictive system based on a team's ranking ranked may be underestimating the chances of the top teams.
Luckily, Las Vegas can help us!
Using Vegas Spreads to Predict Games
Using the Sagarin ratings, we can calculate the margin of victory we’d expect the favored team to win by. That is, we can calculate the spread (according to the Sagarin ratings). Then we can use that spread to calculate the probability the favorite has of winning. How? That’s where Las Vegas comes in.
I took 3,126 college basketball games from this season, collected the spread (for the home team) and whether the home team won or not. (Note that at neutral site games, the “home” team is really just an arbitrary title given to one team. However, the spread already accounts for whether the home team is actually playing on their home court or not, so we can safely group actual home teams and home teams at neutral site games.)
We can use this data to create a binomial logistic regression model that can calculate the probability of the home team winning based on the spread:
You can see that our model does a very good job of predicting the probability the home team has of winning based on the spread. However, there does appear to be a big outlier in the upper right corner. This is the group of home teams that were favored by 24.5 points. In our data, there were 11 such teams, and two of them actually lost. Not only that, the teams were from the same state! Michigan lost to NJIT and Michigan State lost to Texas Southern. These were actually the only two home teams to lose a game in which they were favored by 18.5 points or more. I don’t think there is anything special about being favored by 24.5 points, so I think we can just chalk that outlier up to random variation and continue with the analysis.
Applying the Model to the Sagarin Ratings
Let’s go back to our hypothetical match up between the #1 ranked team and the #25 ranked team to look at the difference between using the ranks and using the ratings. Right now those two teams are Kentucky and Davidson in the Sagarin predictor ratings. If we rely on where the teams are ranked, the probability that Kentucky wins a neutral site game is 84.6%. However, using the binary logistic model based on the spread from the Sagarin ratings, the probability that Kentucky wins is 91.4%. That’s quite a difference!
To win the entire tournament, Kentucky has to win 6 games. For simplicity, let’s just assume Kentucky has the same probability of winning all 6 games.
Probability of Kentucky winning the tournament based on ranks = .846^6 = 36.7%
Probability of Kentucky winning the tournament based on ratings = .914^6 = 58.3%
You can see that even a small difference in probabilities can become extreme when compounded over 6 games!
Testing the Model
The last thing we should do is see how accurate our model is. For 594 games in February and March of this season, I obtained the probability the home team would win based on the Sagarin ratings. Then I put each game into a group based on the probability that the favorite had of winning. For example, if a team had a probability of winning of 76%, they would go in the “70 to 79” group. To test the accuracy of the model (and the ratings), we can look at the proportion of favorites that won in each group and compare that to the predicted probability. If these two numbers are close together, then the model and the ratings are accurate. The results for the Sagarin ratings are below.
Group |
Predicted Probability |
Observed Probability |
Difference |
Number of Games |
50 to 59 |
55.2% |
58.9% |
3.7% |
141 |
60 to 69 |
64.3% |
61.2% |
3.1% |
147 |
70 to 79 |
74.8% |
73.3% |
1.5% |
120 |
80 to 89 |
84.3% |
86.6% |
2.3% |
119 |
90 to 99 |
93.6% |
94% |
0.4% |
67 |
For each group, the difference between the observed probability and the predicted probability is only a few percentage points. It looks like our model is good to go. So check back on Monday, when we break down the brackets!