If you wanted to figure out the probability that your favorite football team will win their next game, how would you do it? My colleague Eduardo Santiago and I recently looked at this question, and in this post we'll share how we approached the solution. Let’s start by breaking down this problem:
- There are only two possible outcomes: your favorite team wins, or they lose. Ties are a possibility, but they're very rare. So, to simplify things a bit, we’ll assume they are so unlikely that could be disregarded from this analysis.
- There are numerous factors to consider.
- What will the playing conditions be?
- Are key players injured?
- Do they match up well with their opponent?
- Do they have home-field advantage?
- And the list goes on...
First, since we assumed the outcome is binary, we can put together a Binary Logistic Regression model to predict the probability of a win occurring. Next, we need to find which predictors would be best to include. After a little research, we found the betting markets seem to take all of this information into account. Basically, we are utilizing the wisdom of the masses to find out what they believe will happen. Since betting markets take this into account, we decided to look at the probability of a win, given the spread of a NCAA football game.
If you are not convinced about how accurate the spreads can be in determining the outcome of a game: win or loss, we collected data for every college football game played between 2000 and 2014. The structure of the data is illustrated below. The third column has the spread (or line) provided by casinos at Vegas, and the last column displayed is the actual score differential (vscore – hscore).
Note: In betting lines, a negative spread indicates how many points you are favored over the opponent. In short, you are giving the opponent a certain number of points.
The original win-or-lose question can be rephrased then as follows: Is the difference between the spreads and actual score differentials statistically significant?
Since we have two populations that are dependent we would compare them via a paired t test. In other words, both the Spread and scoreDiffer are observations (a priori and a posteriori) for the same game and they reflect the relative strength of the home team i versus the road team j.
Using Stat > Basic Statistics > Paired t in Minitab Statistical Software, we get the output below.
Since the p-value is larger than 0.05, we can conclude from the 15 years of data that the average difference between Las Vegas spreads and actual score differentials is not significantly different from zero. With this we are saying that the bias that could exist between both measures of relative strength for teams is not different from zero, which in lay terms means that on average the error that exists between Vegas and actual outcomes is negligible.
It is worth noting that the results above were obtained with a sample size of 10,476 games! So we hope you'll excuse our not including power calculations here.
As a final remark on spreads, the histogram of the differences below shows a couple of interesting things:
- The average difference between the spreads and score differentials seem to be very close to zero. So don’t get too excited yet, as the spreads cannot be used to predict the exact score differential for a game. Nevertheless, with extremely high probability the spread will be very close to the score differential.
- The standard deviation, however, is 15.5 points. That means that if a game shows a spread for your favorite team of -3 points, the outcome could be with high confidence within plus or minus 2 standard deviations of the point estimate, which is -3 ± 31 points in this case. So your favorite team could win by 34 points, or lose by 28!
Figure 1 - Distribution of the differences between scores and spreads
The Binary Logistic Regression Model
By this point, we hope you are convinced about how good these spread values could be. To make the output more readable we summarized the data as follows:
Creating our Binary Logistic Regression Model
After summarizing the data, we used the Binary Fitted Line Plot to come up with our model.
If you are following along, here are the steps:
- Go to Stat > Regression > Binary Fitted Line Plot
- Fill out the dialog box as shown below and click OK.
The steps will produce the following graph:
Interpreting the Plot
If your team is favored to win by 25 points or more, you have a very good chance of winning the game, but what if the spread is much closer?
For the 2014 National Championship, Ohio State was an underdog by 6 points to Oregon. Looking at the Binary Fitted Line Plot the probability of a 6-point underdog to win the game is close to 31% in college football.
Ohio State University ended up beating Oregon by 22 points. Given that the differences described in Figure 1 are normally distributed with respect to zero, then if we assume the spread is given (or known), we can compute the probability of the national championship game outcome being as extreme—or more—as it turned out.
With Ohio State 6 point underdogs, and a standard deviation of 15.53, we can run a Probability Distribution Plot to show that Ohio State would win by 22 points or more 3.6% of the time.
Eduardo Santiago and myself will be giving a talk on using statistics to rank college football teams at the upcoming Conference on Statistical Practice in New Orleans. Our talk is February 21 at 2 p.m. and we would love to have you join.