Imagine a multi-million dollar company that released a product without knowing the probability that it will fail after a certain amount of time. “We offer a 2 year warranty, but we have no idea what percentage of our products fail before 2 years.” Crazy, right? Anybody who wanted to ensure the quality of their product would perform a statistical analysis to look at the reliability and survival of their product.
Now imagine a multimillion-dollar football organization that makes 4^{th} down decisions without knowing the probability that they will convert the 4^{th} down. “We punt on every 4^{th} and 1, but we have no idea what percentage of the time we would keep possession if we went for it.” That's just as crazy, except that seems to be what every football organization does.
But it doesn’t have to be this way. Just like businesses use statistics to improve the quality of their products, football teams should use statistics to improve their chances of winning. So I’m going to use Minitab’s binary logistic regression to create a model that will let us know the probability a team has of successfully converting on 4^{th} down.
The Data
We’re continuing our quest to make a Big Ten 4^{th} down calculator, so we’ll start with the same data that we used to create a model for expected points. For every 3^{rd} down in Big Ten conference games the last 2 seasons, I recorded the distance needed to convert, whether the team on offense was at home or away, and whether they converted. I used 3^{rd} down instead of 4^{th} down to increase the sample size. And since the goal on 3^{rd} down is the same as 4^{th} down (convert in one play), the probabilities should be the same.
Speaking of the probabilities, we can use a scatterplot to get an initial look at how distance affects the probability of converting.
The probability of converting decreases pretty consistently as the distance increases. The data does appear to level out a bit between 10 and 15 yards before decreasing again. And there are some outliers at the end of the data, but that is due to small sample sizes.
Now, I do have a different data set with a much larger sample that we can use to eliminate the noise in the data, but first I want to show something with this first data set that we can’t show with the next one.
The Effect of Playing at Home or Away
In the model for expected points, the location of the game affected a team's expected points. Will we see the same effect on the probability of converting on 3^{rd} down? We’ll use binary logistic regression to determine whether Home or Away is a significant term in the model.
When it comes to the probability of converting on 3^{rd} down, it doesn’t matter whether the team is home or away. The p-value in the regression analysis is 0.994, which is much greater than the common significance level of 0.05. So why does it matter for expected points, but not here? My best guess is the sample size. Home field advantage has such a small effect on a single play that it doesn’t show up in the 3^{rd} down conversions. But over the course of a multiple play drive (like what we looked at in the expected points model), those small effects add up and the effect of home field advantage becomes noticeable.
So when it comes to a single play, we can ignore home field advantage.
The Data: Part II
To increase our sample size, fellow blogger Joel Smith was kind enough to share data he collected on every college football game from 2006–2012. Because our sample size was so large, we can actually look at 4^{th} downs instead of 3^{rd} downs. Here is a scatterplot of the data:
We see a similar pattern as before. The data decreases until about 10 yards where it levels out a bit before decreasing practically to 0% after 20 yards. And that outlier? Teams were 1 for 3 on 4^{th} and 34. That one success came in the 4^{th} quarter when the team on offense was down by 21 points, so the defense probably no longer had their starters in. That means we should clean up the data to try and remove points like these.
To try and avoid games that were blowouts, I removed any 4^{th} downs where the score differential was greater than 4 touchdowns in the first 3 quarters, and greater than 16 points (3 scores) in the 4^{th} quarter. Finally, I removed any distance greater than 20 yards, since the probability basically drops to 0. This means the decision on anything greater than 4^{th} and 20 should be very easy. Punt or kick a FG unless it’s late in the game and you absolutely need to score a touchdown. So we don't really need to worry about modeling that for our 4th down calculator.
After removing these observations, we still have 11,623 4^{th} downs. Here's the data I used.
The Final Model
We already saw that it doesn’t matter whether you’re playing at home or on the road, but there is another factor we should take into account. When you get closer to the goal line, the defense has a smaller portion of the field to defend. This might make it harder to convert on 4^{th} down when you have to score a touchdown rather than simply get a first down. So I created a variable to determine whether it was 4^{th} and goal or not to include in the model.
There also appears to be some curvature in the data, so I included the 2^{nd} and 3^{rd} order terms for distance. And lastly, our integers for distance represent the midpoint of the actual distance. For example, on 4^{th} and 4 you could really have to gain anywhere from 3.5 to 4.5 yards. But on 4^{th} and 1, the range is really 0 yards to 1.5 yards. So instead of using the integer 1, I used 0.75.
Now let’s put our data into Minitab and see the results.
The p-values for all of our terms are less than 0.05, so we can conclude that they are all significant and keep them in the model. The Deviance R-squared value tells us that 97% of the deviance in the probability of converting on 4^{th} down can be explained by the model. We can now use the model to predict the probability to converting at different distances.
Distance |
Probability when Goal to go |
Probability when not Goal to go |
1* |
61% |
70% |
2 |
50% |
60% |
3 |
43% |
53% |
4 |
37% |
46% |
5 |
32% |
41% |
6 |
29% |
37% |
7 |
26% |
34% |
8 |
24% |
32% |
9 |
22% |
30% |
10 |
21% |
28% |
*I used a value of 0.75 for the prediction
We see that being at the goal line decreases your chances on 4^{th} down by about 10%. We also see what a drastic effect just a couple of yards makes. Imagine getting a false start penalty and having your 4^{th} and 1 go to 4^{th} and 6. You just cut your odds of converting in half!
So let’s go back to that coach who punts on every 4^{th} and 1. Now that we have our data, we can analyze whether he is making the correct decision. Let’s say he has a 4^{th} and 1 at his own 10 yard line and is playing on the road. We can use our expected points model and our 4^{th} down model to see what the correct decision should be.
Decision |
Expected Points Success |
Expected Points Fail |
Total Expected Points |
Go for it |
-0.64 |
-5.9 |
-2.2 |
Punt* |
-2.9 |
N/A |
-2.9 |
* The average net punt in the Big Ten was about 40 yards, so that’s the value I used.
By this model, in going for it on 4^{th} down the coach increases his expected points by 0.7 points. That may not sound like much, but imagine making a similar decision 4 or 5 times a game. Those expected points add up to about a field goal. Think there is a coach out there who wouldn’t want an easy way to increase their score by 3 points?
And keep in mind our numbers assume you only gain 1 yard on 4^{th} down. When you account for the fact that you can gain more than 1 yard, the case for going for it only strengthens. As Alabama found out against Ohio State last year, even a simple running play up the middle has the potential to go the distance.
So now we’re all set to track the 4^{th} down decisions in this upcoming Big Ten season. The first Big Ten conference game is September 19^{th}, when Rutgers takes on Penn State. And the Big Ten 4^{th} down calculator is ready and waiting.
Let the games begin!