More March Madness with Minitab and Nonlinear Regression
What, it’s still not March? Blasted February, why won't you just end already! Oh well, at least it gives us time for some more data analysis.
In my last post, I used Minitab’s Fitted Line Plot to create a regression model that predicted the probability of a home team winning a basketball game based on the difference in ranks between the two teams. This model had an r-squared value of 95.2%, which is great. But since it’s still February, let’s spend some time trying to improve on that number.
Improving the Regression Model
My last model used the difference in ranks between two teams. This assumes that the difference between teams is constant. For example, the difference between the 1st team and the 25th team is the same as the difference between the 100th and 125th team. But does this make sense? You could argue that there is a bigger difference between Kentucky and Gonzaga (ranked 1st and 25th at the time of this writing) than Villanova and Georgia (ranked 100th and 125th at the time of this writing).
So you might imagine that in college basketball you have a couple elite teams, a couple terrible teams, and a majority of teams that are pretty close together in the middle. In other words, the distribution of basketball teams is normally distributed.
There has been research done on this. Some statisticians suggest using a statistic based on the inverse normal cumulative distribution function (CDF) of team i’s ranking relative to the total number of teams. For example, if there were 345 teams and team i was ranked 21st, then the statistic would be the inverse normal CDF of (345-21)/345.
Let’s apply this to our model and see if it improves. To do this, we can first use Minitab’s calculator to standardize the rankings. The Pomery Ratings rank 345 basketball teams, so for each team we take 345, subtract their rank, then divide by 345.
Then we can use Minitab's Probability Distribution to get our statistics (select Calc > Probability Distributions > Normal...). In the dialog box below, the mean is 0 and the standard deviation is 1 because those are the values of the parameters in a standard normal distribution.
Now that the rankings are converted, I’ll run another fitted line plot using the difference in our new statistics instead of the difference in ranks.
Look at that—our model got even better! Our r-squared value is over 99%. Looks like we’re going to use these new statistics.
So we’re done right? Well, not quite. Look at the ends of the fitted line, especially the top left corner. You see how it starts to curve back down? This means that the better the team, the higher the statistic. So a negative value in the difference means that the home team is ranked higher than the away team. As this difference increases, the advantage the home team has over the away team should increase, too. But right around a difference of -2, our model starts to decrease. In other words, the better the home team, the less likely they are to win.
That’s not right at all.
So although it’s close, this cubic model doesn’t quite work. What we really want is for the line to asymptotically approach 100 on the left, and 0 on the right because the probability of a team winning has to be between 100% and 0%.
Well, It looks like we’re going to have to turn to nonlinear regression for this one. (Don’t worry, I’m scared too.)
Choosing the Right Nonlinear Regression Model
I haven’t the slightest clue what kind of nonlinear regression model to use, but Minitab Statistical Software can help with that. So when I select Stat > Regression > Nonlinear Regression..., I click "Use Catalog" to help me out.
I'm confronted with a list. Hmmm. Loglogistic, Weibull, Holliday...what to make of it all? And Gompertz Growth?!?!?! I’m pretty sure somebody just made that one up. How do I even start?
Oh wait...Minitab includes pictures to help you figure out which model might work best. And hey, what do you know? That bottom picture for Logistic Growth looks a lot like what our model should look like!
So I’ll select logistic growth, enter the “difference in statistics” as my predictor and the “probability of the home team winning” as the response. If you look at the picture, you’ll see that Theta 1 is the bottom asymptote, and Theta 2 is the top asymptote. We know those values are 0 and 100 because a team’s probability has to be between 0 and 100. Minitab can lock those values in place, so that’s just what we’ll have it do. We’ll start the other two parameters at 1 and let Minitab figure them out. So what do we get?
Hey, that looks just about perfect! I think we finally found our model!
The Right Regression Models for Quality Improvement
In some quality improvement situations, it may not be practical to take things to this level of complexity because the extra resources you spend may not be worth the added benefit you get. You don't always need a perfect model to identify the source of a quality problem! But it’s good to know that if you do have the time and resources, there are plenty of tools available in Minitab to help you find the best possible model for your process.
Predicting Winning Teams with the Nonlinear Regression Model
Now that the hard part is behind us, let’s use our model to make some predictions! The table below has some of the bigger games over the next couple of days and which team the model predicts to win. And yes, I realize that in the NCAA tournament all of the games are at neutral sites. But if this model works well for home/away teams, there is no reason we can’t do the same thing for teams that play on a neutral court! In the meantime, I’ll be tracking every college basketball game over the next few weeks. Then when it’s actually March, I’ll come back to see how it’s doing!
|Day||Home Team||Away Team||Probability||Commentary|
|Wed||Notre Dame||West Virginia||Notre Dame - 56%||Notre Dame is 3rd in the Big East. West Virginia is 8th. Yet the Irish are only slight favorites at home. Interesting.|
|Wed||Minnesota||Michigan St||Michigan St - 76%||Odds of Minnesota getting back into the bubble talk with a big upset over Sparty........not good.|
|Wed||San Diego St||Wyoming||San Diego St - 67%||So the 24th ranked team only beats Wyoming 2 out of 3 times on their home court? Perhaps the Aztecs are overrated.|
|Wed||Kansas||Texas A&M||Kansas - 97%||Kansas is good at basketball.|
|Wed||Syracuse||South Florida||Syracuse - 91%||South Florida is 5th in the Big East, but their last 4 games are against upper half of the conference. Bubble may be bursting in 3.....2.....1....|
|Thurs||Florida St||Duke||Florida St - 52%||Florida St with a legitimate shot at the season sweep of Duke.|
|Thurs||Cincinnati||Louisville||Cincinnati - 52%||Louisville's last 3 games have been decided by 1 possession or gone into overtime. Looks like another nail biter is on the way.|
|Thurs||Gonzaga||BYU||Gonzaga - 62%||Just because I wanted to type GONZAGA!|
|Thurs||Iowa||Wisconsin||Wisconsin - 79%||Why a single head to head game doesn't mean much. Iowa beat Wisconsin in the Kohl Center earlier this season, yet are underdogs at home.|
|Fri||West Virginia||Marquette||West Virginia - 52%||First Friday game involving a ranked BCS team since 2011.|
|Fri||Harvard||Princeton||Harvard - 85%||Harvard looks to put a stranglehold on the Ivy League title.|
|Sat||Virginia||North Carolina||North Carolina - 62%||North Carolina's toughest game until their rematch with Duke.|
|Sat||Kansas||Missouri||Kansas - 72%||And this number doesn't even factor in that Missouri lost at home Tuesday night. Also, have I mentioned that Kansas is good at basketball?|
|Sat||Connecticut||Syracuse||Syracuse - 65%||Connecticut with a realistic chance at saving their season with a win over Syracuse.|
|Sat||Kentucky||Vanderbilt||Kentucky - 91%||The scary part about this is that Vanderbilt is actually pretty good.|
|Sat||Penn St||Northwestern||Northwestern - 61%||Northwestern can't stumble in State College if they're going to make the tournament for the first time ever.|
|Sat||Michigan||Purdue||Michigan - 70%||Purdue could use a big win at Michigan to boost their seed in the tournament.|
|Sat||Mercer||Belmont||Belmont - 72%||Belmont in the Sweet Sixteen! Remember, you heard it here first! Unless they lose in the 1st round, then please forget where you heard it.|
|Sat||Michigan St||Nebraska||Michigan St - 97%||I included this game just so I could say "Sparty Time!"|
|Sat||Arizona||UCLA||Arizona - 78%||I included this game so I don't get accused of east coast bias.|
|Sun||Minnesota||Indiana||Indiana - 59%||Can the Hoosiers actually win on the road?|
|Sun||Ohio St||Wisconsin||Ohio St - 81%||The Buckeyes have lost 2 of their last 4, but should get a win over Wisconsin.|
*All probabilities use data from games through Monday, February 20th