March Madness………..with Minitab

Kevin Rudy 17 February, 2012

Yes, it's the 1994 NCAA Tournament bracket. In 4th grade I caught pneumonia in early March and missed school for a week. My mom cut the bracket out of the newspaper to give me something to do. I've been hooked ever since.I know, I know. It’s not March yet. But it’s never too early to start thinking about your bracket. Can Murray St. be this year’s Butler? Which elite team really is the best? Can some double-digit seed give us an upset celebration as good as Hampton in 2001?  There are so many questions.  Let’s see if we can use Minitab Statistical Software to answer them.

Now, there are many statisticians out there who are already using data analysis to rank college basketball teams. These rankings can easily be used to predict the winner of a game by just looking at which team is ranked higher. But which set of rankings should we use? Well, there is a group of people who have created the LRMC rankings, which claims to have correctly predicted the most NCAA tournament games in the last 9 years. Sweet, let’s use them!

Except there is one small problem: simply predicting the winner only answers part of the question!

For many games, it’s actually not that hard to predict the winner. For example, anybody could tell you that the favorite in a hypothetical matchup between Duke and Belmont would be Duke. But we all know that upsets happen, and the better team doesn’t always win. So what we really want to know is the percentage of time the favorite will win. What if Duke would only win 55% of the time? How great would it be to know that Belmont actually has a legitimate chance to pull the upset? And if you don’t think anything like that is realistic, find where Belmont is ranked in the LRMC rankings. Then look at the score of Duke’s first game this year. Then pencil Belmont into your Sweet 16. Wait, what was that last one? Let’s just get back to the problem at hand.

We need to find a way to turn the LRMC rank of two teams into a probability of who would win a game between them. To do so, we’re going to turn to Ken Pomeroy’s system, the Pomeroy Ratings. The Pomeroy Ratings actually give the probability of the favorite winning for every college basketball game (although you have to subscribe to see them). The problem is that the probabilities aren’t calculated from where the teams are ranked. Instead, they are calculated from the pythagorean calculation for expected winning percentage of each team.

This is where Minitab comes in. We want to find a way to calculate the probability based on where the two teams are ranked, not based on their expected winning percentage. Then we can apply that calculation to any ranking system that we want, like the LRMC rankings! So let’s (finally) analyze some data!

For 150 games, I collected where the Pomery Ratings ranked each team, and the probability of the home team winning. To start, I’ll create a scatterplot of the difference in the ranks (Away minus Home) versus the probability to see what the relationship between these two variables looks like.

Scatterplot of Probabaility of Home Team vs Difference in Ranks

Because the difference in ranks is calculated by taking the away team minus the home team, a positive difference means the home team is better than the away team. And as we see, the higher the difference, the better chance the home team has of winning. This makes sense, so we’re off to a good start.

Next, look at the shape of the scatterplot. It isn’t quite linear. I can see curvature at the bottom left corner and top right corner. So let’s use Fitted Line Plot to fit the data to a cubic model:

Fitted Line Plot of Probabaility of Home Team vs Difference in Ranks

Hey look at that, we got an R-squared value of 95.2%. That means 95.2% of the variability in the probability can be explained by the difference in the ranks! In the world of quality improvement, this would be great. An R-squared of 95.2% would be more than adequate in explaining enough variability to enable you to make some real improvements in your process or product. You could stop pouring time and resources into researching the problem, and start pouring them into fixing it!

But I’m a stat nerd that doesn’t quite live in the world of quality improvement. The tournament doesn’t start for 4 weeks, which gives me plenty of time to keep researching. Sure, an R-squared value of 95.2% is great, but what if I could do even better? In my next post, I’ll see if I can do just that.