Understanding Overfit in Statistics: Those Skintight Jeans Fit Perfect When You Bought Them, But…
Sometimes, statistical terms can seem like they were zapped down from outer space by sadistic, mealy-mouthed aliens: R-squared adjusted, heteroeskadasticity, 3-parameter Weibull distribution.
But not all statistics terminology should leave you feeling woozy and glassy-eyed. Some terms actually make intuitive sense. Knowing those terms can help you get a handle on output that may seem fuzzy at first glance.
Take “overfitting the model”—something statisticians caution against when you model the relationship between a predictor and a response.
To overfit a model for your data is like buying a pair of skintight jeans. Today, your model—just like your jeans—seems to “hug” your sample data perfectly. But you want your jeans to fit a year or so down the road.
Likewise, you want your model to accurately predict the response for data you don’t have—that is, for other samples from the population—not just for the sample you have now.
To see this, let's say we have created a model for a small data set. The graph below shows that our model fits the sample data (red points) perfectly.
Ah, life is good.
But supppose new data is collected (the green points below).
Uh-oh. The "perfect" model predicted that values > 10 would fall roughly in the red dotted area.
The fit isn't very good for the new data.
The linear model below is a “looser fit” for the sample data (red dots) than the model above:
Now, this doesn't mean that a good model fit for your sample data is a bad thing, per se. It's not. But a common pitfall is overloading a model with gobs of predictors and increasing its complexity, just for the sake of a "perfect" fit for the sample data.
Luckily, Minitab provides two statistics in regression output that can help you evaluate both the model fit for your sample data, as well as the predicted model fit for new data.
R2 tells you how well your model fits your sample data. Predicted R2, aka R2 (pred), tells you how well your model predicts responses for new data. For both of these stats, 100% represents a perfect fit.
You might be better off with a model that has a slightly looser fit for your sample data—a slightly lower adjusted R2—if it has a markedly higher predicted R2 than another model.
So, as you don your Tyra Banks hat and search for a top model, don't get too catty about it. Consider the "current fit" and the “future fit” as well.
Now, if statisticians could just develop a statistic to predict whether those expensive new jeans will be cutting off your blood circulation a year from now…