Shakespeare and Best Subsets Regression

Cody Steele 09 January, 2013

 Shakespeare's bust at the Temple of British Worthies Orlando:  And wilt thou have me?         45

  Rosalind:  Ay, and twenty such. 

  Orlando:  What sayest thou? 

  Rosalind:  Are you not good? 

  Orlando:  I hope so. 

  Rosalind:  Why then, can one desire too much of a good thing?

William Shakespeare, As You Like It, Act IV, Scene I.


When looking at best subsets regression, Shakespeare’s question about whether one can desire too much of a good thing becomes immediately important. With the power of best subsets regression, you can quickly explore models with twenty such. For the gummi bear data, you could even try models with thirty-such! And as you add terms to the model you will find models that perform better and better.

But in statistics, the answer to Rosalind’s question is “Yes, you can desire too much of a good thing.”

Large models often maximize summary statistics, but not without a cost. Consider the best subsets regression for the gummi bear data. If we chose adjusted R-squared as our sole criterion for picking a model from the results, we would use a very impressive 16-term model that manages to get us all the way to 87.2%! That’s fantastic, you might think to yourself.

Then, because you want to use the model to predict, you’d check the predicted R-squared and get even more excited. Predicted R-squared is 83.9% for this model. It looks great! But we added a lot of complexity to get those statistics that high. This model contains two 4-factor interactions, after all. And that complexity has a cost.

We can't have as much confidence in the predictions from the 16-term model as we can in those from the 5-term model.

The graph shows the width of the confidence intervals for the predictions of 2 different models. The intervals are for the points in the designed experiment. The red lines show the widths for the 16-term model from the best subsets regression results. The blue line shows a model that uses only the 5 main effects.

Except for the center points, the confidence intervals for the 16-term model are always wider than the intervals for the model with only main effects. This main-effects model has a predicted R-squared of 77.48%, only 6.42% less than the model with 11 additional terms. But for the simpler model, the confidence interval for how far the gummi bears will go is 2.4 inches narrower when averaged across the points in the design.

This effect is a good justification for the principle of parsimony, which basically means that if two models are not very different, choose the simpler one. The more complex model does explain a little bit more variation, but that model sacrifices a lot of precision for predictions. That’s a terrible tradeoff when you want to predict.

Just as Orlando would be hurt if Rosalind took 20 suitors like him, you want to keep the terms in your model to a number that doesn’t hurt.

The photo of the bust of Shakespeare is © Copyrighted by Philip Halling and licensed for reuse under this Creative Commons Licence