If you're just learning about regression, read my regression tutorial!
Just how high should R2 be in regression analysis? I hear this question asked quite frequently.
Previously, I showed how to interpret R-squared (R2). I also showed how it can be a misleading statistic because a low R-squared isn’t necessarily bad and a high R-squared isn’t necessarily good.
Clearly, the answer for “how high should R-squared be” is . . . it depends.
In this post, I’ll help you answer this question more precisely. However, bear with me, because my premise is that if you’re asking this question, you’re probably asking the wrong question. I’ll show you which questions you should actually ask, and how to answer them.
How high should R-squared be? There’s only one possible answer to this question. R2 must equal the percentage of the response variable variation that is explained by a linear model, no more and no less.
When you ask this question, what you really want to know is whether your regression model can meet your objectives. Is the model adequate given your requirements?
I’m going to help you ask and answer the correct questions. The questions depend on whether your major objective for the linear regression model is:
This one is easy. If your main goal is to determine which predictors are statistically significant and how changes in the predictors relate to changes in the response variable, R-squared is almost totally irrelevant.
If you correctly specify a regression model, the R-squared value doesn’t affect how you interpret the relationship between the predictors and response variable one bit.
Suppose you model the relationship between Input and Output. You find that the p-value for Input is significant, its coefficient is 2, and the assumptions pass muster.
These results indicate that a one-unit increase in Input is associated with an average two-unit increase in Output. This interpretation is correct regardless of whether the R-squared value is 25% or 95%!
See a graphical illustration of why a low R-squared doesn't affect this interpretation.
Asking “how high should R-squared be?” doesn’t make sense in this context because it isn’t relevant. A low R-squared doesn’t negate a significant predictor or change the meaning of its coefficient. R-squared is simply whatever value it is, and it doesn’t need to be any particular value to allow for a valid interpretation.
In order to trust your interpretation, which questions should you ask instead?
If your main goal is to produce precise predictions, R-squared becomes a concern. Predictions aren’t as simple as a single predicted value because they include a margin of error; more precise predictions have less error.
R-squared enters the picture because a lower R-squared indicates that the model has more error. Thus, a low R-squared can warn of imprecise predictions. However, you can’t use R-squared to determine whether the predictions are precise enough for your needs.
That’s why “How high should R-squared be?” is still not the correct question.
Which questions should you ask? In addition to the questions above, you should ask:
Don’t worry, Minitab Statistical Software makes this easy to assess.
A prediction interval represents the range where a single new observation is likely to fall given specified settings of the predictors. These intervals account for the margin of error around the mean prediction. Narrower prediction intervals indicate more precise predictions.
You can use subject area knowledge, spec limits, client requirements, etc to determine whether the prediction intervals are precise enough to suit your needs. This approach directly assesses the model’s precision, which is far better than choosing an arbitrary R-squared value as a cut-off point.
For the body fat model, I’m guessing that the range is too wide to provide clinically meaningful information, but a doctor would know for sure.
Read about how to obtain and use prediction intervals.
When you ask, “How high should R-squared be?” it’s probably because you want to know whether your regression model can meet your requirements. I hope you see that there are better ways to answer this than through R-squared!
R-squared gets a lot of attention. I think that’s because it appears to be a simple and intuitive statistic. I’d argue that it’s neither; however, that’s not to say that R-squared isn’t useful at all. For instance, if you perform a study and notice that similar studies generally obtain a notably higher or lower R-squared, it would behoove you to investigate why yours is different.
In my next blog, read how S, the standard error of the regression, is a different goodness-of-fit statistic that can be more helpful than R-squared.
If you're just learning about regression, read my regression tutorial!