Just how high should R^{2} be in regression analysis? I hear this question asked quite frequently.

Previously, I showed how to interpret R-squared (R^{2}). I also showed how it can be a misleading statistic because a low R-squared isn’t necessarily bad and a high R-squared isn’t necessarily good.

Clearly, the answer for “how high should R-squared be” is . . . it depends.

In this post, I’ll help you answer this question more precisely. However, bear with me, because my premise is that if you’re asking this question, you’re probably asking the wrong question. I’ll show you which questions you should actually ask, and how to answer them.

## Why It’s the Wrong Question

How high should R-squared be? There’s only one possible answer to this question. R^{2} must equal the percentage of the response variable variation that is explained by a linear model, no more and no less.

When you ask this question, what you *really* want to know is whether your regression model can meet your objectives. Is the model adequate given *your* requirements?

I’m going to help you ask and answer the correct questions. The questions depend on whether your major objective for the linear regression model is:

- Describing the
*relationship*between the predictors and response variable, or - Predicting the response variable

## R-squared and the Relationship between the Predictors and Response Variable

This one is easy. If your main goal is to determine which predictors are statistically significant and how changes in the predictors relate to changes in the response variable, R-squared is almost totally irrelevant.

If you correctly specify a regression model, the R-squared value doesn’t affect how you interpret the relationship between the predictors and response variable one bit.

Suppose you model the relationship between Input and Output. You find that the p-value for Input is significant, its coefficient is 2, and the assumptions pass muster.

These results indicate that a one-unit increase in Input is associated with an average two-unit increase in Output. This interpretation is correct regardless of whether the R-squared value is 25% or 95%!

See a graphical illustration of why a low R-squared doesn't affect this interpretation.

Asking “how high should R-squared be?” doesn’t make sense in this context because it isn’t relevant. A low R-squared doesn’t negate a significant predictor or change the meaning of its coefficient. R-squared is simply whatever value it is, and it doesn’t need to be any particular value to allow for a valid interpretation.

In order to trust your interpretation, which questions should you ask instead?

- Is there a sound rationale for my model and do the results fit theory?
- Can I trust my data?
- Do the residual plots and other assumptions look good?
- How do I interpret the p-values and regression coefficients?

## R-squared and Predicting the Response Variable

If your main goal is to produce precise predictions, R-squared becomes a concern. Predictions aren’t as simple as a single predicted value because they include a margin of error; more precise predictions have less error.

R-squared enters the picture because a lower R-squared indicates that the model has more error. Thus, a low R-squared can warn of imprecise predictions. However, you can’t use R-squared to determine whether the predictions are precise enough for your needs.

That’s why “How high should R-squared be?” is *still* not the correct question.

Which questions *should* you ask? In addition to the questions above, you should ask:

- Are the prediction intervals precise enough for my requirements?

Don’t worry, Minitab Statistical Software makes this easy to assess.

## Prediction intervals and precision

A prediction interval represents the range where a single new observation is likely to fall given specified settings of the predictors. These intervals account for the margin of error around the mean prediction. Narrower prediction intervals indicate more precise predictions.

For example, in my post where I use BMI to predict body fat percentage, I find that a BMI of 18 produces a prediction interval of 16-30% body fat. We can be 95% confident that this range includes the value of the new observation.

You can use subject area knowledge, spec limits, client requirements, etc to determine whether the prediction intervals are precise enough to suit your needs. This approach directly assesses the model’s precision, which is far better than choosing an arbitrary R-squared value as a cut-off point.

For the body fat model, I’m guessing that the range is too wide to provide clinically meaningful information, but a doctor would know for sure.

Read about how to obtain and use prediction intervals.

## R-squared Is Overrated!

When you ask, “How high should R-squared be?” it’s probably because you want to know whether your regression model can meet your requirements. I hope you see that there are better ways to answer this than through R-squared!

R-squared gets a lot of attention. I think that’s because it *appears* to be a simple and intuitive statistic. I’d argue that it’s neither; however, that’s not to say that R-squared isn’t useful at all. For instance, if you perform a study and notice that similar studies generally obtain a notably higher or lower R-squared, it would behoove you to investigate why yours is different.

In my next blog, read how S, the standard error of the regression, is a different goodness-of-fit statistic that can be more helpful than R-squared.

If you're just learning about regression, read my regression tutorial!

Time: Thursday, July 17, 2014

Hi,

can you please explain the relationship between R squared and Standard Error (SE)? Can SE be greater then R squared (e.g. SE = 95, R squared = 65%)

Thanks.

Time: Thursday, July 17, 2014

Hi John,

I'll assume that you're asking about the standard error of the regression, aka S.

You can't compare R-squared values to S because they measure different things and on different scales.

R-squared is the percentage of the explained response variability and must fall between 0 and 100%.

S is in the units of the response variable and is the average distance that the data points fall from their fitted (predicted) values. S must be greater than or equal to 0. There's no limit to the maximum value of S because it depends on both the units of the response variable and how well your model fits the data.

There is a general rule for the relationship between the two when you're working with a specific response variable. As you improve the fit of your model, R-squared increases up to a maximum of 100% and S decreases down to a minimum of 0. (Both values are the theoretical best values but you wouldn't obtain 100% or 0 in practice.)

I highly recommend that you follow the link near the end of this post to read about the standard error of the regression. I think that will clear things up!

Thanks for the great question!

Jim