dcsimg
 

Regression Analysis: How to Interpret S, the Standard Error of the Regression

S, the standard error of the regressionR-squared gets all of the attention when it comes to determining how well a linear model fits the data. However, I've stated previously that R-squared is overrated. Is there a different goodness-of-fit statistic that can be more helpful? You bet!

Today, I’ll highlight a sorely underappreciated regression statistic: S, or the standard error of the regression. S provides important information that R-squared does not.

What is the Standard Error of the Regression (S)?

illustration of residuals S becomes smaller when the data points are closer to the line.

In the regression output for Minitab statistical software, you can find S in the Summary of Model section, right next to R-squared. Both statistics provide an overall measure of how well the model fits the data. S is known both as the standard error of the regression and as the standard error of the estimate.

S represents the average distance that the observed values fall from the regression line. Conveniently, it tells you how wrong the regression model is on average using the units of the response variable. Smaller values are better because it indicates that the observations are closer to the fitted line.

fitted line plot of BMI and body fat percentage

The fitted line plot shown above is from my post where I use BMI to predict body fat percentage. S is 3.53399, which tells us that the average distance of the data points from the fitted line is about 3.5% body fat.

Unlike R-squared, you can use the standard error of the regression to assess the precision of the predictions. Approximately 95% of the observations should fall within plus/minus 2*standard error of the regression from the regression line, which is also a quick approximation of a 95% prediction interval.

For the BMI example, about 95% of the observations should fall within plus/minus 7% of the fitted line, which is a close match for the prediction interval.

Why I Like the Standard Error of the Regression (S)

In many cases, I prefer the standard error of the regression over R-squared. I love the practical, intuitiveness of using the natural units of the response variable. And, if I need precise predictions, I can quickly check S to assess the precision.

Conversely, the unit-less R-squared doesn’t provide an intuitive feel for how close the predicted values are to the observed values. Further, as I detailed here, R-squared is relevant mainly when you need precise predictions. However, you can’t use R-squared to assess the precision, which ultimately leaves it unhelpful.

To illustrate this, let’s go back to the BMI example. The regression model produces an R-squared of 76.1% and S is 3.53399% body fat. Suppose our requirement is that the predictions must be within +/- 5% of the actual value.

Is the R-squared high enough to achieve this level of precision? There’s no way of knowing. However, S must be <= 2.5 to produce a sufficiently narrow 95% prediction interval. At a glance, we can see that our model needs to be more precise. Thanks S!

Read more about how to obtain and use prediction intervals as well as my regression tutorial.

Comments

Name: Mukundraj • Thursday, April 3, 2014

How to assess s value in case of multiple regression. I could not use this graph. Please help.


Name: Jim Frost • Monday, April 7, 2014

Hi Mukundraj,

You can assess the S value in multiple regression without using the fitted line plot. I use the graph for simple regression because it's easier illustrate the concept. However, with more than one predictor, it's not possible to graph the higher-dimensions that are required!

In multiple regression output, just look in the Summary of Model table that also contains R-squared. You'll see S there.

You interpret S the same way for multiple regression as for simple regression. The S value is still the average distance that the data points fall from the fitted values. However, in multiple regression, the fitted values are calculated with a model that contains multiple terms.

Thanks for the question!
Jim


Name: Nicholas Azzopardi • Wednesday, July 2, 2014

Dear Mr. Frost,
Can you kindly tell me what data can I obtain from the below information.

Mini-slump
R2 = 0.98
DF SS F value
Model 14 42070.4 20.8s
Error 4 203.5
Total 20 42937.8


Name: Jim Frost • Thursday, July 3, 2014

Hi Nicholas,

It appears like you're overfitting your model, which means that you are including too many terms for the number of data points. This can artificially inflate the R-squared value.

From your table, it looks like you have 21 data points and are fitting 14 terms. That's too many! A good rule of thumb is a maximum of one term for every 10 data points.

I write more about how to include the correct number of terms in a different post. I think it should answer your questions.

http://blog.minitab.com/blog/adventures-in-statistics/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables

I bet your predicted R-squared is extremely low.

Thanks for writing!
Jim


Name: Nicholas Azzopardi • Friday, July 4, 2014

Dear Jim,
Thank you for your answer.

But if it is assumed that everything is OK, what information can you obtain from that table?

Thank you once again.

Kind regards,

Nicholas


Name: Himanshu • Saturday, July 5, 2014

Hi Jim!

Thanks for the beautiful and enlightening blog posts. Is there a textbook you'd recommend to get the basics of regression right (with the math involved)? I was looking for something that would make my fundamentals crystal clear. I would really appreciate your thoughts and insights.

Best,
Himanshu


Name: Jim Frost • Monday, July 7, 2014

Hi Nicholas,

I'd say that you can't assume that everything is OK. Fitting so many terms to so few data points will artificially inflate the R-squared. That's probably why the R-squared is so high, 98%.

There's not much I can conclude without understanding the data and the specific terms in the model.

About all I can say is:
The model fits 14 to terms to 21 data points and it explains 98% of the variability of the response data around its mean. The model is probably overfit, which would produce an R-square that is too high.

Was there something more specific you were wondering about?

Jim


Name: Jim Frost • Tuesday, July 8, 2014

Hi Himanshu,

Thanks so much for your kind comments!

I actually haven't read a textbook for awhile. Being out of school for "a few years", I find that I tend to read scholarly articles to keep up with the latest developments. I did ask around Minitab to see what currently used textbooks would be recommended.

This textbook comes highly recommdend:

Applied Linear Statistical Models by Michael Kutner, Christopher Nachtsheim, and William Li.

These authors apparently have a very similar textbook specifically for regression that sounds like it has content that is identical to the above book but only the content related to regression models:
Applied Linear Regression Models

I hope this helps!
Jim


Name: Olivia • Saturday, September 6, 2014

Hi this is such a great resource I have stumbled upon :) I have a question though - when comparing different models from the same data set (ie models including or excluding different variables/number of variables)why is S better than SSE?


blog comments powered by Disqus