# Regression Analysis Tutorial and Examples

I’ve written a number of blog posts about regression analysis and I've collected them here to create a regression tutorial. I’ll supplement my own posts with some from my colleagues.

This tutorial covers many aspects of regression analysis including: choosing the type of regression analysis to use, specifying the model, interpreting the results, determining how well the model fits, making predictions, and checking the assumptions. At the end, I include examples of different types of regression analyses.

If you’re learning regression analysis right now, you might want to bookmark this tutorial!

## Why Choose Regression and the Hallmarks of a Good Regression Analysis

Before we begin the regression analysis tutorial, there are several important questions to answer.

Why should we choose regression at all? What are the common mistakes that even experts make when it comes to regression analysis? And, how do you distinguish a good regression analysis from a less rigorous regression analysis? Read these posts to find out:

- Tribute to Regression Analysis: See why regression is my favorite! Sure, regression generates an equation that describes the relationship between one or more predictor variables and the response variable. But, there’s much more to it than just that.
- Four Tips on How to Perform a Regression Analysis that Avoids Common Problems: Keep these tips in mind through out all stages of this tutorial to ensure a top-quality regression analysis.
- Sample Size Guidelines: These guidelines help ensure that you have sufficient power to detect a relationship and provide a reasonably precise estimate of the strength of that relationship.

## Tutorial: How to Choose the Correct Type of Regression Analysis

Minitab statistical software provides a number of different types of regression analysis. Choosing the correct type depends on the characteristics of your data, as the following posts explain.

- Giving Thanks for the Regression Menu: Patrick Runkel goes through the regression choices using a yummy Thanksgiving context!
- Linear or Nonlinear Regression: How to determine when you should use one or the other.
- What is the Difference between Linear and Nonlinear Equations: Both types of equations can model curvature, so what is the difference between them?

## Tutorial: How to Specify Your Regression Model

Choosing the correct type of regression analysis is just the first step in this regression tutorial. Next, you need to specify the model. Model specification consists of determining which predictor variables to include in the model and whether you need to model curvature and interactions between predictor variables.

Specifying a regression model is an iterative process. The interpretation and assumption verification sections of this regression tutorial show you how to confirm that you’ve specified the model correctly and how to adjust your model based on the results.

- How to Choose the Best Regression Model: I review some common statistical methods, complications you may face, and provide some practical advice.
- Stepwise and Best Subsets Regression: Minitab provides two automatic tools that help identify useful predictors during the exploratory stages of model building.
- Curve Fitting with Linear and Nonlinear Regression: Sometimes your data just don’t follow a straight line and you need to fit a curved relationship.
- Interaction effects: Michelle Paret explains interactions using Ketchup and Soy Sauce.
- Proxy variables: Important variables can be difficult or impossible to measure but omitting them from the regression model can produce invalid results. A proxy variable is an easily measurable variable that is used in place of a difficult variable.
- Overfitting the model: Overly complex models can produce misleading results. Learn about overfit models and how to detect and avoid them.
- Hierarchical models: I review reasons to fit, or not fit, a hierarchical model. A hierarchical model contains all lower-order terms that comprise the higher-order terms that also appear in the model.
- Standardizing the variables: In certain cases, standardizing the variables in your regression model can reveal statistically significant findings that you might otherwise miss.
- Five reasons why your R-squared can be too high: If you specify the wrong regression model, or use the wrong model fitting process, the R-squared can be too high.

## Tutorial: How to Interpret your Regression Results

So, you’ve chosen the correct type of regression and specified the model. Now, you want to interpret the results. The following topics in the regression tutorial show you how to interpret the results and effectively present them:

- Regression coefficients and p-values
- Regression Constant (Y intercept)
- How to statistically test the difference between regression slopes and constants
- R-squared and the goodness-of-fit
- How high should R-squared be?
- How to interpret a model with a low R-squared
- Adjusted R-squared and Predicted R-squared
- S, the standard error of the regression
- F-test of overall significance
- How to Compare Regression Slopes
- Present Your Regression Results to Avoid Costly Mistakes: Research shows that presentation affects the number of interpretation mistakes.
- Identify the Most Important Predictor Variables: After you've settled on a model, it’s common to ask, “Which variable is most important?”

## Tutorial: How to Use Regression to Make Predictions

In addition to determining how the response variable changes when you change the values of the predictor variables, the other key benefit of regression is the ability to make predictions. In this part of the regression tutorial, I cover how to do just this.

- How to Predict with Minitab: A prediction guide that uses BMI to predict body fat percentage.
- Predicted R-squared: This statistic indicates how well a regression model predicts responses for new observations rather than just the original data set.
- Prediction intervals: See how presenting prediction intervals is better than presenting only the regression equation and predicted values.
- Prediction intervals versus other intervals: I compare prediction intervals to confidence and tolerance intervals so you’ll know when to use each type of interval.

## Tutorial: How to Check the Regression Assumptions and Fix Problems

Like any statistical test, regression analysis has assumptions that you should satisfy, or the results can be invalid. In regression analysis, the main way to check the assumptions is to assess the residual plots. The following posts in the tutorial show you how to do this and offer suggestions for how to fix problems.

- Residual plots: What they should look like and reasons why they might not!
- How important are normal residuals: If you have a large enough sample, nonnormal residuals may not be a problem.
- Multicollinearity: Highly correlated predictors can be a problem, but not always!
- Heteroscedasticity: You want the residuals to have a constant variance (homoscedasticity), but what if they don’t?
- Box-Cox transformation: If you can’t resolve the underlying problem, Cody Steele shows how easy it can be to transform the problem away!

## Examples of Different Types of Regression Analyses

The final part of the regression tutorial contains examples of the different types of regression analysis that Minitab can perform. Many of these regression examples include the data sets so you can try it yourself!

- New Linear Model Features in Minitab 17
- Binary Logistic Regression: Predicts the winner of the 2012 U.S. Presidential election.
- Multiple regression with response optimization: Highlights new features added to the Assistant for Minitab 17.
- Linear Regression: Great Presidents by Patrick Runkel and my follow up, Great Presidents Revisited.
- Linear regression with a double-log transformation: Examines the relationship between the size of mammals and their metabolic rate with a fitted line plot.
- Nonlinear regression: Kevin Rudy uses nonlinear regression to predict winning basketball teams.
- Orthogonal regression: Carly Barry shows how orthogonal regression (a.k.a. Deming Regression) can test the equivalence of different instruments.
- Partial least squares (PLS) regression: Cody Steele uses PLS to successfully analyze a very small and highly multicollinear data set.

Name: James Jihulya• Thursday, January 9, 2014I want to use Minitab to compare the length at first maturity of male and female tilapia using nonlinear regression by plotting two graphs on the same graph. My problem is Minitab allows one response variable and one predictive variable. Can some help?

Name: Jim Frost• Thursday, January 16, 2014Hi James,

I faced a similar issue when I wanted to compare 2 curves to the same data set. Below is how to compare multiple models on a scatterplot.

First, I saved the fitted values from each model in the worksheet, which you do by checking "Fits" in the "Storage" dialog. Save the fits for all models that you want to compare, which I'll call Fits1 and Fits2 below.

Then you create a scatterplot and display the following pairs: original response and predictor data, fits1 and predictor, fits2 and predictor. Be sure to check "Overlaid on the same graph" on the Multiple Graphs dialog.

After you create the graph, you need to tweak it by adding connect lines.

Click on a data point in the scatterplot. In the menu, choose Editor > Add > Data Display and check "Connect line". This adds connect lines to all of your data sets. Unfortunately, it adds a connect line to the original data, which we don't want.

Next, we need to make the connect line for the original data to go away.

Slowly double-click on the connect line for the original data set, which selects that specific connect line. (If you double-click too fast, it'll bring up the dialog to edit all connect lines.) Right click and choose "Edit Connect Line". Choose "Custom" and under "Type" scroll up and choose "None".

In a similar fashion, you may want to edit the connect lines for the fitted value data sets so that they are solid lines and perhaps increase their size.

This is not quite as good as what you're looking for but you can at least compare the models on one graph along with the original data. You can edit the legend to identify the different models.

Jim

Name: shasha• Saturday, February 15, 2014Hi, I am currently learning DOE Factorial using 5 variables and two levels. I am trying to create a regression model consisting of 5 variables (all are independent) with time as the response (dependent variable). However, my first variable is qualitative. How do I code it to make it quantitative so that it makes sense to regression analysis? There are two levels for this factor. Please help. Thanks.

Name: Jim Frost• Friday, February 21, 2014Hi Shasha,

If I understand your question correctly, you shouldn't have a problem using Minitab's DOE design generator to create a 2-level factorial design that meets your need to include a 2-level qualitative factor.

In Minitab, go to: Stat > DOE > Factorial > Create Factorial Design.

In Type of Design, choose 2-level factorial (default generators). Under Number of factors, choose 5.

Click Design and choose which type of 2-level design that you require. Click OK.

Click Factors, and you can provide names for your factors. It's totally OK to enter your qualitative factor here. Just enter the information the same way as you would for the others. If you have text names for the qualitative levels, choose Text under Type. The DOE model will work just fine!

Jim

Name: Shasha• Monday, February 24, 2014Thank you for your response Jim.

I have no problem using the DOE generator but when I use Stat > Regression> Regression function, the qualitative factors do not appear in the list of Predictors. Why is this so? I have also read something about "Make indicator variables". Is that relevant to my case here?

Looking forward to your response soon!

Shasha

Name: Jim Frost• Tuesday, February 25, 2014Shasha, there are multiple ways to go about this.

If you have existing data from a DOE design, but didn't create the design in the worksheet using Minitab, you might still able to analyze it using Analyze Factorial Design. In Minitab, go to Stat > DOE > Factorial > Define Custom Factorial Design.

However, if you want to analyze it using regression analysis, it depends on which version of Minitab you have.

If you're using Minitab 15, the only way to include qualitative/categorical variables is to create indicator variables (Calc > Make Indicator Variables). Click the Help button in the dialog box to see the details and for an example of how to do this for regression.

If you're using Minitab 16, use General Regression (Stat > Regression > General Regression). Simply enter your qualitative variable in the Categorical predictors field. You don't need to create indicator variables here. Minitab takes it from there and knows how to analyze it.

If you're using Minitab 17, the general process is the same as Minitab 16 except you need to use Fit Regression Model (Stat > Regression > Regression > Fit Regression Model).

Jim

Name: Andre• Tuesday, May 20, 2014Hi, I was thinking in a regression problem, what happens if you divide your data set ?

For example, if 50% of your data can explain (not the whole model) "almost" the whole problem, why using the whole data set ?

What kind of indicator can I use to see if this half set can explain the whole (or almost) data ?

Thanks.

Name: Jim Frost• Tuesday, May 27, 2014Hi Andre,

Some people do in fact split up their data sets. They might use part of the data to develop a model and part of the data to test the prediction quality of the model.

My colleague, Cody Steele, writes about this here:

http://blog.minitab.com/blog/statistics-and-quality-improvement/dividing-a-data-set-into-training-and-validation-samples

Jim

Name: Venu• Sunday, July 13, 2014Hello, Thank you for the nice article. Could you also let us know how to analyze using regression when there are two response variables (Ys)?

Name: Jim Frost• Monday, July 14, 2014Hi Venu,

What you should do depends on the type of variables in your model and whether the response variables are correlated. There are two general cases:

1) If you have only continuous predictors and/or the responses variables are not correlated, use Regression.

In Minitab, you can use Regression to analyze multiple response variables at the same time. You just enter the multiple response variables into the Responses field in the dialog box, and all the predictors as usual. This is the equivalent of performing separate regression analyses on each response variable using the same set of predictors. In other words, you'll get the same results as if you performed the analysis for each response variable separately. Consequently, you interpret each regression model in the usual manner.

If you want to include different predictors for each response variable, just perform each analysis one response variable at a time.

2) If you have at least one categorical predictor (and optional covariates) AND the response variables are correlated use: Stat > ANOVA > General MANOVA.

MANOVA is a multivariate ANOVA. If your response variables are correlated, MANOVA has several important advantages over performing individual ANOVA tests, one response variable at a time.

The correlation structure among the response variables allows the MANOVA procedure to detect multivariate response patterns and smaller differences than are possible with separate tests. If the response variables are not correlated, there's no reason to use MANOVA.

To learn how to interpret MANOVA results in Minitab, go to: Help > StatGuide > ANOVA > General MANOVA.

I hope this helps! If you need any further clarifications, please don't hesitate to write again.

Jim

Name: fritzi• Tuesday, July 15, 2014Hi! I'm a student and currently having my research that observes the effect of smoking to white blood cell count. I have many variables like age, years they are smoking, number of cigarette sticks they smoke per day, and the total and differential white blood cell count. My professor told me to download Minitab (so this is my first time using this) and use regression analysis. The problem is I really don't know how to start and I don't what commands to click. Can you help me? :(

Name: Jim Frost• Wednesday, July 16, 2014Hi Fritzi,

How to start is a big question! I'm not sure if your question is both about how to use Minitab and/or how to perform regression analysis.

But, to get a good overview of Minitab, we have a great Getting Started guide on our website that introduces you to the basics of using Minitab. You can either view it as web pages or download it in PDF format. You can find that here:

http://support.minitab.com/en-us/minitab/17/getting-started/

To get started in regression analysis specifically, a good first step would be to graph all the variables. A matrix plot creates a scatterplot of a set of variables. Matrix plots allow you to visually assess the relationships between many pairs of variables at once by creating an array of scatterplots. You'll also see if you'll need to model curvature in the data.

To find the matrix plot in Minitab, go to: Graphs > Matrix Plot. Because you have definite X and Y variables, you'll probably want to choose "Each Y versus each X" in the gallery that appears. Then just enter your response variable in Y variables and predictors in X variables.

In terms of performing regression, your best bet is probably to read through the posts in this tutorial. The topics will help you avoid common mistakes, pick the right analysis, specify the model, and interpret the results--all using Minitab.

Within Minitab, there is also a regression tutorial that highlights the uses, shows you how to enter your data in the worksheet, and a guided example. You can find this tutorial in Minitab:

Help > Tutorials > Regression

I hope this helps! Don't hesitate to write again as you progress!

Jim

Name: vasu• Wednesday, August 6, 2014hi

the problem is that when i perfomed the multiple regression the model fit and the significance values comes significant but when i perfomed the coreklation between the same 2 variables the result comes negative. iam confused how should i interpretet t.he results

Name: Jim Frost• Wednesday, August 6, 2014Hi Vasu,

I'll have to make a couple assumptions to answer your question. If these assumptions are incorrect, let me know and I'll address them!

First, I'll assume that your regression model uses adjusted sums of squares rather than sequential sums of squares. Adjusted sums of squares (adj SS) is the default, and the correct choice for most situations. You'd have to intentionally change it to get something else.

Second, I'll assume that when you say the result is "negative", you mean the correlation is insignificant, and not that the correlation is negative (which can be significant).

So, let's say you have a multiple regression model with the predictors of A and B, and the response is Y (Y = A + B).

The general reason why you are getting different results is that correlation and regression use different models for their tests.

Correlation is a univariate test that determines whether the correlation for A & Y is significant and, independently, whether the correlation for B & Y is significant. These are two independent tests with just one variable in each model. If you are studying a multivariate research question with univariate models, you can get misleading results due to confounding variables.

Regression uses a multivariate model to determine significance. For regression with adjusted sums of squares, significance is determined by the portion of the response variation that is explained by one predictor given that all of the other predictors are already in the model.

So, if you have a model with two predictors, A and B, the significance of A is calculated assuming that B is already in the model. A is significant if it explains a significant amount of response variability AFTER accounting for B. And visa versa for B.

Therefore, the general reason is that the different testing procedures can lead to different results. However, it is an indication that you likely have some interesting things going on in your data that you should explore.

For example, these two predictors may be correlated with each other and the response in such a way that the predictors cancel each other out and are not detectable in isolation but only in a multivariate analysis. The correlation tests are trying to assess a multivariate problem with univariate models, which can produce misleading results. Read my post where I cover an example of this and explain how it plays out:

http://blog.minitab.com/blog/adventures-in-statistics/confound-it-some-more-how-a-factor-that-wasnt-there-hampered-my-analysis

Or, it may be Simpson's Paradox at work if you are aggregrating group data for your analysis. My colleague Patrick Runkel wrote a nice post about this:

http://blog.minitab.com/blog/statistics-and-quality-data-analysis/optical-illusions-zen-koans-and-simpsons-paradox

I can't tell you exactly what is going on with your data, but I'm pretty sure it involves some sort of interesting relationship that you should investigate.

Thanks for writing! If you need any further clarifications, please don't hesitate to write again.

Jim

Name: Abbas• Sunday, August 17, 2014Hi

I used Response Surface Design (Box-Behnken) for three factors but I already have more than 15 run from experimental data can I include the extra results by using (Define Custom Response Surface Design) or it is better to use the 15 runs that I got from (Box-Behnken. If the response yes I can which one is more accurate and why. thanks

Name: vasu• Monday, August 18, 2014i want o know in my data my dependent variable is on liker t scale,and i have to apply regression on that ,and i am confident in multiple regression and not to much knowledge about ordinal regression.,is there any probability when your dependent variable is ordinal in nature still you can apply multiple regression,becoz ,i had gone a number of papers like this where people had applied multiple regression,though there data is ordinal in nature.

Name: vasu• Wednesday, August 20, 2014i used paired t test to determine the significant dffrence in the mean in my data, but when i did that i became shocked that the mean difference is negligble still that is coming significant. i want to know is there any problem in my data,becoz if data arte highly corelated then sig value must come higher than sig value

lease do the needful.its urgent

Name: Jim Frost• Wednesday, August 20, 2014Hi Vasu,

I'll attempt to answer both of your questions in one reply!

To perform multiple regression when you have a dependent (response) variable that is measured on the Likert scale, you’ll need to use ordinal logistic regression. It’s a special form of multiple regression where you can use oridinal dependent variables, such as the Likert scale.

You can find it in Minitab at: Stat > Regression > Ordinal Logistic Regression

For the paired t-test question, it’s important to keep in mind the distinction between statistically significant and practically significant. It’s possible to find a statistically significant difference for a very small difference.

The statistical significance indicates that your sample data provides sufficient evidence to state that the population means are likely to be unequal. However, that does not necessarily indicate that the difference is large enough to be practical in the real world. That’s an entirely separate question. You'll need to apply your subject area knowledge to answer that.

Also, finding a very small difference that is statistically significant does not indicate that there is anything wrong with your data. In fact, quite the opposite, it may happen because you have very good data! Often you’ll find that very small differences are significant when you have a large sample size and/or the variability (noise) in your data is very low.

I’ve written several posts about how to use P-values. In particular, I recommend you read the post below and, in particular, look at Guideline 3:

http://blog.minitab.com/blog/adventures-in-statistics/five-guidelines-for-using-p-values

Thanks for reading and writing!

Jim

Name: Ali• Sunday, August 24, 2014Hi

I need your help in how to solve equation with more than one variable and more than one coefficient fore example

y=(a+bx1)x2

when y, x1, and x2 variables and x1 and x2 were nown, but y is unknown.

Then a and b are coefficients and need to solve and find the values of a and b and y

thanks

Name: nuna• Monday, August 25, 2014what is the best way to analyse interaction effects on the response?

how can i know how individual variable affect the response even though each observation combine the variables?

I have 3 independent variables, one dependent variable. Each independent variables have four levels.

Name: Jim Frost• Monday, August 25, 2014Hi Nuna,

Your best bet is to create an Interaction Plot. This plot displays how the effect of one variable on the response depends on the value of a second variable.

Hopefully you're using Minitab 17 because we have some cool new functionality for interaction plots. In Minitab 17, if you're using GLM or Regression (among other analyses), you can create the plots using your stored model. You can use either categorical or continuous predictors for these plots.

It sounds like you're fitting an ANOVA model. So, fit your model using GLM. Then, in the Minitab menu, go to Stat > ANOVA > General Linear Model > Factorial Plots. This will create interaction plots using fitted values based on your model.

The new and improved factorial plots are just a small part of the new linear model functionality in Minitab 17. Read here for more details:

http://blog.minitab.com/blog/adventures-in-statistics/unleash-the-power-of-linear-models-with-minitab-17

If you're still using Minitab 16, you can still create interaction plots. These are based on your raw data rather than the fitted model. And, you can only use categorical factors, no continuous predictors. In Minitab 16, go to: Stat > ANOVA > Interaction Plot.

I hope this helps!

Jim

Name: Kibu• Tuesday, August 26, 2014Hi,

I have the following variables:

Independent variables: Rank of access to the forest by households:...... (anyone from 1 to 5)

Dependent variables:

Age 20-30 (household %)

Education- primary

education- secondary

education-graduate

Land tenure(yes/no)

area of land (ha)

equipments (yes/no)

motor cycle (yes/no)

bycycle (yes/no)

Cow (yes/no)

Neighbours are trusworthy (household %)

relatives support (rank 1 to 5)

engaged in state aurhority (household %)

feel very close(household %)

feel moderately close (household %)

feel very distant (household %)

very peaceful (household %)

moderately peaceful (household %)

very violent (household %)

people share thing (yes/no)

(Few more variables would be included). Would you please inform me maximum how many variables are safe to generate a regression model? I would like to perform a regression analysis by Minitab. Would you please which regression model I could use? and kindly write the instruction of doing that by Minitab?

Name: Jim Frost• Tuesday, August 26, 2014Hi Kibu,

Before I get into an answer, I wanted to check and see if you possibly got the independent and dependent variables switched.

Do you want to use the list of variables to predict the Rank of access the forest? If so, the long list of variables are independent or predictor variables. Rank of access would be the dependent/response variable.

If you want to use rank of access to predict each of the variables in the list, you got them right but you'll need a bunch of seperate regression models. And, you'd need to use a variety of types of regression, or perhaps chi-square instead of regression in some cases.

Jim