dcsimg
 

Thinking about Predictors in Regression, an Example

A few times a year, the Bureau of Labor Statistics (BLS) publishes a Spotlight on Statistics Article. The first such article of 2015 recently arrived, providing analysis of trends in long-term unemployment. 

Certainly an interesting read on its own, but some of the included data gives us a good opportunity to look at how thought can improve your regression analysis. Fortunately, Minitab Statistical Software includes 3-D graphs and Regression Diagnostics that can help you spot opportunities for improvement.

The first chart in the report highlights how high the share of the unemployed who are unemployed for a long time is compared to historical levels. That chart looks a bit like this:

Percent of total unemployed in each category tend to follow each other.

The discussion points out an interesting relationship. The authors note that the record  high for those unemployed 27 weeks or longer occurred in the second quarter of 2010. The record high for those unemployed 52 weeks or longer occurred in the second quarter of 2011. The record high for those unemployed 99 weeks or longer occurred in the 4th quarter of 2011. That is, the highest proportion of unemployed in each category happens earlier for shorter terms.

This relationship is where we can see how to put some thought into regression variables. Let’s say that we want to predict the percentage of unemployed who will have been unemployed for 99 weeks or longer, using the other two figures. The most natural setup for the data is for all of the figures to be in the same row by date, like this:

In this worksheet, each column starts in row 1.

When your data are set up like this, it’s natural to want to analyze the data this way. The relationship that you get this way is strong. If you looked at the R-squared statistics, you might stop.

Model Summary

       S    R-sq  R-sq(adj)  R-sq(pred)
0.963437  94.69%     94.56%      93.96%

But if you look a little deeper, you might find that there are some unsatisfactory aspects with the variables this way. Here's what the relationship looks like when you plot all 3 variables on a 3-D graph:

The relationship between the variables is weaker as the values increase.

I’ve marked the points on this graph that have unusual predictor values. In the diagnostic report for the model, we can see that these points are followed by large standardized residuals. That is, the lag that the article pointed out in the maximums shows up in the regression relationship as well.

Fits and Diagnostics for Unusual Observations

       99 weeks
        Percent
Obs  unemployed     Fit   Resid  Std Resid
 63       4.500   3.219   1.281       1.54     X
 64       5.800   6.793  -0.993      -1.11     X
 65       6.500   8.323  -1.823      -2.03  R  X
 66       9.500  13.152  -3.652      -3.92  R
 67       9.600  12.786  -3.186      -3.40  R
 68      10.700  14.019  -3.319      -3.57  R
 75      14.300  12.387   1.913       2.04  R

R  Large residual
X  Unusual X

If you think about the predictor variables, this makes perfect sense. The BLS report notes that finding a job is less likely the longer you are unemployed. People unemployed for more than 27 weeks can become people who are unemployed for longer than 52 weeks. People who are unemployed for more than 52 weeks can become people who are unemployed longer than 99 weeks.

So what are the right predictors to use for the percentage of the unemployed for longer than 99 weeks? The closest we can get with terms provided is probably that people who are unemployed for over 27 weeks can become people who are unemployed for over 99 weeks about 4 quarters later. Similarly, people who are unemployed for over 52 weeks can become people who are unemployed for over 99 weeks about 2 quarters later.

To get these variables in Minitab, use the Time Series menu.

  1. Choose Stat > Time Series > Lag.
  2. In Series, enter 'Over 27 Weeks'.
  3. In Store lags in, enter ‘Over 27 Lag 4’.
  4. In Lag, enter 4.
  5. Press CTRL + E.
  6. In Series, enter 'Over 52 Weeks'.
  7. In Store lags in, enter ‘Over 52 Lag 2’.
  8. In Lag, enter 2.

The resulting worksheet looks like this:

New variables are in this worksheet that line up the rows at more logical intervals.

Now, the value for the percentage unemployed over 27 weeks lines from the first quarter of 1994 lines up with the percentage of unemployed over 52 weeks from the third quarter of 1994 and the percentage unemployed over 99 weeks from the first quarter of 1995. Plot these data and the relationship looks stronger than before:

The relationship between the response and the lagged predictors looks stronger.

Highlighting the same 3 points from the first graph in red, the points don’t seem unusual at all. In fact, these points don’t appear in the diagnostic report anymore. One point still has a large standardized residual and it is preceded by an unusual X value. But the regression that compare appropriate time frames explains more variation in the data than the regression that compares simultaneous ones.

Model Summary

       S    R-sq  R-sq(adj)  R-sq(pred)
0.676735  97.50%     97.43%      97.04%

Fits and Diagnostics for Unusual Observations

       99 weeks
        Percent
Obs  unemployed     Fit   Resid  Std Resid
 66       9.500   8.866   0.634       1.01     X
 68      10.700  13.357  -2.657      -4.40  R  X

R  Large residual
X  Unusual X

Minitab Statistical Software provides a number of ways for you to evaluate your regression model. If your diagnostics reveal model inadequacies, the you have a lot of easy ways to make improvements. I used lag to create appropriate variables. If you’re ready for more, check out how Bruno Scibilia uses includes interactions in his model for wine tasting or explains the benefits of a Box-Cox transformation.

Comments

blog comments powered by Disqus