# Proxy Variables: The Good Twin of Confounding Variables

A web of correlations can help or hurt you. Be the spider, not the fly.

In recent posts, I’ve shown how confounding variables can flip your statistical analysis results on its head. And I’ve shown how random assignment in experiments can protect you from them in some cases. You might wonder:

• If you can’t randomize, is there another way to avoid the dangers of confounding variables?
• And, is there a way to use their power for good?

Fortunately, the answer to each question is “yes!”

What you need are proxy variables. A proxy variable is an easily measurable variable that is used in place of a variable that cannot be measured or is difficult to measure. The proxy variable can be something that is not of any great interest itself, but has a close correlation with the variable of interest.

Confounding variables and proxy variables are essentially the same thing: correlated predictor variables. But there's a huge difference between them:

• Confounding variables affect your results in undesirable ways by not being included in the model. They are primarily a danger when you aren’t aware of them during the analysis.
• Proxy variables benefit your analysis. You know about and intentionally include them in the model to improve your results.

Wise data analysts can find ways to avoid getting burned by confounding variables and instead use proxy variables to their advantage. Here's a case where knowledge truly is power: specifically, knowledge of your subject matter and the correlation structure amongst your variables allows you to use these correlations to your advantage.

Prediction

Imagine that you are mostly interested in predicting something and that you don’t care so much about identifying true cause-and-effect relationships. Fortunately, prediction doesn’t always require a causal relationship between predictor and response. Instead, a proxy variable that is simply correlated to the response, and is easier to obtain than a causally connected variable, might well do the job.

For example, my colleague, Kevin, does an excellent job using regression analysis to assist those who play fantasy football. Recently, he used a model that included one predictor variable -- each player’s fantasy football points from the prior season -- to predict his points for the subsequent season. Clearly, the points from one season are not causing the points for the next season. Rather, the points are a proxy variable for a host of other variables such as each player’s skills and capabilities, those of their team, the teams they play against, etc. It’s impossible to measure all of these, so a proxy variable is essential. His model for choosing quarterbacks has an r-squared of 73.68%. In this case, there is enough of a correlation from one year to the next that he can use the model for prediction, even though we don’t know or measure the exact causal variables.

Produce unbiased results

Now, imagine that you are working on a research project where some of the variables are difficult, if not impossible, to measure. Remember, if you don’t include the intended variable in any form, your results could be biased 180 degrees from what they should be. Including an imperfect proxy of a hard-to-measure variable is often better than not including an important variable at all. So, if you can’t include the intended variable, look for a proxy!

## Examples of proxy variables

Intended variable Proxy variable
Historical environmental conditions Widths of tree rings
Quality of life Per-capita GDP
True body fat percentage Body Mass Index (BMI)
Cognitive ability Years of education and/or GPA
Depth that light penetrates into the ocean over large areas Satellite images of ocean surface color
Hormone levels in blood Changes in height over a fixed time

Do you have examples of proxy variables that have helped you out in your analyses?

Name: akalu • Thursday, December 12, 2013

thank you for sharing this important lesson, do we need to include them in the regresion analysis, what about the difference with instrumental variable?

Name: Jim Frost • Thursday, December 12, 2013

Hi, thanks for reading! Those are great questions.

For proxy variables, you should include them if excluding them produces a lack-of-fit and if they're significant.

I wasn't too familiar with instrumental variables, but I did some research to see how they compare to proxy variables. Keep in mind that this is only a broad overview of instrumental variables.

Proxy variables and instrumental variables are similar and address a common problem. Namely, if an important predictor is not included in a model, the regression results can be biased.

The differences between the two are generally in the context of how they are used.

Instrumental variables are particularly applicable to applied studies, where you'll often have a randomized experimental design because you are interested in establishing causality. In these cases, you have a treatment variable that you think *causes* changes in the response variable. However, for whatever reason, you cannot directly include the treatment variable in the model.

An instrumental variable is correlated with this treatment variable and you include the instrumental variable in the model rather than the treatment variable. The analysis process often has two stages because you want to determine the relationship between the treatment variable and instrument variable as well as the relationship between the instrument variable and the response variable.

Because instrumental variables are used in the context of causality, there are often case specific restrictions on their usage. However, one general restriction is that instrumental variables must only affect the outcome through its effect on the treatment.

A proxy variable seems to be a more general case where you may not be interested in causality and influencing outcomes. Consequently, there are fewer restrictions on how they're used.

If anyone has additional information, perhaps based on personal experience, that would be great to hear about!

--Jim