Topics: Regression Analysis, Data Analysis

In a previous post, I told you how omitting the subject's weight led to a surprising result in a preliminary regression analysis of the effects of physical activity on bone density. It's a good example of why you need to be very careful about the data you collect: factors you DO NOT include sometimes have a big influence on the ones you DO include!

Let's look at how this played out in my bio-mechanics study.

In this case, multiple variables influence the subjects' bone density, not just their activity. Another important variable is the subject’s weight. Theory states that the more a subject weighs, the higher bone density tends to be. Again, the bones adapt to higher forces. When I included weight in the regression analysis, along with activity, this time I found that both activity and weight were significantly and positively associated with bone density.

Somehow, leaving out weight completely masked the effect of activity on bone density in the initial analysis. Let’s dig a bit deeper to see how this happened by looking at the correlation structure of the variables.

While both activity and weight are related to bone density, they are also related to each other. Lower activity is related to a higher weight, a negative correlation. When two predictors are correlated like this, they are known as confounding variables. This is an appropriate name because they can confound your results!

The diagram above shows the variables and their correlation signs. It turns out that activity and weight canceled each other out. Study subjects who are more active tend to get a boost in bone density due to their physical activity, but, because they tend to be leaner, they get a bone density deduction for weighing less. When activity was the only variable in the model, the model was forced to reflect the counteracting effects of both variables within just the one variable. When we included all significant variables, the model could accurately give each predictor its own effect.

Confounding variables can hide a true relationship between a predictor and response variable (as happened in this case) or they can suggest a false relationship between them. You should be particularly wary of confounding variables in non-randomized studies.

Imagine the problem we would’ve had if we hadn’t collected this additional data! Thanks to advance planning, we had measured their weight and it was a simple matter to include.

If you aren’t careful, these hidden minefields can completely change the results of your data analysis!

Here are some important considerations:

• Research the subject area before designing your study
• Identify and measure all of the important variables
• Understand the correlation structure of your variables
• Include all significant variables in a multiple regression model

Here's one additional illustration of how the messy world interacts with research studies. If you look at the scientific literature about exercise and bone density, you’ll find that some studies show an effect, while others don’t. That's because different studies look at different types of exercise, over different durations, for different age groups. They also measure density at different bone sites and include different variables in the model. Most of these factors legitimately influence the results.  However, a failure to measure, and thereby control for, a confounding variable is not a legitimate influence on results. Watch out!

In my next post, we'll look at a case where a confounding variable made another variable look significant when it wasn't.