# How to Predict with Minitab: Using BMI to Predict the Body Fat Percentage, Part 1

Wouldn’t it be nice to be able to predict something that is important to you?  Sure—and it would be extra nice if you knew how accurate your predictions will be. Well, you don’t have to be a psychic to have these powers because Minitab Statistical Software gives them to you! You can find these predictive powers in regression, general linear model (ANOVA), and the design of experiments (DOE).

## Prediction with Regression Analysis

We’ll explore prediction with regression analysis by using a person’s body mass index (BMI) to predict their percentage of body fat. For extra fun, we’ll compare Minitab’s predictions to those reported by body fat measuring scales that use bioelectrical impedance analysis (BIA).

Prediction in Minitab is different from the psychic ability. For one thing, you are often not predicting the future, which is the case with this blog’s example. Prediction in regression refers to estimating the value of one variable using assumed values of other input variables that are related to it.

Unlike psychic ability, statistical predictions don’t just pop into your head. Instead, you need to collect data and develop a mathematical model that describes the relationship between one or more variables and the variable you want to predict. The accuracy of your predictions depends on how well your model fits the data. In exchange for this work, Minitab not only predicts but also gives you estimates of their accuracy!

The steps for accurate prediction are:

1. Do background research so you know what you’re doing!
2. Collect the data.
3. Fit and evaluate competing regression models.
4. Determine the best model.
5. Predict!

## Scenario

I’ve already collected the data from 92 middle-school-aged girls. The data includes their height, weight, and percentage of body fat as measured by a Hologic DXA whole-body system. DXA measurements are considered one of the best ways to measure the percentage of body fat. You can get the data here.

The idea here is that DXA measurements are more expensive and harder to obtain than your BMI. It would be great if you could use your BMI to predict your percentage of body fat. BMI is a general assessment of whether your weight is in an appropriate range for your height. However, BMI is an imperfect measure because it cannot distinguish between weight from muscle versus weight from fat. Therefore, a goal of this analysis is to see if BMI is good enough.

## Regression Model

For this blog, I’ve already reduced the model down to the final form. The process of reducing the model is the subject of another blog, but now I will show you why the final model is a good one. During that process, I specifically compared including height, weight, and their squared terms in the model to including BMI and its squared term in the model. I included the squared terms in order to model the curvature. Both approaches produced nearly identical results. This isn’t surprising because you use height and weight to calculate BMI (weight in kilograms/height in meters squared). I’ve decided to use BMI as the predictor because it’s easier to graph in the results below.

Because we have one predictor (BMI) and one response (Body Fat Percentage), we can use a fitted line plot to display the relationship.

You’ll note the curved relationship, which is why I included the squared term. The fitted line follows the data very nicely because the observations fall randomly around it for the entire range. The R-squared is 76.1%, which isn't fantastic but not terrible. It reflects the imperfect nature of BMIs. We'll assess the residual plots below to really check the model. Remember, if the model isn’t a good fit, our predictions won’t be valid.

A side note about the implications of the curved relationship for the raw BMI scores of this population: We tend to think of BMI scores in a linear sense. That is, if you start at either 16 or 30 and increase BMI by 1, we assume that it represents the same increase in fat mass. The curved relationship shown above suggests that this is not true. The change in fat mass varies depending on the specific BMI value you start at. The overall effect is that raw BMI values tend to overestimate the fat mass for those with very low BMI scores and very high BMI scores. Middle-of-the-range BMI scores tend to underestimate fat mass. Regression analysis actually improves upon raw BMI scores for this population by correctly modeling the curvature. Yay Minitab!

The residuals in the Normal Probability Plot above follow a straight line, which indicates they are normally distributed. In the Versus Fits plot, the residuals appear to be randomly scattered about zero. These data were not recorded in time-order so we can ignore the Versus Order plot. The histogram can help detect outliers, but none are evident. (Don’t use histograms to assess normality, because they can be deceptive for that purpose!)

Because the model is a good fit, we can start predicting. In part 2 of this series, we’ll generate predictions, assess accuracy, and compare Minitab's predictions to those of the BIA scales!

Name: John Peipock • Tuesday, February 21, 2012

I would like to learn more on this subject and have better examples on using this feature.

Name: Jim Frost • Tuesday, February 21, 2012

Hi John, be sure to check out part 2 of this blog on February 23rd because it contains more information about using Prediction. Also, if there are additional issues in this subject that you would like covered in more depth, please let me know I'll be happy to blog about them!

Kind regards,
Jim

Name: prasshanth bharadwaj • Monday, January 13, 2014

Hi, thank you for such a valuable information.Kindly help me to find the data used in conducting this test and when you said that "predicted R squared" indicates how well the model predicts new observation, what does that mean. thank a lots

regards
prasshant

Name: Jim Frost • Wednesday, January 15, 2014

Hi Prasshant,

I've added a link in the Scenario section that you can click to get the data set.

Predicted R2 is calculated by systematically removing each observation from the data set, estimating the regression equation, and determining how well the model predicts the removed observation. It helps you determine how well your model can predict new observations rather than just fitting your existing data.

I recommend that you read my post about predicted R-squared for more details.

Jim

Name: prasshanth bharadwaj • Wednesday, January 22, 2014

Hi,
Thank you for your reply, just wanted to know when you said the "predict new observation" does it mean the data not in our range ex if my Y is AHT for the month of Jan, Feb and March only thn with the help of "Predicted R2" I can predict for the month of April as well?

Kindly revert.

Thank you
Prasshanth

Name: Jim Frost • Wednesday, January 22, 2014

Hi Prasshanth,

New observation simply means that it's not a data point in the data set that was to estimate the model; it's the Y value that you want to predict using the X value(s) that you supply.

Predicted R-squared and predicting outside the range of your data are two seperate issues.

If your predicted R-squared is notably lower than the R-squared, it means that your model can't predict new observations well, even within the range of your data. Your model is only good for explaining your data set, but not good for predicting new obsverations. In other words, with a low predicted R-squared, you should not have much faith in the predictions at all.

If you have a predicted R-squared that is close to your regular R-squared, your model both explains the data set *and* can predict new observations. However, even in this case, you generally don't want to predict outside the range of your data.

If you have other lines of research that suggest that the relationship between the response and predictors is the same in April as it is for the previous months, it might be OK to make predictions. In other words, if the same rules apply in April, you're not really outside the range. However, that decision really comes down to your subject area knowledge.

You might also consider using a time series analysis if your data is a time series collected at regular intervals and you want to predict the next interval.

Jim

Name: chwee • Wednesday, July 2, 2014

I used Fitted Line Plot to plot no. of tickets closed vs week number. The p-value is high, R-sq very low. Can I still take the regression equation to conclude the trend of the ticket closed ? I'm not into precision prediction. Thanks.

Name: Jim Frost • Thursday, July 3, 2014

Hi Chwee,

Good news! You can conclude that the relationship is there. Just be sure that the model adequately fits the data by checking the residuals.

I wrote a post about how to interpret regression models that have a low R-squared and significant predictors. This post should answer your questions: