Statistical Tools for Predicting Group Membership

Minitab Blog Editor 11 August, 2011

Riddle: What two tools in Minitab can be used to perform the same analysis on your data? Well, there are probably a few pairs that can be mentioned, but I am going to focus on Discriminant Analysis and Binary Logistic Regression.

These tools can be used to predict group membership.  If we look at exh_mvar.mtw, located in Minitab’s sample data folder, we have the perfect data set to use. Here is a snapshot of the first 30 or so observations:

Salmon Data for Statistical Group Prediction

Fifty fish from each place of origin (Alaska, Canada) were caught and growth ring diameters of scales were measured for the time when they lived in freshwater and for the subsequent time when they lived in saltwater. The goal is to be able to identify newly-caught fish as being from Alaskan or Canadian stocks. Let’s analyze this data with both Discriminant Analysis and Binary Logistic Regression and predict for a newly caught fish with Freshwater ring diameter of 100, and Marine ring diameter of 300.

Let’s go to Stat->Multivariate->Discriminant Analysis. I fill out the dialogs as follows:
Discriminant AnalysisAnalysis Options

I then go to session window to look at my results for group membership prediction for the new observation:

Based on a probability of 0.943, Discriminant has selected Canada as my predicted Group.  Let’s see what Binary Logistic Regression has to say. I go to Stat->Regression->Binary Logistic Regression, and fill out the dialogs as follows:

Binary Logistic Regressionlogistic regression options
I look at my session window for the results on the group prediction:

Session Window - binary logistic regression

That 0.877 for the event probability is referring to the likelihood of the event occurring. In this case, the ‘event’ in this logistic regression has been defined as ‘Canada’.  (If you wanted to change the reference event to ‘Alaska’ instead, you could do so under the Options for Binary Logistic Regression.)  So this is saying that for observation 1(100, 300) there is an 87.7% likelihood that Canada would be the group response.

Both tools selected Canada as their group of choice in this case. But does this mean that both tools will always agree with each other?  What analysis should we go along with in the future then? Well, Binary Logistic Regression usually involves fewer violations of assumptions.  Its independent variables needn't be normally distributed, linearly related, or have equal within-group variances. B-log(Binary Logistc) is also robust, handles categorical as well as continuous variables, and has coefficients which many find easier to interpret.  If you have highly unequal group sizes, you may want to shy away from Discriminant Analysis as well. Discriminant Analysis was actually an earlier alternative to binary logistic regression. However, Discriminant Analysis is preferred when the assumptions of linear regression are met, because it then offers more statistical power than logistic regression (less chance of type 2 errors – failing to reject the null hypothesis when it is false).

For a more theoretical response to the matter, here is something that came from one of our lead designers:

“Both discriminant analysis and logistic regression (binary or nominal) model the relationship between predictor variables and a categorical response. However, the underlying assumptions and the estimation methods are very different, so the results will be different. Discriminant analysis assumes normal distribution and estimates the parameters minimizing the least squares criteria. Logistic regression calculates maximum likelihood estimates via iteratively reweighting the observations. In discriminant analysis all observations enter equally in the covariance calculation while in logistic regression the observation closer to the class boundaries get higher importance. Also, logistic regression handles a wider range of predictors (both continuous and categorical) in addition to various link functions. Unless the normality assumption is perfectly met, logistic regression has a better error rate than discriminant analysis because of its greater flexibility in finding the right model.”

For more information on this topic, please refer to these two references:

S.E. Fienberg (1987). The Analysis of Cross-Classified Categorical Data. The MIT Press.
S.J. Press and S. Wilson (1978). "Choosing Between Logistic Regression and Discriminant Analysis," Journal of the American Statistical Association, 73, 699-705.

I hope this helped provide more clarity on the methods behind both tools, and when to use them!