What Is Complete Separation in Binary Logistic Regression?

Eric Heckman | 22 February, 2016

Topics: Regression Analysis

When running a binary logistic regression and many other analyses in Minitab, we estimate parameters for a specified model based on the sample data that has been collected. Most of the time, we use what is called Maximum Likelihood Estimation. However, based on specifics within your data, sometimes these estimation methods fail. What happens then?

Specifically, during binary logistic regression, an error comes up often enough that I want to explain what exactly it means, and offer some potential remedies for it. When you attempt to run your model, you may see the following error:

error

What's going on here? First, let's see what causes this error. Take a look at the following data set consisting of one response variable, Y, and one predictor variable, X.

X 1 2 3 4 4 5 5 6
Y 0 0 0 0 0 1 1 1

Note the key pattern. This data set can be simply described as follows:

If X <= 4, then Y=0 without fail. Similarly, if X >4, then Y=1, again without fail. This is what is known as "separation." 

This "perfect prediction" of the response is what causes the estimates, and thus your model, to fail. 

Often, separation occurs when the data set is too small to observe events with low probabilities. In the example above, it may be possible to observe a Y value of 1 with an X of less than 4, however, when dealing with smaller sample sizes and low probabilities, we didn't observe any instances of this in our data collection. The more predictors are in the model, the more likely separation is to occur because the individual groups in the data have smaller sample sizes.

Essentially, separation occurs when there is a category or range of a predictor with only one value of the response. We need diversity, or variation among the response to estimate the model. 

So when separation happens, what can we do to proceed? With the data as is, there's no way to estimate those parameters; however, there are some things we can do to work around this issue.

1. Obtain more data. If possible, being able to get more data increases the probability that you will obtain different values for your response, thus eliminating the separation. If possible, this is a good first step. 

2. Consider an alternative model. The more terms are in the model, the more likely that separation occurs for at least one variable. When you select terms for the model, you can check whether the exclusion of a term allows the maximum likelihood estimates to converge. If a useful model exists that does not use the term, you can continue the analysis with the new model.

3. Depending on the predictor variable in question, you may be able to manipulate your groupings to something that has events occurring. For example, you may have a predictor in your model with groups for both "Oranges" and "Apples." With such specific groups, it may be possible to see separation. However, that separation may disappear if you can combine those two levels into one specific grouping, such as "Fruit."

Seeing an error message like this can be frustrating, but it doesn't have to be the end of the line if you know some ways to work around it. Keep in mind these steps when analyzing a model, and you can overcome data issues such as this in the future.