Logistic Regression
What is logistic regression?
Logistic regression is used to model the relationship between a categorical response variable and a continuous predictor variable. The model is used to predict the probability of observing a given value of the response variable based on the value of the predictor variable.
How is logistic regression related to simple linear regression?
In simple linear regression, you regress a continuous response on a continuous predictor. In logistic regression, rather than directly regressing a categorical response on a continuous predictor, you regress a function of a categorical response on a continuous predictor. The function is designed to change the categorical response into a continuous response variable.
Example of logistic regression with a binary response
Suppose you are studying how the dose of penicillin relates to curing a disease. Fifty-four animal subjects infected with the disease were given varying dosages of penicillin (Dose). Some animals were cured; others died. You want to know how the probability of curing the disease changes with dosage. The response variable (Response) is a binary variable with two possible outcomes, or events: Cured and Died. The predictor variable is transformed to a log scale, which is denoted as ln(dose).
The logistic regression model predicts the probability of specific outcomes (Cured or Died) based on a predictor (ln(dose)). Probability is bounded on [0, 1] but the response in a linear model is unbounded (–$\infty$, +$\infty$). The relationship between the probability of a response and a predictor might not be linear, so we transform the probability to make the relationship linear. Logistic regression uses a two-step transformation. Let’s look at these steps and the logistic regression model before continuing with our example.
What are the odds?
Odds are a function of the probability of an event. The odds of an event are the ratio of the probability of the event happening to the probability of the event not happening. For example, if the probability that a bank customer will default on a loan is 20%, the odds of defaulting are 0.2 / (1 – 0.2) = 0.2 / 0.8 = 0.25 or 1:4. If the probability of rain today is 75%, the odds that it will rain are 0.75 / (1 – 0.75) = 0.75 / 0.25 = 3:1. In the logistic regression case, the outcome (Cured or Died) is the event.
Transforming the probability of an event to the odds of the event is the first step in the logistic regression transformation.
What is the logit transformation?
The second step of the logistic regression transformation is the logit transformation. The logit function transforms a probability into the log of the odds. That is, if $\pi$ is the probability of an event happening, then
logit$(\pi) = \log\left(\frac{\pi}{1 - \pi}\right)$.
The logit function’s range is (–$\infty$, +$\infty$). The logit is calculated for each observed value of the response and will be used as the response variable in logistic regression. The assumption in logistic regression is that the logit transformation of probabilities results in a linear relationship with the predictor.
What is the logistic regression model?
The logistic regression model for one continuous predictor can be written as:
logit$(\pi(x)) = \log\left(\frac{\pi(x)}{1 - \pi(x)}\right)$.= $\beta_0$ + $\beta_1x$
Example of logistic regression (continued)
Suppose the application of different doses of penicillin to the 54 diseased animals resulted in the following data:
| Dose | Ln(does) | Response | Number of Animals |
| 0.125 | –2.07944 | Cured | 0 |
| 0.25 | –0.38629 | Cured | 3 |
| 0.5 | –0.69315 | Cured | 8 |
| 1 | 0 | Cured | 11 |
| 4 | 1.386294 | Cured | 7 |
| 0.125 | –2.07944 | Died | 11 |
| 0.25 | –0.38629 | Died | 9 |
| 0.5 | –0.69315 | Died | 4 |
| 1 | 0 | Died | 1 |
| 4 | 1.386294 | Died | 0 |
The best fit logistic regression model of Response vs. ln(dose) is shown below.
The y axis represents the probability of Cured for the corresponding level of ln(dose). The data with a + marker are the cured animals, the data with an open circle marker ° are the animals that died.
The parameter estimates and hypothesis tests of significance for $\beta_0$ and $\beta_1$ in the logistic regression model are shown below.
You will notice the note “For log odds of Died/Cured.” That means that in this example, the probability of Died is being modeled as a function of ln(dose). If you wanted to model the probability of Cured, you would need to specify that as the target level of the response, or the outcome of interest.
Nominal logistic regression and ordinal logistic regression
The example above describes logistic regression with a binary response variable. Logistic regression also works when you have a response variable with more than two possible outcomes. For this type of response variable, transforming the probability of an outcome to a real number depends on whether the response categories have order or not.
A nominal variable is a discrete variable with no order to the categories. For example, the type of soda you purchase at the movies might be Cola, Lemon-lime, or Orange. An ordinal variable is a discrete variable with an order to the categories. For example, the size of the soda you purchase might be small, medium, or large.
Nominal logistic regression
For a nominal response variable with k levels, you can compute k – 1 generalized logits. The generalized logits are based on individual probabilities. The odds are for any given outcome relative to the last outcome. Since order is arbitrary for a nominal variable, you can simply impose an order that makes sense for your particular problem. For example, in a morbidity study, the last outcome might be the normal patient or the untreated subject.
Suppose your nominal response variable has three levels. In this case, there are two generalized logits:
$ \begin{aligned} \text{Logit}_1 &= \log \left( \frac{p_1}{p_3} \right) \\ \text{Logit}_2 &= \log \left( \frac{p_2}{p_3} \right) \end{aligned} $
Logit1 is the log odds for Level 1 occurring vs. Level 3 occurring. Logit2 is the log odds for Level 2 occurring vs. Level 3 occurring. There are only two generalized logits because the total probability must sum to 1. In the special case of a binary response, only one logit was necessary because p2 = 1 – p1. In this example, p3 = 1 – (p1 + p2).
The generalized logit model has a separate intercept and slope for each logit:
$ \text{Logit}_i = \beta_{0i} + \beta_{1i} x $.
Ordinal logistic regression
When the response is ordinal, using cumulative logits results in a model that contains fewer parameters than the corresponding nominal model.
The ordered nature of an ordinal response naturally lends itself to answering questions about cumulative outcomes such as, “What is the probability of achieving at least the third of five levels?” Because the logit transformation is equal to the log(odds), these outcomes can be represented by cumulative logits, which are based on cumulative probabilities. For example, the transformed response for the case of achieving at least the third level in five is:
$ \log \left( \frac{p_1 + p_2 + p_3}{1 - (p_1 + p_2 + p_3)} \right) = \log \left( \frac{p_1 + p_2 + p_3}{p_4 + p_5} \right) $.
When the response variable has three levels, the model computes two cumulative logit functions. If p1, p2, and p3 are the probabilities for the first, second, and third level, respectively, then there are two cumulative logits:
$ \begin{aligned} \text{Logit}_1 &= \log \left( \frac{p_1}{p_2 + p_3} \right) \\ \text{Logit}_2 &= \log \left( \frac{p_1 + p_2}{p_3} \right) \end{aligned} $.
Logit1 is the log odds for Level 1 occurring vs. Level 2 and Level 3 occurring. Logit2 is the log odds for Level 1 and Level 2 occurring vs. Level 3 occurring.
Using a model with cumulative logit functions assumes that a common slope parameter ($\beta_1$) is associated with the predictor variable, but each logit has its own intercept:
$ \text{Logit}_i = \beta_{0i} + \beta_{1} x $.
The proportional odds assumption means that the regression lines for each of the cumulative log odds are parallel to each other, different only in the values of the intercept parameter. That is, the regression lines are parallel with respect to the cumulative logits. This assumption means that the odds ratios are constant (the effect of the predictor is the same) across the cumulative logits. If this assumption is not valid, you might want to consider fitting a nominal logistic regression model instead.