library(tidyverse)
library(tidymodels)
library(GGally)
library(knitr)
library(patchwork)
library(viridis)
library(ggfortify)
library(gridExtra)
Logistic Regression
BMLR Chapter 6
Setup
Learning goals
Identify Bernoulli and binomial random variables
Write GLM for binomial response variable
Interpret the coefficients for a logistic regression model
Visualizations for logistic regression
Interpret coefficients and results from an ordinal logistic regression model
Summarize GLMs for independent observations
Basics of logistic regression
Bernoulli + Binomial random variables (1/2)
Logistic regression is used to analyze data with two types of responses:
- Binary: These responses take on two values success \((Y = 1)\) or failure \((Y = 0)\), yes \((Y = 1)\) or no \((Y = 0)\), etc.
. . .
\[P(Y = y) = p^y(1-p)^{1-y} \hspace{10mm} y = 0, 1\]
Bernoulli + Binomial random variables (2/2)
Logistic regression is used to analyze data with two types of responses:
- Binomial: Number of successes in a Bernoulli process, \(n\) independent trials with a constant probability of success \(p\).
. . .
\[P(Y = y) = {n \choose y}p^{y}(1-p)^{n - y} \hspace{10mm} y = 0, 1, \ldots, n\]
. . .
In both instances, the goal is to model \(p\) the probability of success.
Binary vs. Binomial data
For each example, identify if the response is a Bernoulli or Binomial response
Use median age and unemployment rate in a county to predict the percent of Obama votes in the county in the 2008 presidential election.
Use GPA and MCAT scores to estimate the probability a student is accepted into medical school.
Use sex, age, and smoking history to estimate the probability an individual has lung cancer.
Use offensive and defensive statistics from the 2017-2018 NBA season to predict a team’s winning percentage.
Logistic regression model
\(\log\Big(\frac{p}{1-p}\Big) = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_px_p\)
The response variable, \(\log\Big(\frac{p}{1-p}\Big)\), is the log(odds) of success, i.e. the logit
Use the model to calculate the probability of success \(\hat{p} = \frac{e^{\beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_px_p}}{1 + e^{\beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_px_p}}\)
When the response is a Bernoulli random variable, the probabilities can be used to classify each observation as a success or failure
Logistic vs linear regression model
set.seed(0)
<- tibble(x=runif(200, -5, 10),
dat p=exp(-2+1*x)/(1+exp(-2+1*x)),
y=rbinom(200, 1, p),
y2=.3408+.0901*x,
logit=log(p/(1-p)))
<- tibble(x = c(dat$x, dat$x),
dat2 y = c(dat$y2, dat$p),
`Regression model` = c(rep("linear", 200),
rep("logistic", 200)))
ggplot() +
geom_point(data = dat, aes(x, y)) +
geom_line(data = dat2, aes(x, y, linetype = `Regression model`,
color = `Regression model`)) +
labs(title = "Linear vs. logistic regression models for binary response data") +
scale_colour_manual(name = 'Regression model',
values = c('blue', 'red'),
labels = c('linear', 'logistic'), guide ='legend')
Logit link
Bernoulli and Binomial random variables can be written in one-parameter exponential family form, \(f(y;\theta) = e^{[a(y)b(\theta) + c(\theta) + d(y)]}\)
. . .
Bernoulli
\[f(y;p) = e^{y\log(\frac{p}{1-p}) + \log(1-p)}\]
. . .
Binomial
\[f(y;n,p) = e^{y\log(\frac{p}{1-p}) + n\log(1-p) + \log{n \choose y}}\]
. . .
They have the same canonical link \(b(p) = \log\big(\frac{p}{1-p}\big)\)
Assumptions for logistic regression
The following assumptions need to be satisfied to use logistic regression to make inferences
. . .
1️⃣ \(\hspace{0.5mm}\) Binary response: The response is dichotomous (has two possible outcomes) or is the sum of dichotomous responses
2️⃣ \(\hspace{0.5mm}\) Independence: The observations must be independent of one another
3️⃣ \(\hspace{0.5mm}\) Variance structure: Variance of a binomial random variable is \(np(1-p)\) \((n = 1 \text{ for Bernoulli})\) , so the variability is highest when \(p = 0.5\)
4️⃣ \(\hspace{0.5mm}\) Linearity: The log of the odds ratio, \(\log\big(\frac{p}{1-p}\big)\), must be a linear function of the predictors \(x_1, \ldots, x_p\)
COVID-19 infection prevention practices at food establishments
Researchers at Wollo Univeristy in Ethiopia conducted a study in July and August 2020 to understand factors associated with good COVID-19 infection prevention practices at food establishments. Their study is published in Andualem et al. (2022)
They were particularly interested in the understanding implementation of prevention practices at food establishments, given the workers’ increased risk due to daily contact with customers.
Andualem, A., Tegegne, B., Ademe, S., Natnael, T., Berihun, G., Abebe, M., … & Adane, M. (2022). COVID-19 infection prevention practices among a sample of food handlers of food and drink establishments in Ethiopia. PloS one, 17(1), e0259851.
The data
“An institution-based cross-sectional study was conducted among 422 food handlers in Dessie City and Kombolcha Town food and drink establishments in July and August 2020. The study participants were selected using a simple random sampling technique. Data were collected by trained data collectors using a pretested structured questionnaire and an on-the-spot observational checklist.”
Andualem, A., Tegegne, B., Ademe, S., Natnael, T., Berihun, G., Abebe, M., … & Adane, M. (2022). COVID-19 infection prevention practices among a sample of food handlers of food and drink establishments in Ethiopia. PloS one, 17(1), e0259851.
Response variable
“The outcome variable of this study was the good or poor practices of COVID-19 infection prevention among food handlers. Nine yes/no questions, one observational checklist and five multiple choice infection prevention practices questions were asked with a minimum score of 1 and maximum score of 25. Good infection prevention practice (the variable of interest) was determined for food handlers who scored 75% or above, whereas poor infection prevention practices refers to those food handlers who scored below 75% on the practice questions.”
Andualem, A., Tegegne, B., Ademe, S., Natnael, T., Berihun, G., Abebe, M., … & Adane, M. (2022). COVID-19 infection prevention practices among a sample of food handlers of food and drink establishments in Ethiopia. PloS one, 17(1), e0259851.
Results
Andualem, A., Tegegne, B., Ademe, S., Natnael, T., Berihun, G., Abebe, M., … & Adane, M. (2022). COVID-19 infection prevention practices among a sample of food handlers of food and drink establishments in Ethiopia. PloS one, 17(1), e0259851.
Interpreting the results
Is the response a Bernoulli or Binomial?
What is the strongest predictor of having good COVID-19 infection prevention practices?
- It’s often unreliable to look answer this question just based on the model output. Why are we able to answer this question based on the model output in this case?
Describe the effect (coefficient interpretation and inference) of having COVID-19 infection prevention policies available at the food establishment.
The intercept describes what group of food handlers?
Visualizations for logistic regression
Access to personal protective equipment
We will use the data from Andualem et al. (2022) to explore the association between age, sex, years of service, and whether someone works at a food establishment with access to personal protective equipment (PPE) as of August 2020. We will use access to PPE as a proxy for wearing PPE.
age | sex | years | ppe_access |
---|---|---|---|
34 | Male | 2 | 1 |
32 | Female | 3 | 1 |
32 | Female | 1 | 1 |
40 | Male | 4 | 1 |
32 | Male | 10 | 1 |
<- read_csv("data/covid-prevention-study.csv") |>
covid_df rename(age = "Age of food handlers",
years = "Years of service",
ppe_access = "Availability of PPEs") |>
mutate(sex = factor(if_else(Sex == 2, "Female", "Male"))) |>
select(age, sex, years, ppe_access)
|> slice(1:5) |> kable() covid_df
EDA for binary response (1/2)
library(Stat2Data)
par(mfrow = c(1, 2))
emplogitplot1(ppe_access ~ age, data = covid_df, ngroups = 10)
emplogitplot1(ppe_access ~ years, data = covid_df, ngroups = 5)
EDA for binary response (2/2)
library(viridis)
ggplot(data = covid_df, aes(x = sex, fill = factor(ppe_access))) +
geom_bar(position = "fill") +
labs(x = "Sex",
fill = "PPE Access",
title = "PPE Access by Sex") +
scale_fill_viridis_d()
Model results
<- glm(factor(ppe_access) ~ age + sex + years, data = covid_df,
ppe_model family = binomial)
tidy(ppe_model, conf.int = TRUE) |>
kable(digits = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -2.127 | 0.458 | -4.641 | 0.000 | -3.058 | -1.257 |
age | 0.056 | 0.017 | 3.210 | 0.001 | 0.023 | 0.091 |
sexMale | 0.341 | 0.224 | 1.524 | 0.128 | -0.098 | 0.780 |
years | 0.264 | 0.066 | 4.010 | 0.000 | 0.143 | 0.401 |
Visualizing coefficient estimates
<- tidy(ppe_model, exponentiate = TRUE, conf.int = TRUE) model_coef
ggplot(data = model_coef, aes(x = term, y = estimate)) +
geom_point() +
geom_hline(yintercept = 1, lty = 2) +
geom_pointrange(aes(ymin = conf.low, ymax = conf.high))+
labs(title = "Exponentiated model coefficients") +
coord_flip()
Logistic regression for binomial response variable
Data: Supporting railroads in the 1870s
The data set RR_Data_Hale.csv
contains information on support for referendums related to railroad subsidies for 11 communities in Alabama in the 1870s. The data were originally analyzed as part of a thesis project by a student at St. Olaf College. The variables in the data are
pctBlack
: percentage of Black residents in the countydistance
: distance the proposed railroad is from the community (in miles)YesVotes
: number of “yes” votes in favor of the proposed railroad lineNumVotes
: number of votes cast in the election
. . .
Primary question: Was voting on the railroad referendum related to the distance from the proposed railroad line, after adjusting for the racial composition of a community?
The data
<- read_csv("data/RR_Data_Hale.csv")
rr |> slice(1:5) |> kable() rr
County | popBlack | popWhite | popTotal | pctBlack | distance | YesVotes | NumVotes |
---|---|---|---|---|---|---|---|
Carthage | 841 | 599 | 1440 | 58.40 | 17 | 61 | 110 |
Cederville | 1774 | 146 | 1920 | 92.40 | 7 | 0 | 15 |
Five Mile Creek | 140 | 626 | 766 | 18.28 | 15 | 4 | 42 |
Greensboro | 1425 | 975 | 2400 | 59.38 | 0 | 1790 | 1804 |
Harrison | 443 | 355 | 798 | 55.51 | 7 | 0 | 15 |
EDA (1/3)
<- rr |>
rr mutate(pctYes = YesVotes/NumVotes,
emp_logit = log(pctYes / (1 - pctYes)))
|> head(5) rr
# A tibble: 5 × 10
County popBlack popWhite popTotal pctBlack distance YesVotes NumVotes pctYes emp_logit
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Carthage 841 599 1440 58.4 17 61 110 0.555 0.219
2 Cedervi… 1774 146 1920 92.4 7 0 15 0 -Inf
3 Five Mi… 140 626 766 18.3 15 4 42 0.0952 -2.25
4 Greensb… 1425 975 2400 59.4 0 1790 1804 0.992 4.85
5 Harrison 443 355 798 55.5 7 0 15 0 -Inf
EDA (2/3)
<- ggplot(data = rr, aes(x = distance, y = emp_logit)) +
p1 geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Distance to proposed railroad",
y = " ")
<- ggplot(data = rr, aes(x = pctBlack, y = emp_logit)) +
p2 geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "% Black residents",
y = "")
+ p2 + plot_annotation(title = "Log(odds yes vote) vs. predictor variables") p1
EDA (3/3)
<- rr |>
rr mutate(inFavor = if_else(pctYes > 0.5, "Yes", "No"))
ggplot(data = rr, aes(x = distance, y = pctBlack, color = inFavor)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, aes(lty = inFavor)) +
labs(x = "Distance to proposed railroad",
y = "% Black residents",
title = "% Black residents vs. distance",
subtitle = "Based on vote outcome") +
scale_color_viridis_d(end = 0.85)
. . .
Check for potential multicollinearity and interaction effect.
Model
Let \(p\) be the percent of yes votes in a county. We’ll start by fitting the following model:
\[\log\Big(\frac{p}{1-p}\Big) = \beta_0 + \beta_1 ~ dist + \beta_2 ~ pctBlack\]
. . .
Likelihood
\[\begin{aligned}L(p) &= \prod_{i=1}^{n} {m_i \choose y_i}p_i^{y_i}(1 - p_i)^{m_i - y_i} \\ &= \prod_{i=1}^{n} {m_i \choose y_i}\Big[\frac{e^{\beta_0 + \beta_1 ~ dist_i + \beta_2 ~ pctBlack_i}}{1 + e^{\beta_0 + \beta_1 ~ dist_i + \beta_2 ~ pctBlack_i}}\Big]^{y_i}\Big[\frac{1}{e^{\beta_0 + \beta_1 ~ dist_i + \beta_2 ~ pctBlack_i}}\Big]^{m_i - y_i} \\\end{aligned}\]
Use IWLS to find \(\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2\).
Model in R
<- glm(cbind(YesVotes, NumVotes - YesVotes) ~ distance + pctBlack,
rr_model data = rr, family = binomial)
tidy(rr_model, conf.int = TRUE) |>
kable(digits = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 4.222 | 0.297 | 14.217 | 0.000 | 3.644 | 4.809 |
distance | -0.292 | 0.013 | -22.270 | 0.000 | -0.318 | -0.267 |
pctBlack | -0.013 | 0.004 | -3.394 | 0.001 | -0.021 | -0.006 |
. . .
\[\log\Big(\frac{\hat{p}}{1-\hat{p}}\Big) = 4.22 - 0.292 ~ dist - 0.013 ~ pctBlack\]
See Section 6.5 of Generalized Linear Models with Examples in R by Dunn and Smyth (available through Duke library) for details on estimating the standard errors.
Residuals
Similar to Poisson regression, there are two types of residuals: Pearson and deviance residuals
. . .
Pearson residuals
\[\text{Pearson residual}_i = \frac{\text{actual count} - \text{predicted count}}{\text{SD count}} = \frac{Y_i - m_i\hat{p}_i}{\sqrt{m_i\hat{p}_i(1 - \hat{p}_i)}}\]
. . .
Deviance residuals
\[d_i = \text{sign}(Y_i - m_i\hat{p}_i)\sqrt{2\Big[Y_i\log\Big(\frac{Y_i}{m_i\hat{p}_i}\Big) + (m_i - Y_i)\log\Big(\frac{m_i - Y_i}{m_i - m_i\hat{p}_i}\Big)\Big]}\]
Plot of deviance residuals
<- glm(cbind(YesVotes, NumVotes - YesVotes) ~ distance + pctBlack +
rr_int_model *pctBlack,
distancedata = rr, family = binomial)
<- augment(rr_int_model, type.predict = "response",
rr_int_aug type.residuals = "deviance")
|> slice(1:5) |> kable() rr_int_aug
.rownames | cbind(YesVotes, NumVotes - YesVotes) | distance | pctBlack | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
---|---|---|---|---|---|---|---|---|---|
1 | 61 | 49 | 17 | 58.40 | 0.2075801 | 7.964510 | 0.4663943 | 5.0884957 | 32.9667672 |
2 | 0 | 15 | 7 | 92.40 | 0.6776101 | -5.827504 | 0.0492925 | 6.3049338 | 0.4298496 |
3 | 4 | 38 | 15 | 18.28 | 0.2024659 | -1.885115 | 0.6433983 | 6.6366201 | 3.7828247 |
4 | 1790 | 14 | 0 | 59.38 | 0.9760416 | 5.230746 | 0.8996698 | 0.5044966 | 452.2582760 |
5 | 0 | 15 | 7 | 55.51 | 0.8513123 | -7.561561 | 0.0240118 | 5.9951340 | 0.5412285 |
ggplot(data = rr_int_aug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
labs(x = "Fitted values",
y = "Deviance residuals",
title = "Deviance residuals vs. fitted",
subtitle = "for model with interaction term")
Goodness of fit
Similar to Poisson regression, the sum of the squared deviance residuals is used to assess goodness of fit.
\[\begin{aligned} &H_0: \text{ Model is a good fit} \\ &H_a: \text{ Model is not a good fit}\end{aligned}\]
- When \(m_i\) is large and the model is a good fit \((H_0 \text{ true})\) the residual deviance follows a \(\chi^2\) distribution with \(n - p\) degrees of freedom.
- Recall \(n - p\) is the residual degrees of freedom.
- If the model fits, we expect the residual deviance to be approximately what value?
Overdispersion
Adjusting for overdispersion (1/2)
Overdispersion occurs when there is extra-binomial variation, i.e. the variance is greater than what we would expect, \(np(1-p)\).
Similar to Poisson regression, we can adjust for overdispersion in the binomial regression model by using a dispersion parameter \[\hat{\phi} = \sum \frac{(\text{Pearson residuals})^2}{n-p}\]
- By multiplying by \(\hat{\phi}\), we are accounting for the reduction in information we would expect from independent observations.
Adjusting for overdispersion (2/2)
- We adjust for overdispersion using a quasibinomial model.
- “Quasi” reflects the fact we are no longer using a binomial model with true likelihood.
- The standard errors of the coefficients are \(SE_{Q}(\hat{\beta}_j) = \sqrt{\hat{\phi}} SE(\hat{\beta})\)
- Inference is done using the \(t\) distribution to account for extra variability
Predicting emergency department wait and treatment times
Predicting ED wait and treatment times
Ataman and Sariyer (2021) use ordinal logistic regression to predict patient wait and treatment times in an emergency department (ED). The goal is to identify relevant factors that can be used to inform recommendations for reducing wait and treatment times, thus improving the quality of care in the ED.
Data: Daily records for ED arrivals in August 2018 at a public hospital in Izmir, Turkey.
Response variable: Wait time, a categorical variable with three levels: - Patients who wait less than 10 minutes - Patients whose waiting time is in the range of 10-60 minutes - Patients who wait more than 60 minutes
Ataman, M. G., & Sarıyer, G. (2021). Predicting waiting and treatment times in emergency departments using ordinal logistic regression models. The American Journal of Emergency Medicine, 46, 45-50.
Ordinal logistic regression
Let \(Y\) be an ordinal response variable that takes levels \(1, 2, \ldots, J\) with associated probabilities \(p_1, p_2, \ldots, p_J\).
The proportional odds model can be written as the following:
\[\begin{aligned}&\log\Big(\frac{P(Y\leq 1)}{P(Y > 1)}\Big) = \beta_{01} + \beta_1x_1 + \dots + \beta_px_p \\ & \log\Big(\frac{P(Y\leq 2)}{P(Y > 2)}\Big) = \beta_{02} + \beta_1x_1 + \dots + \beta_px_p \\ & \dots \\ & \log\Big(\frac{P(Y\leq J-1)}{P(Y > J-1)}\Big) = \beta_{0{J-1}} + \beta_1x_1 + \dots + \beta_px_p\end{aligned}\]
Questions
- How is the proportional odds model similar to the multinomial logistic model?
- How is it different?
Effect of arrival mode
. . .
Question
The variable arrival mode
takes two categories: ambulance and walk-in. Describe the effect of arrival mode in this model. Note that the baseline level is “walk-in”.
Effect of triage level
. . .
Consider the full output with the ordinal logistic models for wait and treatment times.
- Question
. . .
Use the results from both models to describe the effect of triage level (red = urgent, green = non-urgent) on the wait and treatment times in the ED. Note that “red” is the baseline level. Is this what you expected?
Wrap up GLM for independent observations
Wrap up
Covered fitting, interpreting, and drawing conclusions from GLMs
- Looked at Poisson, Negative Binomial, and Logistic (binary, binomial, ordinal) in detail
Used Pearson and deviance residuals to assess model fit and determine if new variables should be added to the model
Addressed issues of overdispersion and zero-inflation
Used the properties of the one-parameter exponential family to identify the best link function for any GLM
. . .
Everything we’ve done thus far as been under the assumption that the observations are independent. Looking ahead we will consider models for data with dependent (correlated) observations.
Acknowledgements
These slides are based on content in
Acknowledgements
These slides are based on content in
Initial versions of the slides are by Dr. Maria Tackett, Duke University