Poisson Regression

BMLR Chapter 4

Tyler George

Cornell College
STA 363 Spring 2025 Block 8

Setup

library(tidyverse)
library(tidymodels)
library(GGally)
library(knitr)
library(patchwork)
library(viridis)
library(ggfortify)
library(gridExtra)

Learning goals (1/2)

  • Describe properties of the Poisson random variable

  • Write the Poisson regression model

  • Describe how the Poisson regression differs from least-squares regression

  • Interpret the coefficients for the Poisson regression model

  • Compare two Poisson regression models

  • Define and calculate residuals for the Poisson regression model

  • Use Goodness-of-fit to assess model fit

Learning goals (2/2)

  • Identify overdispersion

  • Apply modeling approaches to deal with overdispersion

  • Explore properties of negative binomial versus Poisson response

  • Fit and interpret models with offset to adjust for differences in sampling effort

  • Fit and interpret Zero-inflated Poisson models

  • Write likelihood for Poisson and Zero-inflated Poisson model

Scenarios to use Poisson regression

  • Does the number of employers conducting on-campus interviews during a year differ for public and private colleges?

  • Does the daily number of asthma-related visits to an Emergency Room differ depending on air pollution indices?

  • Does the number of paint defects per square foot of wall differ based on the years of experience of the painter?

Each response variable is a count per a unit of time or space.

Poisson distribution

Let \(Y\) be the number of events in a given unit of time or space. Then \(Y\) can be modeled using a Poisson distribution

\[P(Y=y) = \frac{e^{-\lambda}\lambda^y}{y!} \hspace{10mm} y=0,1,2,\ldots, \infty\]

Poisson Features

  • \(E(Y) = Var(Y) = \lambda\)
  • The distribution is typically skewed right, particularly if \(\lambda\) is small
  • The distribution becomes more symmetric as \(\lambda\) increases
    • If \(\lambda\) is sufficiently large, it can be approximated using a normal distribution (Click here for an example.)

Poisson Graphs

Mean Variance
lambda = 1 0.99351 0.9902178
lambda = 5 4.99367 4.9865798
lambda = 50 49.99288 49.8962683

Example

The annual number of earthquakes registering at least 2.5 on the Richter Scale and having an epicenter within 40 miles of downtown Memphis follows a Poisson distribution with mean 6.5. What is the probability there will be at 3 or fewer such earthquakes next year?

\[P(Y \leq 3) = P(Y = 0) + P(Y = 1) + P(Y = 2) + P(Y = 3)\]

\[ = \frac{e^{-6.5}6.5^0}{0!} + \frac{e^{-6.5}6.5^1}{1!} + \frac{e^{-6.5}6.5^2}{2!} + \frac{e^{-6.5}6.5^3}{3!}\]

\[ = 0.112\]

ppois(3, 6.5)
[1] 0.1118496

Poisson regression

Poisson regression

The data: Household size in the Philippines


The data fHH1.csv come from the 2015 Family Income and Expenditure Survey conducted by the Philippine Statistics Authority.

Goal: Understand the association between household size and various characteristics of the household

Response: - total: Number of people in the household other than the head

Predictors: - location: Where the house is located - age: Age of the head of household - roof: Type of roof on the residence (proxy for wealth)

Other variables: - numLT5: Number in the household under 5 years old

The data

hh_data <- read_csv("data/fHH1.csv")
hh_data |> slice(1:5) |> kable()
location age total numLT5 roof
CentralLuzon 65 0 0 Predominantly Strong Material
MetroManila 75 3 0 Predominantly Strong Material
DavaoRegion 54 4 0 Predominantly Strong Material
Visayas 49 3 0 Predominantly Strong Material
MetroManila 74 3 0 Predominantly Strong Material

Response variable

mean var
3.685 5.534

Why the least-squares model doesn’t work

The goal is to model \(\lambda\), the expected number of people in the household (other than the head), as a function of the predictors (covariates)

We might be tempted to try a linear model \[\lambda_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip}\]

This model won’t work because…

  • It could produce negative values of \(\lambda\) for certain values of the predictors
  • The equal variance assumption required to conduct inference for linear regression is violated.

Poisson regression model

If \(Y_i \sim Poisson\) with \(\lambda = \lambda_i\) for the given values \(x_{i1}, \ldots, x_{ip}\), then

\[\log(\lambda_i) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip}\]

  • Each observation can have a different value of \(\lambda\) based on its value of the predictors \(x_1, \ldots, x_p\)

  • \(\lambda\) determines the mean and variance, so we don’t need to estimate a separate error term

Poisson vs. multiple linear regression

Regression models: Linear regression (left) and Poisson regression (right).

Assumptions for Poisson regression

Poisson response: The response variable is a count per unit of time or space, described by a Poisson distribution, at each level of the predictor(s)

Independence: The observations must be independent of one another

Mean = Variance: The mean must equal the variance

Linearity: The log of the mean rate, \(\log(\lambda)\), must be a linear function of the predictor(s)

Model 1: Number in household vs. age

model1 <- glm(total ~ age, data = hh_data, family = poisson)

tidy(model1) |> 
  kable(digits = 4)
term estimate std.error statistic p.value
(Intercept) 1.5499 0.0503 30.8290 0
age -0.0047 0.0009 -5.0258 0

\[\log(\hat{\lambda}) = 1.5499 - 0.0047 ~ age\]

Question: The coefficient for age is -0.0047. Interpret this coefficient in context.

Answers

  • Each additional year older the head of household is, the estimated average log of the number of people in the household is .0047 lower.

  • Each additional year older the head of household is, the estimated average number of people in the household reduces by 0.5%.

  • Each additional year older the head of household is the estimated average number of people in the household changes by a factor of .995.

  • For every 10 years older the head of household is, the estimated average number of people in the household reduces by 5%.

Understanding the interpretation

Let’s derive the change in predicted mean when we go from \(x\) to \(x+1\)

(see boardwork)

Is the coefficient of age statistically significant?

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 1.5499 0.0503 30.8290 0 1.4512 1.6482
age -0.0047 0.0009 -5.0258 0 -0.0065 -0.0029

\[H_0: \beta_1 = 0 \hspace{2mm} \text{ vs. } \hspace{2mm} H_a: \beta_1 \neq 0\]

Test statistic

\[Z = \frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1)} = \frac{-0.0047 - 0}{0.0009} = -5.026 \text{ (using exact values)}\]

P-value

\[P(|Z| > |-5.026|) = 5.01 \times 10^{-7} \approx 0\]

What are plausible values for the coefficient of age?

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 1.5499 0.0503 30.8290 0 1.4512 1.6482
age -0.0047 0.0009 -5.0258 0 -0.0065 -0.0029

95% confidence interval for the coefficient of age

\[\hat{\beta}_1 \pm Z^{*}\times SE(\hat{\beta}_1)\] \[-0.0047 \pm 1.96 \times 0.0009 = \mathbf{(-.0065, -0.0029)}\]

Question: Interpret the interval in terms of the change in mean household size.

Which can best help us determine whether Model 1 is a good fit?

p1 <- ggplot(data = hh_data, aes(x = age, y = total)) + 
  geom_point() + 
  labs(y = "Total household size", 
       title = "Plot A")

p2 <- hh_data |>
  group_by(age) |> 
  summarise(mean = mean(total)) |>
  ggplot(aes(x = age, y = mean))+ 
  geom_point() + 
  labs(y = "Empirical mean household size", 
       title = "Plot B")

p3 <- hh_data |>
  group_by(age) |> 
  summarise(log_mean = log(mean(total))) |>
  ggplot(aes(x = age, y = log_mean)) + 
  geom_point() + 
  labs(y = "Log empirical mean household size", 
       title = "Plot C")

p1 + p2 + p3 + plot_annotation(tag_levels = 'A')

Model 2: Add a quadratic effect for age

hh_data <- hh_data |> 
  mutate(age2 = age*age)

model2 <- glm(total ~ age + age2, data = hh_data, family = poisson)
tidy(model2, conf.int = T) |> 
  kable(digits = 4)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -0.3325 0.1788 -1.8594 0.063 -0.6863 0.0148
age 0.0709 0.0069 10.2877 0.000 0.0575 0.0845
age2 -0.0007 0.0001 -11.0578 0.000 -0.0008 -0.0006

Model 2: Add a quadratic effect for age

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -0.3325 0.1788 -1.8594 0.063 -0.6863 0.0148
age 0.0709 0.0069 10.2877 0.000 0.0575 0.0845
age2 -0.0007 0.0001 -11.0578 0.000 -0.0008 -0.0006

We can determine whether to keep \(age^2\) in the model in two ways:

1️⃣ Use the p-value (or confidence interval) for the coefficient (since we are adding a single term to the model)

2️⃣ Conduct a drop-in-deviance test

Deviance

A deviance is a way to measure how the observed data deviates from the model predictions.

  • It’s a measure unexplained variability in the response variable (similar to SSE in linear regression )

  • Lower deviance means the model is a better fit to the data

We can calculate the “deviance residual” for each observation in the data (more on the formula later). Let \((\text{deviance residual}_i\) be the deviance residual for the \(i^{th}\) observation, then

\[\text{deviance} = \sum(\text{deviance residual})_i^2\]

Note: Deviance is also known as the “residual deviance”

Drop-in-Deviance Test

We can use a drop-in-deviance test to compare two models. To conduct the test

1️⃣ Compute the deviance for each model

2️⃣ Calculate the drop in deviance

\[\text{drop-in-deviance = Deviance(reduced model) - Deviance(larger model)}\]

3️⃣ Given the reduced model is the true model \((H_0 \text{ true})\), then \[\text{drop-in-deviance} \sim \chi_d^2\]

where \(d\) is the difference in degrees of freedom between the two models (i.e., the difference in number of terms)

Drop-in-deviance - Model1 and Model2

anova(model1, model2, test = "Chisq") |>
  kable(digits = 3)
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1498 2337.089 NA NA NA
1497 2200.944 1 136.145 0

Questions:

  • Write the null and alternative hypotheses.
  • What does the value 2337.089 tell you?
  • What does the value 1 tell you?
  • What is your conclusion?

Add location to the model?

Suppose we want to add location to the model, so we compare the following models:

Model A: \(\lambda_i = \beta_0 + \beta_1 ~ age_i + \beta_2 ~ age_i^2\)

Model B: \(\lambda_i = \beta_0 + \beta_1 ~ age_i + \beta_2 ~ age_i^2 + \beta_3 ~ Loc1_i + \beta_4 ~ Loc2_i + \beta_5 ~ Loc3_i + \beta_6 ~ Loc4_i\)

Question

Which of the following are reliable ways to determine if location should be added to the model?

  • Drop-in-deviance test
  • Use the p-value for each coefficient
  • Likelihood ratio test
  • Nested F Test
  • BIC

Add location to the model?

model3 <- glm(total ~ age + age2 + location, data = hh_data, family = poisson)

Use a drop-in-deviance test to determine if Model 2 or Model 3 (with location) is a better fit for the data.

anova(model2, model3, test = "Chisq") |>
  kable(digits = 3)
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1497 2200.944 NA NA NA
1493 2187.800 4 13.144 0.011

The p-value is small (0.01 < 0.05), so we conclude that Model 3 is a better fit for the data.

Model 3

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -0.3843 0.1821 -2.1107 0.0348 -0.7444 -0.0306
age 0.0704 0.0069 10.1900 0.0000 0.0569 0.0840
age2 -0.0007 0.0001 -10.9437 0.0000 -0.0008 -0.0006
locationDavaoRegion -0.0194 0.0538 -0.3605 0.7185 -0.1250 0.0859
locationIlocosRegion 0.0610 0.0527 1.1580 0.2468 -0.0423 0.1641
locationMetroManila 0.0545 0.0472 1.1542 0.2484 -0.0378 0.1473
locationVisayas 0.1121 0.0417 2.6853 0.0072 0.0308 0.1945

Does this model sufficiently explain the variability in the mean household size?

Goodness-of-fit

Pearson residuals

We can calculate two types of residuals for Poisson regression: Pearson residuals and deviance residuals

\[\text{Pearson residual}_i = \frac{\text{observed} - \text{predicted}}{\text{std. error}} = \frac{y_i - \hat{\lambda}_i}{\sqrt{\hat{\lambda}_i}}\]

  • Similar interpretation as standardized residuals from linear regression

  • Expect most to fall between -2 and 2

  • Used to calculate overdispersion parameter

Deviance residuals

The deviance residual indicates how much the observed data deviates from the fitted model

\[\text{deviance residual}_i = \text{sign}(y_i - \hat{\lambda}_i)\sqrt{2\Bigg[y_i\log\bigg(\frac{y_i}{\hat{\lambda}_i}\bigg) - (y_i - \hat{\lambda}_i)\Bigg]}\]

where

\[\text{sign}(y_i - \hat{\lambda}_i) = \begin{cases} 1 & \text{ if }(y_i - \hat{\lambda}_i) > 0 \\ -1 & \text{ if }(y_i - \hat{\lambda}_i) < 0 \\ 0 & \text{ if }(y_i - \hat{\lambda}_i) = 0 \end{cases}\]

Model 3: Residual plots

model3_aug_pearson <- augment(model3, type.residuals = "pearson") 
model3_aug_deviance <- augment(model3, type.residuals = "deviance")

p1 <- ggplot(data = model3_aug_pearson, aes(x = .fitted, y = .resid)) + 
  geom_point()  + 
  geom_smooth() + 
  labs(x = "Fitted values", 
       y = "Pearson residuals", 
       title = "Pearson residuals vs. fitted")

p2 <-  ggplot(data = model3_aug_deviance, aes(x = .fitted, y = .resid)) + 
  geom_point()  + 
  geom_smooth() + 
  labs(x = "Fitted values", 
       y = "Deviance residuals", 
       title = "Deviance residuals vs. fitted")

p1 + p2

Goodness-of-fit

  • Goal: Use the (residual) deviance to assess how much the predicted values differ from the observed values. Recall \((\text{deviance}) = \sum_{i=1}^{n}(\text{deviance residual})_i^2\)

  • If the model sufficiently fits the data, then

\[\text{deviance} \sim \chi^2_{df}\]

where \(df\) is the model’s residual degrees of freedom

  • Question: What is the probability of observing a deviance larger than the one we’ve observed, given this model sufficiently fits the data?

\[P(\chi^2_{df} > \text{ deviance})\]

Model 3: Goodness-of-fit calculations

model3$deviance
[1] 2187.8
model3$df.residual
[1] 1493
pchisq(model3$deviance, model3$df.residual, lower.tail = FALSE)
[1] 3.153732e-29

The probability of observing a deviance greater than 2187.8 is \(\approx 0\), so there is significant evidence of lack-of-fit.

Lack-of-fit

There are a few potential reasons for lack-of-fit:

  • Missing important interactions or higher-order terms

  • Missing important variables (perhaps this means a more comprehensive data set is required)

  • There could be extreme observations causing the deviance to be larger than expected (assess based on the residual plots)

  • There could be a problem with the Poisson model

    • May need more flexibility in the model to handle overdispersion

Overdispersion

Overdispersion: There is more variability in the response than what is implied by the Poisson model

Overall

mean var
3.685 5.534

by Location

location mean var
CentralLuzon 3.402 4.152
DavaoRegion 3.390 4.723
IlocosRegion 3.586 5.402
MetroManila 3.707 4.863
Visayas 3.902 6.602
hh_data |>
  summarise(mean = mean(total), var = var(total)) |>
  kable(digits = 3)
hh_data |>
  group_by(location) |>
  summarise(mean = mean(total), var = var(total)) |>
  kable(digits = 3)

Why overdispersion matters

If there is overdispersion, then there is more variation in the response than what’s implied by a Poisson model. This means

❌ The standard errors of the model coefficients are artificially small

❌ The p-values are artificially small

❌ This could lead to models that are more complex than what is needed

We can take overdispersion into account by

  • inflating standard errors by multiplying them by a dispersion factor
  • using a negative-binomial regression model

Quasi-poission

Dispersion parameter

The dispersion parameter is represented by \(\phi\)

\[\hat{\phi} =\frac{\text{deviance}}{\text{residual df}} = \frac{\sum_{i=1}^{n}(\text{Pearson residuals})^2}{n - p}\]

where \(p\) is the number of terms in the model (including the intercept)

  • If there is no overdispersion \(\hat{\phi} = 1\)

  • If there is overdispersion \(\hat{\phi} > 1\)

Accounting for dispersion in the model

We inflate the standard errors of the coefficient by multiplying the variance by \(\hat{\phi}\)

\[SE_{Q}(\hat{\beta}) = \sqrt{\hat{\phi}} * SE(\hat{\beta})\] - “Q” stands for quasi-Poisson, since this is an ad-hoc solution - The process for model building and model comparison is called quasilikelihood (similar to likelihood without exact underlying distributions)

Model 3: Quasi-Poisson model

model3_q <- glm(total ~ age + age2 + location, data = hh_data, 
                family = quasipoisson) #<<
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -0.3843 0.2166 -1.7744 0.0762 -0.8134 0.0358
age 0.0704 0.0082 8.5665 0.0000 0.0544 0.0866
age2 -0.0007 0.0001 -9.2000 0.0000 -0.0009 -0.0006
locationDavaoRegion -0.0194 0.0640 -0.3030 0.7619 -0.1451 0.1058
locationIlocosRegion 0.0610 0.0626 0.9735 0.3304 -0.0620 0.1837
locationMetroManila 0.0545 0.0561 0.9703 0.3320 -0.0552 0.1649
locationVisayas 0.1121 0.0497 2.2574 0.0241 0.0156 0.2103

Poisson vs. Q-Poisson

Poisson

term estimate std.error
(Intercept) -0.3843 0.1821
age 0.0704 0.0069
age2 -0.0007 0.0001
locationDavaoRegion -0.0194 0.0538
locationIlocosRegion 0.0610 0.0527
locationMetroManila 0.0545 0.0472
locationVisayas 0.1121 0.0417

Quasi-Poisson

estimate std.error
-0.3843 0.2166
0.0704 0.0082
-0.0007 0.0001
-0.0194 0.0640
0.0610 0.0626
0.0545 0.0561
0.1121 0.0497

Q-Poisson: Inference for coefficients

term estimate std.error
(Intercept) -0.3843 0.2166
age 0.0704 0.0082
age2 -0.0007 0.0001
locationDavaoRegion -0.0194 0.0640
locationIlocosRegion 0.0610 0.0626
locationMetroManila 0.0545 0.0561
locationVisayas 0.1121 0.0497

Test statistic \[t = \frac{\hat{\beta} - 0}{SE_{Q}(\hat{\beta})} \sim t_{n-p}\]

Negative binomial regression model

Negative binomial regression model

Another approach to handle overdispersion is to use a negative binomial regression model

  • This has more flexibility than the quasi-Poisson model, because there is a new parameter in addition to \(\lambda\)


Let \(Y\) be a negative binomial random variable, \(Y\sim NegBinom(r, p)\), then

\[P(Y = y_i) = {y_i + r - 1 \choose r - 1}(1-p)^{y_i}p^r \hspace{5mm} y_i = 0, 1, 2, \ldots, \infty\]

Negative binomial regression model

  • Main idea: Generate a \(\lambda\) for each observation (household) and generate a count using the Poisson random variable with parameter \(\lambda\)
    • Makes the counts more dispersed than with a single parameter
  • Think of it as a Poisson model such that \(\lambda\) is also random

\(\begin{aligned} &\text{If }Y|\lambda \sim Poisson(\lambda)\\ &\text{ and } \lambda \sim Gamma\bigg(r, \frac{1-p}{p}\bigg)\\ &\text{ then } Y \sim NegBinom(r, p)\end{aligned}\)

Negative binomial regression in R

library(MASS)
model3_nb <- glm.nb(total ~ age + age2 + location, data = hh_data)
tidy(model3_nb) |> 
  kable(digits = 4)
term estimate std.error statistic p.value
(Intercept) -0.3753 0.2076 -1.8081 0.0706
age 0.0699 0.0079 8.8981 0.0000
age2 -0.0007 0.0001 -9.5756 0.0000
locationDavaoRegion -0.0219 0.0625 -0.3501 0.7262
locationIlocosRegion 0.0577 0.0615 0.9391 0.3477
locationMetroManila 0.0562 0.0551 1.0213 0.3071
locationVisayas 0.1104 0.0487 2.2654 0.0235

Offsets

Data: Airbnbs in NYC

The data set NYCairbnb-sample.csv contains information about a random sample of 1000 Airbnbs in New York City. It is a subset of the data on 40628 Airbnbs scraped by Awad et al. (2017).

Variables

  • number_of_reviews: Number of reviews for the unit on Airbnb (proxy for number of rentals)
  • price: price per night in US dollars
  • room_type: Entire home/apartment, private room, or shared room
  • days: Number of days the unit has been listed (date when info scraped - date when unit first listed on Airbnb)

Data: Airbnbs in NYC

airbnb <- read_csv("data/NYCairbnb-sample.csv") |>
  dplyr::select(id, number_of_reviews, days, room_type, price)
id number_of_reviews days room_type price
15756544 16 1144 Private room 120
14218251 15 471 Private room 89
21644 0 2600 Private room 89
13667835 1 283 Entire home/apt 150
265912 0 1970 Entire home/apt 89

Goal: Use the price and room type of Airbnbs to describe variation in the number of reviews (a proxy for number of rentals).

EDA

p1 <- ggplot(data = airbnb, aes(x = number_of_reviews)) + 
  geom_histogram() + 
   labs(x = "Number of reviews",
    title = "Distribution of number of reviews")

p2 <- airbnb |>
  filter(price <= 2000) |>
  group_by(price) |>
  summarise(log_mean = log(mean(number_of_reviews))) |>
  ggplot(aes(x = price, y = log_mean)) + 
  geom_point(alpha= 0.7) + 
  geom_smooth() + 
  labs(x = "Price in  US dollars",
    y = "Log(mean # reviews)", 
    title = "Log mean #  of reviews vs. price", 
    subtitle = "Airbnbs $2000 or less")

p3 <- airbnb |>
  filter(price <= 500) |>
  group_by(price) |>
  summarise(log_mean = log(mean(number_of_reviews))) |>
  ggplot(aes(x = price, y = log_mean)) + 
  geom_point(alpha= 0.7) + 
  geom_smooth() + 
  labs(x = "Price in  US dollars",
    y = "Log(mean # reviews)", 
    title = "Log mean # of reviews vs. price", 
    subtitle = "Airbnbs $500 or less")

p1  / (p2 + p3) 

EDA

Overall

mean var
15.916 765.969

by Room type

room_type mean var
Entire home/apt 16.283 760.348
Private room 15.608 786.399
Shared room 15.028 605.971

Considerations for modeling

We would like to fit the Poisson regression model

\[\log(\lambda) = \beta_0 + \beta_1 ~ price + \beta_2 ~ room\_type1 + \beta_3 ~ room\_type2\]

Question: - Based on the EDA, what are some potential issues we may want to address in the model building?

  • Suppose any model fit issues are addressed. What are some potential limitations to the conclusions and interpretations from the model?

Offset

  • Sometimes counts are not directly comparable because the observations differ based on some characteristic directly related to the counts, i.e. the sampling effort.

  • An offset can be used to adjust for differences in sampling effort.

  • Let \(x_{offset}\) be the variable that accounts for differences in sampling effort, then \(\log(x_{offset})\) will be added to the model.

\(\log(\lambda_i) = \beta_0 + \beta_1 ~ x_{i1} + \beta_2 ~ x_{i2} + ... + \beta_p ~ x_{ip} + \log(x_{offset_i})\)

  • The offset is a term in the model with coefficient always equal to 1.

Adding an offset to the Airbnb model

We will add the offset \(\log(days)\) to the model. This accounts for the fact that we would expect Airbnbs that have been listed longer to have more reviews.

\[\log(\lambda_i) = \beta_0 + \beta_1 ~ price_i + \beta_2 ~ room\_type1_i + \beta_3 ~ room\_type2_i + \log(days_i)\]

Note: The response variable for the model is still \(\log(\lambda_i)\), the log mean number of reviews

Detail on the offset

We want to adjust for the number of days, so we are interested in \(\frac{reviews}{days}\).

Given \(\lambda\) is the mean number of reviews

\[\log\Big(\frac{\lambda}{days}\Big) = \beta_0 + \beta_1 ~ price + \beta_2 ~ room\_type1 + \beta_3 ~ room\_type2\]

\[\Rightarrow \log({\lambda}) - \log(days) = \beta_0 + \beta_1 ~ price + \beta_2 ~ room\_type1 + \beta_3 ~ room\_type2\]

\[\Rightarrow \log({\lambda}) = \beta_0 + \beta_1 ~ price + \beta_2 ~ room\_type1 + \beta_3 ~ room\_type2 + \log(days)\]

Airbnb model in R

airbnb_model <- glm(number_of_reviews ~ price + room_type, 
                    data = airbnb, family = poisson, 
                    offset = log(days)) #<<
term estimate std.error statistic p.value
(Intercept) -4.1351 0.0170 -243.1397 0
price -0.0005 0.0001 -7.0952 0
room_typePrivate room -0.0994 0.0174 -5.6986 0
room_typeShared room 0.2436 0.0452 5.3841 0

The coefficient for \(\log(days)\) is fixed at 1, so it is not in the model output.

Interpretations

term estimate std.error statistic p.value
(Intercept) -4.1351 0.0170 -243.1397 0
price -0.0005 0.0001 -7.0952 0
room_typePrivate room -0.0994 0.0174 -5.6986 0
room_typeShared room 0.2436 0.0452 5.3841 0


Question:

  • Interpret the coefficient of price.
  • Interpret the coefficient of room_typePrivate room

Goodness-of-fit

\[\begin{aligned}&H_0: \text{ The model is a good fit for the data}\\ &H_a: \text{ There is significant lack of fit}\end{aligned}\]

pchisq(airbnb_model$deviance, airbnb_model$df.residual, lower.tail = F)
[1] 0

There is evidence of significant lack of fit in the model. Therefore, more models would need to be explored that address the issues mentioned earlier.

In practice we would assess goodness-of-fit and finalize the model before any interpretations and conclusions.

Zero-inflated Poisson model

Data: Weekend drinking

The data weekend-drinks.csv contains information from a survey of 77 students in a introductory statistics course on a dry campus.

Variables

  • drinks: Number of drinks they had in the past weekend
  • off_campus: 1 - lives off campus, 0 otherwise
  • first_year: 1 - student is a first-year, 0 otherwise
  • sex: f - student identifies as female, m - student identifies as male

Goal: The goal is explore factors related to drinking behavior on a dry campus.

EDA: Response variable

Observed vs. expected response

What does it mean to be a “zero” in this data?

Two types of zeros

There are two types of zeros

  • Those who happen to have a zero in the data set (people who drink but happened to not drink last weekend)
  • Those who will always report a value of zero (non-drinkers)
    • These are called true zeros

We introduce a new parameter \(\alpha\) for the proportion of true zeros, then fit a model that has two components:

1️⃣ The association between mean number of drinks and various characteristics among those who drink

2️⃣ The estimated proportion of non-drinkers

Zero-inflated Poisson model

Zero-inflated Poisson (ZIP) model has two parts

1️⃣ Association, among those who drink, between the mean number of drinks and predictors sex and off campus residence

\[\log(\lambda) = \beta_0 + \beta_1 ~ off\_campus + \beta_2 ~ sex\] where \(\lambda\) is the mean number of drinks among those who drink

2️⃣ Probability that a student does not drink

\[\text{logit}(\alpha) = \log\Big(\frac{\alpha}{1- \alpha}\Big) = \beta_0 + \beta_1 ~ first\_year\]

where \(\alpha\) is the proportion of non-drinkers

Note: The same variables can be used in each component

Details of the ZIP model

  • The ZIP model is a special case of a latent variable model
    • A type of mixture model where observations for one or more groups occur together but the group membership unknown
  • Zero-inflated models are a common type of mixture model; they apply beyond Poisson regression

ZIP model in R

Fit ZIP models using the zeroinfl function from the pscl R package.

library(pscl)

drinks_zip <- zeroinfl(drinks ~ off_campus + sex | first_year,
                data = drinks)
drinks_zip

Call:
zeroinfl(formula = drinks ~ off_campus + sex | first_year, data = drinks)

Count model coefficients (poisson with log link):
(Intercept)   off_campus         sexm  
     0.7543       0.4159       1.0209  

Zero-inflation model coefficients (binomial with logit link):
(Intercept)   first_year  
    -0.6036       1.1364  

Tidy output

Use the tidy function from the poissonreg package for tidy model output.

library(poissonreg)

Mean number of drinks among those who drink

tidy(drinks_zip, type = "count") %>% kable(digits = 3)
term type estimate std.error statistic p.value
(Intercept) count 0.754 0.144 5.238 0.000
off_campus count 0.416 0.206 2.021 0.043
sexm count 1.021 0.175 5.827 0.000

Tidy output

Proportion of non-drinkers

tidy(drinks_zip, type = "zero") %>% kable(digits = 3)
term type estimate std.error statistic p.value
(Intercept) zero -0.604 0.311 -1.938 0.053
first_year zero 1.136 0.610 1.864 0.062

Interpreting the model coefficients

term type estimate std.error statistic p.value
(Intercept) count 0.754 0.144 5.238 0.000
off_campus count 0.416 0.206 2.021 0.043
sexm count 1.021 0.175 5.827 0.000


Questions

  • Interpret the intercept.
  • Interpret the coefficients of off_campus and sexm.

Estimated proportion zeros

term type estimate std.error statistic p.value
(Intercept) zero -0.604 0.311 -1.938 0.053
first_year zero 1.136 0.610 1.864 0.062

Questions:

Based on the model…

  • What is the probability a first-year student is a non-drinker?
  • What is the probability a upperclass student (sophomore, junior, senior) is a non-drinker?

These are just a few of the many models…

  • Use the Vuong Test to compare the fit of the ZIP model to a regular Poisson model
    • Why can’t we use a drop-in-deviance test?
  • We’ve just discussed the ZIP model here, but there are other model applications beyond the standard Poisson regression model (e.g., hurdle models, Zero-inflated Negative Binomial models, etc. )

Likelihoods for Poisson models

Estimating coefficients in Poisson model

  • Least squares estimation would not work because the normality and equal variance assumptions don’t hold for Poisson regression

  • Maximum likelihood estimation is used to estimate the coefficients of Poisson regression.

  • The likelihood is the product of the probabilities for the \(n\) independent observations in the data

Likelihood for regular Poisson model

Let’s go back to example about household size in the Philippines. We will focus on the model using the main effect of age to understand variability in mean household size.

Suppose the first five observations have household sizes of 4, 2, 8, 6, and 1. Then the likelihood is

\(L = \frac{e^{-\lambda_1}\lambda_1^4}{4!} * \frac{e^{-\lambda_2}\lambda_2^2}{2!} * \frac{e^{-\lambda_3}\lambda_3^8}{8!} * \frac{e^{-\lambda_4}\lambda_4^6}{6!} * \frac{e^{-\lambda_5}\lambda_5^1}{1!}\)

Likelihood for regular Poisson model

We will use the log likelihood to make finding the MLE easier

\(\begin{aligned}\log(L) &= -\lambda_1 + 4\log(\lambda_1) - \lambda_2 + 2\log(\lambda_2) - \lambda_3 + 8\log(\lambda_3)\\ & -\lambda_4 + 6 \log(\lambda_4) - \lambda_5 + \log(\lambda_5) + C \end{aligned}\)

where - \(\lambda\) is the mean number in household depending on \(x_i\) - \(C = -[\log(4!) + \log(2!) + \log(8!) + \log(6!)+ \log(1!)]\)

Likelihood for regular Poisson model

Given the age of the head of the household, we fit the model

\[\log(\lambda_i) = \beta_0 + \beta_1~age_i\]

Then we replace each \(\lambda_i\) in \(\log(L)\) with \(e^{\beta_0 + \beta_1~age_i}\).

Suppose the first five observations have ages \(X = (32, 21, 55, 44, 28)\). Then

\[\begin{aligned} \log(L) &= [-e^{\beta_0 + \beta_132}+ 4(\beta_0 + \beta_1 32)] + [ - e^{\beta_0 + \beta_121} + 2(\beta_0 + \beta_121)] \\ &+ [- e^{\beta_0 + \beta_155} + 8(\beta_0 + \beta_155)] + [-e^{\beta_0 + \beta_144} + 6(\beta_0 + \beta_144)] \\ &+ [-e^{\beta_0 + \beta_128}(\beta_0 + \beta_128)] + C \end{aligned}\]

Use search algorithm to find the values of \(\beta_0\) and \(\beta_1\) that maximize the above equation.

Probabilities under ZIP model

There are three different types of observations in the data:

  • Observed zero and will always be 0 (true zeros)
  • Observed 0 but will not always be 0
  • Observed non-zero count and will not always be 0

Probabilities under ZIP model

True zeros

\[P(0 | \text{true zero})= \alpha\]

Observed 0 but will not always be 0

\[P(0 | \text{not always zero}) = (1 - \alpha)\frac{e^{-\lambda}\lambda^0}{0!}\]

Did not observe 0 and will not always be 0

\[P(z_i | \text{not always zero}) = (1 - \alpha)\frac{e^{-\lambda}\lambda^{z_i}}{z_i!}\]

Probabilities under ZIP model

Putting this all together. Let \(y_i\) be an observed response then

\[P(Y_i = y_i | x_i) = \begin{cases} \alpha + (1 - \alpha)e^{-\lambda_i} && \text{ if } y_i = 0 \\ (1 - \alpha)\frac{e^{-\lambda_i}\lambda_i^{y_i}}{y_i!} && \text{ if } y_i > 0 \end{cases}\]

Recall from our example,

\[\lambda_i = e^{\beta_0 + \beta_1~off\_campus_i + \beta_2 ~ sex_i}\]

\[\alpha_i = \frac{e^{\beta_{0\alpha} + \beta_{1\alpha} ~ first\_year_i}}{1 + e^{\beta_{0\alpha} + \beta_{1\alpha} ~ first\_year_i}}\]

  • Plug in \(\lambda_i\) and \(\alpha_i\) into the above equation obtain the likelihood function

Acknowledgements

These slides are based on content in BMLR: Chapter 4

Initial versions of the slides are by Dr. Maria Tackett, Duke University