Poisson Regression

BMLR Chapter 4

Author

Affiliation

Tyler George

Cornell College
STA 363 Spring 2025 Block 8

Setup

library(tidyverse)
library(tidymodels)
library(GGally)
library(knitr)
library(patchwork)
library(viridis)
library(ggfortify)
library(gridExtra)

Learning goals (1/2)

Describe properties of the Poisson random variable
Write the Poisson regression model
Describe how the Poisson regression differs from least-squares regression
Interpret the coefficients for the Poisson regression model
Compare two Poisson regression models
Define and calculate residuals for the Poisson regression model
Use Goodness-of-fit to assess model fit

Learning goals (2/2)

Identify overdispersion
Apply modeling approaches to deal with overdispersion
Explore properties of negative binomial versus Poisson response
Fit and interpret models with offset to adjust for differences in sampling effort
Fit and interpret Zero-inflated Poisson models
Write likelihood for Poisson and Zero-inflated Poisson model

Scenarios to use Poisson regression

Does the number of employers conducting on-campus interviews during a year differ for public and private colleges?
Does the daily number of asthma-related visits to an Emergency Room differ depending on air pollution indices?
Does the number of paint defects per square foot of wall differ based on the years of experience of the painter?

. . .

Each response variable is a count per a unit of time or space.

Poisson distribution

Let \(Y\) be the number of events in a given unit of time or space. Then \(Y\) can be modeled using a Poisson distribution

. . .

\[P(Y=y) = \frac{e^{-\lambda}\lambda^y}{y!} \hspace{10mm} y=0,1,2,\ldots, \infty\]

Poisson Features

\(E(Y) = Var(Y) = \lambda\)
The distribution is typically skewed right, particularly if \(\lambda\) is small
The distribution becomes more symmetric as \(\lambda\) increases
- If \(\lambda\) is sufficiently large, it can be approximated using a normal distribution (Click here for an example.)

	Mean	Variance
lambda = 1	0.99351	0.9902178
lambda = 5	4.99367	4.9865798
lambda = 50	49.99288	49.8962683

Example

The annual number of earthquakes registering at least 2.5 on the Richter Scale and having an epicenter within 40 miles of downtown Memphis follows a Poisson distribution with mean 6.5. What is the probability there will be at 3 or fewer such earthquakes next year?

. . .

\[P(Y \leq 3) = P(Y = 0) + P(Y = 1) + P(Y = 2) + P(Y = 3)\]

\[ = \frac{e^{-6.5}6.5^0}{0!} + \frac{e^{-6.5}6.5^1}{1!} + \frac{e^{-6.5}6.5^2}{2!} + \frac{e^{-6.5}6.5^3}{3!}\]

\[ = 0.112\]

. . .

ppois(3, 6.5)

[1] 0.1118496

Example adapted from Introduction to Probability Theory Example 28-2

Poisson regression

The data: Household size in the Philippines

The data fHH1.csv come from the 2015 Family Income and Expenditure Survey conducted by the Philippine Statistics Authority.

Goal: Understand the association between household size and various characteristics of the household

Response: - total: Number of people in the household other than the head

Predictors: - location: Where the house is located - age: Age of the head of household - roof: Type of roof on the residence (proxy for wealth)

Other variables: - numLT5: Number in the household under 5 years old

The data

hh_data <- read_csv("data/fHH1.csv")
hh_data |> slice(1:5) |> kable()

location	age	total	roof
CentralLuzon	65	0	Predominantly Strong Material
MetroManila	75	3	Predominantly Strong Material
DavaoRegion	54	4	Predominantly Strong Material
Visayas	49	3	Predominantly Strong Material
MetroManila	74	3	Predominantly Strong Material

Response variable

mean	var
3.685	5.534

Why the least-squares model doesn’t work

The goal is to model \(\lambda\), the expected number of people in the household (other than the head), as a function of the predictors (covariates)

. . .

We might be tempted to try a linear model \[\lambda_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip}\]

. . .

This model won’t work because…

It could produce negative values of \(\lambda\) for certain values of the predictors
The equal variance assumption required to conduct inference for linear regression is violated.

Poisson regression model

If \(Y_i \sim Poisson\) with \(\lambda = \lambda_i\) for the given values \(x_{i1}, \ldots, x_{ip}\), then

\[\log(\lambda_i) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip}\]

Each observation can have a different value of \(\lambda\) based on its value of the predictors \(x_1, \ldots, x_p\)
\(\lambda\) determines the mean and variance, so we don’t need to estimate a separate error term

Poisson vs. multiple linear regression

Regression models: Linear regression (left) and Poisson regression (right).

From BMLR Figure 4.1

Assumptions for Poisson regression

Poisson response: The response variable is a count per unit of time or space, described by a Poisson distribution, at each level of the predictor(s)

Independence: The observations must be independent of one another

Mean = Variance: The mean must equal the variance

Linearity: The log of the mean rate, \(\log(\lambda)\), must be a linear function of the predictor(s)

Model 1: Number in household vs. age

model1 <- glm(total ~ age, data = hh_data, family = poisson)

tidy(model1) |> 
  kable(digits = 4)

term	estimate	std.error	statistic	p.value
(Intercept)	1.5499	0.0503	30.8290	0
age	-0.0047	0.0009	-5.0258	0

\[\log(\hat{\lambda}) = 1.5499 - 0.0047 ~ age\]

. . .

Question: The coefficient for age is -0.0047. Interpret this coefficient in context.

Answers

Each additional year older the head of household is, the estimated average log of the number of people in the household is .0047 lower.
Each additional year older the head of household is, the estimated average number of people in the household reduces by 0.5%.
Each additional year older the head of household is the estimated average number of people in the household changes by a factor of .995.
For every 10 years older the head of household is, the estimated average number of people in the household reduces by 5%.

Understanding the interpretation

Let’s derive the change in predicted mean when we go from \(x\) to \(x+1\)

(see boardwork)

Is the coefficient of `age` statistically significant?

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	1.5499	0.0503	30.8290	0	1.4512	1.6482
age	-0.0047	0.0009	-5.0258	0	-0.0065	-0.0029

\[H_0: \beta_1 = 0 \hspace{2mm} \text{ vs. } \hspace{2mm} H_a: \beta_1 \neq 0\]

. . .

Test statistic

\[Z = \frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1)} = \frac{-0.0047 - 0}{0.0009} = -5.026 \text{ (using exact values)}\]

. . .

P-value

\[P(|Z| > |-5.026|) = 5.01 \times 10^{-7} \approx 0\]

What are plausible values for the coefficient of `age`?

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	1.5499	0.0503	30.8290	0	1.4512	1.6482
age	-0.0047	0.0009	-5.0258	0	-0.0065	-0.0029

95% confidence interval for the coefficient of age

\[\hat{\beta}_1 \pm Z^{*}\times SE(\hat{\beta}_1)\] \[-0.0047 \pm 1.96 \times 0.0009 = \mathbf{(-.0065, -0.0029)}\]

. . .

Question: Interpret the interval in terms of the change in mean household size.

Which can best help us determine whether Model 1 is a good fit?

Data set pulled from BMLR Section 4.11.3.

Case study in BMLR - Section 4.10

Plots
Code

p1 <- ggplot(data = hh_data, aes(x = age, y = total)) + 
  geom_point() + 
  labs(y = "Total household size", 
       title = "Plot A")

p2 <- hh_data |>
  group_by(age) |> 
  summarise(mean = mean(total)) |>
  ggplot(aes(x = age, y = mean))+ 
  geom_point() + 
  labs(y = "Empirical mean household size", 
       title = "Plot B")

p3 <- hh_data |>
  group_by(age) |> 
  summarise(log_mean = log(mean(total))) |>
  ggplot(aes(x = age, y = log_mean)) + 
  geom_point() + 
  labs(y = "Log empirical mean household size", 
       title = "Plot C")

p1 + p2 + p3 + plot_annotation(tag_levels = 'A')

Model 2: Add a quadratic effect for `age`

hh_data <- hh_data |> 
  mutate(age2 = age*age)

model2 <- glm(total ~ age + age2, data = hh_data, family = poisson)
tidy(model2, conf.int = T) |> 
  kable(digits = 4)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-0.3325	0.1788	-1.8594	0.063	-0.6863	0.0148
age	0.0709	0.0069	10.2877	0.000	0.0575	0.0845
age2	-0.0007	0.0001	-11.0578	0.000	-0.0008	-0.0006

Model 2: Add a quadratic effect for `age`

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-0.3325	0.1788	-1.8594	0.063	-0.6863	0.0148
age	0.0709	0.0069	10.2877	0.000	0.0575	0.0845
age2	-0.0007	0.0001	-11.0578	0.000	-0.0008	-0.0006

We can determine whether to keep \(age^2\) in the model in two ways:

1️⃣ Use the p-value (or confidence interval) for the coefficient (since we are adding a single term to the model)

2️⃣ Conduct a drop-in-deviance test

Deviance

A deviance is a way to measure how the observed data deviates from the model predictions.

It’s a measure unexplained variability in the response variable (similar to SSE in linear regression )
Lower deviance means the model is a better fit to the data

. . .

We can calculate the “deviance residual” for each observation in the data (more on the formula later). Let \((\text{deviance residual}_i\) be the deviance residual for the \(i^{th}\) observation, then

\[\text{deviance} = \sum(\text{deviance residual})_i^2\]

. . .

Note: Deviance is also known as the “residual deviance”

Drop-in-Deviance Test

We can use a drop-in-deviance test to compare two models. To conduct the test

1️⃣ Compute the deviance for each model

2️⃣ Calculate the drop in deviance

\[\text{drop-in-deviance = Deviance(reduced model) - Deviance(larger model)}\]

. . .

3️⃣ Given the reduced model is the true model \((H_0 \text{ true})\), then \[\text{drop-in-deviance} \sim \chi_d^2\]

where \(d\) is the difference in degrees of freedom between the two models (i.e., the difference in number of terms)

Drop-in-deviance - Model1 and Model2

anova(model1, model2, test = "Chisq") |>
  kable(digits = 3)

Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
1498	2337.089	NA	NA	NA
1497	2200.944	1	136.145	0

. . .

Questions:

Write the null and alternative hypotheses.
What does the value 2337.089 tell you?
What does the value 1 tell you?
What is your conclusion?

Add `location` to the model?

Suppose we want to add location to the model, so we compare the following models:

Model A: \(\lambda_i = \beta_0 + \beta_1 ~ age_i + \beta_2 ~ age_i^2\)

Model B: \(\lambda_i = \beta_0 + \beta_1 ~ age_i + \beta_2 ~ age_i^2 + \beta_3 ~ Loc1_i + \beta_4 ~ Loc2_i + \beta_5 ~ Loc3_i + \beta_6 ~ Loc4_i\)

Question

Which of the following are reliable ways to determine if location should be added to the model?

. . .

Drop-in-deviance test
Use the p-value for each coefficient
Likelihood ratio test
Nested F Test
BIC

Add `location` to the model?

model3 <- glm(total ~ age + age2 + location, data = hh_data, family = poisson)

. . .

Use a drop-in-deviance test to determine if Model 2 or Model 3 (with location) is a better fit for the data.

. . .

anova(model2, model3, test = "Chisq") |>
  kable(digits = 3)

Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
1497	2200.944	NA	NA	NA
1493	2187.800	4	13.144	0.011

The p-value is small (0.01 < 0.05), so we conclude that Model 3 is a better fit for the data.

Model 3

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-0.3843	0.1821	-2.1107	0.0348	-0.7444	-0.0306
age	0.0704	0.0069	10.1900	0.0000	0.0569	0.0840
age2	-0.0007	0.0001	-10.9437	0.0000	-0.0008	-0.0006
locationDavaoRegion	-0.0194	0.0538	-0.3605	0.7185	-0.1250	0.0859
locationIlocosRegion	0.0610	0.0527	1.1580	0.2468	-0.0423	0.1641
locationMetroManila	0.0545	0.0472	1.1542	0.2484	-0.0378	0.1473
locationVisayas	0.1121	0.0417	2.6853	0.0072	0.0308	0.1945

. . .

Does this model sufficiently explain the variability in the mean household size?

Goodness-of-fit

Pearson residuals

We can calculate two types of residuals for Poisson regression: Pearson residuals and deviance residuals

. . .

\[\text{Pearson residual}_i = \frac{\text{observed} - \text{predicted}}{\text{std. error}} = \frac{y_i - \hat{\lambda}_i}{\sqrt{\hat{\lambda}_i}}\]

. . .

Similar interpretation as standardized residuals from linear regression
Expect most to fall between -2 and 2
Used to calculate overdispersion parameter

Deviance residuals

The deviance residual indicates how much the observed data deviates from the fitted model

\[\text{deviance residual}_i = \text{sign}(y_i - \hat{\lambda}_i)\sqrt{2\Bigg[y_i\log\bigg(\frac{y_i}{\hat{\lambda}_i}\bigg) - (y_i - \hat{\lambda}_i)\Bigg]}\]

where

\[\text{sign}(y_i - \hat{\lambda}_i) = \begin{cases} 1 & \text{ if }(y_i - \hat{\lambda}_i) > 0 \\ -1 & \text{ if }(y_i - \hat{\lambda}_i) < 0 \\ 0 & \text{ if }(y_i - \hat{\lambda}_i) = 0 \end{cases}\]

Model 3: Residual plots

model3_aug_pearson <- augment(model3, type.residuals = "pearson") 
model3_aug_deviance <- augment(model3, type.residuals = "deviance")

Plots
Code

p1 <- ggplot(data = model3_aug_pearson, aes(x = .fitted, y = .resid)) + 
  geom_point()  + 
  geom_smooth() + 
  labs(x = "Fitted values", 
       y = "Pearson residuals", 
       title = "Pearson residuals vs. fitted")

p2 <-  ggplot(data = model3_aug_deviance, aes(x = .fitted, y = .resid)) + 
  geom_point()  + 
  geom_smooth() + 
  labs(x = "Fitted values", 
       y = "Deviance residuals", 
       title = "Deviance residuals vs. fitted")

p1 + p2

Goodness-of-fit

Goal: Use the (residual) deviance to assess how much the predicted values differ from the observed values. Recall \((\text{deviance}) = \sum_{i=1}^{n}(\text{deviance residual})_i^2\)
If the model sufficiently fits the data, then

\[\text{deviance} \sim \chi^2_{df}\]

where \(df\) is the model’s residual degrees of freedom

. . .

Question: What is the probability of observing a deviance larger than the one we’ve observed, given this model sufficiently fits the data?

. . .

\[P(\chi^2_{df} > \text{ deviance})\]

Model 3: Goodness-of-fit calculations

model3$deviance

[1] 2187.8

model3$df.residual

[1] 1493

pchisq(model3$deviance, model3$df.residual, lower.tail = FALSE)

[1] 3.153732e-29

The probability of observing a deviance greater than 2187.8 is \(\approx 0\), so there is significant evidence of lack-of-fit.

Lack-of-fit

There are a few potential reasons for lack-of-fit:

Missing important interactions or higher-order terms
Missing important variables (perhaps this means a more comprehensive data set is required)
There could be extreme observations causing the deviance to be larger than expected (assess based on the residual plots)
There could be a problem with the Poisson model
- May need more flexibility in the model to handle overdispersion

Overdispersion

Overdispersion: There is more variability in the response than what is implied by the Poisson model

Tables
Code

Overall

mean	var
3.685	5.534

by Location

location	mean	var
CentralLuzon	3.402	4.152
DavaoRegion	3.390	4.723
IlocosRegion	3.586	5.402
MetroManila	3.707	4.863
Visayas	3.902	6.602

hh_data |>
  summarise(mean = mean(total), var = var(total)) |>
  kable(digits = 3)

hh_data |>
  group_by(location) |>
  summarise(mean = mean(total), var = var(total)) |>
  kable(digits = 3)

Why overdispersion matters

If there is overdispersion, then there is more variation in the response than what’s implied by a Poisson model. This means

❌ The standard errors of the model coefficients are artificially small

❌ The p-values are artificially small

❌ This could lead to models that are more complex than what is needed

. . .

We can take overdispersion into account by

inflating standard errors by multiplying them by a dispersion factor
using a negative-binomial regression model

Quasi-poission

Dispersion parameter

The dispersion parameter is represented by \(\phi\)

\[\hat{\phi} =\frac{\text{deviance}}{\text{residual df}} = \frac{\sum_{i=1}^{n}(\text{Pearson residuals})^2}{n - p}\]

where \(p\) is the number of terms in the model (including the intercept)

If there is no overdispersion \(\hat{\phi} = 1\)
If there is overdispersion \(\hat{\phi} > 1\)

Accounting for dispersion in the model

We inflate the standard errors of the coefficient by multiplying the variance by \(\hat{\phi}\)

. . .

\[SE_{Q}(\hat{\beta}) = \sqrt{\hat{\phi}} * SE(\hat{\beta})\] - “Q” stands for quasi-Poisson, since this is an ad-hoc solution - The process for model building and model comparison is called quasilikelihood (similar to likelihood without exact underlying distributions)

Model 3: Quasi-Poisson model

model3_q <- glm(total ~ age + age2 + location, data = hh_data, 
                family = quasipoisson) #<<

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-0.3843	0.2166	-1.7744	0.0762	-0.8134	0.0358
age	0.0704	0.0082	8.5665	0.0000	0.0544	0.0866
age2	-0.0007	0.0001	-9.2000	0.0000	-0.0009	-0.0006
locationDavaoRegion	-0.0194	0.0640	-0.3030	0.7619	-0.1451	0.1058
locationIlocosRegion	0.0610	0.0626	0.9735	0.3304	-0.0620	0.1837
locationMetroManila	0.0545	0.0561	0.9703	0.3320	-0.0552	0.1649
locationVisayas	0.1121	0.0497	2.2574	0.0241	0.0156	0.2103

Poisson vs. Q-Poisson

Poisson

term	estimate	std.error
(Intercept)	-0.3843	0.1821
age	0.0704	0.0069
age2	-0.0007	0.0001
locationDavaoRegion	-0.0194	0.0538
locationIlocosRegion	0.0610	0.0527
locationMetroManila	0.0545	0.0472
locationVisayas	0.1121	0.0417

Quasi-Poisson

estimate	std.error
-0.3843	0.2166
0.0704	0.0082
-0.0007	0.0001
-0.0194	0.0640
0.0610	0.0626
0.0545	0.0561
0.1121	0.0497

Q-Poisson: Inference for coefficients

term	estimate	std.error
(Intercept)	-0.3843	0.2166
age	0.0704	0.0082
age2	-0.0007	0.0001
locationDavaoRegion	-0.0194	0.0640
locationIlocosRegion	0.0610	0.0626
locationMetroManila	0.0545	0.0561
locationVisayas	0.1121	0.0497

Test statistic \[t = \frac{\hat{\beta} - 0}{SE_{Q}(\hat{\beta})} \sim t_{n-p}\]

Negative binomial regression model

Another approach to handle overdispersion is to use a negative binomial regression model

This has more flexibility than the quasi-Poisson model, because there is a new parameter in addition to \(\lambda\)

. . .

Let \(Y\) be a negative binomial random variable, \(Y\sim NegBinom(r, p)\), then

\[P(Y = y_i) = {y_i + r - 1 \choose r - 1}(1-p)^{y_i}p^r \hspace{5mm} y_i = 0, 1, 2, \ldots, \infty\]

Negative binomial regression model

Main idea: Generate a \(\lambda\) for each observation (household) and generate a count using the Poisson random variable with parameter \(\lambda\)
- Makes the counts more dispersed than with a single parameter
Think of it as a Poisson model such that \(\lambda\) is also random

. . .

\(\begin{aligned} &\text{If }Y|\lambda \sim Poisson(\lambda)\\ &\text{ and } \lambda \sim Gamma\bigg(r, \frac{1-p}{p}\bigg)\\ &\text{ then } Y \sim NegBinom(r, p)\end{aligned}\)

Negative binomial regression in R

library(MASS)
model3_nb <- glm.nb(total ~ age + age2 + location, data = hh_data)
tidy(model3_nb) |> 
  kable(digits = 4)

term	estimate	std.error	statistic	p.value
(Intercept)	-0.3753	0.2076	-1.8081	0.0706
age	0.0699	0.0079	8.8981	0.0000
age2	-0.0007	0.0001	-9.5756	0.0000
locationDavaoRegion	-0.0219	0.0625	-0.3501	0.7262
locationIlocosRegion	0.0577	0.0615	0.9391	0.3477
locationMetroManila	0.0562	0.0551	1.0213	0.3071
locationVisayas	0.1104	0.0487	2.2654	0.0235

Offsets

Data: Airbnbs in NYC

The data set NYCairbnb-sample.csv contains information about a random sample of 1000 Airbnbs in New York City. It is a subset of the data on 40628 Airbnbs scraped by Awad et al. (2017).

Variables

number_of_reviews: Number of reviews for the unit on Airbnb (proxy for number of rentals)
price: price per night in US dollars
room_type: Entire home/apartment, private room, or shared room
days: Number of days the unit has been listed (date when info scraped - date when unit first listed on Airbnb)

Data: Airbnbs in NYC

airbnb <- read_csv("data/NYCairbnb-sample.csv") |>
  dplyr::select(id, number_of_reviews, days, room_type, price)

id	number_of_reviews	days	room_type	price
15756544	16	1144	Private room	120
14218251	15	471	Private room	89
21644	0	2600	Private room	89
13667835	1	283	Entire home/apt	150
265912	0	1970	Entire home/apt	89

Goal: Use the price and room type of Airbnbs to describe variation in the number of reviews (a proxy for number of rentals).

EDA

Plot
Code

p1 <- ggplot(data = airbnb, aes(x = number_of_reviews)) + 
  geom_histogram() + 
   labs(x = "Number of reviews",
    title = "Distribution of number of reviews")

p2 <- airbnb |>
  filter(price <= 2000) |>
  group_by(price) |>
  summarise(log_mean = log(mean(number_of_reviews))) |>
  ggplot(aes(x = price, y = log_mean)) + 
  geom_point(alpha= 0.7) + 
  geom_smooth() + 
  labs(x = "Price in  US dollars",
    y = "Log(mean # reviews)", 
    title = "Log mean #  of reviews vs. price", 
    subtitle = "Airbnbs $2000 or less")

p3 <- airbnb |>
  filter(price <= 500) |>
  group_by(price) |>
  summarise(log_mean = log(mean(number_of_reviews))) |>
  ggplot(aes(x = price, y = log_mean)) + 
  geom_point(alpha= 0.7) + 
  geom_smooth() + 
  labs(x = "Price in  US dollars",
    y = "Log(mean # reviews)", 
    title = "Log mean # of reviews vs. price", 
    subtitle = "Airbnbs $500 or less")

p1  / (p2 + p3)

EDA

Overall

mean	var
15.916	765.969

by Room type

room_type	mean	var
Entire home/apt	16.283	760.348
Private room	15.608	786.399
Shared room	15.028	605.971

Considerations for modeling

We would like to fit the Poisson regression model

\[\log(\lambda) = \beta_0 + \beta_1 ~ price + \beta_2 ~ room\_type1 + \beta_3 ~ room\_type2\]

. . .

Question: - Based on the EDA, what are some potential issues we may want to address in the model building?

Suppose any model fit issues are addressed. What are some potential limitations to the conclusions and interpretations from the model?

Offset

Sometimes counts are not directly comparable because the observations differ based on some characteristic directly related to the counts, i.e. the sampling effort.
An offset can be used to adjust for differences in sampling effort.

. . .

Let \(x_{offset}\) be the variable that accounts for differences in sampling effort, then \(\log(x_{offset})\) will be added to the model.

. . .

\(\log(\lambda_i) = \beta_0 + \beta_1 ~ x_{i1} + \beta_2 ~ x_{i2} + ... + \beta_p ~ x_{ip} + \log(x_{offset_i})\)

The offset is a term in the model with coefficient always equal to 1.

Adding an offset to the Airbnb model

We will add the offset \(\log(days)\) to the model. This accounts for the fact that we would expect Airbnbs that have been listed longer to have more reviews.

\[\log(\lambda_i) = \beta_0 + \beta_1 ~ price_i + \beta_2 ~ room\_type1_i + \beta_3 ~ room\_type2_i + \log(days_i)\]

Note: The response variable for the model is still \(\log(\lambda_i)\), the log mean number of reviews

Detail on the offset

We want to adjust for the number of days, so we are interested in \(\frac{reviews}{days}\).

. . .

Given \(\lambda\) is the mean number of reviews

\[\log\Big(\frac{\lambda}{days}\Big) = \beta_0 + \beta_1 ~ price + \beta_2 ~ room\_type1 + \beta_3 ~ room\_type2\]

. . .

\[\Rightarrow \log({\lambda}) - \log(days) = \beta_0 + \beta_1 ~ price + \beta_2 ~ room\_type1 + \beta_3 ~ room\_type2\]

. . .

\[\Rightarrow \log({\lambda}) = \beta_0 + \beta_1 ~ price + \beta_2 ~ room\_type1 + \beta_3 ~ room\_type2 + \log(days)\]

Airbnb model in R

airbnb_model <- glm(number_of_reviews ~ price + room_type, 
                    data = airbnb, family = poisson, 
                    offset = log(days)) #<<

term	estimate	std.error	statistic
(Intercept)	-4.1351	0.0170	-243.1397
price	-0.0005	0.0001	-7.0952
room_typePrivate room	-0.0994	0.0174	-5.6986
room_typeShared room	0.2436	0.0452	5.3841

. . .

The coefficient for \(\log(days)\) is fixed at 1, so it is not in the model output.

Interpretations

term	estimate	std.error	statistic
(Intercept)	-4.1351	0.0170	-243.1397
price	-0.0005	0.0001	-7.0952
room_typePrivate room	-0.0994	0.0174	-5.6986
room_typeShared room	0.2436	0.0452	5.3841

Question:

Interpret the coefficient of price.
Interpret the coefficient of room_typePrivate room

Goodness-of-fit

\[\begin{aligned}&H_0: \text{ The model is a good fit for the data}\\ &H_a: \text{ There is significant lack of fit}\end{aligned}\]

pchisq(airbnb_model$deviance, airbnb_model$df.residual, lower.tail = F)

[1] 0

. . .

There is evidence of significant lack of fit in the model. Therefore, more models would need to be explored that address the issues mentioned earlier.

. . .

In practice we would assess goodness-of-fit and finalize the model before any interpretations and conclusions.

Zero-inflated Poisson model

Data: Weekend drinking

The data weekend-drinks.csv contains information from a survey of 77 students in a introductory statistics course on a dry campus.

Variables

drinks: Number of drinks they had in the past weekend
off_campus: 1 - lives off campus, 0 otherwise
first_year: 1 - student is a first-year, 0 otherwise
sex: f - student identifies as female, m - student identifies as male

. . .

Goal: The goal is explore factors related to drinking behavior on a dry campus.

EDA: Response variable

Observed vs. expected response

. . .

What does it mean to be a “zero” in this data?

Two types of zeros

There are two types of zeros

Those who happen to have a zero in the data set (people who drink but happened to not drink last weekend)
Those who will always report a value of zero (non-drinkers)
- These are called true zeros

. . .

We introduce a new parameter \(\alpha\) for the proportion of true zeros, then fit a model that has two components:

. . .

1️⃣ The association between mean number of drinks and various characteristics among those who drink

2️⃣ The estimated proportion of non-drinkers

Zero-inflated Poisson model

Zero-inflated Poisson (ZIP) model has two parts

. . .

1️⃣ Association, among those who drink, between the mean number of drinks and predictors sex and off campus residence

. . .

\[\log(\lambda) = \beta_0 + \beta_1 ~ off\_campus + \beta_2 ~ sex\] where \(\lambda\) is the mean number of drinks among those who drink

. . .

2️⃣ Probability that a student does not drink

\[\text{logit}(\alpha) = \log\Big(\frac{\alpha}{1- \alpha}\Big) = \beta_0 + \beta_1 ~ first\_year\]

where \(\alpha\) is the proportion of non-drinkers

. . .

Note: The same variables can be used in each component

Details of the ZIP model

The ZIP model is a special case of a latent variable model
- A type of mixture model where observations for one or more groups occur together but the group membership unknown
Zero-inflated models are a common type of mixture model; they apply beyond Poisson regression

ZIP model in R

Fit ZIP models using the zeroinfl function from the pscl R package.

library(pscl)

drinks_zip <- zeroinfl(drinks ~ off_campus + sex | first_year,
                data = drinks)
drinks_zip


Call:
zeroinfl(formula = drinks ~ off_campus + sex | first_year, data = drinks)

Count model coefficients (poisson with log link):
(Intercept)   off_campus         sexm  
     0.7543       0.4159       1.0209  

Zero-inflation model coefficients (binomial with logit link):
(Intercept)   first_year  
    -0.6036       1.1364

Tidy output

Use the tidy function from the poissonreg package for tidy model output.

library(poissonreg)

. . .

Mean number of drinks among those who drink

tidy(drinks_zip, type = "count") %>% kable(digits = 3)

term	type	estimate	std.error	statistic	p.value
(Intercept)	count	0.754	0.144	5.238	0.000
off_campus	count	0.416	0.206	2.021	0.043
sexm	count	1.021	0.175	5.827	0.000

Tidy output

Proportion of non-drinkers

tidy(drinks_zip, type = "zero") %>% kable(digits = 3)

term	type	estimate	std.error	statistic	p.value
(Intercept)	zero	-0.604	0.311	-1.938	0.053
first_year	zero	1.136	0.610	1.864	0.062

Interpreting the model coefficients

term	type	estimate	std.error	statistic	p.value
(Intercept)	count	0.754	0.144	5.238	0.000
off_campus	count	0.416	0.206	2.021	0.043
sexm	count	1.021	0.175	5.827	0.000

Questions

Interpret the intercept.
Interpret the coefficients of off_campus and sexm.

Estimated proportion zeros

term	type	estimate	std.error	statistic	p.value
(Intercept)	zero	-0.604	0.311	-1.938	0.053
first_year	zero	1.136	0.610	1.864	0.062

Questions:

Based on the model…

What is the probability a first-year student is a non-drinker?
What is the probability a upperclass student (sophomore, junior, senior) is a non-drinker?

These are just a few of the many models…

Use the Vuong Test to compare the fit of the ZIP model to a regular Poisson model
- Why can’t we use a drop-in-deviance test?
We’ve just discussed the ZIP model here, but there are other model applications beyond the standard Poisson regression model (e.g., hurdle models, Zero-inflated Negative Binomial models, etc. )

Likelihoods for Poisson models

Estimating coefficients in Poisson model

Least squares estimation would not work because the normality and equal variance assumptions don’t hold for Poisson regression
Maximum likelihood estimation is used to estimate the coefficients of Poisson regression.
The likelihood is the product of the probabilities for the \(n\) independent observations in the data

Likelihood for regular Poisson model

Let’s go back to example about household size in the Philippines. We will focus on the model using the main effect of age to understand variability in mean household size.

. . .

Suppose the first five observations have household sizes of 4, 2, 8, 6, and 1. Then the likelihood is

. . .

\(L = \frac{e^{-\lambda_1}\lambda_1^4}{4!} * \frac{e^{-\lambda_2}\lambda_2^2}{2!} * \frac{e^{-\lambda_3}\lambda_3^8}{8!} * \frac{e^{-\lambda_4}\lambda_4^6}{6!} * \frac{e^{-\lambda_5}\lambda_5^1}{1!}\)

Likelihood for regular Poisson model

We will use the log likelihood to make finding the MLE easier

. . .

\(\begin{aligned}\log(L) &= -\lambda_1 + 4\log(\lambda_1) - \lambda_2 + 2\log(\lambda_2) - \lambda_3 + 8\log(\lambda_3)\\ & -\lambda_4 + 6 \log(\lambda_4) - \lambda_5 + \log(\lambda_5) + C \end{aligned}\)

where - \(\lambda\) is the mean number in household depending on \(x_i\) - \(C = -[\log(4!) + \log(2!) + \log(8!) + \log(6!)+ \log(1!)]\)

Likelihood for regular Poisson model

Given the age of the head of the household, we fit the model

\[\log(\lambda_i) = \beta_0 + \beta_1~age_i\]

Then we replace each \(\lambda_i\) in \(\log(L)\) with \(e^{\beta_0 + \beta_1~age_i}\).

. . .

Suppose the first five observations have ages \(X = (32, 21, 55, 44, 28)\). Then

. . .

\[\begin{aligned} \log(L) &= [-e^{\beta_0 + \beta_132}+ 4(\beta_0 + \beta_1 32)] + [ - e^{\beta_0 + \beta_121} + 2(\beta_0 + \beta_121)] \\ &+ [- e^{\beta_0 + \beta_155} + 8(\beta_0 + \beta_155)] + [-e^{\beta_0 + \beta_144} + 6(\beta_0 + \beta_144)] \\ &+ [-e^{\beta_0 + \beta_128}(\beta_0 + \beta_128)] + C \end{aligned}\]

. . .

Use search algorithm to find the values of \(\beta_0\) and \(\beta_1\) that maximize the above equation.

Probabilities under ZIP model

There are three different types of observations in the data:

Observed zero and will always be 0 (true zeros)
Observed 0 but will not always be 0
Observed non-zero count and will not always be 0

Probabilities under ZIP model

True zeros

\[P(0 | \text{true zero})= \alpha\]

. . .

Observed 0 but will not always be 0

\[P(0 | \text{not always zero}) = (1 - \alpha)\frac{e^{-\lambda}\lambda^0}{0!}\]

. . .

Did not observe 0 and will not always be 0

\[P(z_i | \text{not always zero}) = (1 - \alpha)\frac{e^{-\lambda}\lambda^{z_i}}{z_i!}\]

Probabilities under ZIP model

Putting this all together. Let \(y_i\) be an observed response then

\[P(Y_i = y_i | x_i) = \begin{cases} \alpha + (1 - \alpha)e^{-\lambda_i} && \text{ if } y_i = 0 \\ (1 - \alpha)\frac{e^{-\lambda_i}\lambda_i^{y_i}}{y_i!} && \text{ if } y_i > 0 \end{cases}\]

. . .

Recall from our example,

\[\lambda_i = e^{\beta_0 + \beta_1~off\_campus_i + \beta_2 ~ sex_i}\]

\[\alpha_i = \frac{e^{\beta_{0\alpha} + \beta_{1\alpha} ~ first\_year_i}}{1 + e^{\beta_{0\alpha} + \beta_{1\alpha} ~ first\_year_i}}\]

Plug in \(\lambda_i\) and \(\alpha_i\) into the above equation obtain the likelihood function

Acknowledgements

These slides are based on content in BMLR: Chapter 4

Initial versions of the slides are by Dr. Maria Tackett, Duke University

Setup

Learning goals (1/2)

Learning goals (2/2)

Scenarios to use Poisson regression

Poisson distribution

Poisson Features

Poisson Graphs

Example

Poisson regression

Poisson regression

The data

Response variable

Why the least-squares model doesn’t work

Poisson regression model

Poisson vs. multiple linear regression

Assumptions for Poisson regression

Model 1: Number in household vs. age

Answers

Understanding the interpretation

Is the coefficient of age statistically significant?

What are plausible values for the coefficient of age?

Which can best help us determine whether Model 1 is a good fit?

Model 2: Add a quadratic effect for age

Model 2: Add a quadratic effect for age

Deviance

Drop-in-Deviance Test

Drop-in-deviance - Model1 and Model2

Add location to the model?

Question

Add location to the model?

Model 3

Goodness-of-fit

Pearson residuals

Deviance residuals

Model 3: Residual plots

Goodness-of-fit

Model 3: Goodness-of-fit calculations

Lack-of-fit

Overdispersion

Why overdispersion matters

Quasi-poission

Dispersion parameter

Accounting for dispersion in the model

Model 3: Quasi-Poisson model

Poisson vs. Q-Poisson

Q-Poisson: Inference for coefficients

Negative binomial regression model

Negative binomial regression model

Negative binomial regression model

Negative binomial regression in R

Offsets

Data: Airbnbs in NYC

Data: Airbnbs in NYC

EDA

EDA

Considerations for modeling

Offset

Adding an offset to the Airbnb model

Detail on the offset

Airbnb model in R

Interpretations

Goodness-of-fit

Zero-inflated Poisson model

Data: Weekend drinking

EDA: Response variable

Observed vs. expected response

Two types of zeros

Zero-inflated Poisson model

Details of the ZIP model

ZIP model in R

Tidy output

Tidy output

Interpreting the model coefficients

Estimated proportion zeros

These are just a few of the many models…

Likelihoods for Poisson models

Estimating coefficients in Poisson model

Likelihood for regular Poisson model

Likelihood for regular Poisson model

Likelihood for regular Poisson model

Is the coefficient of `age` statistically significant?

What are plausible values for the coefficient of `age`?

Model 2: Add a quadratic effect for `age`

Model 2: Add a quadratic effect for `age`

Add `location` to the model?

Add `location` to the model?