library(tidyverse)
library(tidymodels)
library(GGally)
library(knitr)
library(patchwork)
library(viridis)
library(ggfortify)
library(gridExtra)
Poisson Regression
BMLR Chapter 4
Setup
Learning goals (1/2)
Describe properties of the Poisson random variable
Write the Poisson regression model
Describe how the Poisson regression differs from least-squares regression
Interpret the coefficients for the Poisson regression model
Compare two Poisson regression models
Define and calculate residuals for the Poisson regression model
Use Goodness-of-fit to assess model fit
Learning goals (2/2)
Identify overdispersion
Apply modeling approaches to deal with overdispersion
Explore properties of negative binomial versus Poisson response
Fit and interpret models with offset to adjust for differences in sampling effort
Fit and interpret Zero-inflated Poisson models
Write likelihood for Poisson and Zero-inflated Poisson model
Scenarios to use Poisson regression
Does the number of employers conducting on-campus interviews during a year differ for public and private colleges?
Does the daily number of asthma-related visits to an Emergency Room differ depending on air pollution indices?
Does the number of paint defects per square foot of wall differ based on the years of experience of the painter?
. . .
Each response variable is a count per a unit of time or space.
Poisson distribution
Let \(Y\) be the number of events in a given unit of time or space. Then \(Y\) can be modeled using a Poisson distribution
. . .
\[P(Y=y) = \frac{e^{-\lambda}\lambda^y}{y!} \hspace{10mm} y=0,1,2,\ldots, \infty\]
Poisson Features
- \(E(Y) = Var(Y) = \lambda\)
- The distribution is typically skewed right, particularly if \(\lambda\) is small
- The distribution becomes more symmetric as \(\lambda\) increases
- If \(\lambda\) is sufficiently large, it can be approximated using a normal distribution (Click here for an example.)
Poisson Graphs
Mean | Variance | |
---|---|---|
lambda = 1 | 0.99351 | 0.9902178 |
lambda = 5 | 4.99367 | 4.9865798 |
lambda = 50 | 49.99288 | 49.8962683 |
Example
The annual number of earthquakes registering at least 2.5 on the Richter Scale and having an epicenter within 40 miles of downtown Memphis follows a Poisson distribution with mean 6.5. What is the probability there will be at 3 or fewer such earthquakes next year?
. . .
\[P(Y \leq 3) = P(Y = 0) + P(Y = 1) + P(Y = 2) + P(Y = 3)\]
\[ = \frac{e^{-6.5}6.5^0}{0!} + \frac{e^{-6.5}6.5^1}{1!} + \frac{e^{-6.5}6.5^2}{2!} + \frac{e^{-6.5}6.5^3}{3!}\]
\[ = 0.112\]
. . .
ppois(3, 6.5)
[1] 0.1118496
Example adapted from Introduction to Probability Theory Example 28-2
Poisson regression
Poisson regression
The data: Household size in the Philippines
The data fHH1.csv come from the 2015 Family Income and Expenditure Survey conducted by the Philippine Statistics Authority.
Goal: Understand the association between household size and various characteristics of the household
Response: - total
: Number of people in the household other than the head
Predictors: - location
: Where the house is located - age
: Age of the head of household - roof
: Type of roof on the residence (proxy for wealth)
Other variables: - numLT5
: Number in the household under 5 years old
The data
<- read_csv("data/fHH1.csv")
hh_data |> slice(1:5) |> kable() hh_data
location | age | total | numLT5 | roof |
---|---|---|---|---|
CentralLuzon | 65 | 0 | 0 | Predominantly Strong Material |
MetroManila | 75 | 3 | 0 | Predominantly Strong Material |
DavaoRegion | 54 | 4 | 0 | Predominantly Strong Material |
Visayas | 49 | 3 | 0 | Predominantly Strong Material |
MetroManila | 74 | 3 | 0 | Predominantly Strong Material |
Response variable
mean | var |
---|---|
3.685 | 5.534 |
Why the least-squares model doesn’t work
The goal is to model \(\lambda\), the expected number of people in the household (other than the head), as a function of the predictors (covariates)
. . .
We might be tempted to try a linear model \[\lambda_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip}\]
. . .
This model won’t work because…
- It could produce negative values of \(\lambda\) for certain values of the predictors
- The equal variance assumption required to conduct inference for linear regression is violated.
Poisson regression model
If \(Y_i \sim Poisson\) with \(\lambda = \lambda_i\) for the given values \(x_{i1}, \ldots, x_{ip}\), then
\[\log(\lambda_i) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip}\]
Each observation can have a different value of \(\lambda\) based on its value of the predictors \(x_1, \ldots, x_p\)
\(\lambda\) determines the mean and variance, so we don’t need to estimate a separate error term
Poisson vs. multiple linear regression
From BMLR Figure 4.1
Assumptions for Poisson regression
Poisson response: The response variable is a count per unit of time or space, described by a Poisson distribution, at each level of the predictor(s)
Independence: The observations must be independent of one another
Mean = Variance: The mean must equal the variance
Linearity: The log of the mean rate, \(\log(\lambda)\), must be a linear function of the predictor(s)
Model 1: Number in household vs. age
<- glm(total ~ age, data = hh_data, family = poisson)
model1
tidy(model1) |>
kable(digits = 4)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 1.5499 | 0.0503 | 30.8290 | 0 |
age | -0.0047 | 0.0009 | -5.0258 | 0 |
\[\log(\hat{\lambda}) = 1.5499 - 0.0047 ~ age\]
. . .
Question: The coefficient for age
is -0.0047. Interpret this coefficient in context.
Answers
Each additional year older the head of household is, the estimated average log of the number of people in the household is .0047 lower.
Each additional year older the head of household is, the estimated average number of people in the household reduces by 0.5%.
Each additional year older the head of household is the estimated average number of people in the household changes by a factor of .995.
For every 10 years older the head of household is, the estimated average number of people in the household reduces by 5%.
Understanding the interpretation
Let’s derive the change in predicted mean when we go from \(x\) to \(x+1\)
(see boardwork)
Is the coefficient of age
statistically significant?
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 1.5499 | 0.0503 | 30.8290 | 0 | 1.4512 | 1.6482 |
age | -0.0047 | 0.0009 | -5.0258 | 0 | -0.0065 | -0.0029 |
\[H_0: \beta_1 = 0 \hspace{2mm} \text{ vs. } \hspace{2mm} H_a: \beta_1 \neq 0\]
. . .
Test statistic
\[Z = \frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1)} = \frac{-0.0047 - 0}{0.0009} = -5.026 \text{ (using exact values)}\]
. . .
P-value
\[P(|Z| > |-5.026|) = 5.01 \times 10^{-7} \approx 0\]
What are plausible values for the coefficient of age
?
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 1.5499 | 0.0503 | 30.8290 | 0 | 1.4512 | 1.6482 |
age | -0.0047 | 0.0009 | -5.0258 | 0 | -0.0065 | -0.0029 |
95% confidence interval for the coefficient of age
\[\hat{\beta}_1 \pm Z^{*}\times SE(\hat{\beta}_1)\] \[-0.0047 \pm 1.96 \times 0.0009 = \mathbf{(-.0065, -0.0029)}\]
. . .
Question: Interpret the interval in terms of the change in mean household size.
Which can best help us determine whether Model 1 is a good fit?
Data set pulled from BMLR Section 4.11.3.
Case study in BMLR - Section 4.10
<- ggplot(data = hh_data, aes(x = age, y = total)) +
p1 geom_point() +
labs(y = "Total household size",
title = "Plot A")
<- hh_data |>
p2 group_by(age) |>
summarise(mean = mean(total)) |>
ggplot(aes(x = age, y = mean))+
geom_point() +
labs(y = "Empirical mean household size",
title = "Plot B")
<- hh_data |>
p3 group_by(age) |>
summarise(log_mean = log(mean(total))) |>
ggplot(aes(x = age, y = log_mean)) +
geom_point() +
labs(y = "Log empirical mean household size",
title = "Plot C")
+ p2 + p3 + plot_annotation(tag_levels = 'A') p1
Model 2: Add a quadratic effect for age
<- hh_data |>
hh_data mutate(age2 = age*age)
<- glm(total ~ age + age2, data = hh_data, family = poisson)
model2 tidy(model2, conf.int = T) |>
kable(digits = 4)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -0.3325 | 0.1788 | -1.8594 | 0.063 | -0.6863 | 0.0148 |
age | 0.0709 | 0.0069 | 10.2877 | 0.000 | 0.0575 | 0.0845 |
age2 | -0.0007 | 0.0001 | -11.0578 | 0.000 | -0.0008 | -0.0006 |
Model 2: Add a quadratic effect for age
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -0.3325 | 0.1788 | -1.8594 | 0.063 | -0.6863 | 0.0148 |
age | 0.0709 | 0.0069 | 10.2877 | 0.000 | 0.0575 | 0.0845 |
age2 | -0.0007 | 0.0001 | -11.0578 | 0.000 | -0.0008 | -0.0006 |
We can determine whether to keep \(age^2\) in the model in two ways:
1️⃣ Use the p-value (or confidence interval) for the coefficient (since we are adding a single term to the model)
2️⃣ Conduct a drop-in-deviance test
Deviance
A deviance is a way to measure how the observed data deviates from the model predictions.
It’s a measure unexplained variability in the response variable (similar to SSE in linear regression )
Lower deviance means the model is a better fit to the data
. . .
We can calculate the “deviance residual” for each observation in the data (more on the formula later). Let \((\text{deviance residual}_i\) be the deviance residual for the \(i^{th}\) observation, then
\[\text{deviance} = \sum(\text{deviance residual})_i^2\]
. . .
Note: Deviance is also known as the “residual deviance”
Drop-in-Deviance Test
We can use a drop-in-deviance test to compare two models. To conduct the test
1️⃣ Compute the deviance for each model
2️⃣ Calculate the drop in deviance
\[\text{drop-in-deviance = Deviance(reduced model) - Deviance(larger model)}\]
. . .
3️⃣ Given the reduced model is the true model \((H_0 \text{ true})\), then \[\text{drop-in-deviance} \sim \chi_d^2\]
where \(d\) is the difference in degrees of freedom between the two models (i.e., the difference in number of terms)
Drop-in-deviance - Model1 and Model2
anova(model1, model2, test = "Chisq") |>
kable(digits = 3)
Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
---|---|---|---|---|
1498 | 2337.089 | NA | NA | NA |
1497 | 2200.944 | 1 | 136.145 | 0 |
. . .
Questions:
- Write the null and alternative hypotheses.
- What does the value 2337.089 tell you?
- What does the value 1 tell you?
- What is your conclusion?
Add location
to the model?
Suppose we want to add location
to the model, so we compare the following models:
Model A: \(\lambda_i = \beta_0 + \beta_1 ~ age_i + \beta_2 ~ age_i^2\)
Model B: \(\lambda_i = \beta_0 + \beta_1 ~ age_i + \beta_2 ~ age_i^2 + \beta_3 ~ Loc1_i + \beta_4 ~ Loc2_i + \beta_5 ~ Loc3_i + \beta_6 ~ Loc4_i\)
Question
Which of the following are reliable ways to determine if location
should be added to the model?
. . .
- Drop-in-deviance test
- Use the p-value for each coefficient
- Likelihood ratio test
- Nested F Test
- BIC
Add location
to the model?
<- glm(total ~ age + age2 + location, data = hh_data, family = poisson) model3
. . .
Use a drop-in-deviance test to determine if Model 2 or Model 3 (with location) is a better fit for the data.
. . .
anova(model2, model3, test = "Chisq") |>
kable(digits = 3)
Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
---|---|---|---|---|
1497 | 2200.944 | NA | NA | NA |
1493 | 2187.800 | 4 | 13.144 | 0.011 |
The p-value is small (0.01 < 0.05), so we conclude that Model 3 is a better fit for the data.
Model 3
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -0.3843 | 0.1821 | -2.1107 | 0.0348 | -0.7444 | -0.0306 |
age | 0.0704 | 0.0069 | 10.1900 | 0.0000 | 0.0569 | 0.0840 |
age2 | -0.0007 | 0.0001 | -10.9437 | 0.0000 | -0.0008 | -0.0006 |
locationDavaoRegion | -0.0194 | 0.0538 | -0.3605 | 0.7185 | -0.1250 | 0.0859 |
locationIlocosRegion | 0.0610 | 0.0527 | 1.1580 | 0.2468 | -0.0423 | 0.1641 |
locationMetroManila | 0.0545 | 0.0472 | 1.1542 | 0.2484 | -0.0378 | 0.1473 |
locationVisayas | 0.1121 | 0.0417 | 2.6853 | 0.0072 | 0.0308 | 0.1945 |
. . .
Does this model sufficiently explain the variability in the mean household size?
Goodness-of-fit
Pearson residuals
We can calculate two types of residuals for Poisson regression: Pearson residuals and deviance residuals
. . .
\[\text{Pearson residual}_i = \frac{\text{observed} - \text{predicted}}{\text{std. error}} = \frac{y_i - \hat{\lambda}_i}{\sqrt{\hat{\lambda}_i}}\]
. . .
Similar interpretation as standardized residuals from linear regression
Expect most to fall between -2 and 2
Used to calculate overdispersion parameter
Deviance residuals
The deviance residual indicates how much the observed data deviates from the fitted model
\[\text{deviance residual}_i = \text{sign}(y_i - \hat{\lambda}_i)\sqrt{2\Bigg[y_i\log\bigg(\frac{y_i}{\hat{\lambda}_i}\bigg) - (y_i - \hat{\lambda}_i)\Bigg]}\]
where
\[\text{sign}(y_i - \hat{\lambda}_i) = \begin{cases} 1 & \text{ if }(y_i - \hat{\lambda}_i) > 0 \\ -1 & \text{ if }(y_i - \hat{\lambda}_i) < 0 \\ 0 & \text{ if }(y_i - \hat{\lambda}_i) = 0 \end{cases}\]
Model 3: Residual plots
<- augment(model3, type.residuals = "pearson")
model3_aug_pearson <- augment(model3, type.residuals = "deviance") model3_aug_deviance
<- ggplot(data = model3_aug_pearson, aes(x = .fitted, y = .resid)) +
p1 geom_point() +
geom_smooth() +
labs(x = "Fitted values",
y = "Pearson residuals",
title = "Pearson residuals vs. fitted")
<- ggplot(data = model3_aug_deviance, aes(x = .fitted, y = .resid)) +
p2 geom_point() +
geom_smooth() +
labs(x = "Fitted values",
y = "Deviance residuals",
title = "Deviance residuals vs. fitted")
+ p2 p1
Goodness-of-fit
Goal: Use the (residual) deviance to assess how much the predicted values differ from the observed values. Recall \((\text{deviance}) = \sum_{i=1}^{n}(\text{deviance residual})_i^2\)
If the model sufficiently fits the data, then
\[\text{deviance} \sim \chi^2_{df}\]
where \(df\) is the model’s residual degrees of freedom
. . .
- Question: What is the probability of observing a deviance larger than the one we’ve observed, given this model sufficiently fits the data?
. . .
\[P(\chi^2_{df} > \text{ deviance})\]
Model 3: Goodness-of-fit calculations
$deviance model3
[1] 2187.8
$df.residual model3
[1] 1493
pchisq(model3$deviance, model3$df.residual, lower.tail = FALSE)
[1] 3.153732e-29
The probability of observing a deviance greater than 2187.8 is \(\approx 0\), so there is significant evidence of lack-of-fit.
Lack-of-fit
There are a few potential reasons for lack-of-fit:
Missing important interactions or higher-order terms
Missing important variables (perhaps this means a more comprehensive data set is required)
There could be extreme observations causing the deviance to be larger than expected (assess based on the residual plots)
There could be a problem with the Poisson model
- May need more flexibility in the model to handle overdispersion
Overdispersion
Overdispersion: There is more variability in the response than what is implied by the Poisson model
Overall
mean | var |
---|---|
3.685 | 5.534 |
by Location
location | mean | var |
---|---|---|
CentralLuzon | 3.402 | 4.152 |
DavaoRegion | 3.390 | 4.723 |
IlocosRegion | 3.586 | 5.402 |
MetroManila | 3.707 | 4.863 |
Visayas | 3.902 | 6.602 |
|>
hh_data summarise(mean = mean(total), var = var(total)) |>
kable(digits = 3)
|>
hh_data group_by(location) |>
summarise(mean = mean(total), var = var(total)) |>
kable(digits = 3)
Why overdispersion matters
If there is overdispersion, then there is more variation in the response than what’s implied by a Poisson model. This means
❌ The standard errors of the model coefficients are artificially small
❌ The p-values are artificially small
❌ This could lead to models that are more complex than what is needed
. . .
We can take overdispersion into account by
- inflating standard errors by multiplying them by a dispersion factor
- using a negative-binomial regression model
Quasi-poission
Dispersion parameter
The dispersion parameter is represented by \(\phi\)
\[\hat{\phi} =\frac{\text{deviance}}{\text{residual df}} = \frac{\sum_{i=1}^{n}(\text{Pearson residuals})^2}{n - p}\]
where \(p\) is the number of terms in the model (including the intercept)
If there is no overdispersion \(\hat{\phi} = 1\)
If there is overdispersion \(\hat{\phi} > 1\)
Accounting for dispersion in the model
We inflate the standard errors of the coefficient by multiplying the variance by \(\hat{\phi}\)
. . .
\[SE_{Q}(\hat{\beta}) = \sqrt{\hat{\phi}} * SE(\hat{\beta})\] - “Q” stands for quasi-Poisson, since this is an ad-hoc solution - The process for model building and model comparison is called quasilikelihood (similar to likelihood without exact underlying distributions)
Model 3: Quasi-Poisson model
<- glm(total ~ age + age2 + location, data = hh_data,
model3_q family = quasipoisson) #<<
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -0.3843 | 0.2166 | -1.7744 | 0.0762 | -0.8134 | 0.0358 |
age | 0.0704 | 0.0082 | 8.5665 | 0.0000 | 0.0544 | 0.0866 |
age2 | -0.0007 | 0.0001 | -9.2000 | 0.0000 | -0.0009 | -0.0006 |
locationDavaoRegion | -0.0194 | 0.0640 | -0.3030 | 0.7619 | -0.1451 | 0.1058 |
locationIlocosRegion | 0.0610 | 0.0626 | 0.9735 | 0.3304 | -0.0620 | 0.1837 |
locationMetroManila | 0.0545 | 0.0561 | 0.9703 | 0.3320 | -0.0552 | 0.1649 |
locationVisayas | 0.1121 | 0.0497 | 2.2574 | 0.0241 | 0.0156 | 0.2103 |
Poisson vs. Q-Poisson
Poisson
term | estimate | std.error |
---|---|---|
(Intercept) | -0.3843 | 0.1821 |
age | 0.0704 | 0.0069 |
age2 | -0.0007 | 0.0001 |
locationDavaoRegion | -0.0194 | 0.0538 |
locationIlocosRegion | 0.0610 | 0.0527 |
locationMetroManila | 0.0545 | 0.0472 |
locationVisayas | 0.1121 | 0.0417 |
Quasi-Poisson
estimate | std.error |
---|---|
-0.3843 | 0.2166 |
0.0704 | 0.0082 |
-0.0007 | 0.0001 |
-0.0194 | 0.0640 |
0.0610 | 0.0626 |
0.0545 | 0.0561 |
0.1121 | 0.0497 |
Q-Poisson: Inference for coefficients
term | estimate | std.error |
---|---|---|
(Intercept) | -0.3843 | 0.2166 |
age | 0.0704 | 0.0082 |
age2 | -0.0007 | 0.0001 |
locationDavaoRegion | -0.0194 | 0.0640 |
locationIlocosRegion | 0.0610 | 0.0626 |
locationMetroManila | 0.0545 | 0.0561 |
locationVisayas | 0.1121 | 0.0497 |
Test statistic \[t = \frac{\hat{\beta} - 0}{SE_{Q}(\hat{\beta})} \sim t_{n-p}\]
Negative binomial regression model
Negative binomial regression model
Another approach to handle overdispersion is to use a negative binomial regression model
- This has more flexibility than the quasi-Poisson model, because there is a new parameter in addition to \(\lambda\)
. . .
Let \(Y\) be a negative binomial random variable, \(Y\sim NegBinom(r, p)\), then
\[P(Y = y_i) = {y_i + r - 1 \choose r - 1}(1-p)^{y_i}p^r \hspace{5mm} y_i = 0, 1, 2, \ldots, \infty\]
Negative binomial regression model
- Main idea: Generate a \(\lambda\) for each observation (household) and generate a count using the Poisson random variable with parameter \(\lambda\)
- Makes the counts more dispersed than with a single parameter
- Think of it as a Poisson model such that \(\lambda\) is also random
. . .
\(\begin{aligned} &\text{If }Y|\lambda \sim Poisson(\lambda)\\ &\text{ and } \lambda \sim Gamma\bigg(r, \frac{1-p}{p}\bigg)\\ &\text{ then } Y \sim NegBinom(r, p)\end{aligned}\)
Negative binomial regression in R
library(MASS)
<- glm.nb(total ~ age + age2 + location, data = hh_data)
model3_nb tidy(model3_nb) |>
kable(digits = 4)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.3753 | 0.2076 | -1.8081 | 0.0706 |
age | 0.0699 | 0.0079 | 8.8981 | 0.0000 |
age2 | -0.0007 | 0.0001 | -9.5756 | 0.0000 |
locationDavaoRegion | -0.0219 | 0.0625 | -0.3501 | 0.7262 |
locationIlocosRegion | 0.0577 | 0.0615 | 0.9391 | 0.3477 |
locationMetroManila | 0.0562 | 0.0551 | 1.0213 | 0.3071 |
locationVisayas | 0.1104 | 0.0487 | 2.2654 | 0.0235 |
Offsets
Data: Airbnbs in NYC
The data set NYCairbnb-sample.csv contains information about a random sample of 1000 Airbnbs in New York City. It is a subset of the data on 40628 Airbnbs scraped by Awad et al. (2017).
Variables
number_of_reviews
: Number of reviews for the unit on Airbnb (proxy for number of rentals)price
: price per night in US dollarsroom_type
: Entire home/apartment, private room, or shared roomdays
: Number of days the unit has been listed (date when info scraped - date when unit first listed on Airbnb)
Data: Airbnbs in NYC
<- read_csv("data/NYCairbnb-sample.csv") |>
airbnb ::select(id, number_of_reviews, days, room_type, price) dplyr
id | number_of_reviews | days | room_type | price |
---|---|---|---|---|
15756544 | 16 | 1144 | Private room | 120 |
14218251 | 15 | 471 | Private room | 89 |
21644 | 0 | 2600 | Private room | 89 |
13667835 | 1 | 283 | Entire home/apt | 150 |
265912 | 0 | 1970 | Entire home/apt | 89 |
Goal: Use the price and room type of Airbnbs to describe variation in the number of reviews (a proxy for number of rentals).
EDA
<- ggplot(data = airbnb, aes(x = number_of_reviews)) +
p1 geom_histogram() +
labs(x = "Number of reviews",
title = "Distribution of number of reviews")
<- airbnb |>
p2 filter(price <= 2000) |>
group_by(price) |>
summarise(log_mean = log(mean(number_of_reviews))) |>
ggplot(aes(x = price, y = log_mean)) +
geom_point(alpha= 0.7) +
geom_smooth() +
labs(x = "Price in US dollars",
y = "Log(mean # reviews)",
title = "Log mean # of reviews vs. price",
subtitle = "Airbnbs $2000 or less")
<- airbnb |>
p3 filter(price <= 500) |>
group_by(price) |>
summarise(log_mean = log(mean(number_of_reviews))) |>
ggplot(aes(x = price, y = log_mean)) +
geom_point(alpha= 0.7) +
geom_smooth() +
labs(x = "Price in US dollars",
y = "Log(mean # reviews)",
title = "Log mean # of reviews vs. price",
subtitle = "Airbnbs $500 or less")
/ (p2 + p3) p1
EDA
Overall
mean | var |
---|---|
15.916 | 765.969 |
by Room type
room_type | mean | var |
---|---|---|
Entire home/apt | 16.283 | 760.348 |
Private room | 15.608 | 786.399 |
Shared room | 15.028 | 605.971 |
Considerations for modeling
We would like to fit the Poisson regression model
\[\log(\lambda) = \beta_0 + \beta_1 ~ price + \beta_2 ~ room\_type1 + \beta_3 ~ room\_type2\]
. . .
Question: - Based on the EDA, what are some potential issues we may want to address in the model building?
- Suppose any model fit issues are addressed. What are some potential limitations to the conclusions and interpretations from the model?
Offset
Sometimes counts are not directly comparable because the observations differ based on some characteristic directly related to the counts, i.e. the sampling effort.
An offset can be used to adjust for differences in sampling effort.
. . .
- Let \(x_{offset}\) be the variable that accounts for differences in sampling effort, then \(\log(x_{offset})\) will be added to the model.
. . .
\(\log(\lambda_i) = \beta_0 + \beta_1 ~ x_{i1} + \beta_2 ~ x_{i2} + ... + \beta_p ~ x_{ip} + \log(x_{offset_i})\)
- The offset is a term in the model with coefficient always equal to 1.
Adding an offset to the Airbnb model
We will add the offset \(\log(days)\) to the model. This accounts for the fact that we would expect Airbnbs that have been listed longer to have more reviews.
\[\log(\lambda_i) = \beta_0 + \beta_1 ~ price_i + \beta_2 ~ room\_type1_i + \beta_3 ~ room\_type2_i + \log(days_i)\]
Note: The response variable for the model is still \(\log(\lambda_i)\), the log mean number of reviews
Detail on the offset
We want to adjust for the number of days, so we are interested in \(\frac{reviews}{days}\).
. . .
Given \(\lambda\) is the mean number of reviews
\[\log\Big(\frac{\lambda}{days}\Big) = \beta_0 + \beta_1 ~ price + \beta_2 ~ room\_type1 + \beta_3 ~ room\_type2\]
. . .
\[\Rightarrow \log({\lambda}) - \log(days) = \beta_0 + \beta_1 ~ price + \beta_2 ~ room\_type1 + \beta_3 ~ room\_type2\]
. . .
\[\Rightarrow \log({\lambda}) = \beta_0 + \beta_1 ~ price + \beta_2 ~ room\_type1 + \beta_3 ~ room\_type2 + \log(days)\]
Airbnb model in R
<- glm(number_of_reviews ~ price + room_type,
airbnb_model data = airbnb, family = poisson,
offset = log(days)) #<<
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -4.1351 | 0.0170 | -243.1397 | 0 |
price | -0.0005 | 0.0001 | -7.0952 | 0 |
room_typePrivate room | -0.0994 | 0.0174 | -5.6986 | 0 |
room_typeShared room | 0.2436 | 0.0452 | 5.3841 | 0 |
. . .
The coefficient for \(\log(days)\) is fixed at 1, so it is not in the model output.
Interpretations
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -4.1351 | 0.0170 | -243.1397 | 0 |
price | -0.0005 | 0.0001 | -7.0952 | 0 |
room_typePrivate room | -0.0994 | 0.0174 | -5.6986 | 0 |
room_typeShared room | 0.2436 | 0.0452 | 5.3841 | 0 |
Question:
- Interpret the coefficient of
price
. - Interpret the coefficient of
room_typePrivate room
Goodness-of-fit
\[\begin{aligned}&H_0: \text{ The model is a good fit for the data}\\ &H_a: \text{ There is significant lack of fit}\end{aligned}\]
pchisq(airbnb_model$deviance, airbnb_model$df.residual, lower.tail = F)
[1] 0
. . .
There is evidence of significant lack of fit in the model. Therefore, more models would need to be explored that address the issues mentioned earlier.
. . .
In practice we would assess goodness-of-fit and finalize the model before any interpretations and conclusions.
Zero-inflated Poisson model
Data: Weekend drinking
The data weekend-drinks.csv
contains information from a survey of 77 students in a introductory statistics course on a dry campus.
Variables
drinks
: Number of drinks they had in the past weekendoff_campus
: 1 - lives off campus, 0 otherwisefirst_year
: 1 - student is a first-year, 0 otherwisesex
: f - student identifies as female, m - student identifies as male
. . .
Goal: The goal is explore factors related to drinking behavior on a dry campus.
EDA: Response variable
Observed vs. expected response
. . .
What does it mean to be a “zero” in this data?
Two types of zeros
There are two types of zeros
- Those who happen to have a zero in the data set (people who drink but happened to not drink last weekend)
- Those who will always report a value of zero (non-drinkers)
- These are called true zeros
. . .
We introduce a new parameter \(\alpha\) for the proportion of true zeros, then fit a model that has two components:
. . .
1️⃣ The association between mean number of drinks and various characteristics among those who drink
2️⃣ The estimated proportion of non-drinkers
Zero-inflated Poisson model
Zero-inflated Poisson (ZIP) model has two parts
. . .
1️⃣ Association, among those who drink, between the mean number of drinks and predictors sex and off campus residence
. . .
\[\log(\lambda) = \beta_0 + \beta_1 ~ off\_campus + \beta_2 ~ sex\] where \(\lambda\) is the mean number of drinks among those who drink
. . .
2️⃣ Probability that a student does not drink
\[\text{logit}(\alpha) = \log\Big(\frac{\alpha}{1- \alpha}\Big) = \beta_0 + \beta_1 ~ first\_year\]
where \(\alpha\) is the proportion of non-drinkers
. . .
Note: The same variables can be used in each component
Details of the ZIP model
- The ZIP model is a special case of a latent variable model
- A type of mixture model where observations for one or more groups occur together but the group membership unknown
- Zero-inflated models are a common type of mixture model; they apply beyond Poisson regression
ZIP model in R
Fit ZIP models using the zeroinfl
function from the pscl R package.
library(pscl)
<- zeroinfl(drinks ~ off_campus + sex | first_year,
drinks_zip data = drinks)
drinks_zip
Call:
zeroinfl(formula = drinks ~ off_campus + sex | first_year, data = drinks)
Count model coefficients (poisson with log link):
(Intercept) off_campus sexm
0.7543 0.4159 1.0209
Zero-inflation model coefficients (binomial with logit link):
(Intercept) first_year
-0.6036 1.1364
Tidy output
Use the tidy
function from the poissonreg package for tidy model output.
library(poissonreg)
. . .
Mean number of drinks among those who drink
tidy(drinks_zip, type = "count") %>% kable(digits = 3)
term | type | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|
(Intercept) | count | 0.754 | 0.144 | 5.238 | 0.000 |
off_campus | count | 0.416 | 0.206 | 2.021 | 0.043 |
sexm | count | 1.021 | 0.175 | 5.827 | 0.000 |
Tidy output
Proportion of non-drinkers
tidy(drinks_zip, type = "zero") %>% kable(digits = 3)
term | type | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|
(Intercept) | zero | -0.604 | 0.311 | -1.938 | 0.053 |
first_year | zero | 1.136 | 0.610 | 1.864 | 0.062 |
Interpreting the model coefficients
term | type | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|
(Intercept) | count | 0.754 | 0.144 | 5.238 | 0.000 |
off_campus | count | 0.416 | 0.206 | 2.021 | 0.043 |
sexm | count | 1.021 | 0.175 | 5.827 | 0.000 |
Questions
- Interpret the intercept.
- Interpret the coefficients of
off_campus
andsexm
.
Estimated proportion zeros
term | type | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|
(Intercept) | zero | -0.604 | 0.311 | -1.938 | 0.053 |
first_year | zero | 1.136 | 0.610 | 1.864 | 0.062 |
Questions:
Based on the model…
- What is the probability a first-year student is a non-drinker?
- What is the probability a upperclass student (sophomore, junior, senior) is a non-drinker?
These are just a few of the many models…
- Use the Vuong Test to compare the fit of the ZIP model to a regular Poisson model
- Why can’t we use a drop-in-deviance test?
- We’ve just discussed the ZIP model here, but there are other model applications beyond the standard Poisson regression model (e.g., hurdle models, Zero-inflated Negative Binomial models, etc. )
Likelihoods for Poisson models
Estimating coefficients in Poisson model
Least squares estimation would not work because the normality and equal variance assumptions don’t hold for Poisson regression
Maximum likelihood estimation is used to estimate the coefficients of Poisson regression.
The likelihood is the product of the probabilities for the \(n\) independent observations in the data
Likelihood for regular Poisson model
Let’s go back to example about household size in the Philippines. We will focus on the model using the main effect of age to understand variability in mean household size.
. . .
Suppose the first five observations have household sizes of 4, 2, 8, 6, and 1. Then the likelihood is
. . .
\(L = \frac{e^{-\lambda_1}\lambda_1^4}{4!} * \frac{e^{-\lambda_2}\lambda_2^2}{2!} * \frac{e^{-\lambda_3}\lambda_3^8}{8!} * \frac{e^{-\lambda_4}\lambda_4^6}{6!} * \frac{e^{-\lambda_5}\lambda_5^1}{1!}\)
Likelihood for regular Poisson model
We will use the log likelihood to make finding the MLE easier
. . .
\(\begin{aligned}\log(L) &= -\lambda_1 + 4\log(\lambda_1) - \lambda_2 + 2\log(\lambda_2) - \lambda_3 + 8\log(\lambda_3)\\ & -\lambda_4 + 6 \log(\lambda_4) - \lambda_5 + \log(\lambda_5) + C \end{aligned}\)
where - \(\lambda\) is the mean number in household depending on \(x_i\) - \(C = -[\log(4!) + \log(2!) + \log(8!) + \log(6!)+ \log(1!)]\)
Likelihood for regular Poisson model
Given the age of the head of the household, we fit the model
\[\log(\lambda_i) = \beta_0 + \beta_1~age_i\]
Then we replace each \(\lambda_i\) in \(\log(L)\) with \(e^{\beta_0 + \beta_1~age_i}\).
. . .
Suppose the first five observations have ages \(X = (32, 21, 55, 44, 28)\). Then
. . .
\[\begin{aligned} \log(L) &= [-e^{\beta_0 + \beta_132}+ 4(\beta_0 + \beta_1 32)] + [ - e^{\beta_0 + \beta_121} + 2(\beta_0 + \beta_121)] \\ &+ [- e^{\beta_0 + \beta_155} + 8(\beta_0 + \beta_155)] + [-e^{\beta_0 + \beta_144} + 6(\beta_0 + \beta_144)] \\ &+ [-e^{\beta_0 + \beta_128}(\beta_0 + \beta_128)] + C \end{aligned}\]
. . .
Use search algorithm to find the values of \(\beta_0\) and \(\beta_1\) that maximize the above equation.
Probabilities under ZIP model
There are three different types of observations in the data:
- Observed zero and will always be 0 (true zeros)
- Observed 0 but will not always be 0
- Observed non-zero count and will not always be 0
Probabilities under ZIP model
True zeros
\[P(0 | \text{true zero})= \alpha\]
. . .
Observed 0 but will not always be 0
\[P(0 | \text{not always zero}) = (1 - \alpha)\frac{e^{-\lambda}\lambda^0}{0!}\]
. . .
Did not observe 0 and will not always be 0
\[P(z_i | \text{not always zero}) = (1 - \alpha)\frac{e^{-\lambda}\lambda^{z_i}}{z_i!}\]
Probabilities under ZIP model
Putting this all together. Let \(y_i\) be an observed response then
\[P(Y_i = y_i | x_i) = \begin{cases} \alpha + (1 - \alpha)e^{-\lambda_i} && \text{ if } y_i = 0 \\ (1 - \alpha)\frac{e^{-\lambda_i}\lambda_i^{y_i}}{y_i!} && \text{ if } y_i > 0 \end{cases}\]
. . .
Recall from our example,
\[\lambda_i = e^{\beta_0 + \beta_1~off\_campus_i + \beta_2 ~ sex_i}\]
\[\alpha_i = \frac{e^{\beta_{0\alpha} + \beta_{1\alpha} ~ first\_year_i}}{1 + e^{\beta_{0\alpha} + \beta_{1\alpha} ~ first\_year_i}}\]
- Plug in \(\lambda_i\) and \(\alpha_i\) into the above equation obtain the likelihood function
Acknowledgements
These slides are based on content in BMLR: Chapter 4
Initial versions of the slides are by Dr. Maria Tackett, Duke University