BMLR Chapter 4
Cornell College
STA 363 Spring 2025 Block 8
Describe properties of the Poisson random variable
Write the Poisson regression model
Describe how the Poisson regression differs from least-squares regression
Interpret the coefficients for the Poisson regression model
Compare two Poisson regression models
Define and calculate residuals for the Poisson regression model
Use Goodness-of-fit to assess model fit
Identify overdispersion
Apply modeling approaches to deal with overdispersion
Explore properties of negative binomial versus Poisson response
Fit and interpret models with offset to adjust for differences in sampling effort
Fit and interpret Zero-inflated Poisson models
Write likelihood for Poisson and Zero-inflated Poisson model
Does the number of employers conducting on-campus interviews during a year differ for public and private colleges?
Does the daily number of asthma-related visits to an Emergency Room differ depending on air pollution indices?
Does the number of paint defects per square foot of wall differ based on the years of experience of the painter?
Each response variable is a count per a unit of time or space.
Let \(Y\) be the number of events in a given unit of time or space. Then \(Y\) can be modeled using a Poisson distribution
\[P(Y=y) = \frac{e^{-\lambda}\lambda^y}{y!} \hspace{10mm} y=0,1,2,\ldots, \infty\]
The annual number of earthquakes registering at least 2.5 on the Richter Scale and having an epicenter within 40 miles of downtown Memphis follows a Poisson distribution with mean 6.5. What is the probability there will be at 3 or fewer such earthquakes next year?
\[P(Y \leq 3) = P(Y = 0) + P(Y = 1) + P(Y = 2) + P(Y = 3)\]
\[ = \frac{e^{-6.5}6.5^0}{0!} + \frac{e^{-6.5}6.5^1}{1!} + \frac{e^{-6.5}6.5^2}{2!} + \frac{e^{-6.5}6.5^3}{3!}\]
\[ = 0.112\]
The data: Household size in the Philippines
The data fHH1.csv come from the 2015 Family Income and Expenditure Survey conducted by the Philippine Statistics Authority.
Goal: Understand the association between household size and various characteristics of the household
Response: - total
: Number of people in the household other than the head
Predictors: - location
: Where the house is located - age
: Age of the head of household - roof
: Type of roof on the residence (proxy for wealth)
Other variables: - numLT5
: Number in the household under 5 years old
location | age | total | numLT5 | roof |
---|---|---|---|---|
CentralLuzon | 65 | 0 | 0 | Predominantly Strong Material |
MetroManila | 75 | 3 | 0 | Predominantly Strong Material |
DavaoRegion | 54 | 4 | 0 | Predominantly Strong Material |
Visayas | 49 | 3 | 0 | Predominantly Strong Material |
MetroManila | 74 | 3 | 0 | Predominantly Strong Material |
mean | var |
---|---|
3.685 | 5.534 |
The goal is to model \(\lambda\), the expected number of people in the household (other than the head), as a function of the predictors (covariates)
We might be tempted to try a linear model \[\lambda_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip}\]
This model won’t work because…
If \(Y_i \sim Poisson\) with \(\lambda = \lambda_i\) for the given values \(x_{i1}, \ldots, x_{ip}\), then
\[\log(\lambda_i) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip}\]
Each observation can have a different value of \(\lambda\) based on its value of the predictors \(x_1, \ldots, x_p\)
\(\lambda\) determines the mean and variance, so we don’t need to estimate a separate error term
Poisson response: The response variable is a count per unit of time or space, described by a Poisson distribution, at each level of the predictor(s)
Independence: The observations must be independent of one another
Mean = Variance: The mean must equal the variance
Linearity: The log of the mean rate, \(\log(\lambda)\), must be a linear function of the predictor(s)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 1.5499 | 0.0503 | 30.8290 | 0 |
age | -0.0047 | 0.0009 | -5.0258 | 0 |
\[\log(\hat{\lambda}) = 1.5499 - 0.0047 ~ age\]
Question: The coefficient for age
is -0.0047. Interpret this coefficient in context.
Each additional year older the head of household is, the estimated average log of the number of people in the household is .0047 lower.
Each additional year older the head of household is, the estimated average number of people in the household reduces by 0.5%.
Each additional year older the head of household is the estimated average number of people in the household changes by a factor of .995.
For every 10 years older the head of household is, the estimated average number of people in the household reduces by 5%.
Let’s derive the change in predicted mean when we go from \(x\) to \(x+1\)
(see boardwork)
age
statistically significant?term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 1.5499 | 0.0503 | 30.8290 | 0 | 1.4512 | 1.6482 |
age | -0.0047 | 0.0009 | -5.0258 | 0 | -0.0065 | -0.0029 |
\[H_0: \beta_1 = 0 \hspace{2mm} \text{ vs. } \hspace{2mm} H_a: \beta_1 \neq 0\]
Test statistic
\[Z = \frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1)} = \frac{-0.0047 - 0}{0.0009} = -5.026 \text{ (using exact values)}\]
P-value
\[P(|Z| > |-5.026|) = 5.01 \times 10^{-7} \approx 0\]
age
?term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 1.5499 | 0.0503 | 30.8290 | 0 | 1.4512 | 1.6482 |
age | -0.0047 | 0.0009 | -5.0258 | 0 | -0.0065 | -0.0029 |
95% confidence interval for the coefficient of age
\[\hat{\beta}_1 \pm Z^{*}\times SE(\hat{\beta}_1)\] \[-0.0047 \pm 1.96 \times 0.0009 = \mathbf{(-.0065, -0.0029)}\]
Question: Interpret the interval in terms of the change in mean household size.
p1 <- ggplot(data = hh_data, aes(x = age, y = total)) +
geom_point() +
labs(y = "Total household size",
title = "Plot A")
p2 <- hh_data |>
group_by(age) |>
summarise(mean = mean(total)) |>
ggplot(aes(x = age, y = mean))+
geom_point() +
labs(y = "Empirical mean household size",
title = "Plot B")
p3 <- hh_data |>
group_by(age) |>
summarise(log_mean = log(mean(total))) |>
ggplot(aes(x = age, y = log_mean)) +
geom_point() +
labs(y = "Log empirical mean household size",
title = "Plot C")
p1 + p2 + p3 + plot_annotation(tag_levels = 'A')
age
hh_data <- hh_data |>
mutate(age2 = age*age)
model2 <- glm(total ~ age + age2, data = hh_data, family = poisson)
tidy(model2, conf.int = T) |>
kable(digits = 4)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -0.3325 | 0.1788 | -1.8594 | 0.063 | -0.6863 | 0.0148 |
age | 0.0709 | 0.0069 | 10.2877 | 0.000 | 0.0575 | 0.0845 |
age2 | -0.0007 | 0.0001 | -11.0578 | 0.000 | -0.0008 | -0.0006 |
age
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -0.3325 | 0.1788 | -1.8594 | 0.063 | -0.6863 | 0.0148 |
age | 0.0709 | 0.0069 | 10.2877 | 0.000 | 0.0575 | 0.0845 |
age2 | -0.0007 | 0.0001 | -11.0578 | 0.000 | -0.0008 | -0.0006 |
We can determine whether to keep \(age^2\) in the model in two ways:
1️⃣ Use the p-value (or confidence interval) for the coefficient (since we are adding a single term to the model)
2️⃣ Conduct a drop-in-deviance test
A deviance is a way to measure how the observed data deviates from the model predictions.
It’s a measure unexplained variability in the response variable (similar to SSE in linear regression )
Lower deviance means the model is a better fit to the data
We can calculate the “deviance residual” for each observation in the data (more on the formula later). Let \((\text{deviance residual}_i\) be the deviance residual for the \(i^{th}\) observation, then
\[\text{deviance} = \sum(\text{deviance residual})_i^2\]
Note: Deviance is also known as the “residual deviance”
We can use a drop-in-deviance test to compare two models. To conduct the test
1️⃣ Compute the deviance for each model
2️⃣ Calculate the drop in deviance
\[\text{drop-in-deviance = Deviance(reduced model) - Deviance(larger model)}\]
3️⃣ Given the reduced model is the true model \((H_0 \text{ true})\), then \[\text{drop-in-deviance} \sim \chi_d^2\]
where \(d\) is the difference in degrees of freedom between the two models (i.e., the difference in number of terms)
Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
---|---|---|---|---|
1498 | 2337.089 | NA | NA | NA |
1497 | 2200.944 | 1 | 136.145 | 0 |
Questions:
location
to the model?Suppose we want to add location
to the model, so we compare the following models:
Model A: \(\lambda_i = \beta_0 + \beta_1 ~ age_i + \beta_2 ~ age_i^2\)
Model B: \(\lambda_i = \beta_0 + \beta_1 ~ age_i + \beta_2 ~ age_i^2 + \beta_3 ~ Loc1_i + \beta_4 ~ Loc2_i + \beta_5 ~ Loc3_i + \beta_6 ~ Loc4_i\)
Which of the following are reliable ways to determine if location
should be added to the model?
location
to the model?Use a drop-in-deviance test to determine if Model 2 or Model 3 (with location) is a better fit for the data.
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -0.3843 | 0.1821 | -2.1107 | 0.0348 | -0.7444 | -0.0306 |
age | 0.0704 | 0.0069 | 10.1900 | 0.0000 | 0.0569 | 0.0840 |
age2 | -0.0007 | 0.0001 | -10.9437 | 0.0000 | -0.0008 | -0.0006 |
locationDavaoRegion | -0.0194 | 0.0538 | -0.3605 | 0.7185 | -0.1250 | 0.0859 |
locationIlocosRegion | 0.0610 | 0.0527 | 1.1580 | 0.2468 | -0.0423 | 0.1641 |
locationMetroManila | 0.0545 | 0.0472 | 1.1542 | 0.2484 | -0.0378 | 0.1473 |
locationVisayas | 0.1121 | 0.0417 | 2.6853 | 0.0072 | 0.0308 | 0.1945 |
Does this model sufficiently explain the variability in the mean household size?
We can calculate two types of residuals for Poisson regression: Pearson residuals and deviance residuals
\[\text{Pearson residual}_i = \frac{\text{observed} - \text{predicted}}{\text{std. error}} = \frac{y_i - \hat{\lambda}_i}{\sqrt{\hat{\lambda}_i}}\]
Similar interpretation as standardized residuals from linear regression
Expect most to fall between -2 and 2
Used to calculate overdispersion parameter
The deviance residual indicates how much the observed data deviates from the fitted model
\[\text{deviance residual}_i = \text{sign}(y_i - \hat{\lambda}_i)\sqrt{2\Bigg[y_i\log\bigg(\frac{y_i}{\hat{\lambda}_i}\bigg) - (y_i - \hat{\lambda}_i)\Bigg]}\]
where
\[\text{sign}(y_i - \hat{\lambda}_i) = \begin{cases} 1 & \text{ if }(y_i - \hat{\lambda}_i) > 0 \\ -1 & \text{ if }(y_i - \hat{\lambda}_i) < 0 \\ 0 & \text{ if }(y_i - \hat{\lambda}_i) = 0 \end{cases}\]
p1 <- ggplot(data = model3_aug_pearson, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_smooth() +
labs(x = "Fitted values",
y = "Pearson residuals",
title = "Pearson residuals vs. fitted")
p2 <- ggplot(data = model3_aug_deviance, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_smooth() +
labs(x = "Fitted values",
y = "Deviance residuals",
title = "Deviance residuals vs. fitted")
p1 + p2
Goal: Use the (residual) deviance to assess how much the predicted values differ from the observed values. Recall \((\text{deviance}) = \sum_{i=1}^{n}(\text{deviance residual})_i^2\)
If the model sufficiently fits the data, then
\[\text{deviance} \sim \chi^2_{df}\]
where \(df\) is the model’s residual degrees of freedom
\[P(\chi^2_{df} > \text{ deviance})\]
The probability of observing a deviance greater than 2187.8 is \(\approx 0\), so there is significant evidence of lack-of-fit.
There are a few potential reasons for lack-of-fit:
Missing important interactions or higher-order terms
Missing important variables (perhaps this means a more comprehensive data set is required)
There could be extreme observations causing the deviance to be larger than expected (assess based on the residual plots)
There could be a problem with the Poisson model
Overdispersion: There is more variability in the response than what is implied by the Poisson model
Overall
mean | var |
---|---|
3.685 | 5.534 |
by Location
location | mean | var |
---|---|---|
CentralLuzon | 3.402 | 4.152 |
DavaoRegion | 3.390 | 4.723 |
IlocosRegion | 3.586 | 5.402 |
MetroManila | 3.707 | 4.863 |
Visayas | 3.902 | 6.602 |
If there is overdispersion, then there is more variation in the response than what’s implied by a Poisson model. This means
❌ The standard errors of the model coefficients are artificially small
❌ The p-values are artificially small
❌ This could lead to models that are more complex than what is needed
We can take overdispersion into account by
The dispersion parameter is represented by \(\phi\)
\[\hat{\phi} =\frac{\text{deviance}}{\text{residual df}} = \frac{\sum_{i=1}^{n}(\text{Pearson residuals})^2}{n - p}\]
where \(p\) is the number of terms in the model (including the intercept)
If there is no overdispersion \(\hat{\phi} = 1\)
If there is overdispersion \(\hat{\phi} > 1\)
We inflate the standard errors of the coefficient by multiplying the variance by \(\hat{\phi}\)
\[SE_{Q}(\hat{\beta}) = \sqrt{\hat{\phi}} * SE(\hat{\beta})\] - “Q” stands for quasi-Poisson, since this is an ad-hoc solution - The process for model building and model comparison is called quasilikelihood (similar to likelihood without exact underlying distributions)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | -0.3843 | 0.2166 | -1.7744 | 0.0762 | -0.8134 | 0.0358 |
age | 0.0704 | 0.0082 | 8.5665 | 0.0000 | 0.0544 | 0.0866 |
age2 | -0.0007 | 0.0001 | -9.2000 | 0.0000 | -0.0009 | -0.0006 |
locationDavaoRegion | -0.0194 | 0.0640 | -0.3030 | 0.7619 | -0.1451 | 0.1058 |
locationIlocosRegion | 0.0610 | 0.0626 | 0.9735 | 0.3304 | -0.0620 | 0.1837 |
locationMetroManila | 0.0545 | 0.0561 | 0.9703 | 0.3320 | -0.0552 | 0.1649 |
locationVisayas | 0.1121 | 0.0497 | 2.2574 | 0.0241 | 0.0156 | 0.2103 |
Poisson
term | estimate | std.error |
---|---|---|
(Intercept) | -0.3843 | 0.1821 |
age | 0.0704 | 0.0069 |
age2 | -0.0007 | 0.0001 |
locationDavaoRegion | -0.0194 | 0.0538 |
locationIlocosRegion | 0.0610 | 0.0527 |
locationMetroManila | 0.0545 | 0.0472 |
locationVisayas | 0.1121 | 0.0417 |
Quasi-Poisson
estimate | std.error |
---|---|
-0.3843 | 0.2166 |
0.0704 | 0.0082 |
-0.0007 | 0.0001 |
-0.0194 | 0.0640 |
0.0610 | 0.0626 |
0.0545 | 0.0561 |
0.1121 | 0.0497 |
term | estimate | std.error |
---|---|---|
(Intercept) | -0.3843 | 0.2166 |
age | 0.0704 | 0.0082 |
age2 | -0.0007 | 0.0001 |
locationDavaoRegion | -0.0194 | 0.0640 |
locationIlocosRegion | 0.0610 | 0.0626 |
locationMetroManila | 0.0545 | 0.0561 |
locationVisayas | 0.1121 | 0.0497 |
Test statistic \[t = \frac{\hat{\beta} - 0}{SE_{Q}(\hat{\beta})} \sim t_{n-p}\]
Another approach to handle overdispersion is to use a negative binomial regression model
Let \(Y\) be a negative binomial random variable, \(Y\sim NegBinom(r, p)\), then
\[P(Y = y_i) = {y_i + r - 1 \choose r - 1}(1-p)^{y_i}p^r \hspace{5mm} y_i = 0, 1, 2, \ldots, \infty\]
\(\begin{aligned} &\text{If }Y|\lambda \sim Poisson(\lambda)\\ &\text{ and } \lambda \sim Gamma\bigg(r, \frac{1-p}{p}\bigg)\\ &\text{ then } Y \sim NegBinom(r, p)\end{aligned}\)
library(MASS)
model3_nb <- glm.nb(total ~ age + age2 + location, data = hh_data)
tidy(model3_nb) |>
kable(digits = 4)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.3753 | 0.2076 | -1.8081 | 0.0706 |
age | 0.0699 | 0.0079 | 8.8981 | 0.0000 |
age2 | -0.0007 | 0.0001 | -9.5756 | 0.0000 |
locationDavaoRegion | -0.0219 | 0.0625 | -0.3501 | 0.7262 |
locationIlocosRegion | 0.0577 | 0.0615 | 0.9391 | 0.3477 |
locationMetroManila | 0.0562 | 0.0551 | 1.0213 | 0.3071 |
locationVisayas | 0.1104 | 0.0487 | 2.2654 | 0.0235 |
The data set NYCairbnb-sample.csv contains information about a random sample of 1000 Airbnbs in New York City. It is a subset of the data on 40628 Airbnbs scraped by Awad et al. (2017).
Variables
number_of_reviews
: Number of reviews for the unit on Airbnb (proxy for number of rentals)price
: price per night in US dollarsroom_type
: Entire home/apartment, private room, or shared roomdays
: Number of days the unit has been listed (date when info scraped - date when unit first listed on Airbnb)id | number_of_reviews | days | room_type | price |
---|---|---|---|---|
15756544 | 16 | 1144 | Private room | 120 |
14218251 | 15 | 471 | Private room | 89 |
21644 | 0 | 2600 | Private room | 89 |
13667835 | 1 | 283 | Entire home/apt | 150 |
265912 | 0 | 1970 | Entire home/apt | 89 |
Goal: Use the price and room type of Airbnbs to describe variation in the number of reviews (a proxy for number of rentals).
p1 <- ggplot(data = airbnb, aes(x = number_of_reviews)) +
geom_histogram() +
labs(x = "Number of reviews",
title = "Distribution of number of reviews")
p2 <- airbnb |>
filter(price <= 2000) |>
group_by(price) |>
summarise(log_mean = log(mean(number_of_reviews))) |>
ggplot(aes(x = price, y = log_mean)) +
geom_point(alpha= 0.7) +
geom_smooth() +
labs(x = "Price in US dollars",
y = "Log(mean # reviews)",
title = "Log mean # of reviews vs. price",
subtitle = "Airbnbs $2000 or less")
p3 <- airbnb |>
filter(price <= 500) |>
group_by(price) |>
summarise(log_mean = log(mean(number_of_reviews))) |>
ggplot(aes(x = price, y = log_mean)) +
geom_point(alpha= 0.7) +
geom_smooth() +
labs(x = "Price in US dollars",
y = "Log(mean # reviews)",
title = "Log mean # of reviews vs. price",
subtitle = "Airbnbs $500 or less")
p1 / (p2 + p3)
Overall
mean | var |
---|---|
15.916 | 765.969 |
by Room type
room_type | mean | var |
---|---|---|
Entire home/apt | 16.283 | 760.348 |
Private room | 15.608 | 786.399 |
Shared room | 15.028 | 605.971 |
We would like to fit the Poisson regression model
\[\log(\lambda) = \beta_0 + \beta_1 ~ price + \beta_2 ~ room\_type1 + \beta_3 ~ room\_type2\]
Question: - Based on the EDA, what are some potential issues we may want to address in the model building?
Sometimes counts are not directly comparable because the observations differ based on some characteristic directly related to the counts, i.e. the sampling effort.
An offset can be used to adjust for differences in sampling effort.
\(\log(\lambda_i) = \beta_0 + \beta_1 ~ x_{i1} + \beta_2 ~ x_{i2} + ... + \beta_p ~ x_{ip} + \log(x_{offset_i})\)
We will add the offset \(\log(days)\) to the model. This accounts for the fact that we would expect Airbnbs that have been listed longer to have more reviews.
\[\log(\lambda_i) = \beta_0 + \beta_1 ~ price_i + \beta_2 ~ room\_type1_i + \beta_3 ~ room\_type2_i + \log(days_i)\]
Note: The response variable for the model is still \(\log(\lambda_i)\), the log mean number of reviews
We want to adjust for the number of days, so we are interested in \(\frac{reviews}{days}\).
Given \(\lambda\) is the mean number of reviews
\[\log\Big(\frac{\lambda}{days}\Big) = \beta_0 + \beta_1 ~ price + \beta_2 ~ room\_type1 + \beta_3 ~ room\_type2\]
\[\Rightarrow \log({\lambda}) - \log(days) = \beta_0 + \beta_1 ~ price + \beta_2 ~ room\_type1 + \beta_3 ~ room\_type2\]
\[\Rightarrow \log({\lambda}) = \beta_0 + \beta_1 ~ price + \beta_2 ~ room\_type1 + \beta_3 ~ room\_type2 + \log(days)\]
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -4.1351 | 0.0170 | -243.1397 | 0 |
price | -0.0005 | 0.0001 | -7.0952 | 0 |
room_typePrivate room | -0.0994 | 0.0174 | -5.6986 | 0 |
room_typeShared room | 0.2436 | 0.0452 | 5.3841 | 0 |
The coefficient for \(\log(days)\) is fixed at 1, so it is not in the model output.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -4.1351 | 0.0170 | -243.1397 | 0 |
price | -0.0005 | 0.0001 | -7.0952 | 0 |
room_typePrivate room | -0.0994 | 0.0174 | -5.6986 | 0 |
room_typeShared room | 0.2436 | 0.0452 | 5.3841 | 0 |
Question:
price
.room_typePrivate room
\[\begin{aligned}&H_0: \text{ The model is a good fit for the data}\\ &H_a: \text{ There is significant lack of fit}\end{aligned}\]
There is evidence of significant lack of fit in the model. Therefore, more models would need to be explored that address the issues mentioned earlier.
In practice we would assess goodness-of-fit and finalize the model before any interpretations and conclusions.
The data weekend-drinks.csv
contains information from a survey of 77 students in a introductory statistics course on a dry campus.
Variables
drinks
: Number of drinks they had in the past weekendoff_campus
: 1 - lives off campus, 0 otherwisefirst_year
: 1 - student is a first-year, 0 otherwisesex
: f - student identifies as female, m - student identifies as maleGoal: The goal is explore factors related to drinking behavior on a dry campus.
What does it mean to be a “zero” in this data?
There are two types of zeros
We introduce a new parameter \(\alpha\) for the proportion of true zeros, then fit a model that has two components:
1️⃣ The association between mean number of drinks and various characteristics among those who drink
2️⃣ The estimated proportion of non-drinkers
Zero-inflated Poisson (ZIP) model has two parts
1️⃣ Association, among those who drink, between the mean number of drinks and predictors sex and off campus residence
\[\log(\lambda) = \beta_0 + \beta_1 ~ off\_campus + \beta_2 ~ sex\] where \(\lambda\) is the mean number of drinks among those who drink
2️⃣ Probability that a student does not drink
\[\text{logit}(\alpha) = \log\Big(\frac{\alpha}{1- \alpha}\Big) = \beta_0 + \beta_1 ~ first\_year\]
where \(\alpha\) is the proportion of non-drinkers
Note: The same variables can be used in each component
Fit ZIP models using the zeroinfl
function from the pscl R package.
library(pscl)
drinks_zip <- zeroinfl(drinks ~ off_campus + sex | first_year,
data = drinks)
drinks_zip
Call:
zeroinfl(formula = drinks ~ off_campus + sex | first_year, data = drinks)
Count model coefficients (poisson with log link):
(Intercept) off_campus sexm
0.7543 0.4159 1.0209
Zero-inflation model coefficients (binomial with logit link):
(Intercept) first_year
-0.6036 1.1364
Use the tidy
function from the poissonreg package for tidy model output.
Proportion of non-drinkers
term | type | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|
(Intercept) | count | 0.754 | 0.144 | 5.238 | 0.000 |
off_campus | count | 0.416 | 0.206 | 2.021 | 0.043 |
sexm | count | 1.021 | 0.175 | 5.827 | 0.000 |
Questions
off_campus
and sexm
.term | type | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|
(Intercept) | zero | -0.604 | 0.311 | -1.938 | 0.053 |
first_year | zero | 1.136 | 0.610 | 1.864 | 0.062 |
Questions:
Based on the model…
Least squares estimation would not work because the normality and equal variance assumptions don’t hold for Poisson regression
Maximum likelihood estimation is used to estimate the coefficients of Poisson regression.
The likelihood is the product of the probabilities for the \(n\) independent observations in the data
Let’s go back to example about household size in the Philippines. We will focus on the model using the main effect of age to understand variability in mean household size.
Suppose the first five observations have household sizes of 4, 2, 8, 6, and 1. Then the likelihood is
\(L = \frac{e^{-\lambda_1}\lambda_1^4}{4!} * \frac{e^{-\lambda_2}\lambda_2^2}{2!} * \frac{e^{-\lambda_3}\lambda_3^8}{8!} * \frac{e^{-\lambda_4}\lambda_4^6}{6!} * \frac{e^{-\lambda_5}\lambda_5^1}{1!}\)
We will use the log likelihood to make finding the MLE easier
\(\begin{aligned}\log(L) &= -\lambda_1 + 4\log(\lambda_1) - \lambda_2 + 2\log(\lambda_2) - \lambda_3 + 8\log(\lambda_3)\\ & -\lambda_4 + 6 \log(\lambda_4) - \lambda_5 + \log(\lambda_5) + C \end{aligned}\)
where - \(\lambda\) is the mean number in household depending on \(x_i\) - \(C = -[\log(4!) + \log(2!) + \log(8!) + \log(6!)+ \log(1!)]\)
Given the age of the head of the household, we fit the model
\[\log(\lambda_i) = \beta_0 + \beta_1~age_i\]
Then we replace each \(\lambda_i\) in \(\log(L)\) with \(e^{\beta_0 + \beta_1~age_i}\).
Suppose the first five observations have ages \(X = (32, 21, 55, 44, 28)\). Then
\[\begin{aligned} \log(L) &= [-e^{\beta_0 + \beta_132}+ 4(\beta_0 + \beta_1 32)] + [ - e^{\beta_0 + \beta_121} + 2(\beta_0 + \beta_121)] \\ &+ [- e^{\beta_0 + \beta_155} + 8(\beta_0 + \beta_155)] + [-e^{\beta_0 + \beta_144} + 6(\beta_0 + \beta_144)] \\ &+ [-e^{\beta_0 + \beta_128}(\beta_0 + \beta_128)] + C \end{aligned}\]
Use search algorithm to find the values of \(\beta_0\) and \(\beta_1\) that maximize the above equation.
There are three different types of observations in the data:
True zeros
\[P(0 | \text{true zero})= \alpha\]
Observed 0 but will not always be 0
\[P(0 | \text{not always zero}) = (1 - \alpha)\frac{e^{-\lambda}\lambda^0}{0!}\]
Did not observe 0 and will not always be 0
\[P(z_i | \text{not always zero}) = (1 - \alpha)\frac{e^{-\lambda}\lambda^{z_i}}{z_i!}\]
Putting this all together. Let \(y_i\) be an observed response then
\[P(Y_i = y_i | x_i) = \begin{cases} \alpha + (1 - \alpha)e^{-\lambda_i} && \text{ if } y_i = 0 \\ (1 - \alpha)\frac{e^{-\lambda_i}\lambda_i^{y_i}}{y_i!} && \text{ if } y_i > 0 \end{cases}\]
Recall from our example,
\[\lambda_i = e^{\beta_0 + \beta_1~off\_campus_i + \beta_2 ~ sex_i}\]
\[\alpha_i = \frac{e^{\beta_{0\alpha} + \beta_{1\alpha} ~ first\_year_i}}{1 + e^{\beta_{0\alpha} + \beta_{1\alpha} ~ first\_year_i}}\]
These slides are based on content in BMLR: Chapter 4
Initial versions of the slides are by Dr. Maria Tackett, Duke University