Homework 5

Ch 4: Poisson Regression Part 2

Setup

Each of your assignments will begin with the following steps.

Going to our RStudio Server at http://turing.cornellcollege.edu:8787/
Creating a new R project, inside your homework folder on the server, and giving it a sensible name such as homework_4 and having that project in the course folder you created.
Create a new quarto document and give it a sensible name such as hw4.
In the YAML add the following (add what you don’t have). The embed-resources component will make your final rendered html self-contained.
```
---
title: "Document title"
author: "my name"
format:
  html:
embed-resources: true
---
```

Instructions

Be sure to include the relevant R code as well as full sentences answering each of the questions (i.e. if I ask for the average, you can output the answer in R but also write a full sentence with the answer). Be sure to frequently save your files!

Data for the homework will be in the STA363_inst_files -> data folder.

Exercises

All problems are from The main textbook is: Beyond Multiple Linear Regression by Paul Roback and Julie Legler – it is freely available online. Chapters 1-9. Abbreviated BMLR.

Use the numbering on the left. The codes are for instructor use (Ex: C1).

Exercise 1

(C15) Dating online. Researchers are interested in the number of dates respondents arranged online and whether the rates differ by age group. Questions which elicit responses similar to this can be found in the Pew Survey concerning dating online and relationships [@Duggan2013]. Each survey respondent was asked how many dates they have arranged online in the past 3 months as well as the typical amount of time, \(t\), in hours, they spend online weekly. Some rows of data appear in Table 1.
- Identify the response, predictor, and offset in this context. Does using an offset make sense?
- Write out a model for this data. As part of your model description, define the parameter, \(\lambda\).
- Consider a zero-inflated Poisson model for this data. Describe what the `true zeros’ would be in this setting.

Table 1: Sample data for Exercise 1.
Age	Time Online	Number of Dates Arranged Online
19	35	3
29	20	5
38	15	0
55	10	0

Exercise 2

(C16) Poisson approximation: rare events. For rare diseases, the probability of a case occurring, \(p\), in a very large population, \(n\), is small. With a small \(p\) and large \(n\), the random variable \(Y\)= the number of cases out of \(n\) people can be approximated using a Poisson random variable with \(\lambda = np\). If the count of those with the disease is observed in several different populations independently of one another, the \(Y_i\) represents the number of cases in the \(i^{th}\) population and can be approximated using a Poisson random variable with \(\lambda_i=n_ip_i\) where \(p_i\) is the probability of a case for the \(i^{th}\) population. Poisson regression can take into account the differences in the population sizes, \(n_i\), using as an offset log(\(n_i\)) as well as differences in a population characteristic like \(x_i\). The coefficient of the offset is set at one; it is not estimated like the other coefficients. Thus the model statement has the form: \(log(\lambda_i) = \beta_0+\beta_1x_i + log(n_i)\), where \(Y_i \sim\) Poisson(\(\lambda_i = n_i p_i\)). Note that \(\lambda_i\) depends on \(x_i\) which may differ for the different populations.

@Scotto1974 wondered if skin cancer rates by age group differ by city. Based on their data in Table 2, identify and describe the following quantities which appear in the description of the Poisson approximation for rare events:
- A case,
- The population size, \(n_i\),
- Probability, \(p_i\),
- Poisson parameter, \(\lambda_i\),
- Poisson random variables, \(Y_i\), and
- The predictors, \(X_i\).

Table 2: Data from Scotto et al. (1974) on the number of cases of non-melanoma skin cancer for women by age group in two metropolitan areas (Minneapolis-St. Paul and Dallas-Ft. Worth); the year is unknown.
Number of Cases	Population	Age Group	City
1	172675	15-24	1
16	123065	25-34	1
...	...	...	...
226	29007	75-84	2
65	7538	85+	2
The columns contain: number of cases, population size, age group, and city (1=Minneapolis-St. Paul, 2=Dallas-Ft. Worth).

Exercise 3

(G6) U.S. National Medical Expenditure Survey. The data set NMES1988 in the AER package contains a sample of individuals over 65 who are covered by Medicare in order to assess the demand for health care through physician office visits, outpatient visits, ER visits, hospital stays, etc. The data can be accessed by installing and loading the AER package and then running data(NMES1988). More background information and references about the NMES1988 data can be found in help pages for the AER package.

Show through graphical means that there are more respondents with 0 visits than might be expected under a Poisson model.
Fit a ZIP model for the number of physician office visits using chronic, health, and insurance as predictors for the Poisson count, and chronic and insurance as the predictors for the binary part of the model. Then, provide interpretations in context for the following model parameters:

chronic in the Poisson part of the model
poor health in the Poisson part of the model
the Intercept in the logistic part of the model
insurance in the logistic part of the model

Is there significant evidence that the ZIP model is an improvement over a simple Poisson regression model?

Submission

When you are finished with your homework, be sure to Render the final document. Once rendered, you can download your file by:

Finding the .html file in your File pane (on the bottom right of the screen)
Click the check box next to the file
Click the blue gear above and then click “Export” to download
Submit your final html document to the respective assignment on Moodle