---
: "Document title"
title: "my name"
author:
format:
html-resources: true
embed---
Homework 6
Ch 6: Logistic Regression
Setup
Each of your assignments will begin with the following steps.
Going to our RStudio Server at http://turing.cornellcollege.edu:8787/
Creating a new R project, inside your homework folder on the server, and giving it a sensible name such as homework_6 and having that project in the course folder you created.
Create a new quarto document and give it a sensible name such as hw6.
In the
YAML
add the following (add what you don’t have). The embed-resources component will make your final renderedhtml
self-contained.
Instructions
Be sure to include the relevant R code as well as full sentences answering each of the questions (i.e. if I ask for the average, you can output the answer in R but also write a full sentence with the answer). Be sure to frequently save your files!
Data for the homework will be in the STA363_inst_files -> data folder.
Exercises
All problems are from The main textbook is: Beyond Multiple Linear Regression by Paul Roback and Julie Legler – it is freely available online. Chapters 1-9. Abbreviated BMLR.
Use the numbering on the left. The codes are for instructor use (Ex: C1).
Part 1: GLMs
Exercise 4
- (C1) For each distribution below,
- Write the pmf or pdf in one-parameter exponential form, if possible.
- Describe an example of a setting where this random variable might be used.
- Identify the canonical link function, and
- (For Fun) Compute \(\mu = -\frac{c'(\theta)}{b'(\theta)}\) and \(\sigma^2 = \frac{b''(\theta)c'(\theta)-c''(\theta)b'(\theta)}{[b'(\theta)]^3}\) and compare with known \(\operatorname{E}(Y)\) and \(\operatorname{Var}(Y)\).
- Binomial (for fixed \(n\)): Y = number of successes in \(n\) independent, identical trials
\[P(Y=y;p)=\left(\begin{array} {c} n\\y \end{array}\right) p^y(1-p)^{(n-y)}\]
- Gamma (for fixed \(r\)): Y = time spent waiting for the \(r^{th}\) event in a Poisson process with an average rate of \(\lambda\) events per unit of time
\[f(y; \lambda) = \frac{\lambda^r}{\Gamma(r)} y^{r-1} e^{-\lambda y}\]
Part 2: Logistic Regression
Exercise 1
- (C2) Interpret the odds ratios in the following abstract.
Daycare Centers and Respiratory Health [@Nafstad1999]
Objective. To estimate the effects of the type of daycare on respiratory health in preschool children.
Methods. A population-based, cross-sectional study of Oslo children born in 1992 was conducted at the end of 1996. A self-administered questionnaire inquired about daycare arrangements, environmental conditions, and family characteristics (n = 3853; response rate, 79%).
Results. In a logistic regression controlling for confounding, children in daycare centers had more often nightly cough (adjusted odds ratio, 1.89; 95% confidence interval 1.34-2.67), and blocked or runny nose without common cold (1.55; 1.07-1.61) during the past 12 months compared with children in home care.
Exercise 2
- (C3) Construct a table and calculate the corresponding odds and odds ratios. Comment on the reported and calculated results in this New York Times article from @Kolata2009.
In November, the Centers for Disease Control and Prevention published a paper reporting that babies conceived with IVF, or with a technique in which sperm are injected directly into eggs, have a slightly increased risk of several birth defects, including a hole between the two chambers of the heart, a cleft lip or palate, an improperly developed esophagus and a malformed rectum. The study involved 9,584 babies with birth defects and 4,792 babies without. Among the mothers of babies without birth defects, 1.1% had used IVF or related methods, compared with 2.4% of mothers of babies with birth defects.
The findings are considered preliminary, and researchers say they believe IVF does not carry excessive risks. There is a 3% chance that any given baby will have a birth defect.
Exercise 3
- (G4) Birdkeeping and lung cancer: a retrospective observational study. A 1972-1981 health survey in The Hague, Netherlands, discovered an association between keeping pet birds and increased risk of lung cancer. To investigate birdkeeping as a risk factor, researchers conducted a case-control study of patients in 1985 at four hospitals in The Hague. They identified 49 cases of lung cancer among patients who were registered with a general practice, who were age 65 or younger, and who had resided in the city since 1965. Each patient (case) with cancer was matched with two control subjects (without cancer) by age and sex. Further details can be found in @Holst1988.
Age, sex, and smoking history are all known to be associated with lung cancer incidence. Thus, researchers wished to determine after age, sex, socioeconomic status, and smoking have been controlled for, is an additional risk associated with birdkeeping? The data [@Ramsey2002] is found in birdkeeping.csv
, and the variables are listed below. In addition, R code at the end of the problem can be used to input the data and create additional useful variables.
female
= sex (1 = Female, 0 = Male)age
= age, in yearshighstatus
= socioeconomic status (1 = High, 0 = Low), determined by the occupation of the household’s primary wage earneryrsmoke
= years of smoking prior to diagnosis or examinationcigsday
= average rate of smoking, in cigarettes per daybird
= indicator of birdkeeping (1 = Yes, 0 = No), determined by whether or not there were caged birds in the home for more than 6 consecutive months from 5 to 14 years before diagnosis (cases) or examination (controls)cancer
= indicator of lung cancer diagnosis (1 = Cancer, 0 = No Cancer)
Perform an exploratory data analysis to see how each explanatory variable is related to the response (
cancer
). Summarize each relationship in one sentence.For quantitative explanatory variables (
age
,yrsmoke
,cigsday
), produce a cdplot, a boxplot, and summary statistics by cancer diagnosis.
- For categorical explanatory variables (
female
orsex
,highstatus
orsocioecon_status
,bird
orkeep_bird
), produce a segmented bar chart and an appropriate table of proportions showing the relationship with cancer diagnosis.
In (a), you should have found no relationship between whether or not a patient develops lung cancer and either their age or sex. Why might this be? What implications will this have on your modeling?
Based on a two-way table with keeping birds and developing lung cancer from (a), find an unadjusted odds ratio comparing birdkeepers to non-birdkeepers and interpret this odds ratio in context. (Note: an unadjusted odds ratio is found by not controlling for any other variables.) Also, find an analogous relative risk and interpret it in context as well.
Are the elogits reasonably linear relating number of years smoked to the estimated log odds of developing lung cancer? Demonstrate with an appropriate plot.
Does there appear to be an interaction between number of years smoked and whether the subject keeps a bird? Demonstrate with an interaction plot and a coded scatterplot with empirical logits on the y-axis.
Before answering the next questions, fit logistic regression models in R with cancer
as the response and the following sets of explanatory variables:
model1
=age
,yrsmoke
,cigsday
,female
,highstatus
,bird
model2
=yrsmoke
,cigsday
,highstatus
,bird
model4
=yrsmoke
,bird
model5
= the complete second order version ofmodel4
(add squared terms and an interaction)model6
=yrsmoke
,bird
,yrsmoke:bird
Is there evidence that we can remove
age
andfemale
from our model? Perform an appropriate test comparingmodel1
tomodel2
; give a test statistic and p-value, and state a conclusion in context.Is there evidence that the complete second order version of
model4
improves its performance? Perform an appropriate test comparingmodel4
tomodel5
; give a test statistic and p-value, and state a conclusion in context.Carefully interpret each of the four model coefficients in
model6
in context.If you replaced
yrsmoke
everywhere it appears inmodel6
with a mean-centered version ofyrsmoke
, tell what would change among these elements: the 4 coefficients, the 4 p-values for coefficients, and the residual deviance.model4
is a potential final model based on this set of explanatory variables. Find and carefully interpret 95% confidence intervals based on profile likelihoods for the coefficients ofyrsmoke
andbird
.How does the adjusted odds ratio for birdkeeping from
model4
compare with the unadjusted odds ratio you found in (c)? Is birdkeeping associated with a significant increase in the odds of developing lung cancer, even after adjusting for other factors?Use the categorical variable
years_factor
based onyrsmoke
and replaceyrsmoke
inmodel4
with your new variable to createmodel4a
. First, interpret the coefficient foryears_factorOver 25 years
in context. Then tell if you prefermodel4
with years smoked as a numeric predictor ormodel4a
with years smoked as a categorical predictor, and explain your reasoning.Discuss the scope of inference in this study. Can we generalize our findings beyond the subjects in this study? Can we conclude that birdkeeping causes increased odds of developing lung cancer? Do you have other concerns with this study design or the analysis you carried out?
Read the article that appeared in the British Medical Journal. What similarities and differences do you see between their analyses and yours? What are a couple of things you learned from the article that weren’t apparent in the short summary at the beginning of the assignment.
<- read_csv("data/birdkeeping.csv") %>%
birds mutate(sex = ifelse(female == 1, "Female", "Male"),
socioecon_status = ifelse(highstatus == 1,
"High", "Low"),
keep_bird = ifelse(bird == 1, "Keep Bird", "No Bird"),
lung_cancer = ifelse(cancer == 1, "Cancer",
"No Cancer")) %>%
mutate(years_factor = cut(yrsmoke,
breaks = c(-Inf, 0, 25, Inf),
labels = c("No smoking", "1-25 years",
"Over 25 years")))
Submission
When you are finished with your homework, be sure to Render the final document. Once rendered, you can download your file by:
- Finding the .html file in your File pane (on the bottom right of the screen)
- Click the check box next to the file
- Click the blue gear above and then click “Export” to download
- Submit your final html document to the respective assignment on Moodle