Homework 7

Ch 7: Correlated Data and Ch 8: Multilevel Models

Setup

Each of your assignments will begin with the following steps.

Going to our RStudio Server at http://turing.cornellcollege.edu:8787/
Creating a new R project, inside your homework folder on the server, and giving it a sensible name such as homework_6 and having that project in the course folder you created.
Create a new quarto document and give it a sensible name such as hw6.
In the YAML add the following (add what you don’t have). The embed-resources component will make your final rendered html self-contained.
```
---
title: "Document title"
author: "my name"
format:
  html:
embed-resources: true
---
```

Instructions

Be sure to include the relevant R code as well as full sentences answering each of the questions (i.e. if I ask for the average, you can output the answer in R but also write a full sentence with the answer). Be sure to frequently save your files!

Data for the homework will be in the STA363_inst_files -> data folder.

Exercises

All problems are from The main textbook is: Beyond Multiple Linear Regression by Paul Roback and Julie Legler – it is freely available online. Chapters 1-9. Abbreviated BMLR.

Use the numbering on the left. The codes are for instructor use (Ex: C1).

Part 1: Chapter 7 - Correlated Data

Exercise 1

(C1) Examples with correlated data. For each of the following studies:
1. Identify the most basic observational units
2. Identify the grouping units (could be multiple levels of grouping)
3. State the response(s) measured and variable type (normal, binary, Poisson, etc.)
4. Write a sentence describing the within-group correlation.
5. Identify fixed and random effects

Epilepsy study. Researchers conducted a randomized controlled study where patients were randomly assigned to either an anti-epileptic drug or a placebo. For each patient, the number of seizures at baseline was measured over a 2-week period. For four consecutive visits the number of seizures were determined over the past 2-week period. Patient age and sex along with visit number were recorded.
Prairie restoration. Researchers at a small Midwestern college decided to experimentally explore the underlying causes of variation in soil reconstruction projects in order to make future projects more effective. Introductory ecology classes were organized to collect weekly data on plants in pots containing soil samples. Data will be examined to compare:
- germination and growth of two species of prairie plants—–leadplants (Amorpha canescens) and coneflowers (Ratibida pinnata).
- soil from a cultivated (agricultural) field, a natural prairie, and a restored (reconstructed) prairie.
- the effect of sterilization, since half of the sampled soil was sterilized to determine if rhizosphere differences were responsible for the observed variation.
Radon in Minnesota. Radon is a carcinogen – a naturally occurring radioactive gas whose decay products are also radioactive – known to cause lung cancer in high concentrations. The EPA sampled more than 80,000 homes across the U.S. Each house came from a randomly selected county and measurements were made on each level of each home. Uranium measurements at the county level were included to improve the radon estimates.

Part 2: Chapter 8 - Multilevel Models

Exercise 2

(C1,2,3)@Brown2004 describe “A Hierarchical Linear Model Approach for Assessing the Effects of House and Neighborhood Characteristics on Housing Prices”.

Based on the title of their paper: (a) give the observational units at Level One and Level Two, and (b) list potential explanatory variables at both Level One and Level Two.
In the preceding problem, why can’t we assume all houses in the data set are independent? What would be the potential implications to our analysis of assuming independence among houses?
In the preceding problem, for each of the following sets of predictors:

write out the two-level model for predicting housing prices,
write out the corresponding composite model, and
determine how many model parameters (fixed effects and variance components) must be estimated.

Predictor set 1: Square footage, number of bedrooms
Predictor set 2: Median neighborhood income, rating of neighborhood schools
Predictor set 3: Square footage, number of bedrooms, age of house, median neighborhood housing price
Predictor set 4: Square footage, median neighborhood income, rating of neighborhood schools, median neighborhood housing price

Exercise 3

(C6) Why is the contour plot for multivariate normal density in Figure @ref(fig:contour-boundary)(b) tilted from southwest to northeast, but the contour plot in Figure @ref(fig:contour-boundary)(a) is not tilted?

Exercise 4

(C8) Why is Model A (Section @ref(modela8) in 8.6.2) sometimes called the “unconditional means model”? Why is it also sometimes called the “random intercepts model”? Are these two labels consistent with each other?

Exercise 5

(C9) Consider adding an indicator variable in Model B (Section @ref(randomslopeandint)) for Small Ensemble performances.
- Write out the two-level model for performance anxiety,
- Write out the corresponding composite model,
- Determine how many model parameters (fixed effects and variance components) must be estimated, and
- Explain how the interpretation for the coefficient in front of Large Ensembles would change.

Exercise 6

(C10) Give a short rule in your own words describing when an interpretation of an estimated coefficient should “hold constant” another covariate or “set to 0” that covariate (see Section @ref(interp:modeld)).

Exercise 7

(C14) Interpret other estimated parameters from Model F beyond those interpreted in Section @ref(modelf): \(\hat{\alpha}_{0}\), \(\hat{\alpha}_{2}\), \(\hat{\alpha}_{3}\), \(\hat{\beta}_{0}\), \(\hat{\gamma}_{0}\), \(\hat{\zeta}_{0}\), \(\hat{\rho}_{wx}\), \(\hat{\sigma}^{2}\), \(\hat{\sigma}_{u}^{2}\), and \(\hat{\sigma}_{z}^{2}\).

Exercise 8

(O1) @Chapp2018 explored 2014 congressional candidates’ ambiguity on political issues in their paper, Going Vague: Ambiguity and Avoidance in Online Political Messaging. They hand-coded a random sample of 2012 congressional candidates’ websites, assigning an ambiguity score. A total of 870 websites from 2014 were then automatically scored using Wordscores, a program designed for political textual analysis. In their paper, they fit a multilevel model for candidates’ ambiguities with predictors at both the candidate and district levels. Some of their hypotheses include that:
- “when incumbents do hazard issue statements, these statements will be marked by a higher degree of clarity.” (Hypothesis 1b)
- “ideological distance [from district residents] will be associated with greater ambiguity.” (Hypothesis 2a)
- “controlling for ideological distance, ideological extremity [of the candidate] should correspond to less ambiguity.” (Hypothesis 2b)
- “more variance in attitudes [among district residents] will correspond to a higher degree of ambiguity in rhetoric” (Hypothesis 3a)
- “a more heterogeneous mix of subgroups [among district residents] will also correspond to a higher degree of ambiguity in rhetoric” (Hypothesis 3b)
Their data can be found in ambiguity.csv. Variables of interest include:
- ambiguity = assigned ambiguity score. Higher scores indicate greater clarity (less ambiguity)
- democrat = 1 if a Democrat, 0 otherwise (Republican)
- incumbent = 1 if an incumbent, 0 otherwise
- ideology = a measure of the candidate’s left-right orientation. Higher (positive) scores indicate more conservative candidates and lower (negative) scores indicate more liberal candidates.
- mismatch = the distance between the candidate’s ideology and the district’s ideology (candidate ideology scores were regressed against district ideology scores; mismatch values represent the absolute value of the residual associated with each candidate)
- distID = the congressional district’s unique ID
- distLean = the district’s political leaning. Higher scores imply more conservative districts.
- attHeterogeneity = a measure of the variability of ideologies within the district. Higher scores imply more attitudinal heterogeneity among voters.
- demHeterogeneity = a measure of the demographic variability within the district. Higher scores imply more demographic heterogeneity among voters.

With this in mind, fit your own models to address these hypotheses from @Chapp2018. Be sure to use a two-level structure to account for variables at both the candidate and district levels.

Hints: (1) Make sure to conduct an EDA. (2) Build appropriate model(s) that allow you to test these hypothesis using the significance of the predictors.

Submission

When you are finished with your homework, be sure to Render the final document. Once rendered, you can download your file by:

Finding the .html file in your File pane (on the bottom right of the screen)
Click the check box next to the file
Click the blue gear above and then click “Export” to download
Submit your final html document to the respective assignment on Moodle