Correlated Data

BMLR Chapter 7

Tyler George

Cornell College
STA 363 Spring 2025 Block 8

Setup

library(tidyverse)
library(tidymodels)
library(GGally)
library(knitr)
library(patchwork)
library(viridis)
library(ggfortify)
library(gridExtra)
library(kableExtra)

Learning goals

Recognize a potential for correlation in a data set
Identify observational units at varying levels
Understand issues correlated data may cause in modeling
Understand how random effects models can be used to take correlation into account

Correlated observations

Examples of correlated data

In an education study, scores for students from a particular teacher are typically more similar than scores of other students with a different teacher
In a study measuring depression indices weekly over a month, the four measures for the same patient tend to be more similar than depression indices from other patients
In political polling, opinions of members from the same household tend to be more similar than opinions of members from another household

Correlation among outcomes within the same group (teacher, patient, household) is called intraclass correlation

Multilevel data

We can think of correlated data as a multilevel structure
- Population elements are aggregated into groups
- There are observational units and measurements at each level
For now we will focus on data with two levels:
- Level one: Most basic level of observation
- Level two: Groups formed from aggregated level-one observations
Example: political polling
- Level one: individual members of household
- Level two: household

Two types of effects

Fixed effects: Effects that are of interest in the study
- Can think of these as effects whose interpretations would be included in a write up of the study
Random effects: Effects we’re not interested in studying but whose variability we want to understand
- Can think of these as effects whose interpretations would not necessarily be included in a write up of the study

Example

Researchers are interested in understanding the effect social media has on opinions about a proposed economic plan. They randomly select 1000 households. They ask each adult in the household how many minutes they spend on social media daily and whether they support the proposed economic plan.

daily minutes on social media is the fixed effect
household is the random effect

Practice

Researchers conducted a randomized controlled study where patients were randomly assigned to either an anti-epileptic drug or a placebo. For each patient, the number of seizures at baseline was measured over a 2-week period. For four consecutive visits the number of seizures were determined over the past 2-week period. Patient age and sex along with visit number were recorded.

Questions

What are the level one and level two observational units?
What is the response variable and what is its type (normal, Poisson, etc.)?
Describe the within-group variation.
What are the fixed effects? What are the random effects?

Teratogen and rat pups

Data: Teratogen and rat pups

Today’s data are simulated results of an experiment with 24 dams (mother rats) randomly divided into four groups that received different doses of teratogen, a substance that could potentially cause harm to developing fetuses. The four groups are

High dose (3 mg)
Medium dose (2 mg)
Low dose (1 mg)
No dose (Control)

Each dam produced 10 rat pups and the presence of a deformity was noted.

Goal: Understand the association between teratogen exposure and the probability a pup is born with a deformity.

Scenario 1: No dose effect

Assume dose has no effect on, \(p\), the probability of a pup born with a deformity.

Scenario 1a.: \(p = 0.5\) for each dam
Scenario 1b.: \(p \sim Beta(0.5, 0.5)\) (expected value = 0.5)

theoretical_pi <- tibble(x = 1:250000,
                         p1 = rbeta(x, 0.5, 0.5))

tibble(x = 1:24, pi_1b) |>
  ggplot() +
    geom_histogram(bins = 5, aes(x = pi_1b, y = ..density..),
                   color = "black", fill = "blue", alpha = 0.2) + 
    coord_cartesian(xlim = c(0,1)) +
    geom_density(data = theoretical_pi, aes(x = p1), 
                  linetype = 3, color = "blue", lwd = 2) +
    geom_vline(xintercept = 0.5, color = "red", lwd = 2) +
    labs(title = "Probability of deformity", 
         subtitle = "Red = Scenario 1a, Blue dashed line  = Scenario 1b", 
         x = "Probability of Deformity")

Questions

Would you expect the number of pups with a deformity for dams in Scenario 1a to follow a distribution similar to the binomial distribution with \(n=10\) and \(p=0.5\)? Why or why not?
Would you expect the number of pups with a deformity for dams in Scenario 1b to follow a distribution similar to the binomial distribution with \(n=10\) and \(p=0.5\)? Why or why not?
Which scenario do you think is more realistic - Scenario 1a or 1b?

Investigating

Output
Code

mean_1a	sd_1a	mean_1b	sd_1b
5.166667	1.493949	5.666667	4.103727

pi_1a <- rep(0.5, 24)
count_1a <- rbinom(24, 10, pi_1a)

pi_1b <- rbeta(24,.5,.5)  
count_1b <- rbinom(24, 10, pi_1b)

scenario_1 <- 
  tibble(pi_1a, count_1a, pi_1b, count_1b) |>
  mutate(phat_1a = count_1a / 10, 
         phat_1b = count_1b / 10)

hist_1a <- ggplot(data = scenario_1, aes(x = count_1a)) + 
  geom_histogram(bins = 5, color = "black", fill = "steelblue") +
  coord_cartesian(xlim = c(0, 10)) +
  labs(title = "Scenario 1a: Binomial, p = 0.5",
       x = "Count of deformed pups per dam")

hist_1b <- ggplot(data = scenario_1, aes(x = count_1b)) + 
  geom_histogram(bins = 5, color = "black", fill = "steelblue") +
  coord_cartesian(xlim = c(0, 10)) +
  labs(title = "Scenario 1b: Binomial, p ~ Beta(0.5, 0.5)",
       x = "Count of deformed pups per dam")

hist_1a / hist_1b

scenario_1 |> 
  summarise(mean_1a = mean(count_1a), sd_1a = sd(count_1a),
            mean_1b = mean(count_1b), sd_1b = sd(count_1b) ) |>
  kable()

Let’s take a look at a binomial and quasibinomial model for Scenarios 1a and 1b.

Scenario 2: Dose effect

Now we will consider the effect of the dose of teratogen on the probability of a pup born with a deformity. The 24 pups have been randomly divided into four groups:

High dose (dose = 3)
Medium dose (dose = 2)
Low dose (dose = 1)
No dose (dose = 0)

We will assume the true relationship between \(p\) and dose is the following:

\[\log\Big(\frac{p}{1-p}\Big) = -2 + 1.33 ~ dose\]

Scenario 2

Scenario 2a.

\[p = \frac{e^{-2 + 1.33 ~ dose}}{1 + e^{-2 + 1.33 ~ dose}}\]

Scenario 2b.:

\[p \sim Beta\Big(\frac{2p}{(1-p)}, 2\Big)\]

On average, dams who receive dose \(x\) have the same probability of deformed pup as dams with dose \(x\) under Scenario 2a.

Distributions under Scenario 2

Summary stats under Scen.2

Output
Code

mean_2a	sd_2a	mean_2b	sd_2b
4.791667	3.202976	4.666667	3.583375

Summary statistics of Scenario 2 by dose.
	Scenario 2a				Scenario 2b
Dosage	Mean p	SD p	Mean Count	SD Count	Mean p	SD p	Mean Count	SD Count
0	0.119	0	1.333	1.366	0.061	0.069	0.500	0.837
1	0.339	0	3.167	1.835	0.239	0.208	3.500	2.881
2	0.661	0	5.833	1.472	0.615	0.195	5.833	1.941
3	0.881	0	8.833	1.169	0.872	0.079	8.833	1.169

scenario_2 |> 
  summarise(mean_2a = mean(count_2a), sd_2a = sd(count_2a),
            mean_2b = mean(count_2b), sd_2b = sd(count_2b) )

scenario2Tab <- scenario_2 |>
                  group_by(dose) |>
                  summarise(mean_2a_pi = round(mean(pi_2a),3), sd_2a_pi = round(sd(pi_2a),3),
                            mean_2a_cnt = round(mean(count_2a),3), sd_2a_cnt = round(sd(count_2a),3),
                            mean_2b_pi = round(mean(pi_2b),3), sd_2b_pi = round(sd(pi_2b),3),
                            mean_2b_cnt = round(mean(count_2b),3), sd_2b_cnt = round(sd(count_2b),3)) |>
                  as.data.frame()
colnames(scenario2Tab) <- c("Dosage","Mean p", "SD p",
    "Mean Count", "SD Count", "Mean p", "SD p",
    "Mean Count", "SD Count")
kable(scenario2Tab, booktabs = T, 
    caption="Summary statistics of Scenario 2 by dose.") |>
    add_header_above(c(" " = 1, "Scenario 2a" = 4, 
                       "Scenario 2b" = 4)) |>
    kable_styling(latex_options = "scale_down") |>
    column_spec(c(4:5,8:9), width = "1cm")

Questions

In Scenario 2a, dams produced 4.79 deformed pups on average, with standard deviation 3.20. Scenario 2b saw an average of 4.67 with standard deviation 3.58. Why are comparisons by dose more meaningful than these overall comparisons?
We will use binomial and quasibinomial regression to model the relationship between dose and probability of pup born with a deformity. What can you say about the center and the width of the confidence intervals under Scenarios 2a and 2b?

Which will be similar and why?
Which will be different and how?

Scenario 2: Estimated odds ratio

The estimated effect of dose and the 95% CI from the binomial and quasibinomial models are below:

Scenario 2a

	Odds Ratio	95% CI
Binomial	3.536	(2.604, 4.958)
Quasibinomial	3.536	(2.512, 5.186)

Scenario 2b

	Odds Ratio	95% CI
Binomial	4.311	(3.086, 6.271)
Quasibinomial	4.311	(2.735, 7.352)

Questions

Describe how the quasibinomial analysis of Scenario 2b differs from the binomial analysis of the same simulated data. Do confidence intervals contain the true model parameters? Is this what you expected? Why?
Why are differences between quasibinomial and binomial models of Scenario 2a less noticeable than the differences in Scenario 2b?

Summary

The structure of the data set may imply correlation between observations.
Correlated observations provide less information than independent observations; we need to account for this reduction in information.
Failing to account for this reduction could result in underestimating standard error, thus resulting in overstating significance and the precision of the estimates.
We showed how we can account for this by incorporating the dispersion parameter or a random effect.

Acknowledgements

These slides are based on content in

BMLR: Chapter 7 - Logistic Regression
Initial versions of the slides are by Dr. Maria Tackett, Duke University