library(tidyverse)
library(tidymodels)
library(GGally)
library(knitr)
library(patchwork)
library(viridis)
library(ggfortify)
library(gridExtra)
library(kableExtra)
Correlated Data
BMLR Chapter 7
Setup
Learning goals
- Recognize a potential for correlation in a data set
- Identify observational units at varying levels
- Understand issues correlated data may cause in modeling
- Understand how random effects models can be used to take correlation into account
Teratogen and rat pups
Data: Teratogen and rat pups
Today’s data are simulated results of an experiment with 24 dams (mother rats) randomly divided into four groups that received different doses of teratogen, a substance that could potentially cause harm to developing fetuses. The four groups are
- High dose (3 mg)
- Medium dose (2 mg)
- Low dose (1 mg)
- No dose (Control)
. . .
Each dam produced 10 rat pups and the presence of a deformity was noted.
. . .
Goal: Understand the association between teratogen exposure and the probability a pup is born with a deformity.
Scenario 1: No dose effect
Scenario 1: No dose effect
Assume dose has no effect on, \(p\), the probability of a pup born with a deformity.
Scenario 1a.: \(p = 0.5\) for each dam
Scenario 1b.: \(p \sim Beta(0.5, 0.5)\) (expected value = 0.5)
. . .
Scenario 1: No dose effect
<- tibble(x = 1:250000,
theoretical_pi p1 = rbeta(x, 0.5, 0.5))
tibble(x = 1:24, pi_1b) |>
ggplot() +
geom_histogram(bins = 5, aes(x = pi_1b, y = ..density..),
color = "black", fill = "blue", alpha = 0.2) +
coord_cartesian(xlim = c(0,1)) +
geom_density(data = theoretical_pi, aes(x = p1),
linetype = 3, color = "blue", lwd = 2) +
geom_vline(xintercept = 0.5, color = "red", lwd = 2) +
labs(title = "Probability of deformity",
subtitle = "Red = Scenario 1a, Blue dashed line = Scenario 1b",
x = "Probability of Deformity")
From Figure 7.1 in BMLR
Questions
Would you expect the number of pups with a deformity for dams in Scenario 1a to follow a distribution similar to the binomial distribution with \(n=10\) and \(p=0.5\)? Why or why not?
Would you expect the number of pups with a deformity for dams in Scenario 1b to follow a distribution similar to the binomial distribution with \(n=10\) and \(p=0.5\)? Why or why not?
Which scenario do you think is more realistic - Scenario 1a or 1b?
Investigating
mean_1a | sd_1a | mean_1b | sd_1b |
---|---|---|---|
5.166667 | 1.493949 | 5.666667 | 4.103727 |
<- rep(0.5, 24)
pi_1a <- rbinom(24, 10, pi_1a)
count_1a
<- rbeta(24,.5,.5)
pi_1b <- rbinom(24, 10, pi_1b) count_1b
<-
scenario_1 tibble(pi_1a, count_1a, pi_1b, count_1b) |>
mutate(phat_1a = count_1a / 10,
phat_1b = count_1b / 10)
<- ggplot(data = scenario_1, aes(x = count_1a)) +
hist_1a geom_histogram(bins = 5, color = "black", fill = "steelblue") +
coord_cartesian(xlim = c(0, 10)) +
labs(title = "Scenario 1a: Binomial, p = 0.5",
x = "Count of deformed pups per dam")
<- ggplot(data = scenario_1, aes(x = count_1b)) +
hist_1b geom_histogram(bins = 5, color = "black", fill = "steelblue") +
coord_cartesian(xlim = c(0, 10)) +
labs(title = "Scenario 1b: Binomial, p ~ Beta(0.5, 0.5)",
x = "Count of deformed pups per dam")
/ hist_1b hist_1a
|>
scenario_1 summarise(mean_1a = mean(count_1a), sd_1a = sd(count_1a),
mean_1b = mean(count_1b), sd_1b = sd(count_1b) ) |>
kable()
. . .
Let’s take a look at a binomial and quasibinomial model for Scenarios 1a and 1b.
Scenario 2: Dose effect
Scenario 2: Dose effect
Now we will consider the effect of the dose of teratogen on the probability of a pup born with a deformity. The 24 pups have been randomly divided into four groups:
- High dose (
dose = 3
) - Medium dose (
dose = 2
) - Low dose (
dose = 1
) - No dose (
dose = 0
)
. . .
We will assume the true relationship between \(p\) and dose is the following:
\[\log\Big(\frac{p}{1-p}\Big) = -2 + 1.33 ~ dose\]
Scenario 2
Scenario 2a.
\[p = \frac{e^{-2 + 1.33 ~ dose}}{1 + e^{-2 + 1.33 ~ dose}}\]
. . .
Scenario 2b.:
\[p \sim Beta\Big(\frac{2p}{(1-p)}, 2\Big)\]
On average, dams who receive dose \(x\) have the same probability of deformed pup as dams with dose \(x\) under Scenario 2a.
Distributions under Scenario 2
Replicated from Figure 7.3 in BMLR
Summary stats under Scen.2
mean_2a | sd_2a | mean_2b | sd_2b |
---|---|---|---|
4.791667 | 3.202976 | 4.666667 | 3.583375 |
Dosage | Mean p | SD p | Mean Count | SD Count | Mean p | SD p | Mean Count | SD Count |
---|---|---|---|---|---|---|---|---|
0 | 0.119 | 0 | 1.333 | 1.366 | 0.061 | 0.069 | 0.500 | 0.837 |
1 | 0.339 | 0 | 3.167 | 1.835 | 0.239 | 0.208 | 3.500 | 2.881 |
2 | 0.661 | 0 | 5.833 | 1.472 | 0.615 | 0.195 | 5.833 | 1.941 |
3 | 0.881 | 0 | 8.833 | 1.169 | 0.872 | 0.079 | 8.833 | 1.169 |
|>
scenario_2 summarise(mean_2a = mean(count_2a), sd_2a = sd(count_2a),
mean_2b = mean(count_2b), sd_2b = sd(count_2b) )
<- scenario_2 |>
scenario2Tab group_by(dose) |>
summarise(mean_2a_pi = round(mean(pi_2a),3), sd_2a_pi = round(sd(pi_2a),3),
mean_2a_cnt = round(mean(count_2a),3), sd_2a_cnt = round(sd(count_2a),3),
mean_2b_pi = round(mean(pi_2b),3), sd_2b_pi = round(sd(pi_2b),3),
mean_2b_cnt = round(mean(count_2b),3), sd_2b_cnt = round(sd(count_2b),3)) |>
as.data.frame()
colnames(scenario2Tab) <- c("Dosage","Mean p", "SD p",
"Mean Count", "SD Count", "Mean p", "SD p",
"Mean Count", "SD Count")
kable(scenario2Tab, booktabs = T,
caption="Summary statistics of Scenario 2 by dose.") |>
add_header_above(c(" " = 1, "Scenario 2a" = 4,
"Scenario 2b" = 4)) |>
kable_styling(latex_options = "scale_down") |>
column_spec(c(4:5,8:9), width = "1cm")
From Table 7.2 in BMLR
Questions
In Scenario 2a, dams produced 4.79 deformed pups on average, with standard deviation 3.20. Scenario 2b saw an average of 4.67 with standard deviation 3.58. Why are comparisons by dose more meaningful than these overall comparisons?
We will use binomial and quasibinomial regression to model the relationship between dose and probability of pup born with a deformity. What can you say about the center and the width of the confidence intervals under Scenarios 2a and 2b?
- Which will be similar and why?
- Which will be different and how?
Scenario 2: Estimated odds ratio
The estimated effect of dose and the 95% CI from the binomial and quasibinomial models are below:
. . .
Scenario 2a
Odds Ratio | 95% CI | |
---|---|---|
Binomial | 3.536 | (2.604, 4.958) |
Quasibinomial | 3.536 | (2.512, 5.186) |
Scenario 2b
Odds Ratio | 95% CI | |
---|---|---|
Binomial | 4.311 | (3.086, 6.271) |
Quasibinomial | 4.311 | (2.735, 7.352) |
Questions
Describe how the quasibinomial analysis of Scenario 2b differs from the binomial analysis of the same simulated data. Do confidence intervals contain the true model parameters? Is this what you expected? Why?
Why are differences between quasibinomial and binomial models of Scenario 2a less noticeable than the differences in Scenario 2b?
Summary
The structure of the data set may imply correlation between observations.
Correlated observations provide less information than independent observations; we need to account for this reduction in information.
Failing to account for this reduction could result in underestimating standard error, thus resulting in overstating significance and the precision of the estimates.
We showed how we can account for this by incorporating the dispersion parameter or a random effect.
Acknowledgements
These slides are based on content in
Initial versions of the slides are by Dr. Maria Tackett, Duke University