BMLR Chapter 7
Cornell College
STA 363 Spring 2025 Block 8
In an education study, scores for students from a particular teacher are typically more similar than scores of other students with a different teacher
In a study measuring depression indices weekly over a month, the four measures for the same patient tend to be more similar than depression indices from other patients
In political polling, opinions of members from the same household tend to be more similar than opinions of members from another household
Correlation among outcomes within the same group (teacher, patient, household) is called intraclass correlation
We can think of correlated data as a multilevel structure
For now we will focus on data with two levels:
Example: political polling
Researchers are interested in understanding the effect social media has on opinions about a proposed economic plan. They randomly select 1000 households. They ask each adult in the household how many minutes they spend on social media daily and whether they support the proposed economic plan.
Researchers conducted a randomized controlled study where patients were randomly assigned to either an anti-epileptic drug or a placebo. For each patient, the number of seizures at baseline was measured over a 2-week period. For four consecutive visits the number of seizures were determined over the past 2-week period. Patient age and sex along with visit number were recorded.
Questions
Today’s data are simulated results of an experiment with 24 dams (mother rats) randomly divided into four groups that received different doses of teratogen, a substance that could potentially cause harm to developing fetuses. The four groups are
Each dam produced 10 rat pups and the presence of a deformity was noted.
Goal: Understand the association between teratogen exposure and the probability a pup is born with a deformity.
Assume dose has no effect on, \(p\), the probability of a pup born with a deformity.
Scenario 1a.: \(p = 0.5\) for each dam
Scenario 1b.: \(p \sim Beta(0.5, 0.5)\) (expected value = 0.5)
theoretical_pi <- tibble(x = 1:250000,
p1 = rbeta(x, 0.5, 0.5))
tibble(x = 1:24, pi_1b) |>
ggplot() +
geom_histogram(bins = 5, aes(x = pi_1b, y = ..density..),
color = "black", fill = "blue", alpha = 0.2) +
coord_cartesian(xlim = c(0,1)) +
geom_density(data = theoretical_pi, aes(x = p1),
linetype = 3, color = "blue", lwd = 2) +
geom_vline(xintercept = 0.5, color = "red", lwd = 2) +
labs(title = "Probability of deformity",
subtitle = "Red = Scenario 1a, Blue dashed line = Scenario 1b",
x = "Probability of Deformity")
Would you expect the number of pups with a deformity for dams in Scenario 1a to follow a distribution similar to the binomial distribution with \(n=10\) and \(p=0.5\)? Why or why not?
Would you expect the number of pups with a deformity for dams in Scenario 1b to follow a distribution similar to the binomial distribution with \(n=10\) and \(p=0.5\)? Why or why not?
Which scenario do you think is more realistic - Scenario 1a or 1b?
mean_1a | sd_1a | mean_1b | sd_1b |
---|---|---|---|
5.166667 | 1.493949 | 5.666667 | 4.103727 |
scenario_1 <-
tibble(pi_1a, count_1a, pi_1b, count_1b) |>
mutate(phat_1a = count_1a / 10,
phat_1b = count_1b / 10)
hist_1a <- ggplot(data = scenario_1, aes(x = count_1a)) +
geom_histogram(bins = 5, color = "black", fill = "steelblue") +
coord_cartesian(xlim = c(0, 10)) +
labs(title = "Scenario 1a: Binomial, p = 0.5",
x = "Count of deformed pups per dam")
hist_1b <- ggplot(data = scenario_1, aes(x = count_1b)) +
geom_histogram(bins = 5, color = "black", fill = "steelblue") +
coord_cartesian(xlim = c(0, 10)) +
labs(title = "Scenario 1b: Binomial, p ~ Beta(0.5, 0.5)",
x = "Count of deformed pups per dam")
hist_1a / hist_1b
Let’s take a look at a binomial and quasibinomial model for Scenarios 1a and 1b.
Now we will consider the effect of the dose of teratogen on the probability of a pup born with a deformity. The 24 pups have been randomly divided into four groups:
dose = 3
)dose = 2
)dose = 1
)dose = 0
)We will assume the true relationship between \(p\) and dose is the following:
\[\log\Big(\frac{p}{1-p}\Big) = -2 + 1.33 ~ dose\]
Scenario 2a.
\[p = \frac{e^{-2 + 1.33 ~ dose}}{1 + e^{-2 + 1.33 ~ dose}}\]
Scenario 2b.:
\[p \sim Beta\Big(\frac{2p}{(1-p)}, 2\Big)\]
On average, dams who receive dose \(x\) have the same probability of deformed pup as dams with dose \(x\) under Scenario 2a.
mean_2a | sd_2a | mean_2b | sd_2b |
---|---|---|---|
4.791667 | 3.202976 | 4.666667 | 3.583375 |
Scenario 2a
|
Scenario 2b
|
|||||||
---|---|---|---|---|---|---|---|---|
Dosage | Mean p | SD p | Mean Count | SD Count | Mean p | SD p | Mean Count | SD Count |
0 | 0.119 | 0 | 1.333 | 1.366 | 0.061 | 0.069 | 0.500 | 0.837 |
1 | 0.339 | 0 | 3.167 | 1.835 | 0.239 | 0.208 | 3.500 | 2.881 |
2 | 0.661 | 0 | 5.833 | 1.472 | 0.615 | 0.195 | 5.833 | 1.941 |
3 | 0.881 | 0 | 8.833 | 1.169 | 0.872 | 0.079 | 8.833 | 1.169 |
scenario2Tab <- scenario_2 |>
group_by(dose) |>
summarise(mean_2a_pi = round(mean(pi_2a),3), sd_2a_pi = round(sd(pi_2a),3),
mean_2a_cnt = round(mean(count_2a),3), sd_2a_cnt = round(sd(count_2a),3),
mean_2b_pi = round(mean(pi_2b),3), sd_2b_pi = round(sd(pi_2b),3),
mean_2b_cnt = round(mean(count_2b),3), sd_2b_cnt = round(sd(count_2b),3)) |>
as.data.frame()
colnames(scenario2Tab) <- c("Dosage","Mean p", "SD p",
"Mean Count", "SD Count", "Mean p", "SD p",
"Mean Count", "SD Count")
kable(scenario2Tab, booktabs = T,
caption="Summary statistics of Scenario 2 by dose.") |>
add_header_above(c(" " = 1, "Scenario 2a" = 4,
"Scenario 2b" = 4)) |>
kable_styling(latex_options = "scale_down") |>
column_spec(c(4:5,8:9), width = "1cm")
In Scenario 2a, dams produced 4.79 deformed pups on average, with standard deviation 3.20. Scenario 2b saw an average of 4.67 with standard deviation 3.58. Why are comparisons by dose more meaningful than these overall comparisons?
We will use binomial and quasibinomial regression to model the relationship between dose and probability of pup born with a deformity. What can you say about the center and the width of the confidence intervals under Scenarios 2a and 2b?
The estimated effect of dose and the 95% CI from the binomial and quasibinomial models are below:
Scenario 2a
Odds Ratio | 95% CI | |
---|---|---|
Binomial | 3.536 | (2.604, 4.958) |
Quasibinomial | 3.536 | (2.512, 5.186) |
Scenario 2b
Odds Ratio | 95% CI | |
---|---|---|
Binomial | 4.311 | (3.086, 6.271) |
Quasibinomial | 4.311 | (2.735, 7.352) |
Describe how the quasibinomial analysis of Scenario 2b differs from the binomial analysis of the same simulated data. Do confidence intervals contain the true model parameters? Is this what you expected? Why?
Why are differences between quasibinomial and binomial models of Scenario 2a less noticeable than the differences in Scenario 2b?
The structure of the data set may imply correlation between observations.
Correlated observations provide less information than independent observations; we need to account for this reduction in information.
Failing to account for this reduction could result in underestimating standard error, thus resulting in overstating significance and the precision of the estimates.
We showed how we can account for this by incorporating the dispersion parameter or a random effect.
These slides are based on content in
Initial versions of the slides are by Dr. Maria Tackett, Duke University