Chapter 2 - Likelihoods

Does having boys or girls run in the family?

Lab Setup

Copy the project lab folder located at Home -> STA363_inst_files-> labs. If you check the box next to the folder name, then click the small gear icon you can “copy to” and put a copy of the folder in your newly created folder.
Now, click File-> Open Project and navigate to the project file in the folder you just copied.
You can place your responses in the file qmd file included.

Introduction

Does having boys or girls run in the family? Using demographic data from the National Longitudinal Survey of Youth, can we identify biases in sex composition patterns of children? The data is found in Table 2 in the Rodgers and Doughty (2001) article, and relevant R code can be found under below.

Questions

Model 1 – Sex Unconditional Model. Each child is independent of the others, and there is a constant probability ( ) that a child is a boy.

Consider a small example with 3 families with compositions of children given by BBG, GBG, and GG.
Find the maximum likelihood estimator (MLE) for by:

Conducting a numerical search in R for the largest likelihood over a fine grid of values 0-1.
Conducting a numerical search in R for the largest log-likelihood between 0 and 1. Illustrate the process graphically, and report the maximum value of the likelihood and log-likelihood functions. Does it make sense that both methods would agree (and agree with the mathematical approach)?

Apply Model 1 to the NLSY data (families in Table 2.4 below who have 3 or fewer children). Find the MLE for by adapting the R code for (1). There are 5,626 families in the table.

#Family comp data table
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(kableExtra)


Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows

table5chp2 <- data.frame(Famcomp, Numfams, Numchild)

table5chp2 |> kbl(col.names = c("Family Composition",
    "Number of families", "Number of children"),
      caption = "Table 2.4 Number of families and children in families with given composition in NLSY data.")

Table 2.4 Number of families and children in families with given composition in NLSY data.
Family Composition	Number of families	Number of children
B	930	930
G	951	951
BB	582	1164
BG	666	1332
GB	666	1332
GG	530	1060
BBB	186	558
BBG	177	531
BGG	173	519
BGB	148	444
GBB	151	453
GGB	125	375
GBG	182	546
GGG	159	477

Model 2 – Sex Conditional Model. The probability of having a boy depends on whether you’ve had boys previously, so Model 2 will have three parameters:

= probability of a boy when previously have had an equal number of boys and girls (neutral)

= probability of a boy when previously have had more boys than girls (boy bias)

= probability of a boy when previously have had more girls than boys (girl bias)

Write out the likelihood function given Model 2 for the small set of data in (1). [You could also try BGG, GGB, BBB just for fun.]

Code

Model 1: Sex Unconditional Model

Assumes the sex for each birth is independent of other births

# Evaluate small set of possible values of pb
pb <- c(.3, .35, .375, .4, .45)  # possible values for prob a boy is born
lik <- pb^3 * (1 - pb)^5         # likelihood of getting observed data
cbind(pb, lik)                   # print table of results
plot(pb, lik, xlab = "possible values of pb", ylab = "Likelihood")
max(lik)                         # maximum likelihood over 5 values of pb
pb[lik == max(lik)]              # value of pb where likelihood maximized

# Evaluate finer grid of possible values of pb
pb <- seq(0, 1, length = 1001)   # possible values for prob a boy is born
lik <- pb^3 * (1 - pb)^5         # likelihood of getting observed data
plot(pb, lik, xlab = "possible values of pb", ylab = "Likelihood", type = "l")
max(lik)                         # maximum likelihood over 1001 values of pb
pb[lik == max(lik)]              # value of pb where likelihood maximized

loglik <- 3 * log(pb) + 5 * log(1 - pb)      # find log likelihoods
plot(pb, loglik, xlab = "possible values of pb", ylab = "Loglikelihood", 
     type = "l")
max(loglik)                    # maximum loglikelihood over 1001 values of pb
pb[loglik == max(loglik)]      # likelihood and loglikelihood max at same spot
mle_pb_m1_small <- pb[loglik == max(loglik)]
max_logL_m1_small <- max(loglik)

# Create ggplot of likelihood and log-likelihood functions
model1grid <- data.frame(pb = pb, lik1 = lik, loglik1 = loglik)
ggplot(data = model1grid, aes(x = pb, y = lik1)) +
  geom_line() +
  labs(x = "possible values of pb", y = "Likelihood")
ggplot(data = model1grid, aes(x = pb, y = loglik1)) +
  geom_line() +
  labs(x = "possible values of pb", y = "log Likelihood")

# Apply Model 1 to NLSY data (for families with 3 or fewer children)
pb <- seq(0, 1, length = 10001)   # possible values for prob a boy is born
lik <- pb^5416 * (1 - pb)^5256    # likelihood (too small)
max(lik)
summary(lik)  

# loglikelihood of getting observed data
loglik <- 5416 * log(pb) + 5256 * log(1 - pb)   
plot(pb, loglik, xlab = "possible values of pb", ylab = "Loglikelihood", 
     type = "l")
plot(pb[loglik > (-7500)], loglik[loglik > (-7500)],
  xlab = "possible values of pb", 
  ylab = "Loglikelihood", type = "l")  # zoom plot
max(loglik)                 # maximum loglikelihood over all values of pb
pb[loglik == max(loglik)]   # MLE of pb
mle_pb_m1_nlsy <- pb[loglik == max(loglik)]
max_logL_m1_nlsy <- max(loglik)

# Create ggplot of likelihood and log-likelihood functions
model1grid <- data.frame(pb = pb, lik1 = lik, loglik1 = loglik)
ggplot(data = model1grid, aes(x = pb, y = lik1)) +
  geom_line() +
  labs(x = "possible values of pb", y = "Likelihood")
ggplot(data = model1grid, aes(x = pb, y = loglik1)) +
  geom_line() +
  labs(x = "possible values of pb", y = "log Likelihood")
model1grid |>
  filter(loglik1 > (-7500)) |>
  ggplot(aes(x = pb, y = loglik1)) +
    geom_line() +
    labs(x = "possible values of pb with loglik above -7500", 
         y = "log Likelihood")

Model 2: Sex Conditional Model (small 3 famly data set)

Assumes probability of having a boy depends on whether you’ve had boys previously

# Find MLEs with 3-dimensional grid search
pbb <- seq(0, 1, length = 101)
pbg <- seq(0, 1, length = 101)
pbn <- seq(0, 1, length = 101)
model2_grid <- expand.grid(pbb = pbb, pbg = pbg, pbn = pbn)

lik <- model2_grid$pbn^1 * (1 - model2_grid$pbn)^3 * model2_grid$pbb^1 * 
  (1 - model2_grid$pbb)^1 * model2_grid$pbg^1 * (1 - model2_grid$pbg)^1
loglik <- 1 * log(model2_grid$pbn) + 3 * log(1 - model2_grid$pbn) + 
  1 * log(model2_grid$pbb) + 1 * log(1 - model2_grid$pbb) + 
  1 * log(model2_grid$pbg) + 1 * log(1 - model2_grid$pbg)

# Print results
max(lik)        # maximum likelihood over all combos of pbb, pbg, pbn
max(loglik)     # maximum loglikelihood over all combos of pbb, pbg, pbn

model2_grid$pbb[loglik == max(loglik)]   # MLE of pbb
model2_grid$pbg[loglik == max(loglik)]   # MLE of pbg
model2_grid$pbn[loglik == max(loglik)]   # MLE of pbn

# Store results
mle_pbb_m2_small <- model2_grid$pbb[loglik == max(loglik)]   # MLE of pbb
mle_pbg_m2_small <- model2_grid$pbg[loglik == max(loglik)]   # MLE of pbg
mle_pbn_m2_small <- model2_grid$pbn[loglik == max(loglik)]   # MLE of pbn

max_L_m2_small <- max(lik)
max_logL_m2_small <- max(loglik)

Model comparisons - Model 1 vs. Model 2

lrt <- 2 * (max_logL_m2_small - max_logL_m1_small)   
lrt                       # likelihood ratio test statistic
1 - pchisq(lrt, df = 2)   # p-value for testing Ho: no diff between Models 1&2

# AIC and BIC values for Models 1 and 2
aic1 <- -2 * max_logL_m1_small + 2 * 1
aic1
bic1 <- -2 * max_logL_m1_small + log(8) * 1
bic1
aic2 <- -2 * max_logL_m2_small + 2 * 3
aic2
bic2 <- -2 * max_logL_m2_small + log(8) * 3
bic2