Copy the project lab folder located at Home -> STA363_inst_files-> labs. If you check the box next to the folder name, then click the small gear icon you can “copy to” and put a copy of the folder in your newly created folder.
Now, click File-> Open Project and navigate to the project file in the folder you just copied.
You can place your responses in the file qmd file included.
Introduction
Does having boys or girls run in the family? Using demographic data from the National Longitudinal Survey of Youth, can we identify biases in sex composition patterns of children? The data is found in Table 2 in the Rodgers and Doughty (2001) article, and relevant R code can be found under below.
Questions
Model 1 – Sex Unconditional Model. Each child is independent of the others, and there is a constant probability ( ) that a child is a boy.
Consider a small example with 3 families with compositions of children given by BBG, GBG, and GG.
Find the maximum likelihood estimator (MLE) for by:
Conducting a numerical search in R for the largest likelihood over a fine grid of values 0-1.
Conducting a numerical search in R for the largest log-likelihood between 0 and 1. Illustrate the process graphically, and report the maximum value of the likelihood and log-likelihood functions. Does it make sense that both methods would agree (and agree with the mathematical approach)?
Apply Model 1 to the NLSY data (families in Table 2.4 below who have 3 or fewer children). Find the MLE for by adapting the R code for (1). There are 5,626 families in the table.
#Family comp data tablelibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(kableExtra)
Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':
group_rows
table5chp2 <-data.frame(Famcomp, Numfams, Numchild)table5chp2 |>kbl(col.names =c("Family Composition","Number of families", "Number of children"),caption ="Table 2.4 Number of families and children in families with given composition in NLSY data.")
Table 2.4 Number of families and children in families with given composition in NLSY data.
Family Composition
Number of families
Number of children
B
930
930
G
951
951
BB
582
1164
BG
666
1332
GB
666
1332
GG
530
1060
BBB
186
558
BBG
177
531
BGG
173
519
BGB
148
444
GBB
151
453
GGB
125
375
GBG
182
546
GGG
159
477
Model 2 – Sex Conditional Model. The probability of having a boy depends on whether you’ve had boys previously, so Model 2 will have three parameters:
= probability of a boy when previously have had an equal number of boys and girls (neutral)
= probability of a boy when previously have had more boys than girls (boy bias)
= probability of a boy when previously have had more girls than boys (girl bias)
Write out the likelihood function given Model 2 for the small set of data in (1). [You could also try BGG, GGB, BBB just for fun.]
Code
Model 1: Sex Unconditional Model
Assumes the sex for each birth is independent of other births
# Evaluate small set of possible values of pbpb <-c(.3, .35, .375, .4, .45) # possible values for prob a boy is bornlik <- pb^3* (1- pb)^5# likelihood of getting observed datacbind(pb, lik) # print table of resultsplot(pb, lik, xlab ="possible values of pb", ylab ="Likelihood")max(lik) # maximum likelihood over 5 values of pbpb[lik ==max(lik)] # value of pb where likelihood maximized# Evaluate finer grid of possible values of pbpb <-seq(0, 1, length =1001) # possible values for prob a boy is bornlik <- pb^3* (1- pb)^5# likelihood of getting observed dataplot(pb, lik, xlab ="possible values of pb", ylab ="Likelihood", type ="l")max(lik) # maximum likelihood over 1001 values of pbpb[lik ==max(lik)] # value of pb where likelihood maximizedloglik <-3*log(pb) +5*log(1- pb) # find log likelihoodsplot(pb, loglik, xlab ="possible values of pb", ylab ="Loglikelihood", type ="l")max(loglik) # maximum loglikelihood over 1001 values of pbpb[loglik ==max(loglik)] # likelihood and loglikelihood max at same spotmle_pb_m1_small <- pb[loglik ==max(loglik)]max_logL_m1_small <-max(loglik)# Create ggplot of likelihood and log-likelihood functionsmodel1grid <-data.frame(pb = pb, lik1 = lik, loglik1 = loglik)ggplot(data = model1grid, aes(x = pb, y = lik1)) +geom_line() +labs(x ="possible values of pb", y ="Likelihood")ggplot(data = model1grid, aes(x = pb, y = loglik1)) +geom_line() +labs(x ="possible values of pb", y ="log Likelihood")
# Apply Model 1 to NLSY data (for families with 3 or fewer children)pb <-seq(0, 1, length =10001) # possible values for prob a boy is bornlik <- pb^5416* (1- pb)^5256# likelihood (too small)max(lik)summary(lik) # loglikelihood of getting observed dataloglik <-5416*log(pb) +5256*log(1- pb) plot(pb, loglik, xlab ="possible values of pb", ylab ="Loglikelihood", type ="l")plot(pb[loglik > (-7500)], loglik[loglik > (-7500)],xlab ="possible values of pb", ylab ="Loglikelihood", type ="l") # zoom plotmax(loglik) # maximum loglikelihood over all values of pbpb[loglik ==max(loglik)] # MLE of pbmle_pb_m1_nlsy <- pb[loglik ==max(loglik)]max_logL_m1_nlsy <-max(loglik)# Create ggplot of likelihood and log-likelihood functionsmodel1grid <-data.frame(pb = pb, lik1 = lik, loglik1 = loglik)ggplot(data = model1grid, aes(x = pb, y = lik1)) +geom_line() +labs(x ="possible values of pb", y ="Likelihood")ggplot(data = model1grid, aes(x = pb, y = loglik1)) +geom_line() +labs(x ="possible values of pb", y ="log Likelihood")model1grid |>filter(loglik1 > (-7500)) |>ggplot(aes(x = pb, y = loglik1)) +geom_line() +labs(x ="possible values of pb with loglik above -7500", y ="log Likelihood")
Model 2: Sex Conditional Model (small 3 famly data set)
Assumes probability of having a boy depends on whether you’ve had boys previously
# Find MLEs with 3-dimensional grid searchpbb <-seq(0, 1, length =101)pbg <-seq(0, 1, length =101)pbn <-seq(0, 1, length =101)model2_grid <-expand.grid(pbb = pbb, pbg = pbg, pbn = pbn)lik <- model2_grid$pbn^1* (1- model2_grid$pbn)^3* model2_grid$pbb^1* (1- model2_grid$pbb)^1* model2_grid$pbg^1* (1- model2_grid$pbg)^1loglik <-1*log(model2_grid$pbn) +3*log(1- model2_grid$pbn) +1*log(model2_grid$pbb) +1*log(1- model2_grid$pbb) +1*log(model2_grid$pbg) +1*log(1- model2_grid$pbg)# Print resultsmax(lik) # maximum likelihood over all combos of pbb, pbg, pbnmax(loglik) # maximum loglikelihood over all combos of pbb, pbg, pbnmodel2_grid$pbb[loglik ==max(loglik)] # MLE of pbbmodel2_grid$pbg[loglik ==max(loglik)] # MLE of pbgmodel2_grid$pbn[loglik ==max(loglik)] # MLE of pbn# Store resultsmle_pbb_m2_small <- model2_grid$pbb[loglik ==max(loglik)] # MLE of pbbmle_pbg_m2_small <- model2_grid$pbg[loglik ==max(loglik)] # MLE of pbgmle_pbn_m2_small <- model2_grid$pbn[loglik ==max(loglik)] # MLE of pbnmax_L_m2_small <-max(lik)max_logL_m2_small <-max(loglik)
Model comparisons - Model 1 vs. Model 2
lrt <-2* (max_logL_m2_small - max_logL_m1_small) lrt # likelihood ratio test statistic1-pchisq(lrt, df =2) # p-value for testing Ho: no diff between Models 1&2# AIC and BIC values for Models 1 and 2aic1 <--2* max_logL_m1_small +2*1aic1bic1 <--2* max_logL_m1_small +log(8) *1bic1aic2 <--2* max_logL_m2_small +2*3aic2bic2 <--2* max_logL_m2_small +log(8) *3bic2