library(tidyverse)
library(GGally)
library(knitr)
library(patchwork)
library(viridis)
library(ggfortify)
Beyond Least Squares: Using Likelihoods
BMLR Chapter 2
Setup
Learning goals
Describe the concept of a likelihood
Construct the likelihood for a simple model
Define the Maximum Likelihood Estimate (MLE) and use it to answer an analysis question
Identify three ways to calculate or approximate the MLE and apply these methods to find the MLE for a simple model
Use likelihoods to compare models (next week)
Likelihood
What is the likelihood?
A likelihood is a function that tells us how likely we are to observe our data for a given parameter value (or values).
Unlike Ordinary Least Squares (OLS), they do not require the responses be independent, identically distributed, and normal (iidN)
They are not the same as probability functions
Probability function: Fixed parameter value(s) + input possible outcomes \(\Rightarrow\) probability of seeing the different outcomes given the parameter value(s)
Likelihood: Fixed data + input possible parameter values \(\Rightarrow\) probability of seeing the fixed data for each parameter value
Fouls in college basketball games
The data set 04-refs.csv
includes 30 randomly selected NCAA men’s basketball games played in the 2009 - 2010 season.
We will focus on the variables foul1
, foul2
, and foul3
, which indicate which team had a foul called them for the 1st, 2nd, and 3rd fouls, respectively.
H
: Foul was called on the home teamV
: Foul was called on the visiting team
. . .
We are focusing on the first three fouls for this analysis, but this could easily be extended to include all fouls in a game.
Fouls in college basketball games
<- read_csv("data/04-refs.csv")
refs %>% slice(1:5) %>% kable() refs
game | date | visitor | hometeam | foul1 | foul2 | foul3 |
---|---|---|---|---|---|---|
166 | 20100126 | CLEM | BC | V | V | V |
224 | 20100224 | DEPAUL | CIN | H | H | V |
317 | 20100109 | MARQET | NOVA | H | H | H |
214 | 20100228 | MARQET | SETON | V | V | H |
278 | 20100128 | SETON | SFL | H | V | V |
We will treat the games as independent in this analysis.
Different likelihood models
Model 1 (Unconditional Model): What is the probability the referees call a foul on the home team, assuming foul calls within a game are independent?
Model 2 (Conditional Model):
Is there a tendency for the referees to call more fouls on the visiting team or home team?
Is there a tendency for referees to call a foul on the team that already has more fouls?
. . .
Ultimately we want to decide which model is better.
Exploratory data analysis
%>%
refs count(foul1, foul2, foul3) %>% kable()
foul1 | foul2 | foul3 | n |
---|---|---|---|
H | H | H | 3 |
H | H | V | 2 |
H | V | H | 3 |
H | V | V | 7 |
V | H | H | 7 |
V | H | V | 1 |
V | V | H | 5 |
V | V | V | 2 |
There are
46 total fouls on the home team
44 total fouls on the visiting team
Model 1: Unconditional model
What is the probability the referees call a foul on the home team, assuming foul calls within a game are independent?
Likelihood
Let \(p_H\) be the probability the referees call a foul on the home team.
The likelihood for a single observation
\[Lik(p_H) = p_H^{y_i}(1 - p_H)^{n_i - y_i}\]
Where \(y_i\) is the number of fouls called on the home team.
(In this example, we know \(n_i = 3\) for all observations.)
. . .
Example
For a single game where the first three fouls are \(H, H, V\), then
\[Lik(p_H) = p_H^{2}(1 - p_H)^{3 - 2} = p_H^{2}(1 - p_H)\]
Model 1: Likelihood contribution
Foul1 | Foul2 | Foul3 | n | Likelihood Contribution |
---|---|---|---|---|
H | H | H | 3 | \(p_H^3\) |
H | H | V | 2 | \(p_H^2(1 - p_H)\) |
H | V | H | 3 | \(p_H^2(1 - p_H)\) |
H | V | V | 7 | A |
V | H | H | 7 | B |
V | H | V | 1 | \(p_H(1 - p_H)^2\) |
V | V | H | 5 | \(p_H(1 - p_H)^2\) |
V | V | V | 2 | \((1 - p_H)^3\) |
. . .
Fill in A and B.
Model 1: Likelihood function
Because the observations (the games) are independent, the likelihood is
\[Lik(p_H) = \prod_{i=1}^{n}p_H^{y_i}(1 - p_H)^{3 - y_i}\]
We will use this function to find the maximum likelihood estimate (MLE). The MLE is the value between 0 and 1 where we are most likely to see the observed data.
Visualizing the likelihood
<- seq(0,1, length.out = 100) #sequence of 100 values between 0 and 100
p <- p^46 *(1 -p)^44
lik
<- tibble(p = p, lik = lik)
x ggplot(data = x, aes(x = p, y = lik)) +
geom_point() +
geom_line() +
labs(y = "Likelihood",
title = "Likelihood of p_H")
Q: What is your best guess for the MLE, \(\hat{p}_H\)?
A. 0.489
B. 0.500
C. 0.511
D. 0.556
Finding the maximum likelihood estimate
There are three primary ways to find the MLE
. . .
✅ Approximate using a graph
. . .
✅ Numerical approximation
. . .
✅ Using calculus
Approximate MLE from a graph
Find the MLE using numerical approximation
Specify a finite set of possible values the for \(p_H\) and calculate the likelihood for each value
# write an R function for the likelihood
<- function(ph) {
ref_lik ^46 *(1 - ph)^44
ph }
# use the optimize function to find the MLE
optimize(ref_lik, interval = c(0,1), maximum = TRUE)
$maximum
[1] 0.5111132
$objective
[1] 8.25947e-28
Find MLE using calculus
Find the MLE by taking the first derivative of the likelihood function.
This can be tricky because of the Product Rule, so we can maximize the log(Likelihood) instead. The same value maximizes the likelihood and log(Likelihood)
. . .
. . .
Since calculus is not a pre-req, we will forgo this quest.
Model 2: Conditional model
Is there a tendency for the referees to call more fouls on the visiting team or home team?
Is there a tendency for referees to call a foul on the team that already has more fouls?
Model 2: Likelihood contributions
- Now let’s assume fouls are not independent within each game. We will specify this dependence using conditional probabilities.
- Conditional probability: \(P(A|B) =\) Probability of \(A\) given \(B\) has occurred
. . .
Define new parameters:
\(p_{H|N}\): Probability referees call foul on home team given there are equal numbers of fouls on the home and visiting teams
\(p_{H|H Bias}\): Probability referees call foul on home team given there are more prior fouls on the home team
\(p_{H|V Bias}\): Probability referees call foul on home team given there are more prior fouls on the visiting team
Model 2: Likelihood contributions
Foul1 | Foul2 | Foul3 | n | Likelihood Contribution |
---|---|---|---|---|
H | H | H | 3 | \((p_{H\vert N})(p_{H\vert H Bias})(p_{H\vert H Bias}) = (p_{H\vert N})(p_{H\vert H Bias})^2\) |
H | H | V | 2 | \((p_{H\vert N})(p_{H\vert H Bias})(1 - p_{H\vert H Bias})\) |
H | V | H | 3 | \((p_{H\vert N})(1 - p_{H\vert H Bias})(p_{H\vert N}) = (p_{H\vert N})^2(1 - p_{H\vert H Bias})\) |
H | V | V | 7 | A |
V | H | H | 7 | B |
V | H | V | 1 | \((1 - p_{H\vert N})(p_{H\vert V Bias})(1 - p_{H\vert N}) = (1 - p_{H\vert N})^2(p_{H\vert V Bias})\) |
V | V | H | 5 | \((1 - p_{H\vert N})(1-p_{H\vert V Bias})(p_{H\vert V Bias})\) |
V | V | V | 2 | \(\begin{aligned}&(1 - p_{H\vert N})(1-p_{H\vert V Bias})(1-p_{H\vert V Bias})\\ &=(1 - p_{H\vert N})(1-p_{H\vert V Bias})^2\end{aligned}\) |
Fill in A and B
Likelihood function
\[\begin{aligned}Lik(p_{H| N}, p_{H|H Bias}, p_{H |V Bias}) &= [(p_{H| N})^{25}(1 - p_{H|N})^{23}(p_{H| H Bias})^8 \\ &(1 - p_{H| H Bias})^{12}(p_{H| V Bias})^{13}(1-p_{H|V Bias})^9]\end{aligned}\]
(Note: The exponents sum to 90, the total number of fouls in the data)
. . .
\[\begin{aligned}\log (Lik(p_{H| N}, p_{H|H Bias}, p_{H |V Bias})) &= 25 \log(p_{H| N}) + 23 \log(1 - p_{H|N}) \\ & + 8 \log(p_{H| H Bias}) + 12 \log(1 - p_{H| H Bias})\\ &+ 13 \log(p_{H| V Bias}) + 9 \log(1-p_{H|V Bias})\end{aligned}\]
Q: If fouls within a game are independent, how would you expect \(\hat{p}_H\), \(\hat{p}_{H\vert H Bias}\) and \(\hat{p}_{H\vert V Bias}\) to compare?
\(\hat{p}_H\) is greater than \(\hat{p}_{H\vert H Bias}\) and \(\hat{p}_{H \vert V Bias}\)
\(\hat{p}_{H\vert H Bias}\) is greater than \(\hat{p}_H\) and \(\hat{p}_{H \vert V Bias}\)
\(\hat{p}_{H\vert V Bias}\) is greater than \(\hat{p}_H\) and \(\hat{p}_{H \vert V Bias}\)
They are all approximately equal.
Q: If there is a tendency for referees to call a foul on the team that already has more fouls, how would you expect \(\hat{p}_H\) and \(\hat{p}_{H\vert H Bias}\) to compare?
\(\hat{p}_H\) is greater than \(\hat{p}_{H\vert H Bias}\)
\(\hat{p}_{H\vert H Bias}\) is greater than \(\hat{p}_H\)
They are approximately equal.
Likelihoods
Model 1 (Unconditional Model)
- \(p_H\): probability of a foul being called on the home team
. . .
Model 2 (Conditional Model)
- \(p_{H|N}\): Probability referees call foul on home team given there are equal numbers of fouls on the home and visiting teams
- \(p_{H|H Bias}\): Probability referees call foul on home team given there are more prior fouls on the home team
- \(p_{H|V Bias}\): Probability referees call foul on home team given there are more prior fouls on the visiting team
Likelihoods
Model 1 (Unconditional Model)
\[Lik(p_H) = p_H^{46}(1 - p_H)^{44}\]
. . .
Model 2 (Conditional Model)
\[\begin{aligned}Lik(p_{H| N}, p_{H|H Bias}, p_{H |V Bias}) &= [(p_{H| N})^{25}(1 - p_{H|N})^{23}(p_{H| H Bias})^8 \\ &(1 - p_{H| H Bias})^{12}(p_{H| V Bias})^{13}(1-p_{H|V Bias})^9]\end{aligned}\]
Maximum likelihood estimates
The maximum likelihood estimate (MLE) is the value between 0 and 1 where we are most likely to see the observed data.
. . .
Model 1 (Unconditional Model)
- \(\hat{p}_H = 46/90 = 0.511\)
Model 2 (Conditional Model)
- \(\hat{p}_{H|N} = 25 / 48 = 0.521\)
- \(\hat{p}_{H|H Bias} = 8 /20 = 0.4\)
- \(\hat{p}_{H|V Bias} = 13/ 22 = 0.591\)
- What is the probability the referees call a foul on the home team, assuming foul calls within a game are independent?
- Is there a tendency for the referees to call more fouls on the visiting team or home team?
- Is there a tendency for referees to call a foul on the team that already has more fouls?
Finding the MLEs for model 2
The likelihood is
\[\begin{aligned}Lik(p_{H| N}, p_{H|H Bias}, p_{H |V Bias}) &= [(p_{H| N})^{25}(1 - p_{H|N})^{23}(p_{H| H Bias})^8 \\ &(1 - p_{H| H Bias})^{12}(p_{H| V Bias})^{13}(1-p_{H|V Bias})^9]\end{aligned}\]
. . .
The log-likelihood is
\[\begin{aligned}\log (Lik(p_{H| N}, p_{H|H Bias}, p_{H |V Bias})) &= 25 \log(p_{H| N}) + 23 \log(1 - p_{H|N}) \\ & + 8 \log(p_{H| H Bias}) + 12 \log(1 - p_{H| H Bias})\\ &+ 13 \log(p_{H| V Bias}) + 9 \log(1-p_{H|V Bias})\end{aligned}\]
. . .
We would like to find the MLEs for \(p_{H| N}, p_{H|H Bias}, \text{ and }p_{H |V Bias}\).
Finding MLEs using graphs
- We need to find the MLEs for three parameters, therefore we would need to visualize a 4-dimensional object to find the MLEs from a graph. Given the difficulty of this task and the lack of precision in the estimates from this approach, we should rely on other approaches to find the MLEs in this instance.
. . .
- We also can’t use calculus… that leaves only 1 approach…. optimization via grid search or optim in R
Finding the MLEs using R (1/3)
We can write a function and do a grid search to find the values that maximize the log-likelihood.
. . .
<- function(nvals){
maxloglik#nvals specifies the number of values
<- seq(0, 1, length = nvals)
phn <- seq(0, 1, length = nvals)
phh <- seq(0, 1, length = nvals)
phv
<- expand.grid(phn, phh, phv)
loglik colnames(loglik) <- c("phn", "phh", "phv")
<- loglik %>%
loglik mutate(loglik = log(phn^25 * (1 - phn)^23 * phh^8 * (1 - phh)^12 *
^13 * (1 - phv)^9))
phv
%>%
loglik arrange(desc(loglik)) %>%
slice(1)
}
maxloglik(100)
phn phh phv loglik
1 0.5252525 0.4040404 0.5858586 -61.57691
Finding the MLEs using R (2/3)
- Depending on the number of parameters, it may be hard to conduct a granular enough search to find the exact values of the MLEs.
- Therefore, one could use the function above to conduct a crude search to find starting values for R’s
optim
function. - The function
optim
differs fromoptimize
in that it can optimize over multiple parameter values (Theoptimize
function can only optimize over a single parameter value).
. . .
# Function to calculate log-likelihood that will be used in the optim function
<- function(params){
loglik <- params[1]
phn <- params[2]
phh <- params[3]
phv
log(phn^25 * (1 - phn)^23 * phh^8 * (1 - phh)^12 *
^13 * (1 - phv)^9)
phv }
Finding the MLEs using R (3/3)
# use manual search to get starting values
<- maxloglik(50) %>% select(-loglik) start_vals
# Use optim function in R to find the values to maximize the log-likelihood
#set fnscale = -1 to maximize (the default is minimize)
optim(par = start_vals, fn = loglik, control=list(fnscale=-1))
$par
phn phh phv
0.5208272 0.4000361 0.5909793
$value
[1] -61.57319
$counts
function gradient
66 NA
$convergence
[1] 0
$message
NULL
Model Comparisons
Model comparisons
Nested models
Non-nested models
Nested Models (1/2)
Nested models: Models such that the parameters of the reduced model are a subset of the parameters for a larger model
Example:
\[\begin{aligned}&\text{Model A: }y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\\ &\text{Model B: }y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \epsilon\end{aligned}\]
. . .
Model A is nested in Model B. We could use likelihoods to test whether it is useful to add \(x_3\) and \(x_4\) to the model.
. . .
\[\begin{aligned}&H_0: \beta_3 = \beta_4 = 0 \\ &H_a: \text{ at least one }\beta_j \text{ is not equal to 0}\end{aligned}\]
Nested models (2/2)
Another way to think about nested models: Parameters in larger model can be equated to get the simpler model or if some parameters can be set to constants
Example:
\[\begin{aligned}&\text{Model 1: }p_H \\ &\text{Model 2: }p_{H| N}, p_{H| H Bias}, p_{H| V Bias}\end{aligned}\]
. . .
Model 1 is nested in Model 2. The parameters \(p_{H| N}\), \(p_{H|H Bias}\), and \(p_{H |V Bias}\) can be set equal to \(p_H\) to get Model 1.
. . .
\[\begin{aligned}&H_0: p_{H| N} = p_{H| H Bias} = p_{H| V Bias} = p_H \\ &H_a: \text{At least one of }p_{H| N}, p_{H| H Bias}, p_{H| V Bias} \text{ differs from the others}\end{aligned}\]
Steps to compare models
1️⃣ Find the MLEs for each model.
2️⃣ Plug the MLEs into the log-likelihood function for each model to get the maximum value of the log-likelihood for each model.
3️⃣ Find the difference in the maximum log-likelihoods
4️⃣ Use the Likelihood Ratio Test to determine if the difference is statistically significant
Steps 1 - 2
Find the MLEs for each model and plug them into the log-likelihood functions.
Model 1:
- \(\hat{p}_H = 46/90 = 0.511\)
<- function(ph){
loglik1 log(ph^46 * (1 - ph)^44)
}loglik1(46/90)
[1] -62.36102
. . .
Model 2
- \(\hat{p}_{H|N} = 25 / 48 = 0.521\)
- \(\hat{p}_{H|H Bias} = 8 /20 = 0.4\)
- \(\hat{p}_{H|V Bias} = 13/ 22 = 0.591\)
. . .
<- function(phn, phh, phv) {
loglik2 log(phn^25 * (1 - phn)^23 * phh^8 *
1 - phh)^12 * phv^13 * (1 - phv)^9)
(
}loglik2(25/48, 8/20, 13/22)
[1] -61.57319
Step 3
Find the difference in the log-likelihoods
<- loglik2(25/48, 8/20, 13/22) - loglik1(46/90)) (diff
[1] 0.7878318
. . .
Is the difference in the maximum log-likelihoods statistically significant?
Likelihood Ratio Test
Test statistic
\[\begin{aligned} LRT &= 2[\max\{\log(Lik(\text{larger model}))\} - \max\{\log(Lik(\text{reduced model}))\}]\\[10pt] &= 2\log\Bigg(\frac{\max\{(Lik(\text{larger model})\}}{\max\{(Lik(\text{reduced model})\}}\Bigg)\end{aligned}\]
. . .
LRT follows a \(\chi^2\) distribution where the degrees of freedom equal the difference in the number of parameters between the two models
Step 4
<- 2 * (loglik2(25/48, 8/20, 13/22) - loglik1(46/90))) (LRT
[1] 1.575664
. . .
The test statistic follows a \(\chi^2\) distribution with 2 degrees of freedom. Therefore, the p-value is \(P(\chi^2 > LRT)\).
pchisq(LRT, 2, lower.tail = FALSE)
[1] 0.4548299
. . .
The p-value is very large, so we fail to reject \(H_0\). We do not have convincing evidence that the conditional model is an improvement over the unconditional model. Therefore, we can stick with the unconditional model.
Comparing non-nested models
AIC = -2(max log-likelihood) + 2p
<- 2 * loglik1(46/90) + 2 * 1) (Model1_AIC
[1] -122.722
<-2 * loglik2(25/48, 8/20, 13/22) + 2 * 3) (Model2_AIC
[1] -117.1464
. . .
BIC = -2(max log-likelihood) + plog(n)
<- 2 * loglik1(46/90) + 1 * log(30)) (Model1_BIC
[1] -121.3208
<-2 * loglik2(25/48, 8/20, 13/22) + 3 * log(30)) (Model2_BIC
[1] -112.9428
Choose Model 1, the unconditional model, based on AIC and BIC
Looking ahead
Likelihoods help us answer the question of how likely we are to observe the data given different parameters
In this example, we did not consider covariates, so in practice the parameters we want to estimate will look more similar to this
. . .
\[p_H = \frac{e^{\beta_0 + \beta_1x_1 + \dots + \beta_px_p}}{1 + e^{\beta_0 + \beta_1x_1 + \dots + \beta_px_p}}\]
. . .
- Finding the MLE becomes much more complex and numerical methods may be required.
- We will primarily rely on software to find the MLE, but the conceptual ideas will be the same
Acknowledgements
These slides are based on content in BMLR: Chapter 1 - Review of Multiple Linear Regression
Initial versions of the slides are by Dr. Maria Tackett, Duke University