Beyond Least Squares: Using Likelihoods

BMLR Chapter 2

Author

Affiliation

Tyler George

Cornell College
STA 363 Spring 2025 Block 8

Setup

library(tidyverse)
library(GGally)
library(knitr)
library(patchwork)
library(viridis)
library(ggfortify)

Learning goals

Describe the concept of a likelihood
Construct the likelihood for a simple model
Define the Maximum Likelihood Estimate (MLE) and use it to answer an analysis question
Identify three ways to calculate or approximate the MLE and apply these methods to find the MLE for a simple model
Use likelihoods to compare models (next week)

Likelihood

What is the likelihood?

A likelihood is a function that tells us how likely we are to observe our data for a given parameter value (or values).

Unlike Ordinary Least Squares (OLS), they do not require the responses be independent, identically distributed, and normal (iidN)
They are not the same as probability functions
- Probability function: Fixed parameter value(s) + input possible outcomes \(\Rightarrow\) probability of seeing the different outcomes given the parameter value(s)
- Likelihood: Fixed data + input possible parameter values \(\Rightarrow\) probability of seeing the fixed data for each parameter value

Fouls in college basketball games

The data set 04-refs.csv includes 30 randomly selected NCAA men’s basketball games played in the 2009 - 2010 season.

We will focus on the variables foul1, foul2, and foul3, which indicate which team had a foul called them for the 1st, 2nd, and 3rd fouls, respectively.

H: Foul was called on the home team
V: Foul was called on the visiting team

. . .

We are focusing on the first three fouls for this analysis, but this could easily be extended to include all fouls in a game.

Fouls in college basketball games

refs <- read_csv("data/04-refs.csv")
refs %>% slice(1:5) %>% kable()

game	date	visitor	hometeam	foul1	foul2	foul3
166	20100126	CLEM	BC	V	V	V
224	20100224	DEPAUL	CIN	H	H	V
317	20100109	MARQET	NOVA	H	H	H
214	20100228	MARQET	SETON	V	V	H
278	20100128	SETON	SFL	H	V	V

We will treat the games as independent in this analysis.

Different likelihood models

Model 1 (Unconditional Model): What is the probability the referees call a foul on the home team, assuming foul calls within a game are independent?

Model 2 (Conditional Model):

Is there a tendency for the referees to call more fouls on the visiting team or home team?
Is there a tendency for referees to call a foul on the team that already has more fouls?

. . .

Ultimately we want to decide which model is better.

Exploratory data analysis

refs %>%
count(foul1, foul2, foul3) %>% kable()

foul1	foul2	foul3	n
H	H	H	3
H	H	V	2
H	V	H	3
H	V	V	7
V	H	H	7
V	H	V	1
V	V	H	5
V	V	V	2

There are

46 total fouls on the home team
44 total fouls on the visiting team

Model 1: Unconditional model

What is the probability the referees call a foul on the home team, assuming foul calls within a game are independent?

Likelihood

Let \(p_H\) be the probability the referees call a foul on the home team.

The likelihood for a single observation

\[Lik(p_H) = p_H^{y_i}(1 - p_H)^{n_i - y_i}\]

Where \(y_i\) is the number of fouls called on the home team.

(In this example, we know \(n_i = 3\) for all observations.)

. . .

Example

For a single game where the first three fouls are \(H, H, V\), then

\[Lik(p_H) = p_H^{2}(1 - p_H)^{3 - 2} = p_H^{2}(1 - p_H)\]

Model 1: Likelihood contribution

Foul1	Foul2	Foul3	n	Likelihood Contribution
H	H	H	3	\(p_H^3\)
H	H	V	2	\(p_H^2(1 - p_H)\)
H	V	H	3	\(p_H^2(1 - p_H)\)
H	V	V	7	A
V	H	H	7	B
V	H	V	1	\(p_H(1 - p_H)^2\)
V	V	H	5	\(p_H(1 - p_H)^2\)
V	V	V	2	\((1 - p_H)^3\)

. . .

Fill in A and B.

Model 1: Likelihood function

Because the observations (the games) are independent, the likelihood is

\[Lik(p_H) = \prod_{i=1}^{n}p_H^{y_i}(1 - p_H)^{3 - y_i}\]

We will use this function to find the maximum likelihood estimate (MLE). The MLE is the value between 0 and 1 where we are most likely to see the observed data.

p <- seq(0,1, length.out = 100) #sequence of 100 values between 0 and 100
lik <- p^46 *(1 -p)^44

x <- tibble(p = p, lik = lik)
ggplot(data = x, aes(x = p, y = lik)) + 
  geom_point() + 
  geom_line() +
  labs(y = "Likelihood",
       title = "Likelihood of p_H")

Q: What is your best guess for the MLE, \(\hat{p}_H\)?

A. 0.489

B. 0.500

C. 0.511

D. 0.556

Finding the maximum likelihood estimate

There are three primary ways to find the MLE

. . .

✅ Approximate using a graph

. . .

✅ Numerical approximation

. . .

✅ Using calculus

Approximate MLE from a graph

Find the MLE using numerical approximation

Specify a finite set of possible values the for \(p_H\) and calculate the likelihood for each value

# write an R function for the likelihood
ref_lik <- function(ph) {
  ph^46 *(1 - ph)^44
}

# use the optimize function to find the MLE
optimize(ref_lik, interval = c(0,1), maximum = TRUE)

$maximum
[1] 0.5111132

$objective
[1] 8.25947e-28

Find MLE using calculus

Find the MLE by taking the first derivative of the likelihood function.
This can be tricky because of the Product Rule, so we can maximize the log(Likelihood) instead. The same value maximizes the likelihood and log(Likelihood)

. . .

. . .

Since calculus is not a pre-req, we will forgo this quest.

Model 2: Conditional model

Is there a tendency for the referees to call more fouls on the visiting team or home team?
Is there a tendency for referees to call a foul on the team that already has more fouls?

Model 2: Likelihood contributions

Now let’s assume fouls are not independent within each game. We will specify this dependence using conditional probabilities.
- Conditional probability: \(P(A|B) =\) Probability of \(A\) given \(B\) has occurred

. . .

Define new parameters:

\(p_{H|N}\): Probability referees call foul on home team given there are equal numbers of fouls on the home and visiting teams
\(p_{H|H Bias}\): Probability referees call foul on home team given there are more prior fouls on the home team
\(p_{H|V Bias}\): Probability referees call foul on home team given there are more prior fouls on the visiting team

Model 2: Likelihood contributions

Foul1	Foul2	Foul3	n	Likelihood Contribution
H	H	H	3	\((p_{H\vert N})(p_{H\vert H Bias})(p_{H\vert H Bias}) = (p_{H\vert N})(p_{H\vert H Bias})^2\)
H	H	V	2	\((p_{H\vert N})(p_{H\vert H Bias})(1 - p_{H\vert H Bias})\)
H	V	H	3	\((p_{H\vert N})(1 - p_{H\vert H Bias})(p_{H\vert N}) = (p_{H\vert N})^2(1 - p_{H\vert H Bias})\)
H	V	V	7	A
V	H	H	7	B
V	H	V	1	\((1 - p_{H\vert N})(p_{H\vert V Bias})(1 - p_{H\vert N}) = (1 - p_{H\vert N})^2(p_{H\vert V Bias})\)
V	V	H	5	\((1 - p_{H\vert N})(1-p_{H\vert V Bias})(p_{H\vert V Bias})\)
V	V	V	2	\(\begin{aligned}&(1 - p_{H\vert N})(1-p_{H\vert V Bias})(1-p_{H\vert V Bias})\\ &=(1 - p_{H\vert N})(1-p_{H\vert V Bias})^2\end{aligned}\)

Fill in A and B

Likelihood function

\[\begin{aligned}Lik(p_{H| N}, p_{H|H Bias}, p_{H |V Bias}) &= [(p_{H| N})^{25}(1 - p_{H|N})^{23}(p_{H| H Bias})^8 \\ &(1 - p_{H| H Bias})^{12}(p_{H| V Bias})^{13}(1-p_{H|V Bias})^9]\end{aligned}\]

(Note: The exponents sum to 90, the total number of fouls in the data)

. . .

\[\begin{aligned}\log (Lik(p_{H| N}, p_{H|H Bias}, p_{H |V Bias})) &= 25 \log(p_{H| N}) + 23 \log(1 - p_{H|N}) \\ & + 8 \log(p_{H| H Bias}) + 12 \log(1 - p_{H| H Bias})\\ &+ 13 \log(p_{H| V Bias}) + 9 \log(1-p_{H|V Bias})\end{aligned}\]

Q: If fouls within a game are independent, how would you expect \(\hat{p}_H\), \(\hat{p}_{H\vert H Bias}\) and \(\hat{p}_{H\vert V Bias}\) to compare?

\(\hat{p}_H\) is greater than \(\hat{p}_{H\vert H Bias}\) and \(\hat{p}_{H \vert V Bias}\)
\(\hat{p}_{H\vert H Bias}\) is greater than \(\hat{p}_H\) and \(\hat{p}_{H \vert V Bias}\)
\(\hat{p}_{H\vert V Bias}\) is greater than \(\hat{p}_H\) and \(\hat{p}_{H \vert V Bias}\)
They are all approximately equal.

Q: If there is a tendency for referees to call a foul on the team that already has more fouls, how would you expect \(\hat{p}_H\) and \(\hat{p}_{H\vert H Bias}\) to compare?

\(\hat{p}_H\) is greater than \(\hat{p}_{H\vert H Bias}\)
\(\hat{p}_{H\vert H Bias}\) is greater than \(\hat{p}_H\)
They are approximately equal.

Likelihoods

Model 1 (Unconditional Model)

\(p_H\): probability of a foul being called on the home team

. . .

Model 2 (Conditional Model)

\(p_{H|N}\): Probability referees call foul on home team given there are equal numbers of fouls on the home and visiting teams
\(p_{H|H Bias}\): Probability referees call foul on home team given there are more prior fouls on the home team
\(p_{H|V Bias}\): Probability referees call foul on home team given there are more prior fouls on the visiting team

Likelihoods

Model 1 (Unconditional Model)

\[Lik(p_H) = p_H^{46}(1 - p_H)^{44}\]

. . .

Model 2 (Conditional Model)

\[\begin{aligned}Lik(p_{H| N}, p_{H|H Bias}, p_{H |V Bias}) &= [(p_{H| N})^{25}(1 - p_{H|N})^{23}(p_{H| H Bias})^8 \\ &(1 - p_{H| H Bias})^{12}(p_{H| V Bias})^{13}(1-p_{H|V Bias})^9]\end{aligned}\]

Maximum likelihood estimates

The maximum likelihood estimate (MLE) is the value between 0 and 1 where we are most likely to see the observed data.

. . .

Model 1 (Unconditional Model)

\(\hat{p}_H = 46/90 = 0.511\)

Model 2 (Conditional Model)

\(\hat{p}_{H|N} = 25 / 48 = 0.521\)
\(\hat{p}_{H|H Bias} = 8 /20 = 0.4\)
\(\hat{p}_{H|V Bias} = 13/ 22 = 0.591\)

What is the probability the referees call a foul on the home team, assuming foul calls within a game are independent?
Is there a tendency for the referees to call more fouls on the visiting team or home team?
Is there a tendency for referees to call a foul on the team that already has more fouls?

Finding the MLEs for model 2

The likelihood is

\[\begin{aligned}Lik(p_{H| N}, p_{H|H Bias}, p_{H |V Bias}) &= [(p_{H| N})^{25}(1 - p_{H|N})^{23}(p_{H| H Bias})^8 \\ &(1 - p_{H| H Bias})^{12}(p_{H| V Bias})^{13}(1-p_{H|V Bias})^9]\end{aligned}\]

. . .

The log-likelihood is

. . .

We would like to find the MLEs for \(p_{H| N}, p_{H|H Bias}, \text{ and }p_{H |V Bias}\).

Finding MLEs using graphs

We need to find the MLEs for three parameters, therefore we would need to visualize a 4-dimensional object to find the MLEs from a graph. Given the difficulty of this task and the lack of precision in the estimates from this approach, we should rely on other approaches to find the MLEs in this instance.

. . .

We also can’t use calculus… that leaves only 1 approach…. optimization via grid search or optim in R

Finding the MLEs using R (1/3)

We can write a function and do a grid search to find the values that maximize the log-likelihood.

. . .

maxloglik<- function(nvals){
  #nvals specifies the number of values
  phn <- seq(0, 1, length = nvals)
  phh <- seq(0, 1, length = nvals)
  phv <- seq(0, 1, length = nvals)
  
  loglik <- expand.grid(phn, phh, phv) 
  colnames(loglik) <- c("phn", "phh", "phv")
  
  loglik <- loglik %>%
    mutate(loglik  = log(phn^25 * (1 - phn)^23 * phh^8 * (1 - phh)^12 * 
                           phv^13 * (1 - phv)^9))
  
  loglik %>%
    arrange(desc(loglik)) %>%
    slice(1)
}

maxloglik(100)

        phn       phh       phv    loglik
1 0.5252525 0.4040404 0.5858586 -61.57691

Finding the MLEs using R (2/3)

Depending on the number of parameters, it may be hard to conduct a granular enough search to find the exact values of the MLEs.
Therefore, one could use the function above to conduct a crude search to find starting values for R’s optim function.
The function optim differs from optimize in that it can optimize over multiple parameter values (The optimize function can only optimize over a single parameter value).

. . .

# Function to calculate log-likelihood that will be used in the optim function
loglik <- function(params){
  phn <- params[1]
  phh <- params[2]
  phv <- params[3]

  log(phn^25 * (1 - phn)^23 * phh^8 * (1 - phh)^12 * 
                           phv^13 * (1 - phv)^9)
}

Finding the MLEs using R (3/3)

# use manual search to get starting values 
start_vals <- maxloglik(50) %>% select(-loglik)

# Use optim function in R to find the values to maximize the log-likelihood
#set fnscale = -1 to maximize (the default is minimize)
optim(par = start_vals, fn = loglik, control=list(fnscale=-1))

$par
      phn       phh       phv 
0.5208272 0.4000361 0.5909793 

$value
[1] -61.57319

$counts
function gradient 
      66       NA 

$convergence
[1] 0

$message
NULL

Model Comparisons

Model comparisons

Nested models
Non-nested models

Nested Models (1/2)

Nested models: Models such that the parameters of the reduced model are a subset of the parameters for a larger model

Example:

\[\begin{aligned}&\text{Model A: }y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\\ &\text{Model B: }y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \epsilon\end{aligned}\]

. . .

Model A is nested in Model B. We could use likelihoods to test whether it is useful to add \(x_3\) and \(x_4\) to the model.

. . .

\[\begin{aligned}&H_0: \beta_3 = \beta_4 = 0 \\ &H_a: \text{ at least one }\beta_j \text{ is not equal to 0}\end{aligned}\]

Nested models (2/2)

Another way to think about nested models: Parameters in larger model can be equated to get the simpler model or if some parameters can be set to constants

Example:

\[\begin{aligned}&\text{Model 1: }p_H \\ &\text{Model 2: }p_{H| N}, p_{H| H Bias}, p_{H| V Bias}\end{aligned}\]

. . .

Model 1 is nested in Model 2. The parameters \(p_{H| N}\), \(p_{H|H Bias}\), and \(p_{H |V Bias}\) can be set equal to \(p_H\) to get Model 1.

. . .

\[\begin{aligned}&H_0: p_{H| N} = p_{H| H Bias} = p_{H| V Bias} = p_H \\ &H_a: \text{At least one of }p_{H| N}, p_{H| H Bias}, p_{H| V Bias} \text{ differs from the others}\end{aligned}\]

Steps to compare models

1️⃣ Find the MLEs for each model.

2️⃣ Plug the MLEs into the log-likelihood function for each model to get the maximum value of the log-likelihood for each model.

3️⃣ Find the difference in the maximum log-likelihoods

4️⃣ Use the Likelihood Ratio Test to determine if the difference is statistically significant

Steps 1 - 2

Find the MLEs for each model and plug them into the log-likelihood functions.

Model 1:

\(\hat{p}_H = 46/90 = 0.511\)

loglik1 <- function(ph){
 log(ph^46 * (1 - ph)^44)
}
loglik1(46/90)

[1] -62.36102

. . .

Model 2

\(\hat{p}_{H|N} = 25 / 48 = 0.521\)
\(\hat{p}_{H|H Bias} = 8 /20 = 0.4\)
\(\hat{p}_{H|V Bias} = 13/ 22 = 0.591\)

. . .

loglik2 <- function(phn, phh, phv) {
  log(phn^25 * (1 - phn)^23 * phh^8 * 
        (1 - phh)^12 * phv^13 * (1 - phv)^9)
}
loglik2(25/48, 8/20, 13/22)

[1] -61.57319

Step 3

Find the difference in the log-likelihoods

(diff <- loglik2(25/48, 8/20, 13/22) - loglik1(46/90))

[1] 0.7878318

. . .

Is the difference in the maximum log-likelihoods statistically significant?

Likelihood Ratio Test

Test statistic

\[\begin{aligned} LRT &= 2[\max\{\log(Lik(\text{larger model}))\} - \max\{\log(Lik(\text{reduced model}))\}]\\[10pt] &= 2\log\Bigg(\frac{\max\{(Lik(\text{larger model})\}}{\max\{(Lik(\text{reduced model})\}}\Bigg)\end{aligned}\]

. . .

LRT follows a \(\chi^2\) distribution where the degrees of freedom equal the difference in the number of parameters between the two models

Step 4

(LRT <- 2 * (loglik2(25/48, 8/20, 13/22) - loglik1(46/90)))

[1] 1.575664

. . .

The test statistic follows a \(\chi^2\) distribution with 2 degrees of freedom. Therefore, the p-value is \(P(\chi^2 > LRT)\).

pchisq(LRT, 2, lower.tail = FALSE)

[1] 0.4548299

. . .

The p-value is very large, so we fail to reject \(H_0\). We do not have convincing evidence that the conditional model is an improvement over the unconditional model. Therefore, we can stick with the unconditional model.

Comparing non-nested models

AIC = -2(max log-likelihood) + 2p

(Model1_AIC <- 2 * loglik1(46/90) + 2 * 1)

[1] -122.722

(Model2_AIC <-2 * loglik2(25/48, 8/20, 13/22) + 2 * 3)

[1] -117.1464

. . .

BIC = -2(max log-likelihood) + plog(n)

(Model1_BIC <- 2 * loglik1(46/90) + 1 * log(30))

[1] -121.3208

(Model2_BIC <-2 * loglik2(25/48, 8/20, 13/22) + 3 * log(30))

[1] -112.9428

Choose Model 1, the unconditional model, based on AIC and BIC

Looking ahead

Likelihoods help us answer the question of how likely we are to observe the data given different parameters
In this example, we did not consider covariates, so in practice the parameters we want to estimate will look more similar to this

. . .

\[p_H = \frac{e^{\beta_0 + \beta_1x_1 + \dots + \beta_px_p}}{1 + e^{\beta_0 + \beta_1x_1 + \dots + \beta_px_p}}\]

. . .

Finding the MLE becomes much more complex and numerical methods may be required.
- We will primarily rely on software to find the MLE, but the conceptual ideas will be the same

Acknowledgements

These slides are based on content in BMLR: Chapter 1 - Review of Multiple Linear Regression

Initial versions of the slides are by Dr. Maria Tackett, Duke University