Beyond Least Squares: Using Likelihoods

BMLR Chapter 2

Author
Affiliation

Tyler George

Cornell College
STA 363 Spring 2025 Block 8

Setup

library(tidyverse)
library(GGally)
library(knitr)
library(patchwork)
library(viridis)
library(ggfortify)

Learning goals

  • Describe the concept of a likelihood

  • Construct the likelihood for a simple model

  • Define the Maximum Likelihood Estimate (MLE) and use it to answer an analysis question

  • Identify three ways to calculate or approximate the MLE and apply these methods to find the MLE for a simple model

  • Use likelihoods to compare models (next week)

Likelihood

What is the likelihood?

A likelihood is a function that tells us how likely we are to observe our data for a given parameter value (or values).

  • Unlike Ordinary Least Squares (OLS), they do not require the responses be independent, identically distributed, and normal (iidN)

  • They are not the same as probability functions

    • Probability function: Fixed parameter value(s) + input possible outcomes \(\Rightarrow\) probability of seeing the different outcomes given the parameter value(s)

    • Likelihood: Fixed data + input possible parameter values \(\Rightarrow\) probability of seeing the fixed data for each parameter value

Fouls in college basketball games

The data set 04-refs.csv includes 30 randomly selected NCAA men’s basketball games played in the 2009 - 2010 season.

We will focus on the variables foul1, foul2, and foul3, which indicate which team had a foul called them for the 1st, 2nd, and 3rd fouls, respectively.

  • H: Foul was called on the home team
  • V: Foul was called on the visiting team

. . .

We are focusing on the first three fouls for this analysis, but this could easily be extended to include all fouls in a game.

Fouls in college basketball games

refs <- read_csv("data/04-refs.csv")
refs %>% slice(1:5) %>% kable()
game date visitor hometeam foul1 foul2 foul3
166 20100126 CLEM BC V V V
224 20100224 DEPAUL CIN H H V
317 20100109 MARQET NOVA H H H
214 20100228 MARQET SETON V V H
278 20100128 SETON SFL H V V

We will treat the games as independent in this analysis.

Different likelihood models

Model 1 (Unconditional Model): What is the probability the referees call a foul on the home team, assuming foul calls within a game are independent?


Model 2 (Conditional Model):

  • Is there a tendency for the referees to call more fouls on the visiting team or home team?

  • Is there a tendency for referees to call a foul on the team that already has more fouls?

. . .

Ultimately we want to decide which model is better.

Exploratory data analysis

refs %>%
count(foul1, foul2, foul3) %>% kable()
foul1 foul2 foul3 n
H H H 3
H H V 2
H V H 3
H V V 7
V H H 7
V H V 1
V V H 5
V V V 2

There are

  • 46 total fouls on the home team

  • 44 total fouls on the visiting team

Model 1: Unconditional model

What is the probability the referees call a foul on the home team, assuming foul calls within a game are independent?

Likelihood

Let \(p_H\) be the probability the referees call a foul on the home team.

The likelihood for a single observation

\[Lik(p_H) = p_H^{y_i}(1 - p_H)^{n_i - y_i}\]

Where \(y_i\) is the number of fouls called on the home team.

(In this example, we know \(n_i = 3\) for all observations.)

. . .

Example

For a single game where the first three fouls are \(H, H, V\), then

\[Lik(p_H) = p_H^{2}(1 - p_H)^{3 - 2} = p_H^{2}(1 - p_H)\]

Model 1: Likelihood contribution

Foul1 Foul2 Foul3 n Likelihood Contribution
H H H 3 \(p_H^3\)
H H V 2 \(p_H^2(1 - p_H)\)
H V H 3 \(p_H^2(1 - p_H)\)
H V V 7 A
V H H 7 B
V H V 1 \(p_H(1 - p_H)^2\)
V V H 5 \(p_H(1 - p_H)^2\)
V V V 2 \((1 - p_H)^3\)

. . .

Fill in A and B.

Model 1: Likelihood function

Because the observations (the games) are independent, the likelihood is

\[Lik(p_H) = \prod_{i=1}^{n}p_H^{y_i}(1 - p_H)^{3 - y_i}\]

We will use this function to find the maximum likelihood estimate (MLE). The MLE is the value between 0 and 1 where we are most likely to see the observed data.

Visualizing the likelihood

p <- seq(0,1, length.out = 100) #sequence of 100 values between 0 and 100
lik <- p^46 *(1 -p)^44

x <- tibble(p = p, lik = lik)
ggplot(data = x, aes(x = p, y = lik)) + 
  geom_point() + 
  geom_line() +
  labs(y = "Likelihood",
       title = "Likelihood of p_H")

Q: What is your best guess for the MLE, \(\hat{p}_H\)?

A. 0.489

B. 0.500

C. 0.511

D. 0.556

Finding the maximum likelihood estimate

There are three primary ways to find the MLE

. . .

✅ Approximate using a graph

. . .

✅ Numerical approximation

. . .

✅ Using calculus

Approximate MLE from a graph

Find the MLE using numerical approximation

Specify a finite set of possible values the for \(p_H\) and calculate the likelihood for each value

# write an R function for the likelihood
ref_lik <- function(ph) {
  ph^46 *(1 - ph)^44
}
# use the optimize function to find the MLE
optimize(ref_lik, interval = c(0,1), maximum = TRUE)
$maximum
[1] 0.5111132

$objective
[1] 8.25947e-28

Find MLE using calculus

  • Find the MLE by taking the first derivative of the likelihood function.

  • This can be tricky because of the Product Rule, so we can maximize the log(Likelihood) instead. The same value maximizes the likelihood and log(Likelihood)

. . .

. . .

Since calculus is not a pre-req, we will forgo this quest.

Model 2: Conditional model

  • Is there a tendency for the referees to call more fouls on the visiting team or home team?

  • Is there a tendency for referees to call a foul on the team that already has more fouls?

Model 2: Likelihood contributions

  • Now let’s assume fouls are not independent within each game. We will specify this dependence using conditional probabilities.
    • Conditional probability: \(P(A|B) =\) Probability of \(A\) given \(B\) has occurred

. . .

Define new parameters:

  • \(p_{H|N}\): Probability referees call foul on home team given there are equal numbers of fouls on the home and visiting teams

  • \(p_{H|H Bias}\): Probability referees call foul on home team given there are more prior fouls on the home team

  • \(p_{H|V Bias}\): Probability referees call foul on home team given there are more prior fouls on the visiting team

Model 2: Likelihood contributions

Foul1 Foul2 Foul3 n Likelihood Contribution
H H H 3 \((p_{H\vert N})(p_{H\vert H Bias})(p_{H\vert H Bias}) = (p_{H\vert N})(p_{H\vert H Bias})^2\)
H H V 2 \((p_{H\vert N})(p_{H\vert H Bias})(1 - p_{H\vert H Bias})\)
H V H 3 \((p_{H\vert N})(1 - p_{H\vert H Bias})(p_{H\vert N}) = (p_{H\vert N})^2(1 - p_{H\vert H Bias})\)
H V V 7 A
V H H 7 B
V H V 1 \((1 - p_{H\vert N})(p_{H\vert V Bias})(1 - p_{H\vert N}) = (1 - p_{H\vert N})^2(p_{H\vert V Bias})\)
V V H 5 \((1 - p_{H\vert N})(1-p_{H\vert V Bias})(p_{H\vert V Bias})\)
V V V 2 \(\begin{aligned}&(1 - p_{H\vert N})(1-p_{H\vert V Bias})(1-p_{H\vert V Bias})\\ &=(1 - p_{H\vert N})(1-p_{H\vert V Bias})^2\end{aligned}\)

Fill in A and B

Likelihood function

\[\begin{aligned}Lik(p_{H| N}, p_{H|H Bias}, p_{H |V Bias}) &= [(p_{H| N})^{25}(1 - p_{H|N})^{23}(p_{H| H Bias})^8 \\ &(1 - p_{H| H Bias})^{12}(p_{H| V Bias})^{13}(1-p_{H|V Bias})^9]\end{aligned}\]

(Note: The exponents sum to 90, the total number of fouls in the data)

. . .

\[\begin{aligned}\log (Lik(p_{H| N}, p_{H|H Bias}, p_{H |V Bias})) &= 25 \log(p_{H| N}) + 23 \log(1 - p_{H|N}) \\ & + 8 \log(p_{H| H Bias}) + 12 \log(1 - p_{H| H Bias})\\ &+ 13 \log(p_{H| V Bias}) + 9 \log(1-p_{H|V Bias})\end{aligned}\]

Q: If fouls within a game are independent, how would you expect \(\hat{p}_H\), \(\hat{p}_{H\vert H Bias}\) and \(\hat{p}_{H\vert V Bias}\) to compare?

  1. \(\hat{p}_H\) is greater than \(\hat{p}_{H\vert H Bias}\) and \(\hat{p}_{H \vert V Bias}\)

  2. \(\hat{p}_{H\vert H Bias}\) is greater than \(\hat{p}_H\) and \(\hat{p}_{H \vert V Bias}\)

  3. \(\hat{p}_{H\vert V Bias}\) is greater than \(\hat{p}_H\) and \(\hat{p}_{H \vert V Bias}\)

  4. They are all approximately equal.

Q: If there is a tendency for referees to call a foul on the team that already has more fouls, how would you expect \(\hat{p}_H\) and \(\hat{p}_{H\vert H Bias}\) to compare?

  1. \(\hat{p}_H\) is greater than \(\hat{p}_{H\vert H Bias}\)

  2. \(\hat{p}_{H\vert H Bias}\) is greater than \(\hat{p}_H\)

  3. They are approximately equal.

Likelihoods

Model 1 (Unconditional Model)

  • \(p_H\): probability of a foul being called on the home team

. . .

Model 2 (Conditional Model)

  • \(p_{H|N}\): Probability referees call foul on home team given there are equal numbers of fouls on the home and visiting teams
  • \(p_{H|H Bias}\): Probability referees call foul on home team given there are more prior fouls on the home team
  • \(p_{H|V Bias}\): Probability referees call foul on home team given there are more prior fouls on the visiting team

Likelihoods

Model 1 (Unconditional Model)

\[Lik(p_H) = p_H^{46}(1 - p_H)^{44}\]

. . .

Model 2 (Conditional Model)

\[\begin{aligned}Lik(p_{H| N}, p_{H|H Bias}, p_{H |V Bias}) &= [(p_{H| N})^{25}(1 - p_{H|N})^{23}(p_{H| H Bias})^8 \\ &(1 - p_{H| H Bias})^{12}(p_{H| V Bias})^{13}(1-p_{H|V Bias})^9]\end{aligned}\]

Maximum likelihood estimates

The maximum likelihood estimate (MLE) is the value between 0 and 1 where we are most likely to see the observed data.

. . .

Model 1 (Unconditional Model)

  • \(\hat{p}_H = 46/90 = 0.511\)

Model 2 (Conditional Model)

  • \(\hat{p}_{H|N} = 25 / 48 = 0.521\)
  • \(\hat{p}_{H|H Bias} = 8 /20 = 0.4\)
  • \(\hat{p}_{H|V Bias} = 13/ 22 = 0.591\)
  • What is the probability the referees call a foul on the home team, assuming foul calls within a game are independent?
  • Is there a tendency for the referees to call more fouls on the visiting team or home team?
  • Is there a tendency for referees to call a foul on the team that already has more fouls?

Finding the MLEs for model 2

The likelihood is

\[\begin{aligned}Lik(p_{H| N}, p_{H|H Bias}, p_{H |V Bias}) &= [(p_{H| N})^{25}(1 - p_{H|N})^{23}(p_{H| H Bias})^8 \\ &(1 - p_{H| H Bias})^{12}(p_{H| V Bias})^{13}(1-p_{H|V Bias})^9]\end{aligned}\]

. . .

The log-likelihood is

\[\begin{aligned}\log (Lik(p_{H| N}, p_{H|H Bias}, p_{H |V Bias})) &= 25 \log(p_{H| N}) + 23 \log(1 - p_{H|N}) \\ & + 8 \log(p_{H| H Bias}) + 12 \log(1 - p_{H| H Bias})\\ &+ 13 \log(p_{H| V Bias}) + 9 \log(1-p_{H|V Bias})\end{aligned}\]

. . .

We would like to find the MLEs for \(p_{H| N}, p_{H|H Bias}, \text{ and }p_{H |V Bias}\).

Finding MLEs using graphs

  • We need to find the MLEs for three parameters, therefore we would need to visualize a 4-dimensional object to find the MLEs from a graph. Given the difficulty of this task and the lack of precision in the estimates from this approach, we should rely on other approaches to find the MLEs in this instance.

. . .

  • We also can’t use calculus… that leaves only 1 approach…. optimization via grid search or optim in R

Finding the MLEs using R (1/3)

We can write a function and do a grid search to find the values that maximize the log-likelihood.

. . .

maxloglik<- function(nvals){
  #nvals specifies the number of values
  phn <- seq(0, 1, length = nvals)
  phh <- seq(0, 1, length = nvals)
  phv <- seq(0, 1, length = nvals)
  
  loglik <- expand.grid(phn, phh, phv) 
  colnames(loglik) <- c("phn", "phh", "phv")
  
  loglik <- loglik %>%
    mutate(loglik  = log(phn^25 * (1 - phn)^23 * phh^8 * (1 - phh)^12 * 
                           phv^13 * (1 - phv)^9))
  
  loglik %>%
    arrange(desc(loglik)) %>%
    slice(1)
}
maxloglik(100)
        phn       phh       phv    loglik
1 0.5252525 0.4040404 0.5858586 -61.57691

Finding the MLEs using R (2/3)

  • Depending on the number of parameters, it may be hard to conduct a granular enough search to find the exact values of the MLEs.
  • Therefore, one could use the function above to conduct a crude search to find starting values for R’s optim function.
  • The function optim differs from optimize in that it can optimize over multiple parameter values (The optimize function can only optimize over a single parameter value).

. . .

# Function to calculate log-likelihood that will be used in the optim function
loglik <- function(params){
  phn <- params[1]
  phh <- params[2]
  phv <- params[3]

  log(phn^25 * (1 - phn)^23 * phh^8 * (1 - phh)^12 * 
                           phv^13 * (1 - phv)^9)
}

Finding the MLEs using R (3/3)

# use manual search to get starting values 
start_vals <- maxloglik(50) %>% select(-loglik)
# Use optim function in R to find the values to maximize the log-likelihood
#set fnscale = -1 to maximize (the default is minimize)
optim(par = start_vals, fn = loglik, control=list(fnscale=-1))
$par
      phn       phh       phv 
0.5208272 0.4000361 0.5909793 

$value
[1] -61.57319

$counts
function gradient 
      66       NA 

$convergence
[1] 0

$message
NULL

Model Comparisons

Model comparisons

  • Nested models

  • Non-nested models

Nested Models (1/2)

Nested models: Models such that the parameters of the reduced model are a subset of the parameters for a larger model

Example:

\[\begin{aligned}&\text{Model A: }y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\\ &\text{Model B: }y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \epsilon\end{aligned}\]

. . .

Model A is nested in Model B. We could use likelihoods to test whether it is useful to add \(x_3\) and \(x_4\) to the model.

. . .

\[\begin{aligned}&H_0: \beta_3 = \beta_4 = 0 \\ &H_a: \text{ at least one }\beta_j \text{ is not equal to 0}\end{aligned}\]

Nested models (2/2)

Another way to think about nested models: Parameters in larger model can be equated to get the simpler model or if some parameters can be set to constants

Example:

\[\begin{aligned}&\text{Model 1: }p_H \\ &\text{Model 2: }p_{H| N}, p_{H| H Bias}, p_{H| V Bias}\end{aligned}\]

. . .

Model 1 is nested in Model 2. The parameters \(p_{H| N}\), \(p_{H|H Bias}\), and \(p_{H |V Bias}\) can be set equal to \(p_H\) to get Model 1.

. . .

\[\begin{aligned}&H_0: p_{H| N} = p_{H| H Bias} = p_{H| V Bias} = p_H \\ &H_a: \text{At least one of }p_{H| N}, p_{H| H Bias}, p_{H| V Bias} \text{ differs from the others}\end{aligned}\]

Steps to compare models

1️⃣ Find the MLEs for each model.

2️⃣ Plug the MLEs into the log-likelihood function for each model to get the maximum value of the log-likelihood for each model.

3️⃣ Find the difference in the maximum log-likelihoods

4️⃣ Use the Likelihood Ratio Test to determine if the difference is statistically significant

Steps 1 - 2

Find the MLEs for each model and plug them into the log-likelihood functions.

Model 1:

  • \(\hat{p}_H = 46/90 = 0.511\)
loglik1 <- function(ph){
 log(ph^46 * (1 - ph)^44)
}
loglik1(46/90)
[1] -62.36102

. . .

Model 2

  • \(\hat{p}_{H|N} = 25 / 48 = 0.521\)
  • \(\hat{p}_{H|H Bias} = 8 /20 = 0.4\)
  • \(\hat{p}_{H|V Bias} = 13/ 22 = 0.591\)

. . .

loglik2 <- function(phn, phh, phv) {
  log(phn^25 * (1 - phn)^23 * phh^8 * 
        (1 - phh)^12 * phv^13 * (1 - phv)^9)
}
loglik2(25/48, 8/20, 13/22)
[1] -61.57319

Step 3

Find the difference in the log-likelihoods

(diff <- loglik2(25/48, 8/20, 13/22) - loglik1(46/90))
[1] 0.7878318


. . .

Is the difference in the maximum log-likelihoods statistically significant?

Likelihood Ratio Test

Test statistic

\[\begin{aligned} LRT &= 2[\max\{\log(Lik(\text{larger model}))\} - \max\{\log(Lik(\text{reduced model}))\}]\\[10pt] &= 2\log\Bigg(\frac{\max\{(Lik(\text{larger model})\}}{\max\{(Lik(\text{reduced model})\}}\Bigg)\end{aligned}\]


. . .

LRT follows a \(\chi^2\) distribution where the degrees of freedom equal the difference in the number of parameters between the two models

Step 4

(LRT <- 2 * (loglik2(25/48, 8/20, 13/22) - loglik1(46/90)))
[1] 1.575664

. . .

The test statistic follows a \(\chi^2\) distribution with 2 degrees of freedom. Therefore, the p-value is \(P(\chi^2 > LRT)\).

pchisq(LRT, 2, lower.tail = FALSE)
[1] 0.4548299

. . .

The p-value is very large, so we fail to reject \(H_0\). We do not have convincing evidence that the conditional model is an improvement over the unconditional model. Therefore, we can stick with the unconditional model.

Comparing non-nested models

AIC = -2(max log-likelihood) + 2p

(Model1_AIC <- 2 * loglik1(46/90) + 2 * 1)
[1] -122.722
(Model2_AIC <-2 * loglik2(25/48, 8/20, 13/22) + 2 * 3)
[1] -117.1464

. . .

BIC = -2(max log-likelihood) + plog(n)

(Model1_BIC <- 2 * loglik1(46/90) + 1 * log(30))
[1] -121.3208
(Model2_BIC <-2 * loglik2(25/48, 8/20, 13/22) + 3 * log(30))
[1] -112.9428

Choose Model 1, the unconditional model, based on AIC and BIC

Looking ahead

  • Likelihoods help us answer the question of how likely we are to observe the data given different parameters

  • In this example, we did not consider covariates, so in practice the parameters we want to estimate will look more similar to this

. . .

\[p_H = \frac{e^{\beta_0 + \beta_1x_1 + \dots + \beta_px_p}}{1 + e^{\beta_0 + \beta_1x_1 + \dots + \beta_px_p}}\]

. . .

  • Finding the MLE becomes much more complex and numerical methods may be required.
    • We will primarily rely on software to find the MLE, but the conceptual ideas will be the same

Acknowledgements

These slides are based on content in BMLR: Chapter 1 - Review of Multiple Linear Regression

Initial versions of the slides are by Dr. Maria Tackett, Duke University