Distribution Theory

BMLR Chapter 3

Tyler George

Cornell College
STA 363 Spring 2025 Block 8

Learning Objectives

  • Write definitions of non-normal random variables in the context of an application.
  • Identify possible values for each random variable.
  • Identify how changing values for a parameter affects the characteristics of the distribution.
  • Recognize a form of the probability density function for each distribution.
  • Identify the mean and variance for each distribution.
  • Match the response for a study to a plausible random variable and provide reasons for ruling out other random variables.
  • Match a histogram of sample data to plausible distributions.
  • Create a mixture of distributions and evaluate the shape, mean, and variance.

Libraries Needed (none new?)

# Packages required for Chapter 3
library(gridExtra)  
library(knitr) 
library(kableExtra)
library(tidyverse)

Introduction

  • What if it is not plausible that a response is normally distributed?
  • You may want to construct a model to predict whether a prospective student will enroll at a school or model the lifetimes of patients following a particular surgery.
  • In the first case you have a binary response (enrolls (1) or does not enroll (0)), and in the second case you are likely to have very skewed data with many similar values and a few hardy souls with extremely long survival.
  • These responses are not expected to be normally distributed; other distributions will be needed to describe and model binary or lifetime data.
  • Non-normal responses are encountered in a large number of situations. Luckily, there are quite a few possibilities for models.

Discrete Random Variables

Discrete Random Variables

  • A discrete random variable has a countable number of possible values
    • Ex: we may want to measure the number of people in a household
    • Ex: model the number of crimes committed on a college campus.
  • With discrete random variables, the associated probabilities can be calculated for each possible value using a probability mass function (pmf).
  • A pmf is a function that calculates \(P(Y=y)\), given each variable’s parameters.

Binary Random Variable

  • Consider the event of flipping a (possibly unfair) coin.
  • If the coin lands heads, let’s consider this a success and record \(Y = 1\).
  • A series of these events is a Bernoulli process, independent trials that take on one of two values (e.g., 0 or 1).
  • These values are often referred to as a failure and a success
    • the probability of success is identical for each trial.

Binary Random Variable

  • Suppose we only flip the coin once, so we only have one parameter, the probability of flipping heads, \(p\).
  • If we know this value, we can express \(P(Y=1) = p\) and \(P(Y=0) = 1-p\).
  • In general, if we have a Bernoulli process with only one trial, we have a binary distribution (also called a Bernoulli distribution) where

\[\begin{equation*} P(Y = y) = p^y(1-p)^{1-y} \quad \textrm{for} \quad y = 0, 1. \end{equation*}\]

  • If \(Y \sim \textrm{Binary}(p)\), then \(Y\) has mean \(\operatorname{E}(Y) = p\) and standard deviation \(\operatorname{SD}(Y) = \sqrt{p(1-p)}\).

Example 1

  • Your playlist of 200 songs has 5 which you cannot stand. What is the probability that when you hit shuffle, a song you tolerate comes on?
  • We want to understand the example and be able to translate that into notation/model it with a distribution
  • Assuming all songs have equal odds of playing, we can calculate \(p = \frac{200-5}{200} = 0.975\)
    • so there is a 97.5% chance of a song you tolerate playing, since \(P(Y=1)=.975^1*(1-.975)^0\).

Binomial Random Variable (1/2)

  • Does anybody know how this relates to a Bernoulli random variable?
  • We can extend our knowledge of binary random variables.
  • Suppose we flipped an unfair coin \(n\) times and recorded \(Y\), the number of heads after \(n\) flips.
  • If we consider a case where \(p = 0.25\) and \(n = 4\), then here \(P(Y=0)\) represents the probability of no successes in 4 trials
    • 4 consecutive failures.
    • The probability of 4 consecutive failures is \(P(Y = 0) = P(TTTT) = (1-p)^4 = 0.75^4\).

Binomial Random Variable (2/2)

  • Next we consider \(P(Y = 1)\), and are interested in the probability of exactly 1 success anywhere among the 4 trials.

  • How many different ways can we get exactly 1 success in the four trials?

  • To find the probability of this what else do we need to count?

  • What is the probability?

  • There are \(\binom{4}{1} = 4\) ways to have exactly 1 success in 4 trials, \(P(Y = 1) = \binom{4}{1}p^1(1-p)^{4-1} = (4)(0.25)(0.75)^3\).

Binomial Distribution

  • In general, if we carry out a sequence of \(n\) Bernoulli trials (with probability of success \(p\)) and record \(Y\), the total number of successes, then \(Y\) follows a binomial distribution, where

\[\begin{equation} P(Y=y) = \binom{n}{y} p^y (1-p)^{n-y} \quad \textrm{for} \quad y = 0, 1, \ldots, n. \end{equation}\]

  • If \(Y \sim \textrm{Binomial}(n,p)\), then \(\operatorname{E}(Y) = np\) and \(\operatorname{SD}(Y) = \sqrt{np(1-p)}\).
  • \(E(y)\) is the expected value of \(Y\). If we repeated the experiment many times we would expect on average \(n*p\) successes.

Binomial Distribution Graphs

  • Typical shapes of a binomial distribution
  • I will show you these in R

Binomial Distribution Graphs

  • On the left side \(n\) remains constant.
  • We see that as \(p\) increases, the center of the distribution (\(\operatorname{E}(Y) = np\)) shifts right.
  • On the right, \(p\) is held constant.
  • As \(n\) increases, the distribution becomes less skewed.

Binomial Distribution vs Bernoulli

  • Note that if \(n=1\),

\[\begin{align*} P(Y=y) &= \binom{1}{y} p^y(1-p)^{1-y} \\ &= p^y(1-p)^{1-y}\quad \textrm{for}\quad y = 0, 1, \end{align*}\] a Bernoulli distribution! - In fact, Bernoulli random variables are a special case of binomial random variables where \(n=1\).

Graphing/calling in R

  • In R we can use the function dbinom(y, n, p), which outputs the probability of \(y\) successes given \(n\) trials with probability \(p\), i.e., \(P(Y=y)\) for \(Y \sim \textrm{Binomial}(n,p)\).
  • Lets try it

Example 2 (1/2)

  • While taking a multiple choice test, a student encountered 10 problems where she ended up completely guessing, randomly selecting one of the four options.
  • What is the chance that she got exactly 2 of the 10 correct?
  • What assumption do we need to make about the questions?
  • How would we do this without using a distribution?

Example 2 (2/2)

  • Knowing that the student randomly selected her answers, we assume she has a 25% chance of a correct response.
  • n factorial is denoted \(n!\) and \(n!=n\cdot(n-1)\cdot...\cdot 3\cdot 2\cdot 1\)
  • Here we used a combination \({n \choose p}=\frac{n!}{p!(n-p)!}\) read n choose p
  • Thus, \(P(Y=2) = {10 \choose 2}(.25)^2(.75)^8 = 0.282\).
  • We can use R to verify this:
dbinom(2, size = 10, prob = .25)
[1] 0.2815676

Therefore, there is a 28% chance of exactly 2 correct answers out of 10.

Negative Binomial Random Variable

  • What if we were to carry out multiple independent and identical Bernoulli trails until the \(r\)th success occurs?
  • If we model \(Y\), the number of failures before the \(r\)th success, then \(Y\) follows a negative binomial distribution where

\[\begin{equation} P(Y=y) = \binom{y + r - 1}{r-1} (1-p)^{y}(p)^r \quad \textrm{for}\quad y = 0, 1, \ldots, \infty. \end{equation}\]

Negative Binomial RV

  • If \(Y \sim \textrm{Negative Binomial}(r, p)\) then \(\operatorname{E}(Y) = \frac{r(1-p)}{p}\) and \(\operatorname{SD}(Y) = \sqrt{\frac{r(1-p)}{p^2}}\).

Negative Binomial RV

Negative Binomial
  • Notice how centers shift right as \(r\) increases, and left as \(p\) increases.
#negativeBinomialPlots
plotNBinom <- function(p, r){
  ynb <- 0:10000
  nbd <- tibble(x = rnbinom(ynb, r, p))
  #breaks <- pretty(range(nbd$x), n = nclass.FD(nbd$x), min.n = 1)  # pretty binning
  #bwidth <- breaks[2] - breaks[1]
  ggplot(nbd, aes(x = x)) +
    geom_histogram(aes(y=..count../sum(..count..)), binwidth = .25) +
    labs(title = paste("p = ", p, ", r = ", r),
         x = "number of failures",
         y = "probability") +
    xlim(-1,30)
}

NBin1 <- plotNBinom(0.35, 3)
NBin2 <- plotNBinom(0.35, 5)
NBin3 <- plotNBinom(0.70, 5)
grid.arrange(NBin1, NBin2, NBin3, ncol = 1)

Negative Binomial RV

  • Note that if we set \(r=1\), then

\[\begin{align*} P(Y=y) &= \binom{y}{0} (1-p)^yp \\ &= (1-p)^yp \quad \textrm{for} \quad y = 0, 1, \ldots, \infty, \end{align*}\] which is the probability mass function of a geometric random variable!

  • Thus, a geometric random variable is, in fact, a special case of a negative binomial random variable.

  • While negative binomial random variables typically are expressed as above using binomial coefficients (expressions such as \(\binom{x}{y}\)), we can generalize our definition to allow non-integer values of \(r\).

  • R function dnbinom(y, r, p) for the probability of \(y\) failures before the \(r\)th success given probability \(p\).

Poisson Random Variable

  • Sometimes, random variables are based on a Poisson process.
  • In a Poisson process, we are counting the number of events per unit of time or space and the number of events depends only on the length or size of the interval.
  • We can then model \(Y\), the number of events in one of these sections with the Poisson distribution, where

\[\begin{equation} P(Y=y) = \frac{e^{-\lambda}\lambda^y}{y!} \quad \textrm{for} \quad y = 0, 1, \ldots, \infty, \end{equation}\] where \(\lambda\) is the mean or expected count in the unit of time or space of interest. - This probability mass function has \(\operatorname{E}(Y) = \lambda\) and \(\operatorname{SD}(Y) = \sqrt{\lambda}\).

Poisson Distribution Graphs

  • Notice how distributions become more symmetric as \(\lambda\) increases.
  • If we wish to use R, dpois(y, lambda) outputs the probability of \(y\) events given \(\lambda\).
#poissonPlots
plotPois <- function(lam){
yp = 0:10000 # possible values
pd <- data.frame(x=rpois(yp, lam))  # generate random deviates
breaks <- pretty(range(pd$x), n = nclass.FD(pd$x), min.n = 1)  # pretty binning
#bwidth <- breaks[2] - breaks[1]
ggplot(pd, aes(x = x)) + geom_histogram(aes(y=..count../sum(..count..)), binwidth = .25) +
  xlab("number of events") + ylab("probability") + 
  labs(title=paste("Poisson lambda = ", lam)) + xlim(-1,13)
}

Pois1 <- plotPois(0.5)
Pois2 <- plotPois(1)
Pois3 <- plotPois(5) + scale_y_continuous(breaks = c(0, 0.1))
grid.arrange(Pois1,Pois2,Pois3,ncol=1)

Continuous Random Variables

Continuous Random Variables

  • A continuous random variable can take on an uncountably infinite number of values.
  • With continuous random variables, we define probabilities using probability density functions (pdfs).
  • Probabilities are calculated by computing the area under the density curve over the interval of interest. So, given a pdf, \(f(y)\), we can compute

\[\begin{align*} P(a \le Y \le b) = \int_a^b f(y)dy. \end{align*}\]

Continuous Random Variables

A few properties of continuous random variables:

  • The area under their density curve is 1. (\(\int_{-\infty}^{\infty} f(y)dy = 1\)).
  • For any value \(y\), \(P(Y = y) = \int_y^y f(y)dy = 0\). Why?
  • Because of the above property, \(P(y < Y) = P(y \le Y)\). We will typically use the first notation rather than the second, but both are equally valid.

Exponential Random Variable

  • Suppose we have a Poisson process with rate \(\lambda\), and we wish to model the wait time \(Y\) until the first event.
  • We could model \(Y\) using an exponential distribution, where

\[\begin{equation} f(y) = \lambda e^{-\lambda y} \quad \textrm{for} \quad y > 0, \end{equation}\] - \(\operatorname{E}(Y) = 1/\lambda\) and \(\operatorname{SD}(Y) = 1/\lambda\).

Exponential Distribution

Exponential distributions with \(\lambda = 0.5, 1,\) and \(5\).

Exponential Distribution
  • As \(\lambda\) increases, \(\operatorname{E}(Y)\) tends towards 0, and distributions “die off” quicker.
#exponentialPlots
x=seq(0,4,by=0.01)  # possible values
probex1 <- dexp(x,.5)  # P(Y=y)
probex2 <- dexp(x,1)
probex3 <- dexp(x,5)
Expdf <- tibble(x,probex1, probex2, probex3) %>%
  rename(x = x,
         `0.5` = probex1,
         `1` = probex2,
         `5` = probex3) %>%
  gather(2:4, key = "Lambda", value = "value") %>%
  mutate(Lambda = factor(Lambda, levels = c("0.5", "1", "5")))
ggplot(data = Expdf, aes(x = x, y = value, color = Lambda)) +
  geom_line(aes(linetype = Lambda)) +
  xlab("values") + ylab("density") + 
  labs(title = "Exponential Distributions") + 
  xlim(0,4) + ylim(0,3)

  • To use R, pexp(y, lambda) outputs the probability \(P(Y < y)\) given \(\lambda\).

Gamma Random Variable

  • Once again consider a Poisson process.
  • When discussing exponential random variables, we modeled the wait time before one event occurred.
  • If \(Y\) represents the wait time before \(r\) events occur in a Poisson process with rate \(\lambda\), \(Y\) follows a gamma distribution where

\[\begin{equation} f(y) = \frac{\lambda^r}{\Gamma(r)} y^{r-1} e^{-\lambda y}\quad \textrm{for} \quad y >0. \end{equation}\]

  • If \(Y \sim \textrm{Gamma}(r, \lambda)\) then \(\operatorname{E}(Y) = r/\lambda\) and \(\operatorname{SD}(Y) = \sqrt{r/\lambda^2}\).
  • Note: \(\Gamma(r)=(r-1)!\) (There is more to it)

Gamma Distribution

  • Means increase as \(r\) increases, but decrease as \(\lambda\) increases.
x <- seq(0, 7, by = 0.01)
`r = 1, lambda = 1` <- dgamma(x, 1, rate = 1)
`r = 2, lambda = 1` <- dgamma(x, 2, rate = 1) 
`r = 5, lambda = 5` <- dgamma(x, 5, rate = 5)
`r = 5, lambda = 7` <- dgamma(x, 5, rate = 7)

gammaDf <- tibble(x, `r = 1, lambda = 1`, `r = 2, lambda = 1`, `r = 5, lambda = 5`, `r = 5, lambda = 7`) %>%
  gather(2:5, key = "Distribution", value = "value") %>%
  mutate(Distribution = factor(Distribution, 
                               levels = c("r = 2, lambda = 1", 
                                          "r = 1, lambda = 1", 
                                          "r = 5, lambda = 5", 
                                          "r = 5, lambda = 7")))

ggplot(data = gammaDf, aes(x = x, y = value, 
                           color = Distribution)) +
  geom_line(aes(linetype = Distribution)) +
  xlab("values") + ylab("density") + 
  labs(title = "Gamma Distributions") +
  theme(legend.title = element_blank())

Gamma vs Others

  • Note that if we let \(r = 1\), we have the following pdf,

\[\begin{align*} f(y) &= \frac{\lambda}{\Gamma(1)} y^{1-1} e^{-\lambda y} \\ &= \lambda e^{-\lambda y} \quad \textrm{for} \quad y > 0, \end{align*}\] an exponential distribution. - Just as how the geometric distribution was a special case of the negative binomial, exponential distributions are in fact a special case of gamma distributions!

  • Just like negative binomial, the pdf of a gamma distribution is defined for all real, non-negative \(r\).
  • In R, pgamma(y, r, lambda) outputs the probability \(P(Y < y)\) given \(r\) and \(\lambda\).

Distributions Used in Testing

  • We have spent most of this chapter discussing probability distributions that may come in handy when modeling.
  • The following distributions, while rarely used in modeling, prove useful in hypothesis testing as certain commonly used test statistics follow these distributions.
  • \(\chi^2\) distribution (requires a degree of freedom)
  • Student \(t\) distribution
  • \(F\) distribution (need 2 different degrees of freedom)
  • Since we have used these In the past, we will leave their definitions to be referenced if needed

Distribution Table!

Acknowledgements

These slides are based on content in BMLR: Chapter 1 - Review of Multiple Linear Regression