# Packages required for Chapter 3
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidyverse)
Distribution Theory
BMLR Chapter 3
Learning Objectives
- Write definitions of non-normal random variables in the context of an application.
- Identify possible values for each random variable.
- Identify how changing values for a parameter affects the characteristics of the distribution.
- Recognize a form of the probability density function for each distribution.
- Identify the mean and variance for each distribution.
- Match the response for a study to a plausible random variable and provide reasons for ruling out other random variables.
- Match a histogram of sample data to plausible distributions.
- Create a mixture of distributions and evaluate the shape, mean, and variance.
Libraries Needed (none new?)
Introduction
- What if it is not plausible that a response is normally distributed?
- You may want to construct a model to predict whether a prospective student will enroll at a school or model the lifetimes of patients following a particular surgery.
- In the first case you have a binary response (enrolls (1) or does not enroll (0)), and in the second case you are likely to have very skewed data with many similar values and a few hardy souls with extremely long survival.
- These responses are not expected to be normally distributed; other distributions will be needed to describe and model binary or lifetime data.
- Non-normal responses are encountered in a large number of situations. Luckily, there are quite a few possibilities for models.
Discrete Random Variables
Discrete Random Variables
- A discrete random variable has a countable number of possible values
- Ex: we may want to measure the number of people in a household
- Ex: model the number of crimes committed on a college campus.
- With discrete random variables, the associated probabilities can be calculated for each possible value using a probability mass function (pmf).
- A pmf is a function that calculates \(P(Y=y)\), given each variable’s parameters.
Binary Random Variable
- Consider the event of flipping a (possibly unfair) coin.
- If the coin lands heads, let’s consider this a success and record \(Y = 1\).
- A series of these events is a Bernoulli process, independent trials that take on one of two values (e.g., 0 or 1).
- These values are often referred to as a failure and a success
- the probability of success is identical for each trial.
Binary Random Variable
- Suppose we only flip the coin once, so we only have one parameter, the probability of flipping heads, \(p\).
- If we know this value, we can express \(P(Y=1) = p\) and \(P(Y=0) = 1-p\).
- In general, if we have a Bernoulli process with only one trial, we have a binary distribution (also called a Bernoulli distribution) where
. . .
\[\begin{equation*} P(Y = y) = p^y(1-p)^{1-y} \quad \textrm{for} \quad y = 0, 1. \end{equation*}\]
- If \(Y \sim \textrm{Binary}(p)\), then \(Y\) has mean \(\operatorname{E}(Y) = p\) and standard deviation \(\operatorname{SD}(Y) = \sqrt{p(1-p)}\).
Example 1
- Your playlist of 200 songs has 5 which you cannot stand. What is the probability that when you hit shuffle, a song you tolerate comes on?
- We want to understand the example and be able to translate that into notation/model it with a distribution
- Assuming all songs have equal odds of playing, we can calculate \(p = \frac{200-5}{200} = 0.975\)
- so there is a 97.5% chance of a song you tolerate playing, since \(P(Y=1)=.975^1*(1-.975)^0\).
Binomial Random Variable (1/2)
- Does anybody know how this relates to a Bernoulli random variable?
- We can extend our knowledge of binary random variables.
- Suppose we flipped an unfair coin \(n\) times and recorded \(Y\), the number of heads after \(n\) flips.
- If we consider a case where \(p = 0.25\) and \(n = 4\), then here \(P(Y=0)\) represents the probability of no successes in 4 trials
- 4 consecutive failures.
- The probability of 4 consecutive failures is \(P(Y = 0) = P(TTTT) = (1-p)^4 = 0.75^4\).
Binomial Random Variable (2/2)
Next we consider \(P(Y = 1)\), and are interested in the probability of exactly 1 success anywhere among the 4 trials.
How many different ways can we get exactly 1 success in the four trials?
To find the probability of this what else do we need to count?
What is the probability?
There are \(\binom{4}{1} = 4\) ways to have exactly 1 success in 4 trials, \(P(Y = 1) = \binom{4}{1}p^1(1-p)^{4-1} = (4)(0.25)(0.75)^3\).
Binomial Distribution
- In general, if we carry out a sequence of \(n\) Bernoulli trials (with probability of success \(p\)) and record \(Y\), the total number of successes, then \(Y\) follows a binomial distribution, where
. . .
\[\begin{equation} P(Y=y) = \binom{n}{y} p^y (1-p)^{n-y} \quad \textrm{for} \quad y = 0, 1, \ldots, n. \end{equation}\]
. . .
- If \(Y \sim \textrm{Binomial}(n,p)\), then \(\operatorname{E}(Y) = np\) and \(\operatorname{SD}(Y) = \sqrt{np(1-p)}\).
- \(E(y)\) is the expected value of \(Y\). If we repeated the experiment many times we would expect on average \(n*p\) successes.
Binomial Distribution Graphs
- Typical shapes of a binomial distribution
- I will show you these in R
Binomial Distribution Graphs
- On the left side \(n\) remains constant.
- We see that as \(p\) increases, the center of the distribution (\(\operatorname{E}(Y) = np\)) shifts right.
- On the right, \(p\) is held constant.
- As \(n\) increases, the distribution becomes less skewed.
Binomial Distribution vs Bernoulli
- Note that if \(n=1\),
. . .
\[\begin{align*} P(Y=y) &= \binom{1}{y} p^y(1-p)^{1-y} \\ &= p^y(1-p)^{1-y}\quad \textrm{for}\quad y = 0, 1, \end{align*}\] a Bernoulli distribution! - In fact, Bernoulli random variables are a special case of binomial random variables where \(n=1\).
Graphing/calling in R
- In R we can use the function
dbinom(y, n, p)
, which outputs the probability of \(y\) successes given \(n\) trials with probability \(p\), i.e., \(P(Y=y)\) for \(Y \sim \textrm{Binomial}(n,p)\). - Lets try it
Example 2 (1/2)
- While taking a multiple choice test, a student encountered 10 problems where she ended up completely guessing, randomly selecting one of the four options.
- What is the chance that she got exactly 2 of the 10 correct?
- What assumption do we need to make about the questions?
- How would we do this without using a distribution?
Example 2 (2/2)
- Knowing that the student randomly selected her answers, we assume she has a 25% chance of a correct response.
- n factorial is denoted \(n!\) and \(n!=n\cdot(n-1)\cdot...\cdot 3\cdot 2\cdot 1\)
- Here we used a combination \({n \choose p}=\frac{n!}{p!(n-p)!}\) read n choose p
- Thus, \(P(Y=2) = {10 \choose 2}(.25)^2(.75)^8 = 0.282\).
. . .
- We can use R to verify this:
. . .
dbinom(2, size = 10, prob = .25)
[1] 0.2815676
Therefore, there is a 28% chance of exactly 2 correct answers out of 10.
Negative Binomial Random Variable
- What if we were to carry out multiple independent and identical Bernoulli trails until the \(r\)th success occurs?
- If we model \(Y\), the number of failures before the \(r\)th success, then \(Y\) follows a negative binomial distribution where
. . .
\[\begin{equation} P(Y=y) = \binom{y + r - 1}{r-1} (1-p)^{y}(p)^r \quad \textrm{for}\quad y = 0, 1, \ldots, \infty. \end{equation}\]
Negative Binomial RV
- If \(Y \sim \textrm{Negative Binomial}(r, p)\) then \(\operatorname{E}(Y) = \frac{r(1-p)}{p}\) and \(\operatorname{SD}(Y) = \sqrt{\frac{r(1-p)}{p^2}}\).
Negative Binomial RV
- Notice how centers shift right as \(r\) increases, and left as \(p\) increases.
#negativeBinomialPlots
<- function(p, r){
plotNBinom <- 0:10000
ynb <- tibble(x = rnbinom(ynb, r, p))
nbd #breaks <- pretty(range(nbd$x), n = nclass.FD(nbd$x), min.n = 1) # pretty binning
#bwidth <- breaks[2] - breaks[1]
ggplot(nbd, aes(x = x)) +
geom_histogram(aes(y=..count../sum(..count..)), binwidth = .25) +
labs(title = paste("p = ", p, ", r = ", r),
x = "number of failures",
y = "probability") +
xlim(-1,30)
}
<- plotNBinom(0.35, 3)
NBin1 <- plotNBinom(0.35, 5)
NBin2 <- plotNBinom(0.70, 5)
NBin3 grid.arrange(NBin1, NBin2, NBin3, ncol = 1)
Negative Binomial RV
- Note that if we set \(r=1\), then
. . .
\[\begin{align*} P(Y=y) &= \binom{y}{0} (1-p)^yp \\ &= (1-p)^yp \quad \textrm{for} \quad y = 0, 1, \ldots, \infty, \end{align*}\] which is the probability mass function of a geometric random variable!
Thus, a geometric random variable is, in fact, a special case of a negative binomial random variable.
While negative binomial random variables typically are expressed as above using binomial coefficients (expressions such as \(\binom{x}{y}\)), we can generalize our definition to allow non-integer values of \(r\).
R function
dnbinom(y, r, p)
for the probability of \(y\) failures before the \(r\)th success given probability \(p\).
Poisson Random Variable
- Sometimes, random variables are based on a Poisson process.
- In a Poisson process, we are counting the number of events per unit of time or space and the number of events depends only on the length or size of the interval.
- We can then model \(Y\), the number of events in one of these sections with the Poisson distribution, where
. . .
\[\begin{equation} P(Y=y) = \frac{e^{-\lambda}\lambda^y}{y!} \quad \textrm{for} \quad y = 0, 1, \ldots, \infty, \end{equation}\] where \(\lambda\) is the mean or expected count in the unit of time or space of interest. - This probability mass function has \(\operatorname{E}(Y) = \lambda\) and \(\operatorname{SD}(Y) = \sqrt{\lambda}\).
Poisson Distribution Graphs
- Notice how distributions become more symmetric as \(\lambda\) increases.
- If we wish to use R,
dpois(y, lambda)
outputs the probability of \(y\) events given \(\lambda\).
#poissonPlots
<- function(lam){
plotPois = 0:10000 # possible values
yp <- data.frame(x=rpois(yp, lam)) # generate random deviates
pd <- pretty(range(pd$x), n = nclass.FD(pd$x), min.n = 1) # pretty binning
breaks #bwidth <- breaks[2] - breaks[1]
ggplot(pd, aes(x = x)) + geom_histogram(aes(y=..count../sum(..count..)), binwidth = .25) +
xlab("number of events") + ylab("probability") +
labs(title=paste("Poisson lambda = ", lam)) + xlim(-1,13)
}
<- plotPois(0.5)
Pois1 <- plotPois(1)
Pois2 <- plotPois(5) + scale_y_continuous(breaks = c(0, 0.1))
Pois3 grid.arrange(Pois1,Pois2,Pois3,ncol=1)
Continuous Random Variables
Continuous Random Variables
- A continuous random variable can take on an uncountably infinite number of values.
- With continuous random variables, we define probabilities using probability density functions (pdfs).
- Probabilities are calculated by computing the area under the density curve over the interval of interest. So, given a pdf, \(f(y)\), we can compute
. . .
\[\begin{align*} P(a \le Y \le b) = \int_a^b f(y)dy. \end{align*}\]
Continuous Random Variables
A few properties of continuous random variables:
- The area under their density curve is 1. (\(\int_{-\infty}^{\infty} f(y)dy = 1\)).
- For any value \(y\), \(P(Y = y) = \int_y^y f(y)dy = 0\). Why?
- Because of the above property, \(P(y < Y) = P(y \le Y)\). We will typically use the first notation rather than the second, but both are equally valid.
Exponential Random Variable
- Suppose we have a Poisson process with rate \(\lambda\), and we wish to model the wait time \(Y\) until the first event.
- We could model \(Y\) using an exponential distribution, where
. . .
\[\begin{equation} f(y) = \lambda e^{-\lambda y} \quad \textrm{for} \quad y > 0, \end{equation}\] - \(\operatorname{E}(Y) = 1/\lambda\) and \(\operatorname{SD}(Y) = 1/\lambda\).
Exponential Distribution
Exponential distributions with \(\lambda = 0.5, 1,\) and \(5\).
- As \(\lambda\) increases, \(\operatorname{E}(Y)\) tends towards 0, and distributions “die off” quicker.
#exponentialPlots
=seq(0,4,by=0.01) # possible values
x<- dexp(x,.5) # P(Y=y)
probex1 <- dexp(x,1)
probex2 <- dexp(x,5)
probex3 <- tibble(x,probex1, probex2, probex3) %>%
Expdf rename(x = x,
`0.5` = probex1,
`1` = probex2,
`5` = probex3) %>%
gather(2:4, key = "Lambda", value = "value") %>%
mutate(Lambda = factor(Lambda, levels = c("0.5", "1", "5")))
ggplot(data = Expdf, aes(x = x, y = value, color = Lambda)) +
geom_line(aes(linetype = Lambda)) +
xlab("values") + ylab("density") +
labs(title = "Exponential Distributions") +
xlim(0,4) + ylim(0,3)
- To use R,
pexp(y, lambda)
outputs the probability \(P(Y < y)\) given \(\lambda\).
Gamma Random Variable
- Once again consider a Poisson process.
- When discussing exponential random variables, we modeled the wait time before one event occurred.
- If \(Y\) represents the wait time before \(r\) events occur in a Poisson process with rate \(\lambda\), \(Y\) follows a gamma distribution where
. . .
\[\begin{equation} f(y) = \frac{\lambda^r}{\Gamma(r)} y^{r-1} e^{-\lambda y}\quad \textrm{for} \quad y >0. \end{equation}\]
- If \(Y \sim \textrm{Gamma}(r, \lambda)\) then \(\operatorname{E}(Y) = r/\lambda\) and \(\operatorname{SD}(Y) = \sqrt{r/\lambda^2}\).
- Note: \(\Gamma(r)=(r-1)!\) (There is more to it)
Gamma Distribution
- Means increase as \(r\) increases, but decrease as \(\lambda\) increases.
<- seq(0, 7, by = 0.01)
x `r = 1, lambda = 1` <- dgamma(x, 1, rate = 1)
`r = 2, lambda = 1` <- dgamma(x, 2, rate = 1)
`r = 5, lambda = 5` <- dgamma(x, 5, rate = 5)
`r = 5, lambda = 7` <- dgamma(x, 5, rate = 7)
<- tibble(x, `r = 1, lambda = 1`, `r = 2, lambda = 1`, `r = 5, lambda = 5`, `r = 5, lambda = 7`) %>%
gammaDf gather(2:5, key = "Distribution", value = "value") %>%
mutate(Distribution = factor(Distribution,
levels = c("r = 2, lambda = 1",
"r = 1, lambda = 1",
"r = 5, lambda = 5",
"r = 5, lambda = 7")))
ggplot(data = gammaDf, aes(x = x, y = value,
color = Distribution)) +
geom_line(aes(linetype = Distribution)) +
xlab("values") + ylab("density") +
labs(title = "Gamma Distributions") +
theme(legend.title = element_blank())
Gamma vs Others
- Note that if we let \(r = 1\), we have the following pdf,
. . .
\[\begin{align*} f(y) &= \frac{\lambda}{\Gamma(1)} y^{1-1} e^{-\lambda y} \\ &= \lambda e^{-\lambda y} \quad \textrm{for} \quad y > 0, \end{align*}\] an exponential distribution. - Just as how the geometric distribution was a special case of the negative binomial, exponential distributions are in fact a special case of gamma distributions!
- Just like negative binomial, the pdf of a gamma distribution is defined for all real, non-negative \(r\).
- In R,
pgamma(y, r, lambda)
outputs the probability \(P(Y < y)\) given \(r\) and \(\lambda\).
Distributions Used in Testing
- We have spent most of this chapter discussing probability distributions that may come in handy when modeling.
- The following distributions, while rarely used in modeling, prove useful in hypothesis testing as certain commonly used test statistics follow these distributions.
- \(\chi^2\) distribution (requires a degree of freedom)
- Student \(t\) distribution
- \(F\) distribution (need 2 different degrees of freedom)
- Since we have used these In the past, we will leave their definitions to be referenced if needed
Distribution Table!
Would not fit on a slide
Click HERE
The “web” of distributions: https://www.acsu.buffalo.edu/~adamcunn/probability/poisson.html
Acknowledgements
These slides are based on content in BMLR: Chapter 1 - Review of Multiple Linear Regression