library(tidyverse)
library(tidymodels)
library(GGally)
library(knitr)
library(patchwork)
library(viridis)
library(ggfortify)
Welcome and Chapter 1
Welcome!
Instructor
- Dr. Tyler George: tgeorge@cornellcollege.edu
Course logistics
- Course Dates: August 26th to September 18th
- Course sessions: By Appointment
- Exam Dates: tbd
Generalized Linear Models
In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.
. . .
Logistic regression
\[\begin{aligned}\pi = P(y = 1 | x) \hspace{2mm} &\Rightarrow \hspace{2mm} \text{Link function: } \log\big(\frac{\pi}{1-\pi}\big) \\ &\Rightarrow \log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1~x\end{aligned}\]
What we’re covering this semester(1/3)
Generalized Linear Models (Ch 1 - 6)
- Introduce models for non-normal response variables
- Estimation, interpretation, and inference
- Mathematical details showing how GLMs are connected
What we’re covering this semester(2/3)
Modeling correlated data (Ch 7 - 9)
- Introduce multilevel models for correlated and longitudinal data
- Estimation, interpretation, and inference
- Mathematical details, particularly diving into covariance structures
What we’re covering this semester(3/3)
More Regression Models (ITSL Chapter 7) - Polynomial Regression - Regression Splines - Smoothing Splines - Generalized Additive Models (GAMS)
Meet your classmates!
- Create larger groups
- Quick introductions - Name, year, and major
- Choose a reporter
- Need help choosing? Person with birthday closest to December 1st.
- Identify 8 things everyone in the group has in common
- Not being a Cornell Student
- Not clothes (we’re all wearing socks)
- Not body parts (we all have a nose)
. . .
Reporter will share list with the class
What background is assumed for the course?
. . .
Pre-reqs
- STA 201, 202 and DSC 223
. . .
Background knowledge
- Statistical content
- Linear and logistic regression
- Statistical inference
- Basic understanding of random variables
- Computing
- Using R for data analysis
- Writing reports using R Markdown or Quarto
Course Toolkit (1/2)
- Website
- https://stats-tgeorge.github.io/STA363_AdvReg/
- Central hub for the course
- Notes
- Labs
- Datasets
Course Toolkit (1/2)
- Moodle:
- https://moodle.cornellcollege.edu/course/view.php?id=7908
- Submissions
- Gradebook
- Announcements
Class Meetings
Lectures
- Some traditional lecture
- Individual and group labs
- Bring fully-charged laptop
- Mini-projects
- Exams
. . .
Attendance is expected (if you are healthy!)
Textbook
Beyond Multiple Linear Regression by Paul Roback and Julie Legler
- Available online
- Hard copies available for purchase
Textbook 2
The secondary text is: An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani – it is freely available online. Chapter 7.
- Hard copies available for purchase
Using R / RStudio
- RStudio Server is installed and should be used
- http://turing.cornellcollege.edu:8787/
Activities & Assessments
Readings
- Primarily from Beyond Multiple Linear Regression
- Recommend reading assigned text before lecture
. . .
Homework - Primarily from Beyond Multiple Linear Regression - Individual assignments - Work together but must complete your own work. Discuss but don’t copy.
Activities & Assessments
Mini-projects
Examples:
- Mini-project 01: Focused on models for non-normal response variables, such as count data
- Mini-project 02: Focused on models for correlated data
. . .
- Short write up and short presentation
- Team-based
Exams
Two exams this block, tbd.
Each will have two components
- Component 1 will be on these dates and you will get a choice of oral or written format.
- Component 2 will be a take-home, open-book, open-note, exam.
- You will have 12 hours or more to complete this component.
Grading
Final grades will be calculated as follows
Category | Points |
---|---|
Homework | 200 |
Participation | 100 |
Labs and Mini Projects | 300 |
Exams | 400 |
Total | 1000 |
See Syllabus on website for letter grade thresholds.
Resources
- Office hours to meet with your instructor in West 311
- Typically MWTh 3:05pm-4:05pm and by appt.
- Double check course calendar
- Make appointments by going to https://calendar.app.google/oHqiU2FZDJLXKq429
- Email Tyler George for private questions regarding personal matters or grades.
- Please put STA 363 in the subject line since I am also teaching capstone this semester
- College support at https://stats-tgeorge.github.io/personal_website/course-support.html.
Chapter 1 - MLR Review
Setup - R Packages
Assumptions for linear regression
What are the assumptions for linear regression? . . .
Linearity: Linear relationship between mean response and predictor variable(s)
. . .
Independence: Residuals are independent. There is no connection between how far any two points lie above or below regression line.
. . .
Normality: Response follows a normal distribution at each level of the predictor (or combination of predictors)
. . .
Equal variance: Variability (variance or standard deviation) of the response is equal for all levels of the predictor (or combination of predictors)
. . .
Use residual plots to check that the conditions hold before using the model for statistical inference.
Assumptions for linear regression
Modified from Figure 1.1. in BMLR]
Linearity: Linear relationship between mean of the response \(Y\) and the predictor \(X\)
Independence: No connection between how any two points lie above or below the regression line
Normality: Response, \(Y\), follows a normal distribution at each level of the predictor, \(X\) (indicated by red curves)
Equal variance: Variance (or standard deviation) of the response, \(Y\), is equal for all levels of the predictor, \(X\)
Questions
How do we assess these conditions?
Beyond linear regression
When we use linear least squares regression to draw conclusions, we do so under the assumption that L.I.N.E. are all met.
Generalized linear models require different assumptions and can accommodate violations in L.I.N.E.
- Relationship between response and predictor(s) can be nonlinear
- Response variable can be non-normal
- Variance in response can differ at each level of predictor(s)
. . .
But the independence assumption must hold!
- Multilevel models will be used for data with correlated observations
Review of multiple linear regression
Data: Kentucky Derby Winners
Today’s data is from the Kentucky Derby, an annual 1.25-mile horse race held at the Churchill Downs race track in Louisville, KY. The data is in the file derbyplus.csv and contains information for races 1896 - 2017.
. . .
Response variable
speed
: Average speed of the winner in feet per second (ft/s)]
Additional variable
winner
: Winning horse
Predictor variables
year
: Year of the racecondition
: Condition of the track (good, fast, slow)starters
: Number of horses who raced]
Data
<- read_csv("data/derbyplus.csv") derby
|>
derby head(5) |> kable()
year | winner | condition | speed | starters |
---|---|---|---|---|
1896 | Ben Brush | good | 51.66 | 8 |
1897 | Typhoon II | slow | 49.81 | 6 |
1898 | Plaudit | good | 51.16 | 4 |
1899 | Manuel | fast | 50.00 | 5 |
1900 | Lieut. Gibson | fast | 52.28 | 7 |
Data Analysis Life Cycle
Exploratory data analysis (EDA)
Once you’re ready for the statistical analysis (explore), the first step should always be exploratory data analysis.
The EDA will help you
- begin to understand the variables and observations
- identify outliers or potential data entry errors
- begin to see relationships between variables
- identify the appropriate model and identify a strategy
The EDA is exploratory; formal modeling and statistical inference should be used to draw conclusions.
Plots for univariate EDA
<- ggplot(data = derby, aes(x = speed)) +
p1 geom_histogram(fill = "forestgreen", color = "black") +
labs(x = "Winning speed (ft/s)", y = "Count")
<- ggplot(data = derby, aes(x = starters)) +
p2 geom_histogram(fill = "forestgreen", color = "black") +
labs(x = "Starters", y = "Count")
<- ggplot(data = derby, aes(x = condition)) +
p3 geom_bar(fill = "forestgreen", color = "black", aes(x = ))
+ (p2 / p3) +
p1 plot_annotation(title = "Univariate data analysis")
Plots for bivariate EDA
<- ggplot(data = derby, aes(x = starters, y = speed)) +
p4 geom_point() +
labs(x = "Starters", y = "Speed (ft / s)")
<- ggplot(data = derby, aes(x = year, y = speed)) +
p5 geom_point() +
labs(x = "Year", y = "Speed (ft / s)")
<- ggplot(data = derby, aes(x = condition, y = speed)) +
p6 geom_boxplot(fill = "forestgreen", color = "black") +
labs(x = "Conditions", y = "Speed (ft / s)")
+ p5) + p6 +
(p4 plot_annotation(title = "Bivariate data analysis")
Scatterplot matrix
A scatterplot matrix helps quickly visualize relationships between many variable pairs. They are particularly useful to identify potentially correlated predictors.
. . .
#library(GGally)
ggpairs(data = derby,
columns = c("condition", "year", "starters", "speed"))
Plots for multivariate EDA
Plot the relationship between the response and a predictor based on levels of another predictor to assess potential interactions.
#library(viridis)
ggplot(data = derby, aes(x = year, y = speed, color = condition,
shape = condition, linetype = condition)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, aes(linetype = condition)) +
labs(x = "Year", y = "Speed (ft/s)", color = "Condition",
title = "Speed vs. year",
subtitle = "by track condition") +
guides(lty = FALSE, shape = FALSE) +
scale_color_viridis_d(end = 0.9)
Model 1: Main effects model
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 8.197 | 4.508 | 1.818 | 0.072 |
starters | -0.005 | 0.017 | -0.299 | 0.766 |
year | 0.023 | 0.002 | 9.766 | 0.000 |
conditiongood | -0.443 | 0.231 | -1.921 | 0.057 |
conditionslow | -1.543 | 0.161 | -9.616 | 0.000 |
# Fit and display model
<- lm(speed ~ starters + year + condition, data = derby)
model1 tidy(model1) |>
kable(digits = 3)
Interpretation
\[\widehat{speed} = 8.197 - 0.005 ~ starters + 0.023 ~ year - 0.443 ~ good - 1.543 ~ slow\]
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 8.197 | 4.508 | 1.818 | 0.072 |
starters | -0.005 | 0.017 | -0.299 | 0.766 |
year | 0.023 | 0.002 | 9.766 | 0.000 |
conditiongood | -0.443 | 0.231 | -1.921 | 0.057 |
conditionslow | -1.543 | 0.161 | -9.616 | 0.000 |
. . .
- Write out the interpretations for
starters
andconditiongood
. - Does the intercept have a meaningful interpretation?
Centering
Centering: Subtract a constant from each observation of a given variable
Do this to make interpretation of model parameters more meaningful (particularly intercept)
In STA 202, we used mean-centering where we subtracted the mean from each observation of given variable
How does centering change the model?
Centering year
<- derby |>
derby mutate(yearnew = year - 1896) #1896 = starting year
. . .
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 52.175 | 0.194 | 269.079 | 0.000 |
starters | -0.005 | 0.017 | -0.299 | 0.766 |
yearnew | 0.023 | 0.002 | 9.766 | 0.000 |
conditiongood | -0.443 | 0.231 | -1.921 | 0.057 |
conditionslow | -1.543 | 0.161 | -9.616 | 0.000 |
. . .
\[\widehat{speed} = 52.175 - 0.005 ~ starters + 0.023 ~ yearnew - 0.443 ~ good - 1.543 ~ slow\]
Model 1: Check model assumptions
#library(ggfortify)
autoplot(model1Cent)
Which of the model assumptions (LINE) does this pass and/or fail?
Model 2: Add quadratic effect for year?
ggplot(data = derby, aes(x = yearnew, y = speed)) +
geom_point(alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
geom_smooth(se = FALSE, color = "red", linetype = 2) +
labs(x = "Years since 1896", y = "Speed (ft/s)",
title = "Speed vs. Years since 1896")
Model 2: Add \(yearnew^2\)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 51.4130 | 0.1826 | 281.5645 | 0.0000 |
starters | -0.0253 | 0.0136 | -1.8588 | 0.0656 |
yearnew | 0.0700 | 0.0061 | 11.4239 | 0.0000 |
I(yearnew^2) | -0.0004 | 0.0000 | -8.0411 | 0.0000 |
conditiongood | -0.4770 | 0.1857 | -2.5689 | 0.0115 |
conditionslow | -1.3927 | 0.1305 | -10.6701 | 0.0000 |
<- lm(speed ~ starters + yearnew + I(yearnew^2) + condition,
model2 data = derby)
tidy(model2) |> kable(digits = 4)
Interpreting quadratic effects
\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 ~ x_1 + \hat{\beta}_2 ~ x_2 + \hat{\beta}_3 ~ x_2^2\]
General interpretation: When \(x_2\) increases from a to b, \(y\) is expected to change by \(\hat{\beta}_2(b - a) + \hat{\beta}_3(b^2 - a^2)\), holding \(x_1\) constant.
. . .
Interpreting quadratic effects
\[\begin{aligned}\widehat{speed} = &51.413 - 0.025 ~ starters + 0.070 ~ yearnew \\ & - 0.0004 ~ yearnew^2 - 0.477 ~ good - 1.393 ~ slow\end{aligned}\]
. . .
Questions:
Interpret the effect of year for the 5 most recent years (2013 - 2017).
Model 2: Check model assumptions
Model 3: Include interaction term?
Recall from the EDA…
library(viridis)
ggplot(data = derby, aes(x = year, y = speed, color = condition,
shape = condition, linetype = condition)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, aes(linetype = condition)) +
labs(x = "Year", y = "Speed (ft/s)", color = "Condition",
title = "Speed vs. year",
subtitle = "by track condition") +
guides(lty = FALSE, shape = FALSE) +
scale_color_viridis_d(end = 0.9)
Model 3: Add interaction term
\[\begin{aligned}\widehat{speed} = & 52.387 - 0.003 ~ starters + 0.020 ~ yearnew - 1.070 ~ good - 2.183 ~ slow \\ &+0.012 ~ yearnew \times good + 0.012 ~ yearnew \times slow \end{aligned}\]
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 52.387 | 0.200 | 262.350 | 0.000 |
starters | -0.003 | 0.016 | -0.189 | 0.850 |
yearnew | 0.020 | 0.003 | 7.576 | 0.000 |
conditiongood | -1.070 | 0.423 | -2.527 | 0.013 |
conditionslow | -2.183 | 0.270 | -8.097 | 0.000 |
yearnew:conditiongood | 0.012 | 0.008 | 1.598 | 0.113 |
yearnew:conditionslow | 0.012 | 0.004 | 2.866 | 0.005 |
<- lm(speed ~ starters + yearnew + condition +
model3 * condition,
yearnew data = derby)
tidy(model3) |> kable(digits = 4)
Interpreting interaction effects
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 52.387 | 0.200 | 262.350 | 0.000 |
starters | -0.003 | 0.016 | -0.189 | 0.850 |
yearnew | 0.020 | 0.003 | 7.576 | 0.000 |
conditiongood | -1.070 | 0.423 | -2.527 | 0.013 |
conditionslow | -2.183 | 0.270 | -8.097 | 0.000 |
yearnew:conditiongood | 0.012 | 0.008 | 1.598 | 0.113 |
yearnew:conditionslow | 0.012 | 0.004 | 2.866 | 0.005 |
- Write out the interpretation of…
Which model would you choose?
r.squared | adj.r.squared | AIC | BIC |
---|---|---|---|
0.73 | 0.721 | 259.478 | 276.302 |
r.squared | adj.r.squared | AIC | BIC |
---|---|---|---|
0.827 | 0.819 | 207.429 | 227.057 |
r.squared | adj.r.squared | AIC | BIC |
---|---|---|---|
0.751 | 0.738 | 253.584 | 276.016 |
# Model 1
glance(model1Cent) |>
select(r.squared, adj.r.squared, AIC, BIC) |>
kable(digits = 3)
# Model2
glance(model2) |>
select(r.squared, adj.r.squared, AIC, BIC) |>
kable(digits = 3)
# Model 3
glance(model3) |>
select(r.squared, adj.r.squared, AIC, BIC) |>
kable(digits = 3)
What are these model quality metrics?
How do we define RSquared?
What is adj.r.squared?
Measures of model performance
- \(\color{#4187aa}{R^2}\): Proportion of variability in the response explained by the model.
- Will always increase as predictors are added, so it shouldn’t be used to compare models
- \(\color{#4187aa}{Adj. R^2}\): Similar to \(R^2\) with a penalty for extra terms
. . .
\(\color{#4187aa}{AIC}\): Likelihood-based approach balancing model performance and complexity
\(\color{#4187aa}{BIC}\): Similar to AIC with stronger penalty for extra terms
. . .
- Nested F Test (extra sum of squares F test): Generalization of t-test for individual coefficients to perform significance tests on nested models
Which model would you choose?
Use the glance
function to get model statistics.
model | r.squared | adj.r.squared | AIC | BIC |
---|---|---|---|---|
Model1 | 0.730 | 0.721 | 259.478 | 276.302 |
Model2 | 0.827 | 0.819 | 207.429 | 227.057 |
Model3 | 0.751 | 0.738 | 253.584 | 276.016 |
<- glance(model1Cent) |>
model1_glance select(r.squared, adj.r.squared, AIC, BIC)
<- glance(model2) |>
model2_glance select(r.squared, adj.r.squared, AIC, BIC)
<- glance(model3) |>
model3_glance select(r.squared, adj.r.squared, AIC, BIC)
|>
model1_glance bind_rows(model2_glance) |>
bind_rows(model3_glance) |>
bind_cols(model = c("Model1", "Model2", "Model3")) |>
select(model, everything()) |>
kable(digits = 3)
Characteristics of a “good” final model
- Model can be used to answer primary research questions
- Predictor variables control for important covariates
- Potential interactions have been investigated
- Variables are centered, as needed, for more meaningful interpretations
- unnecessary terms are removed
- Assumptions are met and influential points have been addressed
- model tells a “persuasive story parsimoniously”
Inference for multiple linear regression
Use statistical inference to
- Determine if predictors are statistically significant (not necessarily practically significant!)
- Quantify uncertainty in coefficient estimates
- Quantify uncertainty in model predictions
. . .
If L.I.N.E. assumptions are met, we can conduct inference using the \(t\) distribution and estimated standard errors
Inference for regression
Use least squares regression to get the estimates \(\hat{\beta}_0\), \(\hat{\beta}_1\), and \(\hat{\sigma}^2\)
\(\hat{\sigma}\) is the regression standard error
. . .
\[\hat{\sigma} = \sqrt{\frac{\sum_{i=1}^n(y_i - \hat{y}_i)^2}{n - p - 1}} = \sqrt{\frac{\sum_{i=1}^n e_i^2}{n-p-1}}\]
Acknowledgements
These slides are based on content in BMLR: Chapter 1 - Review of Multiple Linear Regression
Initial versions of the slides are by Dr. Maria Tackett, Duke University