Welcome and Chapter 1

Tyler George

Cornell College
STA 363 Spring 2025 Block 8

Welcome!

Instructor

Dr. Tyler George: tgeorge@cornellcollege.edu

Course logistics

Course Dates: August 26th to September 18th
Course sessions: By Appointment
Exam Dates: tbd

Generalized Linear Models

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

Logistic regression

\[\begin{aligned}\pi = P(y = 1 | x) \hspace{2mm} &\Rightarrow \hspace{2mm} \text{Link function: } \log\big(\frac{\pi}{1-\pi}\big) \\ &\Rightarrow \log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1~x\end{aligned}\]

What we’re covering this semester(1/3)

Generalized Linear Models (Ch 1 - 6)

Introduce models for non-normal response variables
Estimation, interpretation, and inference
Mathematical details showing how GLMs are connected

What we’re covering this semester(2/3)

Modeling correlated data (Ch 7 - 9)

Introduce multilevel models for correlated and longitudinal data
Estimation, interpretation, and inference
Mathematical details, particularly diving into covariance structures

What we’re covering this semester(3/3)

More Regression Models (ITSL Chapter 7) - Polynomial Regression - Regression Splines - Smoothing Splines - Generalized Additive Models (GAMS)

Meet your classmates!

Create larger groups
Quick introductions - Name, year, and major
Choose a reporter
- Need help choosing? Person with birthday closest to December 1st.
Identify 8 things everyone in the group has in common
- Not being a Cornell Student
- Not clothes (we’re all wearing socks)
- Not body parts (we all have a nose)

Reporter will share list with the class

What background is assumed for the course?

Pre-reqs

STA 201, 202 and DSC 223

Background knowledge

Statistical content
- Linear and logistic regression
- Statistical inference
- Basic understanding of random variables

Computing
- Using R for data analysis
- Writing reports using R Markdown or Quarto

Course Toolkit (1/2)

Website
- https://stats-tgeorge.github.io/STA363_AdvReg/
- Central hub for the course
- Notes
- Labs
- Datasets

Course Toolkit (1/2)

Moodle:
- https://moodle.cornellcollege.edu/course/view.php?id=7908
- Submissions
- Gradebook
- Announcements

Class Meetings

Lectures

Some traditional lecture
Individual and group labs
Bring fully-charged laptop
Mini-projects
Exams

Attendance is expected (if you are healthy!)

Textbook

Beyond Multiple Linear Regression by Paul Roback and Julie Legler

Available online
Hard copies available for purchase

Textbook 2

The secondary text is: An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani – it is freely available online. Chapter 7.

Hard copies available for purchase

Using R / RStudio

RStudio Server is installed and should be used
http://turing.cornellcollege.edu:8787/

Activities & Assessments

Readings

Primarily from Beyond Multiple Linear Regression
Recommend reading assigned text before lecture

Homework - Primarily from Beyond Multiple Linear Regression - Individual assignments - Work together but must complete your own work. Discuss but don’t copy.

Activities & Assessments

Mini-projects

Examples:

Mini-project 01: Focused on models for non-normal response variables, such as count data
Mini-project 02: Focused on models for correlated data

Short write up and short presentation
Team-based

Exams

Two exams this block, tbd.
Each will have two components
- Component 1 will be on these dates and you will get a choice of oral or written format.
- Component 2 will be a take-home, open-book, open-note, exam.
- You will have 12 hours or more to complete this component.

Grading

Final grades will be calculated as follows

Category	Points
Homework	200
Participation	100
Labs and Mini Projects	300
Exams	400
Total	1000

See Syllabus on website for letter grade thresholds.

Resources

Office hours to meet with your instructor in West 311
- Typically MWTh 3:05pm-4:05pm and by appt.
- Double check course calendar
- Make appointments by going to https://calendar.app.google/oHqiU2FZDJLXKq429
Email Tyler George for private questions regarding personal matters or grades.
- Please put STA 363 in the subject line since I am also teaching capstone this semester
College support at https://stats-tgeorge.github.io/personal_website/course-support.html.

Chapter 1 - MLR Review

Setup - R Packages

library(tidyverse)
library(tidymodels)
library(GGally)
library(knitr)
library(patchwork)
library(viridis)
library(ggfortify)

Assumptions for linear regression

What are the assumptions for linear regression? . . .

Linearity: Linear relationship between mean response and predictor variable(s)

Independence: Residuals are independent. There is no connection between how far any two points lie above or below regression line.

Normality: Response follows a normal distribution at each level of the predictor (or combination of predictors)

Equal variance: Variability (variance or standard deviation) of the response is equal for all levels of the predictor (or combination of predictors)

Use residual plots to check that the conditions hold before using the model for statistical inference.

Assumptions for linear regression

Linearity: Linear relationship between mean of the response \(Y\) and the predictor \(X\)
Independence: No connection between how any two points lie above or below the regression line
Normality: Response, \(Y\), follows a normal distribution at each level of the predictor, \(X\) (indicated by red curves)
Equal variance: Variance (or standard deviation) of the response, \(Y\), is equal for all levels of the predictor, \(X\)

Questions

How do we assess these conditions?

Beyond linear regression

When we use linear least squares regression to draw conclusions, we do so under the assumption that L.I.N.E. are all met.
Generalized linear models require different assumptions and can accommodate violations in L.I.N.E.
- Relationship between response and predictor(s) can be nonlinear
- Response variable can be non-normal
- Variance in response can differ at each level of predictor(s)

But the independence assumption must hold!

Multilevel models will be used for data with correlated observations

Review of multiple linear regression

Data: Kentucky Derby Winners

Today’s data is from the Kentucky Derby, an annual 1.25-mile horse race held at the Churchill Downs race track in Louisville, KY. The data is in the file derbyplus.csv and contains information for races 1896 - 2017.

Response variable

speed: Average speed of the winner in feet per second (ft/s)]

Additional variable

winner: Winning horse

Predictor variables

year: Year of the race
condition: Condition of the track (good, fast, slow)
starters: Number of horses who raced]

Data

derby <- read_csv("data/derbyplus.csv")

derby |>
  head(5) |> kable()

year	winner	condition	speed	starters
1896	Ben Brush	good	51.66	8
1897	Typhoon II	slow	49.81	6
1898	Plaudit	good	51.16	4
1899	Manuel	fast	50.00	5
1900	Lieut. Gibson	fast	52.28	7

Data Analysis Life Cycle

Exploratory data analysis (EDA)

Once you’re ready for the statistical analysis (explore), the first step should always be exploratory data analysis.
The EDA will help you
- begin to understand the variables and observations
- identify outliers or potential data entry errors
- begin to see relationships between variables
- identify the appropriate model and identify a strategy
The EDA is exploratory; formal modeling and statistical inference should be used to draw conclusions.

p1 <- ggplot(data = derby, aes(x = speed)) + 
  geom_histogram(fill = "forestgreen", color = "black") + 
  labs(x = "Winning speed (ft/s)", y = "Count")

p2 <- ggplot(data = derby, aes(x = starters)) + 
  geom_histogram(fill = "forestgreen", color = "black") + 
  labs(x = "Starters", y = "Count")

p3 <- ggplot(data = derby, aes(x = condition)) +
   geom_bar(fill = "forestgreen", color = "black", aes(x = ))

p1 + (p2 / p3) + 
  plot_annotation(title = "Univariate data analysis")

Plots for bivariate EDA

Plot
Code

p4 <- ggplot(data = derby, aes(x = starters, y = speed)) + 
  geom_point() + 
  labs(x = "Starters", y = "Speed (ft / s)")

p5 <- ggplot(data = derby, aes(x = year, y = speed)) + 
  geom_point() + 
  labs(x = "Year", y = "Speed (ft / s)")

p6 <- ggplot(data = derby, aes(x = condition, y = speed)) + 
  geom_boxplot(fill = "forestgreen", color = "black") + 
  labs(x = "Conditions", y = "Speed (ft / s)")

(p4 + p5) + p6 +
  plot_annotation(title = "Bivariate data analysis")

Scatterplot matrix

A scatterplot matrix helps quickly visualize relationships between many variable pairs. They are particularly useful to identify potentially correlated predictors.

Plot
Code

#library(GGally)
ggpairs(data = derby, 
        columns = c("condition", "year", "starters", "speed"))

Plots for multivariate EDA

Plot the relationship between the response and a predictor based on levels of another predictor to assess potential interactions.

Plot
Code

#library(viridis)
ggplot(data = derby, aes(x = year, y = speed, color = condition, 
                         shape = condition, linetype = condition)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE, aes(linetype = condition)) + 
  labs(x = "Year", y = "Speed (ft/s)", color = "Condition",
       title = "Speed vs. year", 
       subtitle = "by track condition") +
  guides(lty = FALSE, shape = FALSE) +
  scale_color_viridis_d(end = 0.9)

Model 1: Main effects model

Output
Code

term	estimate	std.error	statistic	p.value
(Intercept)	8.197	4.508	1.818	0.072
starters	-0.005	0.017	-0.299	0.766
year	0.023	0.002	9.766	0.000
conditiongood	-0.443	0.231	-1.921	0.057
conditionslow	-1.543	0.161	-9.616	0.000

# Fit and display model
model1 <- lm(speed ~ starters + year + condition, data = derby)
tidy(model1) |> 
  kable(digits = 3)

Interpretation

\[\widehat{speed} = 8.197 - 0.005 ~ starters + 0.023 ~ year - 0.443 ~ good - 1.543 ~ slow\]

term	estimate	std.error	statistic	p.value
(Intercept)	8.197	4.508	1.818	0.072
starters	-0.005	0.017	-0.299	0.766
year	0.023	0.002	9.766	0.000
conditiongood	-0.443	0.231	-1.921	0.057
conditionslow	-1.543	0.161	-9.616	0.000

Write out the interpretations for starters and conditiongood.
Does the intercept have a meaningful interpretation?

Centering

Centering: Subtract a constant from each observation of a given variable

Do this to make interpretation of model parameters more meaningful (particularly intercept)
In STA 202, we used mean-centering where we subtracted the mean from each observation of given variable
How does centering change the model?

Centering `year`

derby <- derby |>
  mutate(yearnew = year - 1896) #1896 = starting year

term	estimate	std.error	statistic	p.value
(Intercept)	52.175	0.194	269.079	0.000
starters	-0.005	0.017	-0.299	0.766
yearnew	0.023	0.002	9.766	0.000
conditiongood	-0.443	0.231	-1.921	0.057
conditionslow	-1.543	0.161	-9.616	0.000

\[\widehat{speed} = 52.175 - 0.005 ~ starters + 0.023 ~ yearnew - 0.443 ~ good - 1.543 ~ slow\]

Model 1: Check model assumptions

Plots
Questions

#library(ggfortify)
autoplot(model1Cent)

Which of the model assumptions (LINE) does this pass and/or fail?

Model 2: Add quadratic effect for year?

Plot
Code

ggplot(data = derby, aes(x = yearnew, y = speed)) + 
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") + 
  geom_smooth(se = FALSE, color = "red", linetype = 2) + 
  labs(x = "Years since 1896", y = "Speed (ft/s)", 
       title = "Speed vs. Years since 1896")

Model 2: Add \(yearnew^2\)

Plot
Code

term	estimate	std.error	statistic	p.value
(Intercept)	51.4130	0.1826	281.5645	0.0000
starters	-0.0253	0.0136	-1.8588	0.0656
yearnew	0.0700	0.0061	11.4239	0.0000
I(yearnew^2)	-0.0004	0.0000	-8.0411	0.0000
conditiongood	-0.4770	0.1857	-2.5689	0.0115
conditionslow	-1.3927	0.1305	-10.6701	0.0000

model2 <- lm(speed ~ starters + yearnew + I(yearnew^2) + condition, 
             data = derby)
tidy(model2) |> kable(digits = 4)

Interpreting quadratic effects

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 ~ x_1 + \hat{\beta}_2 ~ x_2 + \hat{\beta}_3 ~ x_2^2\]

General interpretation: When \(x_2\) increases from a to b, \(y\) is expected to change by \(\hat{\beta}_2(b - a) + \hat{\beta}_3(b^2 - a^2)\), holding \(x_1\) constant.

Interpreting quadratic effects

\[\begin{aligned}\widehat{speed} = &51.413 - 0.025 ~ starters + 0.070 ~ yearnew \\ & - 0.0004 ~ yearnew^2 - 0.477 ~ good - 1.393 ~ slow\end{aligned}\]

Questions:

Interpret the effect of year for the 5 most recent years (2013 - 2017).

Model 2: Check model assumptions

Model 3: Include interaction term?

Recall from the EDA…

Plot
Code

library(viridis)
ggplot(data = derby, aes(x = year, y = speed, color = condition, 
                         shape = condition, linetype = condition)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE, aes(linetype = condition)) + 
  labs(x = "Year", y = "Speed (ft/s)", color = "Condition",
       title = "Speed vs. year", 
       subtitle = "by track condition") +
  guides(lty = FALSE, shape = FALSE) +
  scale_color_viridis_d(end = 0.9)

Model 3: Add interaction term

\[\begin{aligned}\widehat{speed} = & 52.387 - 0.003 ~ starters + 0.020 ~ yearnew - 1.070 ~ good - 2.183 ~ slow \\ &+0.012 ~ yearnew \times good + 0.012 ~ yearnew \times slow \end{aligned}\]

Output
Code
Assumptions

term	estimate	std.error	statistic	p.value
(Intercept)	52.387	0.200	262.350	0.000
starters	-0.003	0.016	-0.189	0.850
yearnew	0.020	0.003	7.576	0.000
conditiongood	-1.070	0.423	-2.527	0.013
conditionslow	-2.183	0.270	-8.097	0.000
yearnew:conditiongood	0.012	0.008	1.598	0.113
yearnew:conditionslow	0.012	0.004	2.866	0.005

model3 <- lm(speed ~ starters + yearnew + condition +
               yearnew * condition, 
             data = derby)
tidy(model3) |> kable(digits = 4)

Interpreting interaction effects

term	estimate	std.error	statistic	p.value
(Intercept)	52.387	0.200	262.350	0.000
starters	-0.003	0.016	-0.189	0.850
yearnew	0.020	0.003	7.576	0.000
conditiongood	-1.070	0.423	-2.527	0.013
conditionslow	-2.183	0.270	-8.097	0.000
yearnew:conditiongood	0.012	0.008	1.598	0.113
yearnew:conditionslow	0.012	0.004	2.866	0.005

Write out the interpretation of…

Which model would you choose?

Model 1: Main effects
Model 2: Main effects + \(year^2\)
Model 3: Main effects + interaction
Code

r.squared	adj.r.squared	AIC	BIC
0.73	0.721	259.478	276.302

r.squared	adj.r.squared	AIC	BIC
0.827	0.819	207.429	227.057

r.squared	adj.r.squared	AIC	BIC
0.751	0.738	253.584	276.016

# Model 1
glance(model1Cent) |>
  select(r.squared, adj.r.squared, AIC, BIC) |>
  kable(digits = 3)

# Model2
glance(model2) |>
  select(r.squared, adj.r.squared, AIC, BIC) |>
  kable(digits = 3)

# Model 3
glance(model3) |>
  select(r.squared, adj.r.squared, AIC, BIC) |>
  kable(digits = 3)

What are these model quality metrics?

How do we define RSquared?
What is adj.r.squared?

Measures of model performance

\(\color{#4187aa}{R^2}\): Proportion of variability in the response explained by the model.
- Will always increase as predictors are added, so it shouldn’t be used to compare models
\(\color{#4187aa}{Adj. R^2}\): Similar to \(R^2\) with a penalty for extra terms

\(\color{#4187aa}{AIC}\): Likelihood-based approach balancing model performance and complexity
\(\color{#4187aa}{BIC}\): Similar to AIC with stronger penalty for extra terms

Nested F Test (extra sum of squares F test): Generalization of t-test for individual coefficients to perform significance tests on nested models

Which model would you choose?

Use the glance function to get model statistics.

Output
Code

model	r.squared	adj.r.squared	AIC	BIC
Model1	0.730	0.721	259.478	276.302
Model2	0.827	0.819	207.429	227.057
Model3	0.751	0.738	253.584	276.016

model1_glance <- glance(model1Cent) |>
  select(r.squared, adj.r.squared, AIC, BIC)
model2_glance <- glance(model2) |>
  select(r.squared, adj.r.squared, AIC, BIC)
model3_glance <- glance(model3) |>
  select(r.squared, adj.r.squared, AIC, BIC)

model1_glance |>
  bind_rows(model2_glance) |>
  bind_rows(model3_glance) |>
  bind_cols(model = c("Model1", "Model2", "Model3")) |>
  select(model, everything()) |>
kable(digits = 3)

Characteristics of a “good” final model

Model can be used to answer primary research questions
Predictor variables control for important covariates
Potential interactions have been investigated
Variables are centered, as needed, for more meaningful interpretations
unnecessary terms are removed
Assumptions are met and influential points have been addressed
model tells a “persuasive story parsimoniously”

Inference for multiple linear regression

Use statistical inference to

Determine if predictors are statistically significant (not necessarily practically significant!)
Quantify uncertainty in coefficient estimates
Quantify uncertainty in model predictions

If L.I.N.E. assumptions are met, we can conduct inference using the \(t\) distribution and estimated standard errors

Inference for regression

When L.I.N.E. conditions are met
We can

Use least squares regression to get the estimates \(\hat{\beta}_0\), \(\hat{\beta}_1\), and \(\hat{\sigma}^2\)
\(\hat{\sigma}\) is the regression standard error

. . .

\[\hat{\sigma} = \sqrt{\frac{\sum_{i=1}^n(y_i - \hat{y}_i)^2}{n - p - 1}} = \sqrt{\frac{\sum_{i=1}^n e_i^2}{n-p-1}}\]

Acknowledgements

These slides are based on content in BMLR: Chapter 1 - Review of Multiple Linear Regression

Initial versions of the slides are by Dr. Maria Tackett, Duke University

Welcome and Chapter 1

Welcome!

Instructor

Course logistics

Generalized Linear Models

What we’re covering this semester(1/3)

What we’re covering this semester(2/3)

What we’re covering this semester(3/3)

Meet your classmates!

What background is assumed for the course?

Course Toolkit (1/2)

Course Toolkit (1/2)

Class Meetings

Textbook

Textbook 2

Using R / RStudio

Activities & Assessments

Activities & Assessments

Exams

Grading

Resources

Chapter 1 - MLR Review

Setup - R Packages

Assumptions for linear regression

Assumptions for linear regression

Questions

Beyond linear regression

Review of multiple linear regression

Data: Kentucky Derby Winners

Data

Data Analysis Life Cycle

Exploratory data analysis (EDA)

Plots for univariate EDA

Plots for bivariate EDA

Scatterplot matrix

Plots for multivariate EDA

Model 1: Main effects model

Interpretation

Centering

Centering year

Model 1: Check model assumptions

Model 2: Add quadratic effect for year?

Model 2: Add \(yearnew^2\)

Interpreting quadratic effects

Interpreting quadratic effects

Model 2: Check model assumptions

Model 3: Include interaction term?

Model 3: Add interaction term

Interpreting interaction effects

Which model would you choose?

What are these model quality metrics?

Measures of model performance

Which model would you choose?

Characteristics of a “good” final model

Inference for multiple linear regression

Inference for regression

Acknowledgements

Centering `year`