Welcome and Chapter 1

Author
Affiliation

Tyler George

Cornell College
STA 363 Spring 2025 Block 8

Welcome!

Instructor

Course logistics

  • Course Dates: August 26th to September 18th
  • Course sessions: By Appointment
  • Exam Dates: tbd

Generalized Linear Models

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

. . .

Logistic regression

\[\begin{aligned}\pi = P(y = 1 | x) \hspace{2mm} &\Rightarrow \hspace{2mm} \text{Link function: } \log\big(\frac{\pi}{1-\pi}\big) \\ &\Rightarrow \log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1~x\end{aligned}\]

What we’re covering this semester(1/3)

Generalized Linear Models (Ch 1 - 6)

  • Introduce models for non-normal response variables
  • Estimation, interpretation, and inference
  • Mathematical details showing how GLMs are connected

What we’re covering this semester(2/3)

Modeling correlated data (Ch 7 - 9)

  • Introduce multilevel models for correlated and longitudinal data
  • Estimation, interpretation, and inference
  • Mathematical details, particularly diving into covariance structures

What we’re covering this semester(3/3)

More Regression Models (ITSL Chapter 7) - Polynomial Regression - Regression Splines - Smoothing Splines - Generalized Additive Models (GAMS)

Meet your classmates!

  • Create larger groups
  • Quick introductions - Name, year, and major
  • Choose a reporter
    • Need help choosing? Person with birthday closest to December 1st.
  • Identify 8 things everyone in the group has in common
    • Not being a Cornell Student
    • Not clothes (we’re all wearing socks)
    • Not body parts (we all have a nose)

. . .

Reporter will share list with the class

What background is assumed for the course?

. . .

Pre-reqs

  • STA 201, 202 and DSC 223

. . .

Background knowledge

  • Statistical content
    • Linear and logistic regression
    • Statistical inference
    • Basic understanding of random variables
  • Computing
    • Using R for data analysis
    • Writing reports using R Markdown or Quarto

Course Toolkit (1/2)

Course Toolkit (1/2)

Class Meetings

Lectures

  • Some traditional lecture
  • Individual and group labs
  • Bring fully-charged laptop
  • Mini-projects
  • Exams

. . .

Attendance is expected (if you are healthy!)

Textbook

Beyond Multiple Linear Regression by Paul Roback and Julie Legler

  • Available online
  • Hard copies available for purchase

Textbook 2

The secondary text is: An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani – it is freely available online. Chapter 7.

  • Hard copies available for purchase

Using R / RStudio

Activities & Assessments

Readings

  • Primarily from Beyond Multiple Linear Regression
  • Recommend reading assigned text before lecture

. . .

Homework - Primarily from Beyond Multiple Linear Regression - Individual assignments - Work together but must complete your own work. Discuss but don’t copy.

Activities & Assessments

Mini-projects

Examples:

  • Mini-project 01: Focused on models for non-normal response variables, such as count data
  • Mini-project 02: Focused on models for correlated data

. . .

  • Short write up and short presentation
  • Team-based

Exams

  • Two exams this block, tbd.

  • Each will have two components

    • Component 1 will be on these dates and you will get a choice of oral or written format.
    • Component 2 will be a take-home, open-book, open-note, exam.
    • You will have 12 hours or more to complete this component.

Grading

Final grades will be calculated as follows

Category Points
Homework 200
Participation 100
Labs and Mini Projects 300
Exams 400
Total 1000

See Syllabus on website for letter grade thresholds.

Resources

Chapter 1 - MLR Review

Setup - R Packages

library(tidyverse)
library(tidymodels)
library(GGally)
library(knitr)
library(patchwork)
library(viridis)
library(ggfortify)

Assumptions for linear regression

What are the assumptions for linear regression? . . .

Linearity: Linear relationship between mean response and predictor variable(s)

. . .

Independence: Residuals are independent. There is no connection between how far any two points lie above or below regression line.

. . .

Normality: Response follows a normal distribution at each level of the predictor (or combination of predictors)

. . .

Equal variance: Variability (variance or standard deviation) of the response is equal for all levels of the predictor (or combination of predictors)

. . .

Use residual plots to check that the conditions hold before using the model for statistical inference.

Assumptions for linear regression

Modified from Figure 1.1. in BMLR]

  • Linearity: Linear relationship between mean of the response \(Y\) and the predictor \(X\)

  • Independence: No connection between how any two points lie above or below the regression line

  • Normality: Response, \(Y\), follows a normal distribution at each level of the predictor, \(X\) (indicated by red curves)

  • Equal variance: Variance (or standard deviation) of the response, \(Y\), is equal for all levels of the predictor, \(X\)

Questions

How do we assess these conditions?

Beyond linear regression

  • When we use linear least squares regression to draw conclusions, we do so under the assumption that L.I.N.E. are all met.

  • Generalized linear models require different assumptions and can accommodate violations in L.I.N.E.

    • Relationship between response and predictor(s) can be nonlinear
    • Response variable can be non-normal
    • Variance in response can differ at each level of predictor(s)

. . .

But the independence assumption must hold!

  • Multilevel models will be used for data with correlated observations

Review of multiple linear regression

Data: Kentucky Derby Winners

Today’s data is from the Kentucky Derby, an annual 1.25-mile horse race held at the Churchill Downs race track in Louisville, KY. The data is in the file derbyplus.csv and contains information for races 1896 - 2017.

. . .

Response variable

  • speed: Average speed of the winner in feet per second (ft/s)]

Additional variable

  • winner: Winning horse

Predictor variables

  • year: Year of the race
  • condition: Condition of the track (good, fast, slow)
  • starters: Number of horses who raced]

Data

derby <- read_csv("data/derbyplus.csv")
derby |>
  head(5) |> kable()
year winner condition speed starters
1896 Ben Brush good 51.66 8
1897 Typhoon II slow 49.81 6
1898 Plaudit good 51.16 4
1899 Manuel fast 50.00 5
1900 Lieut. Gibson fast 52.28 7

Data Analysis Life Cycle

Exploratory data analysis (EDA)

  • Once you’re ready for the statistical analysis (explore), the first step should always be exploratory data analysis.

  • The EDA will help you

    • begin to understand the variables and observations
    • identify outliers or potential data entry errors
    • begin to see relationships between variables
    • identify the appropriate model and identify a strategy
  • The EDA is exploratory; formal modeling and statistical inference should be used to draw conclusions.

Plots for univariate EDA

p1 <- ggplot(data = derby, aes(x = speed)) + 
  geom_histogram(fill = "forestgreen", color = "black") + 
  labs(x = "Winning speed (ft/s)", y = "Count")

p2 <- ggplot(data = derby, aes(x = starters)) + 
  geom_histogram(fill = "forestgreen", color = "black") + 
  labs(x = "Starters", y = "Count")

p3 <- ggplot(data = derby, aes(x = condition)) +
   geom_bar(fill = "forestgreen", color = "black", aes(x = ))

p1 + (p2 / p3) + 
  plot_annotation(title = "Univariate data analysis")

Plots for bivariate EDA

p4 <- ggplot(data = derby, aes(x = starters, y = speed)) + 
  geom_point() + 
  labs(x = "Starters", y = "Speed (ft / s)")

p5 <- ggplot(data = derby, aes(x = year, y = speed)) + 
  geom_point() + 
  labs(x = "Year", y = "Speed (ft / s)")

p6 <- ggplot(data = derby, aes(x = condition, y = speed)) + 
  geom_boxplot(fill = "forestgreen", color = "black") + 
  labs(x = "Conditions", y = "Speed (ft / s)")

(p4 + p5) + p6 +
  plot_annotation(title = "Bivariate data analysis")

Scatterplot matrix

A scatterplot matrix helps quickly visualize relationships between many variable pairs. They are particularly useful to identify potentially correlated predictors.

. . .

#library(GGally)
ggpairs(data = derby, 
        columns = c("condition", "year", "starters", "speed"))

Plots for multivariate EDA

Plot the relationship between the response and a predictor based on levels of another predictor to assess potential interactions.

#library(viridis)
ggplot(data = derby, aes(x = year, y = speed, color = condition, 
                         shape = condition, linetype = condition)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE, aes(linetype = condition)) + 
  labs(x = "Year", y = "Speed (ft/s)", color = "Condition",
       title = "Speed vs. year", 
       subtitle = "by track condition") +
  guides(lty = FALSE, shape = FALSE) +
  scale_color_viridis_d(end = 0.9)

Model 1: Main effects model

term estimate std.error statistic p.value
(Intercept) 8.197 4.508 1.818 0.072
starters -0.005 0.017 -0.299 0.766
year 0.023 0.002 9.766 0.000
conditiongood -0.443 0.231 -1.921 0.057
conditionslow -1.543 0.161 -9.616 0.000
# Fit and display model
model1 <- lm(speed ~ starters + year + condition, data = derby)
tidy(model1) |> 
  kable(digits = 3)

Interpretation

\[\widehat{speed} = 8.197 - 0.005 ~ starters + 0.023 ~ year - 0.443 ~ good - 1.543 ~ slow\]

term estimate std.error statistic p.value
(Intercept) 8.197 4.508 1.818 0.072
starters -0.005 0.017 -0.299 0.766
year 0.023 0.002 9.766 0.000
conditiongood -0.443 0.231 -1.921 0.057
conditionslow -1.543 0.161 -9.616 0.000

. . .

  1. Write out the interpretations for starters and conditiongood.
  2. Does the intercept have a meaningful interpretation?

Centering

Centering: Subtract a constant from each observation of a given variable

  • Do this to make interpretation of model parameters more meaningful (particularly intercept)

  • In STA 202, we used mean-centering where we subtracted the mean from each observation of given variable

  • How does centering change the model?

Centering year

derby <- derby |>
  mutate(yearnew = year - 1896) #1896 = starting year

. . .

term estimate std.error statistic p.value
(Intercept) 52.175 0.194 269.079 0.000
starters -0.005 0.017 -0.299 0.766
yearnew 0.023 0.002 9.766 0.000
conditiongood -0.443 0.231 -1.921 0.057
conditionslow -1.543 0.161 -9.616 0.000

. . .

\[\widehat{speed} = 52.175 - 0.005 ~ starters + 0.023 ~ yearnew - 0.443 ~ good - 1.543 ~ slow\]

Model 1: Check model assumptions

#library(ggfortify)
autoplot(model1Cent)

Which of the model assumptions (LINE) does this pass and/or fail?

Model 2: Add quadratic effect for year?

ggplot(data = derby, aes(x = yearnew, y = speed)) + 
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") + 
  geom_smooth(se = FALSE, color = "red", linetype = 2) + 
  labs(x = "Years since 1896", y = "Speed (ft/s)", 
       title = "Speed vs. Years since 1896")

Model 2: Add \(yearnew^2\)

term estimate std.error statistic p.value
(Intercept) 51.4130 0.1826 281.5645 0.0000
starters -0.0253 0.0136 -1.8588 0.0656
yearnew 0.0700 0.0061 11.4239 0.0000
I(yearnew^2) -0.0004 0.0000 -8.0411 0.0000
conditiongood -0.4770 0.1857 -2.5689 0.0115
conditionslow -1.3927 0.1305 -10.6701 0.0000
model2 <- lm(speed ~ starters + yearnew + I(yearnew^2) + condition, 
             data = derby)
tidy(model2) |> kable(digits = 4)

Interpreting quadratic effects

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 ~ x_1 + \hat{\beta}_2 ~ x_2 + \hat{\beta}_3 ~ x_2^2\]

General interpretation: When \(x_2\) increases from a to b, \(y\) is expected to change by \(\hat{\beta}_2(b - a) + \hat{\beta}_3(b^2 - a^2)\), holding \(x_1\) constant.

. . .

Interpreting quadratic effects

\[\begin{aligned}\widehat{speed} = &51.413 - 0.025 ~ starters + 0.070 ~ yearnew \\ & - 0.0004 ~ yearnew^2 - 0.477 ~ good - 1.393 ~ slow\end{aligned}\]

. . .

Questions:

Interpret the effect of year for the 5 most recent years (2013 - 2017).

Model 2: Check model assumptions

Model 3: Include interaction term?

Recall from the EDA…

library(viridis)
ggplot(data = derby, aes(x = year, y = speed, color = condition, 
                         shape = condition, linetype = condition)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE, aes(linetype = condition)) + 
  labs(x = "Year", y = "Speed (ft/s)", color = "Condition",
       title = "Speed vs. year", 
       subtitle = "by track condition") +
  guides(lty = FALSE, shape = FALSE) +
  scale_color_viridis_d(end = 0.9)

Model 3: Add interaction term

\[\begin{aligned}\widehat{speed} = & 52.387 - 0.003 ~ starters + 0.020 ~ yearnew - 1.070 ~ good - 2.183 ~ slow \\ &+0.012 ~ yearnew \times good + 0.012 ~ yearnew \times slow \end{aligned}\]

term estimate std.error statistic p.value
(Intercept) 52.387 0.200 262.350 0.000
starters -0.003 0.016 -0.189 0.850
yearnew 0.020 0.003 7.576 0.000
conditiongood -1.070 0.423 -2.527 0.013
conditionslow -2.183 0.270 -8.097 0.000
yearnew:conditiongood 0.012 0.008 1.598 0.113
yearnew:conditionslow 0.012 0.004 2.866 0.005
model3 <- lm(speed ~ starters + yearnew + condition +
               yearnew * condition, 
             data = derby)
tidy(model3) |> kable(digits = 4)

Interpreting interaction effects

term estimate std.error statistic p.value
(Intercept) 52.387 0.200 262.350 0.000
starters -0.003 0.016 -0.189 0.850
yearnew 0.020 0.003 7.576 0.000
conditiongood -1.070 0.423 -2.527 0.013
conditionslow -2.183 0.270 -8.097 0.000
yearnew:conditiongood 0.012 0.008 1.598 0.113
yearnew:conditionslow 0.012 0.004 2.866 0.005
  • Write out the interpretation of…

Which model would you choose?

r.squared adj.r.squared AIC BIC
0.73 0.721 259.478 276.302
r.squared adj.r.squared AIC BIC
0.827 0.819 207.429 227.057
r.squared adj.r.squared AIC BIC
0.751 0.738 253.584 276.016
# Model 1
glance(model1Cent) |>
  select(r.squared, adj.r.squared, AIC, BIC) |>
  kable(digits = 3)

# Model2
glance(model2) |>
  select(r.squared, adj.r.squared, AIC, BIC) |>
  kable(digits = 3)

# Model 3
glance(model3) |>
  select(r.squared, adj.r.squared, AIC, BIC) |>
  kable(digits = 3)

What are these model quality metrics?

  • How do we define RSquared?

  • What is adj.r.squared?

Measures of model performance

  • \(\color{#4187aa}{R^2}\): Proportion of variability in the response explained by the model.
    • Will always increase as predictors are added, so it shouldn’t be used to compare models
  • \(\color{#4187aa}{Adj. R^2}\): Similar to \(R^2\) with a penalty for extra terms

. . .

  • \(\color{#4187aa}{AIC}\): Likelihood-based approach balancing model performance and complexity

  • \(\color{#4187aa}{BIC}\): Similar to AIC with stronger penalty for extra terms

. . .

  • Nested F Test (extra sum of squares F test): Generalization of t-test for individual coefficients to perform significance tests on nested models

Which model would you choose?

Use the glance function to get model statistics.

model r.squared adj.r.squared AIC BIC
Model1 0.730 0.721 259.478 276.302
Model2 0.827 0.819 207.429 227.057
Model3 0.751 0.738 253.584 276.016
model1_glance <- glance(model1Cent) |>
  select(r.squared, adj.r.squared, AIC, BIC)
model2_glance <- glance(model2) |>
  select(r.squared, adj.r.squared, AIC, BIC)
model3_glance <- glance(model3) |>
  select(r.squared, adj.r.squared, AIC, BIC)

model1_glance |>
  bind_rows(model2_glance) |>
  bind_rows(model3_glance) |>
  bind_cols(model = c("Model1", "Model2", "Model3")) |>
  select(model, everything()) |>
kable(digits = 3)

Characteristics of a “good” final model

  • Model can be used to answer primary research questions
  • Predictor variables control for important covariates
  • Potential interactions have been investigated
  • Variables are centered, as needed, for more meaningful interpretations
  • unnecessary terms are removed
  • Assumptions are met and influential points have been addressed
  • model tells a “persuasive story parsimoniously”

Inference for multiple linear regression

Use statistical inference to

  • Determine if predictors are statistically significant (not necessarily practically significant!)
  • Quantify uncertainty in coefficient estimates
  • Quantify uncertainty in model predictions

. . .

If L.I.N.E. assumptions are met, we can conduct inference using the \(t\) distribution and estimated standard errors

Inference for regression

  • Use least squares regression to get the estimates \(\hat{\beta}_0\), \(\hat{\beta}_1\), and \(\hat{\sigma}^2\)

  • \(\hat{\sigma}\) is the regression standard error

. . .

\[\hat{\sigma} = \sqrt{\frac{\sum_{i=1}^n(y_i - \hat{y}_i)^2}{n - p - 1}} = \sqrt{\frac{\sum_{i=1}^n e_i^2}{n-p-1}}\]

Acknowledgements

These slides are based on content in BMLR: Chapter 1 - Review of Multiple Linear Regression

Initial versions of the slides are by Dr. Maria Tackett, Duke University