Cornell College
STA 363 Spring 2025 Block 8
In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.
Logistic regression
\[\begin{aligned}\pi = P(y = 1 | x) \hspace{2mm} &\Rightarrow \hspace{2mm} \text{Link function: } \log\big(\frac{\pi}{1-\pi}\big) \\ &\Rightarrow \log\big(\frac{\pi}{1-\pi}\big) = \beta_0 + \beta_1~x\end{aligned}\]
Generalized Linear Models (Ch 1 - 6)
Modeling correlated data (Ch 7 - 9)
More Regression Models (ITSL Chapter 7) - Polynomial Regression - Regression Splines - Smoothing Splines - Generalized Additive Models (GAMS)
Reporter will share list with the class
Pre-reqs
Background knowledge
Lectures
Attendance is expected (if you are healthy!)
Beyond Multiple Linear Regression by Paul Roback and Julie Legler
The secondary text is: An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani – it is freely available online. Chapter 7.
Readings
Homework - Primarily from Beyond Multiple Linear Regression - Individual assignments - Work together but must complete your own work. Discuss but don’t copy.
Mini-projects
Examples:
Two exams this block, tbd.
Each will have two components
Final grades will be calculated as follows
Category | Points |
---|---|
Homework | 200 |
Participation | 100 |
Labs and Mini Projects | 300 |
Exams | 400 |
Total | 1000 |
See Syllabus on website for letter grade thresholds.
What are the assumptions for linear regression? . . .
Linearity: Linear relationship between mean response and predictor variable(s)
Independence: Residuals are independent. There is no connection between how far any two points lie above or below regression line.
Normality: Response follows a normal distribution at each level of the predictor (or combination of predictors)
Equal variance: Variability (variance or standard deviation) of the response is equal for all levels of the predictor (or combination of predictors)
Use residual plots to check that the conditions hold before using the model for statistical inference.
Linearity: Linear relationship between mean of the response \(Y\) and the predictor \(X\)
Independence: No connection between how any two points lie above or below the regression line
Normality: Response, \(Y\), follows a normal distribution at each level of the predictor, \(X\) (indicated by red curves)
Equal variance: Variance (or standard deviation) of the response, \(Y\), is equal for all levels of the predictor, \(X\)
How do we assess these conditions?
When we use linear least squares regression to draw conclusions, we do so under the assumption that L.I.N.E. are all met.
Generalized linear models require different assumptions and can accommodate violations in L.I.N.E.
But the independence assumption must hold!
Today’s data is from the Kentucky Derby, an annual 1.25-mile horse race held at the Churchill Downs race track in Louisville, KY. The data is in the file derbyplus.csv and contains information for races 1896 - 2017.
Response variable
speed
: Average speed of the winner in feet per second (ft/s)]Additional variable
winner
: Winning horsePredictor variables
year
: Year of the racecondition
: Condition of the track (good, fast, slow)starters
: Number of horses who raced]Once you’re ready for the statistical analysis (explore), the first step should always be exploratory data analysis.
The EDA will help you
The EDA is exploratory; formal modeling and statistical inference should be used to draw conclusions.
p1 <- ggplot(data = derby, aes(x = speed)) +
geom_histogram(fill = "forestgreen", color = "black") +
labs(x = "Winning speed (ft/s)", y = "Count")
p2 <- ggplot(data = derby, aes(x = starters)) +
geom_histogram(fill = "forestgreen", color = "black") +
labs(x = "Starters", y = "Count")
p3 <- ggplot(data = derby, aes(x = condition)) +
geom_bar(fill = "forestgreen", color = "black", aes(x = ))
p1 + (p2 / p3) +
plot_annotation(title = "Univariate data analysis")
p4 <- ggplot(data = derby, aes(x = starters, y = speed)) +
geom_point() +
labs(x = "Starters", y = "Speed (ft / s)")
p5 <- ggplot(data = derby, aes(x = year, y = speed)) +
geom_point() +
labs(x = "Year", y = "Speed (ft / s)")
p6 <- ggplot(data = derby, aes(x = condition, y = speed)) +
geom_boxplot(fill = "forestgreen", color = "black") +
labs(x = "Conditions", y = "Speed (ft / s)")
(p4 + p5) + p6 +
plot_annotation(title = "Bivariate data analysis")
A scatterplot matrix helps quickly visualize relationships between many variable pairs. They are particularly useful to identify potentially correlated predictors.
Plot the relationship between the response and a predictor based on levels of another predictor to assess potential interactions.
#library(viridis)
ggplot(data = derby, aes(x = year, y = speed, color = condition,
shape = condition, linetype = condition)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, aes(linetype = condition)) +
labs(x = "Year", y = "Speed (ft/s)", color = "Condition",
title = "Speed vs. year",
subtitle = "by track condition") +
guides(lty = FALSE, shape = FALSE) +
scale_color_viridis_d(end = 0.9)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 8.197 | 4.508 | 1.818 | 0.072 |
starters | -0.005 | 0.017 | -0.299 | 0.766 |
year | 0.023 | 0.002 | 9.766 | 0.000 |
conditiongood | -0.443 | 0.231 | -1.921 | 0.057 |
conditionslow | -1.543 | 0.161 | -9.616 | 0.000 |
\[\widehat{speed} = 8.197 - 0.005 ~ starters + 0.023 ~ year - 0.443 ~ good - 1.543 ~ slow\]
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 8.197 | 4.508 | 1.818 | 0.072 |
starters | -0.005 | 0.017 | -0.299 | 0.766 |
year | 0.023 | 0.002 | 9.766 | 0.000 |
conditiongood | -0.443 | 0.231 | -1.921 | 0.057 |
conditionslow | -1.543 | 0.161 | -9.616 | 0.000 |
starters
and conditiongood
.Centering: Subtract a constant from each observation of a given variable
Do this to make interpretation of model parameters more meaningful (particularly intercept)
In STA 202, we used mean-centering where we subtracted the mean from each observation of given variable
How does centering change the model?
year
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 52.175 | 0.194 | 269.079 | 0.000 |
starters | -0.005 | 0.017 | -0.299 | 0.766 |
yearnew | 0.023 | 0.002 | 9.766 | 0.000 |
conditiongood | -0.443 | 0.231 | -1.921 | 0.057 |
conditionslow | -1.543 | 0.161 | -9.616 | 0.000 |
\[\widehat{speed} = 52.175 - 0.005 ~ starters + 0.023 ~ yearnew - 0.443 ~ good - 1.543 ~ slow\]
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 51.4130 | 0.1826 | 281.5645 | 0.0000 |
starters | -0.0253 | 0.0136 | -1.8588 | 0.0656 |
yearnew | 0.0700 | 0.0061 | 11.4239 | 0.0000 |
I(yearnew^2) | -0.0004 | 0.0000 | -8.0411 | 0.0000 |
conditiongood | -0.4770 | 0.1857 | -2.5689 | 0.0115 |
conditionslow | -1.3927 | 0.1305 | -10.6701 | 0.0000 |
\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 ~ x_1 + \hat{\beta}_2 ~ x_2 + \hat{\beta}_3 ~ x_2^2\]
General interpretation: When \(x_2\) increases from a to b, \(y\) is expected to change by \(\hat{\beta}_2(b - a) + \hat{\beta}_3(b^2 - a^2)\), holding \(x_1\) constant.
\[\begin{aligned}\widehat{speed} = &51.413 - 0.025 ~ starters + 0.070 ~ yearnew \\ & - 0.0004 ~ yearnew^2 - 0.477 ~ good - 1.393 ~ slow\end{aligned}\]
Questions:
Interpret the effect of year for the 5 most recent years (2013 - 2017).
Recall from the EDA…
library(viridis)
ggplot(data = derby, aes(x = year, y = speed, color = condition,
shape = condition, linetype = condition)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, aes(linetype = condition)) +
labs(x = "Year", y = "Speed (ft/s)", color = "Condition",
title = "Speed vs. year",
subtitle = "by track condition") +
guides(lty = FALSE, shape = FALSE) +
scale_color_viridis_d(end = 0.9)
\[\begin{aligned}\widehat{speed} = & 52.387 - 0.003 ~ starters + 0.020 ~ yearnew - 1.070 ~ good - 2.183 ~ slow \\ &+0.012 ~ yearnew \times good + 0.012 ~ yearnew \times slow \end{aligned}\]
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 52.387 | 0.200 | 262.350 | 0.000 |
starters | -0.003 | 0.016 | -0.189 | 0.850 |
yearnew | 0.020 | 0.003 | 7.576 | 0.000 |
conditiongood | -1.070 | 0.423 | -2.527 | 0.013 |
conditionslow | -2.183 | 0.270 | -8.097 | 0.000 |
yearnew:conditiongood | 0.012 | 0.008 | 1.598 | 0.113 |
yearnew:conditionslow | 0.012 | 0.004 | 2.866 | 0.005 |
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 52.387 | 0.200 | 262.350 | 0.000 |
starters | -0.003 | 0.016 | -0.189 | 0.850 |
yearnew | 0.020 | 0.003 | 7.576 | 0.000 |
conditiongood | -1.070 | 0.423 | -2.527 | 0.013 |
conditionslow | -2.183 | 0.270 | -8.097 | 0.000 |
yearnew:conditiongood | 0.012 | 0.008 | 1.598 | 0.113 |
yearnew:conditionslow | 0.012 | 0.004 | 2.866 | 0.005 |
r.squared | adj.r.squared | AIC | BIC |
---|---|---|---|
0.73 | 0.721 | 259.478 | 276.302 |
r.squared | adj.r.squared | AIC | BIC |
---|---|---|---|
0.827 | 0.819 | 207.429 | 227.057 |
r.squared | adj.r.squared | AIC | BIC |
---|---|---|---|
0.751 | 0.738 | 253.584 | 276.016 |
How do we define RSquared?
What is adj.r.squared?
\(\color{#4187aa}{AIC}\): Likelihood-based approach balancing model performance and complexity
\(\color{#4187aa}{BIC}\): Similar to AIC with stronger penalty for extra terms
Use the glance
function to get model statistics.
model | r.squared | adj.r.squared | AIC | BIC |
---|---|---|---|---|
Model1 | 0.730 | 0.721 | 259.478 | 276.302 |
Model2 | 0.827 | 0.819 | 207.429 | 227.057 |
Model3 | 0.751 | 0.738 | 253.584 | 276.016 |
model1_glance <- glance(model1Cent) |>
select(r.squared, adj.r.squared, AIC, BIC)
model2_glance <- glance(model2) |>
select(r.squared, adj.r.squared, AIC, BIC)
model3_glance <- glance(model3) |>
select(r.squared, adj.r.squared, AIC, BIC)
model1_glance |>
bind_rows(model2_glance) |>
bind_rows(model3_glance) |>
bind_cols(model = c("Model1", "Model2", "Model3")) |>
select(model, everything()) |>
kable(digits = 3)
Use statistical inference to
If L.I.N.E. assumptions are met, we can conduct inference using the \(t\) distribution and estimated standard errors
Use least squares regression to get the estimates \(\hat{\beta}_0\), \(\hat{\beta}_1\), and \(\hat{\sigma}^2\)
\(\hat{\sigma}\) is the regression standard error
. . .
\[\hat{\sigma} = \sqrt{\frac{\sum_{i=1}^n(y_i - \hat{y}_i)^2}{n - p - 1}} = \sqrt{\frac{\sum_{i=1}^n e_i^2}{n-p-1}}\]
These slides are based on content in BMLR: Chapter 1 - Review of Multiple Linear Regression
Initial versions of the slides are by Dr. Maria Tackett, Duke University