library(tidyverse)
library(tidymodels)
library(splines)
Moving Beyond Linearity - Chapter 7 (ISLR)
Wage Data
Lab Setup
- Copy the project lab folder located at Home -> STA363_inst_files-> labs. If you check the box next to the folder name, then click the small gear icon you can “copy to” and put a copy of the folder in your newly created folder.
- Now, click File-> Open Project and navigate to the project file in the folder you just copied.
- You can place your responses in the file qmd file included.
Packages
In this lab we will work with three packages: tidyverse
which is a collection of packages for doing data analysis in a “tidy” way, tidymodels
for statistical model coefficients, and splines
for our splines models.
If you’d like to run your code in the Console as well you’ll also need to load the packages there. To do so, run the following in the console.
Data
For this lab, we are using wage
data adapted from the ISLR
package. This can be loaded in by running:
load("data/wage.rda")
Below is a data dictionary for this dataset:
variable | definition |
---|---|
year | Year that wage information was recorded |
age | Age of worker |
maritl | A factor with levels 1. Never Married 2. Married 3. Widowed 4. Divorced and 5. Separated indicating marital status |
race | A factor with levels 1. White 2. Black 3. Asian and 4. Other indicating race |
education | A factor with levels 1. < HS Grad 2. HS Grad 3. Some College 4. College Grad and 5. Advanced Degree indicating education level |
region | Region of the country (mid-atlantic only) |
jobclass | A factor with levels 1. Industrial and 2. Information indicating type of job |
health | A factor with levels 1. <=Good and 2. >=Very Good indicating health level of worker |
health_ins | A factor with levels 1. Yes and 2. No indicating whether worker has health insurance |
logwage | Log of workers wage |
wage | Workers raw wage |
Exercises
Explore the
wage
dataset. What are the variables? How many observations are there? Is there any missing data? If so how much? For simplicity in this particular lab, we are going to do complete case analysis. Create a new dataframe that is complete (only based on education, notmaritl
since we are not going to use that variable in this lab).For this lab, we are interested in predicting the variable
wage
fromage
,health_ins
,jobclass
,education
, andrace
. Complete a brief univariate and bivariate EDA for each of these varibles and with the response.
Polynomial Regression
Polynomial regression can be thought of as doing polynomial expansion on a variable and passing that expansion into a linear regression model. We will be very explicit in this formulation in this chapter.
The following step will take age
and replace it with the variables age
, age^2
, age^3
, and age^4
since we set degree = 4
(this is a lie, keep reading).
<- lm(wage ~ poly(age , 4), data = wage)
model1 coef(summary(model1))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 111.70361 0.7287409 153.283015 0.000000e+00
poly(age, 4)1 447.06785 39.9147851 11.200558 1.484604e-28
poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
poly(age, 4)3 125.52169 39.9147851 3.144742 1.678622e-03
poly(age, 4)4 -77.91118 39.9147851 -1.951938 5.103865e-02
Though we expect age
, age^2
, age^3
, and age^4
, what is happening is that it returns variables that are a basis of orthogonal polynomials, which means that each of the columns is a linear combination of the variables age
, age^2
, age^3
, and age^4
. We can see this by using poly()
directly with raw = FALSE
since it is the default
poly(1:6, degree = 4, raw = FALSE)
1 2 3 4
[1,] -0.5976143 0.5455447 -0.3726780 0.1889822
[2,] -0.3585686 -0.1091089 0.5217492 -0.5669467
[3,] -0.1195229 -0.4364358 0.2981424 0.3779645
[4,] 0.1195229 -0.4364358 -0.2981424 0.3779645
[5,] 0.3585686 -0.1091089 -0.5217492 -0.5669467
[6,] 0.5976143 0.5455447 0.3726780 0.1889822
attr(,"coefs")
attr(,"coefs")$alpha
[1] 3.5 3.5 3.5 3.5
attr(,"coefs")$norm2
[1] 1.00000 6.00000 17.50000 37.33333 64.80000 82.28571
attr(,"degree")
[1] 1 2 3 4
attr(,"class")
[1] "poly" "matrix"
We see that these variables don’t directly have the format we would have assumed. But this is still a well-reasoned transformation. We can get the raw polynomial transformation by setting raw = TRUE
poly(1:6, degree = 4, raw = TRUE)
1 2 3 4
[1,] 1 1 1 1
[2,] 2 4 8 16
[3,] 3 9 27 81
[4,] 4 16 64 256
[5,] 5 25 125 625
[6,] 6 36 216 1296
attr(,"degree")
[1] 1 2 3 4
attr(,"class")
[1] "poly" "matrix"
These transformations align with what we would expect. It is still recommended to stick with the default of raw = FALSE
unless you have a reason not to do that.
Why is the default to return orthogonal vectors (perpendicular vectors)?
Using your EDA from before, what degree polynomial is needed for age vs wage?
Fit a polynomial regression model with your chosen degree of the polynomial for age to predict wage.
We need to choose the “best” degree polynomial regression model and one way is hypothesis testing. Fit polynomial regression models with age predicting wage with degrees 1 through 5.
Use either a drop-in deviance test or a Nested F Test to determine which model is best.
Now create a plot that visualizes the fit of the data. Make sure to include the original data values and a line for the fit polynomial model.
Splines
To fit regression splines, we use the splines library. We saw that regression splines can be fit by constructing an appropriate matrix of basis functions. The bs()
function generates the entire matrix of basis functions for splines with the specified set of knots. By default, cubic bs()
splines are produced. Fitting wage to age using a regression spline is simple:
<- lm(wage ~ bs(age , knots = c(a, b, c)), data = Wage) model2
Here we have pre-specified knots at ages \(a\), \(b\), and \(c\). This will produce a spline with six basis functions. (Recall that a cubic spline with three knots has seven degrees of freedom; these degrees of freedom are used up by an intercept, plus six basis functions.) We could also use the df option to produce a spline with knots at uniform quantiles of the data.
Use your EDA or new data visualizations to choose the 3 knots. Then fit the cubic spline model using your chosen knots.
Another way to find the knots is quantiles. Use the
bs
function, specifying the correct number of degrees of freedom, to fit the cubic spline model using knots at the 25th, 50th, and 75th quantiles.In order to instead fit a natural spline, we use the
ns()
function. Here we fit a natural spline with four degrees of freedom. Change the code below to fit a natural spline of degree 3.
<- lm(wage ~ ns(age , df = d), data = Wage) model3
Use all of the other variables of interest and refit your best-degree polynomial regression model, your cubic spline, and your natural spline model with the additional variables.
Now we want to choose the best model for all the models above. Use AIC, BIC, and RMSE (square root of mean square error), model complexity, and residual plots, to choose the best model. Explicitly explain your reasoning.