library(mosaic)
library(skimr) # may have to install
library(dplyr)
library(ggplot2)
library(gridExtra)
<- read.csv("data/banksalary.csv") bank
Stat 363: Multiple Linear Regression Review
Bank Salary Case Study
First Lab Setup
Since this our first lab, I suggest you get setup for files in the following way:
- Log into the course RStudio Server. http://turing.cornellcollege.edu:8787/ Make sure to close out of any projects on the top right corner.
- Click the “File” tab in the bottom right corner. Then click “Home” across the top of the panel.
- If ones does not exists already, use the “+📂” button to create a new folder and call it “STA363”
- Copy project folder called “01_linear_regression” located at Home -> STA363_inst_files-> labs. If you check the box next to the folder name, then click the small gear icon you can “copy to” and put a copy of the folder in your newly created folder.
- Now, click File-> Open Project and navigate to the project file in the folder you just copied.
- You can place your responses in the file 01_banksalary_fillable.qmd.
Introduction
Example: Sex discrimination in bank salaries. In the 1970’s, Harris Trust was sued for sex discrimination in the salaries it paid its employees. One approach to addressing this issue was to examine the starting salaries of all skilled, entry-level clerical workers between 1965 and 1975. The data is saved under banksalary.csv, and relevant R code can be found in the STA363_inst_files folder on the home directory of the RStudio Server. banksalary.qmd.
Data from skilled entry-level clerical workers at bank being sued for sex discrimination in 1970s (from Statistical Sleuth, Ch 12):
bsal
= beginning salary (annual salary at time of hire)sal77
= annual salary in 1977sex
= MALE or FEMALEsenior
= months since hiredage
= age in monthseduc
= years of educationexper
= months of prior work experience
First, we will speculate on what we expect to find and then we will perform an analysis using the data.
- Read the data in and use a
view()
to see the data, but do not do anything else with the data on the computer until question 6.
Questions
Identify the observational units, the response variable, and explanatory variables.
Given the mean starting salary of male workers ($5957) was 16% higher than the mean starting salary of female workers ($5139): Is this enough evidence to conclude sex discrimination exists? If not, what further evidence would you need?
How would you expect age, experience, and education to each be related to starting salary?
Why might it be important to control for seniority (number of years with the bank) if we are only concerned with the salary when the worker started?
Do you expect any explanatory variables (including sex) to be closely related to each other? What implications would this have for modeling?
Using the data…
One approach is to construct a good model for beginning salaries while requiring sex as a predictor, to determine the significance of sex after controlling for the other covariates. Then we can explore interactions with sex to see if its effect is consistent across levels of other predictors.
Use the data to address the primary question of interest here using only the beginning salary and sex variables. Be sure to discuss plots and summary statistics first, and then look at test(s) of significance.
Construct plots to investigate how each of the potential confounders (age, experience, education) is related to beginning salaries. Describe your findings.
Does seniority play a role in the variation of starting salaries? In what way?
Examine how the explanatory variables (including sex) are related to each other, if at all. What implications would this have for modeling?
Fit a simple linear regression model with starting salary as the response and education as the sole explanatory variable. Interpret the intercept and slope of this model; also interpret the R-squared value. Is there a significant relationship between education and starting salary?
Intercept:
Slope:
R2
Significance:
Does model1 from question 10 meet all linear regression assumptions? List each assumption and how you decided if it was met or not.
Is a model with all 4 confounding variables better than a model with just education? Justify with an appropriate significance test in addition to summary statistics of model performance.
You should have noticed that the term for age was not significant in the model3. What does this imply about age and about future modeling steps?
The relationship between experience and beginning salary exhibits some curvature. How might it be interpreted in this context? Determine whether a quadratic term in experience improves a model without curvature.
Based on model6, what conclusions can be drawn about sex discrimination at Harris Trust? Do these conclusions have to be qualified at all, or are they pretty clear cut? Interpret a 95% confidence interval for the male indicator variable in context to help with your response.
Do any explanatory variables exhibit an interaction with sex. If so, what are the implications for your answer in (15)?
Often salary data is logged before analysis. Would you recommend logging starting salary in this study? Support your decision analytically.
Regardless of your answer to (17), provide an interpretation for the coefficient for the male coefficient in model6a after logging starting salary.
Code
Data from skilled entry-level clerical workers at bank being sued for sex discrimination in 1970s (from Statistical Sleuth, Ch 12):
bsal
= beginning salary (annual salary at time of hire)sal77
= annual salary in 1977sex
= MALE or FEMALEsenior
= months since hiredage
= age in monthseduc
= years of educationexper
= months of prior work experience
# Examine data frame
head(bank) # print first 6 rows
# Generate relevant summary statistics for response variable
favstats(~ bsal, data = bank)
# Look at marginal relationship between sex and beginning salary
# Three options for getting summary statistics
favstats(~ bsal | sex, data = bank)
|>
bank group_by(sex) |>
summarize(mean = mean(bsal),
median = median(bsal),
sd = sd(bsal),
iqr = IQR(bsal),
n = n())
# Doesn't seem to render as a pdf
#bank |>
# group_by(sex) |>
# skim()
# Several options for plotting 1 categorical and 1 numeric variable
boxplot(bsal ~ sex, ylab="Beginning salary", data = bank)
bwplot(sex ~ bsal, data = bank)
ggplot(bank, aes(x = sex, y = bsal)) +
geom_boxplot() + coord_flip()
ggplot(data = bank, mapping = aes(x = bsal, y = ..density..)) +
geom_freqpoly(mapping = aes(colour = sex), binwidth = 250)
ggplot(data = bank, mapping = aes(x = bsal, y = ..density..)) +
geom_density(mapping = aes(colour = sex))
ggplot(bank, aes(x=bsal, fill=sex)) +
geom_density(alpha=0.4)
ggplot(bank, aes(x = sex, y = bsal)) +
geom_violin()
ggplot(data = bank) +
geom_histogram(mapping = aes(x = bsal)) +
facet_wrap(~ sex, nrow = 2)
ggplot(data = bank) +
geom_histogram(mapping = aes(x = bsal, y = ..density..)) +
facet_wrap(~ sex, nrow = 2)
# Initial analysis of sex vs salary, ignoring other covariates
t.test(bsal ~ sex, data = bank)
= lm(bsal ~ sex, data = bank)
model1 summary(model1)
<- bank |>
bank mutate(male = ifelse(sex=="MALE",1,0)) # create our own indicator for sex
t.test(bsal ~ male, var.equal = TRUE, data = bank)
# How are covariates related to response?
ggplot(bank, aes(y = bsal, age)) +
geom_point() +
geom_smooth(method = lm)
ggplot(bank, aes(y = bsal, exper)) +
geom_point() +
geom_smooth(method = lm)
ggplot(bank, aes(y = bsal, educ)) +
geom_point() +
geom_smooth(method = lm)
ggplot(bank, aes(y = bsal, senior)) +
geom_point() +
geom_smooth(method = lm)
# How are explanatory variables related to each other?
<- bank |> select(bsal, age, exper, educ, senior, male)
bank0 pairs(bank0) # matrix of scatterplots
cor(bank0) # matrix of correlations
# How are explanatory variables related to sex?
favstats(~ age | sex, data = bank)
ggplot(bank, aes(x = sex, y = age)) +
geom_boxplot() + coord_flip()
favstats(~ exper | sex, data = bank)
ggplot(bank, aes(x = sex, y = exper)) +
geom_boxplot() + coord_flip()
favstats(~ educ | sex, data = bank)
ggplot(bank, aes(x = sex, y = educ)) +
geom_boxplot() + coord_flip()
favstats(~ senior | sex, data = bank)
ggplot(bank, aes(x = sex, y = senior)) +
geom_boxplot() + coord_flip()
# Fit linear regression with single predictor (education)
= lm(bsal ~ educ, data = bank)
model2 summary(model2)
# Residual plots - run off the page unless redo margins
par(mfrow=c(2,2), mar=c(4,4,4,1), oma=c(0.5,0.5,0.5,0))
plot(model2) # Generate residual diagnostic plots
par(mfrow=c(1,1))
# Fit multiple regression model with four predictors
= lm(bsal ~ senior + age + educ + exper, data = bank)
model3 summary(model3)
anova(model2, model3, test="F")
# Examine AIC and BIC scores (lower is better)
AIC(model2) # AIC-model2
AIC(model3) # AIC-model3
AIC(model2, k=log(nrow(bank))) # BIC-model2
AIC(model3, k=log(nrow(bank))) # BIC-model3
# Fit linear regression with experience
= lm(bsal ~ exper, data = bank)
model4 summary(model4)
# Residual plots - run off the page unless redo margins
par(mfrow=c(2,2), mar=c(4,4,4,1), oma=c(0.5,0.5,0.5,0))
plot(model4) # Generate residual diagnostic plots
par(mfrow=c(1,1))
<- bank |>
bank mutate(exper2 = exper^2)
= lm(bsal ~ exper + exper2, data = bank)
model5 summary(model5)
# Residual plots - run off the page unless redo margins
par(mfrow=c(2,2), mar=c(4,4,4,1), oma=c(0.5,0.5,0.5,0))
plot(model5) # Generate residual diagnostic plots
par(mfrow=c(1,1))
# One potential final model
= lm(bsal ~ senior + educ + exper + male, data = bank)
model6 summary(model6)
# Construct 95% CIs for model coefficients the hard way...
= summary(model6)$coef[,1] # store model betas
betas = summary(model6)$coef[,2] # store SEs of betas
SEs = qt(.975,model6$df)
tstar = betas - tstar*SEs
lb = betas + tstar*SEs
ub cbind(lb,ub)
# ... or more simply:
confint(model6)
# Investigate interactions with sex
ggplot(bank, aes(y = bsal, x = age, color = sex)) +
geom_point() +
geom_smooth(method = lm)
ggplot(bank, aes(y = bsal, x = exper, color = sex)) +
geom_point() +
geom_smooth(method = lm)
ggplot(bank, aes(y = bsal, x = educ, color = sex)) +
geom_point() +
geom_smooth(method = lm)
ggplot(bank, aes(y = bsal, x = senior, color = sex)) +
geom_point() +
geom_smooth(method = lm)
# Examine interaction between sex and education
= lm(bsal ~ educ + male + educ:male, data = bank)
model7 summary(model7)
# Should beginning salary be log transformed?
<- bank |>
bank mutate(logbsal = log(bsal))
<- ggplot(bank, aes(bsal)) + geom_histogram(bins = 10)
hist1 <- ggplot(bank, aes(logbsal)) + geom_histogram(bins = 10)
hist2 grid.arrange(hist1, hist2, ncol=2)
# Look at marginal relationship between sex and beginning salary
= lm(logbsal ~ senior + educ + exper + male, data = bank)
model6a summary(model6a)
# Residual plots - run off the page unless redo margins
par(mfrow=c(2,2), mar=c(4,4,4,1), oma=c(0.5,0.5,0.5,0))
plot(model6a) # Generate residual diagnostic plots
par(mfrow=c(1,1))
# Bonus code: to help with HW1 when looking at 2 categorical variables
<- tally(~ sex + educ, data = bank)
ytable
ytablemosaicplot(ytable, color = c("blue", "light blue"))
prop.table(ytable, 1)
# Other ways to visualize two categorical variables
<- bank |>
bank mutate(education = ifelse(educ < 12, "No high school", "High school"),
education = ifelse(educ > 12, "Some college", education))
ggplot(data = bank) +
geom_bar(mapping = aes(x = sex, fill = education))
ggplot(data = bank) +
geom_bar(mapping = aes(x = sex, fill = education), position = "fill")
library(ggmosaic) # need to have ggmosaic package installed
ggplot(data = bank) +
geom_mosaic(aes(x = product(education, sex), fill=education), na.rm = TRUE)
# Other ways to summarize relationship between two categorical variables
|>
bank group_by(sex, education) |>
summarise (n = n()) |>
mutate(freq = n / sum(n))