Homework 1

Review on R, RStudio, and MLR

Getting started

This One Time

Go to our RStudio Server at http://turing.cornellcollege.edu:8787/
Go to your class folder named “STA363”
Create a new folder called FirstInitial_Last_Initial_HW with your initials.

Every Homework

Each of your assignments will begin with the following steps.

Finding the instructions on our website: https://stats-tgeorge.github.io/STA363_AdvRegF25/.
Going to our RStudio Server at http://turing.cornellcollege.edu:8787/.
Creating a new project and giving it a sensible name such as homework_1 and having that project in the course folder you created.
Create a new quarto document and give it a sensible name such as hw1.
In the YAML add the following (add what you don’t have). The embed-resources component will make your final rendered html self-contained.
```
---
title: "Document title"
author: "my name"
format:
  html:
    embed-resources: true
---
```

Instructions

Include the relevant R code as well as full sentences answering each of the questions (i.e. if I ask for the average, you can output the answer in R but also write a full sentence with the answer). Be sure to frequently save your files!

Data for the homework will be in the STA363_inst_files -> data folder.

Exercises

All problems are from The main textbook is: Beyond Multiple Linear Regression by Paul Roback and Julie Legler – it is freely available online. Chapters 1-9.. Abbreviated BMLR.

(C1B) Applications that do not violate assumptions for inference in LLSR. Identify the response and explanatory variable(s) for each problem. Write the LLSR assumptions for inference in the context of each study.

Women’s Heights. A random selection of women aged 20-24 years are selected and their shoe size is used to predict their height.

(C2a-d) Applications that do violate assumptions for inference in LLSR. All of the examples in this section have at least one violation of the LLSR assumptions for inference. Begin by identifying the response and explanatory variables. Then, identify which model assumption(s) are violated.

Low Birthweights. Researchers are attempting to see if socioeconomic status and parental stability are predictive of low birthweight. They classify a child as having a low birthweight if their birthweight is less than 2,500 grams.
Clinical Trial I. A Phase II clinical trial is designed to compare the number of patients getting relief at different dose levels. 100 patients get dose A, 100 get dose B, and 100 get dose C.
Canoes and Zip Codes. For each of over 27,000 overnight permits for the Boundary Waters Canoe area, the zip code for the group leader has been translated to the distance traveled and socioeconomic data. Thus, for each zip code we can model the number of trips made as a function of distance traveled and various socioeconomic measures.
Clinical Trial II. A randomized clinical trial investigated postnatal depression and the use of an estrogen patch. Patients were randomly assigned to either use the patch or not. Depression scores were recorded on 6 different visits.

(C3) Kentucky Derby. The next set of questions is related to the Kentucky Derby case study from this chapter.
1. Discuss the pros and cons of using side-by-side boxplots vs. stacked histograms to illustrate the relationship between year and track condition in Figure 1.3
2. Why is a scatterplot more informative than a correlation coefficient to describe the relationship between speed of the winning horse and year in Figure Figure 1.3.
3. How might you incorporate a fourth variable, say number of starters, into Figure 1.4
4. Explain why \(\epsilon_i\) in Equation 1.1measures the vertical distance from a data point to the regression line.
5. In the first t-test in Section 1.6.1(t = 11.251 for \(H_0:\beta_1 = 0\)), notice that \(t = \frac{\hat{\beta_1}}{SE(\beta_1)} = \frac{.026}{.0023} = 11.251\). Why is the t-test based on the ratio of the estimated slope to its standard error?
6. In Equation 1.2 explain why the t-test corresponding to \(\beta_{1}\) is equivalent to an independent-samples t-test under equal variances. Why is the equal variance assumption needed?
7. When interpreting \(\beta_2\) in Equation 1.3 why do we have to be careful to say for a fixed year or after adjusting for year? Is it wrong to leave a qualifier like that off?
8. Interpret in context a 95% confidence interval for \(\beta_0\) in Model 4.
9. State (in context) the result of a t-test for \(\beta_1\) in Model 4.
10. Why is there no \(\epsilon_i\) term in Equation 1.4.
11. If you considered the interaction between two continuous variables (like yearnew and starters), how would you provide an interpretation for that coefficient in context?
12. Interpret (in context) the LLSR estimates for \(\beta_3\) and \(\beta_5\) in Equation 1.5.
(G3) 1. Housing prices and log transformations. The dataset kingCountyHouses.csv contains data on over 20,000 houses sold in King County, Washington (Kaggle, 2018a). The dataset includes the following variables:
- price = selling price of the house
- date = date house was sold, measured in days since January 1, 2014
- bedrooms = number of bedrooms
- bathrooms = number of bathrooms
- sqft = interior square footage
- floors = number of floors
- waterfront = 1 if the house has a view of the waterfront, 0 otherwise
- yr_built = year the house was built
- yr_renovated = 0 if the house was never renovated, the year the house was renovated if else
We wish to create a linear model to predict a house’s selling price.
1. Generate appropriate graphs and summary statistics detailing both price and sqft individually and then together. What do you notice?
2. Fit a simple linear regression model with price as the response variable and sqft as the explanatory variable (Model 1). Interpret the slope coefficient \(\beta_1\). Are all conditions met for linear regression?
3. Create a new variable, logprice, the natural log of price. Fit Model 2, where logprice is now the response variable and sqft is still the explanatory variable. Write out the regression line equation.
4. How does logprice change when sqft increases by 1?
5. Recall that \(\log(a) - \log(b) = \log\big(\frac{a}{b}\big)\), and use this to derive how price changes as sqft increases by 1.
6. Are LLSR assumptions satisfied in Model 2? Why or why not?
7. Create a new variable, logsqft, the natural log of sqft. Fit Model 3 where price and logsqft are the response and explanatory variables, respectively. Write out the regression line equation.
8. How does predicted price change as logsqft increases by 1 in Model 3?
9. How does predicted price change as sqft increases by 1%? As a hint, this is the same as multiplying sqft by 1.01.
10. Are LLSR assumptions satisfied in Model 3? Why or why not?
11. Fit Model 4, with logsqft and logprice as the response and explanatory variables, respectively. Write out the regression line equation.
12. In Model 4, what is the effect on price corresponding to a 1% increase in sqft?
13. Are LLSR assumptions satisfied in Model 4? Why or why not?
14. Find another explanatory variable which can be added to Model 4 to create a model with a higher adjusted \(R^2\) value. Interpret the coefficient of this added variable.

Submission

When you are finished with your homework, be sure to Render the final document. Once rendered, you can download your file by:

Finding the .html file in your File pane (on the bottom right of the screen)
Click the check box next to the file
Click the blue gear above and then click “Export” to download
Submit your final html document to the respective assignment on Moodle