---
: "Document title"
title: "my name"
author:
format:
html-resources: true
embed---
Homework 1
Review on R, RStudio, and MLR
Getting started
This One Time
Go to our RStudio Server at http://turing.cornellcollege.edu:8787/
Go to your class folder named “STA363”
Create a new folder called FirstInitial_Last_Initial_HW with your initials.
Every Homework
Each of your assignments will begin with the following steps.
Finding the instructions on our website: https://stats-tgeorge.github.io/STA363_AdvRegF25/.
Going to our RStudio Server at http://turing.cornellcollege.edu:8787/.
Creating a new project and giving it a sensible name such as homework_1 and having that project in the course folder you created.
Create a new quarto document and give it a sensible name such as hw1.
In the
YAML
add the following (add what you don’t have). The embed-resources component will make your final renderedhtml
self-contained.
Instructions
Include the relevant R code as well as full sentences answering each of the questions (i.e. if I ask for the average, you can output the answer in R but also write a full sentence with the answer). Be sure to frequently save your files!
Data for the homework will be in the STA363_inst_files -> data folder.
Exercises
All problems are from The main textbook is: Beyond Multiple Linear Regression by Paul Roback and Julie Legler – it is freely available online. Chapters 1-9.. Abbreviated BMLR.
- (C1B) Applications that do not violate assumptions for inference in LLSR. Identify the response and explanatory variable(s) for each problem. Write the LLSR assumptions for inference in the context of each study.
Women’s Heights. A random selection of women aged 20-24 years are selected and their shoe size is used to predict their height.
- (C2a-d) Applications that do violate assumptions for inference in LLSR. All of the examples in this section have at least one violation of the LLSR assumptions for inference. Begin by identifying the response and explanatory variables. Then, identify which model assumption(s) are violated.
Low Birthweights. Researchers are attempting to see if socioeconomic status and parental stability are predictive of low birthweight. They classify a child as having a low birthweight if their birthweight is less than 2,500 grams.
Clinical Trial I. A Phase II clinical trial is designed to compare the number of patients getting relief at different dose levels. 100 patients get dose A, 100 get dose B, and 100 get dose C.
Canoes and Zip Codes. For each of over 27,000 overnight permits for the Boundary Waters Canoe area, the zip code for the group leader has been translated to the distance traveled and socioeconomic data. Thus, for each zip code we can model the number of trips made as a function of distance traveled and various socioeconomic measures.
Clinical Trial II. A randomized clinical trial investigated postnatal depression and the use of an estrogen patch. Patients were randomly assigned to either use the patch or not. Depression scores were recorded on 6 different visits.
(C3) Kentucky Derby. The next set of questions is related to the Kentucky Derby case study from this chapter.
- Discuss the pros and cons of using side-by-side boxplots vs. stacked histograms to illustrate the relationship between year and track condition in Figure 1.3
- Why is a scatterplot more informative than a correlation coefficient to describe the relationship between speed of the winning horse and year in Figure Figure 1.3.
- How might you incorporate a fourth variable, say number of starters, into Figure 1.4
- Explain why \(\epsilon_i\) in Equation 1.1measures the vertical distance from a data point to the regression line.
- In the first t-test in Section 1.6.1(t = 11.251 for \(H_0:\beta_1 = 0\)), notice that \(t = \frac{\hat{\beta_1}}{SE(\beta_1)} = \frac{.026}{.0023} = 11.251\). Why is the t-test based on the ratio of the estimated slope to its standard error?
- In Equation 1.2 explain why the t-test corresponding to \(\beta_{1}\) is equivalent to an independent-samples t-test under equal variances. Why is the equal variance assumption needed?
- When interpreting \(\beta_2\) in Equation 1.3 why do we have to be careful to say for a fixed year or after adjusting for year? Is it wrong to leave a qualifier like that off?
- Interpret in context a 95% confidence interval for \(\beta_0\) in Model 4.
- State (in context) the result of a t-test for \(\beta_1\) in Model 4.
- Why is there no \(\epsilon_i\) term in Equation 1.4.
- If you considered the interaction between two continuous variables (like
yearnew
andstarters
), how would you provide an interpretation for that coefficient in context? - Interpret (in context) the LLSR estimates for \(\beta_3\) and \(\beta_5\) in Equation 1.5.
(G3) 1. Housing prices and log transformations. The dataset
kingCountyHouses.csv
contains data on over 20,000 houses sold in King County, Washington (Kaggle, 2018a). The dataset includes the following variables:price
= selling price of the housedate
= date house was sold, measured in days since January 1, 2014bedrooms
= number of bedroomsbathrooms
= number of bathroomssqft
= interior square footagefloors
= number of floorswaterfront
= 1 if the house has a view of the waterfront, 0 otherwiseyr_built
= year the house was builtyr_renovated
= 0 if the house was never renovated, the year the house was renovated if else
We wish to create a linear model to predict a house’s selling price.
- Generate appropriate graphs and summary statistics detailing both
price
andsqft
individually and then together. What do you notice? - Fit a simple linear regression model with
price
as the response variable andsqft
as the explanatory variable (Model 1). Interpret the slope coefficient \(\beta_1\). Are all conditions met for linear regression? - Create a new variable,
logprice
, the natural log ofprice
. Fit Model 2, wherelogprice
is now the response variable andsqft
is still the explanatory variable. Write out the regression line equation. - How does
logprice
change whensqft
increases by 1? - Recall that \(\log(a) - \log(b) = \log\big(\frac{a}{b}\big)\), and use this to derive how
price
changes assqft
increases by 1. - Are LLSR assumptions satisfied in Model 2? Why or why not?
- Create a new variable,
logsqft
, the natural log ofsqft
. Fit Model 3 whereprice
andlogsqft
are the response and explanatory variables, respectively. Write out the regression line equation. - How does predicted
price
change aslogsqft
increases by 1 in Model 3? - How does predicted
price
change assqft
increases by 1%? As a hint, this is the same as multiplyingsqft
by 1.01. - Are LLSR assumptions satisfied in Model 3? Why or why not?
- Fit Model 4, with
logsqft
andlogprice
as the response and explanatory variables, respectively. Write out the regression line equation. - In Model 4, what is the effect on
price
corresponding to a 1% increase insqft
? - Are LLSR assumptions satisfied in Model 4? Why or why not?
- Find another explanatory variable which can be added to Model 4 to create a model with a higher adjusted \(R^2\) value. Interpret the coefficient of this added variable.
Submission
When you are finished with your homework, be sure to Render the final document. Once rendered, you can download your file by:
- Finding the .html file in your File pane (on the bottom right of the screen)
- Click the check box next to the file
- Click the blue gear above and then click “Export” to download
- Submit your final html document to the respective assignment on Moodle