BIOS 1 - Homework 2
Levi Waldron
2023-05-03
Homework2.Rmd
Source code and data for this assignment
Note: The code used to create this document is here.
In this homework we will again use the NHEFS dataset. Take a look at the codebook linked to from https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/.
This time we will test the hypothesis that quitting smoking between
1971 and 1981 (variable qsmk
) reduces the probability of
death by 2001 (variable death
).
Data import and recoding
Download the dataset, import it into R, and recode variables as instructed in Homework 1.
Question 1 (2 points): filter any missing data
Remove any observations that are missing any of the following variables:
death
active
age
education
exercise
qsmk
race
sex
smokeintensity
smkyrs
wt82_71
You would ordinarily want to perform a sensitivity analysis and consider the possibility of bias from informative censoring, but you can ignore that possibility for the purposes of this assignment.
Question 2 (4 points): Table 1
Create a table of descriptive statistics like in HW1, but this time
stratify it by the exposure variable qsmk
Question 3 (4 points): fit crude and adjusted logistic regression models
Fit simple and adjusted multiple logistic regression models for the
probability of death with quitting smoking as the exposure. In the
multivariate model, adjust for sex, race, age, education, and exercise
as hypothesized confounders. Use a B-spline from the
splines
package to correct for age, ie:
\[ death \sim qsmk + sex + race + bs(age) + education + exercise \]
For the purpose of this question, assume that conditional exchangeability holds when controlling for these confounders.
Interpret the coefficient of qsmk
in the crude
model and in the adjusted model. Compare them.
Question 4 (3 points): model diagnostics
Create a binned residual plot using the binnedplot
function from the arm
library. Interpret the plot.
Question 5 (7 points): Inverse probability weighting
a (2 points)) Create a model to calculate the propensity of exposure given the values of the hypothesized confounders (propensity score). This is your “denominator” model of Lecture 11, Slide 15. ie,
\[ \textbf{denom.fit}:qsmk \sim sex + race + bs(age) + education + exercise \]
Also fit your “numerator model”, \[ \textbf{num.fit}: qsmk \sim 1 \]
b (2 points)) Calculate IPTW weights using this model. Hint - you can use code like this:
pdenom <- predict(denom.fit, type = "response")
pnum <- predict(num.fit, type = "response")
nhefs$iptw <- ifelse(nhefs$qsmk == 0, ((1-pn.qsmk)/(1-pd.qsmk)),
(pn.qsmk/pd.qsmk))
c (2 points)) Use the geeglm
function from the
geepack
library to use these weights to estimate the causal
effect of quitting smoking on death.
Notes: - the id
argument to geepack
is
arbitrary in this case because there is no grouping or hierarchical
structure of participants. You can use the seqn
variable or
1:nrow(nhefs)
. - The corstr
argument is also
arbitrary for the same reason; if you don’t specify it you will be using
the default corstr="independence"
- Remember that when
using these weights, do not include confounders in the
model.
d (1 point)) Compare the coefficient and standard error for quitting
smoking to Question 2, and comment. Recall the discussion in lecture
about geepack
standard errors being too conservative.