library(tidyverse)42 Homework: IPW and Estimating Treatment Effects
$$
Introduction
In this homework, we’re going to focus on estimating average treatment effects using inverse probability weighting (IPW). We’ll compare what happens with and without IPW and explore the estimation of the average treatment effect on the treated (ATT).
What to Expect
You’ll see a few variations on things we’ve done before.
- Adjusting for covariate shift by averaging predictions. Here that shift is induced by conditional randomization.
- How to use inverse probability weights to make these predictions in a way that gives us unbiased—or nearly unbiased—estimators of \(\Delta_\text{all}\), \(\Delta_1\), etc. Here we’re interpreting these causally.
That’s all useful stuff. But this is also a good opportunity to review skills and ideas from earlier in the semester.
- Approximating the coverage of biased estimators using normal approximation and the \(\text{bias}/\text{standard deviation}\) ratio.1
- Calculating the coverage of complicated point estimators using simulation.2
As usual, we’ll use the tidyverse.
Setup: The Simulation Framework
We’ll use the same simulation framework from the previous homework on simulating experiments. Here’s the setup code.
m = 1000
set.seed(1)
x = sample(c(20,30,40,50), m, replace=TRUE)
y = function(w) { (2 + 10*((x-20)/30 + w*sqrt((x-20)/30))) * runif(m, min=0, max=2) }
po.pop = data.frame(x=x, y0=y(0), y1=y(1))
n = 200
J = sample(1:m, n, replace=FALSE)
po.sam = po.pop[J,]
pi = function(x) { 1/(1+exp(-(x-30)/8 )) }
W = rbinom(n, 1, pi(po.sam$x))
sam = data.frame(w=W, x=po.sam$x, y=ifelse(W==1, po.sam$y1, po.sam$y0))Estimating the ATE
Now let’s estimate the thing. We’ve shown that the ATE is equal to \(\Delta_\text{all}\), so we can estimate it exactly the same way we estimated \(\Delta_\text{all}\) in our last assignment. We’ll estimate \(\hat\mu\) by (potentially weighted) least squares, then average the difference in predictions \(\textcolor[RGB]{0,191,196}{\hat\mu(1,X_i)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,X_i)}\) over the sample. \[ \begin{aligned} \hat\Delta_\text{all} &= \frac{1}{n} \sum_{i=1}^n \left\{ \textcolor[RGB]{0,191,196}{\hat\mu(1,X_i)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,X_i)} \right\} \\ &\qfor \hat\mu(w,x) = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \gamma(W_i, X_i) \ \left\{ Y_i - m(W_i,X_i) \right\}^2 \end{aligned} \tag{45.1}\]
Below, I’ve packaged all that up into a single function that returns this estimate of the ATE when you give it three arguments.
- A sample \(W_1,X_1,Y_1 \ldots W_n,X_n,Y_n\).
- A formula describing a regression model \(\mathcal{M}\).
- A function \(\gamma\) that computes weights \(\gamma(W_i,X_i)\) for each observation.
constant.weights = function(sam) { rep(1, nrow(sam)) }
estimate.ate = function(sam, formula, gamma=constant.weights) {
model = lm(formula, weights=gamma, data=sam |> mutate(gamma=gamma(sam)))
muhat.0X = predict(model, newdata=sam |> mutate(w=0))
muhat.1X = predict(model, newdata=sam |> mutate(w=1))
mean(muhat.1X - muhat.0X)
}By default, the it uses the weights \(\gamma(w,x)=1\), i.e., it uses unweighted least squares. If we want to use inverse probability weighted least squares, we can pass in this weighting function. We’ll talk about why these are the right weights below in Section 45.1.
ate.weights = function(sam) { ifelse(sam$w==1, 1/pi(sam$x), 1/(1-pi(sam$x))) }Writing the estimator as a function of the sample makes it easy to calculate its sampling distribution and its bootstrap sampling distribution. We’ll get there. First, let’s just calculate point estimates. We’ll try using both the unweighted and inverse probability weighted least squares predictors in the parallel lines model.
ate.hat.lsq = sam |> estimate.ate(y~w+x)
ate.hat.ipw = sam |> estimate.ate(y~w+x, ate.weights)And here’s the estimation target.
ate = mean(po.pop$y1 - po.pop$y0)| target | without ipw | with ipw |
|---|---|---|
| 5.52 | 5.95 | 6.28 |
It looks like we’re doing better without the weights. To get a sense of whether this is just least squares getting lucky, let’s look at the sampling distributions of these two estimators.
To do that, we’ll use two functions defined below.
sample.and.randomizesimulates the process of sampling and randomizing. It takes as input a population of potential outcomes, a sample size \(n\), a function \(\pi(x)\) that tells us the probability of assigning treatment \(w=1\) to a person with covariate value \(x\). And it returns a list of data frames, with each data frame describing a sample \(W_1,X_1,Y_1(W_1) \ldots W_n,X_n,Y_n(W_n)\). By default, it gives us 1000 samples like this.sampling.distributionis not particularly interesting, but it is useful. It takes a list of samples and a list of estimators, calculates each estimator in each sample, and returns the results a data frame describing the sampling distribution of each estimator. Here an estimator is a function that takes one argument, a sample like the ones returned bysample.and.randomize, and returns an estimate.
sample.and.randomize = function(po.pop, n, pi, replications=1000) {
1:replications |> map(function(.) {
J = sample(1:nrow(po.pop), n, replace=FALSE)
po.sam = po.pop[J,]
W = rbinom(n, 1, pi(po.sam$x))
sam = data.frame(w=W, x=po.sam$x, y=ifelse(W==1, po.sam$y1, po.sam$y0))
sam
})
}
sampling.distribution = function(samples, estimators) {
samples |> map(function(sam) {
estimators |> map(function(estimator) {
data.frame(value=estimator(sam))
}) |> bind_rows(.id='estimator')
}) |> bind_rows(.id='sample')
}We can use these functions to calculate the sampling distribution of our two estimators.
samples = po.pop |> sample.and.randomize(n, pi)
estimators = list(
lsq=\(sam)estimate.ate(sam, y~w+x),
ipw=\(sam)estimate.ate(sam, y~w+x, ate.weights))
sampling.dist = samples |> sampling.distribution(estimators)Here are the results. I’ve used a function defined in the expandable box below to draw my histograms. The one using
inverse probability weighted least squares is drawn in purple and the one using unweighted least squares is drawn in orange. Their means are marked by vertical lines of the same color and the target—the ATE—is marked by a thick green line.
sampling.distsampling.dist |> plot.sampling.distribution(target=ate)What can we conclude? Both estimators work pretty well. You can see that the estimator that uses unweighted least squares is slightly biased, but not really enough to worry about.
We know this bias shouldn’t have much impact on coverage, but let’s try to quantify it. We can calculate the bias and standard deviation of that estimator like this.
estimates = sampling.dist |> filter(estimator=='lsq') |> pull(value)
bias = mean(estimates)-ate
se = sd(estimates)| bias | standard deviation |
|---|---|
| -0.158 | 0.921 |
Locked (Week 13)
When we’re working with real data, we don’t know our estimator’s standard deviation \(\sigma_{\theta}\). We have to estimate it. Let’s do that using the bootstrap. The way we’ve organized our code makes this easy. If we have a list of bootstrap samples, we can just call to sampling.distribution passing in this list and the same estimators as before.
Here’s a version of the table above with this estimated standard deviation added.
| bias | standard deviation | estimated standard deviation |
|---|---|---|
| -0.158 | 0.921 | 1.09 |
It’s not exactly the same, but it’s close.
Let’s see how our bias and our imperfect estimate of our standard deviation affect the coverage of an interval estimate we can really use, \(\hat\theta \pm 1.96 \times \hat\sigma_{\theta}\) where \(\hat\sigma_{\theta}\) is the bootstrap standard deviation.
Locked (Week 13)
Weighting
When we talked about inverse probability weighting in Lecture 14, we talked about using the weights \(\gamma(w,x)=m_{x}/m_{wx}\) to estimate \(\Delta_\text{all}\). That was an exercise—it’s not in the slides—so we’ll review it here. Let’s start with the intuition. By weighting this way, we’re effectively using a sample drawn from an imaginary population in which each column (i.e. at each level of \(x\)) has one red dot and one green dot for each dot of either color in our actual population. As a result, the distributions \(\textcolor[RGB]{248,118,109}{p_{x \mid 0}}\) and \(\textcolor[RGB]{0,191,196}{p_{x \mid 1}}\) of \(x\) in our imaginary population are the same as the the distribution \(\textcolor[RGB]{160,32,240}{p_{x}}\) of \(x\) in our actual population.
In the plot below, you can see this in action. On the left, I’ve plotted a population. We see the population, the mean in each group, and the distribution of the covariate \(x\) in each group and in the population as a whole. On the right, you can see the imaginary population we’ve created by using the weights \(\gamma(w,x)=m_{x}/m_{wx}\). I’ve scaled the size of each dot in proportion to its weight. I’ve calculated weighted means within each group and the ‘weighted covariate distribution’ for each group and the population as a whole, i.e., the proportion of weight at each level of \(x\) rather than the proportion of people. You can see that in the imaginary population, the overall (purple) covariate distribution is the same as it is in the unweighted sample, but the covariate distributions of the red and green groups now match the overall distribution. That’s what using the weights \(\gamma(w,x)=m_x/m_{wx}\) does.
There’s something about this that doesn’t quite make sense in randomized experiments. What does it mean to talk about the covariate distributions \(\textcolor[RGB]{0,191,196}{p_{x \mid 1}}\) and \(\textcolor[RGB]{248,118,109}{p_{x \mid 0}}\) among the treated and untreated people in our population when only the ones we sample are actually randomized to treatment? Or the number of people \(\textcolor[RGB]{0,191,196}{m_{1x}}\) and \(\textcolor[RGB]{248,118,109}{m_{0x}}\) with covariate value \(x\) in those groups?
Let’s think about the plot above. The population that’s shown looks a lot like a bigger version of the samples we’ve been working with because it is. To get it, randomly assigned a treatment (i.e. a color) to each person in the population we’ve been sampling from exactly the same way I’ve been assigning treatment to our samples. Like this.
W = rbinom(nrow(po.pop), 1, pi(po.pop$x))
pop = data.frame(x=po.pop$x, w=W, y=ifelse(W==1, po.pop$y1, po.pop$y0))If we like, we can think of our sample as being drawn from this pre-randomized population. Like this.
J = sample(1:nrow(pop), n, replace=FALSE)
sam = pop[J,]Why? Let’s think about how we get our sample in physical terms. Suppose we have a population of people and each of them has a coin. To sample with replacement, we write each person’s phone number on a card, shuffle the deck, and call the first \(n\) numbers. Does it matter if the people we call flip their coin before we call them or after? It doesn’t. Earlier, in Chapter 41, they flipped after we called; in the code right above, they flipped before.
If we think like this, the numbers \(\textcolor[RGB]{248,118,109}{m_{0x}}\) and \(\textcolor[RGB]{0,191,196}{m_{1x}}\) of untreated and treated people in our population with covariate value \(x\) do make sense. But they’re random. And they depend on a bunch of coin flips that have absolutely nothing to do with the people in our sample. So instead of using this random denominator \(m_{wx}\) in our weights \(\gamma(w,x)=m_x/m_{wx}\), let’s average over all these coin flips. Let’s use \(\gamma(w,x)=m_x / \mathop{\mathrm{E}}[m_{wx}]\) instead.
It turns out that this is very easy. Recall that we’ve randomized treatment with probability \(\pi(x)\) that depends only on the covariate \(x\), so \(m_{wx}\) is the sum of \(m_x\) coin flips that come up heads (\(W_j=1\)) with probability \(\pi(x)\). \[ \begin{aligned} \mathop{\mathrm{E}}\qty[m_{wx}] &= \mathop{\mathrm{E}}\qty[\sum_{j:X_j=x} 1_{=w}(W_j)] \\ &= \sum_{j:X_j=x} \mathop{\mathrm{E}}\qty[1_{=w}(W_j)] \\ &= m_x \mathop{\mathrm{E}}\qty[1_{=w}(W_j)] \\ &= m_x \begin{cases} \pi(x) & \qif w=1 \\ 1-\pi(x) & \qif w=0 \end{cases} \end{aligned} \tag{45.2}\] As a result, we can write the weights \(\gamma(w,x)=m_x / \mathop{\mathrm{E}}[m_{wx}]\) in terms of \(\pi(x)\). The factor of \(m_x\) cancels out. \[ \begin{aligned} \gamma(w,x) &= \color[RGB]{0,191,196}\frac{m_x}{m_x \pi(x)} &&= \color[RGB]{0,191,196}\frac{1}{\pi(x)} & \color[RGB]{0,191,196}\qif w=1 \\ &= \color[RGB]{248,118,109}\frac{m_x}{m_x \{1-\pi(x)\}} &&= \color[RGB]{248,118,109}\frac{1}{1-\pi(x)} & \color[RGB]{248,118,109}\qif w=0 \end{aligned} \tag{45.3}\] Those are the weights we’re using in the code above. We don’t need to know the covariate distribution in the population to use them; we just need to know how treatment was randomized. Cool, right? This is where the name ‘inverse probability weighting’ comes from. The weight we give each observation is the (multiplicative) inverse of the probability it was randomized to the treatment it received.
The randomization probability \(\pi(x)\) plays a pretty pivotal role, so let’s take a look at it. And, while we’re at it, let’s look at \(\gamma(w, x)\).
Let’s take a look at what these weights do in our sample. On the left, I’ve plotted our sample, the means in each group, and the covariate distribution in each group and in the sample as a whole. On the right, you can see all the same stuff for imaginary sample we’ve created using the weights \(\textcolor[RGB]{0,191,196}{\gamma(1,x)=1/\pi(x)}\) and \(\textcolor[RGB]{248,118,109}{\gamma(0,x)=1/\{1-\pi(x)\}}\). To compare to what happens using weights \(\gamma(w,x)=m_x/m_{wx}\) based on our pre-randomized population, flip to the second tab.
Having done all this, let’s think about bias. That is, after all, why we’re inverse probability weighting. This is going to look a lot like our analysis in Lecture 14.
Let’s think \(\tilde\mu\), the minimizer of expected weighted squared error. \[ \tilde\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \mathop{\mathrm{E}}\qty[ \gamma(W_i,X_i) \left\{ Y_i - m(W_i,X_i) \right\}^2 ] \] What do we know about this? Following the same logic as in Lecture 14, we get a ‘weighted orthogonality condition’. \[ \mathop{\mathrm{E}}\qty[ \gamma(W_i,X_i) \left\{ \mu(W_i,X_i) - \tilde\mu(W_i,X_i) \right\} m(W_i,X_i) ] = 0 \qqtext{ for all } m \in \mathcal{M}. \]
Locked (Week 13)
Now let’s take a moment to acknowledge a quirk. Exercise 45.3 does not necessarily imply that if we do the same thing in our sample, we’ll get an unbiased estimate of the ATE.3
Let’s think about two estimators. \[ \begin{aligned} \hat\theta_0 &= \frac{1}{n}\sum_{i=1}^n 1_{=1}(W_i) \gamma(W_i,X_i) Y_i - \frac{1}{n}\sum_{i=1}^n 1_{=0}(W_i) \gamma(W_i,X_i) Y_i \\ \hat\theta_1 &= \frac{1}{n}\sum_{i=1}^n \hat\mu(1,X_i) - \hat\mu(0,X_i) \qfor \hat\mu(w,x) = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \gamma(W_i, X_i) \ \left\{ Y_i - m(W_i,X_i) \right\}^2 \end{aligned} \] When you use the parallel lines model, these are pretty similar, but they are not the same.
Locked (Week 13)
Locked (Week 13)
Locked (Week 13)
Estimating the ATT
Now let’s think about estimating the ATT: the average treatment effect on the treated.
\[ \text{ATT} = \mathop{\mathrm{E}}[Y_i(1) - Y_i(0) \mid W_i=1] \]
Using an argument a lot like the one we used to show that the ATE was equal to \(\Delta_\text{all}\) above, we can show that this is equal to \(\Delta_1\). This argument, like that one, uses the law of iterated expectations to boil things down to the ‘within-group’ identification implied by conditional randomization. But there’s a trick to using the law of iterated expectations here because we’re already conditioning on \(W_i=1\). The trick is to start by getting rid of the conditioning. Here’s a formula that’ll help us do that. \[ \mathop{\mathrm{E}}[ Z \mid W=1] = \mathop{\mathrm{E}}[ Z \ 1_{=1}(W) ] \ / \ P(W=1) \qqtext{ for any random variables $W$ and $Z$ with $W \in \{0,1\}$ }. \tag{46.1}\]
Typically, we estimate the ATT with an analogous sample average of \(\hat\mu\). \[ \widehat{ATT} = \textcolor[RGB]{0,191,196}{\frac{1}{N_1}\sum_{i:W_i}}\qty{ \textcolor[RGB]{0,191,196}{\hat\mu(1,X_i)} - \textcolor[RGB]{248,118,109}{\hat \mu(0,X_i)} } \] This should look familiar. We’ve been using the same formula to estimate \(\Delta_1\).
Now let’s run through a few of the same steps we did for the ATE to estimate the ATT. I’ll give you some code to help out. All you’ve got to do, in addition to running the code above here in the document, is write functions estimate.att and att.weights analogous to estimate.ate and ate.weights from Chapter 45. Here’s a starting point for that. What I’ve done is copy/paste those from Chapter 45 and changed the names. What I haven’t done is change the code so it actually estimates the ATT. You don’t have to do that much to it, but you will have to change a few lines. I’ve marked them for you.
constant.weights = function(sam) { rep(1, nrow(sam)) }
estimate.att = function(sam, formula, gamma=constant.weights) {
model = lm(formula, weights=gamma, data=sam |> mutate(gamma=gamma(sam)))
muhat.0X = predict(model, newdata=sam |> mutate(w=0))
muhat.1X = predict(model, newdata=sam |> mutate(w=1))
mean(muhat.1X - muhat.0X)
}
att.weights = function(sam) {
ifelse(sam$w==1, 1/pi(sam$x), 1/(1-pi(sam$x)))
}- 1
- You’ll need to change this if you want your to estimate the ATT.
- 2
- You’ll need to change this if you want your inverse probability weights to work.
att = mean((po.pop$y1 - po.pop$y0)*pi(po.pop$x)/mean(pi(po.pop$x)))
estimators = list(
lsq=\(sam)estimate.att(sam, y~w+x),
ipw=\(sam)estimate.att(sam, y~w+x, att.weights))
sampling.dist = samples |> sampling.distribution(estimators)
sampling.dist |> plot.sampling.distribution(target=att)- 1
- See Note 46.1 for a discussion of this formula.
Locked (Week 13)
Now let’s do some inverse probability weighting. When we talked about the inverse probability weights for the ATE, we started with the idea of using the weights \(\gamma(w,x)=m_x/m_{wx}\) and then replaced \(m_{wx}\) with its expectation. Let’s do something similar for the weights \(\gamma(w,x)=m_{1x}/m_{wx}\). These are like the weights we focused on in Lecture 14 when we were estimating \(\Delta_0\), but with the colors flipped—\(\Delta_1\) is an average over the green dots and \(\Delta_0\) is an average over the red ones.
Locked (Week 13)
If you’ve gotten these right, you should be able to get a nearly-unbiased estimator of the ATT using them.
Locked (Week 13)
The Impact of Model Choice
Now let’s think briefly about what happens when we use different models. Below, I’ve tried out four. For each, I’ve drawn two plots below.
- The least squares and inverse probability weighted least squares predictors on top of the sample they’re fit to. Squares show the unweighted least squares predictions; diamonds show the inverse probability weighted least squares predictions.
- The sampling distributions of the ATT estimators using least squares and inverse probability weighted least squares. The unweighted version is in orange; the inverse probability weighted one is in purple.
Let’s use these to do an exercise on visualizing and communicating about bias and coverage.
Locked (Week 13)
formula = y~w
sam |> lsq.plot(formula, gammas) formula = y~w+x
sam |> lsq.plot(formula, gammas) formula = y~w*x
sam |> lsq.plot(formula, gammas) formula = y~w+factor(x)
sam |> lsq.plot(formula, gammas) This’ll be similar to these exercises from Lecture 13.↩︎
This’ll be similar to this exercise from the Week 9 Homework.↩︎
This isn’t just a randomization thing. In Lecture 14, we did a calculation similar to Exercise 45.3 in a section of the slides called ‘Proving Unbiasedness’, but we never actually proved that our estimator was unbiased. It isn’t. Here, you’ll get a chance to see why. And why it’s pretty close to unbiased anyway.↩︎