42 Homework: IPW and Estimating Treatment Effects

Introduction

In this homework, we’re going to focus on estimating average treatment effects using inverse probability weighting (IPW). We’ll compare what happens with and without IPW and explore the estimation of the average treatment effect on the treated (ATT).

What to Expect

You’ll see a few variations on things we’ve done before.

Adjusting for covariate shift by averaging predictions. Here that shift is induced by conditional randomization.
How to use inverse probability weights to make these predictions in a way that gives us unbiased—or nearly unbiased—estimators of $\Delta_\text{all}$, $\Delta_1$, etc. Here we’re interpreting these causally.

That’s all useful stuff. But this is also a good opportunity to review skills and ideas from earlier in the semester.

Approximating the coverage of biased estimators using normal approximation and the $\text{bias}/\text{standard deviation}$ ratio.¹
Calculating the coverage of complicated point estimators using simulation.²

As usual, we’ll use the tidyverse.

library(tidyverse)

Setup: The Simulation Framework

We’ll use the same simulation framework from the previous homework on simulating experiments. Here’s the setup code.

m = 1000
set.seed(1)

x = sample(c(20,30,40,50), m, replace=TRUE)
y = function(w) { (2 + 10*((x-20)/30 + w*sqrt((x-20)/30))) * runif(m, min=0, max=2) }

po.pop = data.frame(x=x, y0=y(0), y1=y(1)) 

n = 200
J = sample(1:m, n, replace=FALSE)
po.sam = po.pop[J,]

pi = function(x) { 1/(1+exp(-(x-30)/8 )) }
W = rbinom(n, 1, pi(po.sam$x))
sam = data.frame(w=W, x=po.sam$x, y=ifelse(W==1, po.sam$y1, po.sam$y0))

Estimating the ATE

Now let’s estimate the thing. We’ve shown that the ATE is equal to $\Delta_\text{all}$, so we can estimate it exactly the same way we estimated $\Delta_\text{all}$ in our last assignment. We’ll estimate $\hat\mu$ by (potentially weighted) least squares, then average the difference in predictions $\textcolor[RGB]{0,191,196}{\hat\mu(1,X_i)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,X_i)}$ over the sample. \[ \begin{aligned} \hat\Delta_\text{all} &= \frac{1}{n} \sum_{i=1}^n \left\{ \textcolor[RGB]{0,191,196}{\hat\mu(1,X_i)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,X_i)} \right\} \\ &\qfor \hat\mu(w,x) = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \gamma(W_i, X_i) \ \left\{ Y_i - m(W_i,X_i) \right\}^2 \end{aligned} \tag{45.1}\]

Below, I’ve packaged all that up into a single function that returns this estimate of the ATE when you give it three arguments.

A sample $W_1,X_1,Y_1 \ldots W_n,X_n,Y_n$.
A formula describing a regression model $\mathcal{M}$.
A function $\gamma$ that computes weights $\gamma(W_i,X_i)$ for each observation.

constant.weights = function(sam) { rep(1, nrow(sam)) }
estimate.ate = function(sam, formula, gamma=constant.weights) {
  model = lm(formula, weights=gamma, data=sam |> mutate(gamma=gamma(sam)))
  muhat.0X = predict(model, newdata=sam |> mutate(w=0))
  muhat.1X = predict(model, newdata=sam |> mutate(w=1))
  mean(muhat.1X - muhat.0X)
}

By default, the it uses the weights $\gamma(w,x)=1$, i.e., it uses unweighted least squares. If we want to use inverse probability weighted least squares, we can pass in this weighting function. We’ll talk about why these are the right weights below in Section 45.1.

ate.weights = function(sam) { ifelse(sam$w==1, 1/pi(sam$x), 1/(1-pi(sam$x))) }

Writing the estimator as a function of the sample makes it easy to calculate its sampling distribution and its bootstrap sampling distribution. We’ll get there. First, let’s just calculate point estimates. We’ll try using both the unweighted and inverse probability weighted least squares predictors in the parallel lines model.

ate.hat.lsq = sam |> estimate.ate(y~w+x)
ate.hat.ipw = sam |> estimate.ate(y~w+x, ate.weights)

And here’s the estimation target.

ate = mean(po.pop$y1 - po.pop$y0)

target	without ipw	with ipw
5.52	5.95	6.28

It looks like we’re doing better without the weights. To get a sense of whether this is just least squares getting lucky, let’s look at the sampling distributions of these two estimators.

To do that, we’ll use two functions defined below.

sample.and.randomize simulates the process of sampling and randomizing. It takes as input a population of potential outcomes, a sample size $n$, a function $\pi(x)$ that tells us the probability of assigning treatment $w=1$ to a person with covariate value $x$. And it returns a list of data frames, with each data frame describing a sample $W_1,X_1,Y_1(W_1) \ldots W_n,X_n,Y_n(W_n)$. By default, it gives us 1000 samples like this.
sampling.distribution is not particularly interesting, but it is useful. It takes a list of samples and a list of estimators, calculates each estimator in each sample, and returns the results a data frame describing the sampling distribution of each estimator. Here an estimator is a function that takes one argument, a sample like the ones returned by sample.and.randomize, and returns an estimate.

sample.and.randomize = function(po.pop, n, pi, replications=1000) {
  1:replications |> map(function(.) {  
    J = sample(1:nrow(po.pop), n, replace=FALSE)
    po.sam = po.pop[J,]
    W = rbinom(n, 1, pi(po.sam$x))
    sam = data.frame(w=W, x=po.sam$x, y=ifelse(W==1, po.sam$y1, po.sam$y0))
    sam 
  })
}
sampling.distribution = function(samples, estimators) { 
  samples |> map(function(sam) {  
    estimators |> map(function(estimator) {
      data.frame(value=estimator(sam))
    }) |> bind_rows(.id='estimator')  
  }) |> bind_rows(.id='sample')
}

We can use these functions to calculate the sampling distribution of our two estimators.

samples = po.pop |> sample.and.randomize(n, pi)
estimators = list(
  lsq=\(sam)estimate.ate(sam, y~w+x), 
  ipw=\(sam)estimate.ate(sam, y~w+x, ate.weights))

sampling.dist = samples |> sampling.distribution(estimators)

Here are the results. I’ve used a function defined in the expandable box below to draw my histograms. The one using
inverse probability weighted least squares is drawn in purple and the one using unweighted least squares is drawn in orange. Their means are marked by vertical lines of the same color and the target—the ATE—is marked by a thick green line.

Plotting Code You’ll Need

You don’t need to read it if you don’t feel like it.

lsq.plot = function(sam, formula, gammas, dot.size=NULL, dot.size.scale=2) { 
  diamond.shape = 23
  square.shape = 22
  triangle.shape = 24
  jitter = position_jitter(width=2, height=0, seed=0)

  predictions = gammas |> map(function(gamma) {
    model = lm(formula, weights=gamma, 
      data=sam |> mutate(gamma=gamma(sam)))
    sam |> mutate(muhat = predict(model, newdata=sam))
  }) |> bind_rows(.id='weights')
  
  sam$dot.size = if(is.null(dot.size)) {
    1
  } else {
    sum.normalize = function(x) x/mean(x)
    dot.sizes = dot.size(sam) 
    dot.sizes[sam$w==1] = sum.normalize(dot.sizes[sam$w==1]) 
    dot.sizes[sam$w==0] = sum.normalize(dot.sizes[sam$w==0])
    dot.sizes 
  }

  sam |> 
    ggplot() + 
      geom_point(aes(x=x, y=y, color=factor(w), size=dot.size.scale*I(dot.size)), 
        alpha=.5, position=jitter) + 
      geom_point(aes(x=x, y=muhat, fill=factor(w), shape=factor(weights)), 
        alpha=.5, color='black', data=predictions) +
      geom_line(aes(x=x, y=muhat, color=factor(w), linetype=factor(weights)), 
      linewidth=.5, data=predictions) + 
    scale_shape_manual(values=c(diamond.shape, square.shape, triangle.shape)) +
    scale_linetype_manual(values=c('solid', 'dashed', 'dotted')) +
    guides(color='none', fill='none', shape='none', linetype='none', size='none')
}

plot.sampling.distribution = function(sampling.dist, target) {
  sampling.dist |> 
    ggplot() + 
      geom_histogram(aes(x=value, y=after_stat(density), fill=factor(estimator)), 
        bins=50, alpha=.2, color='gray', linewidth=.05, position='identity') +
      geom_vline(xintercept=target, color='green', linewidth=2, alpha=.5) +
      stat_summary(aes(x=0, y=value, color=factor(estimator)), 
        geom='vline', fun.data=\(y)data.frame(xintercept=mean(y)),
        alpha=.9) + 
      scale_fill_manual(values=c('purple', 'orange')) +
      scale_color_manual(values=c('purple', 'orange')) +
      guides(color='none', fill='none')
}

In a Table
In a Histogram

sampling.dist

sampling.dist |> plot.sampling.distribution(target=ate)

What can we conclude? Both estimators work pretty well. You can see that the estimator that uses unweighted least squares is slightly biased, but not really enough to worry about.

We know this bias shouldn’t have much impact on coverage, but let’s try to quantify it. We can calculate the bias and standard deviation of that estimator like this.

estimates = sampling.dist |> filter(estimator=='lsq') |> pull(value)
bias = mean(estimates)-ate 
se   = sd(estimates)

bias	standard deviation
-0.158	0.921

Exercise 45.1

If the sampling distribution of this estimator were normal, what would this tell us about the coverage of the 95% confidence interval $\hat\theta \pm 1.96 \times \sigma_{\theta}$. Here $\hat\theta$ is our point estimate and $\sigma_{\theta}$ is its standard deviation.

Hint. See this slide.

🔒

Solution

Locked (Week 13)

When we’re working with real data, we don’t know our estimator’s standard deviation $\sigma_{\theta}$. We have to estimate it. Let’s do that using the bootstrap. The way we’ve organized our code makes this easy. If we have a list of bootstrap samples, we can just call to sampling.distribution passing in this list and the same estimators as before.

Here’s a version of the table above with this estimated standard deviation added.

bias	standard deviation	estimated standard deviation
-0.158	0.921	1.09

It’s not exactly the same, but it’s close.

Let’s see how our bias and our imperfect estimate of our standard deviation affect the coverage of an interval estimate we can really use, $\hat\theta \pm 1.96 \times \hat\sigma_{\theta}$ where $\hat\sigma_{\theta}$ is the bootstrap standard deviation.

Exercise 45.2

Calculate the coverage of the interval estimate $\hat\theta \pm 1.96 \times \hat\sigma_{\theta}$.

Don’t Write Too Much Code

If you use the code above, you can do this without writing much more. My solution is seven lines of code.

Three of these lines are copied word-for-word from the code above.
Three more are copied word-for-word from the outline below.

This is, as an outline, what my code looks like.

covers = samples |> map_vec(function(sam) { 
  # 3 lines copied from above here
  # 1 new one.
})
mean(covers)

Runtime

To cut down on runtime, I used only 40 bootstrap samples to calculate my bootstrap standard deviation. To calculate coverage the coverage of the resulting interval estimate, I used 1,000 samples. That’s 40 ,000 = 40,000 calls to estimate.ate. On an M2 MacBook Air, this took about 2 minutes.

🔒

Solution

Locked (Week 13)

Weighting

When we talked about inverse probability weighting in Lecture 14, we talked about using the weights $\gamma(w,x)=m_{x}/m_{wx}$ to estimate $\Delta_\text{all}$. That was an exercise—it’s not in the slides—so we’ll review it here. Let’s start with the intuition. By weighting this way, we’re effectively using a sample drawn from an imaginary population in which each column (i.e. at each level of $x$) has one red dot and one green dot for each dot of either color in our actual population. As a result, the distributions $\textcolor[RGB]{248,118,109}{p_{x \mid 0}}$ and $\textcolor[RGB]{0,191,196}{p_{x \mid 1}}$ of $x$ in our imaginary population are the same as the the distribution $\textcolor[RGB]{160,32,240}{p_{x}}$ of $x$ in our actual population.

In the plot below, you can see this in action. On the left, I’ve plotted a population. We see the population, the mean in each group, and the distribution of the covariate $x$ in each group and in the population as a whole. On the right, you can see the imaginary population we’ve created by using the weights $\gamma(w,x)=m_{x}/m_{wx}$. I’ve scaled the size of each dot in proportion to its weight. I’ve calculated weighted means within each group and the ‘weighted covariate distribution’ for each group and the population as a whole, i.e., the proportion of weight at each level of $x$ rather than the proportion of people. You can see that in the imaginary population, the overall (purple) covariate distribution is the same as it is in the unweighted sample, but the covariate distributions of the red and green groups now match the overall distribution. That’s what using the weights $\gamma(w,x)=m_x/m_{wx}$ does.

There’s something about this that doesn’t quite make sense in randomized experiments. What does it mean to talk about the covariate distributions $\textcolor[RGB]{0,191,196}{p_{x \mid 1}}$ and $\textcolor[RGB]{248,118,109}{p_{x \mid 0}}$ among the treated and untreated people in our population when only the ones we sample are actually randomized to treatment? Or the number of people $\textcolor[RGB]{0,191,196}{m_{1x}}$ and $\textcolor[RGB]{248,118,109}{m_{0x}}$ with covariate value $x$ in those groups?

Let’s think about the plot above. The population that’s shown looks a lot like a bigger version of the samples we’ve been working with because it is. To get it, randomly assigned a treatment (i.e. a color) to each person in the population we’ve been sampling from exactly the same way I’ve been assigning treatment to our samples. Like this.

W = rbinom(nrow(po.pop), 1, pi(po.pop$x))
pop = data.frame(x=po.pop$x, w=W, y=ifelse(W==1, po.pop$y1, po.pop$y0))

If we like, we can think of our sample as being drawn from this pre-randomized population. Like this.

J = sample(1:nrow(pop), n, replace=FALSE)
sam = pop[J,]

Why? Let’s think about how we get our sample in physical terms. Suppose we have a population of people and each of them has a coin. To sample with replacement, we write each person’s phone number on a card, shuffle the deck, and call the first $n$ numbers. Does it matter if the people we call flip their coin before we call them or after? It doesn’t. Earlier, in Chapter 41, they flipped after we called; in the code right above, they flipped before.

If we think like this, the numbers $\textcolor[RGB]{248,118,109}{m_{0x}}$ and $\textcolor[RGB]{0,191,196}{m_{1x}}$ of untreated and treated people in our population with covariate value $x$ do make sense. But they’re random. And they depend on a bunch of coin flips that have absolutely nothing to do with the people in our sample. So instead of using this random denominator $m_{wx}$ in our weights $\gamma(w,x)=m_x/m_{wx}$, let’s average over all these coin flips. Let’s use $\gamma(w,x)=m_x / \mathop{\mathrm{E}}[m_{wx}]$ instead.

It turns out that this is very easy. Recall that we’ve randomized treatment with probability $\pi(x)$ that depends only on the covariate $x$, so $m_{wx}$ is the sum of $m_x$ coin flips that come up heads ($W_j=1$) with probability $\pi(x)$. \[ \begin{aligned} \mathop{\mathrm{E}}\qty[m_{wx}] &= \mathop{\mathrm{E}}\qty[\sum_{j:X_j=x} 1_{=w}(W_j)] \\ &= \sum_{j:X_j=x} \mathop{\mathrm{E}}\qty[1_{=w}(W_j)] \\ &= m_x \mathop{\mathrm{E}}\qty[1_{=w}(W_j)] \\ &= m_x \begin{cases} \pi(x) & \qif w=1 \\ 1-\pi(x) & \qif w=0 \end{cases} \end{aligned} \tag{45.2}\] As a result, we can write the weights $\gamma(w,x)=m_x / \mathop{\mathrm{E}}[m_{wx}]$ in terms of $\pi(x)$. The factor of $m_x$ cancels out. \[ \begin{aligned} \gamma(w,x) &= \color[RGB]{0,191,196}\frac{m_x}{m_x \pi(x)} &&= \color[RGB]{0,191,196}\frac{1}{\pi(x)} & \color[RGB]{0,191,196}\qif w=1 \\ &= \color[RGB]{248,118,109}\frac{m_x}{m_x \{1-\pi(x)\}} &&= \color[RGB]{248,118,109}\frac{1}{1-\pi(x)} & \color[RGB]{248,118,109}\qif w=0 \end{aligned} \tag{45.3}\] Those are the weights we’re using in the code above. We don’t need to know the covariate distribution in the population to use them; we just need to know how treatment was randomized. Cool, right? This is where the name ‘inverse probability weighting’ comes from. The weight we give each observation is the (multiplicative) inverse of the probability it was randomized to the treatment it received.

The randomization probability $\pi(x)$ plays a pretty pivotal role, so let’s take a look at it. And, while we’re at it, let’s look at $\gamma(w, x)$.

Let’s take a look at what these weights do in our sample. On the left, I’ve plotted our sample, the means in each group, and the covariate distribution in each group and in the sample as a whole. On the right, you can see all the same stuff for imaginary sample we’ve created using the weights $\textcolor[RGB]{0,191,196}{\gamma(1,x)=1/\pi(x)}$ and $\textcolor[RGB]{248,118,109}{\gamma(0,x)=1/\{1-\pi(x)\}}$. To compare to what happens using weights $\gamma(w,x)=m_x/m_{wx}$ based on our pre-randomized population, flip to the second tab.

Weights and Covariate Shift

In our simulation, the randomization probability $\pi(x)$ is increasing from left to right. As a result, the treated people in our sample get bigger weights when $x$ is small and the untreated people get bigger weights when $x$ is big. That’s why, when you look at the weighted least squares plot in Figure 45.1, we’ve got big red dots on the right and big green dots on the left. That’s the opposite of the direction of covariate shift— the red bars are bigger on the left and the green bars are bigger on the right—and that’s the point. We’re weighting our observations (i.e. scaling our dots) to compensate for covariate shift, so when our bars are based on weight (dot area) instead of count, there’s no shift.

Having done all this, let’s think about bias. That is, after all, why we’re inverse probability weighting. This is going to look a lot like our analysis in Lecture 14.

Let’s think $\tilde\mu$, the minimizer of expected weighted squared error. \[ \tilde\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \mathop{\mathrm{E}}\qty[ \gamma(W_i,X_i) \left\{ Y_i - m(W_i,X_i) \right\}^2 ] \] What do we know about this? Following the same logic as in Lecture 14, we get a ‘weighted orthogonality condition’. \[ \mathop{\mathrm{E}}\qty[ \gamma(W_i,X_i) \left\{ \mu(W_i,X_i) - \tilde\mu(W_i,X_i) \right\} m(W_i,X_i) ] = 0 \qqtext{ for all } m \in \mathcal{M}. \]

Exercise 45.3

Using the weighted orthogonality condition, show that whenever the model $\mathcal{M}$ includes group indicators $\textcolor[RGB]{248,118,109}{m(w,x)= 1_{=0}(w)}$ and $\textcolor[RGB]{0,191,196}{m(w,x)= 1_{=1}(w)}$, $\mathop{\mathrm{E}}[\textcolor[RGB]{248,118,109}{\mu(0, X_i)}=\mathop{\mathrm{E}}[\textcolor[RGB]{248,118,109}{\tilde\mu(0, X_i)}]$ and $\mathop{\mathrm{E}}[\textcolor[RGB]{0,191,196}{\mu(1, X_i)}]=\mathop{\mathrm{E}}[\textcolor[RGB]{0,191,196}{\tilde\mu(1, X_i)}]$. In a sentence or two, explain why —if we’ve randomized treatment as discussed above—this implies $\mathop{\mathrm{E}}[\textcolor[RGB]{0,191,196}{\tilde\mu(1, X)} - \textcolor[RGB]{248,118,109}{\tilde\mu(0, X_i)}]=\text{ATE}$.

Hint. Use the indicator trick and the law of iterated expectations conditioning on $X_i$.

🔒

Solution

Locked (Week 13)

Now let’s take a moment to acknowledge a quirk. Exercise 45.3 does not necessarily imply that if we do the same thing in our sample, we’ll get an unbiased estimate of the ATE.³

Let’s think about two estimators. \[ \begin{aligned} \hat\theta_0 &= \frac{1}{n}\sum_{i=1}^n 1_{=1}(W_i) \gamma(W_i,X_i) Y_i - \frac{1}{n}\sum_{i=1}^n 1_{=0}(W_i) \gamma(W_i,X_i) Y_i \\ \hat\theta_1 &= \frac{1}{n}\sum_{i=1}^n \hat\mu(1,X_i) - \hat\mu(0,X_i) \qfor \hat\mu(w,x) = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \gamma(W_i, X_i) \ \left\{ Y_i - m(W_i,X_i) \right\}^2 \end{aligned} \] When you use the parallel lines model, these are pretty similar, but they are not the same.

Exercise 45.4

Suppose you’re using the parallel lines model $\mathcal{M}=\{ m(w,x) = a(w) \ \text{ for all functions } \ a \}$. Solve for $\hat\mu$ explicitly in terms of $W_i,X_i,Y_i$ and the weights $\gamma(W_i,X_i)$. Then plug it into the formula for $\hat\theta_1$ to get a formula similar to the one for $\hat\theta_0$ above.

Hint. This is like a calculation we saw in our first class on least squares, but with weights.

🔒

Solution

Locked (Week 13)

Exercise 45.5

One of these two estimators — either $\hat\theta_0$ or $\hat\theta_1$ — is an unbiased estimator of the ATE. Which one? Prove that your choice is unbiased.

🔒

Solution

Locked (Week 13)

Exercise 45.6

Extra Credit. The other of these two estimators is not necessarily unbiased, but it’s pretty close. Using a linear approximation similar to the one you used in this exercise from a previous homework, explain why the bias of this other estimator is small.

🔒

Solution

Locked (Week 13)

Estimating the ATT

Now let’s think about estimating the ATT: the average treatment effect on the treated.

\[ \text{ATT} = \mathop{\mathrm{E}}[Y_i(1) - Y_i(0) \mid W_i=1] \]

Using an argument a lot like the one we used to show that the ATE was equal to $\Delta_\text{all}$ above, we can show that this is equal to $\Delta_1$. This argument, like that one, uses the law of iterated expectations to boil things down to the ‘within-group’ identification implied by conditional randomization. But there’s a trick to using the law of iterated expectations here because we’re already conditioning on $W_i=1$. The trick is to start by getting rid of the conditioning. Here’s a formula that’ll help us do that. \[ \mathop{\mathrm{E}}[ Z \mid W=1] = \mathop{\mathrm{E}}[ Z \ 1_{=1}(W) ] \ / \ P(W=1) \qqtext{ for any random variables $W$ and $Z$ with $W \in \{0,1\}$ }. \tag{46.1}\]

Deriving the Formula

To derive this, basically all we need to do is use the law of iterated expectations. It’ll be helpful to have a notation for the conditional expectation function of $Z$: let $z(w)=\mathop{\mathrm{E}}[Z \mid W=w]$, so $\mathop{\mathrm{E}}[Z \mid W]=z(W)$. \[ \begin{aligned} \mathop{\mathrm{E}}[Z 1_{=1}(W)] &\overset{\texttip{\small{\unicode{x2753}}}{iterated expectations}}{=} \mathop{\mathrm{E}}\qty[\mathop{\mathrm{E}}[Z 1_{=1}(W) \mid W]] \\ &\overset{\texttip{\small{\unicode{x2753}}}{linearity of conditional expectations}}{=} \mathop{\mathrm{E}}\qty[\mathop{\mathrm{E}}[Z \mid W] 1_{=1}(W) ] \\ &\overset{\texttip{\small{\unicode{x2753}}}{notation}}{=} \mathop{\mathrm{E}}\qty[ z(W) 1_{=1}(W) ] \\ &\overset{\texttip{\small{\unicode{x2753}}}{indicator trick}}{=} \mathop{\mathrm{E}}\qty[ z(1) 1_{=1}(W) ] \\ &\overset{\texttip{\small{\unicode{x2753}}}{linearity of expectations}}{=} z(1) \mathop{\mathrm{E}}\qty[ 1_{=1}(W) ] \\ &= z(1) \P(W=1) = \mathop{\mathrm{E}}[Z \mid W=1] \P(W=1) \end{aligned} \]

Note 46.1: What Does the ATT Mean?

When we’ve talked about $\Delta_1$ before, we haven’t been randomizing, and we’ve written it as a sum over the people with $w_j=1$ in a population $w_1,x_1,y_1 \ldots w_n,x_n,y_n$. When $(W_i,X_i,Y_i)$ is drawn uniformly-at-random from this population, the expectation is equivalent. \[ \begin{aligned} \Delta_1 &= \frac{1}{m_1} \sum_{j:w_j=1} \qty{\mu(1,x_j) - \mu(0,x_j)} && \text{averaging over the treated people} \\ &= \mathop{\mathrm{E}}[ \mu(1,X_i) - \mu(0,X_i) \mid W_i=1] && \text{due to uniform sampling} \\ \end{aligned} \] But when we’re randomizing, there isn’t a group of treated people in our population to average over. You can show that the ATT is equal to a weighted average of the individual treatment effects $\tau_j=\textcolor[RGB]{0,191,196}{y_j(1)}-\textcolor[RGB]{248,118,109}{y_j(0)}$ with weights proportional to our randomization probabilities $\pi(x_j)$. \[ \frac{1}{m}\sum_{j=1}^m \frac{\pi(x_j)}{\bar\pi} \tau_j \qfor \bar \pi = \frac{1}{m}\sum_{j=1}^m \pi(x_j) \] What’s weird about this is that, when we choose how to randomize treatment, we’re making a choice about what the ATT is, too. This isn’t all that common in real-world randomized experiments. What is common is using the idea of the ATT to think about what it means to estimate $\Delta_1$ when we haven’t really randomized.

Typically, we estimate the ATT with an analogous sample average of $\hat\mu$. \[ \widehat{ATT} = \textcolor[RGB]{0,191,196}{\frac{1}{N_1}\sum_{i:W_i}}\qty{ \textcolor[RGB]{0,191,196}{\hat\mu(1,X_i)} - \textcolor[RGB]{248,118,109}{\hat \mu(0,X_i)} } \] This should look familiar. We’ve been using the same formula to estimate $\Delta_1$.

Now let’s run through a few of the same steps we did for the ATE to estimate the ATT. I’ll give you some code to help out. All you’ve got to do, in addition to running the code above here in the document, is write functions estimate.att and att.weights analogous to estimate.ate and ate.weights from Chapter 45. Here’s a starting point for that. What I’ve done is copy/paste those from Chapter 45 and changed the names. What I haven’t done is change the code so it actually estimates the ATT. You don’t have to do that much to it, but you will have to change a few lines. I’ve marked them for you.

constant.weights = function(sam) { rep(1, nrow(sam)) }
estimate.att = function(sam, formula, gamma=constant.weights) {
  model = lm(formula, weights=gamma, data=sam |> mutate(gamma=gamma(sam)))
  muhat.0X = predict(model, newdata=sam |> mutate(w=0))
  muhat.1X = predict(model, newdata=sam |> mutate(w=1))
  mean(muhat.1X - muhat.0X)
}

att.weights = function(sam) { 
  ifelse(sam$w==1, 1/pi(sam$x), 1/(1-pi(sam$x)))
}

1: You’ll need to change this if you want your to estimate the ATT.
2: You’ll need to change this if you want your inverse probability weights to work.

att = mean((po.pop$y1 - po.pop$y0)*pi(po.pop$x)/mean(pi(po.pop$x)))
estimators = list(
  lsq=\(sam)estimate.att(sam, y~w+x), 
  ipw=\(sam)estimate.att(sam, y~w+x, att.weights))

sampling.dist = samples |> sampling.distribution(estimators)
sampling.dist |> plot.sampling.distribution(target=att)

1: See Note 46.1 for a discussion of this formula.

Exercise 46.1

Change the code above to get an estimator of the ATT where $\hat\mu$ is the (unweighted) least squares predictor in the parallel lines model. Plot its sampling distribution. Is your estimator positively biased (typically too big), negatively biased (typically too small), or approximately unbiased?

🔒

Solution

Locked (Week 13)

Now let’s do some inverse probability weighting. When we talked about the inverse probability weights for the ATE, we started with the idea of using the weights $\gamma(w,x)=m_x/m_{wx}$ and then replaced $m_{wx}$ with its expectation. Let’s do something similar for the weights $\gamma(w,x)=m_{1x}/m_{wx}$. These are like the weights we focused on in Lecture 14 when we were estimating $\Delta_0$, but with the colors flipped—$\Delta_1$ is an average over the green dots and $\Delta_0$ is an average over the red ones.

Exercise 46.2

Write a formula, in terms of the randomization probability $\pi(x)$, for the ratio $\gamma(w,x)=\mathop{\mathrm{E}}[m_{1x}]/\mathop{\mathrm{E}}[m_{wx}]$.

🔒

Solution

Locked (Week 13)

If you’ve gotten these right, you should be able to get a nearly-unbiased estimator of the ATT using them.

Exercise 46.3

Fix up att.weights to use your formula from the previous exercise. Plot the sampling distribution of $\widehat{ATT}$ when you estimate $\hat\mu$ using inverse probability weighted least squares in the parallel lines model. Is you estimator positively biased, negatively biased, or approximately unbiased?

Hint. If you’ve fixed estimate.att and att.weights in the code above, it should do the rest for you.

🔒

Solution

Locked (Week 13)

The Impact of Model Choice

Now let’s think briefly about what happens when we use different models. Below, I’ve tried out four. For each, I’ve drawn two plots below.

The least squares and inverse probability weighted least squares predictors on top of the sample they’re fit to. Squares show the unweighted least squares predictions; diamonds show the inverse probability weighted least squares predictions.
The sampling distributions of the ATT estimators using least squares and inverse probability weighted least squares. The unweighted version is in orange; the inverse probability weighted one is in purple.

Let’s use these to do an exercise on visualizing and communicating about bias and coverage.

Exercise 46.4

Suppose you accept the premise that the population we made up in ‘Step 0’ is pretty close to the population you’re actually studying. As a result, you trust that what you see in these histograms is very close to the actual sampling distribution of these 8 ATT estimators. Suppose you’re ok with 85% coverage, but not less than that.

Which of these estimators, if any, would you object to using? For each of the estimators you object to, explain, with reference to the corresponding plot of $\hat\mu$, what you expect would be wrong with it. Do you expect the estimate of the ATT to be an overestimate (too big)? An underestimate (too small)? What is it about the way $\hat\mu$ does (or doesn’t) fit the data that suggests that? If there’s another predictor based on the same model that does work, e.g. the inverse probability weighted least squares one, describe — in visual terms — why using that one leads to an acceptable estimate of the ATT but the one you object to does not.

🔒

Solution

Locked (Week 13)

$\hat\mu$
$\widehat{ATT}$’s Sampling Distribution

formula = y~w
sam |> lsq.plot(formula, gammas)

$\hat\mu$
$\widehat{ATT}$’s Sampling Distribution

formula = y~w+x
sam |> lsq.plot(formula, gammas)

$\hat\mu$
$\widehat{ATT}$’s Sampling Distribution

formula = y~w*x
sam |> lsq.plot(formula, gammas)

$\hat\mu$
$\widehat{ATT}$’s Sampling Distribution

formula = y~w+factor(x)
sam |> lsq.plot(formula, gammas)

This’ll be similar to these exercises from Lecture 13.↩︎
This’ll be similar to this exercise from the Week 9 Homework.↩︎
This isn’t just a randomization thing. In Lecture 14, we did a calculation similar to Exercise 45.3 in a section of the slides called ‘Proving Unbiasedness’, but we never actually proved that our estimator was unbiased. It isn’t. Here, you’ll get a chance to see why. And why it’s pretty close to unbiased anyway.↩︎