38 Homework: Simulating Experiments and Identifying the ATE

Introduction

Summary

In this homework, we’re going to focus on simulating conditionally randomized experiments and understanding what it means for the average treatment effect (ATE) to be identified. This is foundational material for understanding causal inference with covariates.

The new part is how we’re going to check that our estimators work. We’re going to simulate a conditionally randomized experiment. Having done this, we can show our estimator’s sampling distribution and check coverage just like we did in the last homework.

This is a real thing

We’ve been doing simulation all semester. That’s how we’ve been visualizing the sampling distribution of our estimators and calculating coverage. This is just bringing that into a causal inference context. This isn’t just something people do in class. It’s how people check that their estimators work and compare them to each other. Many papers in Machine Learning conferences and statistics journals use one or both of these simulation techniques. Maybe even the majority of them. Andrew Gelman has a blog post with the title Yes, I really really really like fake-data simulation, and I can’t stop talking about it..

What to Expect

You’ll see a few variations on things we’ve done before.

Test estimators for causal effects using simulated data. Here that simulation involves a randomization step.
Understanding how conditional randomization leads to identification of the ATE.

As usual, we’ll use the tidyverse.

library(tidyverse)

Simulating the Data

We’re going to work with the example we talked about in Lecture 10.¹ We’ve got a list of people that might donate to a political campaign—that’s our population—and we’re considering two ways of contacting them to ask for money: an email or a phone call. We’re running a pilot study to decide which one to use. In particular we want to estimate the average treatment effect of a call (vs. an email) on the amount they donate.

\[ \begin{aligned} \text{ATE} &= \frac{1}{m}\sum_{j=1}^m \qty{\textcolor[RGB]{0,191,196}{y_j(1)} - \textcolor[RGB]{248,118,109}{y_j(0)}} \qqtext{ where } \\ & \textcolor[RGB]{0,191,196}{y_j(1)} \text{ is amount person $j$ would donate if contacted by phone} \\ & \textcolor[RGB]{248,118,109}{y_j(0)} \text{ is amount person $j$ would donate if contacted by email} \end{aligned} \]

We call $y_j(w)$ the potential outcome for person $j$ under treatment $w$. These are, of course, just things we’re imagining—we either we call them or we don’t—but that’s decision-making. When you think through your options, you’re imagining the consequences of some choices that you aren’t ultimately going to make.

To look into what happens when we try to estimate the ATE, we’ll sample people from our population and then randomize them to a treatment. In particular, we’ll do it like this.

Sample without replacement from our population to choose who we’re going to contact.
Assign each of them to treatment by flipping a weighted coin that may depend on that person’s age.

To simulate an experiment like this, we have to start one step back. ‘Step 0’ is making up the population.

Step 0

When we make up our population, we haven’t yet assigned these people to treatment, so we describe each person by their age $x_j$ and their potential outcomes $y_j(\cdot)$² This means our whole population is a list $x_1, y_1(\cdot) \ldots x_m, y_m(\cdot)$. Here’s some code that makes up a population like this.

We’re going to use a random number generator to help us do this, but this isn’t a part of the random process we’re analyzing statistically. It’s just a way of making up $m$ different things without having to do it all by hand. That means that we only do this once, even when we’re using simulating our study over and over to calculate the sampling distribution of our estimator. The population isn’t changing.

m = 1000
set.seed(1)

x = sample(c(20,30,40,50), m, replace=TRUE)
y = function(w) { (2 + 10*((x-20)/30 + w*sqrt((x-20)/30))) * runif(m, min=0, max=2) }

po.pop = data.frame(x=x, y0=y(0), y1=y(1))

Step 1

We choose a random sample $X_1, Y_1(\cdot) \ldots X_n, Y_n(\cdot)$ by sampling without replacement from the population. These random variables $X_i, Y_i(\cdot)$ are sampled uniformly-at-random from our population, so $\mathop{\mathrm{E}}[Y_i(w)]=\frac{1}{m}\sum_{j=1}^m y_j(w)$.

n = 200
J = sample(1:m, n, replace=FALSE)
po.sam = po.pop[J,]

Step 2

For each person ($i$) we’ve chosen, we flip a coin with probability $\pi(X_i)$ of landing heads. Then, taking the corresponding potential outcome for each person, we get the sample we actually observe: $W_1, Y_1(W_1) \ldots W_n, Y_n(W_n)$. We call $Y_i(W_i)$, which is sometime just written as $Y_i$, the realized outcome.

pi = function(x) { 1/(1+exp(-(x-30)/8 )) }
W = rbinom(n, 1, pi(po.sam$x))
sam = data.frame(w=W, x=po.sam$x, y=ifelse(W==1, po.sam$y1, po.sam$y0))

Visualizing our Simulation Process

By flipping through the tabs below, you can see the process of simulating our experiment.

In the first tab, you can see the population of potential outcomes we’re sampling from. Each pair $\textcolor[RGB]{248,118,109}{y_j(0)}, \textcolor[RGB]{0,191,196}{y_j(1)}$ is drawn as a pair of points connected by a thin line. I’ve made them hollow because they’re potential outcomes—they don’t necessarily happen.
In the second, you see the 200 of these that we’ve sampled.
In the third, the circle corresponding to the realized outcome gets filled in.

Identifying the ATE

Let’s start by reviewing our identification results from Lecture 11. We know, because it’s right in the code above this, that these two sampling and randomization conditions are satisfied.

$X_1,Y_1(\cdot) \ldots X_n,Y_n(\cdot)$ are sampled uniformly-at-random from our population of covariate/potential-outcome pairs $x_1,y_1(\cdot) \ldots x_n, y_n(\cdot)$.
The treatment $W_i$ is a binary random variable with a probability $\pi(X_i)$ of being one that depends only the covariate $X_i$.

When these conditions are satisfied, we can identify the mean potential outcome among people with each level of the covariate. That means we can work out a formula for it in terms of the distribution of the data we actually observe: the treatment, covariate, and realized outcome triples $W_1,X_1,Y_1(W_1) \ldots W_n,X_n,Y_n(W_n)$. The mean of the potential outcome $y_j(w)$ for treatment $w$ among people with covariate value $x$ is the conditional mean $\mu(w,x)$ of the realized outcome $Y_i(W_i)$ given that treatment $W_i=w$ and covariate value $X_i=x$.

\[ \begin{aligned} \mu(w,x) &\overset{\texttip{\small{\unicode{x2753}}}{This is just notation. $\mu(w,x)$ is defined by this equation.}}{=} \mathop{\mathrm{E}}[Y_i(W_i) \mid W_i=w, X_i=x] \\ &\overset{\texttip{\small{\unicode{x2753}}}{$Y_i(W_i)=Y_i(w)$ when $W_i=w$}}{=} \mathop{\mathrm{E}}[Y_i(w) \mid W_i=w, X_i=x] \\ &\overset{\texttip{\small{\unicode{x2753}}}{irrelevance of conditionally independent conditioning variables}}{=} \mathop{\mathrm{E}}[Y_i(w) \mid X_i=x] \\ &\overset{\texttip{\small{\unicode{x2753}}}{$Y_i(w)$ is sampled uniformly at random from the potential outcomes $y_j(w)$ of the $m_x$ units with $X_i=x$}}{=} \frac{1}{m_x} \sum_{j:x_j=x} y_j(w) \qfor m_x = \frac{1}{n} \sum_{i=1}^n 1_{=x}(X_i) \end{aligned} \tag{41.1}\]

From here, showing that the average treatment effect (ATE) is equal to the familiar quantity $\Delta_{\text{all}}$ is a matter of averaging.

\[ \begin{aligned} \Delta_{\text{all}} &\overset{\texttip{\small{\unicode{x2753}}}{This equation defines $\Delta_{\text{all}}$}}{=} \mathop{\mathrm{E}}[ \mu(1,X_i) - \mu(0,X_i) ] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Our within-group identification result @eq-within-group-identification}}{=} \mathop{\mathrm{E}}\qty[ \mathop{\mathrm{E}}[ Y_i(1) \mid X_i] - \mathop{\mathrm{E}}[Y_i(0) \mid X_i] ] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Linearity of conditional expectations}}{=} \mathop{\mathrm{E}}\qty[ \mathop{\mathrm{E}}[ Y_i(1) - Y_i(0) \mid X_i] ] \\ &\overset{\texttip{\small{\unicode{x2753}}}{The law of iterated expectations}}{=} \mathop{\mathrm{E}}\qty[ Y_i(1) - Y_i(0) ] \\ &\overset{\texttip{\small{\unicode{x2753}}}{sampling uniformly-at-random}}{=} \frac{1}{m}\sum_{j=1}^m \left\{ y_j(1) - y_j(0) \right\} = \text{ATE}. \end{aligned} \tag{41.2}\]

Optional Reading.

There’s a something buried in the notation above. At the heart of it is a difference between studies with and without randomization that’ll come up again and again.

When we aren’t randomizing, $\mu(w,x)$ is the mean of our outcome $y_j$ for some group of people in our population. We can write it as a conditional expectation involving random variables $W_i,X_i,Y_i$ sampled uniformly-at-random from our population $w_1,x_1,y_1 \ldots w_m,x_m,y_m$, but we don’t have to. The way I like to think about it, that conditional expectation isn’t what it really is, it’s just something that has the same value. \[ \begin{aligned} \mu(w,x) &= \frac{1}{m_{wx}} \sum_{j:w_j=w, x_j=x} y_j &&\qfor m_{wx} = \frac{1}{m}\sum_{j:w_j=w, x_j=x} 1 \\ &= \mathop{\mathrm{E}}[Y_i \mid W_i=w, X_i=x] &&\qqtext{when $(W_i,X_i,Y_i)$ is sampled uniformly-at-random} \end{aligned} \tag{41.3}\]

We can’t define $\mu(w,x)$ the ‘first line way’ in a study involving randomization. There is no such group in the population because the people in our population don’t have a $w_j$; $w$ is something that’s assigned to the people in our sample. We can, however, still calculate the conditional expectation of the realized outcome $Y_i=Y_i(W_i)$ given a treatment assignment $W_i=w$ and covariate value $X_i=x$. So that’s how we define $\mu(w,x)$ in a randomized study. We skip the first line of Equation 41.3 and jump straight to the second. What comes of this is that $\mu(w,x)$ summarizes the distribution of our sample in the same way when we randomize and when we don’t. The difference is that when we’re randomizing, it doesn’t describe any actual group of people in our population.

So let’s think for a moment about what this means about our identification result Equation 41.2. It’s saying that a summary of the potential outcomes $y_j(\cdot)$ of the people in our population ($\text{ATE}$) is equal to a summary of the distribution of the observations $W_i,X_i,Y_i(W_i)$ in our sample ($\Delta_\text{all}$). This’ll come up in a moment when we talk about inverse probability weighting. When we randomize, we can’t talk about the covariate distributions $p_{x \mid 0}$ and $p_{x \mid 1}$ as the distributions of groups of people in our population. Instead, we’ll have to think of them as conditional distributions of the random variable $X_i$ given that $W_i=0$ and $W_i=1$. Without randomization, that’s an implication of our groups-of-people definition when we sample uniformly-at-random; with randomization, it’s where we start.

Exercises

Exercise 42.1

Modify the simulation code above to create a population with a different relationship between age and potential outcomes. Specifically, make the treatment effect (difference between $y_j(1)$ and $y_j(0)$) larger for older people than for younger people. Then:

Calculate the ATE for your new population.
Sample 100 people and randomize them to treatment.
Plot the potential outcomes and realized outcomes as shown above.

🔒

Solution

Locked (Week 13)

Exercise 42.2

In your own words, explain why we need conditional randomization (where $\pi(x)$ can depend on $x$) rather than simple randomization (where everyone has the same probability of treatment) when we want to estimate treatment effects in a population where treatment effects vary with $x$.

Hint: Think about what happens if older people are more likely to be treated and older people also have larger treatment effects.

🔒

Solution

Locked (Week 13)

Exercise 42.3

Using the original population (po.pop) and randomization scheme, simulate 1000 experiments and calculate the sample mean difference $\bar{Y}_1 - \bar{Y}_0$ for each. Plot the sampling distribution of this estimator and compare its mean to the true ATE. Is this estimator biased? Why or why not?

🔒

Solution

Locked (Week 13)

The process we use to generate the data here is a different (simpler) but the overall theme is the same.↩︎
With this notation, we’re writing our potential outcomes as one thing (a function of treatment $w$) instead of two things (that function’s value at treatments $w=0$ and $w=1$).↩︎