33 Conditional Randomization

What is conditional randomization?

So far, we have focused on the case that treatment is randomized without looking at anything else.
Formally: $W_1 \ldots W_n$ are independent of $\{X_1,Y_1(0), Y_1(1)\} \ldots \{X_n, Y_n(0), Y_n(1) \}$.
This is not the only way to randomize!
Suppose I suspect phone calls to older people to pay off more than calls to younger ones.
- Maybe older people have more money to spend on campaigns.
- Maybe they don’t get as annoyed about getting phone calls as young people .
- I might want to choose the probability a person gets treated (i.e. called) as a function of their age.
In conditionally randomized experiments, that’s what we do.
- We look at the covariates $X_1 \ldots X_n$ when we randomize. But only the covariates.
- Formally: $W_1 \ldots W_n$ are conditionally independent of $\{Y_1(0), Y_1(1)\} \ldots \{Y_n(0), Y_n(1)\}$ given $X$.¹
- We’ll focus on the case that each $W_i$ is a coin flip with heads probability depending on $X_i$.²

The Idea

To make things simple, let’s look at a subset of our population.
- We have grouped everyone by age to give us two age groups—a binary covariate.
As before, everyone has two potential outcomes— treated and untreated.

We start with two potential outcomes for each person in our population.
- Those are the connected dots we see in the plot.
Then we sample from our population.
- This is random, too.
Each one flips a weighted coin to determine whether they’re treated. A coin that depends on their age.

This is random. Different things happen each time.
But the overall pattern is consistent.
- Most of the green dots are on the right.
  - It’s mostly 75-year-olds getting called.
  - The heads probability of their coin is 0.87.
- Most of the red dots are on the left.
  - It’s mostly 55-year-olds getting emailed.
  - The heads probability of their coin is 0.35.

What Happens to Within-Group Means

We start with two potential outcomes for each person in our population.
- Those are the connected dots we see in the plot.
- At each level of $X$, these potential outcomes have a mean.
Then we sample from our population.
- This is random, too.
- These have a mean, too. And it is, of course, random. It changes if our sample changes.
Each person in the sample flips a weighted coin to determine whether they’re treated.
- ⦻ marks the potential outcomes that don’t happen.
- We can look at the means of people who actually flip ‘heads’ and ‘tails’, too. More randomness.
- Our sample has a mean within in group, too. A random one.

How different are these means?

🔒

Solution

Locked (Week 0)

What Happens to (Not Within-Group) Means

We start with two potential outcomes for each person in our population.
- Those are the connected dots we see in the plot.
- These have a mean outright—ignoring $X$. Often, that’s what we’re interested in.
Then we sample from our population.
- This is random, too.
- These have a mean, too. And it is, of course, random. It changes if our sample changes.
Each person in the sample flips a weighted coin to determine whether they’re treated.
- ⦻ marks the potential outcomes that don’t happen.
- We can look at the means of people who actually flip ‘heads’ and ‘tails’, too. More randomness.

How different are these means?

🔒

Solution

Locked (Week 0)

Why? Covariate Shift.

When we switch from emails to calls, the distribution of ages shifts to the right.
And the trend is that donations increase with age.
What does this mean for the raw difference in means?

\[ \color{gray} \begin{aligned} \hat\Delta_{\text{raw}} &= \textcolor[RGB]{0,191,196}{\frac{1}{N_1} \sum_{i:W_i=1} Y_i} - \textcolor[RGB]{248,118,109}{\frac{1}{N_0} \sum_{i:W_i=0} Y_i} \\ &= \sum_x \textcolor[RGB]{0,191,196}{P_{x|1}} \ \textcolor[RGB]{0,191,196}{\hat\mu(1,x)} - \sum_x \textcolor[RGB]{248,118,109}{P_{x | 0}} \ {\textcolor[RGB]{248,118,109}{\hat\mu(0,x)}} \\ &= \underset{\text{adjusted difference} \ \hat\Delta_1}{\sum_x \textcolor[RGB]{0,191,196}{P_{x|1}} \{\textcolor[RGB]{0,191,196}{\hat\mu(1,x)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,x)}\}} + \qty{\underset{\text{covariate shift term}}{\sum_x \textcolor[RGB]{0,191,196}{P_{x|1}} \ \textcolor[RGB]{248,118,109}{\hat\mu(0,x)} - \sum_x \textcolor[RGB]{248,118,109}{P_{x|0}} \ \textcolor[RGB]{248,118,109}{\hat\mu(0,x)}}} \end{aligned} \]

Covariate Shift in the Whole Dataset

\[ \color{gray} \begin{aligned} \hat\Delta_{\text{raw}} &= \textcolor[RGB]{0,191,196}{\frac{1}{N_1} \sum_{i:W_i=1} Y_i} - \textcolor[RGB]{248,118,109}{\frac{1}{N_0} \sum_{i:W_i=0} Y_i} \\ &= \sum_x \textcolor[RGB]{0,191,196}{P_{x|1}} \ \textcolor[RGB]{0,191,196}{\hat\mu(1,x)} - \sum_x \textcolor[RGB]{248,118,109}{P_{x | 0}} \ {\textcolor[RGB]{248,118,109}{\hat\mu(0,x)}} \\ &= \underset{\text{adjusted difference} \ \hat\Delta_1}{\sum_x \textcolor[RGB]{0,191,196}{P_{x|1}} \{\textcolor[RGB]{0,191,196}{\hat\mu(1,x)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,x)}\}} + \qty{\underset{\text{covariate shift term}}{\sum_x \textcolor[RGB]{0,191,196}{P_{x|1}} \ \textcolor[RGB]{248,118,109}{\hat\mu(0,x)} - \sum_x \textcolor[RGB]{248,118,109}{P_{x|0}} \ \textcolor[RGB]{248,118,109}{\hat\mu(0,x)}}} \end{aligned} \]

What should we do about this?

We know how to make comparisons that aren’t influenced by covariate shift. Adjusted comparisons.

$$ \[\begin{aligned} \hat\Delta_1 &= \frac{1}{N_1}\sum_{i:W_i=1} \qty{ \hat\mu(1, X_i) - \hat\mu(0,X_i) } = \sum_{x} P_{x \mid 1} \qty{ \hat\mu(1,x) - \hat\mu(0,x)} \\ \hat\Delta_0 &= \frac{1}{N_0}\sum_{i:W_i=0} \qty{ \hat\mu(1, X_i) - \hat\mu(0,X_i) } = \sum_{x} P_{x \mid 0} \qty{ \hat\mu(1,x) - \hat\mu(0,x)} \\ \hat\Delta_{\text{all}} &= \frac{1}{n}\sum_{i=1}^n \qty{ \hat\mu(1,X_i) - \hat\mu(0,X_i) } = \sum_{x} P_{x} \qty{ \hat\mu(1,x) - \hat\mu(0,x)} \end{aligned}\]

$$

What do these tell us about our treatment effects $\tau_j=y_j(1)-y_j(0)$? Let’s find out.

Identification in Conditionally Randomized Experiments

If treatment assignments are conditionally independent of the potential outcomes given covariates
\[ \text{ i.e. if } \ W_i \qqtext{ is independent of } \{Y_i(0), Y_i(1)\} \qqtext{ conditional on } X_i \]

then we can identify potential outcome means within groups with the same covariate value.
It’s a conditional version of the same formula.

\[ \begin{aligned} \mu(w,x) &= \mathop{\mathrm{E}}[Y_i \mid W_i=w, X_i=x] = \mathop{\mathrm{E}}[Y_i(w) \mid X_i=x] = \frac{1}{m_x} \sum_{j:x_j=x} y_j(w) \\ \qfor &m_x = \frac{1}{n} \sum_{i=1}^n 1_{=x}(X_i) \end{aligned} \]

Conditional Independence and Irrelevance

Let’s think about conditioning using multi-stage sampling.
- Stage 1. We sample $X_i$
- Stage 2.
  - We sample $\{Y_i(0), Y_i(1)\}$ from the subpopulation with that level of $X_i$
  - We choose $W_i$ by flipping a coin with probability $\pi(X_i)$ of heads.
  - What we observe is $W_i$, $X_i$, and $Y_i=Y_i(W_i)$.
Conditional independence is just independence in the probability distribution describing Stage 2.
A consequence we’ll use here is the irrelevance of conditionally independent conditioning variables.

\[ \mathop{\mathrm{E}}[Y \mid W, X] = \mathop{\mathrm{E}}[Y \mid X] \qqtext{ if } W \qqtext{ is independent of } \{Y(0), Y(1)\} \qqtext{ conditional on } X \]

Formula Derivation

\[ \begin{aligned} \mathop{\mathrm{E}}[Y_i \mid W_i=w, X_i=x] &\overset{\texttip{\small{\unicode{x2753}}}{$Y_i=Y_i(W_i)$}}{=} \mathop{\mathrm{E}}[Y_i(W_i) \mid W_i=w, X_i=x] \\ &\overset{\texttip{\small{\unicode{x2753}}}{$Y_i(W_i)=Y_i(w)$ when $W_i=w$}}{=} \mathop{\mathrm{E}}[Y_i(w) \mid W_i=w, X_i=x] \\ &\overset{\texttip{\small{\unicode{x2753}}}{irrelevance of conditionally independent conditioning variables}}{=} \mathop{\mathrm{E}}[Y_i(w) \mid X_i=x] \\ &\overset{\texttip{\small{\unicode{x2753}}}{$Y_i(w)$ is sampled uniformly at random from the potential outcomes $y_j(w)$ of the $m_x$ units with $X_i=x$}}{=} \frac{1}{m_x} \sum_{j:x_j=x} y_j(w) \qfor m_x = \frac{1}{n} \sum_{i=1}^n 1_{=x}(X_i) \end{aligned} \]

Consequence for Treatment Effects within Groups

\[ \begin{aligned} \mu(1,x) - \mu(0,x) &= \mathop{\mathrm{E}}[Y_i(1) \mid X_i=x] - \mathop{\mathrm{E}}[Y_i(0) \mid X_i=x] \\ &= \frac{1}{m_x} \sum_{j:x_j=x} y_j(1) - \sum_{j:x_j=x} y_j(0) \\ &= \frac{1}{m_x} \sum_{j:x_j=x} \qty{y_j(1) - y_j(0)} \\ &= \frac{1}{m_x} \sum_{i:x_i=x} \tau_j \end{aligned} \]

We call this the Conditional Average Treatment Effect (CATE).
We write it as $\tau(x)$ in mathematical notation.
When we have conditional randomization, our adjusted comparisons are unbiased estimators of averages of the CATE $\tau(x)$ over groups.
- $\hat\Delta_1$ averages over the green dots, i.e., the treated individuals.
- $\hat\Delta_0$ averages over the red dots, i.e., the untreated individuals.
- $\hat\Delta_{\text{all}}$ averages over all the individuals.
Let’s prove that for $\hat\Delta_{\text{all}}$. We’ll see the others for homework.

Unbiasedness of $\hat\Delta_{\text{all}}$

\[ \begin{aligned} \mathop{\mathrm{E}}[\hat\Delta_\text{all}] &=\mathop{\mathrm{E}}\qty[\sum_x P_{x} \ \qty{\hat \mu(1,x) - \hat\mu(0,x) }] \\ &\overset{\texttip{\small{\unicode{x2753}}}{law of iterated expectations}}{=} \mathop{\mathrm{E}}\qty[\mathop{\mathrm{E}}\qty[ \sum_x P_{x} \ \qty{\hat \mu(1,x) - \hat\mu(0,x)} \mid (W_1, X_1) \ldots (W_n, X_n)]] \\ &\overset{\texttip{\small{\unicode{x2753}}}{linearity of conditional expectation}}{=} \mathop{\mathrm{E}}\qty[\sum_x P_{x} \ \qty{\mathop{\mathrm{E}}\qty[\hat\mu(1,x) \mid (W_1,X_1) \ldots (W_n,X_n)] - \mathop{\mathrm{E}}\qty[\hat\mu(0,x) \mid (W_1,X_1) \ldots (W_n,X_n)]}] \\ &\overset{\texttip{\small{\unicode{x2753}}}{unbiasedness of the sample mean}}{=} \mathop{\mathrm{E}}\qty[\sum_x P_{x} \ \qty{\mu(1,x) - \mu(0,x)}] \\ &\overset{\texttip{\small{\unicode{x2753}}}{identification: $\tau(x)=\mu(1,x)-\mu(0,x)$}}{=} \mathop{\mathrm{E}}\qty[\sum_x P_{x} \ \tau(x)] \\ &\overset{\texttip{\small{\unicode{x2753}}}{linearity of expectation}}{=} \sum_x \mathop{\mathrm{E}}[P_{x}] \ \tau(x) \\ &\overset{\texttip{\small{\unicode{x2753}}}{unbiasedness of sample proportions and def of $\tau(x)$}}{=} \sum_x p_{x} \ \frac{1}{m_x}\sum_{j:x_j=x} \tau_j \qfor p_{x} = \mathop{\mathrm{E}}[P_{x}]=\frac{m_x}{m} \\ &\overset{\texttip{\small{\unicode{x2753}}}{rewriting our sum of column sums as a single sum}}{=} \sum_x \frac{m_x}{m} \ \frac{1}{m_x}\sum_{j:x_j=x} \tau_j = \frac{1}{m}\sum_{j=1}^m \tau_j \end{aligned} \]

This is the average of the individual treatment effects $\tau_j$ over the whole population.
Or, for short, the average treatment effect or ATE.

A Campaign Finance Example

Conditionally Randomized Experiments

What is conditional randomization?

The Idea

What Happens to Within-Group Means

What Happens to (Not Within-Group) Means

Why? Covariate Shift.

Covariate Shift in the Whole Dataset

What should we do about this?

Identification in Conditionally Randomized Experiments

Conditional Independence and Irrelevance

Formula Derivation

Consequence for Treatment Effects within Groups

Unbiasedness of \(\hat\Delta_{\text{all}}\)