34 Inverse Probability Weighting
$$
Introduction
We’ve seen how to make adjusted comparisons by averaging within-group differences over some covariate distribution. That works well when you have enough data in each cell. But when the groups you’re comparing have very different covariate distributions—when there’s covariate shift—it can be hard to get stable estimates. Here’s another approach: inverse probability weighting.
Inverse probability weighting is a simple technique for getting unbiased estimates of targets like adjusted differences in means, even when there’s substantial covariate shift between the groups you’re comparing.
The essential idea is to fit the data where it’s used in your comparison, not where most of the data is. When we estimate \(\Delta_0\), we’re averaging over the covariate distribution of the red group. So we should weight our data to match that distribution, even when fitting predictions for the green group.
The Problem: Covariate Shift
Comparing Group Means Under Covariate Shift
When we want to estimate an adjusted difference like \(\Delta_0\), we’re comparing what would happen if everyone in the red group were treated like the green group versus how they’re actually treated. In formula form, \[ \Delta_0 = \textcolor[RGB]{248,118,109}{\frac{1}{m_0}\sum_{j:w_j=0}} \qty{ \textcolor[RGB]{0,191,196}{\mu(1,x_j)} - \textcolor[RGB]{248,118,109}{\mu(0,x_j)} }. \] The key point is that we’re averaging over the covariate distribution of the red group. But when we estimate \(\textcolor[RGB]{0,191,196}{\mu(1,x)}\), we’re using data from the green group—and that group may have a very different covariate distribution.
In the plots above, you can see this covariate shift getting more extreme. The red and green histograms show where each group’s observations fall. When shift is mild, they’re similar. When shift is extreme, they’re almost non-overlapping.
The Problem with Simple Averages
Consider what happens when we compute within-group means. The green group mean \(\bar Y_1\) weights each education level \(x\) by how many green people have that education level. But when we use \(\bar Y_1\) as part of \(\hat\Delta_0\), we’re implicitly averaging it over the red covariate distribution.
In the moderate covariate shift case:
- At \(x=10\), we have 150 red dots and 50 green dots
- At \(x=18\), we have 50 red dots and 150 green dots
The green group has most of its mass at high education levels, but we want to know what would happen at low education levels—where the red group lives. Simple averaging gets this wrong.
The Solution: Pretending
Reweighting to Match Distributions
We can’t make the covariate shift go away. But we can pretend that we see each green observation multiple times—or fractional times—to make the weighted distribution match the red distribution.
In the moderate case, we pretend that:
- We see each green observation in the \(x=10\) column \(\textcolor[RGB]{248,118,109}{150} / \textcolor[RGB]{0,191,196}{50} = 3\) times
- We see each green observation in the \(x=12\) column \(\textcolor[RGB]{248,118,109}{125} / \textcolor[RGB]{0,191,196}{75} \approx 1.67\) times
- We see each green observation in the \(x=14\) column \(\textcolor[RGB]{248,118,109}{100} / \textcolor[RGB]{0,191,196}{100} = 1\) time
- We see each green observation in the \(x=16\) column \(\textcolor[RGB]{248,118,109}{75} / \textcolor[RGB]{0,191,196}{125} \approx 0.6\) times
- We see each green observation in the \(x=18\) column \(\textcolor[RGB]{248,118,109}{50} / \textcolor[RGB]{0,191,196}{150} \approx 0.33\) times
In the plots above, we scale each dot’s area by the number of times we pretend we see it. The total area of red and green dots in each column is now the same. When there’s extreme shift, some green dots get very large—we’re pretending each observation represents many observations.
The Weights
The weight we give to each observation is the ratio of red to green dots in that column, \[ \gamma(w,x) = \frac{\textcolor[RGB]{248,118,109}{m_{0x}}}{m_{wx}} = \begin{cases} 1 & \text{if } w=0 \\ \frac{\textcolor[RGB]{248,118,109}{m_{0x}}}{\textcolor[RGB]{0,191,196}{m_{1x}}} & \text{if } w=1. \end{cases} \] For red observations, the weight is 1—we don’t need to adjust them. For green observations, the weight is the ratio of red to green counts at that covariate value.
These are called inverse probability weights because in a randomized experiment where treatment probability varies by \(x\), this ratio is inversely proportional to the probability of being in the green group given \(x\).
The Weighted Average
The inverse probability weighted estimate of \(\textcolor[RGB]{0,191,196}{\mu(1,x)}\) averaged over the red distribution is \[ \frac{\sum_{j:w_j=1} \gamma(1,x_j) \cdot y_j}{\sum_{j:w_j=1} \gamma(1,x_j)} = \frac{\sum_{j:w_j=1} \frac{m_{0,x_j}}{m_{1,x_j}} \cdot y_j}{\sum_{j:w_j=1} \frac{m_{0,x_j}}{m_{1,x_j}}}. \] This gives more weight to green observations at covariate values where the red group has more mass, and less weight where the red group has less mass.
Using a Sample
In practice, we don’t have the population—we have a sample. We weight people in our sample the same way.
If we know the population covariate distribution (e.g., from a census or voter file), we use those proportions: \[ \gamma(w,x) = \frac{\textcolor[RGB]{248,118,109}{m_{0x}}}{m_{wx}} \]
pop = exercise.pop(m0x.flat, m1x.moderate)
J = sample(1:nrow(pop), 200)
sam = pop[J,]
pop.summaries = pop |> group_by(w,x) |> summarize(mwx=n(), .groups='drop')
mwx = summary.lookup('mwx', pop.summaries)
gamma = function(w,x) mwx(0,x) / mwx(w,x)
sam$weights = gamma(sam$w, sam$x)
weighted.mean.green = sum(sam$y[sam$w==1] * sam$weights[sam$w==1]) / sum(sam$weights[sam$w==1])
weighted.mean.green[1] 39487.66
If we don’t know the population distribution, we estimate the weights from the sample: \[ \hat\gamma(w,x) = \frac{\textcolor[RGB]{248,118,109}{N_{0x}}}{N_{wx}} \qfor N_{wx} = \sum_{i:W_i=w, X_i=x} 1 \]
sam.summaries = sam |> group_by(w,x) |> summarize(Nwx=n(), .groups='drop')
Nwx = summary.lookup('Nwx', sam.summaries)
gammahat = function(w,x) Nwx(0,x) / Nwx(w,x)
sam$weights = gammahat(sam$w, sam$x)
weighted.mean.green = sum(sam$y[sam$w==1] * sam$weights[sam$w==1]) / sum(sam$weights[sam$w==1])
weighted.mean.green[1] 42313.6
This Works
When we weight based on the population covariate distribution, we get an unbiased estimator, \[ \mathop{\mathrm{E}}[\hat\Delta_0^{\text{IPW}}] = \Delta_0. \] The sampling distribution is centered on our target (shown in green) regardless of how extreme the covariate shift is. The variance increases with more extreme shift—we’re relying heavily on a few observations—but the bias is gone.
Generalizing: Other Targets
\(\Delta_0\)
For \(\Delta_0\), we weight to make the green distribution match the red distribution: \[ \gamma(w,x) = \frac{m_{0x}}{m_{wx}} \]
\(\Delta_1\)
For \(\Delta_1\), we reverse it—weight to make the red distribution match the green distribution: \[ \gamma(w,x) = \frac{m_{1x}}{m_{wx}} \]
\(\Delta_{\text{all}}\)
For \(\Delta_{\text{all}}\), we weight both groups to match the overall distribution: \[ \gamma(w,x) = \frac{m_x}{m_{wx}} \qfor m_x = m_{0x} + m_{1x} \]
Summary
Inverse probability weighting is a simple, powerful technique:
- Identify your target. What covariate distribution are you averaging over?
- Compute weights. The weight for an observation is the ratio of the target distribution to the observed distribution at that covariate value.
- Use weighted averages. Weight each observation by \(\gamma(w,x)\) when computing means.
The key insight is that we should fit the data where it matters for our question, not where most of the data happens to be. When there’s covariate shift, these are different places.