43  Inverse Probability Weighted Least Squares

From IPW Averaging to IPW Least Squares

In the previous lecture, we saw how inverse probability weighting gives us unbiased estimates of adjusted differences like \(\Delta_0\). The key idea was to weight observations so that the green distribution matches the red distribution we’re averaging over. With weights \(\gamma(w,x) = m_{0x}/m_{wx}\), we get \[ \mathop{\mathrm{E}}[\hat\Delta_0^{\text{IPW}}] = \Delta_0. \] Now we’ll see that this is actually a special case of something more general: inverse probability weighted least squares. The IPW averaging estimator is IPW least squares in the constant model—a model where our prediction function doesn’t depend on \(x\) at all. And once we see it that way, we can generalize to other models.

IPW Averaging as IPW Least Squares

The IPW estimate of the green group mean, averaged over the red distribution, can be written as \[ \hat\mu^{\text{IPW}}(1) = \frac{\sum_{i:W_i=1} \gamma(1,X_i) Y_i}{\sum_{i:W_i=1} \gamma(1,X_i)}. \] This is exactly what you get from weighted least squares in the constant model—a model where \(m(w,x) = a(w)\) doesn’t depend on \(x\). The weighted least squares solution is \[ \hat\mu^{\text{IPW}} = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \gamma(W_i,X_i) \qty{ Y_i - m(W_i,X_i) }^2 \qfor \mathcal{M}= \{ m(w,x) = a(w) \}. \] Because this model allows different constants for each group but ignores \(x\), the solution for \(\hat\mu^{\text{IPW}}(1)\) is just the weighted mean of \(Y_i\) among green observations.

Why Generalize?

Sometimes the constant model isn’t a good fit. The column means \(\mu(w,x)\) might vary substantially with \(x\). In that case, we might want to use a richer model—lines, polynomials, or something else—to estimate \(\mu(w,x)\) more accurately.

The question is: if we use a misspecified model (one that doesn’t include the true \(\mu\)), does IPW still give us unbiased estimates of \(\Delta_0\)?

The answer is yes—as long as the model includes group indicators. Let’s see why.

IPW Least Squares in Other Models

The Setup

We’ll use weighted least squares with inverse probability weights, \[ \hat\mu^{\text{IPW}} = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \gamma(W_i,X_i) \qty{ Y_i - m(W_i,X_i) }^2 \] where \(\gamma(w,x) = m_{0x}/m_{wx}\) as before. The difference is what model \(\mathcal{M}\) we use.

To understand bias, we’ll think about the population version—what happens if we do weighted least squares on the entire population, \[ \tilde\mu^{\text{IPW}} = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{j=1}^m \gamma(w_j,x_j) \qty{ y_j - m(w_j,x_j) }^2. \] We know from earlier that \(\mathop{\mathrm{E}}[\hat\mu^{\text{IPW}}(w,x)] = \tilde\mu^{\text{IPW}}(w,x)\), so the bias of \(\hat\Delta_0\) equals the error of \(\tilde\Delta_0\).

Models We’ll Consider

The horizontal lines model has separate constants for each group, \[ \mathcal{M}= \{ m(w,x) = a(w) \}. \tag{45.1}\]

The parallel lines model adds a common slope in \(x\), \[ \mathcal{M}= \{ m(w,x) = a(w) + bx \}. \tag{45.2}\]

The lines model allows different slopes for each group, \[ \mathcal{M}= \{ m(w,x) = a(w) + b(w)x \}. \tag{45.3}\]

The additive model has separate intercepts for each \(x\) value but constrains the group difference to be constant, \[ \mathcal{M}= \{ m(w,x) = a(w) + c(x) \}. \tag{45.4}\]

All of these models include group indicators—they can have different values for \(w=0\) and \(w=1\). That’s what matters for IPW.

Empirical Evidence

For all of these models, the sampling distribution of \(\hat\Delta_0\) is centered on the target (shown in green). The estimator is unbiased despite the model being misspecified. The model doesn’t fit the data well—look at how far the fitted lines are from the column means—but that doesn’t cause bias in \(\hat\Delta_0\).

Proving Unbiasedness

Goal

Suppose we have the inverse probability weighted population least squares predictor \(\tilde\mu\), \[ \tilde\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \text{WMSE}(m) \qfor \text{WMSE}(m) = \sum_{wx} \gamma(w,x) m_{wx} \qty{ \mu(w,x) - m(w,x) }^2 \] with \(\gamma(w,x) = m_{0x}/m_{wx}\). We want to show that when we plug it into our formula for \(\tilde\Delta_0\), we get our target \(\Delta_0\), \[ \tilde\Delta_0 = \frac{1}{m_0} \sum_{j:w_j=0} \qty{ \tilde\mu(1,x_j) - \tilde\mu(0,x_j) } = \Delta_0 = \frac{1}{m_0} \sum_{j:w_j=0} \qty{ \mu(1,x_j) - \mu(0,x_j) }. \]

Approach

We’ll decompose the error \(\Delta_0 - \tilde\Delta_0\) into a matched term and a mismatched term, \[ \Delta_0 - \tilde\Delta_0 = \textcolor[RGB]{248,118,109}{\frac{1}{m_0} \sum_{j:w_j=0}} \qty{\textcolor[RGB]{0,191,196}{\mu(1,x_j) - \tilde\mu(1,x_j)}} - \textcolor[RGB]{248,118,109}{\frac{1}{m_0} \sum_{j:w_j=0}} \qty{\textcolor[RGB]{248,118,109}{\mu(0,x_j) - \tilde\mu(0,x_j)}}. \] We’ll derive an orthogonality condition for the weighted least squares residuals and use it to show both terms are zero.

The Weighted Orthogonality Condition

Because \(\tilde\mu\) minimizes weighted mean squared error over all \(m \in \mathcal{M}\), it minimizes along all paths \(\{ m_t = \tilde\mu + tm : t \in \mathbb{R} \}\) for \(m \in \mathcal{M}\). Taking the derivative with respect to \(t\) at \(t=0\) and setting it to zero gives \[ 0 = \sum_{wx} \gamma(w,x) m_{wx} \qty{ \mu(w,x) - \tilde\mu(w,x) } m(w,x) \qqtext{for all} m \in \mathcal{M}. \tag{46.1}\] Because \(\gamma(w,x) m_{wx} = m_{0x}\), this simplifies to \[ 0 = \sum_{wx} m_{0x} \qty{ \mu(w,x) - \tilde\mu(w,x) } m(w,x) \qqtext{for all} m \in \mathcal{M}. \] Rewriting the sum over \(x\) as a sum over the population (specifically, over red dots), \[ 0 = \sum_{w \in \{0,1\}} \sum_{j:w_j=0} \qty{ \mu(w,x_j) - \tilde\mu(w,x_j) } m(w,x_j) \qqtext{for all} m \in \mathcal{M}. \]

Showing the Error Terms are Zero

Now we plug in specific functions \(m\) to show each error term is zero.

The matched term. Plug in \(m(w,x) = 1_{=0}(w)\), which is in our model (it’s a group indicator). By the indicator trick, only terms with \(w=0\) survive, \[ 0 = \sum_{j:w_j=0} \qty{ \textcolor[RGB]{248,118,109}{\mu(0,x_j) - \tilde\mu(0,x_j)} }. \] This is exactly the matched term (times \(m_0\)), so the matched term is zero.

The mismatched term. Plug in \(m(w,x) = 1_{=1}(w)\), which is also in our model. By the indicator trick, only terms with \(w=1\) survive, but the sum is still over red dots (\(j:w_j=0\)), \[ 0 = \sum_{j:w_j=0} \qty{ \textcolor[RGB]{0,191,196}{\mu(1,x_j) - \tilde\mu(1,x_j)} }. \] This is exactly the mismatched term (times \(m_0\)), so the mismatched term is zero.

The key insight is that our weighting makes the orthogonality condition sum over the red distribution for both groups’ residuals. Without weighting, the green residuals would be orthogonal over the green distribution—which isn’t what we need for \(\Delta_0\).

IPW vs. Unweighted Least Squares

On the left, we see IPW least squares—unbiased for all models. On the right, we see unweighted least squares—biased for most models, though sometimes unbiased by accident (the lines model can get lucky when the ratio of covariate distributions happens to be linear in \(x\); see Lab 7).

Using IPW Least Squares in R

Using IPW least squares is simple. We compute the weights and pass them to lm.

# Compute weights
sam.summaries = sam |> group_by(w,x) |> summarize(Nwx=n(), .groups='drop')
Nwx = summary.lookup('Nwx', sam.summaries)
sam$weights = Nwx(0,sam$x) / Nwx(sam$w,sam$x)

# Fit weighted least squares
fitted.model = lm(y ~ w*x, weights=weights, data=sam)
muhat = function(w,x) predict(fitted.model, newdata=data.frame(w=w,x=x))

# Compute adjusted difference
Delta0.hat = mean(muhat(1,sam$x[sam$w==0]) - muhat(0,sam$x[sam$w==0]))
Delta0.hat
[1] -15440.14

It’s essentially a one-line change to your R code: add weights=weights to your lm call.

Summary

Inverse probability weighted least squares is a simple technique for getting unbiased estimates of adjusted differences even when your model is misspecified.

  1. IPW averaging is IPW least squares in the constant model. The weighted mean is the weighted least squares solution when the model ignores covariates.

  2. IPW least squares works for any model with group indicators. As long as your model can distinguish between groups (includes \(1_{=0}(w)\) and \(1_{=1}(w)\)), IPW gives unbiased estimates of \(\Delta_0\).

  3. The proof uses weighted orthogonality. The key is that weighting by \(\gamma(w,x) = m_{0x}/m_{wx}\) makes the orthogonality condition hold over the target distribution (the red distribution for \(\Delta_0\)).

  4. Implementation is easy. Just add weights=... to your lm call.

People sometimes avoid IPW because it increases variance—you’re relying heavily on a few observations when covariate shift is extreme. But the alternative is bias, which is worse. A wider confidence interval that’s centered on the truth is better than a narrow one that misses.