45 Homework: Misspecification and Bias

Introduction

In this homework, we focus on what happens when models are misspecified and how to reason about the impact of bias on inference. The key insight is that misspecification creates bias that doesn’t shrink with sample size, but our intervals do shrink—so more data can mean worse coverage.

As usual, we’ll use the tidyverse.

library(tidyverse)

Misspecification and Coverage

The population and the population least squares line

A sample of size 13 and the least squares line

Suppose we’re interested in estimating the subpopulation mean \(\mu(0) = \sum_{j:x_j=0} y_j / \sum_{j:x_j=0} 1\) of the population drawn on the left above. We’re working with a sample of \((X_1, Y_1), \ldots, (X_n, Y_n)\) drawn with replacement from this population, like the one drawn on the right above.

As our estimate of \(\mu(0)\), we use \(\hat\mu(0)\) where \(\hat\mu\) is the least squares line (the least squares predictor in the ‘lines model’ \(\mathcal{M}=\{m(w)=a+b w\}\)). Below is a plot of its sampling distribution at sample size \(n=13\). I’ve plotted its mean, mean plus or minus one standard deviation, and mean plus or minus two standard deviations as solid, dashed, and dotted blue lines respectively. I’ve also plotted the subpopulation mean \(\mu(0)\) itself as a green line.

You may assume that the standard deviation of the estimator at sample size \(n\) is \(\text{se}=\sigma/\sqrt{n}\) for some number \(\sigma\) that does not vary with sample size, just like it would be if \(\hat\mu(0)\) were the mean of a sample of size \(n\).

Exercise 47.1

At which of these three samples sizes — 25, 50, or 100 — would you expect that interval \(\hat\mu(0) \pm 1.96\ \text{se}\) to have a width of roughly \(1/2\)?

🔒

Solution

Locked (Week 13)

Exercise 47.2

At which of these three samples sizes — 25, 50, or 100 — would you expect the coverage of the usual 95% confidence interval \(\hat\mu(0) \pm 1.96\ \text{se}\) to be closest to 50%?

Tip. Draw

If the answer isn’t obvious to you, it might be helpful to sketch the sampling distribution of the estimator at the different sample sizes you’re considering. How does their width vary as you increase sample size? What about their center?

🔒

Solution

Locked (Week 13)

The Bias/SE Ratio and Coverage

What determines coverage when we have bias? The key quantity is the ratio of bias to standard error: \(\text{bias}/\text{se}\).

When our sampling distribution is approximately normal, coverage of the interval \(\hat\theta \pm 1.96 \times \text{se}\) is given by:

\[ \text{coverage} = \Phi(1.96 - \text{bias}/\text{se}) - \Phi(-1.96 - \text{bias}/\text{se}) \]

where \(\Phi\) is the standard normal CDF. We can write this as a function:

coverage.from.ratio = function(bias.over.se) {
  pnorm(1.96 - bias.over.se) - pnorm(-1.96 - bias.over.se)
}

Here’s a plot of coverage as a function of the bias/SE ratio:

ratios = seq(-5, 5, by=0.01)
coverages = coverage.from.ratio(ratios)

ggplot() + 
  geom_line(aes(x=ratios, y=coverages)) +
  geom_hline(yintercept=0.95, linetype='dashed', color='green', alpha=0.5) +
  geom_hline(yintercept=0.85, linetype='dashed', color='orange', alpha=0.5) +
  geom_hline(yintercept=0.50, linetype='dashed', color='red', alpha=0.5) +
  labs(x='bias/SE ratio', y='coverage') +
  scale_y_continuous(breaks=seq(0, 1, by=0.1))

Exercise 48.1

Suppose you’re using an estimator with bias = 0.5 and SE = 0.25. What is the bias/SE ratio, and approximately what coverage would you expect?

🔒

Solution

Locked (Week 13)

Exercise 48.2

Suppose misspecification creates a fixed bias of 0.1 (that doesn’t depend on sample size), and your estimator’s standard error is \(\sigma/\sqrt{n}\) where \(\sigma = 1\).

What is the bias/SE ratio at \(n = 100\)?
What is the bias/SE ratio at \(n = 10,000\)?
At which sample size would you expect worse coverage?

🔒

Solution

Locked (Week 13)

Visualizing Misspecification Bias

Let’s look at a concrete example of how model misspecification affects our estimates. Consider a population where the true relationship between \(x\) and \(y\) is nonlinear:

set.seed(42)
m = 5000
x = runif(m, 0, 4)
y = 1 + 2*x - 0.3*x^2 + rnorm(m, sd=0.5)
pop = data.frame(x=x, y=y)

The true conditional mean function is \(\mu(x) = 1 + 2x - 0.3x^2\) (a downward-opening parabola). If we fit a linear model \(y = a + bx\), we get misspecification.

# Population with true curve
gridx = seq(0, 4, by=0.01)
true.curve = 1 + 2*gridx - 0.3*gridx^2

# Fit linear model to population
pop.linear = lm(y ~ x, data=pop)
linear.preds = predict(pop.linear, newdata=data.frame(x=gridx))

ggplot(pop) +
  geom_point(aes(x=x, y=y), alpha=0.1, size=0.5) +
  geom_line(aes(x=x, y=y), data=data.frame(x=gridx, y=true.curve), color='green', linewidth=1) +
  geom_line(aes(x=x, y=y), data=data.frame(x=gridx, y=linear.preds), color='blue', linewidth=1, linetype='dashed') +
  labs(title='Population: True curve (green) vs Linear fit (blue)')
# Sample and fit
n = 50
sam = pop[sample(1:m, n, replace=TRUE),]
sam.linear = lm(y ~ x, data=sam)
sam.preds = predict(sam.linear, newdata=data.frame(x=gridx))

ggplot(sam) +
  geom_point(aes(x=x, y=y), alpha=0.5) +
  geom_line(aes(x=x, y=y), data=data.frame(x=gridx, y=true.curve), color='green', linewidth=1) +
  geom_line(aes(x=x, y=y), data=data.frame(x=gridx, y=sam.preds), color='blue', linewidth=1, linetype='dashed') +
  labs(title='Sample: True curve (green) vs Linear fit (blue)')

Exercise 49.1

Looking at the plots above:

At \(x=0\), is the linear model’s prediction biased up or down?
At \(x=2\), is the linear model’s prediction biased up or down?
At \(x=4\), is the linear model’s prediction biased up or down?

🔒

Solution

Locked (Week 13)

Simulation Validity

When using simulations to check whether estimators work, we need to be careful that our simulations actually test the situations we care about. A simulation that shows good coverage doesn’t mean the estimator always has good coverage—it means the estimator has good coverage in that specific simulation scenario.

Exercise 50.1

Suppose you’re testing an estimator for a population mean using simulated data. You simulate populations where the mean is always exactly 0, and you find that your estimator has 95% coverage. A colleague points out that their estimator has 97% coverage in the same simulation.

Their estimator is: “Report 0 if the sample mean is between -0.1 and 0.1, otherwise report the sample mean.”

Why does this estimator appear to work so well in the simulation?
What would happen if the true mean were 0.5 instead of 0?

🔒

Solution

Locked (Week 13)

Exercise 50.2

When estimating treatment effects, two things can cause bias in misspecified models: misspecification and covariate shift. Suppose you’re running a simulation to test an estimator and you want to make sure it handles both problems.

Design a simulation (describe it in words) that would reveal problems with an estimator that: (a) Can’t handle misspecification (b) Can’t handle covariate shift (c) Can’t handle both

🔒

Solution

Locked (Week 13)

Summary

The key takeaways from this homework:

Misspecification creates fixed bias that doesn’t shrink with sample size, but our intervals do shrink. This means more data can mean worse coverage.
The bias/SE ratio determines coverage impact. A ratio of ~1 gives ~85% coverage; a ratio of ~2 gives ~50% coverage.
Be skeptical of simulation evidence. A simulation that shows an estimator works well only tells you it works in that specific scenario. Always consider what aspects of reality your simulation might be missing.