26  Misspecification and Inference

Summary

  • We’ll try out a misspecified regression model in a simple causal inference problem.
    • We’ll get biased point estimates.
    • We’ll look at the implications for inference.
    • And we’ll see that misspecification is an increasingly big deal as sample size grows.

A Simple Example

An Experiment

  • We’re going to talk about causal inference with a multivalued treatment.
  • Our context will be a randomized experiment with three treatments.
  • A large employer is experimenting with a new benefit for their employees: subsized gym memberships.
    • They’re randomly selecting employees to participate in a trial program.
    • Then randomizing these participants into three plans.
      • Plan 1: No gym membership. It costs them $0/month.
      • Plan 2: A gym membership that covers entry but not classes. It costs $50/month.
      • Plan 3: A gym membership that covers entry and classes. It costs $100/month.
    • Then surveying them to see how often they go to the gym, e.g. in days/month.
  • They’ll be able to use this information to decide what plan to offer everyone next year.

How We Think

  • We’ll think about this using potential outcomes.
  • We imagine, for each employee, that there’s a number of times they’d go to the gym given each plan.
    • \(y_j(0)\): The number of days/month they’d go to the gym if they had no membership.
    • \(y_j(50)\): The number of days/month they’d go to the gym if they had a $50 membership.
    • \(y_j(100)\): The number of days/month they’d go to the gym if they had a $100 membership.
  • And we’ll think about the effect of each plan on each employee.
    • \(\tau_j(50) = y_j(50) - y_j(0)\): The effect of a $50 gym membership vs none.
    • \(\tau_j(100) = y_j(100) - y_j(50)\): The effect of a $100 gym membership vs $50.

\[ \small{ \begin{array}{c|ccc|cc} j & y_j(0) & y_j(50) & y_j(100) & \tau_j(50) & \tau_j(100) \\ 1 & 5 & 10 & 10 & 5 & 0 \\ 2 & 10 & 10 & 10 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 327 & 0 & 0 & 20 & 0 & 20 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 358 & 15 & 18 & 20 & 3 & 2 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ m & 10 & 10 & 5 & 0 & -5 \\ \end{array} } \]

Randomization

  • Treatment is randomized by rolling two dice for each of our 250 participants.1
    • If we roll 1-6: No membership. This happens 106 times.
    • If we roll 8-12: $50 membership. This happens 124 times.
    • If we roll exactly 7: $100 membership. This happens 20 times.
roll = sample(c(1:6), size=n, replace=TRUE) + sample(c(1:6), size=n, replace=TRUE)
W = case_match(roll,
  1:6~0,
  8:12~1,
  7~100)

Estimating An Average Treatment Effect

  • Let’s say we’re interested in the average effect of a $100 gym membership vs. a $50 gym membership.
  • Because we’ve randomized, there’s a natural unbiased estimator of this effect.2
  • It’s the difference in means for the groups receiving these two types of membership.

\[ \tau(100) = \mathop{\mathrm{E}}\qty[\frac{1}{N_{100}}\sum_{i:W_i=100} Y_i - \frac{1}{N_{50}}\sum_{i:W_i=50} Y_i ] \quad \text{ for } N_x = \sum_{i:W_i=x} 1 \]

  • They want to know is whether paying for the $100 membership will increase attendance by at least 1 day /month.
  • Why? Maybe their insurance company will cover the extra $50 if it has that effect.

Uninformative Interval Estimates

Figure 27.1: Two estimates of our estimator’s sampling distribution: the bootstrap sampling distribution and one based on the normal approximation. A 95% confidence interval calibrated using the normal approximation is shown as well.
  • So what we want to know is whether this effect is at least 1.
    • And we can’t rule that out.
    • But we can’t ‘rule it in’ either.
  • We just don’t have compelling evidence in one direction or the other.
  • So let’s look at why that is. What did we do wrong when designing this experiment?

Variance

  • Let’s estimate the variance of our point estimate to see where the problem is.3 \[ \mathop{\mathrm{\mathop{\mathrm{V}}}}\qty[\sum_w \hat\alpha(w) \hat\mu(w)] = \sum_w \sigma^2(w) \times \mathop{\mathrm{E}}\qty[ \frac{\hat\alpha^2(w)}{N_w} ] \]

  • We can use this table. What do you see?

\[ \begin{array}{c|ccc} w & 0 & 50 & 100 \\ \hline \hat\sigma^2(w) & 10.56 & 5.94 & 10.59 \\ \hat\alpha^2(w) & 0 & 1 & 1 \\ N_w & 106.00 & 124.00 & 20.00 \\ \frac{\hat\sigma^2(w) \hat\alpha^2(w)}{N_w} & 0 & 0.05 & 0.53 \\ \end{array} \]

Variance

\[ \begin{array}{c|ccc} w & 0 & 50 & 100 \\ \hline \hat\sigma^2(w) & 10.56 & 5.94 & 10.59 \\ \alpha^2(w) & 0 & 1 & 1 \\ N_w & 106.00 & 124.00 & 20.00 \\ \frac{\hat\sigma^2(w) \alpha^2(d)}{N_w} & 0 & 0.05 & 0.53 \\ \end{array} \]

  • What I see is that we don’t have enough people randomized to the \(W=100\) treatment.
    • The variance we get from just that one term is \(0.53\).
    • Meaning the contribution to our standard error is \(\sqrt{0.53} \approx 0.73\)
    • And since we multiply standard error by 1.96 to get an ‘arm’ for our interval …
    • This ensures our arms are at least \(1.96 \times 0.73 \approx 1.43\) days wide.

If We Could Do It Again

  • We’d assign more people to the \(W=100\) treament to get a more precise estimate.
  • Here’s what we’d get if we assigned people to each of our 2 relevant treatments with probability \(1/2\).
  • This does what we need it to do: it gives us a confidence interval that’s all left of \(1\).

But We Can’t

  • This is our data. We have to do what we can. So we get clever.
  • Instead of estimating each subpopulation mean separately, we fit a line to the data.
  • We take \(\hat\mu(w)\) to be the height of the line at \(w\) and \(\hat\tau(w)=\hat\mu(100)-\hat\mu(50)\) as before.
  • And when we bootstrap it …
    • We get a much narrower interval than the one we had before.
    • But it’s not in the same place at all.

Difference in Sample Means Sampling Dist

Fitted-Line-Based Sampling Dist

What’s Going On?

  • Let’s zoom in. It looks like our like our blue line …
    • overestimates \(\mu(100)\)
    • underestimates \(\mu(50)\)
    • therefore overestimates \(\tau(100)=\mu(100)-\mu(50)\).
  • That’s what it looks like when we compare to the subsample means, anyway.
  • And that’s why we get a bigger estimate than we did using the subsample means.

Implications

Difference in Sample Means Sampling Dist

Fitted-Line-Based Sampling Dist
  • This isn’t a disaster if all we want to know is whether our effect \(\tau(100)\) is >1
    • i.e, if we want to know whether the $100 plan gives us an additional gym day over the $50 one.
    • No matter which approach we use, we get the same answer: we don’t know.
    • Our interval estimates both include numbers above and below 1.
  • But if what we wanted to know was whether \(\tau(100) > 0\), we’d get different answers.
    • The difference in subsample means approach would say we don’t know.
    • The fitted line approach would say we do know—in fact, that \(\tau(100) > 1/2\).
  • And this is a problem. This is fake data, so I can tell you it’s wrong. The actual effect is exactly zero.
  • In the data we’re looking at, the potential outcomes \(Y_i(50)\) and \(Y_i(100)\) are exactly the same.

The Data

# Sampling 250 employees. Our population is an even mix of ...
#   'blue guys', who go 1/4 days no matter what
#   'green guys', who go more if it's free but --- unlike the ones in our table earlier --- don't care about classes

n=250
is.blue = rbinom(n,1, 1/2)
Yblue   = rbinom(n, 30, 1/4) 
Ygreen0 = rbinom(n, 30, 1/10)
Ygreen50 = rbinom(n, 30, 1/5)

Y0   = ifelse(is.blue, Yblue, Ygreen0)
Y50  = ifelse(is.blue, Yblue, Ygreen50)
Y100 = Y50   ### <--- this tells us we've got no effect.

# Randomizing by rolling two dice. R is their sum
R = sample(c(1:6), size=n, replace=TRUE) + sample(c(1:6), size=n, replace=TRUE)
W = case_match(R, 1:6~0, 7:10~50, 11:12~100)
Y = case_match(W, 0~Y0, 50~Y50, 100~Y100)

The Line-Based Estimator is Biased

Difference-in-Means Sampling Dist

Fitted-Line-Based Sampling Dist
  • We can do a little simulation to see that.
  • We’ll generate 1000 fake datasets like the one we’ve been looking at.
    • And calculate our estimator for each one.
    • This gives us draws from our estimator’s actual sampling distribution.
    • I’ve used them to draw a histogram.
  • The mean of these point estimates is \(1.08\).
    • Since our actual effect is zero, that’s our bias. The blue line.
  • Our actual point estimate is off in the same direction—but a bit more than usual.
    • It’s shown in black with a corresponding 95% interval.
    • An interval that fails to cover the actual effect of zero. The green line.

The Resulting Coverage Is Bad

  • We can construct 95% confidence intervals the same way in each fake dataset.
    • I’ve drawn 100 of these intervals in purple.
    • You can see that they sometimes cover zero, but not as often as you’d like.
  • 21 / 1000 ≈ 2% of them cover zero. That’s not good.
  • If that’s typical of our analysis, people really shouldn’t trust us.
  • So why have statisticians gotten away with fitting lines for so long?
  • In essence, it’s by pretending we were trying to do something else.

What We Pretend To Do

  • Our estimator — 50 times the slope of the least squares line — is a good estimator of something.
  • It’s a good estimate of the analogous population summary: 50 × the population least squares slope.
  • That’s the red line. 981 / 1000 ≈ 98% of our intervals cover it. That sounds pretty good.

\[ \small{ \begin{aligned} \hat\tau(100) &= \hat\mu(100) - \hat\mu(50) = \qty(100 \hat a + \hat b) - \qty(50 \hat a + \hat b) = 50\hat a \\ \qqtext{where} & \hat\mu(x) = \hat a x + \hat b \qfor \hat a,\hat b = \mathop{\mathrm{argmin}}_{a,b} \sum_{i=1}^n \qty{Y_i - (aX_i + b)}^2 \\ \tilde\tau(100) &= \tilde\mu(100) - \tilde\mu(50)) = \qty(100 \tilde a + \tilde b) - \qty(50 \tilde a + \tilde b) = 50\tilde a \\ \qqtext{where} & \tilde\mu(x) = \tilde a x + \tilde b \qfor \tilde a,\tilde b = \mathop{\mathrm{argmin}}_{a,b} \mathop{\mathrm{E}}\qty[\qty{Y_i - (aX_i + b)}^2] \end{aligned} } \]

Problem: That’s Not What We Wanted To Estimate

  • The effects we want to estimate are, as a result of randomization, differences in subpopulation means. \[ \tau(50) = \mu(50) - \mu(0) \qand \tau(100) = \mu(100) - \mu(50) \]

  • The population least squares line doesn’t go through these means. It can’t.

  • They don’t lie along a line. They lie along a hockey-stick shaped curve.

  • This means no line—including the population least squares line—can tell us what we want to know.

  • We call this misspecification. We’ve specified a shape that doesn’t match the data.

  • This problem has a history of getting buried in technically correct but opaque language.

Language Games

  • The widespread use of causal language is a recent development.
  • In the past, people would do a little linguistic dance to avoid talking about causality.
  • They’d talk about what they were really estimating as if it were what you wanted to know.
  • This not only buried issues of causality, it buried issues of bias due to misspecification.
    • e.g., due to trying to use a line to estimate a hockey-stick shaped curve.
  • They’d say ‘The regression coefficient for \(Y\) on \(W\) is \(0.03\).’
    • You’d be left to intepret this as saying the treatment effect was roughly \(50 \times 0.03\).
    • That’s wrong, but it’s ‘your fault’ for interpreting it that way. They set you up.

Misspecification and Inference

  • In particular, we’ll think about its implications for statistical inference. This boils down to…
  • what happens when our estimator’s sampling distribution isn’t centered on what we want to estimate.
  • We can see that, if our sampling distribution is narrow enough, the coverage of interval estimates is awful.
  • Today, we’ll think about how sample size impacts that. Something interesting happens.
  • From a point-estimation perspective, more data is always good. Our estimators do get more accurate.
  • But from an inferential perspective, it can be very bad.
  • When we use misspecified models, our coverage claims will get less accurate as sample size grows.

Misspecification’s Impact as Sample Size Grows

  • Here we’re seeing the sampling distributions of our two estimators at three sample sizes.
    • Left/Right: Sample size 50,100,200
    • Top/Bottom: Difference in Subsample Means / Difference in Values of Fitted Line
  • The green line is the actual effect.
  • The red one is the effect estimate we get using the population least squares line: 50 × the slope of the red line.
  • I’ve drawn 100 confidence intervals based on normal approximation for each estimator.
  • We can see that the line-based estimator’s coverage gets increasingly bad as sample size increases.

Coverage

  • Here are the coverage rates for each estimator at each sample size.
  • The difference in sample means works more or less as we’d like.
    • Coverage is always pretty good and is roughly 95% — what we claim — in larger samples.
  • The line-based estimator does the opposite.
    • Coverage is bad in small samples and worse in larger ones.
  • Now, we’re going to do some exercises. We’re going to work out how to calculate coverage by hand.
    • The point is to get some perspective on what it means to have ‘just a little bit of bias’.
    • This’ll come in handy when people around you say a bit of misspecification isn’t a big deal.

Exercises

Set Up

  • Above, I’ve drawn the sampling distribution for the estimator \(\hat\tau\) and a normal approximation to it.
  • We’re going to calculate that estimator’s coverage rate. Approximately. We’ll make some simplifications.
  • We’ll act as if the normal approximation to the sampling distribution were perfect.
  • We’ll assume our intervals are perfectly calibrated.
    • i.e., for a draw \(\hat\tau\) from the sampling distribution …
    • the interval is \(\hat\tau \pm 1.96\sigma\) where \(\sigma\) is our estimator’s actual standard error.

Exercise 1

  • Let’s think about things from the treatment effect’s perspective. The solid green line.
  • Let’s give it arms the same length as our intervals have. They go out to the dashed green lines.
  • An interval’s arms touch the effect if and only if the effect touches the interval’s center, i.e., the corresponding point estimate.
  • So we’re calculating the proportion of point estimates covered by the treatment effect’s arms.
  • Exercise. Write this in terms of the sampling distribution’s probability density f(x)
    • This is a calculation you can illustrate with a sketch on top of the plot above. You may want to start there.

Exercise 1

  • To calculate the proportion of point estimates covered by the treatment effect’s arms
  • We integrate the sampling distribution’s probability density from one to the other.

\[ \text{coverage} = \int_{-1.96\sigma}^{1.96\sigma} f(x) dx \qqtext{ where } f(x)=\frac{1}{\sqrt{2\pi} \sigma} e^{-\frac{(x-\tilde\tau)^2}{2\sigma^2}} \]

  • Here I’ve written in the actual formula of the density of the normal I drew.
    • The one with mean \(\tilde\tau\) and standard deviation \(\sigma\).
    • I’m not expecting you to have that memorized, but it does come in handy from time to time.
  • This doesn’t yet give us all that much generalizable intution.
  • But we’ll derive some by, to some degree, simplifying that integral.

Exercise 2

  • Now we’ll rewrite our coverage as an integral of the standard normal density
  • To do it, we’re going back to intro calculus. We’re making a substitution \(u=(x-\tilde\tau)/\sigma\).
  • This means we have to change the limits.
    • As \(x\) goes from \(-1.96\sigma\) to \(1.96\sigma\)
    • \(u\) goes from \(-1.96-\tilde\tau/\sigma\) to \(1.96-\tilde\tau/\sigma\).
  • And make a substitution into the integrand.
    • For \((x-\tilde\tau)^2/\sigma^2\), we subtitute \(u^2\).
  • And multiply by \(dx/du\) so we get to integrate over \(u\).
    • \(x= u\sigma + \tilde\tau\) so \(dx=\sigma du\)

\[ \text{coverage} = \int_{-1.96\sigma}^{1.96\sigma} e^{-\frac{(x-\tilde\tau)^2}{2\sigma^2}} dx \\ = \class{fragment}{\int_{-1.96-\tilde\tau/\sigma}^{1.96-\tilde\tau/\sigma} \frac{1}{\sqrt{2\pi} \sigma} e^{-\frac{u^2}{2}} \sigma du \\ = \int_{-1.96-\tilde\tau/\sigma}^{1.96-\tilde\tau/\sigma} \textcolor{blue}{\frac{1}{\sqrt{2\pi}} e^{-\frac{u^2}{2}}} du } \]

Summary and Generalization

  • We can calculate the coverage of a biased, approximately normal estimator easily.
  • It’s an integral of the standard normal density over the range \(\pm 1.96 -\text{bias}/\text{se}\).
  • The R function call pnorm(x) calculates that integral from \(-\infty\) to \(x\). All we need to do is subtract.

Math Version

\[ \small{ \begin{aligned} \text{coverage} &=\int_{-1.96-\frac{\text{bias}}{\text{se}}}^{+1.96-\frac{\text{bias}}{\text{se}}} \frac{1}{\sqrt{2\pi}} e^{-\frac{u^2}{2}} du \\ &= \int_{-\infty}^{+1.96-\frac{\text{bias}}{\text{se}}} \frac{1}{\sqrt{2\pi}} e^{-\frac{u^2}{2}} du \ - \ \int_{-\infty}^{-1.96-\frac{\text{bias}}{\text{se}}} \frac{1}{\sqrt{2\pi}} e^{-\frac{u^2}{2}} du \end{aligned} } \]

R Version

coverage = function(bias.over.se) {  pnorm(1.96 - bias.over.se) - pnorm(-1.96 - bias.over.se)  }

Examples

 
 
\[ \frac{\text{bias}}{\text{se}} = \frac{1}{2} \quad \implies \quad \text{coverage} = 92\% \]

 
 
\[ \frac{\text{bias}}{\text{se}} = 1 \quad \implies \quad \text{coverage} = 83\% \]

 
 
\[ \frac{\text{bias}}{\text{se}} = 2 \quad \implies \quad \text{coverage} = 48\% \]

 
 
\[ \frac{\text{bias}}{\text{se}} = 3 \quad \implies \quad \text{coverage} = 48\% \]

People Report This

 

From Stable Estimation of Survival Causal Effects. Pham et al. arXiv:2310.02278

\(\abs{\frac{\text{bias}}{\text{se}}}\) coverage of \(\pm 1.96\sigma\) CI
\(\frac{1}{2}\) 92%
\(1\) 83%
\(2\) 48%
\(3\) 15%
  • This plot shows bias/se for 3 estimators of the effect of a treatment on the probability of survival.
  • In particular, they’re estimators of the effect on survival up to time \(t\). \(t\) is on the \(x\)-axis.
  • Q. For small \(t\), all 3 estimators are doing ok. When does the red one’s coverage drop below 92%? What about 83%? And 48%?
  • Q. What can we say about the coverage of the blue one and the green one?
  • A. The red one’s coverage drops below 92% after \(t=12\). That’s where bias/se starts to exceed 1/2. For \(t=14\), it’s down to 83% (bias/se > 1). And for \(t=20\), it’s down to 48% (bias/se > 2).
  • A. Blue and green have bias/se close to or below 1/2 for all times \(t\). So their coverage is roughly 92% or higher for all times \(t\).

The Whole Bias/SE vs. Coverage Curve

A Back-of-the-Envelope Calculation

  • Suppose we’ve got misspecification, but only a little bit.

  • We’re doing a pilot study with sample size \(n=100\).

  • And we’re sure that our estimator’s bias is less than half its standard error.

  • Question 1. What’s the coverage of our 95% confidence intervals?

  • A. At least 92%.
  • Now suppose we’re doing a follow-up study with sample size \(n=1600\).
  • And we’re using the same estimator. Bias is the same.
  • Question 2. What’s the coverage of our 95% confidence intervals?
  • Assume our estimator’s standard error is proportional to \(\sqrt{1/n}\).
    • They almost always are. For estimators we’ll talk about in this class, always.
    • That means it’ll be \(\sqrt{1/1600} \ / \ \sqrt{1/100}=\sqrt{1/16}=1/4\) of what it was in the pilot study.
  • A. At least 48%. Here’s why. \[ \text{If } \ \frac{\text{bias}}{\text{pilot se}} \le 1/2 \ \text{ and } \ \text{follow-up se} = \frac{\text{pilot se}}{4}, \ \text { then } \ \frac{\text{bias}}{\text{follow-up se}} = 4 \times \frac{\text{bias}}{\text{pilot se}} \le 2. \]

Lessons

Be Careful Using Misspecified Models

sampling distribution for the line-based estimate of gym subsidy effect tau(100) at sample sizes n=10

sampling distribution for the line-based estimate of gym subsidy effect tau(100) at sample sizes n=40

sampling distribution for the line-based estimate of gym subsidy effect tau(100) at sample sizes n=160
  • You might have an unproblematic level of bias in a small study.
  • That same amount of bias can be a big problem in a larger one.
  • In our gym-subsidy example, a line was pretty badly misspecified.
  • That means we have to go pretty small to get an unproblematic level of bias.
  • Here are the coverage rates at sample sizes 10, 40, and 160.
  • We include two versions.
    • An approximation calculated using our formula.
    • The actual coverage based on simulation.
  • These differ a bit when our simplifying assumptions aren’t roughly true.
    • e.g. accuracy of the normal approximation to the sampling distribution.
    • This is a bigger problem in small samples than in large ones.

\[ \small{ \begin{array}{c|ccccc} n & \text{bias} & \text{se} & \frac{\text{bias}}{\text{se}} & \text{calculated coverage} & \text{actual coverage} \\ \hline 10 & 1.07 & 1.60 & 0.67 & 90\% & 84\% \\ 40 & 1.07 & 0.74 & 1.44 & 70\% & 66\% \\ 160 & 1.07 & 0.35 & 3.03 & 14\% & 13\% \\ \end{array} } \]

It’s Best to Use a Model That’s Not Misspecified

difference-in-means estimate, n=10

line-based estimate, n=10

difference-in-means estimate, n=40

line-based estimate, n=40

difference-in-means estimate, n=160

line-based estimate, n=160

If You’re Fitting a Line, Do It Cleverly

  • If we’re estimating \(\tau(100)\), we can use a line fit to the observations with \(W=50\) and \(W=100\).
  • Then, since we’ve effectively got a binary covariate, it goes through the subsample means.
  • And we end up with the difference in means estimator. We know that’s unbiased.

Next Week

  • We’ll see a more general version of this trick.
    • We can fit misspecified models and still get unbiased estimators of summaries.
    • We just have to fit them the right way.
  • We’ll use weighted least squares with weights that depend on the summary.
  • In this case, we give zero weight to observations with \(W=0\) and equal weight to the others.

  1. Each participant’s potential outcomes are plotted as three connected circles. The realized outcome is filled in; the others are hollow. The proportion of participants receiving each treatment are indicated by the purple bars.↩︎

  2. See our Lecture on Causal Inference.↩︎

  3. See our Lecture on Inference for Complex Summaries.↩︎