27  Misspecification and Averaging

Setup

Shared Between Examples

Gym Subsidy Example

California Income Data

What We’re Doing

  • Today, we’ll look into something we saw in the Week 9 Homework’s last exercise.
    • We estimated our subpopulation means \(\mu(w,x)\) using least squares …
    • … to choose a function from the ‘not-necessarily-parallel lines’ model.

\[ \hat\mu(w,x) = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \qty{ Y_i - m(W_i,X_i) }^2 \qfor \mathcal{M}= \qty{ m(w,x) = a(w) + b(w)x } \]

  • It didn’t fit the data very well: \(\hat\mu(w,x) \not\approx \mu(w,x)\).1
  • But when we used it to adjust for covariate shift, it worked well: \(\hat \Delta_0 \approx \Delta_0\). \(\hphantom{1}\)

\[ \color{gray} \begin{aligned} \hat\Delta_0 &= \textcolor[RGB]{248,118,109}{\frac{1}{N_0}\sum_{i: W_i=0}} \left\{ \textcolor[RGB]{0,191,196}{\hat\mu(1,X_i)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,X_i)} \right\} &&\qfor \textcolor[RGB]{248,118,109}{N_0 = \sum_{i: W_i=0} 1} \\ \Delta_0 &= \textcolor[RGB]{248,118,109}{\frac{1}{m_0} \sum_{j: w_j=0}} \left\{ \textcolor[RGB]{0,191,196}{\mu(1,x_{j})} - \textcolor[RGB]{248,118,109}{\mu(0,x_{j})} \right\} &&\qfor \textcolor[RGB]{248,118,109}{m_0 = \sum_{j: w_j=0} 1} \end{aligned} \]

  • How did we know it worked well? We were working with a fake population.
    • We could actually calculate our estimation target. And we did.
    • And we simulated our survey many times to get our estimator’s sampling distribution.2
  • Look at the difference in mean incomes at one education level.

  • Compare it to our target \(\theta\), the actual difference in means in our population. \[ \hat\theta = \textcolor[RGB]{0,191,196}{\hat\mu(1,x)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,x)} \qqtext{ estimates } \theta = \textcolor[RGB]{0,191,196}{\mu(1,x)} - \textcolor[RGB]{248,118,109}{\mu(0,x)} \]

  • The sampling distribution of our estimator \(\hat\theta\) is shown above.

  • For \(x=14\) (a 2-year degree), which is what’s shown for now, its center is far from the target.

    • Our estimator is biased. And its bias is positive.
    • In most samples drawn from the population, we overestimate the difference.
  • Change \(x\) in the code above to see what happens at other education levels.

    • Try \(x=12\) (completed high school) and \(x=16\) (a 4-year degree). And whatever else you like.
    • Remember that some levels, e.g. \(x=15\), don’t exist in our population.
    • Don’t be surprised if you try \(x=15\) and it doesn’t work.
  • Here’s the sampling distribution of our adjusted difference in means. \[ \hat\theta = \textcolor[RGB]{248,118,109}{\frac{1}{N_0}\sum_{i: W_i=0}} \qty{ \textcolor[RGB]{0,191,196}{\hat\mu(1,X_i)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,X_i)} } \qqtext{ estimates } \theta = \textcolor[RGB]{248,118,109}{\frac{1}{m_0} \sum_{j: w_j=0}} \qty{ \textcolor[RGB]{0,191,196}{\mu(1,x_{j})} - \textcolor[RGB]{248,118,109}{\mu(0,x_{j})} } \]

  • We’re using biased estimates of the difference in means at each education level \(x\).

  • But when we average over the covariate distribution3 of male CA residents

    • … we get an estimator that’s barely biased at all.
    • Our term-by-term biases are (imperfectly) canceling each other out.

This Activity’s Goal

  • We’re going to understand this cancellation phenomenon a bit better.
  • We’ll do the math to find out …
    • where bias is and isn’t coming from.
    • and what tends to cause cancellation like this.

Why bother?

  • You could argue that we should never use misspecified models.
    • If we always used the ‘all functions model’, we wouldn’t have to worry about bias.
  • 1. We can’t always use the all functions model.
    • If we have groups in our population that we haven’t sampled …
    • … we need a smaller model, e.g. lines, to help us make predictions for them.
  • 2. We’re not always choosing the model.
    • When we read someone else’s analysis, we want to know whether we should believe what they’re saying.
    • Often, you can’t get their data so you can’t just analyze it your way to check.
  • 3. It’s a good warm-up for what we’ll cover next.

Warm-Up

Our Gym Subsidy Example

A Case of Perfect Cancellation

  • This is the gym subsidy example we looked at in class a few days ago.
    • We have 3 levels of \(x\): 0, 50, and 100.
    • And we’re estimating \(\mu\) by least squares using the ‘lines’ model.

\[ \begin{aligned} \hat \mu &= \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \qty{ Y_i - m(X_i) }^2 \qfor \mathcal{M}= \qty{ m(x) = a + bx } && \text{ estimates } \\ \mu(x) &= \frac{1}{m_x} \sum_{j: x_j=x} y_j \qfor m_x = \sum_{j: x_j=x} 1 \end{aligned} \]

  • This is simulated data, so we know \(\mu(x)\). It’s plotted in green above.
  • We haven’t done a very good job of estimating it.4
    • It’s not because our sample is too small.5
    • The problem is misspecification. \(\mu\) looks like a hockey stick. No line can fit that.
  • Here’s the sampling distribution of \(\hat\mu(100)\), our least squares estimate of \(\mu(100)\)..
    • It’s not even close to being centered on \(\mu(100)\).
    • It’s centered on the population least squares estimate of \(\mu(100)\).6

\[ \tilde\theta = \tilde\mu(100) \qfor \tilde \mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{j=1}^m \qty{ y_j - m(x_j) }^2 \]

  • Here’s how \(\hat\theta\), \(\tilde\theta\), and \(\theta\) are calculated in R.
  • To get the plot above, plot.sampling.distribution simulates our experiment over and over and calculates \(\hat\theta\) each time.
  • So here’s our bias.
  • One thing you’ll notice if you look at \(\hat\mu(x)\) for \(x=0\) and \(x=50\) is that they’re biased in the opposite direction.7
    • If we’re looking at difference like \(\hat\mu(100) - \hat\mu(50)\), that’s bad. \(\text{too big} - \text{too small} = \text{even more too big}\)
    • But if we’re looking at a sum, that’s good. We’ll get some cancellation.
  • Let’s think about a new target: the average of \(y\) in the population.
    • That’s the same thing as the average of \(\mu(x_j)\) over the whole population.
    • That’s just summing in a certain order. See Section 31.1. \[ \theta = \frac{1}{m} \sum_{j=1}^m y_j = \frac{1}{m} \sum_{j=1}^m \mu(x_j) \]
  • So maybe we should estimate it by averaging \(\hat\mu(x)\) over our sample.

\[ \hat\theta = \frac{1}{n} \sum_{i=1}^n \hat\mu(X_i) \]

  • And when we do that, we get an unbiased estimator.
    • The center of our sampling distribution is the population average of our population least squares predictor.
    • And that’s exactly the same as our estimation target \(\theta\).

\[ \mathop{\mathrm{E}}[\hat\theta] = \tilde\theta = \frac{1}{m} \sum_{j=1}^m \tilde\mu(x_j) \]

  • Here’s the empirical evidence.

Orthogonality and Perfect Cancellation

  • The orthogonality of least squares residuals tells us this happens.

\[ 0 = \sum_{i=1}^n \qty{ Y_i - \hat\mu(X_i) } \ m(X_i) \qqtext{ for all } m \in \mathcal{M} \]

  • Why? Plug in \(m(x) = 1\). That’s a line. It’s a horizontal line. Then do a little algebra. \[ 0 = \sum_{i=1}^n \qty{ Y_i - \hat\mu(X_i) } \times 1 \implies \textcolor[RGB]{192,192,192}{\frac{1}{n}} \sum_{i=1}^n Y_i = \textcolor[RGB]{192,192,192}{\frac{1}{n}} \sum_{i=1}^n \hat\mu(X_i) \]

  • It tells us that our estimate \(\hat\theta\) is really just the sample mean.

    • We calculated it in a strange way, but it’s the same number.
    • And we know the sample mean is an unbiased estimator of the population mean.
  • This is true for any regression model that includes the constant function \(m(x)=1\).

\[ \begin{aligned} \mathcal{M}&= \{ m(x) = a : a \in \mathbb{R} \} && \text{horizontal lines} \\ \mathcal{M}&= \{ \text{all functions} \ m(x) \} && \text{all functions} \\ \mathcal{M}&= \{ m(x) = a + bx : a,b \in \mathbb{R} \} && \text{lines} \\ \mathcal{M}&= \{ m(x) = \sum_{k=0}^p a_k x^k : a_k \in \mathbb{R} \} && \text{polynomials of order } p \\ \end{aligned} \]

  • One model it’s not true for is lines through the origin.
    • People don’t like using that model and that’s why. Averages are biased.
    • If you go back a slide and change formula = y ~ 1+x to formula = y ~ 0+x, you’ll see it. Bias!

\[ \begin{aligned} \mathcal{M}&= \{ m(x) = bx : b \in \mathbb{R} \} && \text{lines through the origin} \end{aligned} \]

Population Orthogonality and Perfect Cancellation

  • Let’s look at this a different way. Let’s think about the orthogonality of population least squares residuals.

\[ \frac{1}{m} \sum_{j=1}^m \qty{ y_j - \tilde\mu(x_j) } \times m(x_j) = 0 \qqtext{ for all } m \in \mathcal{M}. \]

  • Again, plug in \(m(x) = 1\). And do the same algebra as before.

\[ 0 = \sum_{j=1}^m \qty{ y_j - \tilde\mu(x_j) } \times 1 \implies \textcolor[RGB]{192,192,192}{\frac{1}{m}} \sum_{j=1}^m y_j = \underset{\tilde\theta}{\textcolor[RGB]{192,192,192}{\frac{1}{m}} \sum_{j=1}^m \tilde\mu(x_j)} \]

  • This tells us the population version of our estimator, \(\tilde\theta\), is just the population mean. That’s our target.
  • And because we know that \(\tilde\theta\) is the expectation of our estimator \(\hat\theta\), that means our estimator is unbiased.

California Income Data

Our Adjusted Difference \(\hat\Delta_0\)

  • Let’s look at our adjusted difference in means when we use the ‘not necessarily parallel lines’ model. \[ \mathcal{M}= \{ m(w,x) = a(w) + b(w)x \ \text{ for functions} \ a,b \} \]

  • We’ll break it down into two parts.

    • A matched part where we average the red prediction function \(\hat\mu(1,x)\) over the red covariate distribution.
    • A mismatched part where we average the green prediction function \(\hat\mu(0,x)\) over the red covariate distribution. \[ \color{gray} \begin{aligned} \hat\Delta_0 &= \textcolor[RGB]{248,118,109}{\frac{1}{N_0}\sum_{i: W_i=0}} \left\{ \textcolor[RGB]{0,191,196}{\hat\mu(1,X_i)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,X_i)} \right\} &&\qfor \textcolor[RGB]{248,118,109}{N_0 = \sum_{i: W_i=0} 1} \\ &= \underset{\text{mismatched part}}{\textcolor[RGB]{248,118,109}{\frac{1}{N_0}\sum_{i: W_i=0}}\textcolor[RGB]{0,191,196}{\hat\mu(1,X_i)}} - \underset{\text{matched part}}{\textcolor[RGB]{248,118,109}{\frac{1}{N_0}\sum_{i: W_i=0}}\textcolor[RGB]{248,118,109}{\hat\mu(0,X_i)}} \end{aligned} \]
  • Our target, of course, has the same parts.

\[ \color{gray} \begin{aligned} \Delta_0 &= \textcolor[RGB]{248,118,109}{\frac{1}{m_0}\sum_{j: w_j=0}} \left\{ \textcolor[RGB]{0,191,196}{\mu(1,x_j)} - \textcolor[RGB]{248,118,109}{\mu(0,x_j)} \right\} &&\qfor \textcolor[RGB]{248,118,109}{m_0 = \sum_{j: w_j=0} 1} \\ &= \underset{\text{mismatched part}}{\textcolor[RGB]{248,118,109}{\frac{1}{m_0}\sum_{j: w_j=0}}\textcolor[RGB]{0,191,196}{\mu(1,x_j)}} - \underset{\text{matched part}}{\textcolor[RGB]{248,118,109}{\frac{1}{m_0}\sum_{j: w_j=0}}\textcolor[RGB]{248,118,109}{\mu(0,x_j)}} \end{aligned} \]

The Matched Part

  • Here’s the sampling distribution of the matched part.
  • You can see it’s unbiased, i.e., centered on the matched part of our target.

\[ \mathop{\mathrm{E}}\qty[ \textcolor[RGB]{248,118,109}{\frac{1}{N_0}\sum_{i: W_i=0}}\textcolor[RGB]{248,118,109}{\hat\mu(0,X_i)} ] = \textcolor[RGB]{248,118,109}{\frac{1}{m_0} \sum_{j: w_j=0} \mu(0,x_j)} \]

  • Let’s prove it. As usual, we’ll use the orthogonality of least squares residuals.

\[ 0 = \sum_{i=1}^n \qty{ Y_i - \hat\mu(W_i,X_i) } \ m(W_i,X_i) \qqtext{ for all } m \in \mathcal{M} \]

  • What function \(m\) should we plug in? Suppose we’re using the ‘not necessarily parallel lines’ model.
  • See the next slide for the answer.

\[ m(w,x) = 1_{=0}(w) = \begin{cases} 1 & \text{if } w=0 \\ 0 & \text{if } w=1 \end{cases} \]

  • That’s in the model. It’s \(a(w) + b(w)x\) for \(a(w)=1_{=0}(w)\) and \(b(w)=0\).
  • Let’s plug it in and do our algebra.

\[ \begin{aligned} 0 &\overset{\texttip{\small{\unicode{x2753}}}{plugging in}}{=} \sum_{i=1}^n \qty{ Y_i - \hat\mu(W_i,X_i) } \times 1_{=0}(W_i) \\ &\overset{\texttip{\small{\unicode{x2753}}}{using the indicator trick}}{=} \sum_{i=1}^n Y_i 1_{=0}(W_i) - \sum_{i=1}^n \hat\mu(0,X_i) 1_{=0}(W_i). \end{aligned} \]

  • Rearranging and dividing by the number \(N_0\) of red dots gives us what we want.8 \[ \frac{1}{N_0}\sum_{i:W_i=0} Y_i = \frac{1}{N_0}\sum_{i:W_i=0} \hat\mu(1,X_i) \]

The Mismatched Part

  • Here’s the sampling distribution of the mismatched part. You can see that it’s biased, but not too badly. \[ \mathop{\mathrm{E}}\qty[ \textcolor[RGB]{248,118,109}{\frac{1}{N_0}\sum_{i: W_i=0}}\textcolor[RGB]{0,191,196}{\hat\mu(1,X_i)} ] \neq \textcolor[RGB]{248,118,109}{\frac{1}{m_0} \sum_{j: w_j=0}} \textcolor[RGB]{0,191,196}{\mu(1,x_j)} \]
  • Let’s think about why why we can’t show that it’s unbiased like we did for the matched part.
    • If you put a group indicator \(m=1_{=0}\) or \(m=1_{=1}\) in the orthogonality condition …
    • … it’s not going to work out like it did for the matched part.

\[ 0 = \sum_{i=1}^n \qty{ Y_i - \hat\mu(W_i,X_i) } \ m(W_i,X_i) \qqtext{ for all } m \in \mathcal{M} \]

  • If you put in an indicator for \(w=1\):
    • you get the right function \(\hat\mu(1,x)\)
    • averaged over the wrong covariate distribution.
  • If you put in an indicator for \(w=0\):
    • you get the wrong function \(\hat\mu(0,x)\)
    • averaged over the right covariate distribution.
  • In short, putting in a group indicator tells us about matched things, i.e.,
    • averages of the least squares prediction for a group
    • averaged over the covariate distribution of that same group.
  • You could, of course, use the population residuals’ orthogonality property. \[ \begin{aligned} 0 &= \mathop{\mathrm{E}}\qty[ \qty{ Y_i - \tilde\mu(W_i, X_i) } \ m(W_i,X_i) ] && \qqtext{ for all } m \in \mathcal{M}\\ &\overset{\texttip{\small{\unicode{x2753}}}{By the law of iterated expectations conditioning on $(W_i,X_i)$}}{=} \mathop{\mathrm{E}}\qty[ \qty{ \mu(W_i,X_i) - \tilde\mu(W_i,X_i) } \ m(W_i,X_i) ] \qqtext{ for all } m \in \mathcal{M}\\ &\overset{\texttip{\small{\unicode{x2753}}}{Writing out our expectation as an average over the population}}{=} \frac{1}{m}\sum_{j=1}^m \qty{ \mu(w_j,x_j) - \tilde\mu(w_j,x_j) } \ m(w_j,x_j) \end{aligned} \]

  • But that has the same problem. Matched stuff only.

  • Let’s do what we just said wouldn’t work: plug an indicator into our population orthogonality condition.
  • Using \(m(w,x)=1_{=1}(w)\), we get the wrong average of the right function.

\[ \begin{aligned} 0 &= \frac{1}{m}\sum_{j=1}^m \qty{ \mu(w_j,x_j) - \tilde\mu(w_j,x_j) } \ 1_{=1}(w_j) \end{aligned} \tag{29.1}\]

  • Using the indicator trick, rearranging, and multiplying by \(m/m_1\), and gives us …

\[ \textcolor[RGB]{0,191,196}{\frac{1}{m_1}\sum_{j:w_j=1} \mu(1,x_j)} = \textcolor[RGB]{0,191,196}{\frac{1}{m_1}\sum_{j:w_j=1} \tilde\mu(1,x_j)} \]

  • Which we can rewrite in ‘histogram form’9

\[ \sum_x \textcolor[RGB]{0,191,196}{p_{x \mid 1} \ \tilde\mu(1,x)} = \sum_x \textcolor[RGB]{0,191,196}{p_{x \mid 1} \ \mu(1,x)} \qfor p_{x \mid w} = \frac{\sum_{j:w_j=w,x_j=x} 1}{\sum_{j:w_j=w} 1} \]

  • What happens if our two histograms are almost the same? We can’t be that far off. \[ \qqtext{ therefore } \sum_x \textcolor[RGB]{248,118,109}{p_{x \mid 0}} \ \textcolor[RGB]{0,191,196}{\mu(1,x)} \approx \sum_x \textcolor[RGB]{248,118,109}{p_{x \mid 0}} \ \textcolor[RGB]{0,191,196}{\tilde\mu(1,x)} \qqtext{ if } \textcolor[RGB]{248,118,109}{p_{x \mid 0}} \approx \textcolor[RGB]{0,191,196}{p_{x \mid 0}} \]

  • If our histograms were exactly the same, it wouldn’t really matter what model we used here.

    • But they’re not exactly the same. So it does matter.
    • Try changing your regression formula to y ~ 1+w to fit the ‘horizontal lines model’
    • What happens to our estimator’s bias?

An Exercise

  • Do this to practice using the techniques we’ve been using to analyze our estimators.
  • And build some intuition about why misspecification doesn’t always cause much bias.

A Caricature of the California Data

  • Let’s dig in to why we do so much better with the lines model than with the horizontal lines model.
    • Above, there’s a plot of a caricature of our California population data.
    • And the population least squares predictor in the lines model.
  • The covariate distributions \(p_{x \mid 0}\) and \(p_{x \mid 1}\) have are very simple.
    • \(p_{x \mid 1}\) is flat. At each of the 5 education levels, you have 1/5 of the female residents.
    • \(p_{x \mid 0}\) is linear.
      • At the level \(x=14\), you have 1/5 of the male residents.
      • But the fraction decreases by \(1/20=.05\) with every two years of education.

\[\color{gray} \begin{aligned} \textcolor[RGB]{0,191,196}{p_{x \mid 1}} &= \frac{1}{5} \qqtext{ and } \textcolor[RGB]{248,118,109}{p_{x \mid 0}} &= \frac{1}{5} - \frac{1}{20} \frac{x-14}{2} \end{aligned} \]

  • More importantly, they have a linear relationship.

\[ \color{gray} \begin{aligned} \textcolor[RGB]{248,118,109}{p_{x \mid 0}} &= \textcolor[RGB]{0,191,196}{p_{x \mid 1}} \times (ax + b) \qfor a=\frac{11}{4} \qand b=-1/8. \end{aligned} \]

Exercise

  • When we look at the sampling distribution of \(\hat\Delta_0\), we see no bias at all.

  • And it’s not because our covariate distributions are close to the same in the two groups. They’re not.

    • Something else is going on. It’s about the relationship between our model and these covariate distributions.
    • If you change the model to the horizontal lines model, you’ll get bias.10
  • Exercise. Prove that the population version of our estimator, \(\tilde\Delta_0\), is equal to the target, i.e., that \[ \color{gray} \begin{aligned} \tilde\Delta_0 &= \sum_x \textcolor[RGB]{248,118,109}{p_{x \mid 0}} \qty{\textcolor[RGB]{0,191,196}{\tilde\mu(1,x)} - \textcolor[RGB]{248,118,109}{\tilde\mu(0,x)}} \qfor \\ \tilde \mu &= \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \frac{1}{m}\sum_{j=1}^m \qty{ y_j - m(w_j,x_j) }^2 \qqtext{ with } \mathcal{M}= \{ a(w) + b(w) x \}. \end{aligned} \]

  • If you’re not convinced that they’re exactly equal, run this code to check.

Hint

  • Think about the orthogonality of population least squares residuals to function \(m \in \mathcal{M}\).
  • What happened when we plugged in \(m(w,x)=1_{=1}(w)\) on Equation 29.1?
    • That didn’t show the mismatched term was unbiased.
    • We were averaging over the wrong covariate distribution.
  • Can we show what we want by plugging in a different m?
    • Not \(m(w,x)=1_{=1}(w)\), but perhaps \(m(w,x)=1_{=1}(w) \times \text{something}\)?

Appendix

Summing in Order

\[ \begin{aligned} \frac{1}{m}\sum_{j=1}^m y_j &= \frac{1}{m} \ \sum_x \sum_{j:x_j=x} y_j \\ &= \frac{1}{m} \ \sum_x \mu(x) \times m_x \qfor \mu(x) = \frac{1}{m_x} \sum_{j:x_j=x} y_j \qand m_x = \sum_{j:x_j=x} 1 \\ &= \frac{1}{m} \ \sum_x \mu(x) \sum_{j:x_j=x} 1 \\ &= \frac{1}{m} \ \sum_x \sum_{j:x_j=x} \mu(x) \\ &= \frac{1}{m} \ \sum_{j=1}^m m_x \mu(x) \end{aligned} \]


  1. In the plot, ◇ indicates \(\hat\mu(w,x)\) and ⍿ indicates \(\mu(w,x)\).↩︎

  2. Or, to be precise, a histogram that approximates it. But we used 10,000 simulated surveys, so our approximation was pretty good.↩︎

  3. In this case, our covariate is education level. So we’re averaging over the distribution of education levels in our sample.↩︎

  4. To see that more clearly, uncomment the + zoom.in in the code above.↩︎

  5. Replace sam with pop in the code above to see the population least squares predictor \(\tilde\mu\).5 Still bad.↩︎

  6. We proved that in class earlier this week. It’s here in the slides.↩︎

  7. Try it! Change target in the code above to function(muhat, sam) { muhat(0) } and see.↩︎

  8. Remember that a sum of terms multiplied by an indicator is the sum of the terms where the indicator is 1.↩︎

  9. We’ll write \(p_{x \mid 0}\) and \(p_{x \mid 1}\) for the height of the red and green bars at covariate level \(x\) in our population histogram. That is, \(p_{x \mid 0}\) is the proportion of male residents in our population with education level \(x\) and \(p_{x \mid 1}\) the same for female residents.↩︎

  10. Try it! Change the formula to y ~ 1+w↩︎