27 Misspecification and Averaging

Setup

Shared Between Examples

Gym Subsidy Example

California Income Data

What We’re Doing

Today, we’ll look into something we saw in the Week 9 Homework’s last exercise.
- We estimated our subpopulation means $\mu(w,x)$ using least squares …
- … to choose a function from the ‘not-necessarily-parallel lines’ model.

\[ \hat\mu(w,x) = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \qty{ Y_i - m(W_i,X_i) }^2 \qfor \mathcal{M}= \qty{ m(w,x) = a(w) + b(w)x } \]

It didn’t fit the data very well: $\hat\mu(w,x) \not\approx \mu(w,x)$.¹
But when we used it to adjust for covariate shift, it worked well: $\hat \Delta_0 \approx \Delta_0$. $\hphantom{1}$

\[ \color{gray} \begin{aligned} \hat\Delta_0 &= \textcolor[RGB]{248,118,109}{\frac{1}{N_0}\sum_{i: W_i=0}} \left\{ \textcolor[RGB]{0,191,196}{\hat\mu(1,X_i)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,X_i)} \right\} &&\qfor \textcolor[RGB]{248,118,109}{N_0 = \sum_{i: W_i=0} 1} \\ \Delta_0 &= \textcolor[RGB]{248,118,109}{\frac{1}{m_0} \sum_{j: w_j=0}} \left\{ \textcolor[RGB]{0,191,196}{\mu(1,x_{j})} - \textcolor[RGB]{248,118,109}{\mu(0,x_{j})} \right\} &&\qfor \textcolor[RGB]{248,118,109}{m_0 = \sum_{j: w_j=0} 1} \end{aligned} \]

How did we know it worked well? We were working with a fake population.
- We could actually calculate our estimation target. And we did.
- And we simulated our survey many times to get our estimator’s sampling distribution.²

Look at the difference in mean incomes at one education level.
Compare it to our target $\theta$, the actual difference in means in our population. \[ \hat\theta = \textcolor[RGB]{0,191,196}{\hat\mu(1,x)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,x)} \qqtext{ estimates } \theta = \textcolor[RGB]{0,191,196}{\mu(1,x)} - \textcolor[RGB]{248,118,109}{\mu(0,x)} \]
The sampling distribution of our estimator $\hat\theta$ is shown above.
For $x=14$ (a 2-year degree), which is what’s shown for now, its center is far from the target.
- Our estimator is biased. And its bias is positive.
- In most samples drawn from the population, we overestimate the difference.
Change $x$ in the code above to see what happens at other education levels.
- Try $x=12$ (completed high school) and $x=16$ (a 4-year degree). And whatever else you like.
- Remember that some levels, e.g. $x=15$, don’t exist in our population.
- Don’t be surprised if you try $x=15$ and it doesn’t work.

Here’s the sampling distribution of our adjusted difference in means. \[ \hat\theta = \textcolor[RGB]{248,118,109}{\frac{1}{N_0}\sum_{i: W_i=0}} \qty{ \textcolor[RGB]{0,191,196}{\hat\mu(1,X_i)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,X_i)} } \qqtext{ estimates } \theta = \textcolor[RGB]{248,118,109}{\frac{1}{m_0} \sum_{j: w_j=0}} \qty{ \textcolor[RGB]{0,191,196}{\mu(1,x_{j})} - \textcolor[RGB]{248,118,109}{\mu(0,x_{j})} } \]
We’re using biased estimates of the difference in means at each education level $x$.
But when we average over the covariate distribution³ of male CA residents …
- … we get an estimator that’s barely biased at all.
- Our term-by-term biases are (imperfectly) canceling each other out.

This Activity’s Goal

We’re going to understand this cancellation phenomenon a bit better.
We’ll do the math to find out …
- where bias is and isn’t coming from.
- and what tends to cause cancellation like this.

Why bother?

You could argue that we should never use misspecified models.
- If we always used the ‘all functions model’, we wouldn’t have to worry about bias.
1. We can’t always use the all functions model.
- If we have groups in our population that we haven’t sampled …
- … we need a smaller model, e.g. lines, to help us make predictions for them.
2. We’re not always choosing the model.
- When we read someone else’s analysis, we want to know whether we should believe what they’re saying.
- Often, you can’t get their data so you can’t just analyze it your way to check.
3. It’s a good warm-up for what we’ll cover next.

Warm-Up

Our Gym Subsidy Example

A Case of Perfect Cancellation

This is the gym subsidy example we looked at in class a few days ago.
- We have 3 levels of $x$: 0, 50, and 100.
- And we’re estimating $\mu$ by least squares using the ‘lines’ model.

\[ \begin{aligned} \hat \mu &= \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \qty{ Y_i - m(X_i) }^2 \qfor \mathcal{M}= \qty{ m(x) = a + bx } && \text{ estimates } \\ \mu(x) &= \frac{1}{m_x} \sum_{j: x_j=x} y_j \qfor m_x = \sum_{j: x_j=x} 1 \end{aligned} \]

This is simulated data, so we know $\mu(x)$. It’s plotted in green above.
We haven’t done a very good job of estimating it.⁴
- It’s not because our sample is too small.⁵
- The problem is misspecification. $\mu$ looks like a hockey stick. No line can fit that.

Here’s the sampling distribution of $\hat\mu(100)$, our least squares estimate of $\mu(100)$..
- It’s not even close to being centered on $\mu(100)$.
- It’s centered on the population least squares estimate of $\mu(100)$.⁶

\[ \tilde\theta = \tilde\mu(100) \qfor \tilde \mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{j=1}^m \qty{ y_j - m(x_j) }^2 \]

Here’s how $\hat\theta$, $\tilde\theta$, and $\theta$ are calculated in R.
To get the plot above, plot.sampling.distribution simulates our experiment over and over and calculates $\hat\theta$ each time.

So here’s our bias.

One thing you’ll notice if you look at $\hat\mu(x)$ for $x=0$ and $x=50$ is that they’re biased in the opposite direction.⁷
- If we’re looking at difference like $\hat\mu(100) - \hat\mu(50)$, that’s bad. $\text{too big} - \text{too small} = \text{even more too big}$
- But if we’re looking at a sum, that’s good. We’ll get some cancellation.

Let’s think about a new target: the average of $y$ in the population.
- That’s the same thing as the average of $\mu(x_j)$ over the whole population.
- That’s just summing in a certain order. See Section 31.1. \[ \theta = \frac{1}{m} \sum_{j=1}^m y_j = \frac{1}{m} \sum_{j=1}^m \mu(x_j) \]
So maybe we should estimate it by averaging $\hat\mu(x)$ over our sample.

\[ \hat\theta = \frac{1}{n} \sum_{i=1}^n \hat\mu(X_i) \]

And when we do that, we get an unbiased estimator.
- The center of our sampling distribution is the population average of our population least squares predictor.
- And that’s exactly the same as our estimation target $\theta$.

\[ \mathop{\mathrm{E}}[\hat\theta] = \tilde\theta = \frac{1}{m} \sum_{j=1}^m \tilde\mu(x_j) \]

Here’s the empirical evidence.

Orthogonality and Perfect Cancellation

The orthogonality of least squares residuals tells us this happens.

\[ 0 = \sum_{i=1}^n \qty{ Y_i - \hat\mu(X_i) } \ m(X_i) \qqtext{ for all } m \in \mathcal{M} \]

Why? Plug in $m(x) = 1$. That’s a line. It’s a horizontal line. Then do a little algebra. \[ 0 = \sum_{i=1}^n \qty{ Y_i - \hat\mu(X_i) } \times 1 \implies \textcolor[RGB]{192,192,192}{\frac{1}{n}} \sum_{i=1}^n Y_i = \textcolor[RGB]{192,192,192}{\frac{1}{n}} \sum_{i=1}^n \hat\mu(X_i) \]
It tells us that our estimate $\hat\theta$ is really just the sample mean.
- We calculated it in a strange way, but it’s the same number.
- And we know the sample mean is an unbiased estimator of the population mean.
This is true for any regression model that includes the constant function $m(x)=1$.

\[ \begin{aligned} \mathcal{M}&= \{ m(x) = a : a \in \mathbb{R} \} && \text{horizontal lines} \\ \mathcal{M}&= \{ \text{all functions} \ m(x) \} && \text{all functions} \\ \mathcal{M}&= \{ m(x) = a + bx : a,b \in \mathbb{R} \} && \text{lines} \\ \mathcal{M}&= \{ m(x) = \sum_{k=0}^p a_k x^k : a_k \in \mathbb{R} \} && \text{polynomials of order } p \\ \end{aligned} \]

One model it’s not true for is lines through the origin.
- People don’t like using that model and that’s why. Averages are biased.
- If you go back a slide and change formula = y ~ 1+x to formula = y ~ 0+x, you’ll see it. Bias!

\[ \begin{aligned} \mathcal{M}&= \{ m(x) = bx : b \in \mathbb{R} \} && \text{lines through the origin} \end{aligned} \]

Population Orthogonality and Perfect Cancellation

Let’s look at this a different way. Let’s think about the orthogonality of population least squares residuals.

\[ \frac{1}{m} \sum_{j=1}^m \qty{ y_j - \tilde\mu(x_j) } \times m(x_j) = 0 \qqtext{ for all } m \in \mathcal{M}. \]

Again, plug in $m(x) = 1$. And do the same algebra as before.

\[ 0 = \sum_{j=1}^m \qty{ y_j - \tilde\mu(x_j) } \times 1 \implies \textcolor[RGB]{192,192,192}{\frac{1}{m}} \sum_{j=1}^m y_j = \underset{\tilde\theta}{\textcolor[RGB]{192,192,192}{\frac{1}{m}} \sum_{j=1}^m \tilde\mu(x_j)} \]

This tells us the population version of our estimator, $\tilde\theta$, is just the population mean. That’s our target.
And because we know that $\tilde\theta$ is the expectation of our estimator $\hat\theta$, that means our estimator is unbiased.

California Income Data

Our Adjusted Difference $\hat\Delta_0$

Let’s look at our adjusted difference in means when we use the ‘not necessarily parallel lines’ model. \[ \mathcal{M}= \{ m(w,x) = a(w) + b(w)x \ \text{ for functions} \ a,b \} \]
We’ll break it down into two parts.
- A matched part where we average the red prediction function $\hat\mu(1,x)$ over the red covariate distribution.
- A mismatched part where we average the green prediction function $\hat\mu(0,x)$ over the red covariate distribution. \[ \color{gray} \begin{aligned} \hat\Delta_0 &= \textcolor[RGB]{248,118,109}{\frac{1}{N_0}\sum_{i: W_i=0}} \left\{ \textcolor[RGB]{0,191,196}{\hat\mu(1,X_i)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,X_i)} \right\} &&\qfor \textcolor[RGB]{248,118,109}{N_0 = \sum_{i: W_i=0} 1} \\ &= \underset{\text{mismatched part}}{\textcolor[RGB]{248,118,109}{\frac{1}{N_0}\sum_{i: W_i=0}}\textcolor[RGB]{0,191,196}{\hat\mu(1,X_i)}} - \underset{\text{matched part}}{\textcolor[RGB]{248,118,109}{\frac{1}{N_0}\sum_{i: W_i=0}}\textcolor[RGB]{248,118,109}{\hat\mu(0,X_i)}} \end{aligned} \]
Our target, of course, has the same parts.

\[ \color{gray} \begin{aligned} \Delta_0 &= \textcolor[RGB]{248,118,109}{\frac{1}{m_0}\sum_{j: w_j=0}} \left\{ \textcolor[RGB]{0,191,196}{\mu(1,x_j)} - \textcolor[RGB]{248,118,109}{\mu(0,x_j)} \right\} &&\qfor \textcolor[RGB]{248,118,109}{m_0 = \sum_{j: w_j=0} 1} \\ &= \underset{\text{mismatched part}}{\textcolor[RGB]{248,118,109}{\frac{1}{m_0}\sum_{j: w_j=0}}\textcolor[RGB]{0,191,196}{\mu(1,x_j)}} - \underset{\text{matched part}}{\textcolor[RGB]{248,118,109}{\frac{1}{m_0}\sum_{j: w_j=0}}\textcolor[RGB]{248,118,109}{\mu(0,x_j)}} \end{aligned} \]

The Matched Part

Here’s the sampling distribution of the matched part.
You can see it’s unbiased, i.e., centered on the matched part of our target.

\[ \mathop{\mathrm{E}}\qty[ \textcolor[RGB]{248,118,109}{\frac{1}{N_0}\sum_{i: W_i=0}}\textcolor[RGB]{248,118,109}{\hat\mu(0,X_i)} ] = \textcolor[RGB]{248,118,109}{\frac{1}{m_0} \sum_{j: w_j=0} \mu(0,x_j)} \]

Let’s prove it. As usual, we’ll use the orthogonality of least squares residuals.

\[ 0 = \sum_{i=1}^n \qty{ Y_i - \hat\mu(W_i,X_i) } \ m(W_i,X_i) \qqtext{ for all } m \in \mathcal{M} \]

What function $m$ should we plug in? Suppose we’re using the ‘not necessarily parallel lines’ model.
See the next slide for the answer.

\[ m(w,x) = 1_{=0}(w) = \begin{cases} 1 & \text{if } w=0 \\ 0 & \text{if } w=1 \end{cases} \]

That’s in the model. It’s $a(w) + b(w)x$ for $a(w)=1_{=0}(w)$ and $b(w)=0$.
Let’s plug it in and do our algebra.

\[ \begin{aligned} 0 &\overset{\texttip{\small{\unicode{x2753}}}{plugging in}}{=} \sum_{i=1}^n \qty{ Y_i - \hat\mu(W_i,X_i) } \times 1_{=0}(W_i) \\ &\overset{\texttip{\small{\unicode{x2753}}}{using the indicator trick}}{=} \sum_{i=1}^n Y_i 1_{=0}(W_i) - \sum_{i=1}^n \hat\mu(0,X_i) 1_{=0}(W_i). \end{aligned} \]

Rearranging and dividing by the number $N_0$ of red dots gives us what we want.⁸ \[ \frac{1}{N_0}\sum_{i:W_i=0} Y_i = \frac{1}{N_0}\sum_{i:W_i=0} \hat\mu(1,X_i) \]

The Mismatched Part

Here’s the sampling distribution of the mismatched part. You can see that it’s biased, but not too badly. \[ \mathop{\mathrm{E}}\qty[ \textcolor[RGB]{248,118,109}{\frac{1}{N_0}\sum_{i: W_i=0}}\textcolor[RGB]{0,191,196}{\hat\mu(1,X_i)} ] \neq \textcolor[RGB]{248,118,109}{\frac{1}{m_0} \sum_{j: w_j=0}} \textcolor[RGB]{0,191,196}{\mu(1,x_j)} \]

Let’s think about why why we can’t show that it’s unbiased like we did for the matched part.
- If you put a group indicator $m=1_{=0}$ or $m=1_{=1}$ in the orthogonality condition …
- … it’s not going to work out like it did for the matched part.

\[ 0 = \sum_{i=1}^n \qty{ Y_i - \hat\mu(W_i,X_i) } \ m(W_i,X_i) \qqtext{ for all } m \in \mathcal{M} \]

If you put in an indicator for $w=1$:
- you get the right function $\hat\mu(1,x)$
- averaged over the wrong covariate distribution.
If you put in an indicator for $w=0$:
- you get the wrong function $\hat\mu(0,x)$
- averaged over the right covariate distribution.
In short, putting in a group indicator tells us about matched things, i.e.,
- averages of the least squares prediction for a group
- averaged over the covariate distribution of that same group.

You could, of course, use the population residuals’ orthogonality property. \[ \begin{aligned} 0 &= \mathop{\mathrm{E}}\qty[ \qty{ Y_i - \tilde\mu(W_i, X_i) } \ m(W_i,X_i) ] && \qqtext{ for all } m \in \mathcal{M}\\ &\overset{\texttip{\small{\unicode{x2753}}}{By the law of iterated expectations conditioning on $(W_i,X_i)$}}{=} \mathop{\mathrm{E}}\qty[ \qty{ \mu(W_i,X_i) - \tilde\mu(W_i,X_i) } \ m(W_i,X_i) ] \qqtext{ for all } m \in \mathcal{M}\\ &\overset{\texttip{\small{\unicode{x2753}}}{Writing out our expectation as an average over the population}}{=} \frac{1}{m}\sum_{j=1}^m \qty{ \mu(w_j,x_j) - \tilde\mu(w_j,x_j) } \ m(w_j,x_j) \end{aligned} \]
But that has the same problem. Matched stuff only.

Let’s do what we just said wouldn’t work: plug an indicator into our population orthogonality condition.
Using $m(w,x)=1_{=1}(w)$, we get the wrong average of the right function.

\[ \begin{aligned} 0 &= \frac{1}{m}\sum_{j=1}^m \qty{ \mu(w_j,x_j) - \tilde\mu(w_j,x_j) } \ 1_{=1}(w_j) \end{aligned} \tag{29.1}\]

Using the indicator trick, rearranging, and multiplying by $m/m_1$, and gives us …

\[ \textcolor[RGB]{0,191,196}{\frac{1}{m_1}\sum_{j:w_j=1} \mu(1,x_j)} = \textcolor[RGB]{0,191,196}{\frac{1}{m_1}\sum_{j:w_j=1} \tilde\mu(1,x_j)} \]

Which we can rewrite in ‘histogram form’⁹

\[ \sum_x \textcolor[RGB]{0,191,196}{p_{x \mid 1} \ \tilde\mu(1,x)} = \sum_x \textcolor[RGB]{0,191,196}{p_{x \mid 1} \ \mu(1,x)} \qfor p_{x \mid w} = \frac{\sum_{j:w_j=w,x_j=x} 1}{\sum_{j:w_j=w} 1} \]

What happens if our two histograms are almost the same? We can’t be that far off. \[ \qqtext{ therefore } \sum_x \textcolor[RGB]{248,118,109}{p_{x \mid 0}} \ \textcolor[RGB]{0,191,196}{\mu(1,x)} \approx \sum_x \textcolor[RGB]{248,118,109}{p_{x \mid 0}} \ \textcolor[RGB]{0,191,196}{\tilde\mu(1,x)} \qqtext{ if } \textcolor[RGB]{248,118,109}{p_{x \mid 0}} \approx \textcolor[RGB]{0,191,196}{p_{x \mid 0}} \]
If our histograms were exactly the same, it wouldn’t really matter what model we used here.
- But they’re not exactly the same. So it does matter.
- Try changing your regression formula to y ~ 1+w to fit the ‘horizontal lines model’
- What happens to our estimator’s bias?

An Exercise

Do this to practice using the techniques we’ve been using to analyze our estimators.
And build some intuition about why misspecification doesn’t always cause much bias.

A Caricature of the California Data

Let’s dig in to why we do so much better with the lines model than with the horizontal lines model.
- Above, there’s a plot of a caricature of our California population data.
- And the population least squares predictor in the lines model.
The covariate distributions $p_{x \mid 0}$ and $p_{x \mid 1}$ have are very simple.
- $p_{x \mid 1}$ is flat. At each of the 5 education levels, you have 1/5 of the female residents.
- $p_{x \mid 0}$ is linear.
  - At the level $x=14$, you have 1/5 of the male residents.
  - But the fraction decreases by $1/20=.05$ with every two years of education.

\[\color{gray} \begin{aligned} \textcolor[RGB]{0,191,196}{p_{x \mid 1}} &= \frac{1}{5} \qqtext{ and } \textcolor[RGB]{248,118,109}{p_{x \mid 0}} &= \frac{1}{5} - \frac{1}{20} \frac{x-14}{2} \end{aligned} \]

More importantly, they have a linear relationship.

\[ \color{gray} \begin{aligned} \textcolor[RGB]{248,118,109}{p_{x \mid 0}} &= \textcolor[RGB]{0,191,196}{p_{x \mid 1}} \times (ax + b) \qfor a=\frac{11}{4} \qand b=-1/8. \end{aligned} \]

Exercise

When we look at the sampling distribution of $\hat\Delta_0$, we see no bias at all.
And it’s not because our covariate distributions are close to the same in the two groups. They’re not.
- Something else is going on. It’s about the relationship between our model and these covariate distributions.
- If you change the model to the horizontal lines model, you’ll get bias.¹⁰
Exercise. Prove that the population version of our estimator, $\tilde\Delta_0$, is equal to the target, i.e., that \[ \color{gray} \begin{aligned} \tilde\Delta_0 &= \sum_x \textcolor[RGB]{248,118,109}{p_{x \mid 0}} \qty{\textcolor[RGB]{0,191,196}{\tilde\mu(1,x)} - \textcolor[RGB]{248,118,109}{\tilde\mu(0,x)}} \qfor \\ \tilde \mu &= \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \frac{1}{m}\sum_{j=1}^m \qty{ y_j - m(w_j,x_j) }^2 \qqtext{ with } \mathcal{M}= \{ a(w) + b(w) x \}. \end{aligned} \]
If you’re not convinced that they’re exactly equal, run this code to check.

Hint

Think about the orthogonality of population least squares residuals to function $m \in \mathcal{M}$.
What happened when we plugged in $m(w,x)=1_{=1}(w)$ on Equation 29.1?
- That didn’t show the mismatched term was unbiased.
- We were averaging over the wrong covariate distribution.
Can we show what we want by plugging in a different m?
- Not $m(w,x)=1_{=1}(w)$, but perhaps $m(w,x)=1_{=1}(w) \times \text{something}$?

Appendix

Summing in Order

\[ \begin{aligned} \frac{1}{m}\sum_{j=1}^m y_j &= \frac{1}{m} \ \sum_x \sum_{j:x_j=x} y_j \\ &= \frac{1}{m} \ \sum_x \mu(x) \times m_x \qfor \mu(x) = \frac{1}{m_x} \sum_{j:x_j=x} y_j \qand m_x = \sum_{j:x_j=x} 1 \\ &= \frac{1}{m} \ \sum_x \mu(x) \sum_{j:x_j=x} 1 \\ &= \frac{1}{m} \ \sum_x \sum_{j:x_j=x} \mu(x) \\ &= \frac{1}{m} \ \sum_{j=1}^m m_x \mu(x) \end{aligned} \]

In the plot, ◇ indicates $\hat\mu(w,x)$ and ⍿ indicates $\mu(w,x)$.↩︎
Or, to be precise, a histogram that approximates it. But we used 10,000 simulated surveys, so our approximation was pretty good.↩︎
In this case, our covariate is education level. So we’re averaging over the distribution of education levels in our sample.↩︎
To see that more clearly, uncomment the + zoom.in in the code above.↩︎
Replace sam with pop in the code above to see the population least squares predictor $\tilde\mu$.⁵ Still bad.↩︎
We proved that in class earlier this week. It’s here in the slides.↩︎
Try it! Change target in the code above to function(muhat, sam) { muhat(0) } and see.↩︎
Remember that a sum of terms multiplied by an indicator is the sum of the terms where the indicator is 1.↩︎
We’ll write $p_{x \mid 0}$ and $p_{x \mid 1}$ for the height of the red and green bars at covariate level $x$ in our population histogram. That is, $p_{x \mid 0}$ is the proportion of male residents in our population with education level $x$ and $p_{x \mid 1}$ the same for female residents.↩︎
Try it! Change the formula to y ~ 1+w↩︎

Setup

Shared Between Examples

Gym Subsidy Example

California Income Data

What We’re Doing

This Activity’s Goal

Why bother?

Warm-Up

A Case of Perfect Cancellation

Orthogonality and Perfect Cancellation

Population Orthogonality and Perfect Cancellation

California Income Data

Our Adjusted Difference \(\hat\Delta_0\)

The Matched Part

The Mismatched Part

An Exercise

A Caricature of the California Data

Exercise

Hint

Appendix

Summing in Order