17 Homework: Covariate Shift

Summary

In this one, we’re going to review some stuff that people mixed up on the exam and work on visualization and computation for complex summaries of data with multiple and multivalued covariates.

Review

A histogram of the population, $y_1 \ldots y_{m}$ for $m=10,000,000$.

The sampling distribution of estimator A.

In Figure 19.1, I’ve drawn two histograms. The first shows a population $y_1 \ldots y_m$ with mean $\mu = \frac1m \sum_{j=1}^m y_j$ and variance $\sigma^2 = \frac1m \sum_{j=1}^m (y_j - \mu)^2$. The second shows a sample $Y_1 \ldots Y_n$ drawn with replacement from this population—the sample that was used in Problem 1 of Midterm 1.

In Figure 19.2, I’ve drawn the bootstrap sampling distributions of three estimators of the population mean.

$\hat\mu = \bar{Y}/2$ for $\bar{Y} = \frac1n \sum_{i=1}^n Y_i$
$\hat\mu = 1 + \bar{Y}/2$.
$\hat\mu = Y_2$.

Exercise 19.1

Match each estimator to the plot of its bootstrap sampling distribution. That is, say something like 1:A, 2:B, 3:C. But not exactly that.

Solution

1:A, 2:C, 3:B.

Exercise 19.2

For each of these estimators, say whether it is unbiased and whether it is consistent.

Tip. Each part of this is a Yes or No question. There are estimators that are unbiased or consistent for some populations and not for others, so if I were asking about a population I’d told you nothing concrete about, the answer might be maybe. But I’m asking about the specific population plotted above and the plot does give you enough information to determine the answer.

Solution

unbiased and consistent.¹
neither unbiased nor consistent.²
unbiased but not consistent.

Exercise 19.3

For any estimator you described as biased or inconsistent, describe what you see in the plots above that suggests this is the case.

Solution

The sampling distribution is centered in the same place at every sample size, and gets narrower and narrower, but the place it’s centered isn’t the sample mean.
The sampling distribution doesn’t get narrower.

Exercise 19.4

For each of the three estimators, do the following.

Write a formula for its standard deviation. Your formula may involve the sample size $n$ and some summaries of the population, e.g. $m$, $\mu$, or $\sigma$.
Using information shown in Figure 19.1, calculate it approximately and report the number.
Check that your answer looks about right using Figure 19.2 and describe how you checked.

Solution

From the plot, we can see that \[ \begin{aligned} \mathop{\mathrm{\mathop{\mathrm{V}}}}(\barY / 2) &= \qty(\frac{1}{2})^2 \mathop{\mathrm{\mathop{\mathrm{V}}}}(\barY) = \frac{1}{4} \frac{\sigma^2}{n} \\ \mathop{\mathrm{\mathop{\mathrm{V}}}}(1 + \barY / 2) &= \mathop{\mathrm{\mathop{\mathrm{V}}}}(\barY / 2) = \frac{1}{4} \frac{\sigma^2}{n} \\ \mathop{\mathrm{\mathop{\mathrm{V}}}}(Y_2) &= \sigma^2 \end{aligned} \] It follows that the standard deviations are $\sigma/2\sqrt{n} = 3/(2\times 100) = 0.015$ for the first two and $\sigma = 3$ for the third, since we can see from dashed lines inthe population plot that $\sigma=3$ and from the caption in the sample plot that $n=10,000=100^2$.

The first two being averages and therefore approximately normal, we can check out calculation by going out 2 standard deviations from the mean in the bootstrap sampling distribution and seeing that it gives us about 95% of the distribution. For these two estimators (i.e. estimators A and C), this is about $\pm 0.03$ and it looks good. For Estimator 3/B, since our estimator $Y_2$ is sampled uniformly-at-random from the population, its sampling distribution looks exactly like the distribution of the population, which is where we got the number $\sigma=3$.

Complex Summaries and Covariate Shift

In this section, we’ll work with a subset of the California income data considered in lecture. We’ll look at incomes of CA residents who responded to the Current Population Survey in 2022, were in the age range 25-25, and graduated from high school. I’ve plotted this data below. As in lecture:

$W_i=0$ for male residents and $W_i=1$ for female residents;
$X_i$ is years of schooling.
$Y_i$ is income in dollars.

Figure 20.1: Scatter plot of `education` vs `income` colored by `sex` with histograms of `education` overlaid. Note that, because incomes are on a very different scale than the proportions we’re displaying in the histogram, two different scales for the y-axis are used. The scale on the left is for income. The scale on the right is for the proportions.

Figure 20.2: Histograms of `education` colored by `sex` in two layouts. On the left, they’re plotted one above the other. On the right, they’re overlaid.

W	X	mu.hat	N.wx
0	12	35k	353
0	13	41k	179
0	14	50k	110
0	16	70k	311
0	18	100k	107
0	20	110k	35
1	12	21k	252
1	13	27k	178
1	14	30k	110
1	16	58k	329
1	18	77k	140
1	20	107k	36

We’ll consider the following four estimators of the difference in income between the male and female CA residents in our sample.
\[ \color{gray} \begin{aligned} \hat \Delta_{\text{raw}} &= \frac{1}{N_1}\sum_{i: W_i=1} Y_i - \frac{1}{N_0}\sum_{i: W_i=0} Y_i \\ \hat\Delta_0 &=\frac{1}{N_0}\sum_{i: W_i=0} \qty{\hat\mu(1,X_i) - \hat \mu(0, X_i)} \\ \hat \Delta_1 &= \frac{1}{N_1}\sum_{i: W_i=1} \qty{\hat\mu(1,X_i) - \hat \mu(0, X_i)} \\ \hat \Delta_{\text{all}} &= \frac{1}{n} \sum_{i=1}^n \qty{\hat\mu(1,X_i) - \hat \mu(0, X_i)} \end{aligned} \]

Exercise 20.1

Using the data in the table to the left, calculate each of these four estimators.

Solution

To use the data in the table, we’ll need these written in ‘histogram form.’ \[ \color{gray} \begin{aligned} \hat \Delta_{\text{raw}} &= \sum_x P_{x\mid 1} \hat\mu(1,x) - \sum_x P_{x \mid 0} \hat\mu(0,x) \\ \hat\Delta_0 &= \sum_x P_{x \mid 0} \qty{\hat\mu(1,x) - \hat \mu(0, x)} \\ \hat \Delta_1 &= \sum_x P_{x \mid 1} \qty{\hat\mu(1,x) - \hat \mu(0, x)} \\ \hat \Delta_{\text{all}} &= \sum_x P_{x} \qty{\hat\mu(1,x) - \hat \mu(0, x)} \end{aligned} \] for \[ \begin{aligned} P_{x \mid 1} &= \frac{N_{1x}}{N_1} = \frac{N_{1x}}{\sum_x N_{1x}} \\ P_{x \mid 0} &= \frac{N_{0x}}{N_0} = \frac{N_{0x}}{\sum_x N_{0x}} \\ P_{x} &= \frac{N_{x}}{n} = \frac{N_{0x} + N_{1x}}{\sum_{w,x} N_{wx}} \\ &= \frac{N_{0x}}{\sum_{x} N_{0x}} \times \frac{\sum_{x} N_{0x}}{\sum_{w,x} N_{wx}} + \frac{N_{1x}}{\sum_{x} N_{1x}} \times \frac{\sum_{x} N_{1x}}{\sum_{w,x} N_{wx}} \\ &= P_{x \mid 0} \times P_0 + P_{x \mid 1} P_1 \qfor P_w = \sum_{x} \frac{N_{wx}}{\sum_{wx}N_{w,x}} \end{aligned} \] Let’s make a table of these quantities.

$x$	$P_{x \mid 0}$	$P_{x \mid 1}$	$P_x$
12	$\frac{353}{1095} \approx 0.32$	$\frac{252}{1045} \approx 0.24$	$0.32 \times 0.51 + 0.24 \times 0.49 \approx 0.28$
13	$\frac{179}{1095} \approx 0.16$	$\frac{178}{1045} \approx 0.17$	$0.16 \times 0.51 + 0.17 \times 0.49 \approx 0.17$
14	$\frac{110}{1095} \approx 0.1$	$\frac{110}{1045} \approx 0.11$	$0.1 \times 0.51 + 0.11 \times 0.49 \approx 0.1$
16	$\frac{311}{1095} \approx 0.28$	$\frac{329}{1045} \approx 0.31$	$0.28 \times 0.51 + 0.31 \times 0.49 \approx 0.3$
18	$\frac{107}{1095} \approx 0.1$	$\frac{140}{1045} \approx 0.13$	$0.1 \times 0.51 + 0.13 \times 0.49 \approx 0.12$
20	$\frac{35}{1095} \approx 0.03$	$\frac{36}{1045} \approx 0.03$	$0.03 \times 0.51 + 0.03 \times 0.49 \approx 0.03$

Substituting these and values from our table of $\hat\mu(w,x)$ into our ‘histogram form’ formulas, we get these numbers.

\[ \begin{aligned} \hat\Delta_{\text{raw}} &\approx -11.3k \\ \hat \Delta_0 &\approx -14.8k\\ \hat \Delta_1 &\approx -15.1k\\ \hat \Delta_{\text{all}} &\approx -14.9k \end{aligned} \]

In Lecture 9, we decomposed the raw difference $\hat\Delta_{\text{raw}}$ as a sum of the adjusted difference $\hat\Delta_1$ and a covariate shift term. And we talked about how to use plots like the ones above to understand what the covariate shift term would look like. In this next exercise, we’ll get in some practice interpreting plots like these. We’ll work with three samples. Each sample is shown in a tab below.

$W_i$ is an indicator for county.
- 0 for residents of Los Angeles County
- 1 for residents of San Francisco and Alameda Counties
$X_i$ is education in years of schooling
$Y_i$ is income in dollars.

$W_i$ is an indicator for county.
- 0 for residents of San Diego
- 1 for residents of Orange County
$X_i$ is education in years of schooling
$Y_i$ is unemployment status: 0 for employed and 1 for unemployed.

$W_i$ is an indicator for voting in the 2004 primary.
- 1 if they voted in the last primary,
- 0 if they didn’t.
$X_i$ is birth-year.
$Y_i$ is an indicator for voting in the 2006 primary.
- 1 if they voted
- 0 if they didn’t

Exercise 20.2

For each of the three samples, answer these Yes or No questions. No need to explain.

Is the raw difference $\hat\Delta_{\text{raw}}$ larger or smaller than the adjusted difference $\hat\Delta_1$?
Is the magnitude of the raw difference, $| \hat\Delta_{\text{raw}}|$, larger or smaller than the magnitude of the adjusted difference, $|\hat\Delta_1|$?

Tip. The second question is a bit hard. You should be able to use the plot to know the sign (+/-) of the adjusted difference and covariate shift term. If the signs are the same, you know the magnitude of the raw difference is larger. But if the signs are opposite, it’s a bit more subtle. What happens depends on the relative magnitudes of the adjusted difference and covariate shift term. Usually, the magnitude of the covariate shift is less than twice the magnitude of the adjusted difference, which makes the raw difference smaller in magnitude than the adjusted one. But if the magnitude of the covariate shift term is more than twice the magnitude of the adjusted difference, we have an extreme case of Simpson’s paradox: the raw difference is larger in magnitude but opposite in sign to the adjusted difference. For the purpose of this assignment, don’t worry about this happening.

Solution. Income in LA vs SFBay

Yes, $\hat\Delta_{\text{raw}}$ is larger than $\hat\Delta_1$. Explanation: Rightward covariate shift, increasing function $\hat\mu(0,x)$.
Yes, it’s larger in magnitude as well. Explanation: $\hat\Delta_1$ is positive because the typical within-group differences are positive.

Solution.Unemployment in SD vs OC

Yes, $\hat\Delta_{\text{raw}}$ is larger than $\hat\Delta_1$. Explanation: Leftward covariate shift, decreasing function $\hat\mu(0,x)$.
No, it’s smaller in magnitude. But visually, it’s not that obvious.

Explanation: The within-group differences $\hat\mu(1,x) - \hat\mu(0,x)$ are negative, so $\hat\Delta_1$ is negative. The raw difference, being bigger in signed terms because of covariate shift, could be negative but smaller in magnitude or positive and either smaller or larger in magnitude. My guess was that there isn’t enough covariate shift to make the raw difference positive. And looking at the numbers in the table in Note 20.1 confirms this: the raw difference is -0.023 and the adjusted one in -0.041.

Solution. Primary Turnout in MI

Yes, $\hat\Delta_{\text{raw}}$ is larger than $\hat\Delta_1$. Explanation: Leftward covariate shift, decreasing function $\hat\mu(0,x)$.
Yes, it’s larger in magnitude as well. Explanation: $\hat\Delta_1$ is positive because the typical within-group differences are positive.

Note 20.1: The Actual Numbers $\hat\Delta_{\text{raw}}$ and $\hat\Delta_1$

I wanted this to be an exercise in interpreting plots, so I didn’t give you the data you’d need to calculate these. But, if you’re curious, here they are.

comparison	$\hat\Delta_{\text{raw}}$	$\hat\Delta_1$
Income in LA vs SFBay	39k	24k
Unemployment in SD vs OC	-0.023	-0.041
Primary Turnout in MI	0.159	0.147

We can derive similar decompositions for our other adjusted differences $\hat\Delta_0$ and $\hat\Delta_{\text{all}}$. That is, we can decompose $\hat\Delta_{\text{raw}}$ as the sum of $\hat\Delta_0$ or $\hat\Delta_{\text{all}}$ and a different, but conceptually similar covariate shift term.

Exercise 20.3

Derive a formula for $\hat\Delta_{\text{raw}}$ as the sum of $\hat\Delta_0$ and a covariate shift term. Show your work.

Solution

\[ \color{gray} \begin{aligned} \hat\Delta_{\text{raw}} &= \textcolor[RGB]{0,191,196}{\frac{1}{N_1}\sum_{i:W_i=1} \hat\mu(1,X_i)} - \textcolor[RGB]{248,118,109}{\frac{1}{N_0}\sum_{i:W_i=0} \hat\mu(0,X_i)} \\ &= \sum_{x} \textcolor[RGB]{0,191,196}{P_{x\mid 1} \ \hat\mu(1,x)} - \sum_{x} \textcolor[RGB]{248,118,109}{P_{x\mid 0} \ \hat\mu(0,x)} \\ &= \sum_{x} (\textcolor[RGB]{0,191,196}{P_{x\mid 1}} + \textcolor[RGB]{248,118,109}{P_{x\mid 0}} - \textcolor[RGB]{248,118,109}{P_{x\mid 0}}) \ \textcolor[RGB]{0,191,196}{\hat\mu(1,x)} - \sum_{x} \textcolor[RGB]{248,118,109}{P_{x\mid 0}} \ \textcolor[RGB]{248,118,109}{\hat\mu(0,x)} \\ &= \underset{\text{covariate shift term}}{\sum_x (\textcolor[RGB]{0,191,196}{P_{x\mid 1}} - \textcolor[RGB]{248,118,109}{P_{x\mid 0}}) \ \textcolor[RGB]{0,191,196}{\hat \mu(1,x)}} + \underset{\text{adjusted difference}\ \hat\Delta_0}{\sum_x \textcolor[RGB]{248,118,109}{P_{x \mid 0}} \qty{ \textcolor[RGB]{0,191,196}{\hat\mu(1,x)} \ - \ \textcolor[RGB]{248,118,109}{\hat\mu(0,x)}}} \end{aligned} \]

The raw difference differs from $\hat\Delta_0$ by the red-to-green shift in the average of
the function $\textcolor[RGB]{0,191,196}{\hat\mu(1,x)}$. When comparing to $\hat\Delta_1$, it was the same shift in $\textcolor[RGB]{248,118,109}{\hat\mu(0,x)}$. So we can use the same visualization techniques and heuristics, we just have to look at the green curve instead of the red one.

Note 20.2: A correction.

In a previously posted version of this assignment, Exercise 20.3 asked you to do the same with $\hat\Delta_{\text{all}}$ in place of $\hat\Delta_0$. That’s a bad exercise because there isn’t a two-term decomposition of $\hat\Delta_{\text{raw}}$ as the sum of $\hat\Delta_{\text{all}}$ and a single covariate shift term. You can, however, get a useful three-term decomposition.

To do this, start with your formula for $\hat\Delta_{\text{raw}}$ in terms of $\hat\Delta_1$. Then think of the difference $\hat\Delta_1 - \hat\Delta_{\text{all}}$ as another covariate shift term—only this time, the function we’re averaging over our two histograms is $\hat\mu(1,x) - \hat\mu(0,x)$ instead of $\hat\mu(0,x)$.

\[ \begin{aligned} \hat\Delta_{\text{raw}} &= \hat\Delta_1 + (\hat\Delta_{\text{raw}} - \hat\Delta_1) \\ &= \hat\Delta_{\text{all}} + \underset{\text{new covariate shift term}}{\qty(\hat\Delta_{1} - \hat\Delta_{\text{all}})} + \underset{\text{$\hat\Delta_1$'s covariate shift term}}{(\hat\Delta_{\text{raw}} - \hat\Delta_1)} \end{aligned} \]

I’ve added an Extra Credit version of this exercise, Exercise 20.7, on deriving and using this three-term decomposition.

Now that we have these new decompositions, let’s use them.

Exercise 20.4

Forget the calculations you did in Exercise 20.1 for a moment. Explain how, using only Figure 20.1 and Figure 20.2, you could predict whether $\hat\Delta_{\text{raw}}$ or $\hat\Delta_0$ would be larger. Do the same for $\hat\Delta_{\text{raw}}$ and $\hat\Delta_{\text{all}}$.

Are your answers here are consistent with what you calculated in Exercise 20.1?

Solution

$\hat\Delta_{\text{raw}}$ is larger than $\hat\Delta_0$. The trend $\textcolor[RGB]{0,191,196}{\hat\mu(1,x)}$ is increasing and the distribution of $x$ shifts to the right, so its average over $\textcolor[RGB]{0,191,196}{P_{x\mid 1}}$ is larger than its average over $\textcolor[RGB]{248,118,109}{P_{x\mid 0}}$. This is consistent with the numbers from Exercise 20.1.

The case of $\hat\Delta_{\text{all}}$ has been moved to the extra credit exercise Exercise 20.7 below. If you did it here, getting it wrong here won’t count against you, but getting it right earns you some of those extra credit points.

Covariate shift isn’t just the phenomenon behind the difference between raw and adjusted differences. It shows up wherever we’re comparing averages over different groups. One example is comparing different adjusted differences, e.g. $\hat \Delta_1$ and $\hat \Delta_0$. The following exercises will explore that. These are extra credit exercises, but I encourage you to look them over and think about them at least for a minute or two. There’s a good chance that, having made it this far, you’ll find them pretty easy.

Exercise 20.5

Extra Credit. Explain how, using only Figure 20.1 and Figure 20.2, you could predict whether $\hat\Delta_{1}$ or $\hat\Delta_0$ would be larger.

Hint. Compare the ‘histogram form’ formulas for $\hat\Delta_1$ and $\hat\Delta_0$.

Solution

This is a tricky one. The formula you need is simple. Observe that the difference $\hat\Delta_1 - \hat\Delta_0$ is a shift between averages of the function $f(x)=\hat\mu(1,x) - \hat\mu(0,x)$. \[ \color{gray} \begin{aligned} \hat\Delta_1 - \hat\Delta_0 &= \sum_x \textcolor[RGB]{0,191,196}{P_{x \mid 1}} \qty{ \textcolor[RGB]{0,191,196}{\hat\mu(1,x)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,x)}} - \sum_x \textcolor[RGB]{248,118,109}{P_{x \mid 0}} \qty{ \textcolor[RGB]{0,191,196}{\hat\mu(1,x)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,x)}}. \end{aligned} \]

What’s subtle is making sense of this visually. It’s not going to be a big difference because this function $\textcolor[RGB]{0,191,196}{\hat\mu(1,x)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,x)}$ isn’t changing much over $x$ and there’s not a ton of covariate shift either. But if you look very carefully, you can see that $\hat\Delta_1$ is a bit smaller (i.e. bigger in magnitude) than $\hat\Delta_0$. How? You have to be a bit subtle about how you think the shift in the distribution of $x$.

The green and red histograms are roughly equal at $x=13$,$14$, and $20$. So what’s happening is that the extra mass that’s in the red histogram at $x=12$ is getting split between $x=16$ and $x=18$ in the green histogram. The function $\textcolor[RGB]{0,191,196}{\hat\mu(1,x)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,x)}$ is roughly the same at $x=12$ and $x=16$, but it’s smaller (more negative) at $x=18$. As a result, the average of this function over the green histogram will be a bit smaller (again, more negative) than the average over the red one.

Does this bear out in the numbers? Yes. If you do the calculations precisely, it does. In Exercise 20.1, we calculated $\hat\Delta_1$ to be about $600 less than $\hat\Delta_0$.

Exercise 20.6

Extra Credit.

Write the simplest formula you can expressing $\hat\Delta_{\text{all}}$ in terms of $\hat\Delta_0$ and $\hat\Delta_1$ and other numbers you can compute from the sample. It should be very simple.
Referring to the formula, explain why, if you know which of $\hat\Delta_0$ or $\hat\Delta_1$ is larger, you also know whether $\hat\Delta_0$ or $\hat\Delta_{\text{all}}$ is larger and whether $\hat\Delta_1$ or $\hat\Delta_{\text{all}}$ is larger.
Explain why in visual terms, too.

Hint. How does the purple histogram in Figure 20.1 relate to the green and red ones? This is not just relevant to the in-visual-terms part; if you don’t already have a formula, what you see may guide you toward one.

Solution

Here’s the formula. It’s a weighted average of the two summaries $\hat\Delta_0$ and $\hat\Delta_1$ where the weights are proportional to number of people averaged over in each summary, i.e., the proportions of red (w=0) and green (w=1) dots in the sample. \[ \hat\Delta_{\text{all}} = P_{w=0} \hat\Delta_0 + P_{w=1} \hat\Delta_1 \qfor P_{w=w'} = \frac{1}{n}\sum_{i=1}^n 1_{=w'}(W_i). \]
Because $\hat\Delta_{\text{all}}$ is a weighted average of $\hat\Delta_0$ and $\hat\Delta_1$, you know it’s going to be in between the two.
All three summaries are averages of the $f(x)=\hat\mu(1,x)-\hat\mu(0,x)$ over different distributions. And the height of the purple histogram is always between the green and red histograms.

I didn’t ask for a derivation of the formula $\hat\Delta_{\text{all}}$, but here it is. It’s just a matter of recognizing that the purple histogram is the same weighted average of the green and red histograms. \[ \begin{aligned} P_x &= \frac{N_{0x} + N_{1x}}{n} \\ &= \frac{N_{0x}}{\sum_x N_{0x}} \times \frac{\sum_x N_{0x}}{n} + \frac{N_{1x}}{\sum_x N_{1x}} \times \frac{\sum_x N_{1x}}{n} \\ &= P_{x\mid 0} \times P_{w=0} + P_{x\mid 1} \times P_{w=1} \qfor P_{w=w'} = \frac{1}{n}\sum_x N_{w'x} = \frac{1}{n}\sum_{i=1}^n 1_{=w'}(W_i). \end{aligned} \]

Substituting this into the formula for $\hat\Delta_{\text{all}}$, we get \[ \begin{aligned} \hat\Delta_{\text{all}} &= \sum_x P_x \qty{ \hat\mu(1,x) - \hat\mu(0,x)} \\ &= \sum_x \qty{ P_{x\mid 0} P_{w=0} + P_{x\mid 1} P_{w=1}} \qty{ \hat\mu(1,x) - \hat\mu(0,x)} \\ &= \sum_x P_{x\mid 0} P_{w=0} \qty{ \hat\mu(1,x) - \hat\mu(0,x)} + \sum_x P_{x\mid 1} P_{w=1} \qty{ \hat\mu(1,x) - \hat\mu(0,x)} \\ &= P_{w=0} \sum_x P_{x\mid 0} \qty{ \hat\mu(1,x) - \hat\mu(0,x)} + P_{w=1} \sum_x P_{x\mid 1} \qty{ \hat\mu(1,x) - \hat\mu(0,x)} \\ &= P_{w=0} \hat\Delta_0 + P_{w=1} \hat\Delta_1. \end{aligned} \]

Exercise 20.7

Extra Credit. Write out the three-term decomposition for $\hat\Delta_{\text{raw}}$ described above in Note 20.2. And, with reference to the California income data shown in Figure 20.1 and Figure 20.2, explain how to use this to see whether $\hat\Delta_{\text{raw}}$ or $\hat\Delta_{\text{all}}$ is larger.

Solution

We’ll start with the decomposition from Note 20.2 and expand. \[ \begin{aligned} \hat\Delta_{\text{raw}} &= \hat\Delta_1 + (\hat\Delta_{\text{raw}} - \hat\Delta_1) \\ &= \hat\Delta_{\text{all}} + \underset{\text{new covariate shift term}}{\qty(\hat\Delta_{1} - \hat\Delta_{\text{all}})} + \underset{\text{$\hat\Delta_1$'s covariate shift term}}{(\hat\Delta_{\text{raw}} - \hat\Delta_1)} \\ &= \hat\Delta_{\text{all}} \\ &+ \sum_x P_{x\mid 1} \qty{ \hat\mu(1,x) - \hat\mu(0,x)} - \sum_x P_{x} \qty{ \hat\mu(1,x) - \hat\mu(0,x)} \\ &+ \sum_x P_{x\mid 1} \hat\mu(0,x) - \sum_x P_{x \mid 0} \hat\mu(0,x) \\ \end{aligned} \] and analogously, making a comparison between $\hat\Delta_{\text{all}}$ and $\hat\Delta_0$, \[ \begin{aligned} \hat\Delta_{\text{raw}} &= \hat\Delta_{\text{all}} + (\hat\Delta_{0} - \hat\Delta_{\text{all}}) + (\hat\Delta_\text{raw} - \hat\Delta_{0}) \\ &+\sum_x P_{x\mid 0} \qty{ \hat\mu(1,x) - \hat\mu(0,x)} - \sum_x P_{x} \qty{ \hat\mu(1,x) - \hat\mu(0,x)} \\ &+ \sum_x P_{x\mid 1} \hat\mu(1,x) - \sum_x P_{x \mid 0} \hat\mu(1,x) \end{aligned} \]

We’ve seen in Exercise 20.3 that the last term is positive — $\hat\Delta_{\text{raw}} > \hat\Delta_0$. So is the middle term. We’ve seen in Exercise 20.5 that $\hat\Delta_0 > \hat\Delta_{1}$ and it follows from Exercise 20.6 that $\hat\Delta_0 > \hat\Delta_\text{all} > \hat\Delta_1$.³

A simpler explanation of which is bigger.

This is based on Exercise 20.6. If $\hat\Delta_\text{raw}$ is bigger than both $\hat\Delta_0$ and $\hat\Delta_1$, then because $\hat\Delta_{\text{all}}$ is between $\hat\Delta_0$ and $\hat\Delta_1$, $\hat\Delta_{\text{raw}}$ is bigger than $\hat\Delta_{\text{all}}$.

We’ve shown that this applies here above. We’ve seen that $\hat\Delta_{\text{raw}}$ is bigger than $\hat\Delta_1$ and $\hat\Delta_0$ in Exercise 20.7 and Exercise 20.4 respectively.

We got lucky here. This is a version of our prior-observations-estimator where $n_\text{prior}=n$ and $\theta_\text{prior}=0$ was exactly equal to $\theta$. If $\theta$ weren’t zero, we’d be neither unbiased nor consistent.↩︎
This is another prior observations estimator with $n_\text{prior}=n$ and $\theta_\text{prior}=2$. We’re not lucky with our prior observations here.↩︎
You could, if you prefer, use a version of Exercise 20.5 to see that $\hat\Delta_0 > \hat\Delta_\text{all}$ directly. Instead of thinking about distributing mass from the left side of the red histogram to the right to get the green one, think about distributing mass from the left side of the red histogram to the right to get the purple one.↩︎

\(x\)	\(P_{x \mid 0}\)	\(P_{x \mid 1}\)	\(P_x\)
12	\(\frac{353}{1095} \approx 0.32\)	\(\frac{252}{1045} \approx 0.24\)	\(0.32 \times 0.51 + 0.24 \times 0.49 \approx 0.28\)
13	\(\frac{179}{1095} \approx 0.16\)	\(\frac{178}{1045} \approx 0.17\)	\(0.16 \times 0.51 + 0.17 \times 0.49 \approx 0.17\)
14	\(\frac{110}{1095} \approx 0.1\)	\(\frac{110}{1045} \approx 0.11\)	\(0.1 \times 0.51 + 0.11 \times 0.49 \approx 0.1\)
16	\(\frac{311}{1095} \approx 0.28\)	\(\frac{329}{1045} \approx 0.31\)	\(0.28 \times 0.51 + 0.31 \times 0.49 \approx 0.3\)
18	\(\frac{107}{1095} \approx 0.1\)	\(\frac{140}{1045} \approx 0.13\)	\(0.1 \times 0.51 + 0.13 \times 0.49 \approx 0.12\)
20	\(\frac{35}{1095} \approx 0.03\)	\(\frac{36}{1045} \approx 0.03\)	\(0.03 \times 0.51 + 0.03 \times 0.49 \approx 0.03\)