28  Homework: Covariate Shift

Summary

In this one, we’re going to review some stuff that people mixed up on the exam and work on visualization and computation for complex summaries of data with multiple and multivalued covariates.

Review

A histogram of the population, \(y_1 \ldots y_{m}\) for \(m=10,000,000\).

A histogram of the sample, \(Y_1 \ldots Y_n\) for \(n=10,000\).
Figure 30.1: A histogram of the sample \(Y_1 \ldots Y_{10000}\). The mean of each distribution is marked with a blue line and the mean plus and minus one standard deviation are marked with dashed blue lines.

The sampling distribution of estimator A.

The sampling distribution of estimator B.

The sampling distribution of estimator C.
Figure 30.2: The bootstrap sampling distributions of our three estimators. The mean of each distribution is marked with a blue line.

In Figure 30.1, I’ve drawn two histograms. The first shows a population \(y_1 \ldots y_m\) with mean \(\mu = \frac1m \sum_{j=1}^m y_j\) and variance \(\sigma^2 = \frac1m \sum_{j=1}^m (y_j - \mu)^2\). The second shows a sample \(Y_1 \ldots Y_n\) drawn with replacement from this population—the sample that was used in Problem 1 of Midterm 1.

In Figure 30.2, I’ve drawn the bootstrap sampling distributions of three estimators of the population mean.

  1. \(\hat\mu = \bar{Y}/2\) for \(\bar{Y} = \frac1n \sum_{i=1}^n Y_i\)
  2. \(\hat\mu = 1 + \bar{Y}/2\).
  3. \(\hat\mu = Y_2\).

Exercise 30.1  

Match each estimator to the plot of its bootstrap sampling distribution. That is, say something like 1:A, 2:B, 3:C. But not exactly that.

🔒

Locked (Week 7)

Exercise 30.2  

For each of these estimators, say whether it is unbiased and whether it is consistent.

Tip. Each part of this is a Yes or No question. There are estimators that are unbiased or consistent for some populations and not for others, so if I were asking about a population I’d told you nothing concrete about, the answer might be maybe. But I’m asking about the specific population plotted above and the plot does give you enough information to determine the answer.

🔒

Locked (Week 7)

Exercise 30.3  

For any estimator you described as biased or inconsistent, describe what you see in the plots above that suggests this is the case.

🔒

Locked (Week 7)

Exercise 30.4  

For each of the three estimators, do the following.

  1. Write a formula for its standard deviation. Your formula may involve the sample size \(n\) and some summaries of the population, e.g. \(m\), \(\mu\), or \(\sigma\).
  2. Using information shown in Figure 30.1, calculate it approximately and report the number.
  3. Check that your answer looks about right using Figure 30.2 and describe how you checked.
🔒

Locked (Week 7)

Complex Summaries and Covariate Shift

In this section, we’ll work with a subset of the California income data considered in lecture. We’ll look at incomes of CA residents who responded to the Current Population Survey in 2022, were in the age range 25-25, and graduated from high school. I’ve plotted this data below. As in lecture:

  • \(W_i=0\) for male residents and \(W_i=1\) for female residents;
  • \(X_i\) is years of schooling.
  • \(Y_i\) is income in dollars.

Figure 31.1: Scatter plot of education vs income colored by sex with histograms of education overlaid. Note that, because incomes are on a very different scale than the proportions we’re displaying in the histogram, two different scales for the y-axis are used. The scale on the left is for income. The scale on the right is for the proportions.

Figure 31.2: Histograms of education colored by sex in two layouts. On the left, they’re plotted one above the other. On the right, they’re overlaid.
W X mu.hat N.wx
0 12 35k 353
0 13 41k 179
0 14 50k 110
0 16 70k 311
0 18 100k 107
0 20 110k 35
1 12 21k 252
1 13 27k 178
1 14 30k 110
1 16 58k 329
1 18 77k 140
1 20 107k 36

We’ll consider the following four estimators of the difference in income between the male and female CA residents in our sample.
\[ \color{gray} \begin{aligned} \hat \Delta_{\text{raw}} &= \frac{1}{N_1}\sum_{i: W_i=1} Y_i - \frac{1}{N_0}\sum_{i: W_i=0} Y_i \\ \hat\Delta_0 &=\frac{1}{N_0}\sum_{i: W_i=0} \qty{\hat\mu(1,X_i) - \hat \mu(0, X_i)} \\ \hat \Delta_1 &= \frac{1}{N_1}\sum_{i: W_i=1} \qty{\hat\mu(1,X_i) - \hat \mu(0, X_i)} \\ \hat \Delta_{\text{all}} &= \frac{1}{n} \sum_{i=1}^n \qty{\hat\mu(1,X_i) - \hat \mu(0, X_i)} \end{aligned} \]

Exercise 31.1  

Using the data in the table to the left, calculate each of these four estimators.

🔒

Locked (Week 7)

In Lecture 9, we decomposed the raw difference \(\hat\Delta_{\text{raw}}\) as a sum of the adjusted difference \(\hat\Delta_1\) and a covariate shift term. And we talked about how to use plots like the ones above to understand what the covariate shift term would look like. In this next exercise, we’ll get in some practice interpreting plots like these. We’ll work with three samples. Each sample is shown in a tab below.

  • \(W_i\) is an indicator for county.
    • 0 for residents of Los Angeles County
    • 1 for residents of San Francisco and Alameda Counties
  • \(X_i\) is education in years of schooling
  • \(Y_i\) is income in dollars.

  • \(W_i\) is an indicator for county.
    • 0 for residents of San Diego
    • 1 for residents of Orange County
  • \(X_i\) is education in years of schooling
  • \(Y_i\) is unemployment status: 0 for employed and 1 for unemployed.

  • \(W_i\) is an indicator for voting in the 2004 primary.
    • 1 if they voted in the last primary,
    • 0 if they didn’t.
  • \(X_i\) is birth-year.
  • \(Y_i\) is an indicator for voting in the 2006 primary.
    • 1 if they voted
    • 0 if they didn’t

Exercise 31.2  

For each of the three samples, answer these Yes or No questions. No need to explain.

  1. Is the raw difference \(\hat\Delta_{\text{raw}}\) larger or smaller than the adjusted difference \(\hat\Delta_1\)?
  2. Is the magnitude of the raw difference, \(| \hat\Delta_{\text{raw}}|\), larger or smaller than the magnitude of the adjusted difference, \(|\hat\Delta_1|\)?

Tip. The second question is a bit hard. You should be able to use the plot to know the sign (+/-) of the adjusted difference and covariate shift term. If the signs are the same, you know the magnitude of the raw difference is larger. But if the signs are opposite, it’s a bit more subtle. What happens depends on the relative magnitudes of the adjusted difference and covariate shift term. Usually, the magnitude of the covariate shift is less than twice the magnitude of the adjusted difference, which makes the raw difference smaller in magnitude than the adjusted one. But if the magnitude of the covariate shift term is more than twice the magnitude of the adjusted difference, we have an extreme case of Simpson’s paradox: the raw difference is larger in magnitude but opposite in sign to the adjusted difference. For the purpose of this assignment, don’t worry about this happening.

🔒

Locked (Week 7)

🔒

Locked (Week 7)

🔒

Locked (Week 7)

I wanted this to be an exercise in interpreting plots, so I didn’t give you the data you’d need to calculate these. But, if you’re curious, here they are.

comparison \(\hat\Delta_{\text{raw}}\) \(\hat\Delta_1\)
Income in LA vs SFBay 39k 24k
Unemployment in SD vs OC -0.023 -0.041
Primary Turnout in MI 0.159 0.147

We can derive similar decompositions for our other adjusted differences \(\hat\Delta_0\) and \(\hat\Delta_{\text{all}}\). That is, we can decompose \(\hat\Delta_{\text{raw}}\) as the sum of \(\hat\Delta_0\) or \(\hat\Delta_{\text{all}}\) and a different, but conceptually similar covariate shift term.

Exercise 31.3  

Derive a formula for \(\hat\Delta_{\text{raw}}\) as the sum of \(\hat\Delta_0\) and a covariate shift term. Show your work.

🔒

Locked (Week 7)

Note 31.2: A correction.

In a previously posted version of this assignment, Exercise 31.3 asked you to do the same with \(\hat\Delta_{\text{all}}\) in place of \(\hat\Delta_0\). That’s a bad exercise because there isn’t a two-term decomposition of \(\hat\Delta_{\text{raw}}\) as the sum of \(\hat\Delta_{\text{all}}\) and a single covariate shift term. You can, however, get a useful three-term decomposition.

To do this, start with your formula for \(\hat\Delta_{\text{raw}}\) in terms of \(\hat\Delta_1\). Then think of the difference \(\hat\Delta_1 - \hat\Delta_{\text{all}}\) as another covariate shift term—only this time, the function we’re averaging over our two histograms is \(\hat\mu(1,x) - \hat\mu(0,x)\) instead of \(\hat\mu(0,x)\).

\[ \begin{aligned} \hat\Delta_{\text{raw}} &= \hat\Delta_1 + (\hat\Delta_{\text{raw}} - \hat\Delta_1) \\ &= \hat\Delta_{\text{all}} + \underset{\text{new covariate shift term}}{\qty(\hat\Delta_{1} - \hat\Delta_{\text{all}})} + \underset{\text{$\hat\Delta_1$'s covariate shift term}}{(\hat\Delta_{\text{raw}} - \hat\Delta_1)} \end{aligned} \]

I’ve added an Extra Credit version of this exercise, Exercise 31.7, on deriving and using this three-term decomposition.

Now that we have these new decompositions, let’s use them.

Exercise 31.4  

Forget the calculations you did in Exercise 31.1 for a moment. Explain how, using only Figure 31.1 and Figure 31.2, you could predict whether \(\hat\Delta_{\text{raw}}\) or \(\hat\Delta_0\) would be larger. Do the same for \(\hat\Delta_{\text{raw}}\) and \(\hat\Delta_{\text{all}}\).

Are your answers here are consistent with what you calculated in Exercise 31.1?

🔒

Locked (Week 7)

Covariate shift isn’t just the phenomenon behind the difference between raw and adjusted differences. It shows up wherever we’re comparing averages over different groups. One example is comparing different adjusted differences, e.g. \(\hat \Delta_1\) and \(\hat \Delta_0\). The following exercises will explore that. These are extra credit exercises, but I encourage you to look them over and think about them at least for a minute or two. There’s a good chance that, having made it this far, you’ll find them pretty easy.

Exercise 31.5  

Extra Credit. Explain how, using only Figure 31.1 and Figure 31.2, you could predict whether \(\hat\Delta_{1}\) or \(\hat\Delta_0\) would be larger.

Hint. Compare the ‘histogram form’ formulas for \(\hat\Delta_1\) and \(\hat\Delta_0\).

🔒

Locked (Week 7)

Exercise 31.6  

Extra Credit.

  1. Write the simplest formula you can expressing \(\hat\Delta_{\text{all}}\) in terms of \(\hat\Delta_0\) and \(\hat\Delta_1\) and other numbers you can compute from the sample. It should be very simple.
  2. Referring to the formula, explain why, if you know which of \(\hat\Delta_0\) or \(\hat\Delta_1\) is larger, you also know whether \(\hat\Delta_0\) or \(\hat\Delta_{\text{all}}\) is larger and whether \(\hat\Delta_1\) or \(\hat\Delta_{\text{all}}\) is larger.
  3. Explain why in visual terms, too.

Hint. How does the purple histogram in Figure 31.1 relate to the green and red ones? This is not just relevant to the in-visual-terms part; if you don’t already have a formula, what you see may guide you toward one.

🔒

Locked (Week 7)

Exercise 31.7  

Extra Credit. Write out the three-term decomposition for \(\hat\Delta_{\text{raw}}\) described above in Note 31.2. And, with reference to the California income data shown in Figure 31.1 and Figure 31.2, explain how to use this to see whether \(\hat\Delta_{\text{raw}}\) or \(\hat\Delta_{\text{all}}\) is larger.

🔒

Locked (Week 7)

🔒

Locked (Week 7)