Summary
$$ \newcommand{X}{} \newcommand{Y}{}
$$
In this one, we’re going to review some stuff that people mixed up on the exam and work on visualization and computation for complex summaries of data with multiple and multivalued covariates.
Review
In Figure 30.1, I’ve drawn two histograms. The first shows a population \(y_1 \ldots y_m\) with mean \(\mu = \frac1m \sum_{j=1}^m y_j\) and variance \(\sigma^2 = \frac1m \sum_{j=1}^m (y_j - \mu)^2\). The second shows a sample \(Y_1 \ldots Y_n\) drawn with replacement from this population—the sample that was used in Problem 1 of Midterm 1.
In Figure 30.2, I’ve drawn the bootstrap sampling distributions of three estimators of the population mean.
- \(\hat\mu = \bar{Y}/2\) for \(\bar{Y} = \frac1n \sum_{i=1}^n Y_i\)
- \(\hat\mu = 1 + \bar{Y}/2\).
- \(\hat\mu = Y_2\).
Exercise 30.1
Match each estimator to the plot of its bootstrap sampling distribution. That is, say something like 1:A, 2:B, 3:C. But not exactly that.
Exercise 30.2
For each of these estimators, say whether it is unbiased and whether it is consistent.
Tip. Each part of this is a Yes or No question. There are estimators that are unbiased or consistent for some populations and not for others, so if I were asking about a population I’d told you nothing concrete about, the answer might be maybe. But I’m asking about the specific population plotted above and the plot does give you enough information to determine the answer.
Exercise 30.3
For any estimator you described as biased or inconsistent, describe what you see in the plots above that suggests this is the case.
Exercise 30.4
For each of the three estimators, do the following.
- Write a formula for its standard deviation. Your formula may involve the sample size \(n\) and some summaries of the population, e.g. \(m\), \(\mu\), or \(\sigma\).
- Using information shown in Figure 30.1, calculate it approximately and report the number.
- Check that your answer looks about right using Figure 30.2 and describe how you checked.
Complex Summaries and Covariate Shift
In this section, we’ll work with a subset of the California income data considered in lecture. We’ll look at incomes of CA residents who responded to the Current Population Survey in 2022, were in the age range 25-25, and graduated from high school. I’ve plotted this data below. As in lecture:
- \(W_i=0\) for male residents and \(W_i=1\) for female residents;
- \(X_i\) is years of schooling.
- \(Y_i\) is income in dollars.
| 0 |
12 |
35k |
353 |
| 0 |
13 |
41k |
179 |
| 0 |
14 |
50k |
110 |
| 0 |
16 |
70k |
311 |
| 0 |
18 |
100k |
107 |
| 0 |
20 |
110k |
35 |
| 1 |
12 |
21k |
252 |
| 1 |
13 |
27k |
178 |
| 1 |
14 |
30k |
110 |
| 1 |
16 |
58k |
329 |
| 1 |
18 |
77k |
140 |
| 1 |
20 |
107k |
36 |
We’ll consider the following four estimators of the difference in income between the male and female CA residents in our sample.
\[
\color{gray}
\begin{aligned}
\hat \Delta_{\text{raw}} &= \frac{1}{N_1}\sum_{i: W_i=1} Y_i - \frac{1}{N_0}\sum_{i: W_i=0} Y_i \\
\hat\Delta_0 &=\frac{1}{N_0}\sum_{i: W_i=0} \qty{\hat\mu(1,X_i) - \hat \mu(0, X_i)} \\
\hat \Delta_1 &= \frac{1}{N_1}\sum_{i: W_i=1} \qty{\hat\mu(1,X_i) - \hat \mu(0, X_i)} \\
\hat \Delta_{\text{all}} &= \frac{1}{n} \sum_{i=1}^n \qty{\hat\mu(1,X_i) - \hat \mu(0, X_i)}
\end{aligned}
\]
Exercise 31.1
Using the data in the table to the left, calculate each of these four estimators.
In Lecture 9, we decomposed the raw difference \(\hat\Delta_{\text{raw}}\) as a sum of the adjusted difference \(\hat\Delta_1\) and a covariate shift term. And we talked about how to use plots like the ones above to understand what the covariate shift term would look like. In this next exercise, we’ll get in some practice interpreting plots like these. We’ll work with three samples. Each sample is shown in a tab below.
- \(W_i\) is an indicator for county.
- 0 for residents of Los Angeles County
- 1 for residents of San Francisco and Alameda Counties
- \(X_i\) is education in years of schooling
- \(Y_i\) is income in dollars.
- \(W_i\) is an indicator for county.
- 0 for residents of San Diego
- 1 for residents of Orange County
- \(X_i\) is education in years of schooling
- \(Y_i\) is unemployment status: 0 for employed and 1 for unemployed.
- \(W_i\) is an indicator for voting in the 2004 primary.
- 1 if they voted in the last primary,
- 0 if they didn’t.
- \(X_i\) is birth-year.
- \(Y_i\) is an indicator for voting in the 2006 primary.
- 1 if they voted
- 0 if they didn’t
Exercise 31.2
For each of the three samples, answer these Yes or No questions. No need to explain.
- Is the raw difference \(\hat\Delta_{\text{raw}}\) larger or smaller than the adjusted difference \(\hat\Delta_1\)?
- Is the magnitude of the raw difference, \(| \hat\Delta_{\text{raw}}|\), larger or smaller than the magnitude of the adjusted difference, \(|\hat\Delta_1|\)?
Tip. The second question is a bit hard. You should be able to use the plot to know the sign (+/-) of the adjusted difference and covariate shift term. If the signs are the same, you know the magnitude of the raw difference is larger. But if the signs are opposite, it’s a bit more subtle. What happens depends on the relative magnitudes of the adjusted difference and covariate shift term. Usually, the magnitude of the covariate shift is less than twice the magnitude of the adjusted difference, which makes the raw difference smaller in magnitude than the adjusted one. But if the magnitude of the covariate shift term is more than twice the magnitude of the adjusted difference, we have an extreme case of Simpson’s paradox: the raw difference is larger in magnitude but opposite in sign to the adjusted difference. For the purpose of this assignment, don’t worry about this happening.
I wanted this to be an exercise in interpreting plots, so I didn’t give you the data you’d need to calculate these. But, if you’re curious, here they are.
| Income in LA vs SFBay |
39k |
24k |
| Unemployment in SD vs OC |
-0.023 |
-0.041 |
| Primary Turnout in MI |
0.159 |
0.147 |
We can derive similar decompositions for our other adjusted differences \(\hat\Delta_0\) and \(\hat\Delta_{\text{all}}\). That is, we can decompose \(\hat\Delta_{\text{raw}}\) as the sum of \(\hat\Delta_0\) or \(\hat\Delta_{\text{all}}\) and a different, but conceptually similar covariate shift term.
Exercise 31.3
Derive a formula for \(\hat\Delta_{\text{raw}}\) as the sum of \(\hat\Delta_0\) and a covariate shift term. Show your work.
In a previously posted version of this assignment, Exercise 31.3 asked you to do the same with \(\hat\Delta_{\text{all}}\) in place of \(\hat\Delta_0\). That’s a bad exercise because there isn’t a two-term decomposition of \(\hat\Delta_{\text{raw}}\) as the sum of \(\hat\Delta_{\text{all}}\) and a single covariate shift term. You can, however, get a useful three-term decomposition.
To do this, start with your formula for \(\hat\Delta_{\text{raw}}\) in terms of \(\hat\Delta_1\). Then think of the difference \(\hat\Delta_1 - \hat\Delta_{\text{all}}\) as another covariate shift term—only this time, the function we’re averaging over our two histograms is \(\hat\mu(1,x) - \hat\mu(0,x)\) instead of \(\hat\mu(0,x)\).
\[
\begin{aligned}
\hat\Delta_{\text{raw}}
&= \hat\Delta_1 + (\hat\Delta_{\text{raw}} - \hat\Delta_1) \\
&= \hat\Delta_{\text{all}}
+ \underset{\text{new covariate shift term}}{\qty(\hat\Delta_{1} - \hat\Delta_{\text{all}})}
+ \underset{\text{$\hat\Delta_1$'s covariate shift term}}{(\hat\Delta_{\text{raw}} - \hat\Delta_1)}
\end{aligned}
\]
I’ve added an Extra Credit version of this exercise, Exercise 31.7, on deriving and using this three-term decomposition.
Now that we have these new decompositions, let’s use them.
Exercise 31.4
Forget the calculations you did in Exercise 31.1 for a moment. Explain how, using only Figure 31.1 and Figure 31.2, you could predict whether \(\hat\Delta_{\text{raw}}\) or \(\hat\Delta_0\) would be larger. Do the same for \(\hat\Delta_{\text{raw}}\) and \(\hat\Delta_{\text{all}}\).
Are your answers here are consistent with what you calculated in Exercise 31.1?
Covariate shift isn’t just the phenomenon behind the difference between raw and adjusted differences. It shows up wherever we’re comparing averages over different groups. One example is comparing different adjusted differences, e.g. \(\hat \Delta_1\) and \(\hat \Delta_0\). The following exercises will explore that. These are extra credit exercises, but I encourage you to look them over and think about them at least for a minute or two. There’s a good chance that, having made it this far, you’ll find them pretty easy.
Exercise 31.5
Extra Credit. Explain how, using only Figure 31.1 and Figure 31.2, you could predict whether \(\hat\Delta_{1}\) or \(\hat\Delta_0\) would be larger.
Hint. Compare the ‘histogram form’ formulas for \(\hat\Delta_1\) and \(\hat\Delta_0\).
Exercise 31.6
Extra Credit.
- Write the simplest formula you can expressing \(\hat\Delta_{\text{all}}\) in terms of \(\hat\Delta_0\) and \(\hat\Delta_1\) and other numbers you can compute from the sample. It should be very simple.
- Referring to the formula, explain why, if you know which of \(\hat\Delta_0\) or \(\hat\Delta_1\) is larger, you also know whether \(\hat\Delta_0\) or \(\hat\Delta_{\text{all}}\) is larger and whether \(\hat\Delta_1\) or \(\hat\Delta_{\text{all}}\) is larger.
- Explain why in visual terms, too.
Hint. How does the purple histogram in Figure 31.1 relate to the green and red ones? This is not just relevant to the in-visual-terms part; if you don’t already have a formula, what you see may guide you toward one.
Exercise 31.7
Extra Credit. Write out the three-term decomposition for \(\hat\Delta_{\text{raw}}\) described above in Note 31.2. And, with reference to the California income data shown in Figure 31.1 and Figure 31.2, explain how to use this to see whether \(\hat\Delta_{\text{raw}}\) or \(\hat\Delta_{\text{all}}\) is larger.