18  Comparing Two Groups

Examples and Introduction

In Part 1, we estimated a single population mean using a sample mean. Now we compare means between two groups. This is a fundamental statistical task: estimating the difference in means between subpopulations.

Black vs. non-Black Voters in GA

\[ \small{ \begin{array}{r|rr|rr|r|rr|rrrrr} \text{call} & 1 & & 2 & & \dots & 625 & & & & & & \\ \text{question} & W_1 & Y_1 & W_2 & Y_2 & \dots & X_{625} & Y_{625} & \overline{X}_{625} & \overline{Y}_{625} &\frac{\sum_{i:W_i=0} Y_i}{\sum_{i:W_i=0} 1} & \frac{\sum_{i:W_i=1} Y_i}{\sum_{i:W_i=1} 1} & \text{difference} \\ \text{outcome} & \underset{\textcolor{gray}{x_{869369}}}{\textcolor[RGB]{248,118,109}{0}} & \underset{\textcolor{gray}{y_{869369}}}{\textcolor[RGB]{248,118,109}{1}} & \underset{\textcolor{gray}{x_{4428455}}}{\textcolor[RGB]{0,191,196}{1}} & \underset{\textcolor{gray}{y_{4428455}}}{\textcolor[RGB]{0,191,196}{1}} & \dots & \underset{\textcolor{gray}{x_{1268868}}}{\textcolor[RGB]{248,118,109}{0}} & \underset{\textcolor{gray}{y_{1268868}}}{\textcolor[RGB]{248,118,109}{1}} & 0.28 & 0.68 & \textcolor[RGB]{248,118,109}{0.68} & \textcolor[RGB]{0,191,196}{0.69} & 0.01 \\ \end{array} } \]

Recipients of Threatening Letters vs. Others in MI

Figure 19.1
  • The green dots are people who received a letter pressuring them to vote. Figure 24.1.
  • The red dots are people who didn’t.

Residents With vs. Without 4-Year Degrees in CA

income education county
1 $55k 13 orange
2 $25k 13 LA
3 $44k 16 san joaquin
4 $22k 14 orange
2271 $150k 16 stanislaus
  • In these examples, our sample was drawn with replacement from some population. Or we’re pretending it was.
  • In this one, it’s the population of all California residents between 25 and 35 with at least an 8th grade education.

The locations shown on the map are made up. They’re not the actual locations of the people in the sample.
The survey includes some location information for some people, but as you can see in the table, not for everyone.

Population

income education county
1 $22k 18 unknown
2 $0k 16 solano
3 $98k 16 LA
5677500 $116k 18 unknown

Sample

income education county
1 $55k 13 orange
2 $25k 13 LA
3 $44k 16 san joaquin
2271 $150k 16 stanislaus

For illustration, I’ve made up a fake population that looks like a bigger version of the sample.

Our Estimator and Target

Estimation Target

\[ \begin{aligned} \mu(1) - \mu(0) \qfor \mu(x) &= \frac{1}{m_x}\sum_{j:x_j = x } y_j \\ \qqtext{ where } \quad m_x &= \sum_{j:x_j=x} 1 \end{aligned} \]

Estimator

\[ \begin{aligned} \hat\mu(1) - \hat\mu(0) \qfor \hat\mu(x) &= \frac{1}{N_x}\sum_{i:W_i=x} Y_i \\ \qqtext{ where } \quad N_x &= \sum_{i:W_i=x} 1. \end{aligned} \]

Is This Estimator Good?

An Unbiased Estimator’s Sampling Distribution

A Biased Estimator’s Sampling Distribution
  • We want to know about our estimator’s …
    • bias. That’s the location of the sampling distribution relative to the estimation target.
    • standard deviation. Roughly speaking, that determines the width of our 95% confidence intervals.

Two Ways to Get Two Groups

Before we answer that question, we need to think about how we got our two groups. There are (at least) two different sampling schemes that give us data like this.

  1. Stratified Sampling. We decide in advance how many people to sample from each group. We sample \(n_0\) people from the no-degree subpopulation and \(n_1\) people from the degree subpopulation, independently.

  2. Ordinary Sampling. We sample \(n\) people from the whole population. We get whoever we get. The number of people in each group, \(N_0\) and \(N_1\), is random.

These are different. In stratified sampling, the group sizes \(n_0\) and \(n_1\) are fixed numbers we choose. In ordinary sampling, the group sizes \(N_0\) and \(N_1\) are random variables—we don’t know what they’ll be until we see our sample.

This matters for our analysis. Stratified sampling is simpler because we’re essentially doing two separate one-sample problems and combining them. Ordinary sampling is more complicated because we have to deal with the randomness in the group sizes.

We’ll start with stratified sampling.

Stratified Sampling

The Setup

Suppose we have two subpopulations—people without and with 4-year degrees. We sample independently from each:

  • From the no-degree subpopulation: draw \(n_0\) people with replacement.
  • From the degree subpopulation: draw \(n_1\) people with replacement.

Our data looks like this: \[ \begin{aligned} &\text{From subpopulation 0:} && Y_{0,1} \ldots Y_{0,n_0} && \text{all sampled uniformly from } y_1 \ldots y_{m_0} \text{ with } x_j=0 \\ &\text{From subpopulation 1:} && Y_{1,1} \ldots Y_{1,n_1} && \text{all sampled uniformly from } y_1 \ldots y_{m_1} \text{ with } x_j=1 \end{aligned} \]

The key feature: \(n_0\) and \(n_1\) are fixed numbers that we chose in advance.

Target and Estimator

Our target is still the difference in subpopulation means. \[ \mu(1) - \mu(0) \qfor \mu(x) = \frac{1}{m_x}\sum_{j:x_j=x} y_j \]

Our estimator is the difference in subsample means. \[ \hat\mu(1) - \hat\mu(0) \qfor \hat\mu(x) = \frac{1}{n_x}\sum_{i=1}^{n_x} Y_{x,i} \]

Notice the notation is a bit different. We’re writing \(Y_{x,i}\) for the \(i\)th observation from subpopulation \(x\), and the denominator is \(n_x\), a fixed number we chose.

Unbiasedness

This is immediate from Part 1. Each subsample mean is an unbiased estimator of its subpopulation mean. \[ \mathop{\mathrm{E}}[\hat\mu(x)] = \mu(x) \]

Why? Because \(\hat\mu(x)\) is just a sample mean of \(n_x\) observations drawn uniformly-at-random from a subpopulation with mean \(\mu(x)\). That’s exactly the setup from Part 1.

And by linearity: \[ \mathop{\mathrm{E}}[\hat\mu(1) - \hat\mu(0)] = \mathop{\mathrm{E}}[\hat\mu(1)] - \mathop{\mathrm{E}}[\hat\mu(0)] = \mu(1) - \mu(0) \]

Variance

The variance is equally straightforward. From Part 1, we know: \[ \mathop{\mathrm{\mathop{\mathrm{V}}}}[\hat\mu(x)] = \frac{\sigma^2(x)}{n_x} \] where \(\sigma^2(x)\) is the variance of the subpopulation with \(W=w\).

For the difference, we use the fact that the two subsample means are independent—we sampled from the two subpopulations separately. When random variables are independent, the variance of their difference is the sum of their variances: \[ \mathop{\mathrm{\mathop{\mathrm{V}}}}[\hat\mu(1) - \hat\mu(0)] = \mathop{\mathrm{\mathop{\mathrm{V}}}}[\hat\mu(1)] + \mathop{\mathrm{\mathop{\mathrm{V}}}}[\hat\mu(0)] = \frac{\sigma^2(1)}{n_1} + \frac{\sigma^2(0)}{n_0} \]

That’s it. No conditioning tricks, no iterated expectations. Just two separate one-sample problems combined.

Estimation and Confidence Intervals

We estimate the variance by plugging in subsample variances: \[ \widehat{\mathop{\mathrm{\mathop{\mathrm{V}}}}}[\hat\mu(1) - \hat\mu(0)] = \frac{\hat\sigma^2(1)}{n_1} + \frac{\hat\sigma^2(0)}{n_0} \]

And our confidence interval is: \[ \hat\mu(1) - \hat\mu(0) \pm 1.96 \sqrt{\frac{\hat\sigma^2(1)}{n_1} + \frac{\hat\sigma^2(0)}{n_0}} \]

Summary

Stratified sampling is simple because:

  1. Group sizes \(n_0\) and \(n_1\) are fixed, not random.
  2. The two subsamples are independent.
  3. Everything reduces to two separate applications of Part 1’s results.

But stratified sampling requires us to decide in advance how many people to sample from each group. That’s not always possible or practical. Often we just sample from the whole population and see who we get. That’s ordinary sampling, and it’s more complicated.

The Forgetting Heuristic

What if our population has more than two groups? Suppose instead of just degree vs. no-degree, we had data on years of education: 8, 9, 10, …, 16, 18, 20. That’s many groups. But often we want to compare just two of them.

Here’s the key insight: you can “forget” the other groups exist. If you want to compare people with 16 years of education (4-year degree) to people with 12 years (high school diploma), just ignore everyone else. Extract the two groups you care about and apply exactly what we’ve developed here.

Why does this work? Because the math we did for comparing two groups never assumed they were the only two groups. We proved:

  • The difference in subsample means is unbiased for the difference in subpopulation means.
  • The variance of this difference is \(\frac{\sigma^2(1)}{n_1} + \frac{\sigma^2(0)}{n_0}\).

None of this depended on there being no other groups. The existence of people with 10 or 14 years of education doesn’t affect the sampling distribution of the comparison between the 12-year and 16-year groups. You can always “coarsen” a many-group comparison into a two-group one by ignoring the groups you’re not interested in.

This is a useful heuristic. When faced with a complex comparison involving many groups, ask yourself: can I reframe this as a two-group comparison by “forgetting” some groups? If so, everything from this lecture applies directly.

Ordinary Sampling

Now suppose we sample \(n\) people from the whole population—a population where each person has both a covariate \(x_j\) (degree status) and an outcome \(y_j\) (income). We get whoever we get.

The data looks like this: \[ (W_1, Y_1) \ldots (W_n, Y_n) \] Each pair \((W_i, Y_i)\) is drawn uniformly-at-random from the population of pairs \((x_1, y_1) \ldots (x_m, y_m)\).

The group sizes are now random: \[ N_x = \sum_{i=1}^n 1_{=x}(W_i) \] This is the number of sampled people with \(W_i=x\). It’s a random variable, not a fixed number.

This randomness complicates everything. To handle it, we need some probability tools.

Probability Review

Population Means and Variances as Expectations

  • If a random variable \(Y\) is equally likely to be any member of a population \(y_1 \ldots y_m\), …
  • … then its expectation is the population mean \(\mu=\frac{1}{m}\sum_{j=1}^m y_j\).
  • … and its variance is the population variance \(\sigma^2 = \frac{1}{m}\sum_{j=1}^m (y_j - \mu)^2\).
  • Terminology. When each of our observations \(Y_i\) is like this, we’re sampling uniformly-at-random.

\[ \begin{aligned} \mathop{\mathrm{E}}[Y] &= \sum_{j=1}^m \underset{\text{response}}{y_j} \times \underset{\text{probability}}{\frac{1}{m}} = \frac{1}{m}\sum_{j=1}^m y_j = \mu \\ \mathop{\mathrm{E}}[(Y-\mathop{\mathrm{E}}[Y])^2] &= \sum_{j=1}^m \underset{\text{deviation}^2}{(y_j - \mu)^2} \times \underset{\text{probability}}{\frac{1}{m}} = \frac{1}{m}\sum_{j=1}^m (y_j - \mu)^2 = \sigma^2 \end{aligned} \]

Subpopulations and Conditioning

  • Conditioning is, in effect, a way of thinking about sampling as a two-stage process.
    • First, we choose the color of our dot, i.e., the value of \(W_i\), according to it frequency in the population.
    • Then, we choose a specific one of those dots, i.e. \(J_i\), from those with that color—with equal probability.
  • Because this is just a way of thinking, each person still gets chosen with probability \(1/m\).

\[ \begin{aligned} P(J_i=j) = \begin{cases} \frac{m_{green}}{m} \ \times \frac{1}{m_{green}} \ = \ \frac{1}{m} & \text{if the $j$th dot is green ($x_j=1$) } \\ \frac{m_{red}}{m} \ \times \frac{1}{m_{red}} \ = \ \frac{1}{m} & \text{otherwise} \end{cases} \end{aligned} \]

  • The first stage is ‘a coin flip’ that decides whether we roll the ‘red die’ or the ‘green die’ in the second.
  • The Conditional Probability of \(Y_i\) is the probability resulting from the second stage.
    • \(P(Y_i=y \mid W_i=1)\) is the probability distribution of \(Y_i\) when we’re ‘rolling the green die’.
    • \(P(Y_i=y \mid W_i=0)\) is the probability distribution of \(Y_i\) when we’re ‘rolling the red die’.
  • The Conditional Expectation of \(Y_i\) is the ‘second stage expected value’ in the same sense.
    • \(\mathop{\mathrm{E}}[Y_i \mid W_i=1]\) is the expected value of \(Y_i\) when we’re ‘rolling the green die’.
    • It’s the mean value \(\mu(1)\) of \(y_j\) in the subpopulation drawn as little green dots.
    • \(\mathop{\mathrm{E}}[Y_i \mid W_i=0]\) is the expected value of \(Y_i\) when we’re ‘rolling the red die’.
    • It’s the mean value \(\mu(0)\) of \(y_j\) in the subpopulation drawn as little red dots.
  • The Conditional Variance of \(Y_i\) is like this too. It’s the ‘second stage variance.’
    • \(\mathop{\mathrm{\mathop{\mathrm{V}}}}[Y_i \mid W_i=1]\) is the variance of \(Y_i\) when we’re ‘rolling the green die’.
    • It’s the variance \(\sigma^2(1)\) of \(y_j\) in the subpopulation drawn as little green dots.
    • \(\mathop{\mathrm{\mathop{\mathrm{V}}}}[Y_i \mid W_i=0]\) is the variance of \(Y_i\) when we’re ‘rolling the red die’.
    • It’s the variance \(\sigma^2(0)\) of \(y_j\) in the subpopulation drawn as little red dots.

About Conditional Expectations and Variances

  1. The conditional expectation function \(\mu(x)=E[Y \mid W=x]\).
    • \(\mu\) is a function; evaluated at \(x\), it’s a number. It’s not random.
    • It’s the mean of the subpopulation of people with \(W=w\).
  2. The conditional expectation \(\mu(W)=E[Y \mid W]\).
    • \(\mu(W)\) is the mean of a random subpopulation of people.
    • It’s the conditional expectation function evaluated at the random variable \(X\).
  1. The conditional variance function \(\sigma^2(x)=E[\{Y - \mu(x)\}^2 \mid W=x]\).
    • \(\sigma^2\) is a function; evaluated at \(x\), it’s a number. It’s not random.
    • It’s the variance of the subpopulation of people with \(W=w\).
  2. The conditional variance \(\sigma^2(W)=E[\{Y - \mu(W)\}^2 \mid W]\).
    • \(\sigma^2(W)\) is the variance of a random subpopulation of people.
    • It’s the conditional variance function evaluated at the random variable \(X\).
  • This is like what happens with conditional expectations because it is one.
  • It’s the conditional expectation of some random variable. What random variable?
  • It’s the conditional expectation of the square of \(\textcolor[RGB]{0,0,255}{Y-\mu(W)}\), a conditionally centered version of \(Y\).

\[ \mathop{\mathrm{\mathop{\mathrm{V}}}}[Y \mid W] = \mathop{\mathrm{E}}[\textcolor[RGB]{0,0,255}{\{Y-\mu(W)\}}^2 \mid W] \qfor \mu(W) = \mathop{\mathrm{E}}[Y \mid W] \]

Working with Expectations

Linearity

\[ \begin{aligned} E ( a Y + b Z ) &= E (aY) + E (bZ) = aE(Y) + bE(Z) \\ &\text{ for random variables $Y, Z$ and numbers $a,b$ } \end{aligned} \]

Application. When we sample uniformly-at-random, the sample mean is unbiased.

\[ \begin{aligned} \mathop{\mathrm{E}}[\hat\mu] &= \mathop{\mathrm{E}}\qty[\frac1n\sum_{i=1}^n Y_i] \\ &= \frac1n\sum_{i=1}^n \mathop{\mathrm{E}}[Y_i] \\ &= \frac1n\sum_{i=1}^n \mu \\ &= \frac1n \times n \times \mu = \mu. \end{aligned} \]

Conditional Version. \[ \begin{aligned} E\{ a(W) Y + b(W) Z \mid W \} &= E\{a(W)Y \mid W\} + E\{ b(W)Z \mid W\} \\ &= a(W)E(Y \mid W) + b(W)E(Z \mid W) \\ & \text{ for random variables $W, Y, Z$ and functions $a,b$ } \end{aligned} \]

Expectations of Products Factor into Products of Expectations

\[ \color{gray} \mathop{\mathrm{E}}[\textcolor[RGB]{239,71,111}{Y}\textcolor[RGB]{17,138,178}{Z}] = \textcolor[RGB]{239,71,111}{\mathop{\mathrm{E}}[Y]}\textcolor[RGB]{17,138,178}{\mathop{\mathrm{E}}[Z]} \qqtext{when $\textcolor[RGB]{239,71,111}{Y}$ and $\textcolor[RGB]{17,138,178}{Z}$ are independent} \]

Application. When we sample with replacement, the sample mean’s variance is the population variance divided by \(n\).

\[ \color{gray} \begin{aligned} \mathop{\mathrm{\mathop{\mathrm{V}}}}[\hat\mu] &= \mathop{\mathrm{E}}\qty[ \qty{ \frac{1}{n}\sum_{i=1}^n Y_i - \mathop{\mathrm{E}}\qty[ \frac{1}{n}\sum_{i=1}^n Y_i ] }^2 ] \\ &= \mathop{\mathrm{E}}\qty[ \qty{ \frac{1}{n}\sum_{i=1}^n (Y_i - \mathop{\mathrm{E}}[Y_i]) }^2 ] \\ &= \mathop{\mathrm{E}}\qty[ \qty{ \frac{1}{n}\sum_{i=1}^n Z_i }^2 ] && \text{ for } \ Z_i = Y_i - \mathop{\mathrm{E}}[Y_i] \\ &= \mathop{\mathrm{E}}\qty[ \qty{ \textcolor[RGB]{239,71,111}{\frac{1}{n}\sum_{i=1}^n Z_i }} \times \qty{\textcolor[RGB]{17,138,178}{\frac{1}{n}\sum_{j=1}^n Z_j}} ] &&\text{ with } \mathop{\mathrm{E}}[ Z_i ] = \mathop{\mathrm{E}}[ Y_i ] - \mathop{\mathrm{E}}[Y_i] = \mu - \mu = 0 \\ &= \mathop{\mathrm{E}}\qty[ \frac{1}{n^2}\textcolor[RGB]{239,71,111}{\sum_{i=1}^n} \textcolor[RGB]{17,138,178}{\sum_{j=1}^n} \textcolor[RGB]{239,71,111}{Z_i} \textcolor[RGB]{17,138,178}{Z_j} ] &&\text{ and } \ \mathop{\mathrm{E}}[ Z_i^2] = \mathop{\mathrm{E}}[ \{ Y_i - \mathop{\mathrm{E}}[Y_i] \}^2 ] = \mathop{\mathrm{\mathop{\mathrm{V}}}}[Y_i] = \sigma^2 \\ &= \frac{1}{n^2} \textcolor[RGB]{239,71,111}{\sum_{i=1}^n} \textcolor[RGB]{17,138,178}{\sum_{j=1}^n} \mathop{\mathrm{E}}\qty[\textcolor[RGB]{239,71,111}{Z_i} \textcolor[RGB]{17,138,178}{Z_j} ] \\ &= \frac{1}{n^2} \textcolor[RGB]{239,71,111}{\sum_{i=1}^n} \textcolor[RGB]{17,138,178}{\sum_{j=1}^n} \begin{cases} \mathop{\mathrm{E}}[Z_i^2]=\sigma^2 & \text{if } j=i \\ \textcolor[RGB]{239,71,111}{\mathop{\mathrm{E}}[Z_i]}\textcolor[RGB]{17,138,178}{\mathop{\mathrm{E}}[Z_j]} = 0 \times 0 & \text{if } j\neq i \end{cases} \\ &= \frac{1}{n^2} \textcolor[RGB]{239,71,111}{\sum_{i=1}^n} \sigma^2 = \frac{1}{n^2} \times n \times \sigma^2 = \frac{\sigma^2}{n} \end{aligned} \]

The Law of Iterated Expectations

\[ \mathop{\mathrm{E}}[Y] = \mathop{\mathrm{E}}\qty[ \mathop{\mathrm{E}}( Y \mid X ) ] \quad \text{ for any random variables $X, Y$} \]

\(p\) \(W_i\) \(\mathop{\mathrm{E}}[Y_i \mid W_i]\)
\(\frac{3}{6}\) \(0\) \(1\)
\(\frac{3}{6}\) \(1\) \(1.25\)

What is \(\mathop{\mathrm{E}}[Y_i]\)?

\(p\) \(W_i\) \(\mathop{\mathrm{E}}[Y_i \mid W_i]\)
\(0.58\) 0 \(31.22K\)
\(0.42\) 1 \(72.35K\)

What is \(\mathop{\mathrm{E}}[Y_i]\)?

Irrelevance of independent conditioning variables

\[ \mathop{\mathrm{E}}[ \textcolor[RGB]{17,138,178}{Y} \mid \textcolor[RGB]{17,138,178}{X}, \textcolor[RGB]{239,71,111}{X'} ] = \textcolor[RGB]{17,138,178}{\mathop{\mathrm{E}}[ Y \mid X ]} \quad \text{ when $X'$ is independent of $X,Y$ } \]

Application. When we sample with replacement, the conditional expectation of \(Y_i\) given \(W_1 \ldots W_n\) is the conditional mean of \(Y_i\) given \(W_i\) alone.

\[ \mathop{\mathrm{E}}[Y_i \mid W_1 \ldots W_n] = \mathop{\mathrm{E}}[Y_i \mid W_i] = \mu(W_i) \qqtext{ when $(W_1, Y_1) \ldots (W_n, Y_n)$ are independent.} \]

The Indicator Trick

\[ 1_{=w}(W)\mu(W) = \begin{cases} 1 \times \mu(W) & \text{ when } W=w \\ 0 \times \mu(W) & \text{ when } W \neq w \end{cases} = \begin{cases} 1 \times \mu(w) & \text{ when } W=w \\ 0 \times \mu(w) & \text{ when } W \neq w \end{cases} = 1_{=w}(W)\mu(w) \]

Application.

\[ \mathop{\mathrm{E}}[1_{=w}(W)\mu(W)] = \mathop{\mathrm{E}}[ 1_{W=w} \mu(w) ] = \mu(w) \mathop{\mathrm{E}}[1_{=w}(W)] = \mu(w) \times P(W=w) \]

Unbiasedness

A Sample Mean is Unbiased

Claim. When we sample uniformly-at-random, the sample mean is an unbiased estimator of the population mean. \[ \mathop{\mathrm{E}}[\hat\mu] = \mu \]

We proved this earlier today using linearity of expectations. See Section 22.2.1.

A Subsample Mean is Conditionally Unbiased

Claim. When we sample with replacement, a subsample mean is a conditionally unbiased estimator of the corresponding subpopulation mean as long as there’s at least one person with \(W_i=1\) \[ \mathop{\mathrm{E}}[\hat\mu(1) \mid W_1 \ldots W_n] = \mu(1) \quad \text{ if } \quad \sum_{i=1}^{n} 1_{=1}(W_i) > 0. \]

Use the linearity of conditional expectations to push the conditional expectation into the sum. Then use the irrelevance of independent conditioning variables to write things in terms of the random variable \(\mu(W_i)\). Then the indicator trick. What’s \(1_{=1}(W_i) \mu(W_i)\)? How is it related to \(\mu(1)\)?

\[ \hat\mu(1) = \frac{\sum_{i:W_i=1} Y_i}{\sum_{i:W_i=1} 1} = \frac{\sum_{i=1}^{n} 1_{=1}(W_i) Y_{i}}{\sum_{i=1}^{n} 1_{=1}(W_i)} \]

It’s easy to make mistakes using linearity of expectations when we’re summing over a subsample. Can we or can we not ‘push’ or ‘pull’ an expectation through a sum when the terms in that sum depend on the value of a random variable?

To make this a lot more obvious, we can rewrite these as sums over the whole sample. Instead of ‘excluding’ the terms where \(W_i=0\), we ‘make them zero’ by multiplying in the indicator \(1_{=1}(W_i)\). Then, of course we can distribute expectations through the sum. The question becomes whether we can ‘pull out’ the indicator. And we have a rule for that. We can do it if we’re conditioning on \(W_i\).

When we write a subsample mean, we can do that in the numerator (the sum of observations in the subsample) and in the denominator (the number of those terms, which is a sum of ones over the subsample).

\[ \begin{aligned} \mathop{\mathrm{E}}[\hat\mu(1) \mid W_1 \ldots W_n] &=\mathop{\mathrm{E}}\qty[\frac{\sum_{i=1}^{n} 1_{=1}(W_i) Y_{i}}{\sum_{i=1}^{n} 1_{=1}(W_i)} \mid W_1 \ldots W_n] \\ &\overset{\texttip{\small{\unicode{x2753}}}{via linearity. All the indicators (and their sum) are functions of $W_1\ldots W_n$.}}{=} \frac{\sum_{i=1}^{n} 1_{=1}(W_i) \mathop{\mathrm{E}}\qty{ Y_{i} \mid W_1 \ldots W_n}}{\sum_{i=1}^{n}1_{=1}(W_i)} \\ &\overset{\texttip{\small{\unicode{x2753}}}{via irrelevance of independent conditioning variables. $(W_i,Y_i)$ are independent of the other $X$s.}}{=} \frac{\sum_{i=1}^{n} 1_{=1}(W_i) \mathop{\mathrm{E}}\qty{ Y_{i} \mid W_i}}{\sum_{i=1}^{n}1_{=1}(W_i)} \\ &\overset{\texttip{\small{\unicode{x2753}}}{This is just a change of notation, as $\mu(W_i)=\mathop{\mathrm{E}}[Y_i \mid W_i]$, but it makes it easier to understand what happens next.}}{=} \frac{\sum_{i=1}^{n} 1_{=1}(W_i) \mu(W_i)}{\sum_{i=1}^{n}1_{=1}(W_i)} \\ &\overset{\texttip{\small{\unicode{x2753}}}{via the indicator trick.}}{=}\frac{\sum_{i=1}^{n}1_{=1}(W_i) \mu(1)}{\sum_{i=1}^{n}1_{=1}(X_{i})} \\ &\overset{\texttip{\small{\unicode{x2753}}}{via linearity}}{=} \mu(1) \ \frac{\sum_{i=1}^{n} 1_{=1}(W_i) }{\sum_{i=1}^{n}1_{=1}(X_{i})} = \mu(1) \end{aligned} \]

Differences in Subsample Means

Claim. A difference in subsample means is conditionally unbiased for the corresponding difference in subpopulation means as long as there sample includes at least one person in each group.

\[ \mathop{\mathrm{E}}[\hat\mu(1) - \hat\mu(0) \mid W_1 \ldots W_n] = \mu(1) - \mu(0) \quad \text{ if } \quad \sum_{i=1}^{n} 1_{=1}(W_i) > 0 \ \text{ and } \ \sum_{i=1}^{n} 1_{=0}(W_i) > 0. \]

Why?

This follows from the linearity of expectations and conditional unbiasedness of the subsample means.

\[ \mathop{\mathrm{E}}[\hat\mu(1) - \hat\mu(0) \mid W_1 \ldots W_n] = \mathop{\mathrm{E}}[\hat\mu(1) \mid W_1 \ldots W_n] - \mathop{\mathrm{E}}[\hat\mu(0) \mid W_1 \ldots W_n] = \mu(1) - \mu(0) \]

Conditional vs. Unconditional Unbiasedness

Conditional unbiasedness without the ‘if’ implies unconditional unbiasedness. That’s the law of iterated expectations.

\[ \mathop{\mathrm{E}}[\hat\mu(1)] = \mathop{\mathrm{E}}\qty[\mathop{\mathrm{E}}[\hat\mu(1) \mid W_1 \ldots W_n]] = \mathop{\mathrm{E}}[\mu(1)] = \mu(1) \qif \mathop{\mathrm{E}}[\hat\mu(1) \mid W_1 \ldots W_n] = \mu(1) \]

But the ‘if’ is a problem. Fundamentally, the problem is that if there’s nobody with \(W_i=1\), there isn’t a subsample mean for that group. There’s nobody in the subsample to average. The best we could do is just make something up. However, when our sample is large, it’s very unlikely that there’s nobody in the subsample for any group, so if we make up anything at all reasonable, we’ll get an estimator that’s pretty close to unbiased.

\[ \begin{aligned} \mathop{\mathrm{E}}[\hat\mu(1)] &= \mathop{\mathrm{E}}\qty[\mathop{\mathrm{E}}[\hat\mu(1) \mid W_1 \ldots W_n]] \\ &= \sum_{\substack{x_1 \ldots x_n \\ \sum_{i=1}^{n} 1_{=1}(x_i) > 0}} \mu(1) \times P(W_1=x_1, \ldots, W_n=x_n) \\ &+ \sum_{\substack{x_1 \ldots x_n \\ \sum_{i=1}^{n} 1_{=1}(x_i) = 0}} \text{whatever} \times P(W_1=x_1, \ldots, W_n=x_n) \\ &= (1-p_0) \ \mu(1) \ + \ p_0 \ \text{whatever} \qfor p_0 = P(W_i=0 \text{ for all } i) \\ &\approx \mu(1) \qif p_0 \approx 0 \end{aligned} \]

For this reason, we’ll mostly ignore this issue from here on. We’ll call subsample means unbiased estimators of subpopulation means because they’re close enough, but you should be aware that there is a subtle issue here.

Why Unconditional Unbiasedness is Impossible

It’s not just that we haven’t found an unconditionally unbiased estimator—there isn’t one. No matter how clever we are about what to do when a group is empty, we can’t achieve unconditional unbiasedness.

Here’s why. Consider estimating \(\mu(1)\). Any estimator \(\hat\mu(1)\) we construct is a function of our data \((W_1, Y_1) \ldots (W_n, Y_n)\). When \(N_1 = 0\)—when nobody in our sample has \(W_i = 1\)—we have no observations from the group with \(W=1\). Our estimator can only depend on observations from the other group.

But the subpopulation mean \(\mu(1)\) could be anything. It’s a feature of the green subpopulation, which we haven’t seen at all when \(N_1 = 0\). For our estimator to be unbiased, we’d need \[ \mathop{\mathrm{E}}[\hat\mu(1)] = \mu(1) \qqtext{for all possible values of } \mu(1). \] When \(N_1 = 0\), our estimator \(\hat\mu(1)\) doesn’t depend on \(\mu(1)\) at all—it’s determined entirely by the red observations. So \(\mathop{\mathrm{E}}[\hat\mu(1) \mid N_1 = 0]\) is some fixed number that doesn’t change with \(\mu(1)\). But we need it to equal \(\mu(1)\), which does change. That’s impossible.

More formally: let \(c = \mathop{\mathrm{E}}[\hat\mu(1) \mid N_1 = 0]\). Then \[ \mathop{\mathrm{E}}[\hat\mu(1)] = P(N_1 > 0) \cdot \mu(1) + P(N_1 = 0) \cdot c \] For this to equal \(\mu(1)\) for all \(\mu(1)\), we’d need \(P(N_1 = 0) = 0\). But with ordinary sampling, there’s always some chance—however small—that everyone in our sample comes from the same group.

This is why stratified sampling matters. With stratified sampling, we choose to sample \(n_1 > 0\) people from the green group. We guarantee \(N_1 = n_1 > 0\) by design. The “what if nobody shows up” problem disappears, and we get unconditional unbiasedness for free.

Implications of Unbiasedness

Like this.

Not like this.
  • If we can estimate the width of its sampling distribution, we can get interval estimates with the coverage we want.
  • The question that remains is whether those interval estimates are narrow enough to be useful.

Spread

The Question

Is it like this?

Or like this?

Spread Within and Between Groups

  • You can think of there being two kinds of variation in a population.
  1. Within-group variation.
    • That’s illustrated above by the ‘arms’ on our subpopulation means in the plot above.
    • The arm length is, in fact, the subpopulation standard deviation.
    • That’s the square root of the conditional variance function \(\sigma^2(x) = \mathop{\mathrm{\mathop{\mathrm{V}}}}[Y \mid W=x]\).
  2. Between-group variation.
    • That’s about the difference between those subsample means.
    • One reasonable summary is the standard deviation of the subpopulation means \(\mu(x) = \mathop{\mathrm{E}}[Y \mid W=x]\).

The Law of Total Variance

\[ \mathop{\mathrm{\mathop{\mathrm{V}}}}[Y] = \mathop{\mathrm{E}}\qty{\mathop{\mathrm{\mathop{\mathrm{V}}}}( Y \mid X ) } + \mathop{\mathrm{\mathop{\mathrm{V}}}}\qty{\mathop{\mathrm{E}}( Y \mid X ) } \]

  • The Law of Total Variance breaks Variance into within-group and between-group terms.
  • It’s like the Law of Iterated Expectations, but for Variance.
  • It’s a useful way to decompose the variance of a random variable.
  • Think about where most of the variance in \(Y\) is coming from in the two populations shown above.

We won’t prove this, but it’s in the slides if you’re interested. See Chapter 25.

Check Your Understanding

  • Try to put the three terms into words.
    • \(\mathop{\mathrm{\mathop{\mathrm{V}}}}[Y]\) is the …
    • \(\mathop{\mathrm{E}}\qty{ \mathop{\mathrm{\mathop{\mathrm{V}}}}(Y \mid W) }\) is the …
    • \(\mathop{\mathrm{\mathop{\mathrm{V}}}}\qty{\mathop{\mathrm{E}}( Y \mid W)}\) is the …
  • Do this in two ways.
    1. Using ‘Random Variables Language’,
      • e.g. ‘the expected value of the conditional variance of Y given X’
    2. Using ‘Population Language’
      • e.g. ‘the variance of income (Y) within the random subpopulation with degree status X’

The Variance of the Sample Mean

Claim. When we sample w/ replacement, the variance of the sample mean
is the population variance divided by the number of people in the sample.

\[ \mathop{\mathrm{\mathop{\mathrm{V}}}}[\hat\mu] = \frac{\sigma^2}{n} \]

  • We proved this earlier today using. See Section 22.2.2.
  • We used, among other things, the property that expectations of products factor into products of expectations.
  • Our proof for the subsample mean will be similar with more conditioning.
    • We’ll use a conditional version of that we can derive using the ‘law of iterated iterated expectations’.

Conditional Expectations of Products Factor

When we sample with replacement, the conditional expectation of products factors into a product of conditional expectations.

\[ \mathop{\mathrm{E}}[Y_i \ Y_j \mid W_1 \ldots W_n] = \mathop{\mathrm{E}}[Y_i \mid W_i]\mathop{\mathrm{E}}[Y_j \mid W_j] = \mu(W_i)\mu(W_j) \qqtext{ when $(W_1, Y_1) \ldots (W_n, Y_n)$ are independent.} \]

This is a subtle application of two of our ‘laws’. \[ \begin{aligned} &\mathop{\mathrm{E}}[Y \mid W] = \mathop{\mathrm{E}}[ \mathop{\mathrm{E}}\qty[Y \mid W, X] \mid W] \quad \text{ for any random variables $W, X, Y$} && \text{Law of Iterated Iterated Expectations} \\ &\mathop{\mathrm{E}}[ \textcolor[RGB]{17,138,178}{Y} \mid \textcolor[RGB]{17,138,178}{X}, \textcolor[RGB]{239,71,111}{X'} ] = \textcolor[RGB]{17,138,178}{\mathop{\mathrm{E}}[ Y \mid X ]} \quad \text{ when $X'$ is independent of $X,Y$ } && \text{Irrelevance of Independent Conditioning Variables} \end{aligned} \]

\[ \color{gray} \begin{aligned} \mathop{\mathrm{E}}[ \textcolor[RGB]{239,71,111}{Y_i} \ \textcolor[RGB]{17,138,178}{Y_j} \mid W_1 \ldots W_n] &\overset{\texttip{\small{\unicode{x2753}}}{Here's use the law of iterated iterated expectations to 'break the second stage into a second and third'}}{=} \mathop{\mathrm{E}}[\mathop{\mathrm{E}}[\textcolor[RGB]{239,71,111}{Y_i} \ \textcolor[RGB]{17,138,178}{Y_j} \mid \textcolor[RGB]{239,71,111}{Y_i}, W_1 \ldots W_n] \mid W_1 \ldots W_n] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Here we pull $Y_i$ out of the inner expectation (stage 3). That's justified by linearity of conditional expectations because it's a function of $\textcolor[RGB]{239,71,111}{Y_i}, W_1 \ldots W_n$, the stuff we chose in stages 1 and 2.}}{=} \mathop{\mathrm{E}}[ \textcolor[RGB]{239,71,111}{Y_i} \ \mathop{\mathrm{E}}[\textcolor[RGB]{17,138,178}{Y_j} \mid \textcolor[RGB]{239,71,111}{Y_i}, W_1 \ldots W_n] \mid W_1 \ldots W_n] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Here we drop the irrelevant conditioning variables from the inner expectation. In the notation above, $Y=Y_j$, $X=W_j$, and $X'=(Y_i, W_1 \ldots W_n \text{ except } W_j)$, which describes other 'calls', is independent of $(X,Y)=(W_j, Y_j)$.}}{=} \mathop{\mathrm{E}}[\textcolor[RGB]{239,71,111}{Y_i} \ \mathop{\mathrm{E}}[\textcolor[RGB]{17,138,178}{Y_j} \mid W_j] \mid W_1 \ldots W_n] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Here we pull $\textcolor[RGB]{17,138,178}{Y_j}$ out of the outer expectation (stage 2). That's justified by linearity of conditional expectations because it's a function of $W_1 \ldots W_n$, the stuff we chose in stage 1.}}{=} \textcolor[RGB]{17,138,178}{\mathop{\mathrm{E}}[Y_j \mid W_j]} \ \mathop{\mathrm{E}}[\textcolor[RGB]{239,71,111}{Y_i} \mid W_1 \ldots W_n] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Here we drop the irrelevant conditioning variables from expectation involving $\textcolor[RGB]{239,71,111}{Y_i}$. In the notation above, $Y=Y_i$, $X=W_i$, and $X'=(Y_j, W_1 \ldots W_n \text{ except } W_i)$, which describes other 'calls', is independent of $(X,Y)=(W_i, Y_i)$.}}{=} \textcolor[RGB]{17,138,178}{\mathop{\mathrm{E}}[Y_j \mid W_j]} \ \textcolor[RGB]{239,71,111}{\mathop{\mathrm{E}}[Y_i \mid W_i]} \\ &\overset{\texttip{\small{\unicode{x2753}}}{This is the same thing in different notation.}}{=} \textcolor[RGB]{17,138,178}{\mu(W_j)} \ \textcolor[RGB]{239,71,111}{\mu(W_i)} \end{aligned} \]

This proof is a bit stranger conceptually than it is as a matter of pushing symbols around. You don’t have to get it conceptually if you don’t want to. Sometimes it’s not worth it. But if you do want to, here’s what’s going on.

We start out thinking about a ‘two stage’ sampling process. First, we flip our ‘color coin’ \(n\) times to choose \(W_1 \ldots W_n\). Those flips together are \(X\) in our iterated expectations formula. Then we roll n times to choose \(Y_1 \ldots Y_n\). For \(Y_i\), we roll the ‘red die’ if \(W_i=0\) and the ‘green die’ if \(W_i=1\). You can think of the product \(Y_i Y_j\) as \(Y\) in our iterated expectations formula.

Then we use the ‘law of iterated iterated expectations’ to break that second stage into two: first we roll once to choose \(Y_i\) (that’s \(W\) in our iterated expectations formula), then we roll \(n-1\) times to choose the rest of \(Y_1 \ldots Y_n\). After that, we simplify using linearity and the irrelevance of independent conditioning variables.

It sounds a little odd because our ‘3 stage sampling’ scheme depends on the pair \(Y_i, Y_j\) we’re thinking about. When we think about the pairs \(Y_1 Y_3\) and \(Y_2 Y_3\), we’re thinking about two different ways of drawing \(Y_3\). And we can’t really do both. It’s just one thing. We can’t draw it twice.

That’s ok because we don’t really have to sample this way. It’s just a way of thinking about doing a calculation. This is a bit like how we can think of any pair \(Y_i,Y_j\) as distributed like \(Y_1,Y_2\) when calculating about variance after sampling without replacement in this week’s homework. Ultimately, it’s about linearity. When we’re calculating the expectation of a sum, we only care about the marginal distribution of each term. When that term involves only two observations, like we see in variance calculations, we can think about the random mechanism that gives us each pair separately.

The Variance of the Subsample Mean

Claim. When we sample with replacement, the variance of a subsample mean is the
expected value of the subpopulation variance divided by the number of people in the subsample. \[ \mathop{\mathrm{\mathop{\mathrm{V}}}}[\hat\mu(1)] = \mathop{\mathrm{E}}\qty[ \frac{\sigma^2(1)}{N_1} ] \quad \text{ for } \quad N_1 = \sum_{i=1}^n 1_{=1}(W_i) \]

  • This is like what we saw in the unconditional case, but with a twist.
  • The number of people in the subsample is random, so instead of \(1/n\), we have \(\mathop{\mathrm{E}}[ \frac{1}{N_1} ]\).

Step 1. Total Variance

  • We’ll start by using the Law of Total Variance, conditioning on \(W_1 \ldots W_n\).
  • And then we’ll have to calculate two terms.
    1. The expected value of the conditional variance
    2. The variance of the conditional expectation \[ \begin{aligned} \mathop{\mathrm{\mathop{\mathrm{V}}}}\qty[ \hat \mu(1) ] &= \underset{\color[RGB]{64,64,64}\text{within-groups term}}{\mathop{\mathrm{E}}\qty[ \mathop{\mathrm{\mathop{\mathrm{V}}}}\qty{ \hat \mu(1) \mid W_1 \ldots W_n }] } + \underset{\color[RGB]{64,64,64}\text{between-groups term}}{\mathop{\mathrm{\mathop{\mathrm{V}}}}\qty[ \mathop{\mathrm{E}}\qty{\hat \mu(1) \mid W_1 \ldots W_n} ] } \\ \end{aligned} \]
  • This is one place it helps to know our estimator is conditionally unbiased.
    • That conditional expectation is the constant \(\mu(1)\).1
    • So the ‘between-groups’ variance term is zero.

Step 2. Centering and Squaring

\[ \begin{aligned} \mathop{\mathrm{\mathop{\mathrm{V}}}}\qty[ \hat \mu(1) \mid W_1 \ldots W_n ] &\overset{\texttip{\small{\unicode{x2753}}}{Definitionally.}}{=} \mathop{\mathrm{E}}\qty[ \qty{\hat \mu(1) - \mu(1) }^2 \mid W_1 \ldots W_n ] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Also definitionally.}}{=} \mathop{\mathrm{E}}\qty[ \qty{\frac{\sum_{i=1}^n 1_{=1}(W_i)Y_{i}}{\sum_{i=1}^n 1_{=1}(W_i)} - \mu(1)}^2 \mid W_1 \ldots W_n ] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Multiplying $\mu(1)$ by $N_1/N_1$ to get a common denominator}}{=} \mathop{\mathrm{E}}\qty[ \qty{ \frac{\sum_{i=1}^n 1_{=1}(W_i){\textcolor[RGB]{0,0,255}{\qty{Y_{i} - \mu(1)}}}}{\sum_{i=1}^n 1_{=1}(W_i)}}^2 \mid W_1 \ldots W_n ] \\ &\overset{\texttip{\small{\unicode{x2753}}}{The indicator trick tells us the factors we've highlighted in \textcolor[RGB]{0,0,255}{blue} are the same.}}{=} \mathop{\mathrm{E}}\qty[ \qty{ \frac{\sum_{i=1}^n 1_{=1}(W_i)\textcolor[RGB]{0,0,255}{Z_i}}{\sum_{i=1}^n 1_{=1}(W_i)}}^2 \mid W_1 \ldots W_n ] \qfor \textcolor[RGB]{0,0,255}{Z_i = Y_{i} - \mu(W_i)} \\ &\overset{\texttip{\small{\unicode{x2753}}}{Expanding the square.}}{=} \mathop{\mathrm{E}}\qty[ \frac{\textcolor[RGB]{239,71,111}{\sum_{i=1}^n}\textcolor[RGB]{17,138,178}{\sum_{j=1}^n} \textcolor[RGB]{239,71,111}{1_{=1}(W_i)Z_i} \ \textcolor[RGB]{17,138,178}{1_{=1}(W_j)Z_j}}{\qty{\sum_{i=1}^n 1_{=1}(W_i) }^2} \mid W_1 \ldots W_n ] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Distributing the expectation. We can pull out the denominator and the indicator in each term because they're functions of $W_1 \ldots W_n$.}}{=} \frac{\textcolor[RGB]{239,71,111}{\sum_{i=1}^n}\textcolor[RGB]{17,138,178}{\sum_{j=1}^n} \textcolor[RGB]{239,71,111}{1_{=1}(W_i)} \textcolor[RGB]{17,138,178}{1_{=1}(W_j)}\mathop{\mathrm{E}}\qty[\textcolor[RGB]{239,71,111}{Z_i} \textcolor[RGB]{17,138,178}{Z_j} \mid W_1 \ldots W_n] }{\qty{\sum_{i=1}^n 1_{=1}(W_i) }^2} \end{aligned} \]

This random variable \(Z_i\) is like the one we have in the unconditional case, but conditionally.

\[ \begin{aligned} \mathop{\mathrm{E}}[ Z_i \mid W_i ] &= \mathop{\mathrm{E}}[ Y_i - \mu(W_i) \mid W_i ] = \mathop{\mathrm{E}}[ Y_i \mid W_i ] - \mu(W_i) = \mu(W_i) - \mu(W_i) && \overset{\texttip{\small{\unicode{x2753}}}{It has conditional expectation zero. }}{\ } \\ \mathop{\mathrm{E}}[ Z_i^2 \mid W_i] &= \mathop{\mathrm{E}}[ \{ Y_i - \mathop{\mathrm{E}}[Y_i \mid W_i] \}^2 \mid W_i] = \mathop{\mathrm{\mathop{\mathrm{V}}}}[Y_i \mid W_i] && \overset{\texttip{\small{\unicode{x2753}}}{The conditional expectation of its square is the conditional variance of $Y_i$ }}{\ } \end{aligned} \]

Step 3. Taking Expectations Term-by-Term

Squared Terms (\(i=j\))

\[ \begin{aligned} \mathop{\mathrm{E}}\qty[\textcolor[RGB]{239,71,111}{Z_i}\textcolor[RGB]{17,138,178}{Z_j} \mid W_1 \ldots W_n] &\overset{\texttip{\small{\unicode{x2753}}}{Because $i=j$}}{=} \mathop{\mathrm{E}}\qty[ Z_i^2 \mid W_1 \ldots W_n] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Because $X' = (W_1 \ldots W_n \text{except} W_i)$ is independent of $W_i$ and $Y_i$.}}{=} \mathop{\mathrm{E}}\qty[ Z_i^2 \mid W_i] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Definitionally}}{=}\mathop{\mathrm{\mathop{\mathrm{V}}}}[Y_i \mid W_i] = \sigma^2(W_i) \end{aligned} \]

Cross-Terms (\(i\neq j\))

\[ \begin{aligned} \mathop{\mathrm{E}}\qty[\textcolor[RGB]{239,71,111}{Z_i} \textcolor[RGB]{17,138,178}{Z_j} \mid W_1 \ldots W_n] &\overset{\texttip{\small{\unicode{x2753}}}{This is the factorization identity we proved a few slides back.}}{=}\textcolor[RGB]{239,71,111}{\mathop{\mathrm{E}}\qty[Z_i \mid W_i]} \textcolor[RGB]{17,138,178}{\mathop{\mathrm{E}}\qty[Z_j \mid W_j]} \\ &\overset{\texttip{\small{\unicode{x2753}}}{Definitionally}}{=} 0 \times 0 \end{aligned} \]

Step 4. Putting the Pieces Together

\[ \begin{aligned} \mathop{\mathrm{\mathop{\mathrm{V}}}}\qty[ \hat \mu(1) \mid W_1 \ldots W_n ] &\overset{\texttip{\small{\unicode{x2753}}}{Step 1}}{=} \underset{\color[RGB]{64,64,64}\text{within-groups term}}{\mathop{\mathrm{E}}\qty[ \mathop{\mathrm{\mathop{\mathrm{V}}}}\qty{ \hat \mu(1) \mid W_1 \ldots W_n }] } + \underset{\color[RGB]{64,64,64}\text{between-groups term}}{ 0 } \\ &\overset{\texttip{\small{\unicode{x2753}}}{Step 2}}{=} \mathop{\mathrm{E}}\qty[\frac{\textcolor[RGB]{239,71,111}{\sum_{i=1}^n}\textcolor[RGB]{17,138,178}{\sum_{j=1}^n} \textcolor[RGB]{239,71,111}{1_{=1}(W_i)} \textcolor[RGB]{17,138,178}{1_{=1}(W_j)}\mathop{\mathrm{E}}\qty[\textcolor[RGB]{239,71,111}{Z_i} \textcolor[RGB]{17,138,178}{Z_j} \mid W_1 \ldots W_n] }{\qty{\sum_{i=1}^n 1_{=1}(W_i) }^2}] + 0 \\ &\overset{\texttip{\small{\unicode{x2753}}}{Step 3}}{=} \mathop{\mathrm{E}}\qty[\frac{\textcolor[RGB]{239,71,111}{\sum_{i=1}^n 1_{=1}(W_i)} \ \sigma^2(W_i) }{\qty{\sum_{i=1}^n 1_{=1}(W_i) }^2}] \\ &\overset{\texttip{\small{\unicode{x2753}}}{The indicator trick again.}}{=} \mathop{\mathrm{E}}\qty[\frac{\textcolor[RGB]{239,71,111}{\sum_{i=1}^n 1_{=1}(W_i)} \ \sigma^2(1) }{\qty{\sum_{i=1}^n 1_{=1}(W_i) }^2}] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Cancelling a common factor of $N_1$ in the numerator and denominator.}}{=} \mathop{\mathrm{E}}\qty[\frac{\sigma^2(1)}{\sum_{i=1}^n 1_{=1}(W_i) } ] \end{aligned} \]

The Variance of Differences in Means

Claim. The variance of the difference in subsample means is the sum of the variances of the subsample means. \[ \mathop{\mathrm{\mathop{\mathrm{V}}}}[\hat\mu(1) - \hat\mu(0)] = \mathop{\mathrm{E}}\qty[ \frac{\sigma^2(1)}{N_1} + \frac{\sigma^2(0)}{N_0}] \quad \text{ for } \quad N_x = \sum_{i=1}^n 1_{=x}(W_i) \]

This’ll be a Homework Exercise.

Estimating the Variance

The population.

The sample.

\[ \mathop{\mathrm{\mathop{\mathrm{V}}}}[\hat\mu(1) - \hat\mu(0)] = \mathop{\mathrm{E}}\qty[ \frac{\sigma^2(1)}{N_1} + \frac{\sigma^2(0)}{N_0}] \quad \text{ for } \quad N_x = \sum_{i=1}^n 1_{=x}(W_i) \]

  • This formula includes a few things we don’t know.
  • We don’t know the subpopulation standard deviations \(\sigma(0)\) or \(\sigma(1)\).
    • Those are drawn in as ‘arms’ on the subpopulation means in the plot on the left.
    • That’s a fake-data plot. It’s an illustration of data we don’t really get to see.
  • We also don’t know the expected values of \(1/N_1\) or \(1/N_0\).
    • We’d need to know some things about the population to calculate those, too.
    • In particular, we’d need to know the frequency that \(x_j=1\) in our population.
  • What we can do is use our sample to estimate all of those tings.
  • We can use the subsample standard deviations \(\hat\sigma(0)\) or \(\hat\sigma(1)\).
    • Those are drawn in as ‘arms’ on the subsample means in the plot on the right.
    • That’s a plot of the data we have. Our sample. And they’re about the same length.
  • We can use the \(1/N_1\) and \(1/N_0\) themselves to estimate their expected values.
  • And we can plug those into our variance formula to get an estimate of this variance.

\[ \widehat{\mathop{\mathrm{\mathop{\mathrm{V}}}}}[\hat\mu(1) - \hat\mu(0)] = \frac{\hat{\sigma}^2(1)}{N_1} + \frac{\hat{\sigma}^2(0)}{N_0} \qfor \hat \sigma^2(x) = \frac{1}{N_x} \sum_{i:W_i=x} (Y_i - \hat \mu(x))^2 \]

Case Study

The NSW Experiment

Code

nsw.data = read.csv('../data/nsw.csv')
nsw = nsw.data |> 
  filter(dataset=='nsw') 

X = as.numeric(nsw$treated)
Y = as.numeric(nsw$income78)
n = length(Y)
library(purrr)
difference = mean(Y[X==1]) - mean(Y[X==0])
difference.bootstrap.samples = map_vec(1:10000, function(.) {
    I = sample(1:n, n, replace=TRUE)
    Xstar = X[I]
    Ystar = Y[I]
    mean(Ystar[Xstar==1]) - mean(Ystar[Xstar==0])
})
w95.boot = width(difference.bootstrap.samples, alpha=.05)
nsw.interval = difference + c(-1,1)*w95.boot/2

The National Supported Work Demonstration

The Bootstrap Sampling Distribution of the Difference in Means
\(x\) \(N_x\) \(\hat \mu(x)\) \(\hat \sigma(x)\)
0 260 4600 5500
1 185 6300 7900
  • The National Supported Work Demonstration was an experimental program in the 1970s that provided job training and other services.
  • Here ‘experimental’ doesn’t just mean untested. It was run as a big experiment.
  • From a population of eligible people, they drew a sample of size 445 to participate.
  • To help estimate the impact of the program, participants were randomized into two groups.
    • The ‘treated’ group received the training and services.
    • The ‘control’ group did not.
  • A year or two after the program ended, they compared the incomes of the two groups.
    • And the difference within the sample was about $1800 per year. In 1978 dollars.
    • That’s pretty good. A dollar in 1978 was worth about what $4.80 is today.
    • So adjusting for inflation, that’s an in-sample difference of about $8600. Huge impact.
  • But there’s problem. It’s not clear that this describes a real population-level difference.
  • The sample was relatively small. And, as a result, their interval estimate was pretty wide.

\[ \begin{aligned} \mu(1)-\mu(0) &\in 1800 \pm 1300 \approx [500, \ 3100] &&\qqtext{is a 95\% confidence interval in 1978 dollars} \\ &\in 8600 \pm 6300 \approx [2300, \ 14900] &&\qqtext{adjusted for inflation} \end{aligned} \]

  • This was an expensive program to run.
    • If the population difference were at the low end of the interval, at 500 1978 dollars, it wouldn’t be cost-effective.
    • If the population difference were at the high end of the interval, at 3100 1978 dollars, it would be.
  • To figure out if it’s worth running, we need a more conclusive result.

Activities

The Bootstrap Sampling Distribution of the Difference in Means
\(x\) \(N_x\) \(\hat \mu(x)\) \(\hat \sigma(x)\)
0 260 4600 5500
1 185 6300 7900

Use the information in this table to calculate a 95% interval based on normal approximation. Compare it to the interval we got using the bootstrap sampling distribution for calibration.

Suppose we want to know the population difference to within $500 (1978 dollars). How can we resize our experiment to make that happen?

  1. We could restart the experiment, running it exactly as before, to get a larger sample like the one we have. In particular, one with the same proportion of participants assigned to treated and control units. What sample size would we need to be able to get a 95% confidence interval this accurate?

  2. It’d be cheaper to increase the size of our control group without actually treating anyone. Suppose you can analyze the resulting ‘diluted’ version of our experiment exactly like the original. Same variance formula, same \(N_1\), but larger \(N_0\). How big do we need to make \(N_0\) to get the desired level of precision? Is it even possible without increasing \(N_1\)? How precise can we get keeping \(N_1\) as-is?

You can use a ‘two stage sampling’ argument to think of the ‘diluted’ version of our experiment as being like the one we ran orginally, but with a different coin to assign treatment. In essence, getting assigned treatment in the diluted version requires you to flip heads twice: once to be in the first part of this larger sample (our original sample), and once to be in the part of that sample assigned to treatment. We can boil those two flips down to one flip of an appropriately weighted coin.

Warm-Up.

Our interval based on normal approximation should be \(\hat\theta \pm 1.96 \times \hat\sigma_{\hat\theta}\) where \(\hat\theta=\hat\mu(1) - \hat\mu(0)\) is our point estimate and \(\hat\sigma_{\hat\theta}\) is an estimate of its standard deviation. That’s roughly \(\hat\theta \pm 2 \times 670\). Pretty similar to our bootstrap interval. Using the formula from Section 22.4.13, we calculate it like this.

\[ \begin{aligned} \widehat{\mathop{\mathrm{\mathop{\mathrm{V}}}}}[\hat\mu(1) - \hat\mu(0)] &= \frac{\hat\sigma^2(1)}{N_1} + \frac{\hat\sigma^2(0)}{N_0} \\ &\approx \frac{61.90M}{185} + \frac{30.07M}{260} \\ &\approx 450.24K \approx \hat\sigma_{\hat\theta}^2 \qfor \hat\sigma_{\hat\theta} \approx 670 \end{aligned} \]

Resizing Part 1.

If we’re resizing our experiment but running as before, we can think of the new subsample sizes as being roughly proportional to the old ones. That is, if our new sample size is \(\alpha n\), then our subsample sizes will be \(\alpha N_1\) and \(\alpha N_0\). Plugging these new sample sizes in our variance formula, we get a simple formula for our our estimator’s standard deviation in this new experiment relative to the old one. It’s just \(\sigma_{\hat\theta}/\sqrt{\alpha}\) where \(\sigma_{\hat\theta}\) is the standard deviation of our estimator in our original experiment.

\[ \begin{aligned} \mathop{\mathrm{\mathop{\mathrm{V}}}}[\hat\mu(1) - \hat\mu(0)] &= \frac{\sigma^2(1)}{\alpha N_1} + \frac{\sigma^2(0)}{\alpha N_0} = \frac{1}{\alpha} \sigma_{\hat\theta}^2 = \qty(\frac{\sigma_{\hat\theta}}{\sqrt{\alpha}})^2 \end{aligned} \]

That is, it’s \(1/\sqrt{\alpha}\) times the quantity we just estimated to be \(\hat\sigma_{\hat\theta} \approx 670\), so it’s sensible to estimate it by \(\hat\sigma_{\hat\theta}/\sqrt{\alpha}\). We want our new interval \(\hat\theta \pm 1.96\hat\sigma_{\hat\theta}/\sqrt{\alpha}\) to be \(\hat\theta \pm 500\). The old one, \(\hat\theta \pm 1.96\hat\sigma_{\hat\theta} \approx \hat\theta \pm 1320\), was about 3 times wider than this, so we want \(\sqrt{\alpha} \approx 3\) and therefore \(\alpha \approx 9\). Since we’ve already got a sample of size \(n\), we can run a ‘second wave’ of size \((\alpha-1)n \approx 8n\) to get a big enough sample.

That’s a back-of-the-envelope calculation. If we want to be a bit more precise, we can actually solve for \(\alpha\) that equates these two interval widths. \[ \begin{aligned} \frac{1.96\hat\sigma_{\hat\theta}}{\sqrt{\alpha}} = 500 \qfor \alpha = \qty(\frac{1.96\hat\sigma_{\hat\theta}}{500})^2 \approx 6.92 \end{aligned} \]

That’s a bit better than 9. It pays to be precise sometimes.

Resizing Part 2. If it’s free to include more participants in our control group, we may as we’ll make \(N_0\) arbitrarily large. If we did, we’d have this variance.

\[ \begin{aligned} \widehat{\mathop{\mathrm{\mathop{\mathrm{V}}}}}[\hat\mu(1) - \hat\mu(0)] &= \frac{\hat\sigma^2(1)}{N_1} + \frac{\hat\sigma^2(0)}{\infty} = \qty(\frac{\hat\sigma(1)}{\sqrt{N_1}})^2 \approx 580^2 \end{aligned} \]

The resulting interval would be \(\pm 1.96 \times 580 \approx \pm 1130\). Roughly twice as wide as we want. So we’re not getting the precision we want for free. Thinking back to Part 1, that means we’d want \(N_1\) to be roughly \(\alpha = 4\) times larger. And that’d mean treating roughly \((\alpha-1) = 3\) more people. That’s a lot cheaper than treating \(8\) (or really \(5.92\)) times as many people, as we would if we ran our second wave as a scaled-up version of the first.

Appendix: Additional Figures

The Social Pressure Letter

Figure 24.1

Appendix: Proving the Law of Total Variance

Starting Point: Interpreting Variance as Excess

\[ \small{ \begin{aligned} \mathop{\mathrm{\mathop{\mathrm{V}}}}(Y) &= \mathop{\mathrm{E}}\qty[ \qty{ Y - \mathop{\mathrm{E}}(Y) }^2 ] && \text{ the \emph{spread} idea: the mean square of a \emph{centered version} of $Y$ } \\ &= \mathop{\mathrm{E}}\qty(Y^2) - \qty{\mathop{\mathrm{E}}\qty(Y)}^2 && \text{ the \emph{excess} idea: the average amount $Y^2$ \emph{exceeds} the square of its mean}. \end{aligned} } \]

Let’s think, in terms of the plot, about why there is excess in the square.

  • Suppose \(Y\) takes on three values — the dots —with equal probability.
    • Its expectation is the average height of the black dots.
    • And its expectation, squared, is the square of this.
  • Its square \(Y^2\) takes on the the square of those values—the squares — with equal probability.
    • Its expectation is the average height of the squares.
  • Why is the second one bigger?
    • Because squaring increases big draws of Y more than it decreases small ones.
    • Visually, the line from \(Y\) to \(Y^2\) is longer when \(Y\) is bigger than when it’s smaller.
  • We’ll mostly think about variance in terms of spread, but the ‘excess’ formula comes up in calculations.

Equivalence Proof

Claim. The spread and excess formulas for variance are equivalent. \[ \mathop{\mathrm{E}}\qty[ \qty{ Y - \mathop{\mathrm{E}}(Y) }^2 ] =\mathop{\mathrm{E}}[ Y^2] - \qty{\mathop{\mathrm{E}}(Y)}^2 \]

\[ \small{ \begin{aligned} \mathop{\mathrm{E}}\qty[ \qty{ Y - \mathop{\mathrm{E}}(Y) }^2 ] &= \mathop{\mathrm{E}}\qty[ Y^2 - 2 Y \mathop{\mathrm{E}}(Y) + \qty{\mathop{\mathrm{E}}(Y)}^2 ] && \text{ FOIL } \\ &= \mathop{\mathrm{E}}[ Y^2] - \mathop{\mathrm{E}}\qty[2 Y \mathop{\mathrm{E}}(Y)] + \mathop{\mathrm{E}}\qty[\qty{\mathop{\mathrm{E}}(Y)}^2] && \text{Distributing expectations (linearity)} \\ &= \mathop{\mathrm{E}}[ Y^2] - 2\mathop{\mathrm{E}}(Y)\mathop{\mathrm{E}}\qty[Y] + \qty{\mathop{\mathrm{E}}(Y)}^2\mathop{\mathrm{E}}[1] && \text{Pulling constants out of expectations (linearity)} \\ &= \mathop{\mathrm{E}}[ Y^2] - 2\qty{\mathop{\mathrm{E}}(Y)}^2 + \qty{\mathop{\mathrm{E}}(Y)}^2] && \text{Recognizing a product of something with itself as a square} \\ &= \mathop{\mathrm{E}}[ Y^2] - \cancel{2}\qty{\mathop{\mathrm{E}}(Y)}^2 + \cancel{\qty{\mathop{\mathrm{E}}(Y)}^2}] && \text{Subtracting} \end{aligned} } \]

Conditional Variance as Excess

\[ \small{ \begin{aligned} \mathop{\mathrm{\mathop{\mathrm{V}}}}(Y \mid W) &= \mathop{\mathrm{E}}\qty[ \qty{ Y - \mathop{\mathrm{E}}(Y\mid W)}^2 \mid X ] && \text{ the \emph{spread} idea: the mean square of a conditionally centered version of $Y$ } \\ &= \mathop{\mathrm{E}}\qty(Y^2 \mid W) - \qty{\mathop{\mathrm{E}}\qty(Y \mid W)}^2 && \text{ the \emph{excess} idea: the average amount $Y^2$ exceeds the square of its cond. mean} \end{aligned} } \]

  • We can think of the Conditional Variance as excess, too.
  • It’s the same as the unconditional case, but within subpopulations.

Proof

Claim. \[ \mathop{\mathrm{\mathop{\mathrm{V}}}}[Y] = \textcolor{blue}{\mathop{\mathrm{E}}\qty[\mathop{\mathrm{\mathop{\mathrm{V}}}}(Y \mid W)]} + \textcolor{green}{\mathop{\mathrm{\mathop{\mathrm{V}}}}\qty[\mathop{\mathrm{E}}(Y \mid W)]} \]

  1. Rewrite the variances, conditional and unconditional, in excess form.

\[ \mathop{\mathrm{\mathop{\mathrm{V}}}}[Y] = \textcolor{blue}{\mathop{\mathrm{E}}\qty[ \mathop{\mathrm{E}}(Y^2 \mid W) - \qty{\mathop{\mathrm{E}}\qty(Y \mid W)}^2]} + \textcolor{green}{\mathop{\mathrm{E}}\qty{ \mathop{\mathrm{E}}(Y \mid W)^2} - \qty(\mathop{\mathrm{E}}\qty{\mathop{\mathrm{E}}\qty(Y \mid W)})^2} \]

  1. Distribute the outer expectations in the first term.

\[ \mathop{\mathrm{\mathop{\mathrm{V}}}}[Y] = \textcolor{blue}{\mathop{\mathrm{E}}\qty[ \mathop{\mathrm{E}}(Y^2 \mid W)] - \mathop{\mathrm{E}}\qty[\qty{\mathop{\mathrm{E}}\qty(Y \mid W)}^2]} + \textcolor{green}{\mathop{\mathrm{E}}\qty[ \qty{\mathop{\mathrm{E}}(Y \mid W)^2} ] - \qty(\mathop{\mathrm{E}}\qty{\mathop{\mathrm{E}}\qty(Y \mid W)})^2} \]

  1. Cancel matching terms

\[ \mathop{\mathrm{\mathop{\mathrm{V}}}}[Y] = \textcolor{blue}{\mathop{\mathrm{E}}\qty[ \mathop{\mathrm{E}}(Y^2 \mid W)] \cancel{- \mathop{\mathrm{E}}\qty[\qty{\mathop{\mathrm{E}}\qty(Y \mid W)}^2]}} + \textcolor{green}{\cancel{\mathop{\mathrm{E}}\qty[ \qty{\mathop{\mathrm{E}}(Y \mid W)^2} ]} - \qty(\mathop{\mathrm{E}}\qty{\mathop{\mathrm{E}}\qty(Y \mid W)})^2} \]

  1. Use the law of iterated expectations to simplify the remaining terms.

\[ \mathop{\mathrm{\mathop{\mathrm{V}}}}[Y] = \textcolor{blue}{E[Y^2] \cancel{- \mathop{\mathrm{E}}\qty[\qty{\mathop{\mathrm{E}}\qty(Y \mid W)}^2]}} + \textcolor{green}{\cancel{\mathop{\mathrm{E}}\qty[ \qty{\mathop{\mathrm{E}}(Y \mid W)^2} ]} - (\mathop{\mathrm{E}}\qty{Y})^2} \]


  1. We showed that on Section 22.2.1.↩︎