6 Calibrating Interval Estimates with Binary Observations
Calibration
What does it mean for an interval estimate to be calibrated? We’ll work through a polling example to see why calibration matters and what goes wrong when you get it wrong.
Before the Election
$$
Your Sample
\[ \begin{array}{r|rrrr|r} i & 1 & 2 & \dots & 625 & \bar{Y}_{625} \\ Y_i & 1 & 1 & \dots & 0 & 0.72 \\ \end{array} \]
The Population
\[ \color{lightgray} \begin{array}{r|rrrrrr|r} j & 1 & 2 & 3 & 4 & \dots & 7.23M & \bar{y}_{7.23M} \\ y_{j} & 1 & 1 & 1 & 0 & \dots & 1 & 0.70 \\ \end{array} \]
You want to estimate the proportion of all registered voters—the population—who will vote. To do this, you use the proportion of polled voters—your sample—who said they would. You’re probably not going to match the population proportion exactly, so you report an interval estimate—a range of values we claim the population proportion is in. You know you’re not going to be right 100% of the time you make claims like this, so you state a nominal coverage probability—how often you’re right about claims like this.1
Without thinking too hard about it, you report, as your interval, your sample proportion (72%) ± 1%. Because it sounds good. And you say your coverage probability is 95%. Because that’s what everybody else says.
After the Election
Your Sample
\[ \begin{array}{r|rrrr|r} i & 1 & 2 & \dots & 625 & \bar{Y}_{625} \\ Y_i & 1 & 1 & \dots & 0 & 0.72 \\ \end{array} \]
The Population
\[ \begin{array}{r|rrrrrr|r} j & 1 & 2 & 3 & 4 & \dots & 7.23M & \bar{y}_{7.23M} \\ y_{j} & 1 & 1 & 1 & 0 & \dots & 1 & 0.70 \\ \end{array} \]
When the election occurs, we get to see who turns out to vote. 5.05M people, or roughly 70% of registered voters, actually vote. You—and future employers—can see how well you did. And how well everybody else did.
Your interval missed the target. It doesn’t contain the turnout proportion. Your point estimate is only off by 2%. But you overstated your precision. Now you’re kicking yourself. You’d briefly considered saying ± 3% or ± 4%. That would’ve done it. But that didn’t sound as good, so you went for ± 1%.
You hope you’re not the only one who missed. So you check out the competition.
The Competition
\[ \begin{array}{r|rr|rr|r|rr|r} \text{call} & 1 & & 2 & & \dots & 625 & & \\ \text{poll} & J_1 & Y_1 & J_2 & Y_2 & \dots & J_{625} & Y_{625} & \overline{Y}_{625} \\ \hline \color[RGB]{7,59,76}1 & \color[RGB]{7,59,76}869369 & \color[RGB]{7,59,76}1 & \color[RGB]{7,59,76}4428455 & \color[RGB]{7,59,76}1 & \color[RGB]{7,59,76}\dots & \color[RGB]{7,59,76}1268868 & \color[RGB]{7,59,76}1 & \color[RGB]{7,59,76}0.68 \\ \color[RGB]{239,71,111}2 & \color[RGB]{239,71,111}600481 & \color[RGB]{239,71,111}0 & \color[RGB]{239,71,111}6793745 & \color[RGB]{239,71,111}1 & \color[RGB]{239,71,111}\dots & \color[RGB]{239,71,111}1377933 & \color[RGB]{239,71,111}1 & \color[RGB]{239,71,111}0.71 \\ \color[RGB]{17,138,178}3 & \color[RGB]{17,138,178}3830847 & \color[RGB]{17,138,178}1 & \color[RGB]{17,138,178}5887416 & \color[RGB]{17,138,178}1 & \color[RGB]{17,138,178}\dots & \color[RGB]{17,138,178}4706637 & \color[RGB]{17,138,178}1 & \color[RGB]{17,138,178}0.70 \\ {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} \\ \color[RGB]{6,214,160}1M & \color[RGB]{6,214,160}1487507 & \color[RGB]{6,214,160}1 & \color[RGB]{6,214,160}393580 & \color[RGB]{6,214,160}1 & \color[RGB]{6,214,160}\dots & \color[RGB]{6,214,160}1247545 & \color[RGB]{6,214,160}0 & \color[RGB]{6,214,160}0.72 \\ {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} \\ \end{array} \]
It turns out that everyone got their point estimate exactly like you did. Each of them rolled their 7.23M-sided die 625 times. And after making their 625 calls, they reported their sample proportion.
You’re not the only one who missed, but you’re part of a pretty small club. 5% of the other polls got it wrong. But you are the only one who claimed they’d get within 1%. Everyone else claimed they’d get within roughly 4%. Most were right. About 95%. And a lot of them had worse point estimates than you.
What you have is a calibration problem.
Calibration using the Sampling Distribution
You’d have seen that your interval was miscalibrated if you knew what you do now: your competitors’ point estimates, which they got exactly like you got yours, and who actually turned out, so you could simulate even more polls if you wanted to.
You’d have drawn ± 1% intervals around all these point estimates.2 And noticed that many of these intervals cover the target. About 43% do. Or, if leaving the target out of it, that only 43% of them cover the mean of all these estimates. Same thing. The mean and target are right on top of one another. Like this: |.
And you’d have known how to choose the right interval width. One that makes 95% of these intervals cover. Right?
Quiz
I’ve drawn the sampling distribution of an estimator, 100 draws from it as ●s, and 3 interval estimates. One of these interval estimates is calibrated to have exactly 95% coverage. Which is it?
- A. The one around the ● on top.
- B. The one around the ● in the middle.
- C. The one around the ● on the bottom.
Hint. You can reach your twin with your arms if your twin can reach you with theirs.
Hint. Maybe you look like this ● and your twin like this |.
Calibration in Reality
But you didn’t know any of this. You just knew your own point estimate. Without all that post-election information, you didn’t know how to calibrate an interval estimate. But everyone else did. Their widths were almost the same as the width you’d choose now.
They didn’t know the point estimator’s actual sampling distribution, but they knew enough about it to choose the right interval width. That’s what we’re going to look into now. Over the course of this lecture and the next, we’ll work out what the sampling distributon of our estimator can look like and how what it does look like depends on the population. And we’ll see how to use that information to calibrate our intervals.
Sampling with Replacement
To calibrate our intervals, we need to understand the sampling distribution of our estimator. That means doing some probability calculations. The machinery isn’t complicated—we’ll count equally likely configurations and add up probabilities—but it gets notationally heavy when we apply it to the polling problem. So let’s warm up on something simpler first.
Warm-up: Card Probabilities
A deck of cards. A standard deck has 52 cards. If we shuffle well and draw one card, each of the 52 cards is equally likely. That’s our starting point: equally likely configurations.
One Card
Locked (Week 0)
Two Cards with Replacement
Now suppose we draw two cards with replacement—we draw one, note it, put it back, shuffle, and draw again. Each draw is equally likely to be any of the 52 cards, so there are \(52 \times 52 = 2704\) equally likely pairs.
Locked (Week 0)
Thinking about Marginalization
Notice what we did in the “same suit” calculation. We could have listed all 2704 pairs and counted which ones had matching suits. Instead, we grouped them: 676 pairs have matching suits (169 for each of the 4 suits), and the rest don’t. This grouping—summing probabilities over configurations that share some property—is marginalization. We’ll use it constantly.
Now let’s apply this same machinery—equally likely configurations, counting, marginalization—to the polling problem.
Finding the Distribution of \(Y_1\) in a Small Population
Let’s start with the distribution of \(Y_1\), the response to one call, when we sample with replacement. And to keep things concrete, let’s suppose we’re polling a population of size \(m=3\). We’re rolling a 3-sided die.3
The Population
| \(j\) | name |
|---|---|
| \(1\) | Rush |
| \(2\) | Mitt |
| \(3\) | Al |
The Joint
| \(p\) | \(J_1\) | \(Y_1\) |
|---|---|---|
| \(\color[RGB]{239,71,111}\frac13\) | \(1\) | \(\color[RGB]{239,71,111}0\) |
| \(\color[RGB]{239,71,111}\frac13\) | \(2\) | \(\color[RGB]{239,71,111}0\) |
| \(\color[RGB]{17,138,178}\frac13\) | \(3\) | \(\color[RGB]{17,138,178}1\) |
The Marginal
| \(p\) | \(Y_1\) |
|---|---|
| \(\color[RGB]{239,71,111}\frac23\) | \(\color[RGB]{239,71,111}0\) |
| \(\color[RGB]{17,138,178}\frac13\) | \(\color[RGB]{17,138,178}1\) |
We have two Nos (0s) and one Yes (1). The ‘Yes’ frequency in the population is \(\theta_1 = 1/3\). Here’s how we find the distribution of \(Y_1\).
We write the probability distribution of our die roll in a table. It has two columns: one for the roll \(J_1\) and another for its probability \(p\). It has \(m=3\) rows, one for each possible outcome of the roll. Each roll is equally likely, so the probability in each row is \(1/3\).
We add a column for the response to our call, \(Y_1\). Adding this row is ‘free’—it doesn’t change the row’s probability—because what we hear is determined by the roll. E.g. if we roll a 1, we’re going to hear Mitt say ‘Yes’. So the probability we roll a 1 and hear a ‘Yes’ is the same as the probability we roll a 1.
We marginalize over rolls to find the distribution of our call’s outcome. We can color each row according to the call’s outcome \(Y_1\), then sum the probabilities in each color. The probability of hearing ‘Yes’ is the probability that we roll a 1 or 2: \(1/3+1/3 = 2/3\). The probability of hearing ‘No’ is the probability that we roll a 3: \(1/3\). These probabilities match the frequencies of ‘Yes’ and ‘No’ in our population.
Generalizing to Larger Populations
Is this probabilities-match-frequencies phenomenon a fluke? Or is it a general rule? To find out, we’ll work through the same steps in abstract, i.e. for a binary population \(y_1 \ldots y_m\) of arbitrary size \(m\). To ground us while we do it, we’ll think about our population of \(m=7.23M\) registered voters in Georgia. But we’ll handle the general case. Plugging in 7.23M ones and zeros wouldn’t exactly make things easier.
Finding the Distribution of \(Y_1\) in a Large Population
The Population
| \(j\) | \(y_j\) |
|---|---|
| \(1\) | 1 |
| \(2\) | 1 |
| \(3\) | 1 |
| \(4\) | 0 |
| ⋮ | ⋮ |
| \(m\) | 1 |
The Joint
| \(p\) | \(J_1\) | \(Y_1\) |
|---|---|---|
| \(1/m\) | \(1\) | \(1\) |
| \(1/m\) | \(2\) | \(1\) |
| \(1/m\) | \(3\) | \(1\) |
| \(1/m\) | \(4\) | \(0\) |
| ⋮ | ⋮ | ⋮ |
| \(1/m\) | \(m\) | \(1\) |
The Marginal
| \(p\) | \(Y_1\) |
|---|---|
| \(\underset{\color{gray}\approx 0.30}{\sum\limits_{j:y_j=0} \frac{1}{m}}\) | \(0\) |
| \(\underset{\color{gray}\approx 0.70}{\sum\limits_{j:y_j=1} \frac{1}{m}}\) | \(1\) |
Generalizing to Larger Populations
We start by writing the probability distribution of our dice rolls. The person we call, \(J_1\), is equally likely to be anyone in the population of \(m \approx 7.23M\) people. It takes on each value \(1 \ldots m\) with probability \(1/m\).
We add a column for the response to our call, \(Y_1\). This is ‘for free’. The roll determines what we hear, so the probabilities don’t change. Still \(1/m\).
And we marginalize to find the distribution of \(Y_1\). We sum the probabilities in rows where \(Y_1=0\) and \(Y_1=1\). What we saw in our small population does generalize. The probability of hearing a response \(y \in \{0,1\}\) is the frequency of that response in the population. We’ll call these frequencies \(\theta_0\) and \(\theta_1\). This is a little redundant because \(\theta_0 = 1 - \theta_1\). Our estimation target is just the ‘Yes’ frequency \(\theta_1\). Some people call it the success rate.
To find the distribution of a sum of \(Y_1+Y_2\), we can follow the same steps. There’s a bit more to it, so we’ll break down the marginalization step to make it a little more tractable.
Exercise: Two Calls to A Small Population
| \(j\) | name | \(y_j\) |
|---|---|---|
| \(1\) | Rush | \(0\) |
| \(2\) | Mitt | \(0\) |
| \(3\) | Al | \(1\) |
- Make a table for the joint distribution of three rolls of a 3-sided die. Add columns for the responses.
- Partially marginalize to get the joint distribution of the responses. Add a column for their sum.
- Marginalize to get the distribution of the sum.
Aren’t you happy that, even though you can’t make a 3-sided die, I didn’t ask you to use 4-sided one?
| \(p\) | \(J_1\) | \(J_2\) | \(Y_1\) | \(Y_2\) |
|---|---|---|---|---|
| \(\color[RGB]{239,71,111}1/9\) | \(1\) | \(1\) | \(0\) | \(0\) |
| \(\color[RGB]{239,71,111}1/9\) | \(1\) | \(2\) | \(0\) | \(0\) |
| \(\color[RGB]{17,138,178}1/9\) | \(1\) | \(3\) | \(0\) | \(1\) |
| \(\color[RGB]{239,71,111}1/9\) | \(2\) | \(1\) | \(0\) | \(0\) |
| \(\color[RGB]{239,71,111}1/9\) | \(2\) | \(2\) | \(0\) | \(0\) |
| \(\color[RGB]{17,138,178}1/9\) | \(2\) | \(3\) | \(0\) | \(1\) |
| \(\color[RGB]{17,138,178}1/9\) | \(3\) | \(1\) | \(1\) | \(0\) |
| \(\color[RGB]{17,138,178}1/9\) | \(3\) | \(2\) | \(1\) | \(0\) |
| \(\color[RGB]{6,214,160}1/9\) | \(3\) | \(3\) | \(1\) | \(1\) |
| \(p\) | \(Y_1\) | \(Y_2\) | \(Y_1 + Y_2\) |
|---|---|---|---|
| \(\color[RGB]{239,71,111}\frac{4}{9}\) | \(0\) | \(0\) | \(0\) |
| \(\color[RGB]{17,138,178}\frac{2}{9}\) | \(0\) | \(1\) | \(1\) |
| \(\color[RGB]{17,138,178}\frac{2}{9}\) | \(1\) | \(0\) | \(1\) |
| \(\color[RGB]{6,214,160}\frac{1}{9}\) | \(1\) | \(1\) | \(2\) |
| \(p\) | \(Y_1 + Y_2\) |
|---|---|
| \(\color[RGB]{239,71,111}\frac{4}{9}\) | \(0\) |
| \(\color[RGB]{17,138,178}\frac{4}{9}\) | \(1\) |
| \(\color[RGB]{6,214,160}\frac{1}{9}\) | \(2\) |
From Sampling to Coin Flip Sums: Two Responses
The Joint
| \(p\) | \(J_1\) | \(J_2\) | \(Y_1\) | \(Y_2\) | \(Y_1 + Y_2\) |
|---|---|---|---|---|---|
| \(\frac{1}{m^2}\) | \(1\) | \(1\) | \(y_1\) | \(y_1\) | \(y_1+y_1\) |
| \(\frac{1}{m^2}\) | \(1\) | \(2\) | \(y_1\) | \(y_2\) | \(y_1+y_2\) |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | |
| \(\frac{1}{m^2}\) | \(1\) | \(m\) | \(y_1\) | \(y_m\) | \(y_1+y_m\) |
| \(\frac{1}{m^2}\) | \(2\) | \(1\) | \(y_2\) | \(y_1\) | \(y_2+y_1\) |
| \(\frac{1}{m^2}\) | \(2\) | \(2\) | \(y_2\) | \(y_2\) | \(y_2+y_2\) |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | |
| \(\frac{1}{m^2}\) | \(m\) | \(m\) | \(y_m\) | \(y_m\) | \(y_m+y_m\) |
Partially Marginalized
| \(p\) | \(Y_1\) | \(Y_2\) | \(Y_1 + Y_2\) |
|---|---|---|---|
| \(\textcolor[RGB]{239,71,111}{\sum\limits_{\substack{j_1,j_2 \\ y_{j_1},y_{j_2} = 0,0}} \frac{1}{m^2}}\) | \(0\) | \(0\) | \(\textcolor[RGB]{239,71,111}{0}\) |
| \(\textcolor[RGB]{17,138,178}{\sum\limits_{\substack{j_1,j_2 \\ y_{j_1},y_{j_2} = 0,1}} \frac{1}{m^2}}\) | \(0\) | \(1\) | \(\textcolor[RGB]{17,138,178}{1}\) |
| \(\textcolor[RGB]{17,138,178}{\sum\limits_{\substack{j_1,j_2 \\ y_{j_1},y_{j_2} = 1,0}} \frac{1}{m^2}}\) | \(1\) | \(0\) | \(\textcolor[RGB]{17,138,178}{1}\) |
| \(\textcolor[RGB]{6,214,160}{\sum\limits_{\substack{j_1,j_2 \\ y_{j_1},y_{j_2} = 1,1}} \frac{1}{m^2}}\) | \(1\) | \(1\) | \(\textcolor[RGB]{6,214,160}{2}\) |
Fully Marginalized
| \(p\) | \(Y_1 + Y_2\) |
|---|---|
| \(\textcolor[RGB]{239,71,111}{\sum\limits_{\substack{j_1,j_2 \\ y_{j_1}+y_{j_2} = 0}} \frac{1}{m^2}}\) | \(\textcolor[RGB]{239,71,111}{0}\) |
| \(\textcolor[RGB]{17,138,178}{\sum\limits_{\substack{j_1,j_2 \\ y_{j_1}+y_{j_2} = 1}} \frac{1}{m^2}}\) | \(\textcolor[RGB]{17,138,178}{1}\) |
| \(\textcolor[RGB]{6,214,160}{\sum\limits_{\substack{j_1,j_2 \\ y_{j_1}+y_{j_2} = 2}} \frac{1}{m^2}}\) | \(\textcolor[RGB]{6,214,160}{2}\) |
Two-Step Approach
To find the distribution of a sum of two responses, \(Y_1+Y_2\), we do the same thing. We start with the joint distribution of two dice rolls. Our rolls are equally likely to be any pair of numbers in \(1\ldots m\). And there are \(m^2\) pairs, so the probability is \(1/m^2\) for each pair. Then we add columns that are functions of the rolls: \(Y_1\), \(Y_2\), and \(Y_1+Y_2\). This is ‘for free’ as before. If we know who we’re calling, we know the responses we’ll hear and their sum.
Then we marginalize to find the distribution of the sum. We’ll do this in two steps.
- We partially marginalize over \(J_1,J_2\), to find the joint distribution of the response sequence \(Y_1, Y_2\).
- We fully marginalize the result to find the distribution of \(Y_1+Y_2\).
Step 1. Partial Marginalization
To sum over pairs of responses, we sum over one response then the other. This is exactly like what we did for one response. Except twice. And what we get is pretty similar too.
The probability of observing the pair of responses is determined by their frequencies in the population. It’s the product of the frequencies of those responses. For ‘Yes,Yes’ it’s \(\theta_1^2\). For ‘No,No’ it’s \(\theta_0^2\). And for ‘Yes,No’ and ‘No,Yes’ it’s \(\theta_1\theta_0\).
This is, for what it’s worth, the product of the marginal probabilities of those responses. When this happens for any pair of random variables, we say they are independent. More on that soon. And the probability of any response pair \(a,b\) depends only on its sum \(a+b\).
\[ \begin{aligned} \sum\limits_{\substack{j_1,j_2\\ y_{j_1},y_{j_2} = a,b}} \frac{1}{m^2} &= \sum\limits_{\substack{j_1 \\ y_{j_1}=a}} \qty{ \sum\limits_{\substack{j_2 \\ y_{j_2}=b}} \frac{1}{m^2} } \\ &= \sum\limits_{\substack{j_1 \\ y_{j_1}=a}} \qty{ m_b \times \frac{1}{m^2} } \qfor m_y = \sum\limits_{j:y_j=y} 1 \\ &= m_a \times m_b \times \frac{1}{m^2} = \theta_a \times \theta_b \qfor \theta_y = \frac{m_y}{m} \\ &= \theta_1^{s} \theta_0^{1-s} \qfor s = a+b \end{aligned} \]
Step 2. Full Marginalization
To calculate the marginal probability that \(Y_1+Y_2=s\), we sum the probabilities of all pairs with that sum. This is pretty easy because there aren’t too many pairs. And I’ve color coded the pairs to make it even easier.
From Sampling to Coin Flip Sums: \(n\) Responses.
To find the distribution of a sum \(Y_1 + \ldots + Y_n\), we do the same thing. We start by writing out the joint distribution of \(n\) dice rolls. We’re equally likely to roll any sequence of \(n\) numbers in \(1\ldots m\). And there are \(m^n\) sequences, so the probability is \(1/m^n\) for each sequence. Then we add columns for the corresponding responses ‘for free’. And the sums.
Then we marginalize in two steps.
- Sum over roll sequences leading to the same response sequence \(a_1 \ldots a_n\).
- Sum over response sequences leading to the same sum \(s=a_1+\ldots+a_n\).
Step 1. Partial Marginalization
To sum over roll sequences, we sum over one roll after another. This is exactly like what we did for one response. Except over and over. You can think of this sum ‘inside out.’
Suppose we’re calculating the probability of the response sequence \(a_1 \ldots a_n = 0,0,...,0,0\).
- We consider any ‘head’ \(j_1\ldots j_{n-1}\) where \(y_{j_1} \ldots y_{j_{n-1}}=0,0,...0\) and think about choosing the last element. No matter what the head is, there are \(m_0\) ways to complete it by choosing \(j_{n}\) with \(y_{j_n}=0\).
- We repeat with shortened ‘head’ \(j_1 \ldots j_{n-2}\) to consider the choice of \(j_{n-1}\). We have \(m_0\) choices and our last step tells us that each choice results in \(m_0\) call sequences, so there are \(m_0^2\) sequences with head \(j_1 \ldots j_{n-2}\) ending in \(0,0\).
- Repeat \(n-2\) more times.
What we get is pretty similar to the two-response case. The probability of observing the sequence is determined by the frequency of Yeses and Nos in the population. It’s the product of the frequencies of those responses. For ‘Yes,Yes,…,Yes’ it’s \(\theta_1^n\). For ‘No,No,…,No’ it’s \(\theta_0^n\). And if we have \(s\) Yeses and therefore \(n-s\) Nos, it’s \(\theta_1^s\theta_0^{n-s}\).
This is, again, the product of the marginal probabilities of those \(n\) responses.
\[ \begin{aligned} \sum_{\substack{j_1 \ldots j_n \\ y_{j_1} \ldots y_{j_n} = a_1 \ldots a_n}} \frac{1}{m^n} &= \sum_{\substack{j_1 \ldots j_{n-1} \\ y_{j_1} \ldots y_{j_{n-1}} = a_1 \ldots a_{n-1}}} \sum_{\substack{j_n \\ y_n=a_n}} \frac{1}{m^n} \\ &= \sum_{\substack{j_1 \ldots j_{n-1} \\ y_{j_1} \ldots y_{j_{n-1}} = a_1 \ldots a_{n-1}}} \sum_{\substack{j_n \\ y_n=a_n}} m_{a_n} \times \frac{1}{m^n} \\ &= \sum_{\substack{j_1 \ldots j_{n-2} \\ y_{j_1} \ldots y_{j_{n-2}} = a_1 \ldots a_{n-2}}} \sum_{\substack{j_{n-1} \\ y_n=a_{n-1}}} m_{a_n} \times \frac{1}{m^n} \\ &= \sum_{\substack{j_1 \ldots j_{n-2} \\ y_{j_1} \ldots y_{j_{n-2}} = a_1 \ldots a_{n-2}}} \sum_{\substack{j_{n-1} \\ y_n=a_{n-1}}} m_{a_{n-1}} m_{a_n} \times \frac{1}{m^n} \\ &= m_{a_1} \ldots m_{a_{n}} \times \frac{1}{m^n} \qqtext{after repeating $n-2$ more times} \\ &= \theta_{a_i} \ldots \theta_{a_n} \\ &= \prod_{i:a_i=1} \theta_1 \prod_{i:a_i=0} \theta_0 = \theta_1^{s} \theta_0^{n-s} \qfor s = a_1 + \ldots + a_n \end{aligned} \]
Step 2. Full Marginalization
The full marginalization step is a bit trickier. But we can take advantage of our partial marginalization. To calculate the probability that the sum takes on the value \(s\), we order our sum of probabilities thoughtfully. We sum over response sequences \(a_1 \ldots a_n\) with \(s\) Yeses and \(n-s\) Nos. And within that, we sum over calls where we’d hear those responses.
The inner sum we’ve done. That was our partial marginalization. And the outer sum boils down to counting the number of sequences with \(s\) Yeses. The probability of all these sequences is the same: \(\theta_1^s\theta_0^{n-s}\). So summing is counting and multiplying. You may have done the counting part in high school in a unit on ‘permutations and combinations’. In any case, that count has a name. It’s spoken ‘\(n\) choose \(s\)’ and written \(\binom{n}{s}\).
\[ \begin{aligned} \sum\limits_{\substack{j_1 \ldots j_n \\ y_{j_1} + \ldots + y_{j_n} = s}} \frac{1}{m^n} &= \sum\limits_{\substack{a_1 \dots a_n \\ a_1 + \ldots + a_n = s}} \ \ \sum_{\substack{j_1 \ldots j_n \\ y_{j_1} \ldots y_{j_n} = a_1 \ldots a_n}} \frac{1}{m^n} \\ & =\sum\limits_{\substack{a_1 \dots a_n \\ a_1 + \ldots + a_n = s}} \theta_1^{s}\theta_0^{n-s} \\ &= \binom{s}{n} \theta_1^{s} \theta_0^{n-s} \qqtext{ where } \binom{s}{n} = \sum\limits_{\substack{a_1 \dots a_n \\ a_1 + \ldots + a_n = s}} 1 \end{aligned} \]
Using R
\[ P\qty(\sum_{i=1}^n Y_i = s) = \binom{s}{n} \theta_1^{s}\theta_0^{n-s} \ \ \text{ where } \ \ \binom{s}{n} \text{ is the number of binary sequences $a_1 \ldots a_n$ summing to $s$.} \]
We don’t actually have to do all this sequence-summing-to-\(s\) counting ourselves. The choose function in R will do it for us: \(\binom{s}{n}\) is choose(n,s). The dbinom function will give us the whole probability: \(\binom{s}{n}\theta_1^s\theta_0^{n-s}\) is dbinom(s, n, theta_1). And the rbinom function will draw samples from this distribution: rbinom(10000, n, theta_1) gives us 10,000.
theta_1 = .7
n = 625
p = dbinom(0:n, n, theta_1)
S = rbinom(10000, n, theta_1)
ggplot() + geom_area(aes(x=0:n, y=p), color=pollc, alpha=.2) +
geom_bar(aes(x=S, y=after_stat(prop)), alpha=.3) +
geom_point(aes(x=S[1:100], y=max(p)*seq(0,1,length.out=100)),
color='purple', alpha=.2)The Binomial Distribution
\[ \begin{aligned} &\overset{\color{gray}=P\qty(\frac{1}{n}\sum_{i=1}^n Y_i = \frac{s}{n})}{P\qty(\sum_{i=1}^n Y_i = s)} = \binom{s}{n} \theta_1^{s}\theta_0^{n-s} \\ &\qfor n =625 \\ &\qand \theta_1 \in \{\textcolor[RGB]{239,71,111}{0.68}, \textcolor[RGB]{17,138,178}{0.7}, \textcolor[RGB]{6,214,160}{0.72} \} \end{aligned} \]
These functions have ‘binom’ in their name because we call this distribution the Binomial distribution. The Binomial distribution on \(n\) trials with success probability \(\theta\) is our name for the distribution of the sum of \(n\) independent binary random variables with probability \(\theta\) of being \(1\). E.g. the number of heads in \(n\) coin flips is Binomial with \(n\) trials and success probability \(1/2\).
We’ve shown that’s the sampling distribution of the sum of responses, \(Y_1 + \ldots + Y_n\), when we sample with replacement from a population of binary responses \(y_1 \ldots y_m\) in which \(\theta\) is the frequency of ones. We’re interested in the mean of responses, so we divide by \(n\): \(\color{gray} \sum_{i=1}^n Y_i = s \ \text{ when } \ \frac{1}{n}\sum_{i=1}^n Y_i = s/n\).
It’s easy to estimate—and talk about estimating—this sampling distribution. It depends only on one thing we don’t know: the population frequency \(\theta\). And that’s exactly the thing we’re trying to estimate anyway. That’s why we’ve started here—the case of sampling with replacement from a population of binary responses.
Payoff
To estimate this sampling distribution, you plug your point estimate \(\hat\theta\) into the Binomial formula. \[ \hat P\qty(\sum_{i=1}^n Y_i = s) = \binom{s}{n} \hat\theta^{s} (1-\hat\theta)^{n-s} \qqtext{ estimates } P\qty(\sum_{i=1}^n Y_i = s) = \binom{s}{n} \theta^{s} (1-\theta)^{n-s} \]
To calibrate your interval estimate, you use rbinom to draw 10,000 samples from this estimate of the sampling distribution. Even remembering to divide by \(n\). And use the function width from the Week 1 Homework to find an interval that covers 95% of them.
theta.hat = mean(Y)
samples = rbinom(10000, n, theta.hat) / n
interval.width = width(samples, alpha=.05)You nail it. Your interval covers the estimation target \(\theta\) just like 95% of your competitors’ do.
It’s not just that you’ve widened your interval enough. You’ve widened it almost exactly the right amount. Just like your competitors. Almost as if you all knew how to estimate your estimator’s sampling distribution.
Sampling without Replacement
Warm-up: Card Probabilities without Replacement
Now let’s revisit the card problems, but this time drawing without replacement—we don’t put cards back after drawing them.
Two Cards without Replacement
When we draw two cards without replacement, there are \(52 \times 51 = 2652\) equally likely ordered pairs—52 choices for the first card, then 51 remaining choices for the second.
Locked (Week 0)
The Pattern
Notice what’s different. With replacement, drawing a heart doesn’t change the probability that the next card is a heart—it’s still \(13/52 = 1/4\). Without replacement, drawing a heart does change things—now there are only 12 hearts among 51 remaining cards, so the probability is \(12/51\), slightly less than \(1/4\).
This “using up” of options is what makes without-replacement calculations more involved. Let’s see how it plays out in the polling context.
Marginalization
To find the distribution of a sum \(Y_1 + \ldots + Y_n\) when we sample without replacement we do the same thing. We start by writing out the joint distribution of \(n\) dice rolls. We’re equally likely to pull any sequence of \(n\) distinct numbers in \(1\ldots m\). There are \(m \times (m-1) \times \ldots \times (m-n+1)=m!/(m-n)!\) sequences, so the probability is \((m-n)!/m!\) for each sequence. Then we add the responses and sums as columns. That doesn’t change.
Then we marginalize in two steps.
- Sum over roll sequences leading to the same response sequence \(a_1 \ldots a_n\).
- Sum over response sequences leading to the same sum \(s=a_1+\ldots+a_n\).
Step 1. Partial Marginalization
To sum over roll sequences, we sum over one roll after another. Suppose we’re calculating the probability of the response sequence \(a_1 \ldots a_n = 0,0,...,0,0\).
- We consider any ‘head’ \(j_1\ldots j_{n-1}\) where \(y_{j_1} \ldots y_{j_{n-1}}=0,0,...0\) and think about choosing the last element. No matter what the head is, we’ve ‘used up’ \(n-1\) people with response \(y_j=0\). So there are \(m_0-(n-1)=m_0-n+1\) ways to complete the sequence by choosing an as-yet unused \(j_{n}\) with \(y_{j_n}=0\).
- We repeat with shortened ‘head’ \(j_1 \ldots j_{n-2}\) to consider the choice of \(j_{n-1}\). We have \(m_0-n+2\) as-yet unused choices and our last step tells us that each choice results in \(m_0-(n-1)\) call sequences, so there are \((m_0 - n + 1)(m_0 - n + 2)\) sequences with head \(j_1 \ldots j_{n-2}\) ending in \(0,0\).
- Repeat \(n-2\) more times. We get \((m_0 - n + 1) \times \ldots \times m_0 = m_0!/(m_0-n)!\) rows.
When we work through the same argument for sequences with a mix of 0s and 1s, we see that a call in the head forecloses possibilities for later calls with the same response only. Thus, when we have \(s\) ones and \(n-s\) zeros, we can think of ourselves as choosing calls with response zero and response one separately. We have \(m_0!/(m_0-(n-s))! \times m_1!/(m_1-s)!\) rows. Multiplying by the probability of each row, we get the probability of seeing a response sequence with \(s\) ones.
\[ \frac{m_0!}{(m_0-(n-s))!} \times \frac{m_1!}{(m_1-s)!} \times \frac{(m-n)!}{m!} \]
Step 2. Full Marginalization
To calculate the probability that the sum takes on the value \(s\), we sum over response sequences with \(s\) ones. Just like in the with-replacement case, the probability of all these sequences is the same—it depends only on how many ones and zeros are in the sequence, not their order. So summing is counting and multiplying.
The count is the same as before: the number of binary sequences of length \(n\) with \(s\) ones is \(\binom{n}{s}\). Multiplying, we get
\[ P\qty(\sum_{i=1}^n Y_i = s) = \binom{n}{s} \times \frac{m_0!}{(m_0-(n-s))!} \times \frac{m_1!}{(m_1-s)!} \times \frac{(m-n)!}{m!}. \]
This is the Hypergeometric distribution. It’s a bit uglier than the Binomial, but the same R tricks work.
Using R
The dhyper function gives us the probability: dhyper(s, m1, m0, n) is \(P(\sum_{i=1}^n Y_i = s)\) when we sample \(n\) times without replacement from a population with \(m_1\) ones and \(m_0\) zeros. The rhyper function draws samples: rhyper(10000, m1, m0, n) gives us 10,000.
theta_1 = .7
m = 1000
m1 = m * theta_1
m0 = m * (1-theta_1)
n = 100
p = dhyper(0:n, m1, m0, n)
S = rhyper(10000, m1, m0, n)
ggplot() + geom_area(aes(x=0:n, y=p), color=pollc, alpha=.2) +
geom_bar(aes(x=S, y=after_stat(prop)), alpha=.3) +
geom_point(aes(x=S[1:100], y=max(p)*seq(0,1,length.out=100)),
color='purple', alpha=.2)The Hypergeometric Distribution
\[ \begin{aligned} &P\qty(\sum_{i=1}^n Y_i = s) = \binom{n}{s} \frac{m_0!}{(m_0-n+s)!} \frac{m_1!}{(m_1-s)!} \frac{(m-n)!}{m!} \\ &\qfor n = 100, \ m = 200 \\ &\qand \theta_1 \in \{\textcolor[RGB]{239,71,111}{0.68}, \textcolor[RGB]{17,138,178}{0.7}, \textcolor[RGB]{6,214,160}{0.72} \} \end{aligned} \]
These functions have ‘hyper’ in their name because we call this distribution the Hypergeometric distribution. The Hypergeometric distribution describes the number of successes when we draw \(n\) items without replacement from a population containing \(m_1\) successes and \(m_0\) failures.
We’ve shown that’s the sampling distribution of the sum of responses, \(Y_1 + \ldots + Y_n\), when we sample without replacement from a population of binary responses \(y_1 \ldots y_m\) in which \(m_1\) are ones.
It’s a bit harder to estimate than the Binomial because it depends on two things we don’t know: the population size \(m\) and the number of ones \(m_1\). We usually know \(m\), though. And if we do, we can estimate \(m_1\) by \(m\hat\theta\) where \(\hat\theta\) is our sample proportion. That’s usually good enough.
The Hypergeometric looks a lot like the Binomial when \(n\) is small relative to \(m\). That’s because sampling without replacement is almost like sampling with replacement when you’re unlikely to pick the same person twice anyway. When \(n\) is a meaningful fraction of \(m\), the Hypergeometric is narrower—you get better precision by not wasting calls on people you’ve already talked to.