2 Sampling

Summary

In this chapter, we’ll explore different sampling methods and their implications for statistical inference. If you’re in a room with at least six people who are willing to talk to you, that’s great! That’ll be your population. You’ll be polling them to find out which of two things they tend to prefer. Don’t worry: you won’t need to talk to all of them. That would be too much work.

If you’re not in a room with six people, you’ll have to pretend. Here’s a way to do it. Put six coins in a bag, shake it, and arrange them into a line without looking. That’s who you’re polling: coin 1 … coin 6. And you’ll be asking whether they’d choose to sit heads-up or tails-up.

Sampling is a way to cut down on the effort we put into surveying our population at the cost of a bit of accuracy. In short, we survey some fraction of the people in our population instead of the whole thing. When it’s done carefully, we can do some math to reason about how accurate our estimate is.

Often, people aren’t clear about how they actually get their sample, or they do the math as if they’d sampled some other way. This might be okay if doing the math ‘right’ and doing it ‘wrong’ lead to the same inferences. In that case, you might say you’re using a simplifying approximation. Today, we’ll look into how much how we sample affects what we learn about our population.

Our Population

We need a way to represent our population mathematically. When doing math, we tend to refer to a version of the population without irrelevant detail. We boil it down to a list of numbers. In this case, preferences. We’ll use 0 for thing A and 1 for thing B.

\[ y_1 \ldots y_m \qqtext{ is the standard notation. } \]

To get a sense of what a population looks like, we can plot one. Since we’re not going to talk to our whole population, we’ll have to make one up.

In the code block above, we define our population with $m=6$ people, using the letter $j$ to count them out and the letter $y$ to record their preferences. Then we plot our population as a set of 6 dots. A dot’s x-coordinate shows the person’s position in our list; the first person is on the left and the last is on the right. The dot’s y-coordinate shows their preference.

What We’re Estimating

We want to estimate the proportion of these people who prefer B, i.e., who have $y_j=1$. While this could be easy since we’ve just looked at all the $y_j$s, we’ll make it hard for ourselves by acting as if we haven’t. This practice of fake data simulation is an important tool when understanding methods. This is what we do.

We make up a fake population.
We run a fake (simulated) study on them exactly like the one we’re thinking about running on our actual population.
We compare the resulting estimate to the thing we wanted to estimate.

Because it’s fake, we get to poll the whole population so we know what the right answer is (if we don’t, we don’t have a well-defined question) and we can run the study over and over again to look at the distribution of estimates.

If we’re not satisfied with that distribution, that’s evidence it’s a study we shouldn’t be running. Once we’ve done this for a few study designs and seen what works — or at least what doesn’t — we’ll choose one to use on our actual population. In mathematical notation, our estimand looks like this. \[ \text{estimand} = \frac{1}{m}\sum_{j=1}^m y_j \]

Estimating It

Let’s assume we’ve been given a sample. Some list of preferences. \[ Y_1 \ldots Y_n \qqtext{ is the standard notation } \]

We’ll take our plot of the population and color in the dots we talked to.

\[ \text{estimator} = \frac{1}{n}\sum_{i=1}^n Y_i \qqtext{ where } Y_1 \ldots Y_n \qqtext{ is our sample } \]

The horizontal line shows the mean of our sample $Y_1 \ldots Y_n$. It’s the fraction of times we heard ‘I prefer B’ when asking someone’s preference.

Understanding the Notation

People often talk as if both $Y_i$ and $y_j$ are the same kind of thing (e.g., people’s preferences). This can be a source of confusion because they aren’t. To make this clear, let’s be concrete.

Calling someone to find out their preference is how we do it.

$y_j$ is a person’s preference. A real person, e.g., the $j$th one in the phone book.
$Y_i$ is an utterance. It’s what we hear on our $i$th call.

In this class, no matter what the context, we’ll say ‘$i$th call’ and ‘$j$th’ person. It could be a quality control application in a brewery where we’re opening and pouring a few beers to check that we haven’t overcarbonated. In that case, our $j$th person is a can and our $i$th call is a glass.

Using different letters for samples and population helps us distinguish them. Typically, people use the same letter to communicate that each (big) $Y$ is somebody’s (little) $y$, remembering that $Y_1$ and $y_1$ are different things. In this book, we’ll also count them out with different letters.

$Y_i$ for $i$ from $1$ to $n$ is our sample.
$y_j$ from $j$ from $1$ to $m$ is our population.

This’ll help us keep track of whether we’re referring to a person or an utterance when we count. That’s not a particularly popular convention, so you shouldn’t expect to see it done elsewhere, but it can be helpful when you’re first learning about this stuff.

Sampling

We’ll repeat the sampling process many times to understand the distribution of our estimates.

To do this efficiently, we’ll use the function map_vec from the package purrr. This function takes two arguments — a list and a function — and calls the function on every element of the list, returning a vector of the results. This saves us from writing a for loop and handling the ‘filing’ ourselves.

We can also use an anonymous function to make it a one-liner:

We’ll write a function that … 1. Takes as input a population. 2. Samples the population. 3. Returns the sample’s mean—an estimate of the population mean.

We’ll run it R times on our population by mapping our function over a list containing R copies of the population.

All this above is just so the code you’ll be seeing isn’t a complete mystery. For now, we won’t look at it too closely. We’ll just run it and see what happens. We’ll come back to what’s going on in the code in Chapter 4.

Sampling with Replacement

To sample with replacement, we roll a ($m$-sided) die $n$ times. Since we’re using $m=6$, it’s the kind of die you see a lot. We’re rolling $n=3$ times. The first roll tells us who we call first, the second who we call second, etc. If our $i$th roll is $4$, $Y_i=y_4$. You can also pick balls numbered 1…m out of a hat. Or an urn. Whatever is at hand. But you have to replace the ball you picked before you pick again.

Do It Yourself.
Have R Do It

Roll a die or pick a ball $n=3$ times. Color in the dots you talked to. If talked to the same one multiple times, draw in another copy of the dot next to it. Then draw a horizontal line indicating your estimate: the mean of your sample $Y_1 \ldots Y_n$. I tend to use blue when I draw estimates in this book.

You can use the built-in whiteboard tool to draw on the image above with your stylus, mouse, trackpad, etc. If you have a tablet with an active stylus like an apple pencil, you should be able to just draw with it. If you’re using a mouse, trackpad, or finger, it’s a little more complicated. Double-click/double-tap on the image to open the whiteboard tool.¹

If you prefer to let R draw for you, change J in the code block above to the list of numbers you rolled then rerun the code block. It’ll color your dots and draw your line for you.

We can use R to repeat this sampling process many times. The built-in function sample rolls the die for us. We’ll look more closely at how to write code to do this later in Chapter 4.

What we see above is the result of repeating this $R=1000$ times. Each of the $1000$ estimates we get is shown as a purple dot. If we estimate $2/6$ 200 times, we’ll see 200 dots in a column around $2/6$. So that we don’t have to count dots ourselves, we use bars to show what fraction of the 1000 dots is in each column. We call this our estimator’s sampling distribution. Why are there four bars?

If you typed in your sample on the last slide, you should see a blue dot indicating your estimate. If you didn’t, you use the whiteboard tool to draw one in yourself. There’s nothing special about that dot other than the color. It’s just the one you happened to get when you went through the process once. You could draw 1000 dots too and you’d get essentially the same picture you see above. It’d just take a while.

It might feel wasteful to use a sampling method in which we might call the same person twice. If you were willing to make three calls, you could’ve gained more information. But there is something to like. The math is easy. We say the utterances $Y_1 \ldots Y_n$ are independent and identically distributed.

Independent means that each call is a fresh start. What you hear one the first one has no impact on what you’ll hear on the second.
Identically distributed means that we do the same thing each call. We’re rolling the same die and using the same list of people.

A Question about Independence
The Answer

It can feel a little counterintuitive that this process gives you independence. After all, you might place two calls to the same person. And the second time you wouldn’t have to ask. You’d know their response. Why, informally, does this not make the utterances $Y_1 \ldots Y_n$ dependent?

Because what you gain from that first call is knowledge about a person, not a call. Ask yourself whether, if you already knew the preferences $y_1 \ldots y_m$ of the whole population, you’d know more about what you’d hear on your second call after placing your first one. If you wouldn’t, $Y_1$ and $Y_2$ are independent.

And there are four bars because we can hear ‘I prefer B’ 0,1,2, or 3 times when we make 3 calls. We’ll see 0,1,2, or 3 dots in each column.

Sampling without Replacement

We sample without replacement by picking balls from a hat one after another without replacing the balls we’ve picked. That way, one you’ve called somebody, you can’t call them again. If you don’t have balls and a hat, you can use cards. Shuffle card labeled $1 \ldots m$ then draw the first $n$ cards off the top. $Y_1$ is what you hear when you call the person on your first card, $Y_2$ is what you hear when you call the second, etc.

Do It Yourself
Have R Do It

Choose $n=3$ of the $m=6$ people in our population to call by picking balls from a hat (without replacing them) or drawing cards from a shuffled deck. Draw in your sample and your estimate like you did when you sampled with replacement.

We can use R to repeat this sampling process many times too. The built-in function sample will draw cards from a shuffled deck for us too. We just have to tell it whether to sample with or without replacement by passing replace=TRUE or replace=FALSE.

People like sampling without replacement because it doesn’t feel wasteful. You don’t call the same person twice. However, it makes the math a bit harder because calls aren’t independent. So people often do this, then do the math as if they’d sampled with replacement. The justification is that, if you have enough balls, you weren’t going to pick the same one twice anyway. What do you think about this? We’ll look into it a few chapters from now.

Convenience Sampling

A convenience sample is what it sounds like. Pick any $n$ people you want. Most people will pick them however it’s easiest. That’s why it’s called what it is.

Do It Yourself.
Have R Do It For You?

You know the drill. Draw in your sample and your estimate.

You’ll have to give R a little guidance if you want it done for you. What’s convenient? You tell it.

People who collect data like this more than people who analyze it. What do you think about this approach? How do you analyze it? Can you sketch the estimator’s distribution?

Randomization (Bernoulli Sampling)

To sample by randomization, flip a coin for each person in the population. Our sample is everyone whose coin lands heads-up.

Do It Yourself.
Have R Do It For You?

You’ve got this.

We can use R to repeat this sampling process. The built-in function rbinom will flip coins for us.

This method combines some of the benefits of the previous two. We don’t call anyone twice and the math is easy because flips are independent. But there is a price: the sample size is random and zero is an option. When we get no heads in our coin flips, we get an empty sample. When plotting our estimates above, we’ve just thrown out the empty samples we got. But to do the math when you’re using this sampling method, you really need to specify what it does when it has no data to work with.

Your Survey

Now let’s get to the point. You really to estimate the proportion of people in the population who prefer thing B to thing A, i.e. the ones in the top row of our population plot (Section 2.2). We’ve looked at three reasonable ways to sample from the population and are looking at the the sampling distribution of our estimator—the corresponding sample proportion—for each of them. Which would you use? Why?

Once you’ve made your choice, sample your population and make your calls. The proportion of times you hear ‘B’ is your estimate of the proportion of your population that prefers B to A. Then go ahead and ask everyone else in your population so you can calculate your estimand. It’s only 6 people after all. Did you get close?

Do you think you had good luck? Bad luck? Or were just about as accurate as you’d expect? To investigate, enter your population’s preferences into the code block in Section 2.2, then come back and rerun the code blocks above to see the sampling distribution of your estimator with each of the three sampling methods you considered. Did that change your mind?

What’s Going On in the Sampling Code?

Sampling with Replacement

pop = data.frame(j=1:6,
                 y=c(0,1,0,1,0,1))
m = nrow(pop)

1: We input our population as a data frame pop with two columns, j and y, corresponding to the rows on in the table below.
2: We record our population size as m for later use.

$j$	1	2	3	4	5	6
$y$	$\underset{y_1}{0}$	$\underset{y_2}{1}$	$\underset{y_3}{0}$	$\underset{y_4}{1}$	$\underset{y_5}{0}$	$\underset{y_6}{1}$

scales = list(scale_x_continuous(breaks = 1:6),
              scale_y_continuous(breaks = (0:3/3), labels=sprintf("%d/3", 0:3)))

pop.plot = ggplot(pop, aes(x = j, y = y)) +
           geom_point(size=5, shape='circle', alpha=.1) +
           scales
pop.plot

1: Before we plot, we’ll define the ‘scales’ to communicate to ggplot the grid lines (breaks) we want in our plot and how we want them labeled.
2: We create a plot pop.plot that uses pop as its data source and interprets its columns j and y as x and y-coordinates respectively.
3: We add a visualization that plots these points $(j, y_j)$. Adding plot elements is done with +. The arguments we pass to geom_point tell us how to style points. We ask that they be 5mm circles (size=5, shape='circle') that are fairly transparent (alpha=.1 for roughly 10% opacity). Why those choices? It looked right to me. Styling is a bit of a trial and error process.
4: We add the scales we defined before. This is done with + again.
5: So far, we’ve defined but not displayed the plot. Here we ‘return’ the plot to the R terminal so it gets displayed.

pop$y

1: Saying pop$y gets the y column of pop. That’s a vector.

[1] 0 1 0 1 0 1

pop[1,]

2: Saying pop[1,] gets the first row of pop. That’s still a data frame.

J = c(1,2,1)
J

3: Saying c(1,2,1) gives us a vector of three numbers: the vector $[1,2,1]$.

[1] 1 2 1

pop[J,]

4: Saying pop[J,], if J is a vector of numbers, stacks the rows of pop[J[1],], pop[J[2],]… into a data frame. It handles repetition in J by repeating rows. This one is pop[1,], pop[2,], and pop[1,] stacked.

pop$y[J]

5: You can do the same thing with a vector, e.g. pop$y. This one is pop$y[1], pop$y[2], and pop$y[1].

[1] 0 1 0

n = 3  
J = sample(1:m, n, replace=TRUE)
sam = pop[J, ]

1: We ask R to sample n numbers from 1 to m with replacement, as if we were rolling an m-sided die n times. It gives us in a vector length n that we will call J. J[1] (in code) or $J_1$ (in math) is the first number in this list and so on for 2,3,…
2: We ask R to give us a data frame sam with $n$ rows: its $i$th row is the row of the population specified by our $i$th dice roll: sam[i,]=pop[J[i],] (code) and $Y_i = Y_{J_i}$ (math).

$i$	1	2	3
$J_i$	1	4	4
$Y_i$	$\underset{y_{1}}{0}$	$\underset{y_{4}}{1}$	$\underset{y_{4}}{1}$

pop.plot + geom_point(aes(x=j, y=y), data=sam,
                      color='blue', size=4,
                      position=position_dodge2(width=.3))

3: What we’re doing here is adding a visualization of our sample on top of our population plot. We have to specify a new data source data=sam for this visualization because it would otherwise think we were using the population plot’s data source pop. And that we want to use the j column as the x-coordinate so we plot our sample points on top of the corresponding population points. We usually can’t do this because we don’t know the population, but this can be useful when can, e.g. in simulated studies.
4: Use 4mm blue dots.
5: Using position=position_dodge2(...) lets us see when we have multiple copies of the same point in our sample. It plots the copies side-by-side instead of on top of each other. The width argument tells ggplot how much space to put between the points.

$i$	1	2	3
$J_i$	1	4	4
$Y_i$	$\underset{y_{1}}{0}$	$\underset{y_{4}}{1}$	$\underset{y_{4}}{1}$

becomes

$i$	1	2	3
$Y_i$	$\underset{\color{lightgray}y_{1}}{0}$	$\underset{\color{lightgray}y_{4}}{1}$	$\underset{\color{lightgray}y_{4}}{1}$

sam$i = 1:n
ggplot(sam) + geom_point(aes(x=i, y=y),
                         color='blue', size=4)  + scales

1: Usually, instead of plotting on top of the population, we just plot the sample as its own thing. We use $i$ as the x-coordinate instead of $J_i$ like we did before. no need to dodge because $i$ isn’t duplicated even if $J_i$ is.

Sampling by Randomization

$j$	1	2	3	4	5	6
$y$	$\underset{y_1}{0}$	$\underset{y_2}{1}$	$\underset{y_3}{0}$	$\underset{y_4}{1}$	$\underset{y_5}{0}$	$\underset{y_6}{1}$

scales = list(scale_x_continuous(breaks = 1:6),                                       
              scale_y_continuous(breaks = (0:3/3), labels=sprintf("%d/3", 0:3)))      

pop.plot = ggplot(pop, aes(x = j, y = y)) +                                           
           geom_point(size=5, shape='circle', alpha=.1) +                             
           scales                                                                     
pop.plot

W = c(FALSE,FALSE,TRUE,TRUE,FALSE,TRUE)
pop[W,]

1: We create a vector of ‘logicals’ (TRUE or FALSE) of the same size as our population.
2: We use this vector to index our population. It gives us the 3 rows of pop where W is TRUE. Looking at the population member id $j$ in the output, we see that it’s the 3rd, 4th, and 6th rows. As it should be.

not.quite.W = ifelse(W, 1, 0)
pop[not.quite.W, ]

3: Often, people treat the number 0 as false and the number 1 as true. This code converts our vector of logicals W to a corresponding vector of 0s and 1s.
4: Logical indexing doesn’t work with vectors of 0s and 1s. It only works with logicals. So this code doesn’t give us the same result as the previous one. In fact, it ignores the zeros in not.quite.W and gives us 3 copies of pop[1,] — one for each copy of 1 in not.quite.W.

n = 3
sampling.rate = n/m
not.quite.W = rbinom(m, 1, sampling.rate)
W = as.logical(not.quite.W)
sam = pop[W, ]

1: We calculate our sampling rate—the probability that our coin comes up heads—as $n/m$. This’ll give us a roughly n heads when we flip our coin m times.
2: We flip our coin m times. This gives us a vector of 0s and 1s that we’ll call not.quite.W.
3: We convert our 0s and 1s to logicals. This gives us a vector of TRUEs and FALSEs that we’ll call W.
4: We use the vector of logicals W to index our population. This gives us the rows of pop where W is TRUE—the rows where our coin came up heads.

$j$	1	2	3	4	5	6
$W_j$	0	0	1	1	0	1
$y_j$	$\underset{y_{1}}{0}$	$\underset{y_{2}}{1}$	$\underset{y_{3}}{0}$	$\underset{y_{4}}{1}$	$\underset{y_{5}}{0}$	$\underset{y_{6}}{1}$

becomes, dropping the rows where we flip tails ($W_j=0$), and counting our the remaining rows $i=1,2,\ldots$,

$i$	1	2	3
$J_i$	3	4	6
$Y_i$	$\underset{y_{3}}{0}$	$\underset{y_{4}}{1}$	$\underset{y_{6}}{1}$

pop.plot + geom_point(aes(x=j, y=y), data=sam,
                      color='blue',  size=4)

5: Here we’re visualizing our sample on top of the population again. Because we’re not sampling any population members twice, we don’t need to use position=position_dodge(...) like we did before.

$i$	1	2	3
$J_i$	3	4	6
$Y_i$	$\underset{y_{3}}{0}$	$\underset{y_{4}}{1}$	$\underset{y_{6}}{1}$

becomes

$i$	1	2	3
$Y_i$	$\underset{\color{lightgray}y_{3}}{0}$	$\underset{\color{lightgray}y_{4}}{1}$	$\underset{\color{lightgray}y_{6}}{1}$

sam$i = 1:n
ggplot(sam) + geom_point(aes(x=i, y=y),
                         color='blue', size=4)  + scales

At the moment the whiteboard tool works in the slides but not the book. If you’re using the book, you’ll have to let R draw for you.↩︎

\(i\)	1	2	3
\(J_i\)	1	4	4
\(Y_i\)	\(\underset{y_{1}}{0}\)	\(\underset{y_{4}}{1}\)	\(\underset{y_{4}}{1}\)

\(i\)	1	2	3
\(J_i\)	1	4	4
\(Y_i\)	\(\underset{y_{1}}{0}\)	\(\underset{y_{4}}{1}\)	\(\underset{y_{4}}{1}\)

\(j\)	1	2	3	4	5	6
\(W_j\)	0	0	1	1	0	1
\(y_j\)	\(\underset{y_{1}}{0}\)	\(\underset{y_{2}}{1}\)	\(\underset{y_{3}}{0}\)	\(\underset{y_{4}}{1}\)	\(\underset{y_{5}}{0}\)	\(\underset{y_{6}}{1}\)

\(i\)	1	2	3
\(J_i\)	3	4	6
\(Y_i\)	\(\underset{y_{3}}{0}\)	\(\underset{y_{4}}{1}\)	\(\underset{y_{6}}{1}\)

\(i\)	1	2	3
\(J_i\)	3	4	6
\(Y_i\)	\(\underset{y_{3}}{0}\)	\(\underset{y_{4}}{1}\)	\(\underset{y_{6}}{1}\)