46  Practice Final

QTM 285-1

Premise

Whether you’re in academia or industry, you’ll run into people who are analyzing data badly. Often, what they’re doing is both the established common practice and obviously wrong. In this practice exam, I’m going to run you through some examples in which this is happening. I’m going to ask you how to do better; explain what’s going wrong; and, in some cases, deal with what happens when you’ve convinced someone to do better but they haven’t entirely understood you. Obviously, in real life not everything that happens can be explained using the concepts we’ve learned in this class, and often the problems are a bit better-hidden than they are here. But it’s surprising—at least to me—how often these things come up in a relatively obvious way.

Aside from giving you some practice recognizing and dealing with this stuff, what I hope to do here is give you some examples to help you explain what’s going on when these things happen. Often, if someone’s making a mistake in a subtler form or a more complex problem, you can point out that the situation in analogous to one where the problem is much more obvious, as it is in many of these examples.

About the Exam

This practice exam was long and, in places, hard. Think about this as a set of exercises I wrote to make sure y’all had some practice with the ideas and skills that’ll come up on the exam.

For the actual exam, expect questions that look like these, but they’ll be simpler with less repetition and fewer of parts.

A Theme

Several of these exercises — Problems 2, 4, and parts of 5 — look at how evidence from simulations can be misleading. I wanted to bring this up, show you a few flavors of it, and pick apart what’s going on a little in a simple case for two reasons. To explain them, I do have to give away a few exercise solutions.

  1. A lot of the evidence people use to choose methods is based on simulation. When you write a paper proposing a new method, you’re expected to show it works—and often that it works better than the alternatives—in simulated data. That evidence is often pretty questionable. Even if you put aside the incentive to make your method look good, if you spend months or years working on a method to solve one piece of a problem, you get tunnel vision. That’s the piece of the problem that shows up in your simulated data because that’s the piece of the problem you’ve been thinking about. Knowing that this happens and a little about what it can look like can make it a little less disorienting the first time you download someone’s code, try it out, and find that it doesn’t seem to work as well as you’ve been led to believe.

  2. Arguably, many methods work pretty often for the same reason simulations don’t show that they don’t work. A stopped clock is right twice a day. And in Problem 2, we see an estimator that’s right ‘once a day’. But when we get to more complex estimators, there start to be a lot of ways for things to go right. In Problem 5, we see two that come up when estimating treatment effects.

  • We could have misspecification but not covariate shift, and therefore no bias.
  • We could have covariate shift but no misspecification, and therefore no bias.

We can even have misspecification and covariate shift that somehow fit together perfectly so there’s still no bias. We saw one example of this a few weeks ago. It’s not likely that this is happening exactly— unless you use models that can’t be misspecified (all functions) or techniques like inverse probability weighting, you’ll get bias— but maybe not enough that it really matters. I don’t mean to suggest that you should be relying on this kind of luck instead of just using better models or inverse probability weighting if you have a choice. But it might be a reason not to discount conclusions based on problematic methods too much. Pretty often, when you reanalyze the data using a more trustworthy approach, what you get is roughly the same point estimate and a slightly wider confidence interval.

Problem 1

The population and the population least squares line

A sample of size 13 and the least squares line

Suppose we’re interested in estimating the subpopulation mean \(\mu(0) = \sum_{j:x_j=0} y_j / \sum_{j:x_j=0} 1\) of the population drawn on the left above. We’re working with a sample of \((X_1, Y_1), \ldots, (X_n, Y_n)\) drawn with replacement from this population, like the one drawn on the right above.

As our estimate of \(\mu(0)\), we use \(\hat\mu(0)\) where \(\hat\mu\) is the least squares line1. Below is a plot of its sampling distribution at sample size \(n=13\). As usual, I’ve plotted its mean, mean plus or minus one standard deviation, and mean plus or minus two standard deviations as solid, dashed, and dotted blue lines respectively. I’ve also plotted the subpopulation mean \(\mu(0)\) itself as a green line.

You may assume that the standard deviation of the estimator at sample size \(n\) is \(\text{se}=\sigma/\sqrt{n}\) for some number \(\sigma\) that does not vary with sample size, just like it would be if \(\hat\mu(0)\) were the mean of a sample of size \(n\).

Exercise 47.1  

At which of these three samples sizes — 25, 50, or 100 — would you expect that interval \(\hat\mu(0) \pm 1.96\ \text{se}\) to have a width of roughly \(1/2\)?

🔒

Locked (Week 0)

Exercise 47.2  

At which of these three samples sizes — 25, 50, or 100 — would you expect the coverage of the usual 95% confidence interval \(\hat\mu(0) \pm 1.96\ \text{se}\) to be closest to 50%?

If the answer isn’t obvious to you, it might be helpful to sketch the sampling distribution of the estimator at the different sample sizes you’re considering. How does their width vary as you increase sample size? What about their center?

🔒

Locked (Week 0)

This is a summary of our discussion here in Lecture 13 in the context of this question. What we saw there is that our sampling distribution is approximately normal, what determines coverage is the ratio \(\text{bias}/\text{se}\) of our estimator’s bias to its standard deviation. And we can plot coverage as a function of that ratio. We do in Chapter 52.

Where is this coming from? If we’re looking at the sampling distribution of \(\hat\theta\) and thinking about the coverage of intervals \(\hat\theta \pm \ell\) of width \(2\ell\) (\(\ell\) for ‘arm-length’), the ones that cover are the ones where the arms reach from \(\hat\theta\) to \(\theta\), i.e. the ones where \(\abs{\hat\theta-\theta} \le \ell\). To count those, we integrate the sampling distribution of \(\hat\theta\) from \(\theta-\ell\) to \(\theta+\ell\). You can see this by drawing the integral you’d use to do the coverage calculation. I’ve done that below.

Look at the top 2 plots. On the left, I’ve drawn the \(n=13\) sampling distribution again, this time with this domain of integration shaded. On the right, I’ve drawn a version for \(n=50\). What you know about the sampling distribution for \(\hat\mu(0)\) when \(n=50\) is that it’s centered in the same place as it is for \(n=13\) and it’s about half as wide. I’ve drawn exactly that on the right; I’ve taken the \(n=13\) histogram on the left and ‘compressed it’ so everything is \(1/2\) as far from the mean. To show that this width-halving approach works, I’ve also drawn the actual sampling distribution for \(n=50\) as a faint gray outline. Since we’re using intervals that are ± two standard deviations, we integrate the sampling distribution from two standard deviations left of the estimation target to two standard deviations right of it—i.e., over the region I’ve shaded yellow. Because that yellow region extends right to the center of the (roughly symmetric) sampling distribution and left far enough to include essentially all of its probability mass on that side, this integral is roughly \(.5\)—half of the sampling distribution’s total mass.

Why is \(\text{bias}/\text{se}\) all that really matters? In Lecture 13, we showed it via calculus. We wrote out the integral of the normal distribution with mean \(\theta+\text{bias}\) and standard deviation \(\text{se}\) and did a change of variables. But I’ll try to give you a more visual explanation here.

Think about what’s happened when we (roughly) quadrupled our sample size to \(n=50\). We took the \(n=13\) histogram and made it half as wide. And our intervals got half as wide too: their width scales with the sampling distribution’s. What doesn’t change is where the target is. On the left, it’s 1 standard deviation from the mean. On the right, it’s 2. Not because the target or the sampling distribution’s center has changed, but because the standard deviation has. That changes the value of the integral. But if we halved the bias at the same time as we halved everything else, then we’d get exactly the same integral—the plot on the right would just be a half-as-wide version as of the plot on the left.

I’ve illustrated that on the bottom row by drawing exactly the same thing twice: the \(n=50\) plot with the target changed so the bias is halved. By comparing the two plots on the right, top and bottom, you can see that all that’s changed is that the target’s been changed so the bias is halved. And by comparing the two pots on the left, you can see that all that’s changed are the numbers on the axes—it’s exactly the same plot, but half as wide and twice as tall. And because of that, the proportion of the sampling distribution in the range we’re integrating over is the same as it is in the \(n=13\) case.

Problem 2

You have a sample \(Y_1,Y_2...,Y_n\) drawn with replacement from a population with mean \(\mu\) and standard deviation \(\sigma\). The usual estimate for \(\mu\) is the sample mean \(\bar{Y}_0 = \frac{1}{n}\sum_{i=1}^n Y_i\). But one of your coworkers has a clever idea to reduce variance—instead of dividing by \(n\), they divide by \(n+10\). This is their estimator: \(\bar{Y}_{10} = \frac{1}{n+10}\sum_{i=1}^n Y_i\). You try to convince them it’s a bad idea by being a bit hyperbolic—you argue that if that’s a good idea, then why not divide by \(n+100\)? They, unfortunately, love this idea. You now have to explain why this is a bad idea. There are some plots in Appendix Chapter 52 that may help.

Your coworker has calculated formulas for the bias and standard deviation of the ‘divide by \(n+k\) estimator’ \(\bar{Y}_k=\frac{1}{n+k}\sum_{i=1}^n Y_i\). \[ \begin{aligned} \mathop{\mathrm{E}}[\bar{Y}_k] - \mu &= -\frac{k}{n+k}\mu \\ \mathop{\mathrm{sd}}[\bar{Y}_k] &= \frac{\sigma}{\sqrt{n}} \times \frac{n}{n+k}. \end{aligned} \]

\[ \begin{aligned} \mathop{\mathrm{E}}\qty[\bar{Y}_k] - \mu &= \mathop{\mathrm{E}}\qty[\frac{n}{n+k}\bar Y_0] - \mu \\ &= \frac{n}{n+k}\mathop{\mathrm{E}}[\bar Y_0] - \mu \\ &=\frac{n}{n+k}\mu - \mu \\ &= \qty(\frac{n}{n+k} - 1)\mu \\ &= \qty(\frac{n}{n+k} - \frac{n+k}{n+k})\mu \\ &= -\frac{k}{n+k}\mu \\ & \\ \mathop{\mathrm{\mathop{\mathrm{V}}}}\qty[\bar{Y}_k] &= \mathop{\mathrm{\mathop{\mathrm{V}}}}\qty[\frac{n}{n+k} \bar {Y}_0] \\ &= \qty(\frac{n}{n+k})^2 V[\bar{Y}_0] \\ &= \frac{n^2}{(n+k)^2} \frac{\sigma^2}{n} \\ &= \frac{ n\sigma^2}{(n+k)^2}=\frac{\sigma^2}{n} \times \frac{n^2}{(n+k)^2} \end{aligned} \]

And they’ve drawn the sampling distribution of \(\bar{Y}_0\), \(\bar{Y}_{10}\), and \(\bar{Y}_{100}\) (top, middle, and bottom rows respectively) for sample sizes \(n=100\) and \(n=10,000\) (left and right columns respectively) for \(\mu=0\) and \(\sigma=1\). The solid line is the estimator’s expected value \(\mathop{\mathrm{E}}[\bar{Y}_k]\), the dashed lines are \(\mathop{\mathrm{E}}[\bar Y_k] \pm \mathop{\mathrm{sd}}[\bar Y_k]\), and the dotted lines are \(\mathop{\mathrm{E}}[\bar Y_k] \pm 2 \mathop{\mathrm{sd}}[\bar Y_k]\).

On the basis of those drawings, your coworker proposes using \(\bar{Y}_{100}\) to estimate \(\mu\).

Exercise 48.1  

Suppose your sample size is \(n=100\). Explain why your coworker’s original proposal to use \(\bar Y_{10}\) and your hyperbolic suggestion to use \(\bar Y_{100}\) are appears, from the drawings, to be a way to increase precision without increasing bias. To explain why this is misleading, describe what happens in the case that \(\mu=11/10\) and \(\sigma=1\) by doing the following.

  • Modifying the drawings to show the sampling distributions in this new case.
  • Stating an approximation to the coverage of each of the confidence intervals \(\bar{Y}_{k} \pm 1.96\ \text{se}\) for \(k=0\), \(k=10\), and \(k=100\).

Tip. You can ‘change the location’ of the distributions in the drawing above by changing the labels on the x-axis.

🔒

Locked (Week 0)

Exercise 48.2  

This coworker isn’t going anywhere. Next time, it might be better to bite your tongue and save time. If your sample size was \(n=10,000\), maybe your coworker’s idea to use \(\tilde Y_{10}\) really wasn’t creating much of a problem. What would you need to know about \(\mu\) and \(\sigma\) to guarantee at least 85% coverage?

🔒

Locked (Week 0)

Exercise 48.3  

Suppose the context for all this is that you are running a poll of 100 registered voters to find out whether they intend to vote for the candidate from Party 0 or Party 1. Elections are always pretty close, so you can assume that the proportion registered voters who intend to vote for Party 1 is between .4 and .6. Approximately how much coverage does using the interval estimator \(\bar{Y}_{k} \pm 1.96\ \text{se}\) guarantee you for \(k=0\), \(k=10\), and \(k=100\)?

🔒

Locked (Week 0)

Problem 3

Consider the following estimation targets. \[ \begin{aligned} \Delta_{\text{raw}} &= \frac{1}{m_1}\sum_{j:w_j=1} y_j - \frac{1}{m_0}\sum_{j:w_j=0} y_j \qfor m_w = \sum_{j:w_j=w} 1 \\ \Delta_{0} &= \frac{1}{m_0}\sum_{j:w_j=0} \color{gray}\left\{ \color{black} \mu(1,x_j) - \mu(0,x_j) \color{gray} \right\} \color{black} \\ \Delta_{1} &= \frac{1}{m_1}\sum_{j:w_j=1} \color{gray}\left\{ \color{black} \mu(1,x_j) - \mu(0,x_j) \color{gray} \right\} \color{black} \\ \Delta_{\text{all}} &= \frac{1}{m}\sum_{j=1}^m \color{gray}\left\{ \color{black} \mu(1,x_j) - \mu(0,x_j) \color{gray} \right\} \color{black} \end{aligned} \]

Exercise 49.1  

For each of the examples above, answer the following questions.

  1. Is \(\Delta_1 > \Delta_\text{raw}\)?
  2. Is \(\Delta_1 > \Delta_0\)?
  3. Is \(\Delta_1 > \Delta_\text{all}\)?

I’m not asking you to compare the magnitudes of these estimation targets. If \(\Delta_1\) is positive and \(\Delta_\text{raw}\) is negative, then no matter what the magnitudes are, \(\Delta_1 > \Delta_\text{raw}\).

🔒

Locked (Week 0)

Problem 4

Your friend tells you about a new estimator for the mean \(\theta=\frac1m\sum_{j=1}^m y_j\) of a population \(y_1 \ldots y_m\). They like to use it when they think the population mean \(\theta\) is close to zero.

\[ \hat \theta = \begin{cases} 0 & \text{if } \abs{\bar Y} < .1 \\ \bar Y & \text{otherwise} \end{cases} \]

estimator = function(x, kill.below=.1) {
  if(abs(mean(x)) < kill.below) { 0 } else { mean(x) }
}

To show you how it works, they’ve make up a population of \(y_1 \ldots y_{1000}\) random variables with mean \(\theta=.02\) and standard deviation 1. They show you the sampling distribution of this estimator when you draw samples \(Y_1 \ldots Y_n\) of size \(n=25\) with replacement from this population.

They’re pretty pleased with themselves. Their estimator is …

  • almost never further from \(\theta\) than the sample mean is. It’s further in 0% of samples.
  • often closer to \(\theta\) than the sample mean is. It’s closer in 30% of samples.

This is a weird estimator, so you ask if you can still use the bootstrap to calibrate confidence intervals around it. Your friend says …

I checked this morning. The coverage probability is \(.93\). That’s not bad when you’re getting a more accurate estimate 30% of the time.

They pull up the code below to show you how they checked, but when they run it, it doesn’t say \(.93\). It says \(.84\).

bootstrap.sampling.distribution = function(Y) { 
  1:1000 |> map_vec( function(.) {  
    J.star = sample(1:n, n, replace=TRUE)
    Y.star = Y[J.star]
    estimator(Y.star)
  })
}

se = sd(bootstrap.sampling.distribution(Y))
covers = 1:1000 |> map_vec( function(.) { 
  J = sample(1:m, n, replace=TRUE)  
  Y = y[J]                          
  estimator(Y) - 1.96*se <= theta & theta <= estimator(Y) + 1.96*se
})
mean(covers)
[1] 0.841

Exercise 50.1  

Tell your friend what’s wrong with their code and how to fix it. A sentence or two should do.

It doesn’t take me more than 5 seconds to fix and I don’t type very fast.

🔒

Locked (Week 0)

Exercise 50.2  

Extra Credit Style. Suppose that somewhere in the code they used to draw those plots, they were actually drawing a sample Y from the population y. They were running this code.

J = sample(1:m, n, replace=TRUE)  
Y = y[J]    

They ran it before they calculated the coverage probability this morning and they ran it again before they calculated it just now. Why do you think the two coverage probabilities they calculated are so different?

Take a look at the plots below. Each one shows the bootstrap sampling distributions of this new estimator in teal and the sample mean in pink for a random sample \(Y_1 \ldots Y_n\) drawn from the population \(y_1 \ldots y_m\). Each estimator’s bootstrap mean ± 2 bootstrap standard deviations is shown by a pair of vertical in the same colors.

🔒

Locked (Week 0)

Problem 5

You’re working at a company that makes a fancy alarm clock that watches you sleep. You don’t tell it when to wake you up. You tell it when you need to be up by; it wakes you earlier if, based on your sleep patterns, it thinks that’s better for you. You can, of course, turn that feature off so it wakes you when you ask like a regular alarm clock. You’re on a team that’s trying to figure out how well this feature is working.

You have a population \(w_1, x_1, y_1 \ldots w_m, x_m, y_m\) of people who have been using the clock.

  • \(w_j\) indicates whether they have the feature turned on or off.
  • \(x_j\) is a sleep quality score they reported when they bought the clock (0 is worst, 4 is best).
  • \(y_j\) is the average of the sleep quality scores they’ve logged since they started using the clock.

Privacy is a big deal when you’re selling a device that watches people sleep, so your app doesn’t report sleep quality scores directly to your company. You actually have to run a survey. The result is a sample \((W_1, X_1, Y_1) \ldots (W_n,X_n,Y_n)\) that’s drawn with replacement from the population. You have it
in an R dataframe sam with n rows. I’ve plotted it above.

As usual, \(X\) is on the horizontal axis, \(Y\) variable is on the vertical axis, and \(W\) variable, which is binary, is indicated by color: green for \(W=1\), red for \(W=0\). The mean in each column is indicated by a black pointrange (⍿). The proportion of times that \(X\) takes on each value \(x\) when \(W=0\) and when \(W=1\), \(\textcolor[RGB]{248,118,109}{P_{x\mid 0}}\) and \(\textcolor[RGB]{0,191,196}{P_{x\mid 1}}\) respectively, are plotted as histograms in matching colors. They’re also shown in the table below.

\[ \color{gray} \begin{array}{lll} x & \textcolor[RGB]{248,118,109}{P_{x \mid 0}} & \textcolor[RGB]{0,191,196}{P_{x \mid 1}} \\ \hline 0 & \textcolor[RGB]{248,118,109}{0.41} & \textcolor[RGB]{0,191,196}{0.09} \\ 1 & \textcolor[RGB]{248,118,109}{0.35} & \textcolor[RGB]{0,191,196}{0.11} \\ 2 & \textcolor[RGB]{248,118,109}{0.05} & \textcolor[RGB]{0,191,196}{0.45} \\ 3 & \textcolor[RGB]{248,118,109}{0.15} & \textcolor[RGB]{0,191,196}{0.31} \\ 4 & \textcolor[RGB]{248,118,109}{0.03} & \textcolor[RGB]{0,191,196}{0.04} \\ \end{array} \]

Part 1

Suppose your coworker fits the additive model by least squares. That is, they choose \(\hat\mu(w,x)\) as their estimate of \(\mu(w,x)\) like this. \[ \begin{aligned} \hat \mu &= \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \qty{ Y_i - m(W_i,X_i) }^2 \qquad \\ &\qqtext{ where } \mathcal{M}= \qty{m(w,x) = a(w) + b(x) \ \text{ for all functions $a(w)$ and $b(x)$ }} \end{aligned} \]

They claim that the difference in the green and red lines’ values at \(x=0\), \(\textcolor[RGB]{0,191,196}{\hat\mu(1, 0)} - \textcolor[RGB]{248,118,109}{\hat\mu(0, 0)} = -1.36\), is a good estimate of the adjusted difference in means \(\Delta_1\). 2 3

\[ \Delta_1 = \textcolor[RGB]{0,191,196}{\frac{1}{m_1}\sum_{j:w_j=1}} \color{gray}\left\{ \color{black} \textcolor[RGB]{0,191,196}{\mu(1, X_i)} - \textcolor[RGB]{248,118,109}{\mu(0, X_i)} \color{gray} \right\} \color{black} \qfor m_1 = \sum_{j:w_j=1} 1 \]

You’re not sure, so you plot their fitted model. Here’s what you see. As usual, \(\textcolor[RGB]{0,191,196}{\mu(1, x)}\) is the green line and \(\textcolor[RGB]{248,118,109}{\mu(0, x)}\) is the red one.

Exercise 51.1  

Do you think their estimate is biased up (too large) or down (too small)? Referring to the plot, explain why.

🔒

Locked (Week 0)

A Revision

I’ve drawn in the column means in a revised version posted Monday night. If you were working on this one over the weekend, it was probably a little difficult because in this data, it little hard to see where the mean is each column is. You could get those means from the plot below, but it’s easier having them all in one place.

Part 2

When they calculate their estimator, they’re probably not explicitly making a comparison at \(x=0\), e.g. by writing code that says something like muhat(1,0)-muhat(0,0). They’re probably just looking at the coefficient on \(w\) in the output of the lm function.

additive.predictor = lm(y~w+factor(x), data=sam) 
additive.predictor    

Call:
lm(formula = y ~ w + factor(x), data = sam)

Coefficients:
(Intercept)            w   factor(x)1   factor(x)2   factor(x)3   factor(x)4  
      1.347       -1.362        0.636        0.695        1.727        3.091  

That’s the number in front of \(w\) when you write out your least squares predictor out in terms of a certain basis: it’s \(\hat a_1\) for \(\hat\mu(w,x) = \hat a_0 + \hat a_1 w + \hat a_2 1_{=2}(x) + \ldots + \hat a_5 1_{=5}(x)\). When they hear what you said in the Exercise 51.1, they change their model to one they’ve seen you use before. This is their new least squares predictor. \[ \hat\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \qty{ Y_i - m(W_i,X_i)}^2 \qquad \text{ where } \mathcal{M}= \qty{\text{all functions} \ m(w,x) } \]

And they do exactly what they did before. They report the coefficient on \(w\) that lm gives them: -0.0860.

all.functions.predictor = lm(y~w*factor(x), data=sam) 
all.functions.predictor

Call:
lm(formula = y ~ w * factor(x), data = sam)

Coefficients:
 (Intercept)             w    factor(x)1    factor(x)2    factor(x)3  
       1.084        -0.086         0.954         1.731         2.427  
  factor(x)4  w:factor(x)1  w:factor(x)2  w:factor(x)3  w:factor(x)4  
       2.665        -1.472        -2.125        -1.890        -0.161  

That’s \(\hat a_1\) when we write out the least squares predictor like this. \[ \hat\mu(w,x) = \hat a_0 + \hat a_1 w + \hat a_2 1_{=1}(x) + \ldots + \hat a_5 1_{=4}(x) + \hat a_6 w 1_{=1}(x) + \ldots + \hat a_{9} w 1_{=4}(x) \tag{51.1}\]

That number doesn’t sound right to them, so they plot their new predictor using the code you just wrote. Here’s what they see.

They’re confused. The plot tells basically the same story as the one for the additive model. There’s a green curve below a red one. It seems like they should be getting a negative number, not zero.

Exercise 51.2  

  1. Using the output of lm above and other information you have above, estimate \(\Delta_1\). Or, if you don’t feel like getting out your calculator, explain how to do it very concretely, so your confused coworker handle it from there. Don’t expect too much from them. Say something like ‘take this number from this place, multiply it by this number from this other place, etc.’

  2. Explain why what they were doing before worked for the additive model but not for the all functions model.

Interpreting the output of lm isn’t something we’ve spent much time on in this class. And it’s not something you’ll have to do on the actual final exam.

If you want to try, go ahead. Reading the next tip might help. But if you don’t, you’re welcome to use this table describing the values of \(\hat\mu(w,x)\). And if you’re writing instructions for your coworker instead of doing the calculation yourself, you’re welcome to refer to this table in your instructions.

w x muhat
0 1 2.04
1 1 0.48
0 2 2.82
1 2 0.60
0 3 3.51
1 3 1.54
0 0 1.08
1 0 1.00
0 4 3.75
1 4 3.50

To interpret the coefficients in the output of lm, you need to understand how it describes the basis it’s thinking in. This is one of those skills that’s not all that necessary when you’re working on your own—you can just use predict to calculate \(\hat\mu(w,x)\) and not worry about how R is actually doing it under the hood. But it’s a good skill to have around other people, because a lot of them are in the habit of thinking about coefficients instead of predictions, and when that results in mistakes like the one ‘your coworker’ made here, you’ll need to do a little translation between their way of thinking and yours so you can say something more helpful than ‘Forget everything you know. Just do it my way.’

In the output of lm(y~w*factor(x), data=data), they’re thinking of \(\hat\mu\) like I’ve written it in 1.

  • What I’ve called \(\hat a_0\), R writes under ‘(Intercept)’. It’s the value of \(\hat\mu(0,0)\).
  • What I’ve called \(\hat a_1\), R writes under ‘w’. It’s what’s multiplying \(w\) in
  • What I’ve called \(\hat a_2 \ldots a_5\), R writes under ‘factor(x)1’ … ‘factor(x)4’. What it means by ‘factor(x)k’ is the indicator function \(1_{=k}(x)\).
  • What I’ve called \(\hat a_6 \ldots \hat a_9\), R writes under ‘w:factor(x)1’ … ‘w:factor(x)4’. What it means by ‘w:factor(x)k’ is the product of \(w\) and the indicator function \(1_{=k}(x)\).

Why it says factor(x). R says factor(x)1, etc., because I’ve converted \(x\) to a ‘factor’— something where we want an indicator for each possible value in our model instead of a single slope — directly in the formula y~w*factor(x). If \(x\) were already a factor, you’d see ‘x1’ … ‘x4’ instead of ‘factor(x)1’ … ‘factor(x)4’ and ‘w:x1’ … ‘w:x4’ instead of ‘w:factor(x)1’ … ‘w:factor(x)4’ in the output as you do below.

another.all.functions.predictor = lm(y~w*x, data=sam |> mutate(x=factor(x)))
another.all.functions.predictor

Call:
lm(formula = y ~ w * x, data = mutate(sam, x = factor(x)))

Coefficients:
(Intercept)            w           x1           x2           x3           x4  
      1.084       -0.086        0.954        1.731        2.427        2.665  
       w:x1         w:x2         w:x3         w:x4  
     -1.472       -2.125       -1.890       -0.161  

The diffence between converting it in the formula and before is what predict expects. If you say factor(x) in the formula, you can pass in data in which \(x\) is not a factor. It’ll convert for you.

predict(all.functions.predictor, data.frame(w=1, x=1:4))
   1    2    3    4 
0.48 0.60 1.54 3.50 

If you convert it ahead of time when you call lm, you have to convert it when you call predict, too.

predict(another.all.functions.predictor, data.frame(w=1, x=factor(1:4)))
   1    2    3    4 
0.48 0.60 1.54 3.50 

If you don’t, it gives you an error.

Error: variable 'x' was fitted with type "factor" but type "numeric" was supplied
🔒

Locked (Week 0)

Part 3

Now you’ve managed to get your coworker estimating \(\Delta_1\) by using predict instead of just looking at coefficients. Congratulations! That means no matter what model they’re using, if \(\hat\mu(w,x)\) is a reasonable estimate of \(\mu(w,x)\), they’ll get a reasonable estimate of \(\Delta_1\). And you’ve got them packaging everything up into a function estimate.Delta.1 that takes in a sample, does all the work, and gives you back an estimate of the \(\Delta_1\).

Delta1.hat = estimate.Delta.1(sam)

Their point estimate of \(\Delta_1\) is -1.8.

Now they want a confidence interval, so they bootstrap their estimator. To do it, they first write a function that calculates the sampling distribution of any estimator that’s written as a function of a sample when you sample with replacement from a population. Something like this.

sampling.distribution = function(pop, estimator, draws=1000) { 
  map_vec(1:draws, function(b) {
    J = sample(1:nrow(pop), nrow(pop), replace=TRUE)
    sam = pop[J,]
    estimator(sam)
  }) 
}

Then, to get the bootstrap sampling distribution of their estimator, they use it as if their sample were the population.

Delta1.hat.star = sampling.distribution(sam, estimate.Delta.1) 

They calculate a 95% interval like this.

se = sd(Delta1.hat.star)
interval = c(Delta1.hat - 1.96*se, Delta1.hat + 1.96*se)

They get this interval \([-1.94 \ , \ -1.67 ]\). And they’ve done everything right. They’re using an interval estimator with 95% coverage. But your boss, who has a habit of correcting employees without really spending the time to understand what they’re doing, glances at the code and says this:

You forgot to divide by \(\sqrt{n}\). Rookie mistake.

This isn’t great for your coworker’s confidence. To help explain the \(\sqrt{n}\) thing, you draw a few things. You draw the bootstrap sampling distribution of \(\Delta_1\), the bootstrap sampling distribution of \(\bar{Y}\), and the distribution of \(Y\) in the sample.

The bootstrap sampling distribution of Delta.hat.1.

The distribution of Y in the sample.

The bootstrap sampling distribution of the sample mean.

Here’s what you say.

Suppose we were just using the sample mean \(\bar Y\) to estimate the mean outcome in the population, \(\frac{1}{m}\sum_{j=1}^m y_j\). One way of calculating the standard deviation of \(\bar Y\) is to take the standard deviation of the outcomes \(Y_1 \ldots Y_n\) and divide by \(\sqrt{n}\). But when we bootstrap \(\bar Y\), we get a distribution that looks like the distribution of our estimator, not our outcomes. It’s already narrower by a factor of \(1/\sqrt{n}\). You even calculate the standard deviation of \(Y_1 \ldots Y_n\)—it’s 1.11—divide it by \(\sqrt{n}\approx 32\), and compare it to the bootstrap standard deviation of \(\bar Y\), which is 0.033. But you really want to end your explanation with a flourish. You say this. If you want to think of the standard deviation of \(Y_1 \ldots Y_n\)—the one you’d divide by \(\sqrt{n}\)—in bootstrap terms, it’s the bootstrap standard deviation of an estimator you’d never actually use. It’s the standard deviation of …

Exercise 51.3  

Finish the sentence. What simple estimator has bootstrap standard deviation \(\sigma=\sqrt{\frac{1}{n}\sum\{Y_i - \bar Y\}^2}\), the same standard deviation as the sample \(Y_1 \ldots Y_n\)?

Tip. There’s more than one. Any one you can think of is fine.

🔒

Locked (Week 0)

Part 4

When you and your coworker bring a summary of your findings to your boss, they’re annoyed. They say this.

Why are you doing all this complicated nonsense?

They pull up the data, subtract the mean of the green points from the mean of the red points. They estimate the standard deviation of their estimator \(\hat\Delta_{\text{raw}}\) to be 0.07, which gives them a 95% confidence interval of \(\hat\Delta_{\text{raw}} \pm 0.13\).

Much better! Simple. And more precise. Your interval was \(\pm 0.14\). Mine is \(\pm 0.13\).

But width isn’t everything. Their interval isn’t just narrower. It’s nowhere near yours.

Exercise 51.4  

Is their point estimate \(\hat\Delta_{\text{raw}}\) bigger or smaller than your point estimate \(\hat\Delta_1\)? Explain why.

🔒

Locked (Week 0)

Part 5

Your coworker decides to do a little simulation to test out all the new stuff they’ve learned.

They make up a population of 10,000 people. The population they’ve made up is plotted above. They draw samples of size 1000 repeatedly to compare the sampling distributions of four different estimators of \(\Delta_1\).

  1. The estimator they were using at the beginning, in Section 51.1, where they used \(\hat\mu\) is the least squares predictor in the additive model.
  2. The estimator you calculated in Exercise 51.2. You used the least squares predictor in the all functions model.
  3. An estimator like the one they used in Section 51.1, but based on the inverse probability weighted least squares predictor in the additive model.
  4. Your boss’s estimator \(\Delta_{\text{raw}}\) from Section 51.4.

From what they see, none of these estimators is biased. They sampling distributions of all four, plotted below, are centered at the estimation target \(\Delta_1\).

Exercise 51.5  

What is it about this population that makes this happen?

No need for a long explanation. A sentence should be enough.

🔒

Locked (Week 0)

Part 6

Your coworker has lost confidence in their ability to make up fake populations that they can use to test estimators. To try to test their four estimators again, they get a fake population from an old paper published in Statistical Science.

Now they’re angry. The problem they had in their last simuation is gone—your boss’s estimator, at least, is biased. But they’re still not seeing any problem with their estimator from Section 51.1. All the changes they’ve considered have increased their estimator’s variance without meaningfully reducing bias.

Exercise 51.6  

What’s going on? What is it about this population that makes this happen?

No need for a long explanation. A sentence should be enough.

🔒

Locked (Week 0)

Exercise 51.7  

When they use the same axes for all the plots, the sampling distribution of your boss’s estimator \(\hat\Delta_{\text{raw}}\) just isn’t there. That estimator is so biased that its sampling distribution is off the plot to one side or the other. Which side is it on? Is the bias of \(\hat\Delta_{\text{raw}}\) positive or negative?

🔒

Locked (Week 0)

Appendix

Figure 52.1: The coverage of a 95% confidence interval \(\hat\theta \pm 1.96\ \mathop{\mathrm{sd}}[\hat\theta]\) when \(\hat\theta\) has a normal sampling distribution as a function of the bias/standard deviation ratio.

Figure 52.2: This is a plot of \(f(x) = x/\sqrt{x(1-x)}\).

  1. If we want to be verbose, it’s the least squares predictor in the ‘lines model’ \(\mathcal{M}=\{m(w)=a+b w\}\)↩︎

  2. They don’t say ‘the difference \(\textcolor[RGB]{0,191,196}{\hat\mu(1, 0)} - \textcolor[RGB]{248,118,109}{\hat\mu(0, 0)}\)’. We’ll get to that in the next section.↩︎

  3. They describe what they’re estimating as the average treatment effect on the treated (ATT), but the treatment \(W_i\) isn’t randomized; the people in our population chose it themselves. People, myself included, do this pretty often. We say average treatment effect on the treated (ATT), average treatment effect on the controls (ATC), and average treatment effect (ATE) to refer to \(\Delta_1\), \(\Delta_0\), and \(\Delta_\text{all}\) respectively even outside the context of randomized experiments. Sometimes that’s because they really want to estimate those treatment effects and they’re doing their best despite lack of randomization. That’s what’s going on in Section 2.1 of Kim et al. (2023). But sometimes it’s just something we do because we don’t have a good alternative: there isn’t standard terminology that differentiates between \(\Delta_1\), \(\Delta_0\), and \(\Delta_\text{all}\) that isn’t about causality. Sometimes it’s something in between, like when people talk about estimating the ‘effect of race’ or ‘effect of gender’; they want to be making a causal claim but what they’re calling a treatment is a social construct and what exactly our potential outcomes mean is unclear. If you’re interested in questions like that, take a look at Issa Kohler-Hausmann’s paper Eddie Murphy and the Danger of Counterfactual Causal Thinking About Detecting Racial Discrimination.↩︎