23  Least Squares Regression in Linear Models

A Problem With Subsample Averages

Sometimes There Isn’t Much to Average

  • Very few people in our California CPS sample have less than 11 years of education.
    • There’s a good reason for this: CA law requires residents stay in school until they’re 18.
  • But that means our 8th-10th grade income averages are based on very few people.
  • And that makes them very sensitive to who, exactly, is in the sample.

Sensitivity

  • If we leave even one person out of our 8th grade average, it changes a fair amount.
    • Each person’s income is 1/14th of that average.
    • We can see those changes in the ‘leave-one-out means’ plot on the left.
  • That’s not a problem with the 12th grade average.
    • Each person’s income is 1/605th of that average.
    • And the leave-one-out means barely change.
  • We see a more ‘statistical’ version of this by looking at the bootstrap means plot on the right.
    • The spread of these bootstrap means is a good approximation to the spread
      we get in the actual sample means when we sample from the population.
    • And it’s huge for folks with 8th grade educations.

Sometimes There’s Nothing to Average

  • When we break things out in additional dimensions, e.g. by race, we see a bigger problem.
  • We can’t even take an average of Black survey respondents with 8th grade educations.
  • There’s nobody in that column to average.

Often.

  • When we break things out by county, we have empty columns too.
  • We’ve sampled nobody in San Francisco with an 8th grade education.
  • This isn’t an edge case. This is common.

Extremely Common.

  • When we break things down in multiple dimensions, we have a lot of empty categories.
  • Even using 3 dimensions—race, county, and education—we have a lot of empties.
    • Those aren’t necessarily empty categories in the population: we’ve only sampled 1/2500 people.
    • But if we need to sample people in those groups to make predictions about them, we’re stuck.
      • We don’t have anything we can say about these unsampled groups.
      • So we can’t estimate any population summaries involving them.
  • So we need some way to make predictions for groups with no data.

A Partial Solution

  • We’ve already seen a partial solution to this problem: coarsening.
  • Here we’ve coarsened in two dimensions.
    • We’ve coarsened education into two categories: 4-year degree (≥ 16 years) vs. no 4-year degree (< 16 years).
    • We’ve coarsened county into two areas: SF Bay Area (SF, Alameda) and LA Area (LA, Orange).
  • And it helps, but we have two problems.
    • We’ve renamed everything, so we’d have to redefine all our summaries. This is fixable.
    • We still have empty groups, although we have fewer. This is not.

We Need A General Solution

  • This empty groups problem is a much bigger deal with this data
  • We’ve been looking at some of the most populous counties and common racial identities.
  • If we use all the counties in CA and all the identities the CPS includes, most groups are empty.
  • And the range of coarsening options is overwhelming.

Today’s Plan

  • Today, we’re going to talk about how to deal with the problem of small or empty groups.
  • To make this more manageable, we’ll break things down into a two-step process.
    1. We’ll decide on a regression model: a set of functions we could use to make predictions.
    2. We’ll choose the best function from our model using the least squares criterion.
  • We call the function we choose this way the least squares predictor within our model.

Notation

The sample

An illustration of the population
  • So far, we’ve used \(\mu(x)\) to refer to mean of a column in the population plot.

\[ \mu(x) = \frac{1}{m_x}\sum_{j:x_j=x} y_j \quad \text{ for } m_x = \sum_{j:x_j=x} 1 \]

  • We’ve used \(\hat\mu(x)\) for one particular estimate of \(\mu(x)\): the mean of the corresponding column in the sample.

\[ \hat\mu(x) = \frac{1}{N_x}\sum_{i:X_i=x} Y_i \quad \text{ for } N_x = \sum_{i:X_i=x} 1 \]

  • From today on, we’ll be a bit more flexible with the meaning of \(\hat\mu\).
    • We’ll use \(\hat\mu(x)\) to refer to any estimator of the subpopulation mean \(\mu(x)\).
    • Which should be clear from context.

A Least Squares Interpretation of Means

Reinterpreting the Sample Mean

  • Let’s stop thinking about the sample mean procedurally, i.e., how we calculate it
  • Let’s think of it as a choice, i.e., the best number for summarizing the data according to some criterion.
  • Here are the outcomes \(Y_i\) for sampled Californians with 8 years of education and three potential summaries of the location of those outcomes:
    1. The sample mode (blue line)
    2. The sample median (orange line)
    3. The sample mean (red line)
  • All of these are reasonable summaries. And the best choice according to some reasonable criterion.
  • Criteria which, sensibly, look what’s left over when we compare our observations to our location summary.

\[ \hat\varepsilon_i = Y_i - \hat\mu \qquad \text{ are the \textbf{residuals}} \]

Residuals

  • Here we’re looking at the residuals for each of our location summaries \(\hat\mu\).
    1. \(\hat\varepsilon_i = Y_i - \hat\mu\) when \(\hat\mu\) is the mode (blue points). It has the largest number of zero residuals.
    2. \(\hat\varepsilon_i = Y_i - \hat\mu\) when \(\hat\mu\) is the sample median (orange points). It has the smallest sum of absolute residuals.
    3. \(\hat\varepsilon_i = Y_i - \hat\mu\) when \(\hat\mu\) is the sample mean (red points). It has the smallest sum of squared residuals.
  • For better or for worse, the sum of squared residuals is the criterion we tend to use.

\[ \textcolor{red}{\hat \mu} = \mathop{\mathrm{argmin}}_{\text{numbers}\ m} \sum_{i=1}^n \qty{ Y_i - m }^2 \quad \text{ is the \textbf{least squares estimate}} \]

Terminology

  • The argmin of a function is the argument at which the function is minimized. \[ \hat\mu = \mathop{\mathrm{argmin}}_{m} \ f(m) \quad \iff \quad f(\hat\mu) = \min_m \ f(m) \]
  • Here that function is the sum of squared residuals, \(f(m) = \sum_{i=1}^n \qty{ Y_i - m }^2\).

The Least Squares Location Summary

\[ \hat \mu = \mathop{\mathrm{argmin}}_{\text{numbers}\ m} \sum_{i=1}^n \qty{ Y_i - m }^2 \quad \text{ satisfies the \textbf{zero-derivative condition} } \] \[ \begin{aligned} 0 &= \frac{d}{dm} \sum_{i=1}^n \qty{ Y_i - m }^2 \ \ \mid_{m=\hat\mu} \\ &= \class{fragment}{\sum_{i=1}^n \frac{d}{dm} \qty{ Y_i - m }^2 \ \ \mid_{m=\hat\mu}} \\ &= \class{fragment}{\sum_{i=1}^n -2 \qty{ Y_i - m } \ \ \mid_{m=\hat\mu}} \\ &= \class{fragment}{\sum_{i=1}^n -2 \qty{ Y_i - \hat\mu }} \end{aligned} \]

  • This says the least squares residuals \(\hat\varepsilon = Y_i - \hat \mu\) sum to zero.
  • With a bit of algebra, this tells us that the least squares estimator \(\hat\mu\) is the sample mean.

\[ \begin{aligned} &0 = \class{fragment}{\sum_{i=1}^n -2\qty{ Y_i - \hat\mu }} && \text{ when } \\ &\class{fragment}{2\sum_{i=1}^n Y_i = 2\sum_{i=1}^n \hat\mu = 2n\hat\mu} && \text{ and therefore when } \\ &\class{fragment}{\frac{1}{n}\sum_{i=1}^n Y_i = \hat\mu} \end{aligned} \]

  • It’s not just the best choice of the 3 location summaries we’ve looked at.
  • It’s the best choice — in terms of the sum of squared errors criterion — of all numbers outright.

Two Locations: Regression with a Binary Covariate

  • Suppose we have two groups: people in our sample with 8 and 12 years of education.
  • We can find the best location for each using this criterion of minimizing squared residuals.
  • We can think of the column means \(\hat\mu(x)\) as the best function of x for predicting y.
    • That is, the best function of years of education for predicting income.
    • Where best means the one that minimizes the sum of squared residuals.

\[ \hat \mu(x) = \mathop{\mathrm{argmin}}_{\text{functions}\ m(x)} \sum_{i=1}^n \qty{ Y_i - m(X_i) }^2 \]

Two Locations: Regression with a Binary Covariate

  • Why? Recall that a function of \(X\) is just a number for each value of \(X\).
  • So to see that the sub-sample means are the solution, we can break our sum into pieces where \(X\) is the same.

\[ \hat \mu(x) = \mathop{\mathrm{argmin}}_{\text{functions}\ m(x)} \sum_{i:X_i=8} \{ Y_i - m(8) \}^2 + \sum_{i:X_i=12} \{ Y_i - m(12) \}^2 \]

  • And observe that the first sum depends only on \(m(8)\) and the second only on \(m(12)\).
    • So we get \(\hat\mu(8)\) by minimizing squared residuals in the \(X=8\) column.
    • And we get \(\hat\mu(12)\) by minimizing squared residuals in the \(X=12\) column.
  • The minimizers — which we can solve for exactly as before — are the means in each column.

\[ \begin{aligned} \hat \mu(8) &= \mathop{\mathrm{argmin}}_{\text{numbers}\ m} \sum_{i:X_i=8} \{ Y_i - m \}^2 = \frac{\sum_{i:X_i=8} Y_i}{\sum_{i:X_i=8} 1 } \\ \hat \mu(12) &= \mathop{\mathrm{argmin}}_{\text{numbers}\ m} \sum_{i:X_i=12} \{ Y_i - m \}^2 = \frac{\sum_{i:X_i=12} Y_i}{\sum_{i:X_i=12} 1 } \end{aligned} \]

Many Locations: Regression with a Multiple or Many-Valued Covariates

  • There’s nothing special about two locations. We can do this for as many locations as we like.

  • We break our sum into pieces where \(X\) is the same. \[ \hat \mu(x) = \mathop{\mathrm{argmin}}_{\text{functions}\ m(x)} \sum_x \sum_{i:X_i=x} \{ Y_i - m(x) \}^2 \]

  • And we get the best location at each \(x\), in terms of squared residuals, by minimizing each piece.

  • The solution is, as before, the mean of each subsample.

\[ \hat \mu(x) = \mathop{\mathrm{argmin}}_{\text{numbers}\ m} \sum_{i:X_i=x} \{ Y_i - m \}^2 = \frac{\sum_{i:X_i=x} Y_i}{\sum_{i:X_i=x} 1 } \]

Coarsening

  • When we coarsen, we’re in effect restricting the set of functions we’re considering.
  • We’re restricting them to be functions of the coarsened groups \(c(x)\), i.e. the color in the plot.
  • Now we break our sum into pieces where the coarsened group \(c(x)\) is the same.

\[ \begin{aligned} \hat \mu(c(x)) &= \mathop{\mathrm{argmin}}_{\text{functions}\ m(c(x)) } \sum_{c(x)} \sum_{i:c(X_i)=c(x)} \{ Y_i - m(c(x)) \}^2 \\ &= \mathop{\mathrm{argmin}}_{\text{functions}\ m(c(x)) } \textcolor[RGB]{248,118,109}{\sum_{i:c(X_i)=\text{red}} \{ Y_i - m(\text{red}) \}^2} + \textcolor[RGB]{0,191,196}{\sum_{i:c(X_i)=\text{green}} \{ Y_i - m(\text{green}) \}^2} \end{aligned} \]

  • And we get the best location at each coarsened \(x\), in terms of squared error, by minimizing each piece.
  • And the solution is the mean of each coarsened group.

\[ \hat \mu(x) = \mathop{\mathrm{argmin}}_{\text{numbers}\ m} \sum_{i:c(X_i)=c(x)} \{ Y_i - m \}^2 = \frac{\sum_{i:c(X_i)=c(x)} Y_i}{\sum_{i:c(X_i)=c(x)} 1 } \]

  • This says the prediction we make at each \(x\) is the mean of the \(y\) values in the coarsened group it belongs to.

Least Squares Regression

  • We can think of both uncoarsened and coarsened subsample means as least squares estimators.
  • Each is the best choice, within some set of functions, for predicting \(Y\) from \(X\).
  • But the set of functions we’re choosing from is different in each case.
    • In the uncoarsened case, we’re choosing from all functions of \(X\).
    • In the coarsened case, we’re choosing from functions of the coarsened \(X\).
    • We could consider other sets of functions as well.

\[ \hat\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \{ Y_i - m(X_i) \}^2 \qquad \text{ where } \quad \mathcal{M} \ \ \text{ is our model.} \]

Least Squares Regression in Linear Models

Least Squares Regression

  • Regression, generally, is choosing a function of covariates \(X_i\) to predict outcomes \(Y_i\).
  • We choose from a set of functions \(\mathcal{M}\) that we call a regression model.

\[ Y_i \approx \hat\mu(X_i) \qfor \hat\mu \in \mathcal{M} \]

  • In this class, we’ll choose by least squares. We can do this for any model we like.

\[ \begin{aligned} \hat\mu &= \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \{ Y_i - m(X_i) \}^2 \qfor && \textcolor{red}{\mathcal{M}} = \qty{ \text{all functions} \ m(x) } \\ &&& \textcolor{blue}{\mathcal{M}} = \qty{ \text{all lines} \ m(x) = a + bx } \\ &&& \textcolor{magenta}{\mathcal{M}} = \qty{ \text{all increasing functions} \ m(x) } \\ &&& \textcolor{cyan}{\mathcal{M}} = \qty{ \text{all functions of an indicator} \ m(x)=f(1_{\ge 16}(x))} \\ \end{aligned} \]

Least Squares in Linear Models

  • We’re going to focus on choosing from linear models.
    • These are sets of functions that are ‘closed’ under addition, subtraction, and scaling by constants.
    • This means when we add, subtract, and scale functions in the model, we get another function in the model. \[ a(x), b(x) \in \mathcal{M}\implies a(x) + b(x), a(x) - b(x), c a(x) \in \mathcal{M}\quad \text{ for any constant $c$ } \]
  • Q. Which of these models are linear?

\[ \begin{aligned} \textcolor{red}{\mathcal{M}} &= \qty{ \text{all functions} \ m(x) } \\ \textcolor{blue}{\mathcal{M}} &= \qty{ \text{all lines} \ m(x)=a + bx } \\ \textcolor{magenta}{\mathcal{M}} &= \qty{ \text{all increasing functions} \ m(x) } \\ \textcolor{cyan}{\mathcal{M}} &= \qty{ \text{all functions of an indicator} \ m(x)=f(1_{\ge 16}(x))} \\ \end{aligned} \]

Multivariate Regression Models

A few examples.

Additive Models

  • Here we let our predictions be arbitrary functions of education (\(x\)) as before.
  • But we’re requiring that race (\(w\)) only shift that function up and down—it can’t change the function’s shape.
  • The result, in visual terms, are predictions for different values of \(w\) are ‘parallel curves’.

\[\small{ \begin{aligned} \textcolor{blue}{\mathcal{M}} &= \qty{ m(w,x) = m_0(w) + m_1(x) \ \ \text{for univariate functions} \ \ m_0, m_1 } &&\text{additive bivariate model} \\ \end{aligned} } \]

The Parallel Lines Model

  • Here we impose the additional restriction that the function of education (\(x\)) is a line.
  • This means that the predictions for different values of \(w\) are parallel lines: same slope, different intercepts.

\[\small{ \begin{aligned} \textcolor{red}{\mathcal{M}} &= \qty{ m(w,x) = a(w) + b x } &&\text{parallel lines} \\ \end{aligned} } \]

The Not-Necessarily-Parallel Lines Model

  • Here we stick with the lines thing, but drop the additivity restriction.
  • For each value of \(w\), we can get a totally different line—different intercept and slope.

\[\small{ \begin{aligned} \textcolor{magenta}{\mathcal{M}} &= \qty{ m(w,x) = a(w) + b(w) x } &&\text{lines} \\ \end{aligned} } \]

Where This Leaves Us

  • Each of these approaches allows us to make predictions everywhere—even in columns we haven’t sampled.
  • Fundamentally, it’s because they’re using information from one column to make predictions in another.
  • Q. This solves the empty-groups problem, but it introduces a new one. What is it?
  • A. We’re basing our predictions on assumptions about what \(\mu\) looks like.
  • If we’re wrong, our predictions will be wrong. It’ll bias our estimates of \(\mu(x)\).

Where We’re Going

  • There’s no right model to use. Each has its own strengths and weaknesses.
  • In the next couple weeks, we’ll look into the issues surrounding model choice.
    • Implications of bad model choice.
    • A good model to use by default.
    • How to choose a model automatically.

Practice

A Small Dataset

  • Researchers hypothesize that higher calcium intake causes greater bone density (stronger bones).
  • Bone density scores indicate how good someone’s bone density is.
    • 3 is typical, larger numbers are better, smaller numbers are worse
    • a score of 0.5 or lower is used to diagnose osteoporosis.
  • Patients in our sample are classified as old (age 65+) or young (age 64-).
    • We’ll call this \(X_i\): \(X_i=0\) if patient \(i\) is young, \(X_i=1\) if patient \(i\) is old.
  • Each patient in our sample regularly takes a calcium supplement or not.
    • We’ll call this \(W_i\): \(W_i=1\) if patient \(i\) takes calcium supplement, \(W_i=0\) if they do not.

The Horizontal Lines Model

Sketch in the least squares predictor in the horizontal lines model.

\[ \mathcal{M} = \qty{ m(w,x) = a(w) \ \ \text{for univariate functions} \ \ a } \]

Flip a slide forward to check your answer.

The Not-Necessarily-Parallel Lines Model

Sketch in the least squares predictor in the not-necessarily-parallel lines model.

\[ \mathcal{M} = \qty{ m(w,x) = a(w) + b(w) x \ \ \text{for univariate functions} \ \ a,b } \]

Flip a slide forward to check your answer.

The Parallel Lines Model

Sketch in the least squares predictor in the parallel lines model.

\[ \mathcal{M} = \qty{ m(w,x) = a(w) + b x \ \ \text{for univariate functions} \ \ a \ \ \text{and constants} \ \ b } \]

Flip a slide forward to check your answer.

The Additive Model

Sketch in the least squares predictor in the additive model. \[ \mathcal{M} = \qty{ m(w,x) = a(w) + b(x) \ \ \text{for univariate functions} \ \ a,b } \]

Flip a slide forward to check your answer. Is what you’ve drawn familiar? If so, why?

  • It’s the same as the parallel lines model.
    • When \(x\) takes on two values, all functions of \(x\) can be written as lines.
    • Adding a function of \(w\) to it, we get different—but parallel—lines for different values of \(w\).