37 Least Squares in R

Setup

Least Squares in Linear Models

Sample
Population

Least Squares
Linear Models

The estimation targets we’ve talked about are all summaries of column means in the population.

\[ \begin{aligned} \mu(w,x) &= \frac{1}{m_{w,x}}\sum_{j:w_j=w,x_j=x} y_j \qfor m_{w,x} = \sum_{j: w_j=w, x_j=x} 1 \\ &= \mathop{\mathrm{E}}[Y_i \mid W_i=w, X_i=x] \qfor (W_i,X_i,Y_i) \qqtext{sampled uniformly-at-random} \end{aligned} \]

Least Squares Regression is a method for choosing a function to estimate these column means. When we choose from the set of all functions of $(w,x)$, we get the mean in the corresponding column of the sample.

\[ \begin{aligned} \hat\mu(w,x) &= \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \qty{Y_i - m(W_i,X_i)}^2 \\ &= \frac{1}{N_{w,x}}\sum_{i:W_i=w,X_i=x} Y_i \qqtext{ when } \mathcal{M}= \{ \text{all functions of } (w,x) \} \end{aligned} \]

The set of functions we choose from is our model. Or regression model if we want to be verbose. And there are, of course, many other models. When we don’t have a lot of data in some columns in the sample, column means are really sensitive to who we happened to sample. And we’re probably better off using a smaller model, i.e. one that includes fewer functions. The models we’ll focus on in this class are linear models. They’re closed under addition, subtraction, and scalar multiplication. That means when you add, subtract, or scale any functions in the model, you get another function in the model.

Least Squares with Univariate Data

Models

\[ \begin{aligned} \mathcal{M}&= \{ \text{all functions } \ m(x) \} && \text{all functions} \\ \mathcal{M}&= \{ \text{all functions } \ m(c(x)) \} && \text{all functions of $c(x)$, a coarsened version of $x$} \\ \mathcal{M}&= \{ m(x) = a + bx \ : \ a,b \in \mathbb{R} \} && \text{all lines} \end{aligned} \]

We use R’s built-in function lm¹ to do least squares with linear models. To tell it which model to use, we use what R calls a ‘formula’. What a formula does is describe a basis for the model, i.e. a set of functions that we can take linear combinations of to get any function in the model. \[ \begin{aligned} \mathcal{M}&= \{ m(x) = a_0 b_0(x) + a_1 b_1(x) + \ldots + a_k b_k(x) \ \ : \ a_0 \ldots a_k \in \mathbb{R} \} \\ & \qif b_0(x),b_1(x),\ldots,b_k(x) \ \text{ is a basis for } \ \mathcal{M}. \end{aligned} \]

Where it says lm(y~1+x, data=unisam) above, y~1+x is the formula and unisam is the data. It’s saying that we want to choose a function $m$ from our model for which $m(X_i) \approx Y_i$, where $(X_i,Y_i)$ are unisam$x[i] and unisam$y[i] for $i=1 \ldots n$. And our model is the set of all linear combinations of the functions $b_0(x) = 1$ and $b_1(x) = x$. \[ \hat \mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \qty{Y_i - m(X_i)}^2 \qfor \mathcal{M}= \{m(x) = a_0 \ 1 + a_1 \ x \ \ : \ \ a_0,a_1 \in \mathbb{R} \} \]

What lm gives you is not really the function $\hat\mu$. It’s something else. But whatever it is, you can calculate $\hat\mu(x)$ by calling predict. The code below uses predict to calculate $\hat\mu(8)$, $\hat\mu(10)$, and $\hat\mu(12)$.

I like to write a function muhat that calls predict on its argument. This makes the code shorter and more like the mathematical notation.

If you’re familiar with object oriented programming, lm returns an object we call fitted.model, and predict is one of its methods. Method calls in R look a little different than they do in Python, C++, or Java. But they work pretty much the same way. Almost. R, like Common Lisp and Julia, has an object system based on multimethods.

Formulas in Abstract

You can use any R functions you like in a formula. Above, I’ve implemented my own functions b.0 and b.1 to use as my basis. You can also use R’s built-in functions. Try replacing b.1 in the formula with sin. You can use as many basis functions as you like. Try adding b.2(x) to the formula y ~ b.0(x) + b.1(x) above.

You don’t really have to define all these functions. $1$ and $x$ will do in place of b.0(x) and b.1(x). But x^2 won’t do in place of b.2(x). It’s because squaring means something special in formulas. But you can work around this by writing I(x^2) instead. In short, wrapping an expression in I says ‘interpret this the normal way.’ Try writing out a formula that does the same thing as y ~ b.0(x) + b.1(x) + b.2(x) but without using b.0, b.1, or b.2.

R implicitly adds the constant function $b_0(x)=1$ to every basis you give it. Go ahead and get rid of b.0(x) in the formula y~b.0(x)+b.1(x) above. Does anything change in the plot? If you want to stop it from doing this, you can write y ~ 0 + b.1(x) instead. What changes in the plot? Why?

The All-Functions Model

When we write y ~ factor(x), we’re telling R to treat each value of x as a separate category. The model has one parameter per unique value of x—so the least squares solution is just the mean of y within each column. This is the “all functions” model: it can fit any pattern in the column means.

Try comparing this to y ~ 1 + x from before. The line model smooths across columns; the all-functions model doesn’t.

Piecewise-Constant Models

We can get something in between by coarsening x before treating it as a factor. Here coarsen(x) groups years of education into four categories: no college degree, 2-year degree, 4-year degree, and graduate degree. The least squares solution is the mean within each group.

Try changing the coarsening. What happens if you collapse “no college” and “2-year” into one group?

Least Squares with Multivariate Data

‘lm’ stands for ‘linear model’↩︎