The Algebra Behind OLS
Everything so far has described OLS in words: the best linear approximation to the CEF, the line that minimizes squared residuals, the thing that partials out controls. This page shows where the numbers actually come from — the matrix algebra that produces coefficient estimates and their standard errors.
From scalar to matrix
In simple regression (\(Y = \beta_0 + \beta_1 X + \varepsilon\)), you already know the formula:
\[ \hat{\beta}_1 = \frac{\text{Cov}(X, Y)}{\text{Var}(X)} \]
That’s one regressor and one coefficient. But the moment you add a second regressor — say \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon\) — this scalar formula breaks down. You can’t just compute \(\text{Cov}(X_1, Y) / \text{Var}(X_1)\) and call it \(\hat{\beta}_1\), because that ignores the correlation between \(X_1\) and \(X_2\). You need to adjust for all the regressors simultaneously.
Matrix notation handles this cleanly. Stack all \(n\) observations into:
\[ \underset{(n \times 1)}{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \qquad \underset{(n \times k)}{X} = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1,k-1} \\ 1 & x_{21} & x_{22} & \cdots & x_{2,k-1} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{n,k-1} \end{bmatrix}, \qquad \underset{(k \times 1)}{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_{k-1} \end{bmatrix} \]
The first column of \(X\) is all ones — that’s the intercept. Now the model is just \(y = X\beta + \varepsilon\), regardless of how many regressors you have.
The OLS solution
OLS minimizes \(\sum_i e_i^2 = e'e = (y - X\hat{\beta})'(y - X\hat{\beta})\). Take the derivative with respect to \(\hat{\beta}\), set it to zero, and you get the normal equations:
\[ X'X\hat{\beta} = X'y \]
Solve for \(\hat{\beta}\):
\[ \boxed{\hat{\beta} = (X'X)^{-1}X'y} \]
That’s it. Every OLS coefficient you’ve ever seen comes from this formula.
What is \((X'X)^{-1}\) doing? The matrix \(X'X\) captures how the regressors relate to each other — their variances and covariances. Inverting it adjusts for the fact that regressors are correlated. This is exactly what Frisch-Waugh-Lovell does when it “partials out” controls: the mechanical operation of partialling out is the \((X'X)^{-1}\) adjustment.
The Gauss-Markov assumptions
The OLS formula \(\hat{\beta} = (X'X)^{-1}X'y\) is purely mechanical — it always gives you a number. But whether that number has good properties (unbiased, minimum variance) depends on a set of assumptions known as the Gauss-Markov conditions:
- Linearity. The true model is \(y = X\beta + \varepsilon\) — the relationship between \(y\) and \(X\) is linear in the parameters.
- Strict exogeneity. \(E[\varepsilon \mid X] = 0\) — the errors are mean-zero and uncorrelated with the regressors. This rules out omitted variable bias, reverse causality, and measurement error in \(X\).
- No perfect multicollinearity. The columns of \(X\) are linearly independent, so \(X'X\) is invertible. (Near-collinearity is allowed but inflates standard errors — see below.)
- Homoskedasticity. \(\text{Var}(\varepsilon \mid X) = \sigma^2 I\) — the error variance is constant across observations.
- No autocorrelation. \(\text{Cov}(\varepsilon_i, \varepsilon_j \mid X) = 0\) for \(i \neq j\) — errors are uncorrelated with each other.
Under assumptions 1-3, OLS is unbiased: \(E[\hat{\beta} \mid X] = \beta\). Add 4 and 5, and the Gauss-Markov theorem kicks in: OLS is the Best Linear Unbiased Estimator (BLUE) — no other linear, unbiased estimator has smaller variance.
Notice that assumptions 4 and 5 are about the standard errors, not the coefficient estimates. If they fail, \(\hat{\beta}\) is still unbiased — but the formula for its variance changes. That’s exactly what heteroskedasticity-robust SEs and clustered SEs fix.
The variance-covariance matrix
We have estimates \(\hat{\beta}\). How precise are they? Under the Gauss-Markov assumptions (homoskedastic errors with variance \(\sigma^2\), uncorrelated with \(X\)):
\[ \boxed{\text{Var}(\hat{\beta}) = \sigma^2 (X'X)^{-1}} \]
This is a \(k \times k\) matrix — one row and column for each coefficient (including the intercept). Here’s what lives inside it:
| Position | Contains | Tells you |
|---|---|---|
| Diagonal entry \((j, j)\) | \(\text{Var}(\hat{\beta}_j)\) | How uncertain you are about \(\hat{\beta}_j\) |
| Off-diagonal entry \((j, m)\) | \(\text{Cov}(\hat{\beta}_j, \hat{\beta}_m)\) | How estimation error in one coefficient relates to another |
Standard errors are the square roots of the diagonal:
\[ \text{SE}(\hat{\beta}_j) = \sqrt{\text{Var}(\hat{\beta}_j)} = \sqrt{\sigma^2 \left[(X'X)^{-1}\right]_{jj}} \]
In practice, \(\sigma^2\) is unknown and estimated by:
\[ \hat{\sigma}^2 = \frac{e'e}{n - k} = \frac{\sum_i \hat{e}_i^2}{n - k} \]
where \(e = y - X\hat{\beta}\) are the residuals and \(n - k\) corrects for the degrees of freedom used up by estimating \(k\) parameters. This is the same \(n-1\) logic from Variance, SD & SE, generalized to \(k\) parameters.
What makes standard errors big or small
The formula \(\text{Var}(\hat{\beta}) = \sigma^2(X'X)^{-1}\) tells you exactly what controls precision. There are four levers:
More data (\(n\) large). Each additional observation adds a row to \(X\), making \(X'X\) larger. A larger \(X'X\) means a smaller \((X'X)^{-1}\), which means smaller standard errors. This is why standard errors shrink at rate \(1/\sqrt{n}\).
More spread in \(X\). If your regressor has high variance, \(X'X\) is large in the relevant entries. More variation in \(X\) gives you more “leverage” to estimate the slope — like fitting a line through points that are far apart rather than bunched together.
Less noise (\(\sigma^2\) small). If the errors are small, the data points hug the regression line closely, and you can estimate the slope precisely. This enters directly as a multiplier on the whole variance matrix.
Correlated regressors (multicollinearity). When regressors are highly correlated, \(X'X\) is close to singular — its determinant is near zero. Inverting a nearly singular matrix produces huge numbers, so \((X'X)^{-1}\) blows up and standard errors explode. The data can’t tell the regressors apart, so it can’t precisely attribute the effect to one vs. the other. This is multicollinearity: the estimates are still unbiased, but they’re noisy.
Connecting to the rest of the course
The formula \(\text{Var}(\hat{\beta}) = \sigma^2(X'X)^{-1}\) assumes homoskedastic, independent errors. The rest of the course shows what happens when those assumptions fail — and each fix modifies this formula in a specific way.
Heteroskedasticity. When \(\sigma^2\) isn’t constant across observations, the formula \(\sigma^2(X'X)^{-1}\) is wrong. Robust (sandwich) standard errors replace it with:
\[ \text{Var}(\hat{\beta}) = (X'X)^{-1}\left(\sum_i \hat{e}_i^2 \, x_i x_i'\right)(X'X)^{-1} \]
The middle term lets each observation contribute its own squared residual instead of assuming a single \(\sigma^2\) for everyone.
Clustered SEs. When observations within groups are correlated, the sandwich gets an additional layer: residuals are summed within clusters before forming the middle matrix. This accounts for the fact that 100 observations in 10 clusters carry less information than 100 independent observations.
FWL. The partialling-out operation in Frisch-Waugh-Lovell is exactly what \((X'X)^{-1}\) does mechanically. When you residualize \(Y\) and \(X_1\) on \(X_2\), the slope of the residualized regression equals the \(\hat{\beta}_1\) from the full regression — because both are doing the same linear algebra.
Residuals. The residual vector \(e = y - X\hat{\beta}\) can be written as \(e = My\), where:
\[ M = I - X(X'X)^{-1}X' \]
\(M\) is the annihilator matrix — it projects \(y\) onto the space orthogonal to the columns of \(X\). Everything \(X\) can explain gets removed; what’s left is the residual. The matrix \(H = X(X'X)^{-1}X'\) is the hat matrix (it puts the “hat” on \(y\) to get \(\hat{y} = Hy\)), and \(M = I - H\) is its complement.