The Gauss-Markov & Gaussian Assumptions
Every result in classical regression — unbiasedness, efficiency, exact t and F tests — rests on five assumptions about the data-generating process. These are assumptions about OLS specifically. Other estimators (MLE, MoM, GMM) have their own regularity conditions. But since OLS is the workhorse of applied economics, these five assumptions are where inference begins.
The assumptions
For the linear model \(y = X\beta + \varepsilon\):
| # | Assumption | Formal statement | What it gives you |
|---|---|---|---|
| 1 | Linearity | \(y = X\beta + \varepsilon\) | Model is correctly specified |
| 2 | Strict exogeneity | \(E[\varepsilon \mid X] = 0\) | OLS is unbiased |
| 3 | No perfect multicollinearity | \(\text{rank}(X) = k\) | \((X'X)^{-1}\) exists, estimates are unique |
| 4 | Spherical errors | \(\text{Var}(\varepsilon \mid X) = \sigma^2 I\) | OLS is BLUE, standard errors are correct |
| 5 | Normality | \(\varepsilon \mid X \sim N(0, \sigma^2 I)\) | t and F are exact in finite samples |
Two tiers
Assumptions 1–4 are the Gauss-Markov conditions. Under these, OLS is the Best Linear Unbiased Estimator (BLUE) — no other linear unbiased estimator has smaller variance. But you don’t yet know the exact distribution of \(\hat{\beta}\), so you can’t do exact finite-sample inference.
Adding assumption 5 upgrades you to the classical normal linear model. Now t-statistics follow \(t_{n-k}\) exactly and F-statistics follow \(F_{q,\,n-k}\) exactly, even with \(n = 20\). This is the world where all the textbook formulas — confidence intervals, p-values, prediction intervals — are exact, not approximate.
Three roads to \(\hat{\beta} = (X'X)^{-1}X'y\)
OLS isn’t just one estimator — it’s the point where three different estimation philosophies converge.
OLS as Method of Moments. The population moment condition is:
\[ E\!\left[X'(y - X\beta)\right] = 0 \]
This says errors are uncorrelated with regressors — a direct restatement of assumption 2 (exogeneity). Replace the expectation with the sample average and solve:
\[ \frac{1}{n}X'(y - X\hat{\beta}) = 0 \;\;\Longrightarrow\;\; \hat{\beta} = (X'X)^{-1}X'y \]
That’s it. OLS is the method of moments estimator for the linear model. You only need assumptions 1–3 for this — no distributional assumption at all. See Method of Moments for the general framework.
OLS as MLE. Under all five assumptions (including normality), maximising the log-likelihood gives the same formula. The normal log-likelihood is proportional to \(-\sum(y_i - x_i'\beta)^2\), so maximising it is identical to minimising the sum of squared residuals. That’s assumption 5 doing double duty — it makes OLS = MLE, which is why t and F tests are exact under the full classical model. See Maximum Likelihood for the derivation.
OLS as least squares. Minimise \(\sum(y_i - x_i'\beta)^2\) directly — a purely algebraic/geometric operation. No probability model needed. This is how Gauss and Legendre originally derived it (early 1800s), before the statistical framework existed.
What breaks when each assumption fails
| Violated | Consequence | Fix | Page |
|---|---|---|---|
| Linearity | Bias, meaningless coefficients | Correct specification, nonparametric methods | |
| Exogeneity | \(\hat{\beta}\) is biased — you’re testing the wrong value | IV, experiments, panel methods | OVB |
| Multicollinearity | \((X'X)^{-1}\) explodes — huge SEs, unstable estimates | Drop variables, regularize | |
| Homoskedasticity | OLS SEs are wrong \(\Rightarrow\) wrong p-values | Robust / clustered SEs | Heteroskedasticity, Clustered SEs |
| Normality | t and F are approximate, not exact | Large \(n\) (CLT), bootstrap | Bootstrap |
The hierarchy of damage
Not all violations are equally serious:
Exogeneity failure is fatal. If \(E[\varepsilon \mid X] \neq 0\), OLS is biased and inconsistent — more data doesn’t help. Your estimates converge to the wrong number. This is the violation that keeps econometricians up at night.
Heteroskedasticity is fixable. OLS is still unbiased and consistent, but the standard errors are wrong. The fix is simple: use robust or clustered SEs. The coefficient estimates themselves don’t change.
Non-normality is usually harmless. With large \(n\), the CLT makes inference approximately valid. Only matters in small samples or when you need exact finite-sample results.
Multicollinearity is a data problem, not a model problem. OLS is still BLUE — it’s doing the best it can. The SEs are large because the data don’t contain enough information to separate the effects.
Simulation: break each assumption and watch the t-test fail
This generates data under \(H_0\!: \beta_1 = 0\) (the null is true) and computes the OLS t-statistic 2,000 times. If the test works correctly, the histogram should match the theoretical \(t(n-2)\) curve and the rejection rate should be 5%.
#| standalone: true
#| viewerHeight: 620
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
selectInput("violation", "Scenario:",
choices = c(
"All assumptions hold" = "none",
"Non-normal errors" = "nonnormal",
"Heteroskedastic errors" = "hetero",
"Endogenous regressor" = "endogenous"
)
),
sliderInput("n", "Sample size (n):",
min = 20, max = 500, value = 30, step = 10),
actionButton("run", "Run 2,000 replications",
style = "width:100%; margin-top:10px;
background:#3498db; color:white;
border:none; padding:10px; font-weight:bold;"),
uiOutput("stats")
),
mainPanel(
width = 9,
plotOutput("hist_plot", height = "500px")
)
)
)
server <- function(input, output, session) {
sim_data <- reactiveVal(NULL)
observeEvent(input$run, {
n <- input$n
B <- 2000
viol <- input$violation
# --- Generate data (vectorised: n x B matrices) ---
X <- matrix(rnorm(n * B), nrow = n, ncol = B)
if (viol == "none") {
EPS <- matrix(rnorm(n * B), nrow = n, ncol = B)
} else if (viol == "nonnormal") {
# Skewed chi-squared errors, mean-centred
EPS <- (matrix(rchisq(n * B, df = 2), n, B) - 2) / 2
} else if (viol == "hetero") {
# Variance grows with |x|
EPS <- matrix(rnorm(n * B), n, B) * (1 + 2 * abs(X))
} else {
# Endogeneity: x and eps share a common component
U <- matrix(rnorm(n * B), n, B)
X <- X + U
EPS <- U + matrix(rnorm(n * B, sd = 0.5), n, B)
}
Y <- EPS # true beta_1 = 0
# --- Vectorised OLS t-statistics ---
Xc <- X - matrix(colMeans(X), n, B, byrow = TRUE)
b1 <- colSums(Xc * Y) / colSums(Xc^2)
Res <- Y - matrix(colMeans(Y), n, B, byrow = TRUE) -
Xc * matrix(b1, n, B, byrow = TRUE)
s2 <- colSums(Res^2) / (n - 2)
se <- sqrt(s2 / colSums(Xc^2))
tst <- b1 / se
sim_data(list(t = tst, n = n, viol = viol))
})
output$hist_plot <- renderPlot({
d <- sim_data()
if (is.null(d)) {
plot.new()
text(0.5, 0.5, "Press the button to run the simulation",
cex = 1.4, col = "#7f8c8d")
return()
}
df_val <- d$n - 2
par(mar = c(5, 5, 4, 2))
lbl <- switch(d$viol,
none = "All assumptions hold",
nonnormal = "Non-normal errors (\u03c7\u00b2 - skewed)",
hetero = "Heteroskedastic errors",
endogenous = "Endogenous regressor")
# Allow wider x-range for endogeneity (t-stats shift)
xlim_lo <- min(-5, quantile(d$t, 0.005))
xlim_hi <- max( 5, quantile(d$t, 0.995))
hist(d$t, breaks = 60, freq = FALSE, col = "#dfe6e9",
border = "#b2bec3", xlim = c(xlim_lo, xlim_hi),
main = paste0(lbl, " (n = ", d$n, ")"),
xlab = "t-statistic", ylab = "Density",
cex.main = 1.4, cex.lab = 1.2)
xseq <- seq(xlim_lo, xlim_hi, length.out = 500)
lines(xseq, dt(xseq, df = df_val), lwd = 3, col = "#e74c3c", lty = 2)
rej <- mean(abs(d$t) > qt(0.975, df = df_val))
legend("topright", bty = "n", cex = 1.0,
legend = c("Simulated t-stats",
paste0("t(", df_val, ") theory"),
paste0("Rejection rate: ", sprintf("%.1f%%", rej * 100),
" (nominal 5%)")),
col = c("#b2bec3", "#e74c3c", NA),
lwd = c(NA, 3, NA),
lty = c(NA, 2, NA),
pch = c(15, NA, NA),
pt.cex = 2)
})
output$stats <- renderUI({
d <- sim_data()
if (is.null(d)) return(NULL)
df_val <- d$n - 2
rej <- mean(abs(d$t) > qt(0.975, df = df_val))
col <- if (abs(rej - 0.05) < 0.02) "#27ae60" else "#e74c3c"
tags$div(class = "stats-box",
HTML(paste0(
"<b>Replications:</b> 2,000<br>",
"<b>df:</b> ", df_val, "<br>",
"<hr style='margin:8px 0'>",
"<b>Nominal size:</b> 5.0%<br>",
"<b>Actual rejection:</b> ",
"<span style='color:", col, "; font-weight:bold'>",
sprintf("%.1f%%", rej * 100), "</span><br>",
"<hr style='margin:8px 0'>",
"<b>Mean(t):</b> ", round(mean(d$t), 3), "<br>",
"<b>SD(t):</b> ", round(sd(d$t), 3), "<br>",
"<small>Should be ≈ 0 and ≈ 1<br>",
"if assumptions hold.</small>"
))
)
})
}
shinyApp(ui, server)