The Gauss-Markov & Gaussian Assumptions

Every result in classical regression — unbiasedness, efficiency, exact t and F tests — rests on five assumptions about the data-generating process. These are assumptions about OLS specifically. Other estimators (MLE, MoM, GMM) have their own regularity conditions. But since OLS is the workhorse of applied economics, these five assumptions are where inference begins.

The assumptions

For the linear model \(y = X\beta + \varepsilon\):

# Assumption Formal statement What it gives you
1 Linearity \(y = X\beta + \varepsilon\) Model is correctly specified
2 Strict exogeneity \(E[\varepsilon \mid X] = 0\) OLS is unbiased
3 No perfect multicollinearity \(\text{rank}(X) = k\) \((X'X)^{-1}\) exists, estimates are unique
4 Spherical errors \(\text{Var}(\varepsilon \mid X) = \sigma^2 I\) OLS is BLUE, standard errors are correct
5 Normality \(\varepsilon \mid X \sim N(0, \sigma^2 I)\) t and F are exact in finite samples

Two tiers

Assumptions 1–4 are the Gauss-Markov conditions. Under these, OLS is the Best Linear Unbiased Estimator (BLUE) — no other linear unbiased estimator has smaller variance. But you don’t yet know the exact distribution of \(\hat{\beta}\), so you can’t do exact finite-sample inference.

Adding assumption 5 upgrades you to the classical normal linear model. Now t-statistics follow \(t_{n-k}\) exactly and F-statistics follow \(F_{q,\,n-k}\) exactly, even with \(n = 20\). This is the world where all the textbook formulas — confidence intervals, p-values, prediction intervals — are exact, not approximate.

What about large samples? Without normality but with assumptions 1–4, the CLT still delivers: \(t \xrightarrow{d} N(0,1)\) as \(n \to \infty\). So in large samples, you can drop assumption 5 and still do valid inference. This is why applied econometrics rarely tests for normality — with \(n\) in the thousands, it doesn’t matter. The assumptions that do matter regardless of sample size are exogeneity (2) and homoskedasticity (4).

Three roads to \(\hat{\beta} = (X'X)^{-1}X'y\)

OLS isn’t just one estimator — it’s the point where three different estimation philosophies converge.

OLS as Method of Moments. The population moment condition is:

\[ E\!\left[X'(y - X\beta)\right] = 0 \]

This says errors are uncorrelated with regressors — a direct restatement of assumption 2 (exogeneity). Replace the expectation with the sample average and solve:

\[ \frac{1}{n}X'(y - X\hat{\beta}) = 0 \;\;\Longrightarrow\;\; \hat{\beta} = (X'X)^{-1}X'y \]

That’s it. OLS is the method of moments estimator for the linear model. You only need assumptions 1–3 for this — no distributional assumption at all. See Method of Moments for the general framework.

OLS as MLE. Under all five assumptions (including normality), maximising the log-likelihood gives the same formula. The normal log-likelihood is proportional to \(-\sum(y_i - x_i'\beta)^2\), so maximising it is identical to minimising the sum of squared residuals. That’s assumption 5 doing double duty — it makes OLS = MLE, which is why t and F tests are exact under the full classical model. See Maximum Likelihood for the derivation.

OLS as least squares. Minimise \(\sum(y_i - x_i'\beta)^2\) directly — a purely algebraic/geometric operation. No probability model needed. This is how Gauss and Legendre originally derived it (early 1800s), before the statistical framework existed.

The pattern. MoM needs exogeneity (assumption 2). Least squares needs nothing beyond algebra. MLE needs normality (assumption 5). All three give the same \(\hat{\beta}\), but the inferential guarantees differ — MoM gives you consistency, Gauss-Markov gives you efficiency, and MLE gives you exact finite-sample distributions.

What breaks when each assumption fails

Violated Consequence Fix Page
Linearity Bias, meaningless coefficients Correct specification, nonparametric methods
Exogeneity \(\hat{\beta}\) is biased — you’re testing the wrong value IV, experiments, panel methods OVB
Multicollinearity \((X'X)^{-1}\) explodes — huge SEs, unstable estimates Drop variables, regularize
Homoskedasticity OLS SEs are wrong \(\Rightarrow\) wrong p-values Robust / clustered SEs Heteroskedasticity, Clustered SEs
Normality t and F are approximate, not exact Large \(n\) (CLT), bootstrap Bootstrap

The hierarchy of damage

Not all violations are equally serious:

  1. Exogeneity failure is fatal. If \(E[\varepsilon \mid X] \neq 0\), OLS is biased and inconsistent — more data doesn’t help. Your estimates converge to the wrong number. This is the violation that keeps econometricians up at night.

  2. Heteroskedasticity is fixable. OLS is still unbiased and consistent, but the standard errors are wrong. The fix is simple: use robust or clustered SEs. The coefficient estimates themselves don’t change.

  3. Non-normality is usually harmless. With large \(n\), the CLT makes inference approximately valid. Only matters in small samples or when you need exact finite-sample results.

  4. Multicollinearity is a data problem, not a model problem. OLS is still BLUE — it’s doing the best it can. The SEs are large because the data don’t contain enough information to separate the effects.

Simulation: break each assumption and watch the t-test fail

This generates data under \(H_0\!: \beta_1 = 0\) (the null is true) and computes the OLS t-statistic 2,000 times. If the test works correctly, the histogram should match the theoretical \(t(n-2)\) curve and the rejection rate should be 5%.

#| standalone: true
#| viewerHeight: 620

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      selectInput("violation", "Scenario:",
        choices = c(
          "All assumptions hold" = "none",
          "Non-normal errors" = "nonnormal",
          "Heteroskedastic errors" = "hetero",
          "Endogenous regressor" = "endogenous"
        )
      ),

      sliderInput("n", "Sample size (n):",
                  min = 20, max = 500, value = 30, step = 10),

      actionButton("run", "Run 2,000 replications",
                   style = "width:100%; margin-top:10px;
                            background:#3498db; color:white;
                            border:none; padding:10px; font-weight:bold;"),

      uiOutput("stats")
    ),

    mainPanel(
      width = 9,
      plotOutput("hist_plot", height = "500px")
    )
  )
)

server <- function(input, output, session) {

  sim_data <- reactiveVal(NULL)

  observeEvent(input$run, {
    n    <- input$n
    B    <- 2000
    viol <- input$violation

    # --- Generate data (vectorised: n x B matrices) ---
    X <- matrix(rnorm(n * B), nrow = n, ncol = B)

    if (viol == "none") {
      EPS <- matrix(rnorm(n * B), nrow = n, ncol = B)
    } else if (viol == "nonnormal") {
      # Skewed chi-squared errors, mean-centred
      EPS <- (matrix(rchisq(n * B, df = 2), n, B) - 2) / 2
    } else if (viol == "hetero") {
      # Variance grows with |x|
      EPS <- matrix(rnorm(n * B), n, B) * (1 + 2 * abs(X))
    } else {
      # Endogeneity: x and eps share a common component
      U   <- matrix(rnorm(n * B), n, B)
      X   <- X + U
      EPS <- U + matrix(rnorm(n * B, sd = 0.5), n, B)
    }

    Y <- EPS                       # true beta_1 = 0

    # --- Vectorised OLS t-statistics ---
    Xc  <- X - matrix(colMeans(X), n, B, byrow = TRUE)
    b1  <- colSums(Xc * Y) / colSums(Xc^2)
    Res <- Y - matrix(colMeans(Y), n, B, byrow = TRUE) -
           Xc * matrix(b1, n, B, byrow = TRUE)
    s2  <- colSums(Res^2) / (n - 2)
    se  <- sqrt(s2 / colSums(Xc^2))
    tst <- b1 / se

    sim_data(list(t = tst, n = n, viol = viol))
  })

  output$hist_plot <- renderPlot({
    d <- sim_data()
    if (is.null(d)) {
      plot.new()
      text(0.5, 0.5, "Press the button to run the simulation",
           cex = 1.4, col = "#7f8c8d")
      return()
    }

    df_val <- d$n - 2
    par(mar = c(5, 5, 4, 2))

    lbl <- switch(d$viol,
      none       = "All assumptions hold",
      nonnormal  = "Non-normal errors (\u03c7\u00b2 - skewed)",
      hetero     = "Heteroskedastic errors",
      endogenous = "Endogenous regressor")

    # Allow wider x-range for endogeneity (t-stats shift)
    xlim_lo <- min(-5, quantile(d$t, 0.005))
    xlim_hi <- max( 5, quantile(d$t, 0.995))

    hist(d$t, breaks = 60, freq = FALSE, col = "#dfe6e9",
         border = "#b2bec3", xlim = c(xlim_lo, xlim_hi),
         main = paste0(lbl, "   (n = ", d$n, ")"),
         xlab = "t-statistic", ylab = "Density",
         cex.main = 1.4, cex.lab = 1.2)

    xseq <- seq(xlim_lo, xlim_hi, length.out = 500)
    lines(xseq, dt(xseq, df = df_val), lwd = 3, col = "#e74c3c", lty = 2)

    rej <- mean(abs(d$t) > qt(0.975, df = df_val))

    legend("topright", bty = "n", cex = 1.0,
      legend = c("Simulated t-stats",
                 paste0("t(", df_val, ") theory"),
                 paste0("Rejection rate: ", sprintf("%.1f%%", rej * 100),
                        "  (nominal 5%)")),
      col  = c("#b2bec3", "#e74c3c", NA),
      lwd  = c(NA, 3, NA),
      lty  = c(NA, 2, NA),
      pch  = c(15, NA, NA),
      pt.cex = 2)
  })

  output$stats <- renderUI({
    d <- sim_data()
    if (is.null(d)) return(NULL)

    df_val <- d$n - 2
    rej    <- mean(abs(d$t) > qt(0.975, df = df_val))
    col    <- if (abs(rej - 0.05) < 0.02) "#27ae60" else "#e74c3c"

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>Replications:</b> 2,000<br>",
        "<b>df:</b> ", df_val, "<br>",
        "<hr style='margin:8px 0'>",
        "<b>Nominal size:</b> 5.0%<br>",
        "<b>Actual rejection:</b> ",
        "<span style='color:", col, "; font-weight:bold'>",
        sprintf("%.1f%%", rej * 100), "</span><br>",
        "<hr style='margin:8px 0'>",
        "<b>Mean(t):</b> ", round(mean(d$t), 3), "<br>",
        "<b>SD(t):</b> ", round(sd(d$t), 3), "<br>",
        "<small>Should be &asymp; 0 and &asymp; 1<br>",
        "if assumptions hold.</small>"
      ))
    )
  })
}

shinyApp(ui, server)

What to try:

  • All assumptions hold: histogram matches the red \(t\)-curve, rejection \(\approx\) 5%.
  • Non-normal errors (skewed \(\chi^2\)): at \(n = 20\), the histogram is visibly off-centre. Crank \(n\) up to 200 — the CLT kicks in and the histogram snaps back onto the \(t\)-curve.
  • Heteroskedastic errors: rejection rate drifts above 5% regardless of \(n\). The OLS standard errors are systematically wrong. This is why you need robust SEs.
  • Endogenous regressor: t-statistics shift away from zero — the estimator is biased, so the “null is true” test rejects almost every time. No amount of \(n\) fixes this; you need a different estimator (IV, etc.).