When Inference Breaks Down

The common thread

You need variation to learn anything. When variation disappears in some dimension, the math collapses.

Every formula in this course has a denominator that measures variation somewhere. When that variation goes to zero, the formula divides by zero and inference breaks down:

What disappears What breaks Why
Variation in data (all values identical) SD = 0, SE = 0, CI has zero width Nothing to estimate — your “sample” is a constant
Variation in sample size (n = 1) \(s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1}\) divides by 0 Can’t measure spread from a single point
Variation in X (regressor is constant) OLS slope is undefined Can’t draw a line through a vertical stack of points
Variation between regressors (perfect collinearity) Individual coefficients undefined Can’t separate the effect of X1 from X2 if they move together perfectly

The simulations below let you push each slider to the edge and watch the math collapse in real time.

The Oracle View. In these simulations, we set \(\sigma\), the true slope \(\beta\), and \(SD(X)\) — we control the data-generating process and can push parameters to their breaking point. In practice, you estimate all of these from data. The danger is that near-degeneracy (tiny Var(X), high collinearity) can silently produce huge SEs without your software raising an error.


Simulation 1: Too little data

The standard error is \(SE = s / \sqrt{n}\). As \(n\) shrinks, the SE explodes — your estimate becomes useless. At \(n = 1\), the sample variance \(s^2\) is undefined (dividing by \(n - 1 = 0\)).

Watch the confidence interval expand until it swallows everything.

#| standalone: true
#| viewerHeight: 650

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .warning { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("n", "Sample size (n):",
                  min = 1, max = 50, value = 30, step = 1),

      sliderInput("true_mu", "True mean:",
                  min = 0, max = 100, value = 50, step = 5),

      sliderInput("true_sd", "Population SD:",
                  min = 5, max = 30, value = 15, step = 1),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      plotOutput("ci_plot", height = "550px")
    )
  )
)

server <- function(input, output, session) {

  sim <- reactive({
    n   <- input$n
    mu  <- input$true_mu
    sig <- input$true_sd

    samp <- rnorm(n, mu, sig)
    xbar <- mean(samp)

    if (n == 1) {
      list(samp = samp, xbar = xbar, se = NA, ci_lo = NA, ci_hi = NA,
           s = NA, n = n, mu = mu, degenerate = TRUE)
    } else {
      s  <- sd(samp)
      se <- s / sqrt(n)
      tc <- qt(0.975, df = n - 1)
      ci_lo <- xbar - tc * se
      ci_hi <- xbar + tc * se
      list(samp = samp, xbar = xbar, se = se, ci_lo = ci_lo, ci_hi = ci_hi,
           s = s, n = n, mu = mu, tc = tc, degenerate = FALSE)
    }
  })

  output$ci_plot <- renderPlot({
    d <- sim()
    par(mar = c(4.5, 4.5, 3.5, 1))

    if (d$degenerate) {
      plot(1, d$samp[1], pch = 19, cex = 2, col = "#e74c3c",
           xlim = c(0.5, 1.5), ylim = c(d$mu - 60, d$mu + 60),
           xlab = "", ylab = "Value", xaxt = "n",
           main = "n = 1: Variance is UNDEFINED")
      abline(h = d$mu, col = "#27ae60", lwd = 2, lty = 2)
      text(1, d$mu + 5, paste0("True mean = ", d$mu), col = "#27ae60", cex = 1.1)
      text(1, d$samp[1] - 5, paste0("Your one observation = ", round(d$samp[1], 1)),
           col = "#e74c3c", cex = 1.1)
      text(1, d$mu - 40,
           "Can't compute SD, SE, or CI from a single point.",
           cex = 1.3, font = 2, col = "#e74c3c")
    } else {
      ylim <- range(c(d$ci_lo, d$ci_hi, d$samp, d$mu)) + c(-10, 10)

      plot(seq_along(d$samp), d$samp, pch = 16, cex = 0.8,
           col = adjustcolor("#3498db", 0.6),
           xlim = c(0, d$n + 1), ylim = ylim,
           xlab = "Observation", ylab = "Value",
           main = paste0("n = ", d$n, " — 95% CI"))

      abline(h = d$mu, col = "#27ae60", lwd = 2, lty = 2)
      abline(h = d$xbar, col = "#2c3e50", lwd = 2)

      # CI band
      rect(0, d$ci_lo, d$n + 1, d$ci_hi,
           col = adjustcolor("#e74c3c", 0.1), border = NA)
      abline(h = d$ci_lo, col = "#e74c3c", lwd = 1.5, lty = 3)
      abline(h = d$ci_hi, col = "#e74c3c", lwd = 1.5, lty = 3)

      legend("topright", bty = "n", cex = 0.9,
             legend = c(paste0("True mean = ", d$mu),
                        paste0("Sample mean = ", round(d$xbar, 1)),
                        paste0("95% CI: [", round(d$ci_lo, 1), ", ",
                               round(d$ci_hi, 1), "]"),
                        paste0("CI width = ", round(d$ci_hi - d$ci_lo, 1))),
             col = c("#27ae60", "#2c3e50", "#e74c3c", NA),
             lwd = c(2, 2, 1.5, NA), lty = c(2, 1, 3, NA))
    }
  })

  output$results <- renderUI({
    d <- sim()

    if (d$degenerate) {
      tags$div(class = "stats-box",
        HTML(paste0(
          "<span class='warning'>DEGENERATE</span><br>",
          "<b>n = 1</b><br>",
          "s = undefined (0/0)<br>",
          "SE = undefined<br>",
          "CI = undefined<br>",
          "<hr style='margin:8px 0'>",
          "<small>You used up your one degree",
          " of freedom estimating the mean.",
          " None left to estimate spread.</small>"
        ))
      )
    } else {
      tags$div(class = "stats-box",
        HTML(paste0(
          "<b>n = </b>", d$n, "<br>",
          "<b>s = </b>", round(d$s, 2), "<br>",
          "<b>SE = </b>", round(d$se, 2), "<br>",
          "<b>t critical = </b>", round(d$tc, 2), "<br>",
          "<b>CI width = </b>", round(d$ci_hi - d$ci_lo, 1), "<br>",
          "<hr style='margin:8px 0'>",
          "<small>df = ", d$n - 1, "</small>"
        ))
      )
    }
  })
}

shinyApp(ui, server)

Things to try

  • n = 30: normal-looking CI, reasonably tight around the mean.
  • n = 5: CI balloons — the t critical value jumps from ~2 to ~2.8, and the SE is much larger. Your estimate is barely useful.
  • n = 2: CI is enormous. The t critical value for df = 1 is 12.7 — you need the sample mean to be over 12 SEs away from zero to reject. You have almost no power.
  • n = 1: the math breaks. You can’t compute variance, SE, or a CI. One observation tells you nothing about uncertainty.

Simulation 2: No variation in X

The OLS slope formula is:

\[\hat{\beta} = \frac{\text{Cov}(X, Y)}{\text{Var}(X)}\]

If \(\text{Var}(X) = 0\) (the regressor is constant), you’re dividing by zero. You can’t estimate a slope because there’s no variation in X to “explain” variation in Y.

#| standalone: true
#| viewerHeight: 600

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .warning { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("sd_x", "SD of X (variation in regressor):",
                  min = 0, max = 5, value = 3, step = 0.1),

      sliderInput("beta_true", "True slope:",
                  min = 0.5, max = 3, value = 1.5, step = 0.1),

      sliderInput("n_obs", "Sample size:",
                  min = 30, max = 200, value = 100, step = 10),

      uiOutput("results2")
    ),

    mainPanel(
      width = 9,
      plotOutput("ols_plot", height = "500px")
    )
  )
)

server <- function(input, output, session) {

  base_noise <- reactive({
    n <- input$n_obs
    list(z = rnorm(n), u = rnorm(n))
  })

  sim <- reactive({
    d    <- base_noise()
    sd_x <- input$sd_x
    b    <- input$beta_true
    n    <- input$n_obs

    x <- 5 + sd_x * d$z
    y <- 1 + b * x + d$u

    if (sd_x < 0.05) {
      list(x = x, y = y, beta = b, sd_x = sd_x,
           degenerate = TRUE)
    } else {
      fit <- lm(y ~ x)
      ci  <- confint(fit)[2, ]
      se  <- summary(fit)$coefficients[2, 2]
      list(x = x, y = y, beta = b, sd_x = sd_x,
           b_hat = coef(fit)[2], se = se,
           ci_lo = ci[1], ci_hi = ci[2],
           degenerate = FALSE)
    }
  })

  output$ols_plot <- renderPlot({
    d <- sim()
    par(mar = c(4.5, 4.5, 3.5, 1))

    plot(d$x, d$y, pch = 16, cex = 0.7,
         col = adjustcolor("#3498db", 0.5),
         xlab = "X", ylab = "Y",
         main = if (d$degenerate) "Var(X) = 0: Slope is UNDEFINED"
                else paste0("SD(X) = ", round(d$sd_x, 1),
                            " — Slope SE = ", round(d$se, 2)))

    if (!d$degenerate) {
      abline(lm(d$y ~ d$x), col = "#e74c3c", lwd = 3)
      abline(a = 1, b = d$beta, col = "#27ae60", lwd = 2, lty = 2)
      legend("topleft", bty = "n", cex = 0.9,
             legend = c(paste0("OLS: ", round(d$b_hat, 3)),
                        paste0("True: ", d$beta)),
             col = c("#e74c3c", "#27ae60"), lwd = c(3, 2), lty = c(1, 2))
    } else {
      text(mean(d$x), mean(d$y),
           "All X values are the same.\nCan't fit a line through\na vertical stack of points.",
           cex = 1.4, font = 2, col = "#e74c3c")
    }
  })

  output$results2 <- renderUI({
    d <- sim()

    if (d$degenerate) {
      tags$div(class = "stats-box",
        HTML(paste0(
          "<span class='warning'>DEGENERATE</span><br>",
          "Var(X) = 0<br>",
          "Slope = 0/0 = undefined<br>",
          "SE = undefined<br>",
          "<hr style='margin:8px 0'>",
          "<small>No variation in X means you",
          " can't tell if Y changes with X.",
          " Every value of \u03b2 fits equally well.</small>"
        ))
      )
    } else {
      tags$div(class = "stats-box",
        HTML(paste0(
          "<b>True \u03b2:</b> ", d$beta, "<br>",
          "<b>OLS \u03b2:</b> ", round(d$b_hat, 3), "<br>",
          "<b>SE:</b> ", round(d$se, 3), "<br>",
          "<b>95% CI:</b> [", round(d$ci_lo, 3),
          ", ", round(d$ci_hi, 3), "]<br>",
          "<b>CI width:</b> ", round(d$ci_hi - d$ci_lo, 3), "<br>",
          "<hr style='margin:8px 0'>",
          "<small>SE(\u03b2) = \u03c3 / [\u221an \u00b7 SD(X)]<br>",
          "Less variation in X \u2192 larger SE</small>"
        ))
      )
    }
  })
}

shinyApp(ui, server)

Things to try

  • SD(X) = 3: data fans out horizontally, OLS estimates the slope precisely.
  • SD(X) = 1: the cloud compresses horizontally. SE roughly triples — harder to pin down the slope.
  • SD(X) = 0.3: the data is nearly a vertical column. The SE is huge, the CI is wide, and the estimate bounces wildly.
  • SD(X) = 0: complete degeneracy. All X values are identical. You can’t draw a line through a point.

The formula makes it obvious: \(SE(\hat{\beta}) = \frac{\sigma}{\sqrt{n} \cdot SD(X)}\). When \(SD(X) \to 0\), the SE \(\to \infty\).


Simulation 3: Multicollinearity

When two regressors are highly correlated, OLS can estimate their combined effect but can’t tell which one is doing the work. The individual coefficient SEs explode even though predictions remain fine.

#| standalone: true
#| viewerHeight: 650

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .warning { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("rho", "Correlation between X1 and X2:",
                  min = 0, max = 0.99, value = 0.3, step = 0.01),

      sliderInput("b1", "True effect of X1:",
                  min = 0.5, max = 3, value = 1.5, step = 0.25),

      sliderInput("b2", "True effect of X2:",
                  min = 0.5, max = 3, value = 1.0, step = 0.25),

      sliderInput("n3", "Sample size:",
                  min = 50, max = 300, value = 100, step = 50),

      uiOutput("results3")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("coef_plot", height = "500px")),
        column(6, plotOutput("vif_plot", height = "500px"))
      )
    )
  )
)

server <- function(input, output, session) {

  sim <- reactive({
    n   <- input$n3
    rho <- input$rho
    b1  <- input$b1
    b2  <- input$b2

    nsims <- 300

    betas1 <- numeric(nsims)
    betas2 <- numeric(nsims)

    for (i in seq_len(nsims)) {
      z1 <- rnorm(n)
      z2 <- rnorm(n)
      x1 <- z1
      x2 <- rho * z1 + sqrt(1 - rho^2) * z2
      y  <- 1 + b1 * x1 + b2 * x2 + rnorm(n)
      fit <- lm(y ~ x1 + x2)
      betas1[i] <- coef(fit)[2]
      betas2[i] <- coef(fit)[3]
    }

    vif <- 1 / (1 - rho^2)

    list(betas1 = betas1, betas2 = betas2,
         b1 = b1, b2 = b2, rho = rho, vif = vif)
  })

  output$coef_plot <- renderPlot({
    d <- sim()
    par(mar = c(4.5, 4.5, 3.5, 1))

    xlim <- range(c(d$betas1, d$betas2, d$b1, d$b2)) + c(-0.3, 0.3)

    hist(d$betas1, breaks = 30,
         col = adjustcolor("#3498db", 0.5), border = "white",
         main = paste0("Coefficient estimates (\u03c1 = ", d$rho, ")"),
         xlab = "Estimated coefficient", xlim = xlim,
         freq = FALSE, cex.main = 1.2)
    hist(d$betas2, breaks = 30,
         col = adjustcolor("#e74c3c", 0.4), border = "white",
         add = TRUE, freq = FALSE)

    abline(v = d$b1, col = "#3498db", lwd = 2.5, lty = 2)
    abline(v = d$b2, col = "#e74c3c", lwd = 2.5, lty = 2)

    legend("topright", bty = "n", cex = 0.85,
           legend = c(
             paste0("X1 (true=", d$b1, ", SD=", round(sd(d$betas1), 3), ")"),
             paste0("X2 (true=", d$b2, ", SD=", round(sd(d$betas2), 3), ")")
           ),
           fill = c(adjustcolor("#3498db", 0.5),
                    adjustcolor("#e74c3c", 0.4)),
           border = "white")
  })

  output$vif_plot <- renderPlot({
    d <- sim()
    par(mar = c(4.5, 4.5, 3.5, 1))

    rhos <- seq(0, 0.99, by = 0.01)
    vifs <- 1 / (1 - rhos^2)

    plot(rhos, vifs, type = "l", lwd = 2.5, col = "#2c3e50",
         xlab = "Correlation between X1 and X2",
         ylab = "Variance Inflation Factor (VIF)",
         main = "VIF: How much SEs inflate",
         ylim = c(1, min(50, max(vifs))),
         cex.main = 1.2)

    points(d$rho, d$vif, pch = 19, cex = 2, col = "#e74c3c")
    segments(d$rho, 1, d$rho, d$vif, lty = 2, col = "#e74c3c")

    abline(h = 1, lty = 2, col = "#95a5a6")
    abline(h = 10, lty = 3, col = "#e67e22")
    text(0.5, 11, "Rule of thumb: VIF > 10 = serious problem",
         cex = 0.8, col = "#e67e22")

    text(d$rho + 0.03, d$vif + 1.5,
         paste0("VIF = ", round(d$vif, 1)),
         col = "#e74c3c", font = 2, cex = 1.1, adj = 0)
  })

  output$results3 <- renderUI({
    d <- sim()

    tags$div(class = "stats-box",
      HTML(paste0(
        if (d$vif > 10) "<span class='warning'>HIGH COLLINEARITY</span><br>" else "",
        "<b>Correlation:</b> ", d$rho, "<br>",
        "<b>VIF:</b> ", round(d$vif, 1), "<br>",
        "<hr style='margin:8px 0'>",
        "<b>SD of \u03b21 estimates:</b> ", round(sd(d$betas1), 3), "<br>",
        "<b>SD of \u03b22 estimates:</b> ", round(sd(d$betas2), 3), "<br>",
        "<hr style='margin:8px 0'>",
        "<small>VIF = 1/(1 \u2212 \u03c1\u00b2)<br>",
        "SE inflates by \u221aVIF = ",
        round(sqrt(d$vif), 2), "x</small>"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • \(\rho\) = 0: the two coefficient distributions are tight, centered on their true values. No collinearity, no problem.
  • \(\rho\) = 0.7: distributions visibly wider. VIF = 2. SEs inflate by ~40%.
  • \(\rho\) = 0.9: distributions are quite wide. VIF = 5.3. Hard to tell X1 and X2 apart.
  • \(\rho\) = 0.99: distributions are enormous. VIF = 50. The estimates are unbiased (still centered at the truth!) but so noisy they’re useless. One regression might say X1 has a huge effect and X2 has none; the next says the opposite.

The key insight: multicollinearity doesn’t cause bias — it causes imprecision. OLS still gets it right on average, but any single estimate can be wildly off.


The unifying principle

Every formula in statistics has variation in the denominator:

Formula Denominator contains Goes to zero when
\(s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1}\) \(n - 1\) \(n = 1\)
\(SE = \frac{s}{\sqrt{n}}\) \(\sqrt{n}\) \(n = 0\) (no data)
\(\hat{\beta} = \frac{\text{Cov}(X,Y)}{\text{Var}(X)}\) \(\text{Var}(X)\) All X values identical
\(SE(\hat{\beta}_j) \propto \frac{1}{\sqrt{1 - R_j^2}}\) \(1 - R_j^2\) Perfect collinearity (\(R_j^2 = 1\))

Variation is the fuel of inference. No variation, no learning.

  • No variation in Y (your outcome) → can’t estimate spread or uncertainty
  • No variation in X (your regressor) → can’t estimate slopes
  • No variation between X1 and X2 (regressors move together) → can’t separate their effects
  • No variation in treatment assignment (everyone treated or everyone control) → can’t estimate causal effects

Precision problems vs identification problems

Not all “bad variation” is the same. The SE formula has a numerator and a denominator, and they fail in completely different ways:

\[SE(\hat{\beta}) = \frac{\overbrace{\sigma}^{\text{numerator: noise in Y}}}{\underbrace{\sqrt{n} \cdot SD(X)}_{\text{denominator: signal in X}}}\]

A note on \(\sigma\): in the simulations above, you set \(\sigma\) with a slider — you’re the Oracle, generating data from a known distribution so you can watch what happens when things break. In real life, you never know \(\sigma\). You estimate it from the data using the sample SD, \(s = \sqrt{\sum(x_i - \bar{x})^2 / (n-1)}\), and plug that in. This is why real inference uses the t-distribution (which accounts for the extra uncertainty of estimating \(\sigma\)) rather than the normal. As \(n\) grows, \(s\) converges to \(\sigma\) and the t converges to the normal — but with small samples, the distinction matters.

Precision problem (numerator too big). High variance in Y means \(\sigma\) is large, so your SE is large and your CIs are wide. But OLS is still unbiased — you still get the right answer on average. You can fix this by collecting more data: \(SE \propto \sigma / \sqrt{n}\), so if the noise in Y doubles, quadrupling your sample size gets you back to the same SE. Why does more data help? It doesn’t reduce the noise in any single observation — \(\sigma^2\) is a property of the population, not your sample size. But when you average over more observations, the random noise cancels out: some observations are too high, some too low, and the average lands closer to the truth. Formally, \(\text{Var}(\bar{X}) = \sigma^2 / n\). The variance of the average shrinks even though each observation is equally noisy. That’s the Law of Large Numbers. More data always helps a precision problem.

Identification problem (denominator goes to zero). When \(SD(X) \to 0\), or \(n \to 1\), or regressors become perfectly collinear, the denominator vanishes and the SE goes to infinity. This isn’t imprecision — the parameter literally isn’t estimable. No amount of data fixes it. If every person in your study has the same income, you cannot estimate the effect of income, period — not with 100 observations, not with a million. The formula divides by zero and OLS either crashes or gives you a meaningless number. More data never helps an identification problem.

Precision problem Identification problem
What’s wrong Numerator too big (\(\sigma\) large) Denominator goes to zero
Bias? No — estimates are centered on truth Undefined — no estimate exists
CIs? Wide but valid Infinite or meaningless
Fix with more data? Yes (\(SE \propto 1/\sqrt{n}\)) No
Example Noisy outcome Constant regressor, perfect collinearity, n = 1

Understanding where the variation comes from — and when it might disappear — is the difference between statistics that works and statistics that lies.


Did you know?

  • The problem of perfect multicollinearity is sometimes called the identification problem — there are infinitely many combinations of \(\beta_1\) and \(\beta_2\) that fit the data equally well. This is the same issue that arises in causal inference when you try to separate selection from treatment without an experiment.

  • Near-degeneracy can be worse than full degeneracy in practice. When Var(X) = 0, your software throws an error and you know something is wrong. When Var(X) is tiny but nonzero, OLS happily produces an estimate with a massive SE that you might not notice — especially if you don’t look at standard errors. This is why always report SEs, not just point estimates.

  • The condition number of the \(X'X\) matrix is used in numerical linear algebra to detect near-degeneracy. A condition number above \(10^6\) means your computer’s floating-point arithmetic is losing significant digits when inverting the matrix — your regression coefficients may be numerically unreliable even if they’re theoretically defined.