Omitted Variable Bias

The formula

Suppose the true model is:

\[Y = \beta_1 X_1 + \beta_2 X_2 + \varepsilon\]

If you omit \(X_2\) and run the short regression \(Y = \tilde{\beta}_1 X_1 + u\), the short-regression estimator converges to:

\[\tilde{\beta}_1 \xrightarrow{p} \beta_1 + \beta_2 \, \delta\]

where \(\delta\) is the coefficient from regressing \(X_2\) on \(X_1\) (the auxiliary regression). The bias has a clean interpretation:

\[\text{Bias} = \underbrace{\beta_2}_{\text{effect of omitted}} \times \underbrace{\delta}_{\text{correlation with included}}\]

If the omitted variable doesn’t affect \(Y\) (\(\beta_2 = 0\)) or is uncorrelated with \(X_1\) (\(\delta = 0\)), there is no bias. Both links in the chain must be present.

Sign-of-bias table

You can sign the bias without knowing magnitudes — just think about the two ingredients:

	\(\delta > 0\) (positive correlation)	\(\delta < 0\) (negative correlation)
\(\beta_2 > 0\) (positive effect)	Positive bias (overestimate)	Negative bias (underestimate)
\(\beta_2 < 0\) (negative effect)	Negative bias (underestimate)	Positive bias (overestimate)

Example 1 — Returns to education, omitting ability. Ability likely has a positive effect on wages (\(\beta_2 > 0\)) and is positively correlated with education (\(\delta > 0\)). Omitting ability biases the return to education upward.

Example 2 — Class size and test scores, omitting SES. Higher SES likely raises scores (\(\beta_2 > 0\)) and wealthier districts may have smaller classes (\(\delta < 0\)). Omitting SES biases the class-size effect downward (makes class size look more harmful than it is).

The Oracle View. In the simulation below, we set the true \(\beta_2\) and \(\delta\) and can verify the OVB formula exactly. In practice, you don’t observe the omitted variable — otherwise you’d include it. The formula tells you what direction the bias goes, which is often enough to sign the problem.

Simulation

Left panel: sampling distributions of the short regression (omitting \(X_2\)) vs the long regression (including \(X_2\)). Right panel: realized bias across simulations vs the formula prediction \(\beta_2 \times \delta\).

#| standalone: true
#| viewerHeight: 750

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .eq-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-bottom: 14px; font-size: 14px; line-height: 1.9;
    }
    .eq-box b { color: #2c3e50; }
    .match  { color: #27ae60; font-weight: bold; }
    .coef   { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 4,

      sliderInput("n", "Sample size (n):",
                  min = 50, max = 500, value = 200, step = 50),

      sliderInput("b1", HTML("True &beta;<sub>1</sub>:"),
                  min = -3, max = 3, value = 1.5, step = 0.1),

      sliderInput("b2", HTML("True &beta;<sub>2</sub> (omitted variable effect):"),
                  min = -3, max = 3, value = 1, step = 0.1),

      sliderInput("delta", HTML("&delta; = Corr(X<sub>1</sub>, X<sub>2</sub>) direction:"),
                  min = -0.9, max = 0.9, value = 0.6, step = 0.1),

      sliderInput("sigma", HTML("Error SD (&sigma;):"),
                  min = 0.5, max = 5, value = 1, step = 0.5),

      actionButton("resim", "Run simulations", class = "btn-primary", width = "100%"),

      uiOutput("results_box")
    ),

    mainPanel(
      width = 8,
      fluidRow(
        column(6, plotOutput("plot_dist", height = "450px")),
        column(6, plotOutput("plot_bias", height = "450px"))
      ),
      uiOutput("formula_box")
    )
  )
)

server <- function(input, output, session) {

  sim_results <- reactive({
    input$resim
    n     <- input$n
    b1    <- input$b1
    b2    <- input$b2
    delta <- input$delta
    sigma <- input$sigma
    n_sims <- 500

    short_coefs <- numeric(n_sims)
    long_coefs  <- numeric(n_sims)

    for (i in seq_len(n_sims)) {
      z1 <- rnorm(n)
      z2 <- rnorm(n)
      x1 <- z1
      x2 <- delta * z1 + sqrt(1 - delta^2) * z2
      eps <- rnorm(n, sd = sigma)
      y <- b1 * x1 + b2 * x2 + eps

      short_coefs[i] <- coef(lm(y ~ x1))[2]
      long_coefs[i]  <- coef(lm(y ~ x1 + x2))["x1"]
    }

    list(short = short_coefs, long = long_coefs,
         b1 = b1, b2 = b2, delta = delta,
         formula_bias = b2 * delta)
  })

  output$plot_dist <- renderPlot({
    d <- sim_results()
    par(mar = c(5, 5, 4, 2))

    rng <- range(c(d$short, d$long))
    brks <- seq(rng[1] - 0.1, rng[2] + 0.1, length.out = 40)

    hist(d$long, breaks = brks, col = adjustcolor("#27ae60", 0.4),
         border = "white", main = expression("Sampling distributions of " * hat(beta)[1]),
         xlab = expression(hat(beta)[1]), freq = FALSE,
         xlim = rng, ylim = c(0, max(
           hist(d$long, breaks = brks, plot = FALSE)$density,
           hist(d$short, breaks = brks, plot = FALSE)$density
         ) * 1.2))
    hist(d$short, breaks = brks, col = adjustcolor("#e74c3c", 0.4),
         border = "white", add = TRUE, freq = FALSE)

    abline(v = d$b1, lty = 2, lwd = 2, col = "#2c3e50")
    abline(v = d$b1 + d$formula_bias, lty = 2, lwd = 2, col = "#e74c3c")

    legend("topright", bty = "n", cex = 0.85,
           legend = c("Long regression (unbiased)",
                      "Short regression (biased)",
                      expression("True " * beta[1]),
                      expression(beta[1] + beta[2] * delta)),
           col = c(adjustcolor("#27ae60", 0.6),
                   adjustcolor("#e74c3c", 0.6),
                   "#2c3e50", "#e74c3c"),
           pch = c(15, 15, NA, NA), lwd = c(NA, NA, 2, 2),
           lty = c(NA, NA, 2, 2), pt.cex = 2)
  })

  output$plot_bias <- renderPlot({
    d <- sim_results()
    par(mar = c(5, 5, 4, 2))

    realized_bias <- d$short - d$b1
    hist(realized_bias, breaks = 35,
         col = adjustcolor("#3498db", 0.5), border = "white",
         main = "Realized bias vs formula prediction",
         xlab = expression(tilde(beta)[1] - beta[1]),
         freq = FALSE)

    abline(v = d$formula_bias, col = "#e74c3c", lwd = 3)
    abline(v = mean(realized_bias), col = "#2c3e50", lwd = 2, lty = 2)

    legend("topright", bty = "n", cex = 0.85,
           legend = c(
             paste0("Formula: ", round(d$formula_bias, 3)),
             paste0("Mean realized: ", round(mean(realized_bias), 3))
           ),
           col = c("#e74c3c", "#2c3e50"),
           lwd = c(3, 2), lty = c(1, 2))
  })

  output$results_box <- renderUI({
    d <- sim_results()
    tags$div(class = "eq-box", style = "margin-top: 16px;",
      HTML(paste0(
        "<b>OVB Formula:</b><br>",
        "Bias = &beta;<sub>2</sub> &times; &delta; = ",
        d$b2, " &times; ", d$delta, " = <span class='coef'>",
        round(d$formula_bias, 3), "</span><br><br>",
        "<b>Mean short estimate:</b> ", round(mean(d$short), 3), "<br>",
        "<b>Mean long estimate:</b> ", round(mean(d$long), 3), "<br>",
        "<b>True &beta;<sub>1</sub>:</b> ", d$b1
      ))
    )
  })

  output$formula_box <- renderUI({
    tags$div(class = "eq-box", style = "margin-top: 8px;",
      HTML(paste0(
        "<b>Key:</b> The short regression (red) is centered at ",
        "&beta;<sub>1</sub> + &beta;<sub>2</sub>&delta;, not at &beta;<sub>1</sub>. ",
        "The long regression (green) is centered at the truth. ",
        "Both concentrate as n grows, but the short regression concentrates ",
        "around the <i>wrong</i> value."
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

Set \(\beta_2 = 0\): no matter what \(\delta\) is, the short regression is unbiased. The omitted variable doesn’t affect \(Y\).
Set \(\delta = 0\): the omitted variable affects \(Y\) but is uncorrelated with \(X_1\). No bias — omitting a relevant but orthogonal variable is harmless for \(\hat{\beta}_1\).
Make both large: the two histograms separate visibly. The bias is \(\beta_2 \times \delta\).
Increase \(n\): both distributions get tighter, but the short regression still converges to the wrong value.

The bottom line

Omitting a variable biases the included coefficient if and only if the omitted variable (1) affects \(Y\) and (2) correlates with the included \(X\).
The bias doesn’t vanish with more data — it’s a probability limit, not a finite-sample problem.
The sign-of-bias table lets you reason about direction without knowing magnitudes.

Connections

Frisch-Waugh-Lovell — FWL shows mechanically what controlling for \(X_2\) does; OVB shows what happens when you don’t.
From Correlation to Causation — OVB is the main reason correlation ≠ causation.
Selection on Observables — When you can observe the confounders, controlling for them removes OVB.

Did you know?

The OVB formula is arguably the single most important result in applied econometrics. Joshua Angrist and Jörn-Steffen Pischke call it the “lingua franca” of empirical economics in Mostly Harmless Econometrics.
OVB is the formal version of “correlation does not imply causation.” Every confounding story is an OVB story: there exists some \(X_2\) that affects \(Y\) and correlates with \(X_1\).
The formula generalizes to the multivariate case via the FWL theorem — the bias from omitting a set of variables equals the effect of those variables times their auxiliary regression coefficients on the included variables.