Omitted Variable Bias

The formula

Suppose the true model is:

\[Y = \beta_1 X_1 + \beta_2 X_2 + \varepsilon\]

If you omit \(X_2\) and run the short regression \(Y = \tilde{\beta}_1 X_1 + u\), the short-regression estimator converges to:

\[\tilde{\beta}_1 \xrightarrow{p} \beta_1 + \beta_2 \, \delta\]

where \(\delta\) is the coefficient from regressing \(X_2\) on \(X_1\) (the auxiliary regression). The bias has a clean interpretation:

\[\text{Bias} = \underbrace{\beta_2}_{\text{effect of omitted}} \times \underbrace{\delta}_{\text{correlation with included}}\]

If the omitted variable doesn’t affect \(Y\) (\(\beta_2 = 0\)) or is uncorrelated with \(X_1\) (\(\delta = 0\)), there is no bias. Both links in the chain must be present.

Sign-of-bias table

You can sign the bias without knowing magnitudes — just think about the two ingredients:

\(\delta > 0\) (positive correlation) \(\delta < 0\) (negative correlation)
\(\beta_2 > 0\) (positive effect) Positive bias (overestimate) Negative bias (underestimate)
\(\beta_2 < 0\) (negative effect) Negative bias (underestimate) Positive bias (overestimate)

Example 1 — Returns to education, omitting ability. Ability likely has a positive effect on wages (\(\beta_2 > 0\)) and is positively correlated with education (\(\delta > 0\)). Omitting ability biases the return to education upward.

Example 2 — Class size and test scores, omitting SES. Higher SES likely raises scores (\(\beta_2 > 0\)) and wealthier districts may have smaller classes (\(\delta < 0\)). Omitting SES biases the class-size effect downward (makes class size look more harmful than it is).

The Oracle View. In the simulation below, we set the true \(\beta_2\) and \(\delta\) and can verify the OVB formula exactly. In practice, you don’t observe the omitted variable — otherwise you’d include it. The formula tells you what direction the bias goes, which is often enough to sign the problem.

Simulation

Left panel: sampling distributions of the short regression (omitting \(X_2\)) vs the long regression (including \(X_2\)). Right panel: realized bias across simulations vs the formula prediction \(\beta_2 \times \delta\).

#| standalone: true
#| viewerHeight: 750

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .eq-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-bottom: 14px; font-size: 14px; line-height: 1.9;
    }
    .eq-box b { color: #2c3e50; }
    .match  { color: #27ae60; font-weight: bold; }
    .coef   { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 4,

      sliderInput("n", "Sample size (n):",
                  min = 50, max = 500, value = 200, step = 50),

      sliderInput("b1", HTML("True &beta;<sub>1</sub>:"),
                  min = -3, max = 3, value = 1.5, step = 0.1),

      sliderInput("b2", HTML("True &beta;<sub>2</sub> (omitted variable effect):"),
                  min = -3, max = 3, value = 1, step = 0.1),

      sliderInput("delta", HTML("&delta; = Corr(X<sub>1</sub>, X<sub>2</sub>) direction:"),
                  min = -0.9, max = 0.9, value = 0.6, step = 0.1),

      sliderInput("sigma", HTML("Error SD (&sigma;):"),
                  min = 0.5, max = 5, value = 1, step = 0.5),

      actionButton("resim", "Run simulations", class = "btn-primary", width = "100%"),

      uiOutput("results_box")
    ),

    mainPanel(
      width = 8,
      fluidRow(
        column(6, plotOutput("plot_dist", height = "450px")),
        column(6, plotOutput("plot_bias", height = "450px"))
      ),
      uiOutput("formula_box")
    )
  )
)

server <- function(input, output, session) {

  sim_results <- reactive({
    input$resim
    n     <- input$n
    b1    <- input$b1
    b2    <- input$b2
    delta <- input$delta
    sigma <- input$sigma
    n_sims <- 500

    short_coefs <- numeric(n_sims)
    long_coefs  <- numeric(n_sims)

    for (i in seq_len(n_sims)) {
      z1 <- rnorm(n)
      z2 <- rnorm(n)
      x1 <- z1
      x2 <- delta * z1 + sqrt(1 - delta^2) * z2
      eps <- rnorm(n, sd = sigma)
      y <- b1 * x1 + b2 * x2 + eps

      short_coefs[i] <- coef(lm(y ~ x1))[2]
      long_coefs[i]  <- coef(lm(y ~ x1 + x2))["x1"]
    }

    list(short = short_coefs, long = long_coefs,
         b1 = b1, b2 = b2, delta = delta,
         formula_bias = b2 * delta)
  })

  output$plot_dist <- renderPlot({
    d <- sim_results()
    par(mar = c(5, 5, 4, 2))

    rng <- range(c(d$short, d$long))
    brks <- seq(rng[1] - 0.1, rng[2] + 0.1, length.out = 40)

    hist(d$long, breaks = brks, col = adjustcolor("#27ae60", 0.4),
         border = "white", main = expression("Sampling distributions of " * hat(beta)[1]),
         xlab = expression(hat(beta)[1]), freq = FALSE,
         xlim = rng, ylim = c(0, max(
           hist(d$long, breaks = brks, plot = FALSE)$density,
           hist(d$short, breaks = brks, plot = FALSE)$density
         ) * 1.2))
    hist(d$short, breaks = brks, col = adjustcolor("#e74c3c", 0.4),
         border = "white", add = TRUE, freq = FALSE)

    abline(v = d$b1, lty = 2, lwd = 2, col = "#2c3e50")
    abline(v = d$b1 + d$formula_bias, lty = 2, lwd = 2, col = "#e74c3c")

    legend("topright", bty = "n", cex = 0.85,
           legend = c("Long regression (unbiased)",
                      "Short regression (biased)",
                      expression("True " * beta[1]),
                      expression(beta[1] + beta[2] * delta)),
           col = c(adjustcolor("#27ae60", 0.6),
                   adjustcolor("#e74c3c", 0.6),
                   "#2c3e50", "#e74c3c"),
           pch = c(15, 15, NA, NA), lwd = c(NA, NA, 2, 2),
           lty = c(NA, NA, 2, 2), pt.cex = 2)
  })

  output$plot_bias <- renderPlot({
    d <- sim_results()
    par(mar = c(5, 5, 4, 2))

    realized_bias <- d$short - d$b1
    hist(realized_bias, breaks = 35,
         col = adjustcolor("#3498db", 0.5), border = "white",
         main = "Realized bias vs formula prediction",
         xlab = expression(tilde(beta)[1] - beta[1]),
         freq = FALSE)

    abline(v = d$formula_bias, col = "#e74c3c", lwd = 3)
    abline(v = mean(realized_bias), col = "#2c3e50", lwd = 2, lty = 2)

    legend("topright", bty = "n", cex = 0.85,
           legend = c(
             paste0("Formula: ", round(d$formula_bias, 3)),
             paste0("Mean realized: ", round(mean(realized_bias), 3))
           ),
           col = c("#e74c3c", "#2c3e50"),
           lwd = c(3, 2), lty = c(1, 2))
  })

  output$results_box <- renderUI({
    d <- sim_results()
    tags$div(class = "eq-box", style = "margin-top: 16px;",
      HTML(paste0(
        "<b>OVB Formula:</b><br>",
        "Bias = &beta;<sub>2</sub> &times; &delta; = ",
        d$b2, " &times; ", d$delta, " = <span class='coef'>",
        round(d$formula_bias, 3), "</span><br><br>",
        "<b>Mean short estimate:</b> ", round(mean(d$short), 3), "<br>",
        "<b>Mean long estimate:</b> ", round(mean(d$long), 3), "<br>",
        "<b>True &beta;<sub>1</sub>:</b> ", d$b1
      ))
    )
  })

  output$formula_box <- renderUI({
    tags$div(class = "eq-box", style = "margin-top: 8px;",
      HTML(paste0(
        "<b>Key:</b> The short regression (red) is centered at ",
        "&beta;<sub>1</sub> + &beta;<sub>2</sub>&delta;, not at &beta;<sub>1</sub>. ",
        "The long regression (green) is centered at the truth. ",
        "Both concentrate as n grows, but the short regression concentrates ",
        "around the <i>wrong</i> value."
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • Set \(\beta_2 = 0\): no matter what \(\delta\) is, the short regression is unbiased. The omitted variable doesn’t affect \(Y\).
  • Set \(\delta = 0\): the omitted variable affects \(Y\) but is uncorrelated with \(X_1\). No bias — omitting a relevant but orthogonal variable is harmless for \(\hat{\beta}_1\).
  • Make both large: the two histograms separate visibly. The bias is \(\beta_2 \times \delta\).
  • Increase \(n\): both distributions get tighter, but the short regression still converges to the wrong value.

The bottom line

  • Omitting a variable biases the included coefficient if and only if the omitted variable (1) affects \(Y\) and (2) correlates with the included \(X\).
  • The bias doesn’t vanish with more data — it’s a probability limit, not a finite-sample problem.
  • The sign-of-bias table lets you reason about direction without knowing magnitudes.

Connections


Did you know?

  • The OVB formula is arguably the single most important result in applied econometrics. Joshua Angrist and Jörn-Steffen Pischke call it the “lingua franca” of empirical economics in Mostly Harmless Econometrics.
  • OVB is the formal version of “correlation does not imply causation.” Every confounding story is an OVB story: there exists some \(X_2\) that affects \(Y\) and correlates with \(X_1\).
  • The formula generalizes to the multivariate case via the FWL theorem — the bias from omitting a set of variables equals the effect of those variables times their auxiliary regression coefficients on the included variables.