Omitted Variable Bias
The formula
Suppose the true model is:
\[Y = \beta_1 X_1 + \beta_2 X_2 + \varepsilon\]
If you omit \(X_2\) and run the short regression \(Y = \tilde{\beta}_1 X_1 + u\), the short-regression estimator converges to:
\[\tilde{\beta}_1 \xrightarrow{p} \beta_1 + \beta_2 \, \delta\]
where \(\delta\) is the coefficient from regressing \(X_2\) on \(X_1\) (the auxiliary regression). The bias has a clean interpretation:
\[\text{Bias} = \underbrace{\beta_2}_{\text{effect of omitted}} \times \underbrace{\delta}_{\text{correlation with included}}\]
If the omitted variable doesn’t affect \(Y\) (\(\beta_2 = 0\)) or is uncorrelated with \(X_1\) (\(\delta = 0\)), there is no bias. Both links in the chain must be present.
Sign-of-bias table
You can sign the bias without knowing magnitudes — just think about the two ingredients:
| \(\delta > 0\) (positive correlation) | \(\delta < 0\) (negative correlation) | |
|---|---|---|
| \(\beta_2 > 0\) (positive effect) | Positive bias (overestimate) | Negative bias (underestimate) |
| \(\beta_2 < 0\) (negative effect) | Negative bias (underestimate) | Positive bias (overestimate) |
Example 1 — Returns to education, omitting ability. Ability likely has a positive effect on wages (\(\beta_2 > 0\)) and is positively correlated with education (\(\delta > 0\)). Omitting ability biases the return to education upward.
Example 2 — Class size and test scores, omitting SES. Higher SES likely raises scores (\(\beta_2 > 0\)) and wealthier districts may have smaller classes (\(\delta < 0\)). Omitting SES biases the class-size effect downward (makes class size look more harmful than it is).
Simulation
Left panel: sampling distributions of the short regression (omitting \(X_2\)) vs the long regression (including \(X_2\)). Right panel: realized bias across simulations vs the formula prediction \(\beta_2 \times \delta\).
#| standalone: true
#| viewerHeight: 750
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.eq-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-bottom: 14px; font-size: 14px; line-height: 1.9;
}
.eq-box b { color: #2c3e50; }
.match { color: #27ae60; font-weight: bold; }
.coef { color: #e74c3c; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 4,
sliderInput("n", "Sample size (n):",
min = 50, max = 500, value = 200, step = 50),
sliderInput("b1", HTML("True β<sub>1</sub>:"),
min = -3, max = 3, value = 1.5, step = 0.1),
sliderInput("b2", HTML("True β<sub>2</sub> (omitted variable effect):"),
min = -3, max = 3, value = 1, step = 0.1),
sliderInput("delta", HTML("δ = Corr(X<sub>1</sub>, X<sub>2</sub>) direction:"),
min = -0.9, max = 0.9, value = 0.6, step = 0.1),
sliderInput("sigma", HTML("Error SD (σ):"),
min = 0.5, max = 5, value = 1, step = 0.5),
actionButton("resim", "Run simulations", class = "btn-primary", width = "100%"),
uiOutput("results_box")
),
mainPanel(
width = 8,
fluidRow(
column(6, plotOutput("plot_dist", height = "450px")),
column(6, plotOutput("plot_bias", height = "450px"))
),
uiOutput("formula_box")
)
)
)
server <- function(input, output, session) {
sim_results <- reactive({
input$resim
n <- input$n
b1 <- input$b1
b2 <- input$b2
delta <- input$delta
sigma <- input$sigma
n_sims <- 500
short_coefs <- numeric(n_sims)
long_coefs <- numeric(n_sims)
for (i in seq_len(n_sims)) {
z1 <- rnorm(n)
z2 <- rnorm(n)
x1 <- z1
x2 <- delta * z1 + sqrt(1 - delta^2) * z2
eps <- rnorm(n, sd = sigma)
y <- b1 * x1 + b2 * x2 + eps
short_coefs[i] <- coef(lm(y ~ x1))[2]
long_coefs[i] <- coef(lm(y ~ x1 + x2))["x1"]
}
list(short = short_coefs, long = long_coefs,
b1 = b1, b2 = b2, delta = delta,
formula_bias = b2 * delta)
})
output$plot_dist <- renderPlot({
d <- sim_results()
par(mar = c(5, 5, 4, 2))
rng <- range(c(d$short, d$long))
brks <- seq(rng[1] - 0.1, rng[2] + 0.1, length.out = 40)
hist(d$long, breaks = brks, col = adjustcolor("#27ae60", 0.4),
border = "white", main = expression("Sampling distributions of " * hat(beta)[1]),
xlab = expression(hat(beta)[1]), freq = FALSE,
xlim = rng, ylim = c(0, max(
hist(d$long, breaks = brks, plot = FALSE)$density,
hist(d$short, breaks = brks, plot = FALSE)$density
) * 1.2))
hist(d$short, breaks = brks, col = adjustcolor("#e74c3c", 0.4),
border = "white", add = TRUE, freq = FALSE)
abline(v = d$b1, lty = 2, lwd = 2, col = "#2c3e50")
abline(v = d$b1 + d$formula_bias, lty = 2, lwd = 2, col = "#e74c3c")
legend("topright", bty = "n", cex = 0.85,
legend = c("Long regression (unbiased)",
"Short regression (biased)",
expression("True " * beta[1]),
expression(beta[1] + beta[2] * delta)),
col = c(adjustcolor("#27ae60", 0.6),
adjustcolor("#e74c3c", 0.6),
"#2c3e50", "#e74c3c"),
pch = c(15, 15, NA, NA), lwd = c(NA, NA, 2, 2),
lty = c(NA, NA, 2, 2), pt.cex = 2)
})
output$plot_bias <- renderPlot({
d <- sim_results()
par(mar = c(5, 5, 4, 2))
realized_bias <- d$short - d$b1
hist(realized_bias, breaks = 35,
col = adjustcolor("#3498db", 0.5), border = "white",
main = "Realized bias vs formula prediction",
xlab = expression(tilde(beta)[1] - beta[1]),
freq = FALSE)
abline(v = d$formula_bias, col = "#e74c3c", lwd = 3)
abline(v = mean(realized_bias), col = "#2c3e50", lwd = 2, lty = 2)
legend("topright", bty = "n", cex = 0.85,
legend = c(
paste0("Formula: ", round(d$formula_bias, 3)),
paste0("Mean realized: ", round(mean(realized_bias), 3))
),
col = c("#e74c3c", "#2c3e50"),
lwd = c(3, 2), lty = c(1, 2))
})
output$results_box <- renderUI({
d <- sim_results()
tags$div(class = "eq-box", style = "margin-top: 16px;",
HTML(paste0(
"<b>OVB Formula:</b><br>",
"Bias = β<sub>2</sub> × δ = ",
d$b2, " × ", d$delta, " = <span class='coef'>",
round(d$formula_bias, 3), "</span><br><br>",
"<b>Mean short estimate:</b> ", round(mean(d$short), 3), "<br>",
"<b>Mean long estimate:</b> ", round(mean(d$long), 3), "<br>",
"<b>True β<sub>1</sub>:</b> ", d$b1
))
)
})
output$formula_box <- renderUI({
tags$div(class = "eq-box", style = "margin-top: 8px;",
HTML(paste0(
"<b>Key:</b> The short regression (red) is centered at ",
"β<sub>1</sub> + β<sub>2</sub>δ, not at β<sub>1</sub>. ",
"The long regression (green) is centered at the truth. ",
"Both concentrate as n grows, but the short regression concentrates ",
"around the <i>wrong</i> value."
))
)
})
}
shinyApp(ui, server)
Things to try
- Set \(\beta_2 = 0\): no matter what \(\delta\) is, the short regression is unbiased. The omitted variable doesn’t affect \(Y\).
- Set \(\delta = 0\): the omitted variable affects \(Y\) but is uncorrelated with \(X_1\). No bias — omitting a relevant but orthogonal variable is harmless for \(\hat{\beta}_1\).
- Make both large: the two histograms separate visibly. The bias is \(\beta_2 \times \delta\).
- Increase \(n\): both distributions get tighter, but the short regression still converges to the wrong value.
The bottom line
- Omitting a variable biases the included coefficient if and only if the omitted variable (1) affects \(Y\) and (2) correlates with the included \(X\).
- The bias doesn’t vanish with more data — it’s a probability limit, not a finite-sample problem.
- The sign-of-bias table lets you reason about direction without knowing magnitudes.
Connections
- Frisch-Waugh-Lovell — FWL shows mechanically what controlling for \(X_2\) does; OVB shows what happens when you don’t.
- From Correlation to Causation — OVB is the main reason correlation ≠ causation.
- Selection on Observables — When you can observe the confounders, controlling for them removes OVB.
Did you know?
- The OVB formula is arguably the single most important result in applied econometrics. Joshua Angrist and Jörn-Steffen Pischke call it the “lingua franca” of empirical economics in Mostly Harmless Econometrics.
- OVB is the formal version of “correlation does not imply causation.” Every confounding story is an OVB story: there exists some \(X_2\) that affects \(Y\) and correlates with \(X_1\).
- The formula generalizes to the multivariate case via the FWL theorem — the bias from omitting a set of variables equals the effect of those variables times their auxiliary regression coefficients on the included variables.