Bayesian Causal Inference
Why go Bayesian for causal effects?
Frequentist causal inference gives you a point estimate and a confidence interval. But sometimes you want more:
- “What’s the probability the program actually works?” A frequentist CI doesn’t answer this. A posterior does: \(P(\tau > 0 \mid \text{data})\).
- “A previous study found an effect of $1,500. Can I use that?” Bayesian priors let you formally incorporate prior evidence.
- “My sample is tiny. Can I still say anything?” With a reasonable prior, Bayesian estimates are more stable in small samples.
The cost: you need to specify a prior, and your results depend on it. The payoff: richer inference and better small-sample behavior.
Setup: a job training program
A nonprofit runs a job training program for unemployed workers. We observe earnings for participants (treated) and non-participants (control). The question: does the program increase earnings?
We’ll use selection on observables — controlling for baseline covariates like age and prior earnings — and compare three approaches:
- Difference in means (naive)
- Regression adjustment (frequentist, with controls)
- Bayesian regression (with an informative prior from a previous study)
#| standalone: true
#| viewerHeight: 700
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.info-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.8;
}
.info-box b { color: #2c3e50; }
.good { color: #27ae60; font-weight: bold; }
.bad { color: #e74c3c; font-weight: bold; }
.muted { color: #7f8c8d; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
tags$h5("True DGP"),
sliderInput("true_ate", "True treatment effect ($):",
min = -500, max = 3000, value = 1200, step = 100),
sliderInput("n_obs", "Sample size:",
min = 30, max = 500, value = 80, step = 10),
sliderInput("confounding", "Selection bias strength:",
min = 0, max = 1, value = 0.5, step = 0.1),
tags$hr(),
tags$h5("Bayesian prior"),
sliderInput("prior_mean", "Prior mean ($):",
min = -1000, max = 3000, value = 1500, step = 100),
sliderInput("prior_sd", "Prior SD ($):",
min = 100, max = 3000, value = 800, step = 100),
tags$hr(),
actionButton("go", "New draw", class = "btn-primary", width = "100%"),
uiOutput("info")
),
mainPanel(
width = 9,
fluidRow(
column(6, plotOutput("estimates_plot", height = "420px")),
column(6, plotOutput("posterior_plot", height = "420px"))
)
)
)
)
server <- function(input, output, session) {
dat <- reactive({
input$go
n <- input$n_obs
ate <- input$true_ate
conf <- input$confounding
# Covariates
age <- rnorm(n, 35, 8)
prior_earn <- rnorm(n, 20000, 5000)
# Selection into treatment depends on covariates
# Higher prior earnings -> more likely to be treated (positive selection)
latent <- conf * (prior_earn - 20000) / 5000 + rnorm(n)
treat <- as.integer(latent > 0)
# Outcome: earnings
noise <- rnorm(n, sd = 3000)
earnings <- 15000 + 200 * (age - 35) + 0.3 * (prior_earn - 20000) +
ate * treat + noise
df <- data.frame(earnings = earnings, treat = treat,
age = age, prior_earn = prior_earn)
# 1. Naive difference in means
naive_est <- mean(earnings[treat == 1]) - mean(earnings[treat == 0])
n1 <- sum(treat); n0 <- n - n1
se_naive <- sqrt(var(earnings[treat == 1]) / max(n1, 1) +
var(earnings[treat == 0]) / max(n0, 1))
# 2. Regression adjustment (frequentist)
fit_freq <- lm(earnings ~ treat + age + prior_earn, data = df)
freq_est <- coef(fit_freq)["treat"]
freq_se <- summary(fit_freq)$coefficients["treat", "Std. Error"]
# 3. Bayesian (conjugate normal update)
# Likelihood: treat coefficient ~ N(freq_est, freq_se^2)
# Prior: tau ~ N(prior_mean, prior_sd^2)
prior_mean <- input$prior_mean
prior_sd <- input$prior_sd
prior_prec <- 1 / prior_sd^2
data_prec <- 1 / freq_se^2
post_prec <- prior_prec + data_prec
post_var <- 1 / post_prec
post_mean <- post_var * (prior_prec * prior_mean + data_prec * freq_est)
post_sd <- sqrt(post_var)
# P(tau > 0 | data)
p_positive <- 1 - pnorm(0, post_mean, post_sd)
list(df = df, ate = ate,
naive_est = naive_est, se_naive = se_naive,
freq_est = freq_est, freq_se = freq_se,
prior_mean = prior_mean, prior_sd = prior_sd,
post_mean = post_mean, post_sd = post_sd,
p_positive = p_positive,
n = n, n1 = n1, n0 = n0)
})
output$estimates_plot <- renderPlot({
d <- dat()
par(mar = c(5, 10, 3, 2))
ests <- c(d$naive_est, d$freq_est, d$post_mean)
ses <- c(d$se_naive, d$freq_se, d$post_sd)
labs <- c("Naive\n(diff in means)", "Regression\nadjustment", "Bayesian\nposterior")
cols <- c("#e74c3c", "#3498db", "#27ae60")
xlim <- range(c(ests - 2.5 * ses, ests + 2.5 * ses, d$ate))
plot(NULL, xlim = xlim, ylim = c(0.5, 3.5),
xlab = "Treatment effect ($)", ylab = "", yaxt = "n",
main = "Three estimates of the training effect")
axis(2, at = 1:3, labels = labs, las = 1, cex.axis = 0.9)
abline(v = d$ate, col = "gray40", lty = 2, lwd = 2)
text(d$ate, 3.45, paste0("Truth: $", d$ate), cex = 0.85, col = "gray40")
for (i in 1:3) {
ci_lo <- ests[i] - 1.96 * ses[i]
ci_hi <- ests[i] + 1.96 * ses[i]
segments(ci_lo, i, ci_hi, i, col = cols[i], lwd = 3)
points(ests[i], i, pch = 16, col = cols[i], cex = 1.8)
}
abline(v = 0, col = "gray80", lty = 3)
legend("topright", bty = "n", cex = 0.8,
legend = c("Point estimate", "95% interval", "True effect"),
pch = c(16, NA, NA), lty = c(NA, 1, 2),
lwd = c(NA, 3, 2), col = c("gray40", "gray40", "gray40"))
})
output$posterior_plot <- renderPlot({
d <- dat()
par(mar = c(5, 4.5, 3, 1))
# Plot range
all_means <- c(d$prior_mean, d$freq_est, d$post_mean)
all_sds <- c(d$prior_sd, d$freq_se, d$post_sd)
xlo <- min(all_means - 3 * all_sds)
xhi <- max(all_means + 3 * all_sds)
x <- seq(xlo, xhi, length.out = 500)
# Densities
d_prior <- dnorm(x, d$prior_mean, d$prior_sd)
d_like <- dnorm(x, d$freq_est, d$freq_se)
d_post <- dnorm(x, d$post_mean, d$post_sd)
ymax <- max(c(d_prior, d_like, d_post)) * 1.15
plot(x, d_like, type = "l", col = "#3498db", lwd = 2.5,
xlab = "Treatment effect ($)", ylab = "Density",
main = "Prior, likelihood & posterior",
ylim = c(0, ymax))
lines(x, d_prior, col = "#e67e22", lwd = 2.5, lty = 2)
lines(x, d_post, col = "#27ae60", lwd = 3)
# Shade posterior
poly_x <- c(x, rev(x))
poly_y <- c(d_post, rep(0, length(x)))
polygon(poly_x, poly_y, col = "#27ae6025", border = NA)
abline(v = d$ate, col = "gray40", lty = 2, lwd = 1.5)
abline(v = 0, col = "gray80", lty = 3)
legend("topright", bty = "n", cex = 0.85,
legend = c(paste0("Prior: N(", d$prior_mean, ", ", d$prior_sd, "\u00b2)"),
paste0("Likelihood (data): N(",
round(d$freq_est), ", ", round(d$freq_se), "\u00b2)"),
paste0("Posterior: N(",
round(d$post_mean), ", ", round(d$post_sd), "\u00b2)")),
col = c("#e67e22", "#3498db", "#27ae60"),
lwd = c(2.5, 2.5, 3), lty = c(2, 1, 1))
})
output$info <- renderUI({
d <- dat()
# Coverage
freq_covers <- (d$freq_est - 1.96 * d$freq_se <= d$ate) &
(d$freq_est + 1.96 * d$freq_se >= d$ate)
bayes_covers <- (d$post_mean - 1.96 * d$post_sd <= d$ate) &
(d$post_mean + 1.96 * d$post_sd >= d$ate)
freq_tag <- if (freq_covers) "<span class='good'>Yes</span>" else "<span class='bad'>No</span>"
bayes_tag <- if (bayes_covers) "<span class='good'>Yes</span>" else "<span class='bad'>No</span>"
tags$div(class = "info-box", HTML(paste0(
"<b>Naive:</b> $", round(d$naive_est), "<br>",
"<b>Regression:</b> $", round(d$freq_est),
" ± ", round(1.96 * d$freq_se), "<br>",
"<b>Bayesian:</b> $", round(d$post_mean),
" ± ", round(1.96 * d$post_sd), "<br>",
"<b>P(effect > 0 | data):</b> ", round(d$p_positive, 3), "<br>",
"<hr style='margin:8px 0'>",
"<small>Covers truth? Freq: ", freq_tag,
" | Bayes: ", bayes_tag, "</small>"
)))
})
}
shinyApp(ui, server)
How the Bayesian update works
The mechanics are simple. The regression gives us a likelihood for the treatment effect:
\[\hat{\tau}_{\text{data}} \sim N(\hat{\tau}_{\text{freq}}, \; \text{SE}^2)\]
We combine this with a prior from a previous study:
\[\tau \sim N(\mu_0, \; \sigma_0^2)\]
The posterior is another normal (conjugate updating):
\[\tau \mid \text{data} \sim N\!\left(\frac{\sigma_0^{-2} \mu_0 + \text{SE}^{-2} \hat{\tau}_{\text{freq}}}{\sigma_0^{-2} + \text{SE}^{-2}}, \;\; \frac{1}{\sigma_0^{-2} + \text{SE}^{-2}}\right)\]
The posterior mean is a precision-weighted average of the prior mean and the data estimate. Whichever has smaller variance (higher precision) gets more weight.
Things to try
- Small sample (n=30), high confounding: the naive estimate is way off. Regression adjustment helps but is noisy. The Bayesian estimate, anchored by the prior, is more stable.
- Large sample (n=500): the prior barely matters — the data dominates. All three estimates converge. This is the Bayesian guarantee: with enough data, the prior washes out.
- Wrong prior (set prior mean = -500, true effect = 1200): watch how the posterior gets pulled toward the wrong prior at small \(n\) but corrects at large \(n\). This is the risk of informative priors.
- Set confounding = 0: naive and regression estimates agree (no selection bias). The Bayesian estimate is a compromise between data and prior.
Prior sensitivity
The most common objection to Bayesian causal inference: “your results depend on the prior.” This is true — and it’s a feature, not a bug, when used honestly. The key question is: how much do the results depend on the prior?
A robust result is one where the conclusion holds across a range of reasonable priors. If flipping the prior from optimistic to skeptical changes your policy recommendation, that’s important to know.
#| standalone: true
#| viewerHeight: 600
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.info-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.8;
}
.info-box b { color: #2c3e50; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("n_sens", "Sample size:",
min = 30, max = 500, value = 80, step = 10),
tags$hr(),
tags$h5("Data (fixed once drawn)"),
sliderInput("true_effect", "True effect ($):",
min = 0, max = 3000, value = 1200, step = 100),
actionButton("draw", "Draw new data", class = "btn-primary", width = "100%"),
tags$hr(),
tags$h5("Vary the prior"),
sliderInput("pm", "Prior mean ($):",
min = -2000, max = 5000, value = 1500, step = 100),
sliderInput("ps", "Prior SD ($):",
min = 100, max = 5000, value = 800, step = 100),
uiOutput("sens_info")
),
mainPanel(
width = 9,
fluidRow(
column(6, plotOutput("sens_posterior", height = "420px")),
column(6, plotOutput("sens_ppos", height = "420px"))
)
)
)
)
server <- function(input, output, session) {
# Fix the data on button press
fixed_data <- reactiveVal(NULL)
observeEvent(input$draw, {
n <- input$n_sens
ate <- input$true_effect
age <- rnorm(n, 35, 8)
pe <- rnorm(n, 20000, 5000)
lat <- 0.5 * (pe - 20000) / 5000 + rnorm(n)
treat <- as.integer(lat > 0)
earnings <- 15000 + 200 * (age - 35) + 0.3 * (pe - 20000) +
ate * treat + rnorm(n, sd = 3000)
fit <- lm(earnings ~ treat + age + pe)
freq_est <- coef(fit)["treat"]
freq_se <- summary(fit)$coefficients["treat", "Std. Error"]
fixed_data(list(freq_est = freq_est, freq_se = freq_se, ate = ate, n = n))
}, ignoreNULL = FALSE)
# Recompute posterior whenever prior sliders change
post <- reactive({
fd <- fixed_data()
req(fd)
pm <- input$pm
ps <- input$ps
prior_prec <- 1 / ps^2
data_prec <- 1 / fd$freq_se^2
post_prec <- prior_prec + data_prec
post_var <- 1 / post_prec
post_mean <- post_var * (prior_prec * pm + data_prec * fd$freq_est)
post_sd <- sqrt(post_var)
p_pos <- 1 - pnorm(0, post_mean, post_sd)
list(post_mean = post_mean, post_sd = post_sd, p_pos = p_pos,
pm = pm, ps = ps,
freq_est = fd$freq_est, freq_se = fd$freq_se, ate = fd$ate)
})
output$sens_posterior <- renderPlot({
d <- post()
par(mar = c(5, 4.5, 3, 1))
all_means <- c(d$pm, d$freq_est, d$post_mean)
all_sds <- c(d$ps, d$freq_se, d$post_sd)
xlo <- min(all_means - 3.5 * all_sds, -500)
xhi <- max(all_means + 3.5 * all_sds, 3000)
x <- seq(xlo, xhi, length.out = 500)
d_prior <- dnorm(x, d$pm, d$ps)
d_like <- dnorm(x, d$freq_est, d$freq_se)
d_post <- dnorm(x, d$post_mean, d$post_sd)
ymax <- max(c(d_prior, d_like, d_post)) * 1.15
plot(x, d_like, type = "l", col = "#3498db", lwd = 2.5,
xlab = "Treatment effect ($)", ylab = "Density",
main = "Drag the prior, watch the posterior",
ylim = c(0, ymax))
lines(x, d_prior, col = "#e67e22", lwd = 2.5, lty = 2)
lines(x, d_post, col = "#27ae60", lwd = 3)
polygon(c(x, rev(x)), c(d_post, rep(0, length(x))),
col = "#27ae6025", border = NA)
abline(v = d$ate, col = "gray40", lty = 2, lwd = 1.5)
abline(v = 0, col = "gray80", lty = 3)
legend("topright", bty = "n", cex = 0.85,
legend = c("Prior", "Likelihood (data)", "Posterior", "Truth"),
col = c("#e67e22", "#3498db", "#27ae60", "gray40"),
lwd = c(2.5, 2.5, 3, 1.5), lty = c(2, 1, 1, 2))
})
output$sens_ppos <- renderPlot({
fd <- fixed_data()
req(fd)
par(mar = c(5, 4.5, 3, 1))
# Compute P(tau > 0) for a grid of prior means
prior_means <- seq(-2000, 5000, by = 100)
ps <- input$ps
prior_prec <- 1 / ps^2
data_prec <- 1 / fd$freq_se^2
p_pos_grid <- sapply(prior_means, function(pm) {
post_prec <- prior_prec + data_prec
post_var <- 1 / post_prec
post_mean <- post_var * (prior_prec * pm + data_prec * fd$freq_est)
post_sd <- sqrt(post_var)
1 - pnorm(0, post_mean, post_sd)
})
plot(prior_means, p_pos_grid, type = "l", col = "#9b59b6", lwd = 2.5,
xlab = "Prior mean ($)", ylab = "P(effect > 0 | data)",
main = "Sensitivity: how prior changes the conclusion",
ylim = c(0, 1))
abline(h = 0.5, col = "gray70", lty = 3)
abline(h = 0.95, col = "#27ae60", lty = 3)
abline(v = input$pm, col = "#e67e22", lty = 2, lwd = 2)
points(input$pm, post()$p_pos, pch = 16, col = "#e67e22", cex = 2)
text(max(prior_means) * 0.7, 0.97, "95% threshold", col = "#27ae60", cex = 0.8)
text(max(prior_means) * 0.7, 0.52, "50% threshold", col = "gray50", cex = 0.8)
})
output$sens_info <- renderUI({
d <- post()
robust_msg <- if (d$p_pos > 0.95) {
"<span style='color:#27ae60'>Robust: P > 95% regardless of prior</span>"
} else if (d$p_pos > 0.5) {
"<span style='color:#e67e22'>Likely positive, but sensitive to prior</span>"
} else {
"<span style='color:#e74c3c'>Inconclusive or negative</span>"
}
prior_wt <- round(100 * (1/d$ps^2) / (1/d$ps^2 + 1/d$freq_se^2), 1)
tags$div(class = "info-box", HTML(paste0(
"<b>Posterior mean:</b> $", round(d$post_mean), "<br>",
"<b>P(effect > 0):</b> ", round(d$p_pos, 3), "<br>",
"<b>Prior weight:</b> ", prior_wt, "%<br>",
"<b>Data weight:</b> ", round(100 - prior_wt, 1), "%<br>",
"<hr style='margin:8px 0'>",
robust_msg
)))
})
}
shinyApp(ui, server)
Reading the sensitivity plot
The right panel shows \(P(\tau > 0 \mid \text{data})\) as a function of the prior mean, holding prior SD fixed. This is a sensitivity analysis:
- If the curve stays above 0.95 across all reasonable prior means, your conclusion is robust — even a skeptic would agree.
- If the curve crosses 0.5, a skeptical prior would flip your conclusion. That’s important to report.
- Tight priors (small SD) make the curve steeper — the prior matters more. Vague priors (large SD) flatten the curve — the data dominates.
Things to try
- Draw data, then drag prior mean from -2000 to +5000: watch the posterior (green) slide between prior and data. With large \(n\), it barely moves.
- Set prior SD = 100 (very confident prior): the posterior sticks near the prior mean regardless of data. This is dogmatic — bad practice.
- Set prior SD = 5000 (very vague): the posterior collapses onto the data. The prior contributes nothing — you’ve recovered the frequentist estimate.
- Small n (30) vs large n (500): at \(n = 30\) the prior matters a lot. At \(n = 500\) even a strong prior gets overwhelmed.
When does Bayesian help most?
#| standalone: true
#| viewerHeight: 600
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.info-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.8;
}
.info-box b { color: #2c3e50; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("true_eff3", "True effect ($):",
min = 0, max = 3000, value = 1200, step = 100),
sliderInput("prior_mean3", "Prior mean ($):",
min = 0, max = 3000, value = 1500, step = 100),
sliderInput("prior_sd3", "Prior SD ($):",
min = 200, max = 2000, value = 800, step = 100),
sliderInput("n_sims", "Simulations:",
min = 100, max = 1000, value = 500, step = 100),
actionButton("run", "Run simulation", class = "btn-primary", width = "100%"),
uiOutput("mc_info")
),
mainPanel(
width = 9,
fluidRow(
column(6, plotOutput("mse_plot", height = "420px")),
column(6, plotOutput("coverage_plot", height = "420px"))
)
)
)
)
server <- function(input, output, session) {
results <- reactiveVal(NULL)
observeEvent(input$run, {
ate <- input$true_eff3
pm <- input$prior_mean3
ps <- input$prior_sd3
B <- input$n_sims
sample_sizes <- c(30, 50, 80, 120, 200, 350, 500)
out <- data.frame()
for (n in sample_sizes) {
freq_ests <- numeric(B)
bayes_ests <- numeric(B)
freq_covers <- logical(B)
bayes_covers <- logical(B)
for (b in 1:B) {
age <- rnorm(n, 35, 8)
pe <- rnorm(n, 20000, 5000)
lat <- 0.5 * (pe - 20000) / 5000 + rnorm(n)
treat <- as.integer(lat > 0)
earnings <- 15000 + 200 * (age - 35) + 0.3 * (pe - 20000) +
ate * treat + rnorm(n, sd = 3000)
fit <- lm(earnings ~ treat + age + pe)
fe <- coef(fit)["treat"]
fse <- summary(fit)$coefficients["treat", "Std. Error"]
prior_prec <- 1 / ps^2
data_prec <- 1 / fse^2
post_prec <- prior_prec + data_prec
post_var <- 1 / post_prec
post_mean <- post_var * (prior_prec * pm + data_prec * fe)
post_sd <- sqrt(post_var)
freq_ests[b] <- fe
bayes_ests[b] <- post_mean
freq_covers[b] <- (fe - 1.96 * fse <= ate) & (fe + 1.96 * fse >= ate)
bayes_covers[b] <- (post_mean - 1.96 * post_sd <= ate) &
(post_mean + 1.96 * post_sd >= ate)
}
out <- rbind(out, data.frame(
n = n,
mse_freq = mean((freq_ests - ate)^2),
mse_bayes = mean((bayes_ests - ate)^2),
cov_freq = mean(freq_covers),
cov_bayes = mean(bayes_covers)
))
}
results(out)
})
output$mse_plot <- renderPlot({
res <- results()
req(res)
par(mar = c(5, 5, 3, 1))
ylim <- range(c(res$mse_freq, res$mse_bayes))
plot(res$n, res$mse_freq / 1e6, type = "b", pch = 16, col = "#3498db",
lwd = 2.5, cex = 1.3,
xlab = "Sample size", ylab = "MSE (millions $\u00b2)",
main = "Mean squared error",
ylim = ylim / 1e6)
lines(res$n, res$mse_bayes / 1e6, type = "b", pch = 17, col = "#27ae60",
lwd = 2.5, cex = 1.3)
legend("topright", bty = "n", cex = 0.95,
legend = c("Frequentist", "Bayesian"),
col = c("#3498db", "#27ae60"), pch = c(16, 17), lwd = 2.5)
})
output$coverage_plot <- renderPlot({
res <- results()
req(res)
par(mar = c(5, 5, 3, 1))
plot(res$n, res$cov_freq, type = "b", pch = 16, col = "#3498db",
lwd = 2.5, cex = 1.3,
xlab = "Sample size", ylab = "Coverage (95% interval)",
main = "Coverage probability",
ylim = c(0.7, 1))
lines(res$n, res$cov_bayes, type = "b", pch = 17, col = "#27ae60",
lwd = 2.5, cex = 1.3)
abline(h = 0.95, col = "gray50", lty = 2, lwd = 1.5)
legend("bottomright", bty = "n", cex = 0.95,
legend = c("Frequentist", "Bayesian", "Nominal 95%"),
col = c("#3498db", "#27ae60", "gray50"),
pch = c(16, 17, NA), lty = c(1, 1, 2), lwd = 2.5)
})
output$mc_info <- renderUI({
res <- results()
req(res)
small <- res[res$n == min(res$n), ]
big <- res[res$n == max(res$n), ]
tags$div(class = "info-box", HTML(paste0(
"<b>At n=", small$n, ":</b><br>",
"Freq MSE: ", round(small$mse_freq / 1e6, 2), "M<br>",
"Bayes MSE: ", round(small$mse_bayes / 1e6, 2), "M<br>",
"<hr style='margin:6px 0'>",
"<b>At n=", big$n, ":</b><br>",
"Freq MSE: ", round(big$mse_freq / 1e6, 2), "M<br>",
"Bayes MSE: ", round(big$mse_bayes / 1e6, 2), "M<br>",
"<hr style='margin:6px 0'>",
"<small>Bayesian advantage is largest at small n, ",
"vanishes at large n.</small>"
)))
})
}
shinyApp(ui, server)
What the Monte Carlo shows
- Left panel (MSE): at small \(n\), the Bayesian estimator has lower MSE because the prior shrinks noisy estimates toward a reasonable value. At large \(n\), both converge — the data overwhelms the prior.
- Right panel (coverage): both methods achieve roughly 95% coverage at large \(n\). At small \(n\), the Bayesian interval can have better coverage if the prior is reasonable, because it’s wider and better centered.
Things to try
- Good prior (prior mean near truth): Bayesian MSE is much lower at small \(n\). Free lunch from prior knowledge.
- Bad prior (prior mean = 0, truth = 2000): Bayesian MSE is higher at small \(n\) — the prior pulls you away from truth. But at large \(n\), the data corrects it.
- Very tight prior (SD = 200): amplifies both the benefit (when right) and the cost (when wrong). Very vague prior (SD = 2000): negligible difference from frequentist.
The bottom line
| Frequentist | Bayesian | |
|---|---|---|
| What you get | Point estimate + CI | Full posterior distribution |
| Prior knowledge | Not formally used | Incorporated via prior |
| “Does the program work?” | “We reject \(H_0: \tau = 0\) at 5%” | “\(P(\tau > 0 \mid \text{data}) = 0.97\)” |
| Small samples | Noisy, wide CIs | Stabilized by prior |
| Large samples | Gold standard | Converges to frequentist |
| Risk | None from prior | Bad prior = bad estimate (at small \(n\)) |
In practice, the Bayesian approach is most valuable when:
- You have prior evidence — meta-analyses, pilot studies, or domain expertise that you want to formally incorporate
- Your sample is small — the prior provides regularization, reducing MSE
- You need probabilistic statements — “90% probability the effect exceeds $500” is more useful for policy than “we fail to reject at 5%”
- Sensitivity analysis matters — showing results are robust across priors is more convincing than a single p-value
When your sample is large and you have no strong prior information, the Bayesian and frequentist answers will be nearly identical. Use whichever your audience expects.
Did you know?
- The LaLonde (1986) critique showed that non-experimental estimates of job training effects could differ wildly from experimental benchmarks. This launched the credibility revolution in economics — and is why we care so much about identification. A Bayesian prior doesn’t fix bad identification; it just stabilizes the estimate conditional on your assumptions being right.
- The Bayesian approach to clinical trials is gaining ground at the FDA. Berry (2006) argued that Bayesian adaptive designs let you stop trials earlier (saving money and lives) by continuously updating the probability that a drug works. The same logic applies to program evaluation.
- Rubin (1978) — the father of the potential outcomes framework — was himself a Bayesian. His framework for causal inference was motivated by Bayesian thinking about missing data: the untreated potential outcome for a treated unit is “missing,” and we can reason about it using posterior predictive distributions.