Application: Returns to Education
Putting it all together: returns to education
You’ve learned the pieces — Bayes’ theorem, priors and posteriors, shrinkage, MCMC, Bayesian regression. Now let’s use them on a real question.
What’s the causal return to an extra year of education on wages?
This is one of the most studied questions in labor economics. The standard approach uses the Mincer equation:
\[ \log(\text{wage}) = \beta_0 + \beta_1 \cdot \text{educ} + \beta_2 \cdot \text{exper} + \beta_3 \cdot \text{exper}^2 + \varepsilon \]
We’ll simulate data from this model and walk through the full Bayesian workflow: estimate, interpret, check sensitivity, and compare to OLS.
Simulation: Full Bayesian workflow
Walk through all five steps. The left panel shows the posterior for the education coefficient with the prior and OLS estimate. The right panel shows how the posterior shifts as you change the prior strength.
#| standalone: true
#| viewerHeight: 680
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 13px; line-height: 1.8;
}
.stats-box b { color: #2c3e50; }
.good { color: #27ae60; font-weight: bold; }
.bad { color: #e74c3c; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
tags$h5("Data"),
sliderInput("n", "Sample size (n):",
min = 30, max = 1000, value = 200, step = 10),
sliderInput("true_beta", HTML("True β<sub>educ</sub>:"),
min = 0.01, max = 0.20, value = 0.10, step = 0.01),
tags$hr(),
tags$h5("Prior"),
sliderInput("prior_mean", HTML("Prior mean for β<sub>educ</sub>:"),
min = 0, max = 0.20, value = 0.10, step = 0.01),
sliderInput("prior_sd", HTML("Prior SD for β<sub>educ</sub>:"),
min = 0.005, max = 0.10, value = 0.03, step = 0.005),
actionButton("go", "New data", class = "btn-primary", width = "100%"),
uiOutput("results")
),
mainPanel(
width = 9,
fluidRow(
column(6, plotOutput("scatter_plot", height = "300px")),
column(6, plotOutput("posterior_plot", height = "300px"))
),
fluidRow(
column(12, plotOutput("sensitivity_plot", height = "280px"))
)
)
)
)
server <- function(input, output, session) {
dat <- reactive({
input$go
n <- input$n
true_beta <- input$true_beta
prior_mean <- input$prior_mean
prior_sd <- input$prior_sd
# Simulate Mincer equation
educ <- round(rnorm(n, 13, 2.5)) # years of education ~13 mean
educ <- pmax(educ, 8) # floor at 8 years
exper <- round(pmax(rnorm(n, 15, 8), 0)) # years of experience
sigma <- 0.4 # residual SD in log wages
log_wage <- 1.0 + true_beta * educ + 0.03 * exper - 0.0005 * exper^2 +
rnorm(n, 0, sigma)
# OLS: regress log_wage on educ, exper, exper^2
X <- cbind(1, educ, exper, exper^2)
XtX_inv <- solve(t(X) %*% X)
beta_ols <- as.numeric(XtX_inv %*% t(X) %*% log_wage)
resid <- log_wage - X %*% beta_ols
s2 <- sum(resid^2) / (n - 4)
se_ols <- sqrt(s2 * diag(XtX_inv))
# Extract education coefficient (index 2)
b_ols <- beta_ols[2]
se_b <- se_ols[2]
# Bayesian posterior (conjugate, focusing on educ coeff)
prior_prec <- 1 / prior_sd^2
data_prec <- 1 / se_b^2
post_prec <- prior_prec + data_prec
post_sd <- 1 / sqrt(post_prec)
post_mean <- (prior_mean * prior_prec + b_ols * data_prec) / post_prec
w_prior <- prior_prec / post_prec
w_data <- data_prec / post_prec
# CrI
cri_lo <- qnorm(0.025, post_mean, post_sd)
cri_hi <- qnorm(0.975, post_mean, post_sd)
# CI
ci_lo <- b_ols - 1.96 * se_b
ci_hi <- b_ols + 1.96 * se_b
# P(beta > 0 | data)
p_positive <- 1 - pnorm(0, post_mean, post_sd)
# P(beta > 0.05 | data)
p_above_05 <- 1 - pnorm(0.05, post_mean, post_sd)
# Sensitivity: how posterior mean changes with prior SD
prior_sd_grid <- seq(0.005, 0.15, length.out = 100)
sens_means <- numeric(length(prior_sd_grid))
sens_lo <- numeric(length(prior_sd_grid))
sens_hi <- numeric(length(prior_sd_grid))
for (i in seq_along(prior_sd_grid)) {
pp <- 1 / prior_sd_grid[i]^2
posp <- pp + data_prec
pm <- (prior_mean * pp + b_ols * data_prec) / posp
ps <- 1 / sqrt(posp)
sens_means[i] <- pm
sens_lo[i] <- qnorm(0.025, pm, ps)
sens_hi[i] <- qnorm(0.975, pm, ps)
}
list(educ = educ, log_wage = log_wage, exper = exper,
beta_ols_all = beta_ols, se_ols_all = se_ols,
b_ols = b_ols, se_b = se_b, ci_lo = ci_lo, ci_hi = ci_hi,
prior_mean = prior_mean, prior_sd = prior_sd,
post_mean = post_mean, post_sd = post_sd,
cri_lo = cri_lo, cri_hi = cri_hi,
w_prior = w_prior, w_data = w_data,
p_positive = p_positive, p_above_05 = p_above_05,
true_beta = true_beta, sigma = sigma,
prior_sd_grid = prior_sd_grid, sens_means = sens_means,
sens_lo = sens_lo, sens_hi = sens_hi)
})
# --- Left: scatterplot with OLS and Bayesian regression lines ---
output$scatter_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
plot(d$educ, d$log_wage, pch = 16, col = "#2c3e5040", cex = 0.9,
xlab = "Years of education", ylab = "log(wage)",
main = "Data + Regression Lines")
# Predict at education values, holding experience at mean
educ_seq <- seq(min(d$educ), max(d$educ), length.out = 100)
mean_exp <- mean(d$exper)
# OLS fitted line (red)
ols_pred <- d$beta_ols_all[1] + d$beta_ols_all[2] * educ_seq +
d$beta_ols_all[3] * mean_exp + d$beta_ols_all[4] * mean_exp^2
lines(educ_seq, ols_pred, col = "#e74c3c", lwd = 2.5)
# Bayesian fitted line (blue) — same intercept/exper coeffs, Bayesian educ slope
bayes_pred <- d$beta_ols_all[1] + d$post_mean * educ_seq +
d$beta_ols_all[3] * mean_exp + d$beta_ols_all[4] * mean_exp^2
lines(educ_seq, bayes_pred, col = "#3498db", lwd = 2.5)
# True line (green dashed)
true_pred <- 1.0 + d$true_beta * educ_seq +
0.03 * mean_exp - 0.0005 * mean_exp^2
lines(educ_seq, true_pred, col = "#27ae60", lwd = 2, lty = 2)
legend("topleft", bty = "n", cex = 0.8,
legend = c(
paste0("OLS: ", round(d$b_ols, 4)),
paste0("Bayesian: ", round(d$post_mean, 4)),
paste0("True: ", d$true_beta)
),
col = c("#e74c3c", "#3498db", "#27ae60"),
lwd = c(2.5, 2.5, 2), lty = c(1, 1, 2))
})
# --- Right: posterior density with OLS comparison ---
output$posterior_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
# Range for plotting
all_means <- c(d$prior_mean, d$b_ols, d$post_mean)
all_sds <- c(d$prior_sd, d$se_b, d$post_sd)
xlim <- range(c(all_means - 3.5 * all_sds, all_means + 3.5 * all_sds))
x_seq <- seq(xlim[1], xlim[2], length.out = 400)
prior_y <- dnorm(x_seq, d$prior_mean, d$prior_sd)
lik_y <- dnorm(x_seq, d$b_ols, d$se_b)
post_y <- dnorm(x_seq, d$post_mean, d$post_sd)
ylim <- c(0, max(c(prior_y, lik_y, post_y)) * 1.25)
plot(NULL, xlim = xlim, ylim = ylim,
xlab = expression(beta[educ]),
ylab = "Density",
main = expression("OLS vs Bayesian: " * beta[educ]))
# Prior (red dashed)
lines(x_seq, prior_y, col = "#e74c3c", lwd = 2, lty = 2)
# Likelihood / OLS (gray dotted)
lines(x_seq, lik_y, col = "gray50", lwd = 2, lty = 3)
# Posterior (blue filled)
polygon(c(x_seq, rev(x_seq)), c(post_y, rep(0, length(x_seq))),
col = "#3498db30", border = NA)
lines(x_seq, post_y, col = "#3498db", lwd = 2.5)
# True value
abline(v = d$true_beta, lty = 2, col = "#27ae60", lwd = 1.5)
# OLS CI bracket (red)
y_ci <- max(post_y) * 0.05
arrows(d$ci_lo, y_ci, d$ci_hi, y_ci,
code = 3, angle = 90, length = 0.04, lwd = 2, col = "#e74c3c")
text(d$b_ols, y_ci + max(post_y) * 0.07, "95% CI", cex = 0.7,
col = "#e74c3c", font = 2)
# Bayesian CrI bracket (blue)
y_cri <- max(post_y) * 0.15
arrows(d$cri_lo, y_cri, d$cri_hi, y_cri,
code = 3, angle = 90, length = 0.04, lwd = 2, col = "#3498db")
text(d$post_mean, y_cri + max(post_y) * 0.07, "95% CrI", cex = 0.7,
col = "#3498db", font = 2)
legend("topright", bty = "n", cex = 0.75,
legend = c("Prior", "OLS (likelihood)",
"Posterior", expression("True " * beta[educ])),
col = c("#e74c3c", "gray50", "#3498db", "#27ae60"),
lwd = c(2, 2, 2.5, 1.5),
lty = c(2, 3, 1, 2))
})
# --- Bottom: sensitivity to prior strength ---
output$sensitivity_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 2.5, 1))
ylim <- range(c(d$sens_lo, d$sens_hi, d$b_ols, d$true_beta))
plot(d$prior_sd_grid, d$sens_means, type = "l",
lwd = 2.5, col = "#3498db",
xlab = expression("Prior SD for " * beta[educ]),
ylab = expression("Posterior mean of " * beta[educ]),
main = "Sensitivity: How Much Does the Prior Matter?",
ylim = ylim)
# CrI band
polygon(c(d$prior_sd_grid, rev(d$prior_sd_grid)),
c(d$sens_lo, rev(d$sens_hi)),
col = "#3498db15", border = NA)
lines(d$prior_sd_grid, d$sens_lo, col = "#3498db50", lty = 2)
lines(d$prior_sd_grid, d$sens_hi, col = "#3498db50", lty = 2)
# OLS reference
abline(h = d$b_ols, col = "#e74c3c", lwd = 1.5, lty = 3)
# True value
abline(h = d$true_beta, col = "#27ae60", lwd = 1.5, lty = 2)
# Prior mean reference
abline(h = d$prior_mean, col = "#f39c12", lwd = 1.5, lty = 3)
# Current prior SD
abline(v = d$prior_sd, col = "#2c3e50", lwd = 1, lty = 3)
points(d$prior_sd, d$post_mean, pch = 16, col = "#2c3e50", cex = 1.5)
legend("bottomright", bty = "n", cex = 0.75,
legend = c("Posterior mean", "95% CrI",
"OLS estimate", expression("True " * beta),
"Prior mean", "You are here"),
col = c("#3498db", "#3498db50", "#e74c3c",
"#27ae60", "#f39c12", "#2c3e50"),
lwd = c(2.5, 1.5, 1.5, 1.5, 1.5, NA),
lty = c(1, 2, 3, 2, 3, NA),
pch = c(NA, NA, NA, NA, NA, 16))
})
output$results <- renderUI({
d <- dat()
tags$div(class = "stats-box",
HTML(paste0(
"<b>OLS:</b> ", round(d$b_ols, 4),
" [", round(d$ci_lo, 4), ", ", round(d$ci_hi, 4), "]<br>",
"<b>Posterior:</b> ", round(d$post_mean, 4),
" [", round(d$cri_lo, 4), ", ", round(d$cri_hi, 4), "]<br>",
"<b>True:</b> ", d$true_beta, "<br>",
"<hr style='margin:8px 0'>",
"<b>Prior weight:</b> ", round(d$w_prior * 100, 1), "%<br>",
"<b>Data weight:</b> ", round(d$w_data * 100, 1), "%<br>",
"<hr style='margin:8px 0'>",
"<b>P(β > 0 | data):</b> ",
round(d$p_positive, 4), "<br>",
"<b>P(β > 0.05 | data):</b> ",
round(d$p_above_05, 4)
))
)
})
}
shinyApp(ui, server)
Things to try
- Strong correct prior (prior mean = 0.10, prior SD = 0.02, true β = 0.10): the CrI is tighter than the OLS CI. Incorporating good prior information genuinely improves precision. This is the Bayesian advantage.
- Wrong prior + small n (prior mean = 0.02, true β = 0.10, n = 30): the posterior is pulled toward 0.02 — the prior biases the estimate. Now increase n to 500 — the data overwhelm the wrong prior.
- Large n (n = 1000): the posterior and OLS converge regardless of the prior. The right panel shows the sensitivity curve flattening — the prior SD barely matters.
- P(β > 0 | data): this direct probability statement is impossible with frequentist methods. A 95% CI of [0.03, 0.12] doesn’t tell you the probability that β > 0. The posterior does.
- Right panel insight: as prior SD increases (prior gets vaguer), the posterior mean converges to OLS. The x-axis is “how much you trust the prior” and the curve shows the bias-precision tradeoff.
The Stata workflow
Here’s the complete analysis in Stata, mirroring each step above:
* ----- Step 1: Look at the data -----
summarize wage education experience
scatter wage education
* ----- Step 2: OLS -----
gen log_wage = log(wage)
gen exper2 = experience * experience
reg log_wage education experience exper2
* ----- Step 3: Bayesian with default (vague) priors -----
bayes: reg log_wage education experience exper2
* ----- Step 4: Bayesian with informative prior -----
* Literature says ~10% return, we're fairly confident
bayes, prior({log_wage:education}, normal(0.10, 0.01)) : ///
reg log_wage education experience exper2
* Note: normal(0.10, 0.01) means prior variance 0.01,
* so prior SD = 0.1 — centered at 10% with moderate spread
* ----- Step 5: Check convergence -----
bayesgraph diagnostics {log_wage:education}
* Look for: stationary trace, high ESS, R-hat near 1
* ----- Step 6: Compare models -----
bayesstats ic
* Compare DIC/WAIC to the default-prior modelWhat did we learn?
With a reasonable prior, Bayesian and OLS agree — especially with large samples. The prior adds almost no bias but can reduce uncertainty.
The prior adds value when samples are small or when you have genuine outside information (literature, expert knowledge, previous studies).
The posterior gives direct probability statements that OLS can’t: \(P(\beta > 0.05 \mid \text{data})\) answers the question “how confident should I be that the return exceeds 5%?” directly.
Sensitivity analysis is easy and important. The right panel shows exactly how your conclusions depend on the prior. If the answer doesn’t change much across reasonable priors, your results are robust.
Where to go deeper:
- Bayes’ theorem — the updating formula behind everything
- Priors & Posteriors — mechanics of conjugate updating
- Shrinkage — why the prior pulls estimates and when it helps
- Bayesian Regression — the full OLS vs Bayesian comparison
- Model Comparison — Bayes factors for choosing between models
- Posterior Predictive Checks — does the model fit?
Did you know?
Mincer (1974) introduced the log-linear earnings equation in Schooling, Experience, and Earnings. The specification — log wages regressed on education and a quadratic in experience — became the workhorse of labor economics. Nearly every study of returns to education starts from this framework.
Card (1999, 2001) used instrumental variables (geographic proximity to colleges) to address the endogeneity of education, finding returns of about 10% per year. His work highlighted that OLS may underestimate the return for marginal students — those on the fence about attending college.
Bayesian approaches in development economics: researchers studying returns to education in developing countries often face small samples and noisy data. Informative priors based on findings from similar countries can stabilize estimates — a natural application of the framework you just practiced.