Bayesian Regression
From reg y x to Bayesian regression
On the MCMC page, you saw how to sample from any posterior — even when no closed-form solution exists. Now we apply that machinery to something familiar: regression.
In Stata, you type reg y x and get a point estimate \(\hat{\beta}\), a standard error, and a 95% confidence interval. That’s OLS — it picks the single line that minimizes the sum of squared residuals.
Bayesian regression starts from the same place (the same likelihood) but adds one ingredient: a prior distribution on \(\beta\). Instead of a single best-fit line, you get a full posterior distribution — a curve showing how plausible each value of \(\beta\) is, given both the data and your prior beliefs.
The math side-by-side:
| OLS | Bayesian | |
|---|---|---|
| Model | \(y = X\beta + \varepsilon, \;\; \varepsilon \sim N(0, \sigma^2 I)\) | Same likelihood + prior \(\beta \sim N(\mu_0, \sigma_0^2)\) |
| Estimate | \(\hat{\beta}_{OLS} = (X'X)^{-1}X'y\) | Posterior mean \(= w \cdot \hat{\beta}_{OLS} + (1-w) \cdot \mu_0\) |
| Uncertainty | SE, confidence interval | Full posterior distribution, credible interval |
| Result | One number + interval | Entire distribution |
The posterior mean is a precision-weighted average of the prior mean and the OLS estimate, where \(w = \frac{\text{data precision}}{\text{data precision} + \text{prior precision}}\). The more data you have, the more the posterior looks like OLS. The stronger the prior (smaller \(\sigma_0^2\)), the more the posterior is pulled toward the prior mean.
Simulation: OLS vs Bayesian — the role of the prior
#| standalone: true
#| viewerHeight: 620
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
.good { color: #27ae60; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("true_beta", HTML("True β:"),
min = -3, max = 3, value = 1.5, step = 0.5),
sliderInput("prior_mean", "Prior mean:",
min = -3, max = 3, value = 0, step = 0.5),
sliderInput("prior_sd", "Prior SD:",
min = 0.1, max = 5, value = 1, step = 0.1),
sliderInput("n", "Sample size (n):",
min = 5, max = 200, value = 20, step = 5),
sliderInput("sigma", HTML("Noise (σ):"),
min = 0.5, max = 5, value = 2, step = 0.5),
actionButton("go", "New data", class = "btn-primary", width = "100%"),
uiOutput("results")
),
mainPanel(
width = 9,
fluidRow(
column(6, plotOutput("scatter_plot", height = "420px")),
column(6, plotOutput("density_plot", height = "420px"))
)
)
)
)
server <- function(input, output, session) {
dat <- reactive({
input$go
true_beta <- input$true_beta
prior_mean <- input$prior_mean
prior_sd <- input$prior_sd
n <- input$n
sigma <- input$sigma
# Generate data: y = true_beta * x + noise (centered x, no intercept)
x <- rnorm(n, 0, 1)
y <- true_beta * x + rnorm(n, 0, sigma)
# OLS estimate
beta_ols <- sum(x * y) / sum(x^2)
se_ols <- sigma / sqrt(sum(x^2))
# Bayesian posterior (conjugate normal-normal, known sigma)
prior_prec <- 1 / prior_sd^2
data_prec <- sum(x^2) / sigma^2
post_prec <- prior_prec + data_prec
post_sd <- 1 / sqrt(post_prec)
post_mean <- (prior_mean * prior_prec + beta_ols * data_prec) / post_prec
# Weights
w_data <- data_prec / post_prec
w_prior <- prior_prec / post_prec
list(x = x, y = y, true_beta = true_beta,
beta_ols = beta_ols, se_ols = se_ols,
prior_mean = prior_mean, prior_sd = prior_sd,
post_mean = post_mean, post_sd = post_sd,
w_data = w_data, w_prior = w_prior, sigma = sigma)
})
output$scatter_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
plot(d$x, d$y, pch = 16, col = "#2c3e5060", cex = 1.2,
xlab = "x", ylab = "y", main = "Data + Regression Lines")
# True line (green, dashed)
abline(0, d$true_beta, col = "#27ae60", lwd = 2.5, lty = 2)
# OLS line (red)
abline(0, d$beta_ols, col = "#e74c3c", lwd = 2.5)
# Bayesian line (blue)
abline(0, d$post_mean, col = "#3498db", lwd = 2.5)
legend("topleft", bty = "n", cex = 0.85,
legend = c(
paste0("True: ", d$true_beta),
paste0("OLS: ", round(d$beta_ols, 3)),
paste0("Bayesian: ", round(d$post_mean, 3))
),
col = c("#27ae60", "#e74c3c", "#3498db"),
lwd = 2.5, lty = c(2, 1, 1))
})
output$density_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
# Range for plotting
all_means <- c(d$prior_mean, d$beta_ols, d$post_mean)
all_sds <- c(d$prior_sd, d$se_ols, d$post_sd)
xlim <- range(c(all_means - 3.5 * all_sds, all_means + 3.5 * all_sds))
x_seq <- seq(xlim[1], xlim[2], length.out = 300)
# Densities
prior_y <- dnorm(x_seq, d$prior_mean, d$prior_sd)
lik_y <- dnorm(x_seq, d$beta_ols, d$se_ols)
post_y <- dnorm(x_seq, d$post_mean, d$post_sd)
ylim <- c(0, max(c(prior_y, lik_y, post_y)) * 1.15)
plot(NULL, xlim = xlim, ylim = ylim,
xlab = expression(beta), ylab = "Density",
main = expression("Prior, Likelihood, & Posterior for " * beta))
# Prior (red, dashed)
lines(x_seq, prior_y, col = "#e74c3c", lwd = 2, lty = 2)
# Likelihood (gray, dotted)
lines(x_seq, lik_y, col = "gray50", lwd = 2, lty = 3)
# Posterior (blue, shaded)
polygon(c(x_seq, rev(x_seq)), c(post_y, rep(0, length(x_seq))),
col = "#3498db30", border = NA)
lines(x_seq, post_y, col = "#3498db", lwd = 2.5)
# True beta
abline(v = d$true_beta, lty = 2, col = "#27ae60", lwd = 1.5)
legend("topright", bty = "n", cex = 0.8,
legend = c("Prior", "Likelihood (data)",
"Posterior", expression("True " * beta)),
col = c("#e74c3c", "gray50", "#3498db", "#27ae60"),
lwd = c(2, 2, 2.5, 1.5),
lty = c(2, 3, 1, 2))
})
output$results <- renderUI({
d <- dat()
tags$div(class = "stats-box",
HTML(paste0(
"<b>OLS estimate:</b> ", round(d$beta_ols, 3), "<br>",
"<b>Posterior mean:</b> ", round(d$post_mean, 3), "<br>",
"<b>True β:</b> ", d$true_beta, "<br>",
"<hr style='margin:8px 0'>",
"<b>Weight on data:</b> ", round(d$w_data * 100, 1), "%<br>",
"<b>Weight on prior:</b> ", round(d$w_prior * 100, 1), "%<br>",
"<hr style='margin:8px 0'>",
"<b>OLS SE:</b> ", round(d$se_ols, 3), "<br>",
"<b>Posterior SD:</b> ", round(d$post_sd, 3)
))
)
})
}
shinyApp(ui, server)
Things to try
- Vague prior (prior SD = 5): the posterior almost exactly matches OLS. With a wide prior, the data dominates — the weight on prior drops near 0%. A vague Bayesian is a frequentist.
- Strong prior (prior SD = 0.3): the posterior is pulled heavily toward the prior mean. Even if the data says otherwise, the posterior barely moves. This is shrinkage in action.
- Large n (n = 200): regardless of the prior, the posterior converges to the OLS estimate. Data overwhelms the prior. Watch the weight on data approach 100%.
- Wrong prior (prior mean = -2, true beta = 1.5, n = 10): the Bayesian estimate is biased toward -2. Now increase n — the data corrects the bad prior. This is why Bayesian inference is “self-correcting” with enough data.
- Compare the right panel: the posterior (blue) always sits between the prior (red) and the likelihood (gray). It’s literally a compromise between what you believed and what you saw.
Credible intervals vs confidence intervals
Both give you a range of plausible values for \(\beta\). But they answer different questions:
| Confidence interval (frequentist) | Credible interval (Bayesian) | |
|---|---|---|
| Statement | “If I repeated this experiment many times, 95% of my CIs would contain the true \(\beta\)” | “Given the data and my prior, there’s a 95% probability that \(\beta\) is in this interval” |
| About | The procedure | The parameter |
| Depends on prior? | No | Yes |
| Interpretation | Frequency guarantee across experiments | Direct probability statement for this experiment |
The credible interval says what most people think the confidence interval says. See Bayesian vs Frequentist for more on this distinction.
The prior as regularization
A prior isn’t just a philosophical stance — it has a direct mathematical connection to regularization:
| Prior on \(\beta\) | Equivalent to | Penalty |
|---|---|---|
| \(N(0, \sigma_0^2)\) | Ridge regression | \(\lambda = \sigma^2 / \sigma_0^2\) |
| Laplace\((0, b)\) | LASSO | \(\lambda = \sigma^2 / b\) |
| Flat (improper) | OLS | No penalty |
A tighter prior (smaller \(\sigma_0^2\)) implies stronger regularization (larger \(\lambda\)). This is why Bayesian regression with an informative prior produces shrinkage — it pulls estimates toward the prior mean, just like ridge regression pulls coefficients toward zero.
MAP vs MLE: where the prior shows up in optimization
Three ways to estimate \(\beta\), each using a different objective:
| MLE | MAP | Full Bayesian | |
|---|---|---|---|
| Objective | \(\arg\max_\beta \; p(y \mid \beta)\) | \(\arg\max_\beta \; p(y \mid \beta)\,p(\beta)\) | Compute \(p(\beta \mid y)\) |
| What you get | Point estimate | Point estimate | Entire distribution |
| Uncertainty? | Only via SE / asymptotics | No (just the peak) | Yes — built in |
| Equivalent to | OLS | Ridge / LASSO | — |
In log space the relationship is transparent:
\[\log p(\beta \mid y) = \underbrace{\log p(y \mid \beta)}_{\text{log-likelihood}} + \underbrace{\log p(\beta)}_{\text{log-prior}} - \text{const}\]
- MLE maximizes the first term alone.
- MAP maximizes the first plus the second — the prior acts as a penalty.
What does that penalty look like?
- Normal prior \(\beta \sim N(0, \sigma_0^2)\): \(\log p(\beta) = -\beta^2 / 2\sigma_0^2 + \text{const}\) — that’s the ridge (\(L_2\)) penalty.
- Laplace prior \(\beta \sim \text{Laplace}(0, b)\): \(\log p(\beta) = -|\beta| / b + \text{const}\) — that’s the LASSO (\(L_1\)) penalty.
So MAP = regularized MLE. Full Bayesian = MAP + uncertainty. The posterior mean, posterior median, and MAP (posterior mode) are three different summaries of the same posterior distribution. For symmetric posteriors (like the normal-normal case in the simulation above) they coincide, but in general they can differ.
Simulation: Repeated experiments — CI vs CrI coverage
Run the same experiment many times. Each repetition generates new data from the true model, computes both a frequentist CI and a Bayesian CrI, and checks whether each contains the true \(\beta\). The prior is centered at 0 (like ridge regression).
#| standalone: true
#| viewerHeight: 620
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
.good { color: #27ae60; font-weight: bold; }
.bad { color: #e74c3c; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("true_beta2", HTML("True β:"),
min = -3, max = 3, value = 1.5, step = 0.5),
sliderInput("prior_sd2", "Prior SD:",
min = 0.1, max = 5, value = 1, step = 0.1),
sliderInput("n2", "Sample size (n):",
min = 5, max = 200, value = 20, step = 5),
sliderInput("n_reps", "Repetitions:",
min = 10, max = 100, value = 40, step = 10),
actionButton("go2", "Run experiments", class = "btn-primary", width = "100%"),
uiOutput("results2")
),
mainPanel(
width = 9,
fluidRow(
column(6, plotOutput("interval_plot", height = "420px")),
column(6, plotOutput("dot_plot", height = "420px"))
)
)
)
)
server <- function(input, output, session) {
dat2 <- reactive({
input$go2
true_beta <- input$true_beta2
prior_sd <- input$prior_sd2
n <- input$n2
n_reps <- input$n_reps
sigma <- 2
prior_mean <- 0 # centered at 0, like ridge
prior_prec <- 1 / prior_sd^2
# Storage: ols_est, ols_lo, ols_hi, bayes_est, bayes_lo, bayes_hi
results <- matrix(NA, nrow = n_reps, ncol = 6)
for (r in 1:n_reps) {
x <- rnorm(n, 0, 1)
y <- true_beta * x + rnorm(n, 0, sigma)
beta_ols <- sum(x * y) / sum(x^2)
se_ols <- sigma / sqrt(sum(x^2))
data_prec <- sum(x^2) / sigma^2
post_prec <- prior_prec + data_prec
post_sd <- 1 / sqrt(post_prec)
post_mean <- (prior_mean * prior_prec + beta_ols * data_prec) / post_prec
results[r, ] <- c(
beta_ols,
beta_ols - 1.96 * se_ols,
beta_ols + 1.96 * se_ols,
post_mean,
qnorm(0.025, post_mean, post_sd),
qnorm(0.975, post_mean, post_sd)
)
}
# Coverage
ci_covers <- results[, 2] <= true_beta & results[, 3] >= true_beta
cri_covers <- results[, 5] <= true_beta & results[, 6] >= true_beta
# Average widths
ci_width <- mean(results[, 3] - results[, 2])
cri_width <- mean(results[, 6] - results[, 5])
# MSE
mse_ols <- mean((results[, 1] - true_beta)^2)
mse_bayes <- mean((results[, 4] - true_beta)^2)
# Implied ridge lambda
lambda <- sigma^2 / prior_sd^2
list(results = results, true_beta = true_beta, n_reps = n_reps,
ci_covers = ci_covers, cri_covers = cri_covers,
ci_width = ci_width, cri_width = cri_width,
mse_ols = mse_ols, mse_bayes = mse_bayes, lambda = lambda)
})
output$interval_plot <- renderPlot({
d <- dat2()
par(mar = c(4.5, 4.5, 3, 1))
n_show <- min(d$n_reps, 40)
xlim <- range(c(d$results[1:n_show, 2:3],
d$results[1:n_show, 5:6], d$true_beta)) + c(-0.3, 0.3)
plot(NULL, xlim = xlim, ylim = c(0.5, n_show + 0.5),
xlab = expression(beta), ylab = "Experiment #",
main = "95% Intervals Across Experiments")
for (i in 1:n_show) {
# CI (red)
ci_col <- if (d$ci_covers[i]) "#e74c3c" else "#e74c3c40"
segments(d$results[i, 2], i - 0.15, d$results[i, 3], i - 0.15,
lwd = 2, col = ci_col)
# CrI (blue)
cri_col <- if (d$cri_covers[i]) "#3498db" else "#3498db40"
segments(d$results[i, 5], i + 0.15, d$results[i, 6], i + 0.15,
lwd = 2, col = cri_col)
}
abline(v = d$true_beta, lty = 2, lwd = 2, col = "#2c3e50")
legend("topright", bty = "n", cex = 0.8,
legend = c(
paste0("CI (", sum(d$ci_covers[1:n_show]), "/", n_show, " cover)"),
paste0("CrI (", sum(d$cri_covers[1:n_show]), "/", n_show, " cover)")
),
col = c("#e74c3c", "#3498db"), lwd = 3)
})
output$dot_plot <- renderPlot({
d <- dat2()
par(mar = c(4.5, 4.5, 3, 1))
n_show <- min(d$n_reps, 40)
xlim <- range(c(d$results[1:n_show, 1],
d$results[1:n_show, 4], d$true_beta)) + c(-0.3, 0.3)
plot(NULL, xlim = xlim, ylim = c(0.5, n_show + 0.5),
xlab = expression(hat(beta)), ylab = "Experiment #",
main = "Point Estimates Across Experiments")
for (i in 1:n_show) {
points(d$results[i, 1], i - 0.12, pch = 16, col = "#e74c3c", cex = 0.9)
points(d$results[i, 4], i + 0.12, pch = 17, col = "#3498db", cex = 0.9)
}
abline(v = d$true_beta, lty = 2, lwd = 2, col = "#2c3e50")
legend("topright", bty = "n", cex = 0.8,
legend = c("OLS estimate", "Posterior mean"),
col = c("#e74c3c", "#3498db"),
pch = c(16, 17))
})
output$results2 <- renderUI({
d <- dat2()
ci_rate <- round(mean(d$ci_covers) * 100, 1)
cri_rate <- round(mean(d$cri_covers) * 100, 1)
tags$div(class = "stats-box",
HTML(paste0(
"<b>CI coverage:</b> ", ci_rate, "%<br>",
"<b>CrI coverage:</b> ", cri_rate, "%<br>",
"<hr style='margin:8px 0'>",
"<b>Avg CI width:</b> ", round(d$ci_width, 3), "<br>",
"<b>Avg CrI width:</b> ", round(d$cri_width, 3), "<br>",
"<hr style='margin:8px 0'>",
"<b>MSE (OLS):</b> ", round(d$mse_ols, 4), "<br>",
"<b>MSE (Bayes):</b> ", round(d$mse_bayes, 4), "<br>",
"<hr style='margin:8px 0'>",
"<b>Implied ridge λ:</b> ", round(d$lambda, 3)
))
)
})
}
shinyApp(ui, server)
Things to try
- Large n (n = 200): both intervals are nearly identical. CI and CrI coverage, width, and MSE all converge. With lots of data, Bayesian = frequentist.
- Strong prior (prior SD = 0.3), true beta = 1.5: the CrI is narrower but the prior pulls estimates toward 0. Some CrIs miss the true value — the coverage drops below 95%. A strong wrong prior hurts.
- Vague prior (prior SD = 5): the CrI and CI are virtually the same. The implied ridge \(\lambda\) is tiny — almost no regularization.
- True beta = 0, prior SD = 1: the prior is correct! The CrI has excellent coverage and is narrower than the CI. The Bayesian MSE beats OLS because the prior genuinely helps.
- Right panel: notice how the blue dots (Bayesian) cluster tighter around the true value than the red dots (OLS) when the prior is reasonable. That’s the bias-variance tradeoff — a little bias from the prior reduces variance enough to lower overall MSE.
In Stata: bayes: reg y x
Stata makes Bayesian regression straightforward. Just prepend bayes: to your regression command:
* Frequentist OLS
reg wage education experience
* Bayesian regression (default priors)
bayes: reg wage education experience
* With custom priors
bayes, prior({wage:education}, normal(1, 0.25)) ///
prior({wage:experience}, normal(0.5, 1)) ///
: reg wage education experienceThe output looks different from OLS — instead of a coefficient table with p-values, you get posterior means, standard deviations, and credible intervals. Under the hood, Stata uses the Metropolis-Hastings algorithm (see MCMC) to sample from the posterior.
Did you know?
Lindley & Smith (1972) formalized the conjugate Bayesian linear model, showing how hierarchical priors on regression coefficients lead to shrinkage estimators. Their work unified the Bayesian and empirical Bayes approaches to regression.
Hoerl & Kennard (1970) introduced ridge regression as a purely frequentist technique for handling multicollinearity. The Bayesian interpretation — that ridge is equivalent to a normal prior — came later, connecting two literatures that developed independently.
Modern frontiers: Bayesian regression ideas power some of today’s most flexible methods. BART (Bayesian Additive Regression Trees) uses priors on tree structures for nonparametric regression. Bayesian model averaging puts priors on competing models themselves, not just parameters, to account for model uncertainty.