Bayesian Regression

From reg y x to Bayesian regression

On the MCMC page, you saw how to sample from any posterior — even when no closed-form solution exists. Now we apply that machinery to something familiar: regression.

In Stata, you type reg y x and get a point estimate \(\hat{\beta}\), a standard error, and a 95% confidence interval. That’s OLS — it picks the single line that minimizes the sum of squared residuals.

Bayesian regression starts from the same place (the same likelihood) but adds one ingredient: a prior distribution on \(\beta\). Instead of a single best-fit line, you get a full posterior distribution — a curve showing how plausible each value of \(\beta\) is, given both the data and your prior beliefs.

Example: You’re estimating the effect of an extra year of education on hourly wages. Before seeing your data, you’ve read the literature — most estimates land between $0.50 and $1.50 per year. A Bayesian approach lets you encode this: set a prior of \(\beta \sim N(1.0, 0.5^2)\), centered on $1.00 with moderate uncertainty. OLS ignores this information entirely. The Bayesian estimate blends it with your data.

What “a prior on \(\beta\)” means: it’s a probability distribution expressing your beliefs about the slope before seeing the data. A prior of \(N(0, 10^2)\) says “I think \(\beta\) is probably near 0 but I’m very uncertain” (vague). A prior of \(N(1, 0.3^2)\) says “I’m fairly confident \(\beta\) is close to 1” (informative). After seeing data, the prior gets updated to a posterior.

The math side-by-side:

OLS Bayesian
Model \(y = X\beta + \varepsilon, \;\; \varepsilon \sim N(0, \sigma^2 I)\) Same likelihood + prior \(\beta \sim N(\mu_0, \sigma_0^2)\)
Estimate \(\hat{\beta}_{OLS} = (X'X)^{-1}X'y\) Posterior mean \(= w \cdot \hat{\beta}_{OLS} + (1-w) \cdot \mu_0\)
Uncertainty SE, confidence interval Full posterior distribution, credible interval
Result One number + interval Entire distribution

The posterior mean is a precision-weighted average of the prior mean and the OLS estimate, where \(w = \frac{\text{data precision}}{\text{data precision} + \text{prior precision}}\). The more data you have, the more the posterior looks like OLS. The stronger the prior (smaller \(\sigma_0^2\)), the more the posterior is pulled toward the prior mean.

Simulation: OLS vs Bayesian — the role of the prior

#| standalone: true
#| viewerHeight: 620

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("true_beta", HTML("True &beta;:"),
                  min = -3, max = 3, value = 1.5, step = 0.5),

      sliderInput("prior_mean", "Prior mean:",
                  min = -3, max = 3, value = 0, step = 0.5),

      sliderInput("prior_sd", "Prior SD:",
                  min = 0.1, max = 5, value = 1, step = 0.1),

      sliderInput("n", "Sample size (n):",
                  min = 5, max = 200, value = 20, step = 5),

      sliderInput("sigma", HTML("Noise (&sigma;):"),
                  min = 0.5, max = 5, value = 2, step = 0.5),

      actionButton("go", "New data", class = "btn-primary", width = "100%"),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("scatter_plot", height = "420px")),
        column(6, plotOutput("density_plot", height = "420px"))
      )
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$go
    true_beta  <- input$true_beta
    prior_mean <- input$prior_mean
    prior_sd   <- input$prior_sd
    n          <- input$n
    sigma      <- input$sigma

    # Generate data: y = true_beta * x + noise (centered x, no intercept)
    x <- rnorm(n, 0, 1)
    y <- true_beta * x + rnorm(n, 0, sigma)

    # OLS estimate
    beta_ols <- sum(x * y) / sum(x^2)
    se_ols   <- sigma / sqrt(sum(x^2))

    # Bayesian posterior (conjugate normal-normal, known sigma)
    prior_prec <- 1 / prior_sd^2
    data_prec  <- sum(x^2) / sigma^2
    post_prec  <- prior_prec + data_prec
    post_sd    <- 1 / sqrt(post_prec)
    post_mean  <- (prior_mean * prior_prec + beta_ols * data_prec) / post_prec

    # Weights
    w_data  <- data_prec / post_prec
    w_prior <- prior_prec / post_prec

    list(x = x, y = y, true_beta = true_beta,
         beta_ols = beta_ols, se_ols = se_ols,
         prior_mean = prior_mean, prior_sd = prior_sd,
         post_mean = post_mean, post_sd = post_sd,
         w_data = w_data, w_prior = w_prior, sigma = sigma)
  })

  output$scatter_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    plot(d$x, d$y, pch = 16, col = "#2c3e5060", cex = 1.2,
         xlab = "x", ylab = "y", main = "Data + Regression Lines")

    # True line (green, dashed)
    abline(0, d$true_beta, col = "#27ae60", lwd = 2.5, lty = 2)
    # OLS line (red)
    abline(0, d$beta_ols, col = "#e74c3c", lwd = 2.5)
    # Bayesian line (blue)
    abline(0, d$post_mean, col = "#3498db", lwd = 2.5)

    legend("topleft", bty = "n", cex = 0.85,
           legend = c(
             paste0("True: ", d$true_beta),
             paste0("OLS: ", round(d$beta_ols, 3)),
             paste0("Bayesian: ", round(d$post_mean, 3))
           ),
           col = c("#27ae60", "#e74c3c", "#3498db"),
           lwd = 2.5, lty = c(2, 1, 1))
  })

  output$density_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    # Range for plotting
    all_means <- c(d$prior_mean, d$beta_ols, d$post_mean)
    all_sds   <- c(d$prior_sd, d$se_ols, d$post_sd)
    xlim <- range(c(all_means - 3.5 * all_sds, all_means + 3.5 * all_sds))
    x_seq <- seq(xlim[1], xlim[2], length.out = 300)

    # Densities
    prior_y <- dnorm(x_seq, d$prior_mean, d$prior_sd)
    lik_y   <- dnorm(x_seq, d$beta_ols, d$se_ols)
    post_y  <- dnorm(x_seq, d$post_mean, d$post_sd)

    ylim <- c(0, max(c(prior_y, lik_y, post_y)) * 1.15)

    plot(NULL, xlim = xlim, ylim = ylim,
         xlab = expression(beta), ylab = "Density",
         main = expression("Prior, Likelihood, & Posterior for " * beta))

    # Prior (red, dashed)
    lines(x_seq, prior_y, col = "#e74c3c", lwd = 2, lty = 2)

    # Likelihood (gray, dotted)
    lines(x_seq, lik_y, col = "gray50", lwd = 2, lty = 3)

    # Posterior (blue, shaded)
    polygon(c(x_seq, rev(x_seq)), c(post_y, rep(0, length(x_seq))),
            col = "#3498db30", border = NA)
    lines(x_seq, post_y, col = "#3498db", lwd = 2.5)

    # True beta
    abline(v = d$true_beta, lty = 2, col = "#27ae60", lwd = 1.5)

    legend("topright", bty = "n", cex = 0.8,
           legend = c("Prior", "Likelihood (data)",
                      "Posterior", expression("True " * beta)),
           col = c("#e74c3c", "gray50", "#3498db", "#27ae60"),
           lwd = c(2, 2, 2.5, 1.5),
           lty = c(2, 3, 1, 2))
  })

  output$results <- renderUI({
    d <- dat()
    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>OLS estimate:</b> ", round(d$beta_ols, 3), "<br>",
        "<b>Posterior mean:</b> ", round(d$post_mean, 3), "<br>",
        "<b>True &beta;:</b> ", d$true_beta, "<br>",
        "<hr style='margin:8px 0'>",
        "<b>Weight on data:</b> ", round(d$w_data * 100, 1), "%<br>",
        "<b>Weight on prior:</b> ", round(d$w_prior * 100, 1), "%<br>",
        "<hr style='margin:8px 0'>",
        "<b>OLS SE:</b> ", round(d$se_ols, 3), "<br>",
        "<b>Posterior SD:</b> ", round(d$post_sd, 3)
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • Vague prior (prior SD = 5): the posterior almost exactly matches OLS. With a wide prior, the data dominates — the weight on prior drops near 0%. A vague Bayesian is a frequentist.
  • Strong prior (prior SD = 0.3): the posterior is pulled heavily toward the prior mean. Even if the data says otherwise, the posterior barely moves. This is shrinkage in action.
  • Large n (n = 200): regardless of the prior, the posterior converges to the OLS estimate. Data overwhelms the prior. Watch the weight on data approach 100%.
  • Wrong prior (prior mean = -2, true beta = 1.5, n = 10): the Bayesian estimate is biased toward -2. Now increase n — the data corrects the bad prior. This is why Bayesian inference is “self-correcting” with enough data.
  • Compare the right panel: the posterior (blue) always sits between the prior (red) and the likelihood (gray). It’s literally a compromise between what you believed and what you saw.

Credible intervals vs confidence intervals

Both give you a range of plausible values for \(\beta\). But they answer different questions:

Confidence interval (frequentist) Credible interval (Bayesian)
Statement “If I repeated this experiment many times, 95% of my CIs would contain the true \(\beta\) “Given the data and my prior, there’s a 95% probability that \(\beta\) is in this interval”
About The procedure The parameter
Depends on prior? No Yes
Interpretation Frequency guarantee across experiments Direct probability statement for this experiment

The credible interval says what most people think the confidence interval says. See Bayesian vs Frequentist for more on this distinction.


The prior as regularization

A prior isn’t just a philosophical stance — it has a direct mathematical connection to regularization:

Prior on \(\beta\) Equivalent to Penalty
\(N(0, \sigma_0^2)\) Ridge regression \(\lambda = \sigma^2 / \sigma_0^2\)
Laplace\((0, b)\) LASSO \(\lambda = \sigma^2 / b\)
Flat (improper) OLS No penalty

MAP vs full Bayesian: The maximum a posteriori (MAP) estimate — the peak of the posterior — equals the ridge/LASSO solution exactly. But full Bayesian inference gives you the entire posterior, not just the peak. You get uncertainty for free.

A tighter prior (smaller \(\sigma_0^2\)) implies stronger regularization (larger \(\lambda\)). This is why Bayesian regression with an informative prior produces shrinkage — it pulls estimates toward the prior mean, just like ridge regression pulls coefficients toward zero.

MAP vs MLE: where the prior shows up in optimization

Three ways to estimate \(\beta\), each using a different objective:

MLE MAP Full Bayesian
Objective \(\arg\max_\beta \; p(y \mid \beta)\) \(\arg\max_\beta \; p(y \mid \beta)\,p(\beta)\) Compute \(p(\beta \mid y)\)
What you get Point estimate Point estimate Entire distribution
Uncertainty? Only via SE / asymptotics No (just the peak) Yes — built in
Equivalent to OLS Ridge / LASSO

In log space the relationship is transparent:

\[\log p(\beta \mid y) = \underbrace{\log p(y \mid \beta)}_{\text{log-likelihood}} + \underbrace{\log p(\beta)}_{\text{log-prior}} - \text{const}\]

  • MLE maximizes the first term alone.
  • MAP maximizes the first plus the second — the prior acts as a penalty.

What does that penalty look like?

  • Normal prior \(\beta \sim N(0, \sigma_0^2)\): \(\log p(\beta) = -\beta^2 / 2\sigma_0^2 + \text{const}\) — that’s the ridge (\(L_2\)) penalty.
  • Laplace prior \(\beta \sim \text{Laplace}(0, b)\): \(\log p(\beta) = -|\beta| / b + \text{const}\) — that’s the LASSO (\(L_1\)) penalty.

So MAP = regularized MLE. Full Bayesian = MAP + uncertainty. The posterior mean, posterior median, and MAP (posterior mode) are three different summaries of the same posterior distribution. For symmetric posteriors (like the normal-normal case in the simulation above) they coincide, but in general they can differ.

Connecting to the simulation above: the OLS line is the MLE. For the symmetric normal posterior in this simulation, the posterior mean essentially equals the MAP. Try a strong prior (SD = 0.3) and watch both get pulled toward the prior mean — that’s the ridge penalty at work. Then set the prior SD to 5 and watch MAP converge to MLE — a vague prior means almost no penalty.

Simulation: Repeated experiments — CI vs CrI coverage

Run the same experiment many times. Each repetition generates new data from the true model, computes both a frequentist CI and a Bayesian CrI, and checks whether each contains the true \(\beta\). The prior is centered at 0 (like ridge regression).

#| standalone: true
#| viewerHeight: 620

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("true_beta2", HTML("True &beta;:"),
                  min = -3, max = 3, value = 1.5, step = 0.5),

      sliderInput("prior_sd2", "Prior SD:",
                  min = 0.1, max = 5, value = 1, step = 0.1),

      sliderInput("n2", "Sample size (n):",
                  min = 5, max = 200, value = 20, step = 5),

      sliderInput("n_reps", "Repetitions:",
                  min = 10, max = 100, value = 40, step = 10),

      actionButton("go2", "Run experiments", class = "btn-primary", width = "100%"),

      uiOutput("results2")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("interval_plot", height = "420px")),
        column(6, plotOutput("dot_plot", height = "420px"))
      )
    )
  )
)

server <- function(input, output, session) {

  dat2 <- reactive({
    input$go2
    true_beta  <- input$true_beta2
    prior_sd   <- input$prior_sd2
    n          <- input$n2
    n_reps     <- input$n_reps
    sigma      <- 2
    prior_mean <- 0  # centered at 0, like ridge

    prior_prec <- 1 / prior_sd^2

    # Storage: ols_est, ols_lo, ols_hi, bayes_est, bayes_lo, bayes_hi
    results <- matrix(NA, nrow = n_reps, ncol = 6)

    for (r in 1:n_reps) {
      x <- rnorm(n, 0, 1)
      y <- true_beta * x + rnorm(n, 0, sigma)

      beta_ols <- sum(x * y) / sum(x^2)
      se_ols   <- sigma / sqrt(sum(x^2))

      data_prec <- sum(x^2) / sigma^2
      post_prec <- prior_prec + data_prec
      post_sd   <- 1 / sqrt(post_prec)
      post_mean <- (prior_mean * prior_prec + beta_ols * data_prec) / post_prec

      results[r, ] <- c(
        beta_ols,
        beta_ols - 1.96 * se_ols,
        beta_ols + 1.96 * se_ols,
        post_mean,
        qnorm(0.025, post_mean, post_sd),
        qnorm(0.975, post_mean, post_sd)
      )
    }

    # Coverage
    ci_covers  <- results[, 2] <= true_beta & results[, 3] >= true_beta
    cri_covers <- results[, 5] <= true_beta & results[, 6] >= true_beta

    # Average widths
    ci_width  <- mean(results[, 3] - results[, 2])
    cri_width <- mean(results[, 6] - results[, 5])

    # MSE
    mse_ols   <- mean((results[, 1] - true_beta)^2)
    mse_bayes <- mean((results[, 4] - true_beta)^2)

    # Implied ridge lambda
    lambda <- sigma^2 / prior_sd^2

    list(results = results, true_beta = true_beta, n_reps = n_reps,
         ci_covers = ci_covers, cri_covers = cri_covers,
         ci_width = ci_width, cri_width = cri_width,
         mse_ols = mse_ols, mse_bayes = mse_bayes, lambda = lambda)
  })

  output$interval_plot <- renderPlot({
    d <- dat2()
    par(mar = c(4.5, 4.5, 3, 1))

    n_show <- min(d$n_reps, 40)
    xlim <- range(c(d$results[1:n_show, 2:3],
                    d$results[1:n_show, 5:6], d$true_beta)) + c(-0.3, 0.3)

    plot(NULL, xlim = xlim, ylim = c(0.5, n_show + 0.5),
         xlab = expression(beta), ylab = "Experiment #",
         main = "95% Intervals Across Experiments")

    for (i in 1:n_show) {
      # CI (red)
      ci_col <- if (d$ci_covers[i]) "#e74c3c" else "#e74c3c40"
      segments(d$results[i, 2], i - 0.15, d$results[i, 3], i - 0.15,
               lwd = 2, col = ci_col)

      # CrI (blue)
      cri_col <- if (d$cri_covers[i]) "#3498db" else "#3498db40"
      segments(d$results[i, 5], i + 0.15, d$results[i, 6], i + 0.15,
               lwd = 2, col = cri_col)
    }

    abline(v = d$true_beta, lty = 2, lwd = 2, col = "#2c3e50")

    legend("topright", bty = "n", cex = 0.8,
           legend = c(
             paste0("CI (", sum(d$ci_covers[1:n_show]), "/", n_show, " cover)"),
             paste0("CrI (", sum(d$cri_covers[1:n_show]), "/", n_show, " cover)")
           ),
           col = c("#e74c3c", "#3498db"), lwd = 3)
  })

  output$dot_plot <- renderPlot({
    d <- dat2()
    par(mar = c(4.5, 4.5, 3, 1))

    n_show <- min(d$n_reps, 40)
    xlim <- range(c(d$results[1:n_show, 1],
                    d$results[1:n_show, 4], d$true_beta)) + c(-0.3, 0.3)

    plot(NULL, xlim = xlim, ylim = c(0.5, n_show + 0.5),
         xlab = expression(hat(beta)), ylab = "Experiment #",
         main = "Point Estimates Across Experiments")

    for (i in 1:n_show) {
      points(d$results[i, 1], i - 0.12, pch = 16, col = "#e74c3c", cex = 0.9)
      points(d$results[i, 4], i + 0.12, pch = 17, col = "#3498db", cex = 0.9)
    }

    abline(v = d$true_beta, lty = 2, lwd = 2, col = "#2c3e50")

    legend("topright", bty = "n", cex = 0.8,
           legend = c("OLS estimate", "Posterior mean"),
           col = c("#e74c3c", "#3498db"),
           pch = c(16, 17))
  })

  output$results2 <- renderUI({
    d <- dat2()

    ci_rate  <- round(mean(d$ci_covers) * 100, 1)
    cri_rate <- round(mean(d$cri_covers) * 100, 1)

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>CI coverage:</b> ", ci_rate, "%<br>",
        "<b>CrI coverage:</b> ", cri_rate, "%<br>",
        "<hr style='margin:8px 0'>",
        "<b>Avg CI width:</b> ", round(d$ci_width, 3), "<br>",
        "<b>Avg CrI width:</b> ", round(d$cri_width, 3), "<br>",
        "<hr style='margin:8px 0'>",
        "<b>MSE (OLS):</b> ", round(d$mse_ols, 4), "<br>",
        "<b>MSE (Bayes):</b> ", round(d$mse_bayes, 4), "<br>",
        "<hr style='margin:8px 0'>",
        "<b>Implied ridge &lambda;:</b> ", round(d$lambda, 3)
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • Large n (n = 200): both intervals are nearly identical. CI and CrI coverage, width, and MSE all converge. With lots of data, Bayesian = frequentist.
  • Strong prior (prior SD = 0.3), true beta = 1.5: the CrI is narrower but the prior pulls estimates toward 0. Some CrIs miss the true value — the coverage drops below 95%. A strong wrong prior hurts.
  • Vague prior (prior SD = 5): the CrI and CI are virtually the same. The implied ridge \(\lambda\) is tiny — almost no regularization.
  • True beta = 0, prior SD = 1: the prior is correct! The CrI has excellent coverage and is narrower than the CI. The Bayesian MSE beats OLS because the prior genuinely helps.
  • Right panel: notice how the blue dots (Bayesian) cluster tighter around the true value than the red dots (OLS) when the prior is reasonable. That’s the bias-variance tradeoff — a little bias from the prior reduces variance enough to lower overall MSE.

In Stata: bayes: reg y x

Stata makes Bayesian regression straightforward. Just prepend bayes: to your regression command:

* Frequentist OLS
reg wage education experience

* Bayesian regression (default priors)
bayes: reg wage education experience

* With custom priors
bayes, prior({wage:education}, normal(1, 0.25)) ///
      prior({wage:experience}, normal(0.5, 1))  ///
      : reg wage education experience

The output looks different from OLS — instead of a coefficient table with p-values, you get posterior means, standard deviations, and credible intervals. Under the hood, Stata uses the Metropolis-Hastings algorithm (see MCMC) to sample from the posterior.

Try it: run reg y x and bayes: reg y x on the same dataset. With Stata’s default (vague) priors and moderate sample sizes, the results will be nearly identical — confirming that the prior washes out with enough data.


Did you know?

  • Lindley & Smith (1972) formalized the conjugate Bayesian linear model, showing how hierarchical priors on regression coefficients lead to shrinkage estimators. Their work unified the Bayesian and empirical Bayes approaches to regression.

  • Hoerl & Kennard (1970) introduced ridge regression as a purely frequentist technique for handling multicollinearity. The Bayesian interpretation — that ridge is equivalent to a normal prior — came later, connecting two literatures that developed independently.

  • Modern frontiers: Bayesian regression ideas power some of today’s most flexible methods. BART (Bayesian Additive Regression Trees) uses priors on tree structures for nonparametric regression. Bayesian model averaging puts priors on competing models themselves, not just parameters, to account for model uncertainty.