Priors & Posteriors

What is Bayesian inference, really?

Imagine you’re trying to estimate something — say, the average effect of a tutoring program on test scores. In frequentist statistics, you collect data, compute a point estimate, and that’s your answer.

In Bayesian statistics, you do something different:

  1. Start with a prior — what you believed before seeing data. Maybe from past studies, expert opinion, or just “I have no idea” (a flat prior).
  2. Observe data — the likelihood tells you how probable the data is for each possible value of the parameter.
  3. Combine them — Bayes’ theorem multiplies the prior by the likelihood to give you the posterior: your updated belief after seeing the data.

\[\underbrace{P(\theta \mid \text{data})}_{\text{posterior}} \propto \underbrace{P(\text{data} \mid \theta)}_{\text{likelihood}} \times \underbrace{P(\theta)}_{\text{prior}}\]

Example: diagnosing a headache. A doctor sees a patient with a headache. Before any tests, her prior is: 99.9% chance it’s a tension headache, 0.1% chance it’s a brain tumor (base rates from experience). Then an MRI shows something unusual — that’s the data. The posterior updates: maybe now it’s 95% tension headache, 5% tumor. The prior didn’t disappear — it got updated by the evidence. A second MRI (more data) updates it further. With enough evidence, even a strong prior gets overwhelmed. That’s Bayesian inference: start with what you know, update with what you see.

The simulation below makes this tangible. You’re estimating the true mean of a normal distribution. Set a prior, generate data, and watch the posterior form.

#| standalone: true
#| viewerHeight: 600

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("true_mu", HTML("True &mu; (unknown to you):"),
                  min = -5, max = 5, value = 2, step = 0.5),

      sliderInput("prior_mu", "Prior mean:",
                  min = -5, max = 5, value = 0, step = 0.5),

      sliderInput("prior_sd", "Prior SD (certainty):",
                  min = 0.5, max = 10, value = 3, step = 0.5),

      sliderInput("n", "Sample size (data):",
                  min = 1, max = 200, value = 5, step = 1),

      sliderInput("sigma", HTML("Data noise (&sigma;):"),
                  min = 0.5, max = 5, value = 2, step = 0.5),

      actionButton("go", "New draw", class = "btn-primary", width = "100%"),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      plotOutput("posterior_plot", height = "450px")
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$go

    true_mu  <- input$true_mu
    prior_mu <- input$prior_mu
    prior_sd <- input$prior_sd
    n        <- input$n
    sigma    <- input$sigma

    # Generate data
    y <- rnorm(n, mean = true_mu, sd = sigma)
    y_bar <- mean(y)

    # Posterior (conjugate normal-normal)
    prior_prec <- 1 / prior_sd^2
    data_prec  <- n / sigma^2
    post_prec  <- prior_prec + data_prec
    post_sd    <- 1 / sqrt(post_prec)
    post_mu    <- (prior_prec * prior_mu + data_prec * y_bar) / post_prec

    # Shrinkage weight on prior
    w_prior <- prior_prec / post_prec

    list(true_mu = true_mu, prior_mu = prior_mu, prior_sd = prior_sd,
         y_bar = y_bar, sigma = sigma, n = n,
         post_mu = post_mu, post_sd = post_sd, w_prior = w_prior)
  })

  output$posterior_plot <- renderPlot({
    d <- dat()

    xmin <- min(d$prior_mu - 3.5 * d$prior_sd, d$post_mu - 4 * d$post_sd, d$true_mu - 2)
    xmax <- max(d$prior_mu + 3.5 * d$prior_sd, d$post_mu + 4 * d$post_sd, d$true_mu + 2)
    x <- seq(xmin, xmax, length.out = 500)

    y_prior <- dnorm(x, d$prior_mu, d$prior_sd)
    y_like  <- dnorm(x, d$y_bar, d$sigma / sqrt(d$n))
    y_post  <- dnorm(x, d$post_mu, d$post_sd)

    ylim <- c(0, max(y_prior, y_like, y_post) * 1.15)

    par(mar = c(4.5, 4.5, 3, 1))
    plot(x, y_prior, type = "l", lwd = 2.5, col = "#e74c3c",
         xlab = expression(mu), ylab = "Density",
         main = "Prior + Likelihood = Posterior",
         ylim = ylim)
    lines(x, y_like, lwd = 2.5, col = "#3498db")
    lines(x, y_post, lwd = 3, col = "#27ae60")

    # Shade posterior
    polygon(c(x, rev(x)),
            c(y_post, rep(0, length(x))),
            col = adjustcolor("#27ae60", 0.2), border = NA)

    # True value
    abline(v = d$true_mu, lty = 2, lwd = 2, col = "#2c3e50")

    legend("topright", bty = "n", cex = 0.9,
           legend = c("Prior (your belief before data)",
                      "Likelihood (what the data says)",
                      "Posterior (updated belief)",
                      expression("True " * mu)),
           col = c("#e74c3c", "#3498db", "#27ae60", "#2c3e50"),
           lwd = c(2.5, 2.5, 3, 2),
           lty = c(1, 1, 1, 2))
  })

  output$results <- renderUI({
    d <- dat()
    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>Prior mean:</b> ", d$prior_mu, "<br>",
        "<b>Data mean:</b> ", round(d$y_bar, 3), "<br>",
        "<b>Posterior mean:</b> ", round(d$post_mu, 3), "<br>",
        "<b>Posterior SD:</b> ", round(d$post_sd, 3), "<br>",
        "<hr style='margin:8px 0'>",
        "<b>Weight on prior:</b> ", round(d$w_prior * 100, 1), "%<br>",
        "<b>Weight on data:</b> ", round((1 - d$w_prior) * 100, 1), "%<br>",
        "<small>Posterior = weighted average of prior & data</small>"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • n = 1: the posterior is mostly the prior (red). You barely have data.
  • Slide n to 100: the posterior (green) collapses onto the data mean (blue). Data overwhelms the prior. With enough data, the prior doesn’t matter.
  • Set prior SD = 0.5 (strong prior) with n = 5: the posterior is pulled toward the prior. This is shrinkage — the prior is “shrinking” your estimate away from the data and toward your prior belief.
  • Set prior SD = 10 (vague prior): the posterior tracks the data almost exactly. A flat prior says “I have no opinion” and lets the data speak.
  • Watch the weight on prior in the sidebar — it shows exactly how much the posterior is a compromise between prior and data.

Shrinkage: the Bayesian superpower

Look at the “weight on prior” number in the sidebar. The posterior mean is literally a weighted average:

\[\mu_{post} = w \cdot \mu_{prior} + (1 - w) \cdot \bar{y}\]

where \(w\) depends on how confident your prior is relative to how much data you have.

This is shrinkage: the posterior “shrinks” the data estimate toward the prior. When is this useful?

  • Small samples: noisy data gets regularized toward a sensible default.
  • Many groups: estimating batting averages for 500 baseball players? Shrink extreme estimates toward the league average. A player who went 3-for-3 on opening day probably isn’t a .1000 hitter.
  • Hierarchical models: borrow strength across groups by shrinking toward a common mean.

Shrinkage isn’t bias — it’s a bias-variance tradeoff. You add a little bias but reduce variance a lot, often improving overall accuracy.

Why did the math work out so cleanly?

The simulation above computes the posterior instantly — no sampling, no iteration, just a formula. That’s because we used a conjugate prior: a special prior-likelihood pair where the posterior has the same distributional form as the prior.

Here, the prior is Normal, the likelihood is Normal, and the posterior is Normal too. You just update the mean and variance:

\[\mu_{post} = \frac{\frac{\mu_0}{\sigma_0^2} + \frac{n\bar{y}}{\sigma^2}}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}}\]

Plug in numbers, get the answer. No algorithm required.

Common conjugate pairs

Likelihood Conjugate prior Posterior Example
Normal (known \(\sigma\)) Normal Normal Estimating a mean (this page)
Binomial Beta Beta Estimating a proportion (Bayes’ Theorem)
Poisson Gamma Gamma Estimating a rate

The problem: most real models aren’t conjugate

Conjugacy is elegant but limited. It only works for these specific combinations. The moment your model gets realistic — logistic regression with priors on coefficients, hierarchical models with multiple levels, non-standard likelihoods — there’s no conjugate solution. The posterior is some high-dimensional surface with no closed-form expression.

What does “closed-form” mean? A closed-form solution is one you can write as a finite formula using standard operations (addition, multiplication, exponents, etc.) and evaluate directly.

Closed-form (conjugate case): the posterior mean above — plug in \(\mu_0\), \(\sigma_0\), \(n\), \(\bar{y}\), \(\sigma\), do arithmetic, get the exact answer. Done.

Not closed-form (non-conjugate case): say you want the posterior for a logistic regression coefficient \(\theta\) with a normal prior:

\[p(\theta \mid y) = \frac{\prod_{i=1}^n \frac{1}{1 + e^{-\theta x_i}} \cdot e^{-\theta^2/2}}{\int_{-\infty}^{\infty} \prod_{i=1}^n \frac{1}{1 + e^{-\theta x_i}} \cdot e^{-\theta^2/2} \, d\theta}\]

That integral in the denominator? There’s no formula for it. You can’t simplify it to “plug in numbers.” You’d have to numerically approximate it — which is exactly what MCMC does (it sidesteps the integral entirely by sampling).

That’s why MCMC exists: when you can’t write down the posterior, you sample from it instead. The progression in this course:

  • This page: conjugate priors — exact, instant posteriors (the special case)
  • MCMC: numerical sampling — posteriors for any model (the general case)
  • Hierarchical Models: the reason you need MCMC in practice