Priors & Posteriors

What is Bayesian inference, really?

Imagine you’re trying to estimate something — say, the average effect of a tutoring program on test scores. In frequentist statistics, you collect data, compute a point estimate, and that’s your answer.

In Bayesian statistics, you do something different:

  1. Start with a prior — what you believed before seeing data. Maybe from past studies, expert opinion, or just “I have no idea” (a flat prior).
  2. Observe data — the likelihood tells you how probable the data is for each possible value of the parameter.
  3. Combine them — Bayes’ theorem multiplies the prior by the likelihood to give you the posterior: your updated belief after seeing the data.

\[\underbrace{P(\theta \mid \text{data})}_{\text{posterior}} \propto \underbrace{P(\text{data} \mid \theta)}_{\text{likelihood}} \times \underbrace{P(\theta)}_{\text{prior}}\]

The simulation below makes this tangible. You’re estimating the true mean of a normal distribution. Set a prior, generate data, and watch the posterior form.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 600

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("true_mu", HTML("True &mu; (unknown to you):"),
                  min = -5, max = 5, value = 2, step = 0.5),

      sliderInput("prior_mu", "Prior mean:",
                  min = -5, max = 5, value = 0, step = 0.5),

      sliderInput("prior_sd", "Prior SD (certainty):",
                  min = 0.5, max = 10, value = 3, step = 0.5),

      sliderInput("n", "Sample size (data):",
                  min = 1, max = 200, value = 5, step = 1),

      sliderInput("sigma", HTML("Data noise (&sigma;):"),
                  min = 0.5, max = 5, value = 2, step = 0.5),

      actionButton("go", "New draw", class = "btn-primary", width = "100%"),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      plotOutput("posterior_plot", height = "450px")
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$go

    true_mu  <- input$true_mu
    prior_mu <- input$prior_mu
    prior_sd <- input$prior_sd
    n        <- input$n
    sigma    <- input$sigma

    # Generate data
    y <- rnorm(n, mean = true_mu, sd = sigma)
    y_bar <- mean(y)

    # Posterior (conjugate normal-normal)
    prior_prec <- 1 / prior_sd^2
    data_prec  <- n / sigma^2
    post_prec  <- prior_prec + data_prec
    post_sd    <- 1 / sqrt(post_prec)
    post_mu    <- (prior_prec * prior_mu + data_prec * y_bar) / post_prec

    # Shrinkage weight on prior
    w_prior <- prior_prec / post_prec

    list(true_mu = true_mu, prior_mu = prior_mu, prior_sd = prior_sd,
         y_bar = y_bar, sigma = sigma, n = n,
         post_mu = post_mu, post_sd = post_sd, w_prior = w_prior)
  })

  output$posterior_plot <- renderPlot({
    d <- dat()

    xmin <- min(d$prior_mu - 3.5 * d$prior_sd, d$post_mu - 4 * d$post_sd, d$true_mu - 2)
    xmax <- max(d$prior_mu + 3.5 * d$prior_sd, d$post_mu + 4 * d$post_sd, d$true_mu + 2)
    x <- seq(xmin, xmax, length.out = 500)

    y_prior <- dnorm(x, d$prior_mu, d$prior_sd)
    y_like  <- dnorm(x, d$y_bar, d$sigma / sqrt(d$n))
    y_post  <- dnorm(x, d$post_mu, d$post_sd)

    ylim <- c(0, max(y_prior, y_like, y_post) * 1.15)

    par(mar = c(4.5, 4.5, 3, 1))
    plot(x, y_prior, type = "l", lwd = 2.5, col = "#e74c3c",
         xlab = expression(mu), ylab = "Density",
         main = "Prior + Likelihood = Posterior",
         ylim = ylim)
    lines(x, y_like, lwd = 2.5, col = "#3498db")
    lines(x, y_post, lwd = 3, col = "#27ae60")

    # Shade posterior
    polygon(c(x, rev(x)),
            c(y_post, rep(0, length(x))),
            col = adjustcolor("#27ae60", 0.2), border = NA)

    # True value
    abline(v = d$true_mu, lty = 2, lwd = 2, col = "#2c3e50")

    legend("topright", bty = "n", cex = 0.9,
           legend = c("Prior (your belief before data)",
                      "Likelihood (what the data says)",
                      "Posterior (updated belief)",
                      expression("True " * mu)),
           col = c("#e74c3c", "#3498db", "#27ae60", "#2c3e50"),
           lwd = c(2.5, 2.5, 3, 2),
           lty = c(1, 1, 1, 2))
  })

  output$results <- renderUI({
    d <- dat()
    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>Prior mean:</b> ", d$prior_mu, "<br>",
        "<b>Data mean:</b> ", round(d$y_bar, 3), "<br>",
        "<b>Posterior mean:</b> ", round(d$post_mu, 3), "<br>",
        "<b>Posterior SD:</b> ", round(d$post_sd, 3), "<br>",
        "<hr style='margin:8px 0'>",
        "<b>Weight on prior:</b> ", round(d$w_prior * 100, 1), "%<br>",
        "<b>Weight on data:</b> ", round((1 - d$w_prior) * 100, 1), "%<br>",
        "<small>Posterior = weighted average of prior & data</small>"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • n = 1: the posterior is mostly the prior (red). You barely have data.
  • Slide n to 100: the posterior (green) collapses onto the data mean (blue). Data overwhelms the prior. With enough data, the prior doesn’t matter.
  • Set prior SD = 0.5 (strong prior) with n = 5: the posterior is pulled toward the prior. This is shrinkage — the prior is “shrinking” your estimate away from the data and toward your prior belief.
  • Set prior SD = 10 (vague prior): the posterior tracks the data almost exactly. A flat prior says “I have no opinion” and lets the data speak.
  • Watch the weight on prior in the sidebar — it shows exactly how much the posterior is a compromise between prior and data.

Shrinkage: the Bayesian superpower

Look at the “weight on prior” number in the sidebar. The posterior mean is literally a weighted average:

\[\mu_{post} = w \cdot \mu_{prior} + (1 - w) \cdot \bar{y}\]

where \(w\) depends on how confident your prior is relative to how much data you have.

This is shrinkage: the posterior “shrinks” the data estimate toward the prior. When is this useful?

  • Small samples: noisy data gets regularized toward a sensible default.
  • Many groups: estimating batting averages for 500 baseball players? Shrink extreme estimates toward the league average. A player who went 3-for-3 on opening day probably isn’t a .1000 hitter.
  • Hierarchical models: borrow strength across groups by shrinking toward a common mean.

Shrinkage isn’t bias — it’s a bias-variance tradeoff. You add a little bias but reduce variance a lot, often improving overall accuracy.