Priors & Posteriors
What is Bayesian inference, really?
Imagine you’re trying to estimate something — say, the average effect of a tutoring program on test scores. In frequentist statistics, you collect data, compute a point estimate, and that’s your answer.
In Bayesian statistics, you do something different:
- Start with a prior — what you believed before seeing data. Maybe from past studies, expert opinion, or just “I have no idea” (a flat prior).
- Observe data — the likelihood tells you how probable the data is for each possible value of the parameter.
- Combine them — Bayes’ theorem multiplies the prior by the likelihood to give you the posterior: your updated belief after seeing the data.
\[\underbrace{P(\theta \mid \text{data})}_{\text{posterior}} \propto \underbrace{P(\text{data} \mid \theta)}_{\text{likelihood}} \times \underbrace{P(\theta)}_{\text{prior}}\]
The simulation below makes this tangible. You’re estimating the true mean of a normal distribution. Set a prior, generate data, and watch the posterior form.
#| standalone: true
#| viewerHeight: 600
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("true_mu", HTML("True μ (unknown to you):"),
min = -5, max = 5, value = 2, step = 0.5),
sliderInput("prior_mu", "Prior mean:",
min = -5, max = 5, value = 0, step = 0.5),
sliderInput("prior_sd", "Prior SD (certainty):",
min = 0.5, max = 10, value = 3, step = 0.5),
sliderInput("n", "Sample size (data):",
min = 1, max = 200, value = 5, step = 1),
sliderInput("sigma", HTML("Data noise (σ):"),
min = 0.5, max = 5, value = 2, step = 0.5),
actionButton("go", "New draw", class = "btn-primary", width = "100%"),
uiOutput("results")
),
mainPanel(
width = 9,
plotOutput("posterior_plot", height = "450px")
)
)
)
server <- function(input, output, session) {
dat <- reactive({
input$go
true_mu <- input$true_mu
prior_mu <- input$prior_mu
prior_sd <- input$prior_sd
n <- input$n
sigma <- input$sigma
# Generate data
y <- rnorm(n, mean = true_mu, sd = sigma)
y_bar <- mean(y)
# Posterior (conjugate normal-normal)
prior_prec <- 1 / prior_sd^2
data_prec <- n / sigma^2
post_prec <- prior_prec + data_prec
post_sd <- 1 / sqrt(post_prec)
post_mu <- (prior_prec * prior_mu + data_prec * y_bar) / post_prec
# Shrinkage weight on prior
w_prior <- prior_prec / post_prec
list(true_mu = true_mu, prior_mu = prior_mu, prior_sd = prior_sd,
y_bar = y_bar, sigma = sigma, n = n,
post_mu = post_mu, post_sd = post_sd, w_prior = w_prior)
})
output$posterior_plot <- renderPlot({
d <- dat()
xmin <- min(d$prior_mu - 3.5 * d$prior_sd, d$post_mu - 4 * d$post_sd, d$true_mu - 2)
xmax <- max(d$prior_mu + 3.5 * d$prior_sd, d$post_mu + 4 * d$post_sd, d$true_mu + 2)
x <- seq(xmin, xmax, length.out = 500)
y_prior <- dnorm(x, d$prior_mu, d$prior_sd)
y_like <- dnorm(x, d$y_bar, d$sigma / sqrt(d$n))
y_post <- dnorm(x, d$post_mu, d$post_sd)
ylim <- c(0, max(y_prior, y_like, y_post) * 1.15)
par(mar = c(4.5, 4.5, 3, 1))
plot(x, y_prior, type = "l", lwd = 2.5, col = "#e74c3c",
xlab = expression(mu), ylab = "Density",
main = "Prior + Likelihood = Posterior",
ylim = ylim)
lines(x, y_like, lwd = 2.5, col = "#3498db")
lines(x, y_post, lwd = 3, col = "#27ae60")
# Shade posterior
polygon(c(x, rev(x)),
c(y_post, rep(0, length(x))),
col = adjustcolor("#27ae60", 0.2), border = NA)
# True value
abline(v = d$true_mu, lty = 2, lwd = 2, col = "#2c3e50")
legend("topright", bty = "n", cex = 0.9,
legend = c("Prior (your belief before data)",
"Likelihood (what the data says)",
"Posterior (updated belief)",
expression("True " * mu)),
col = c("#e74c3c", "#3498db", "#27ae60", "#2c3e50"),
lwd = c(2.5, 2.5, 3, 2),
lty = c(1, 1, 1, 2))
})
output$results <- renderUI({
d <- dat()
tags$div(class = "stats-box",
HTML(paste0(
"<b>Prior mean:</b> ", d$prior_mu, "<br>",
"<b>Data mean:</b> ", round(d$y_bar, 3), "<br>",
"<b>Posterior mean:</b> ", round(d$post_mu, 3), "<br>",
"<b>Posterior SD:</b> ", round(d$post_sd, 3), "<br>",
"<hr style='margin:8px 0'>",
"<b>Weight on prior:</b> ", round(d$w_prior * 100, 1), "%<br>",
"<b>Weight on data:</b> ", round((1 - d$w_prior) * 100, 1), "%<br>",
"<small>Posterior = weighted average of prior & data</small>"
))
)
})
}
shinyApp(ui, server)
Things to try
- n = 1: the posterior is mostly the prior (red). You barely have data.
- Slide n to 100: the posterior (green) collapses onto the data mean (blue). Data overwhelms the prior. With enough data, the prior doesn’t matter.
- Set prior SD = 0.5 (strong prior) with n = 5: the posterior is pulled toward the prior. This is shrinkage — the prior is “shrinking” your estimate away from the data and toward your prior belief.
- Set prior SD = 10 (vague prior): the posterior tracks the data almost exactly. A flat prior says “I have no opinion” and lets the data speak.
- Watch the weight on prior in the sidebar — it shows exactly how much the posterior is a compromise between prior and data.
Shrinkage: the Bayesian superpower
Look at the “weight on prior” number in the sidebar. The posterior mean is literally a weighted average:
\[\mu_{post} = w \cdot \mu_{prior} + (1 - w) \cdot \bar{y}\]
where \(w\) depends on how confident your prior is relative to how much data you have.
This is shrinkage: the posterior “shrinks” the data estimate toward the prior. When is this useful?
- Small samples: noisy data gets regularized toward a sensible default.
- Many groups: estimating batting averages for 500 baseball players? Shrink extreme estimates toward the league average. A player who went 3-for-3 on opening day probably isn’t a .1000 hitter.
- Hierarchical models: borrow strength across groups by shrinking toward a common mean.
Shrinkage isn’t bias — it’s a bias-variance tradeoff. You add a little bias but reduce variance a lot, often improving overall accuracy.
Why did the math work out so cleanly?
The simulation above computes the posterior instantly — no sampling, no iteration, just a formula. That’s because we used a conjugate prior: a special prior-likelihood pair where the posterior has the same distributional form as the prior.
Here, the prior is Normal, the likelihood is Normal, and the posterior is Normal too. You just update the mean and variance:
\[\mu_{post} = \frac{\frac{\mu_0}{\sigma_0^2} + \frac{n\bar{y}}{\sigma^2}}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}}\]
Plug in numbers, get the answer. No algorithm required.
Common conjugate pairs
| Likelihood | Conjugate prior | Posterior | Example |
|---|---|---|---|
| Normal (known \(\sigma\)) | Normal | Normal | Estimating a mean (this page) |
| Binomial | Beta | Beta | Estimating a proportion (Bayes’ Theorem) |
| Poisson | Gamma | Gamma | Estimating a rate |
The problem: most real models aren’t conjugate
Conjugacy is elegant but limited. It only works for these specific combinations. The moment your model gets realistic — logistic regression with priors on coefficients, hierarchical models with multiple levels, non-standard likelihoods — there’s no conjugate solution. The posterior is some high-dimensional surface with no closed-form expression.
That’s why MCMC exists: when you can’t write down the posterior, you sample from it instead. The progression in this course:
- This page: conjugate priors — exact, instant posteriors (the special case)
- MCMC: numerical sampling — posteriors for any model (the general case)
- Hierarchical Models: the reason you need MCMC in practice