Markov Chain Monte Carlo
What is MCMC?
The name breaks down into two parts:
- Monte Carlo: using random sampling to approximate something you can’t compute exactly. Instead of solving an integral analytically, you draw random samples and use their average as an approximation. (Named after the Monte Carlo casino — it’s fundamentally about randomness.)
- Markov Chain: a sequence of random values where each value depends only on the previous one — not on the full history. The “chain” is a random walk through parameter space, where each step proposes a new value based on where you currently are.
Put them together: MCMC constructs a Markov chain whose long-run distribution equals the posterior. Run it long enough, and the samples you collect are (approximately) draws from \(p(\theta \mid y)\) — even though you never computed that distribution directly.
Watch it happen
The chain wanders through parameter space. Early on, the histogram of visited values looks nothing like the target. But as samples accumulate, the histogram converges to the true posterior — the chain has “learned” the distribution just by random walking.
#| standalone: true
#| viewerHeight: 480
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 8px; font-size: 13px; line-height: 1.7;
}
.stats-box b { color: #2c3e50; }
.good { color: #27ae60; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("show_n", "Samples to show:",
min = 10, max = 5000, value = 50, step = 10,
animate = animationOptions(interval = 80, loop = FALSE)),
actionButton("go_intro", "New chain", class = "btn-primary", width = "100%"),
uiOutput("results_intro")
),
mainPanel(
width = 9,
fluidRow(
column(6, plotOutput("trace_intro", height = "340px")),
column(6, plotOutput("hist_intro", height = "340px"))
)
)
)
)
server <- function(input, output, session) {
chain_data <- reactive({
input$go_intro
# Target: posterior is N(2, 0.5^2) — we pretend we can't compute this
target_mu <- 2
target_sd <- 0.5
log_target <- function(x) dnorm(x, target_mu, target_sd, log = TRUE)
# Run a long chain once
n_total <- 5000
chain <- numeric(n_total)
chain[1] <- -1 # start far from the target
prop_sd <- 0.4
for (t in 2:n_total) {
proposal <- rnorm(1, chain[t - 1], prop_sd)
log_r <- log_target(proposal) - log_target(chain[t - 1])
if (log(runif(1)) < log_r) {
chain[t] <- proposal
} else {
chain[t] <- chain[t - 1]
}
}
list(chain = chain, target_mu = target_mu, target_sd = target_sd)
})
output$trace_intro <- renderPlot({
d <- chain_data()
n_show <- min(input$show_n, length(d$chain))
par(mar = c(4, 4.5, 3, 1))
plot(1:n_show, d$chain[1:n_show], type = "l",
col = "#3498db80", lwd = 0.6,
xlim = c(1, max(200, n_show)),
ylim = range(d$chain),
xlab = "Step", ylab = expression(theta),
main = "The chain explores")
# Show current position
points(n_show, d$chain[n_show], pch = 19, col = "#e74c3c", cex = 1.5)
abline(h = d$target_mu, lty = 2, col = "#27ae60", lwd = 1.5)
text(max(200, n_show) * 0.95, d$target_mu,
expression("Target " * mu), pos = 3, cex = 0.8, col = "#27ae60")
})
output$hist_intro <- renderPlot({
d <- chain_data()
n_show <- min(input$show_n, length(d$chain))
samples <- d$chain[1:n_show]
par(mar = c(4, 4.5, 3, 1))
xlim <- c(d$target_mu - 3 * d$target_sd - 1,
d$target_mu + 3 * d$target_sd + 1)
hist(samples, breaks = seq(xlim[1], xlim[2], length.out = 40),
freq = FALSE, col = "#3498db40", border = "#3498db",
xlim = xlim, ylim = c(0, 1.2),
xlab = expression(theta), main = "Histogram vs true posterior")
# True target
x_seq <- seq(xlim[1], xlim[2], length.out = 300)
lines(x_seq, dnorm(x_seq, d$target_mu, d$target_sd),
col = "#27ae60", lwd = 2.5)
legend("topright", bty = "n", cex = 0.85,
legend = c(paste0("MCMC samples (n=", n_show, ")"),
"True posterior"),
col = c("#3498db", "#27ae60"),
lwd = c(NA, 2.5), pch = c(15, NA), pt.cex = c(1.5, NA))
})
output$results_intro <- renderUI({
d <- chain_data()
n_show <- min(input$show_n, length(d$chain))
samples <- d$chain[1:n_show]
tags$div(class = "stats-box",
HTML(paste0(
"<b>Samples:</b> ", n_show, "<br>",
"<b>MCMC mean:</b> ", round(mean(samples), 3), "<br>",
"<b>True mean:</b> ", d$target_mu, "<br>",
"<b>Error:</b> <span class='good'>",
round(abs(mean(samples) - d$target_mu), 3), "</span>"
))
)
})
}
shinyApp(ui, server)
Drag the slider (or hit the play button) and watch:
- 10 samples: the histogram is jagged, nothing like the green curve. The chain hasn’t explored enough.
- 100 samples: the shape starts to emerge. The chain has found the high-probability region.
- 1000+ samples: the histogram matches the true posterior almost exactly. The chain has “learned” the distribution through random walking alone.
The chain never knew the formula for the green curve. It only knew how to evaluate the posterior at any single point and compare “is this new spot better or worse?” That’s enough.
Why do we need it?
On the Priors & Posteriors page, we used conjugate priors — special prior-likelihood pairs where the posterior has a closed-form solution. That’s elegant but limiting. Most real models don’t have conjugate posteriors:
- Logistic regression with a prior on coefficients
- Hierarchical models with multiple levels of parameters
- Any model where the posterior \(p(\theta \mid y) \propto p(y \mid \theta) \, p(\theta)\) doesn’t simplify to a known distribution
For these models, we can’t write down the posterior analytically. We need to sample from it numerically. That’s what MCMC does.
The Metropolis-Hastings algorithm
The most foundational MCMC algorithm. The core idea is beautifully simple:
- Start at some value \(\theta_0\).
- Propose a new value \(\theta^*\) from a proposal distribution \(q(\theta^* \mid \theta_t)\) — typically \(\theta^* \sim N(\theta_t, \sigma_{prop}^2)\).
- Compute the acceptance ratio: \[\alpha = \min\left(1, \, \frac{p(\theta^* \mid y) \; q(\theta_t \mid \theta^*)}{p(\theta_t \mid y) \; q(\theta^* \mid \theta_t)}\right)\]
- Accept \(\theta^*\) with probability \(\alpha\) (set \(\theta_{t+1} = \theta^*\)). Otherwise stay: \(\theta_{t+1} = \theta_t\).
- Repeat.
When the proposal is symmetric — e.g. \(q(\theta^* \mid \theta_t) = N(\theta_t, \sigma_{prop}^2)\) — the proposal ratio \(q(\theta_t \mid \theta^*) / q(\theta^* \mid \theta_t) = 1\) and the acceptance ratio simplifies to \(\alpha = \min\!\left(1,\; p(\theta^* \mid y) \,/\, p(\theta_t \mid y)\right)\). This is the original Metropolis algorithm. The Hastings (1970) generalization adds the proposal ratio so that asymmetric proposals can be used.
Key insight: you never need to compute the normalizing constant \(p(y) = \int p(y \mid \theta) \, p(\theta) \, d\theta\). The ratio \(p(\theta^* \mid y) / p(\theta_t \mid y)\) cancels it out. You only need to evaluate the unnormalized posterior — the numerator of Bayes’ theorem.
After enough iterations, the chain converges to the posterior distribution. The samples \(\theta_1, \theta_2, \ldots\) are (correlated) draws from \(p(\theta \mid y)\).
Simulation 1: Metropolis-Hastings in action
Estimate the mean \(\mu\) of normally distributed data with unknown true value. The proposal width controls exploration: too narrow and the chain moves slowly; too wide and most proposals get rejected.
#| standalone: true
#| viewerHeight: 650
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
.good { color: #27ae60; font-weight: bold; }
.bad { color: #e74c3c; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("true_mu", HTML("True μ:"),
min = -3, max = 3, value = 1, step = 0.5),
sliderInput("n_data", "Data points:",
min = 5, max = 100, value = 20, step = 5),
sliderInput("prop_sd", "Proposal width (SD):",
min = 0.05, max = 5, value = 0.5, step = 0.05),
sliderInput("n_iter", "Iterations:",
min = 500, max = 5000, value = 2000, step = 500),
actionButton("go", "Run chain", class = "btn-primary", width = "100%"),
uiOutput("results")
),
mainPanel(
width = 9,
fluidRow(
column(6, plotOutput("trace_plot", height = "420px")),
column(6, plotOutput("hist_plot", height = "420px"))
)
)
)
)
server <- function(input, output, session) {
dat <- reactive({
input$go
true_mu <- input$true_mu
n_data <- input$n_data
prop_sd <- input$prop_sd
n_iter <- input$n_iter
sigma <- 2 # known SD
# Generate data
y <- rnorm(n_data, mean = true_mu, sd = sigma)
# Log posterior (unnormalized): normal likelihood + flat prior
log_post <- function(mu) {
sum(dnorm(y, mean = mu, sd = sigma, log = TRUE))
}
# Metropolis-Hastings
chain <- numeric(n_iter)
chain[1] <- 0 # start at 0
accepted <- 0
for (t in 2:n_iter) {
proposal <- rnorm(1, mean = chain[t - 1], sd = prop_sd)
log_ratio <- log_post(proposal) - log_post(chain[t - 1])
if (log(runif(1)) < log_ratio) {
chain[t] <- proposal
accepted <- accepted + 1
} else {
chain[t] <- chain[t - 1]
}
}
accept_rate <- accepted / (n_iter - 1)
# Analytic posterior for comparison (conjugate: flat prior + normal)
post_mean <- mean(y)
post_sd <- sigma / sqrt(n_data)
list(chain = chain, accept_rate = accept_rate,
true_mu = true_mu, post_mean = post_mean, post_sd = post_sd,
n_iter = n_iter, prop_sd = prop_sd)
})
output$trace_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
plot(d$chain, type = "l", col = "#3498db80", lwd = 0.5,
xlab = "Iteration", ylab = expression(mu),
main = "Trace Plot")
abline(h = d$true_mu, lty = 2, col = "#e74c3c", lwd = 2)
abline(h = d$post_mean, lty = 3, col = "#27ae60", lwd = 2)
legend("topright", bty = "n", cex = 0.8,
legend = c(expression("True " * mu),
"Posterior mean (analytic)"),
col = c("#e74c3c", "#27ae60"),
lty = c(2, 3), lwd = 2)
})
output$hist_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
# Discard first 20% as burn-in
burnin <- floor(d$n_iter * 0.2)
samples <- d$chain[(burnin + 1):d$n_iter]
hist(samples, breaks = 40, freq = FALSE,
col = "#3498db40", border = "#3498db",
xlab = expression(mu), main = "Posterior Distribution",
xlim = range(c(samples, d$true_mu - 0.5, d$true_mu + 0.5)))
# Analytic posterior
x_seq <- seq(min(samples) - 0.5, max(samples) + 0.5, length.out = 200)
lines(x_seq, dnorm(x_seq, d$post_mean, d$post_sd),
col = "#27ae60", lwd = 2.5)
abline(v = d$true_mu, lty = 2, col = "#e74c3c", lwd = 2)
legend("topright", bty = "n", cex = 0.8,
legend = c("MCMC samples", "Analytic posterior",
expression("True " * mu)),
col = c("#3498db", "#27ae60", "#e74c3c"),
lwd = c(NA, 2.5, 2), lty = c(NA, 1, 2),
pch = c(15, NA, NA), pt.cex = c(1.5, NA, NA))
})
output$results <- renderUI({
d <- dat()
burnin <- floor(d$n_iter * 0.2)
samples <- d$chain[(burnin + 1):d$n_iter]
rate_class <- if (d$accept_rate > 0.15 && d$accept_rate < 0.5) "good" else "bad"
rate_note <- if (d$accept_rate < 0.15) {
"Too low — proposal too wide"
} else if (d$accept_rate > 0.5) {
"Too high — proposal too narrow"
} else {
"Good range (0.15-0.50)"
}
tags$div(class = "stats-box",
HTML(paste0(
"<b>Acceptance rate:</b> <span class='", rate_class, "'>",
round(d$accept_rate * 100, 1), "%</span><br>",
"<small>", rate_note, "</small><br>",
"<hr style='margin:8px 0'>",
"<b>MCMC posterior mean:</b> ", round(mean(samples), 3), "<br>",
"<b>Analytic posterior mean:</b> ", round(d$post_mean, 3), "<br>",
"<b>True μ:</b> ", d$true_mu, "<br>",
"<hr style='margin:8px 0'>",
"<b>Proposal width:</b> ", d$prop_sd
))
)
})
}
shinyApp(ui, server)
Things to try
- Proposal width = 0.5: a well-tuned chain. The trace plot shows good mixing (bouncing around the posterior), and the histogram matches the analytic posterior (green curve). Acceptance rate is in the sweet spot (20–40%).
- Proposal width = 0.05: too narrow. The chain takes tiny steps — the trace plot shows slow, random-walk behavior. Acceptance rate is near 100% (almost every proposal is accepted because it’s barely different). The chain explores the posterior very slowly.
- Proposal width = 5: too wide. Most proposals jump far from the current value and land in low-probability regions — they get rejected. The trace plot shows long flat stretches (the chain is stuck). Acceptance rate is very low.
- The goldilocks principle: you want a proposal width that’s “just right” — large enough to explore, small enough to get accepted. The theoretical optimum for 1D is an acceptance rate around 44% (Roberts et al., 1997).
Burn-in and convergence
A practical concern: the chain starts at an arbitrary value (\(\theta_0 = 0\) above). The early samples reflect the starting point, not the posterior. You need to discard these initial samples — the “burn-in” period.
How do you know the chain has converged? Run multiple chains from different starting points. If they all end up exploring the same region, you have evidence of convergence. If they’re stuck in different places, the chains haven’t converged and you need more iterations.
Simulation 2: Multiple chains and convergence
#| standalone: true
#| viewerHeight: 580
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
.good { color: #27ae60; font-weight: bold; }
.bad { color: #e74c3c; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("n_iter2", "Iterations:",
min = 200, max = 3000, value = 1000, step = 200),
sliderInput("burnin", "Burn-in (discard first %):",
min = 0, max = 50, value = 20, step = 5),
sliderInput("prop_sd2", "Proposal width:",
min = 0.1, max = 3, value = 0.5, step = 0.1),
actionButton("go2", "Run chains", class = "btn-primary", width = "100%"),
uiOutput("results2")
),
mainPanel(
width = 9,
plotOutput("multi_trace", height = "420px")
)
)
)
server <- function(input, output, session) {
dat2 <- reactive({
input$go2
n_iter <- input$n_iter2
burnin <- input$burnin / 100
prop_sd <- input$prop_sd2
sigma <- 2
true_mu <- 1.5
# Generate data once
y <- rnorm(30, mean = true_mu, sd = sigma)
log_post <- function(mu) {
sum(dnorm(y, mean = mu, sd = sigma, log = TRUE))
}
# Run 4 chains from different starting points
starts <- c(-5, -2, 4, 7)
chain_cols <- c("#e74c3c", "#3498db", "#27ae60", "#f39c12")
chains <- matrix(0, nrow = n_iter, ncol = 4)
for (ch in 1:4) {
chains[1, ch] <- starts[ch]
for (t in 2:n_iter) {
proposal <- rnorm(1, chains[t - 1, ch], prop_sd)
log_r <- log_post(proposal) - log_post(chains[t - 1, ch])
if (log(runif(1)) < log_r) {
chains[t, ch] <- proposal
} else {
chains[t, ch] <- chains[t - 1, ch]
}
}
}
# Posterior (analytic)
post_mean <- mean(y)
list(chains = chains, starts = starts, chain_cols = chain_cols,
n_iter = n_iter, burnin = burnin, true_mu = true_mu,
post_mean = post_mean)
})
output$multi_trace <- renderPlot({
d <- dat2()
par(mar = c(4.5, 4.5, 3, 1))
burnin_line <- floor(d$n_iter * d$burnin)
ylim <- range(d$chains)
plot(NULL, xlim = c(1, d$n_iter), ylim = ylim,
xlab = "Iteration", ylab = expression(mu),
main = "Four Chains from Different Starting Points")
# Shade burn-in region
if (burnin_line > 0) {
rect(0, ylim[1] - 1, burnin_line, ylim[2] + 1,
col = "#f0f0f080", border = NA)
abline(v = burnin_line, lty = 2, col = "gray40", lwd = 1.5)
text(burnin_line, ylim[2], "burn-in", pos = 2,
cex = 0.8, col = "gray40")
}
for (ch in 1:4) {
lines(d$chains[, ch], col = paste0(d$chain_cols[ch], "90"),
lwd = 0.8)
}
abline(h = d$true_mu, lty = 2, col = "#2c3e50", lwd = 2)
text(d$n_iter * 0.98, d$true_mu, expression("True " * mu),
pos = 3, cex = 0.85, col = "#2c3e50")
legend("topright", bty = "n", cex = 0.75,
legend = paste0("Chain ", 1:4, " (start = ", d$starts, ")"),
col = d$chain_cols, lwd = 2)
})
output$results2 <- renderUI({
d <- dat2()
burnin_n <- floor(d$n_iter * d$burnin)
# Post-burnin means per chain
if (burnin_n < d$n_iter) {
post_samples <- d$chains[(burnin_n + 1):d$n_iter, ]
chain_means <- round(colMeans(post_samples), 3)
spread <- round(max(chain_means) - min(chain_means), 3)
} else {
chain_means <- rep(NA, 4)
spread <- NA
}
converged <- !is.na(spread) && spread < 0.3
conv_class <- if (converged) "good" else "bad"
conv_label <- if (converged) "Chains agree" else "Chains disagree"
tags$div(class = "stats-box",
HTML(paste0(
"<b>Post-burn-in means:</b><br>",
paste0("Chain ", 1:4, ": ", chain_means, collapse = "<br>"), "<br>",
"<hr style='margin:8px 0'>",
"<b>Spread:</b> <span class='", conv_class, "'>",
spread, "</span><br>",
"<span class='", conv_class, "'>", conv_label, "</span>"
))
)
})
}
shinyApp(ui, server)
Things to try
- Default settings (1000 iterations, burn-in = 20%): all four chains start at different values (-5, -2, 4, 7) but converge to the same region within ~100 iterations. After burn-in, all chain means agree. This is convergence.
- Burn-in = 0%: the early samples (from the starting points) contaminate the posterior. The chain means diverge because each chain’s average is pulled toward its start.
- Proposal width = 0.1: very slow exploration. The chains take longer to converge — you can see them creeping slowly toward the true value. With only 1000 iterations, they might not fully converge. Increase iterations to fix this.
- Proposal width = 3: the chains converge quickly but the trace plot shows many flat stretches (rejected proposals). The posterior is explored but inefficiently.
The key message
MCMC lets you compute posteriors for any model — not just conjugate ones. Specify the likelihood and the prior, and the algorithm samples from the posterior. This is what makes Bayesian inference practical for real-world problems like hierarchical models, where closed-form posteriors don’t exist.
This is also the engine behind Bayesian regression. When you type bayes: reg y x in Stata, it runs Metropolis-Hastings to sample from the posterior over regression coefficients — the same algorithm you just watched above, applied to a model you already know.
Modern tools like Stan, JAGS, and PyMC automate this — you specify the model and they handle the sampling. But understanding the basics (proposal tuning, burn-in, convergence checks) helps you diagnose problems when things go wrong.
What you do with the draws
Everything above was about getting posterior draws. Here’s the payoff: once you have draws, any question about the posterior is just counting.
Suppose you run a Bayesian regression in Stan and get 4,000 draws of \(\beta\) (4 chains \(\times\) 1,000 each). Those 4,000 numbers are the posterior. To answer any question, you just look at the draws:
| Question | Code | What it does |
|---|---|---|
| What’s \(P(\beta < 0)\)? | mean(beta_draws < 0) |
Fraction of draws where \(\beta < 0\) |
| What’s \(P(\beta > 0.5)\)? | mean(beta_draws > 0.5) |
Fraction of draws above 0.5 |
| Posterior mean? | mean(beta_draws) |
Average of the draws |
| 95% credible interval? | quantile(beta_draws, c(0.025, 0.975)) |
2.5th and 97.5th percentiles |
| Posterior SD? | sd(beta_draws) |
Standard deviation of the draws |
That’s it. No formulas, no distributional assumptions on the output, no asymptotics. The posterior can be skewed, bimodal, heavy-tailed — it doesn’t matter. You just count.
This is what connects MCMC to the rest of the site. The Bayes’ Theorem page shows the logic: prior \(\times\) likelihood \(=\) posterior. The coin-flip simulation there gives you the posterior in closed form. But for real models — regression with multiple predictors, hierarchical structures, non-conjugate priors — there is no closed form. MCMC gives you draws from that posterior, and then you count. The logic is the same; only the computation changes.
Did you know?
The Metropolis algorithm was developed at Los Alamos National Laboratory by Nicholas Metropolis, Arianna Rosenbluth, Marshall Rosenbluth, Augusta Teller, and Edward Teller (1953) — originally for simulating equations of state in statistical mechanics, not statistics. W.K. Hastings (1970) generalized it to asymmetric proposal distributions.
MCMC was named one of the top 10 algorithms of the 20th century by Computing in Science & Engineering (2000). It’s used across physics, chemistry, biology, statistics, and machine learning.
Modern Bayesian computation has largely moved beyond basic Metropolis-Hastings to Hamiltonian Monte Carlo (HMC) and its adaptive variant NUTS (Hoffman & Gelman, 2014). HMC uses gradient information to make smarter proposals, dramatically improving efficiency in high-dimensional problems. This is what Stan uses under the hood.