Markov Chain Monte Carlo

What is MCMC?

The name breaks down into two parts:

Monte Carlo: using random sampling to approximate something you can’t compute exactly. Instead of solving an integral analytically, you draw random samples and use their average as an approximation. (Named after the Monte Carlo casino — it’s fundamentally about randomness.)
Markov Chain: a sequence of random values where each value depends only on the previous one — not on the full history. The “chain” is a random walk through parameter space, where each step proposes a new value based on where you currently are.

Put them together: MCMC constructs a Markov chain whose long-run distribution equals the posterior. Run it long enough, and the samples you collect are (approximately) draws from \(p(\theta \mid y)\) — even though you never computed that distribution directly.

The blindfolded hiker. Imagine you’re blindfolded on a hilly landscape and want to map out the shape of a mountain. You can’t see the whole thing, but at each step you can feel whether the ground goes up or down. You take a step — if it’s uphill (higher probability), you go. If it’s downhill, you might go (sometimes you need to cross valleys to find higher peaks). After thousands of steps, plot everywhere you’ve been on a map — the places you visited most often are the peaks. You’ve mapped the mountain without ever seeing it. That’s MCMC.

Isn’t this just optimization? It looks like it — both climb hills. But an optimizer (like maximum likelihood in Stata’s reg) only goes uphill, finds the peak, and stops. That gives you one number: the best \(\hat\beta\). MCMC deliberately goes downhill sometimes (step 4 above: accepting worse proposals with some probability). That’s what lets it explore the full shape of the posterior — not just where the peak is, but how wide, how skewed, and whether there are multiple peaks. Turn off the downhill moves and you’d have a hill-climbing optimizer. Keep them on and you have a sampler.

Watch it happen

The chain wanders through parameter space. Early on, the histogram of visited values looks nothing like the target. But as samples accumulate, the histogram converges to the true posterior — the chain has “learned” the distribution just by random walking.

#| standalone: true
#| viewerHeight: 480

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 8px; font-size: 13px; line-height: 1.7;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("show_n", "Samples to show:",
                  min = 10, max = 5000, value = 50, step = 10,
                  animate = animationOptions(interval = 80, loop = FALSE)),

      actionButton("go_intro", "New chain", class = "btn-primary", width = "100%"),

      uiOutput("results_intro")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("trace_intro", height = "340px")),
        column(6, plotOutput("hist_intro", height = "340px"))
      )
    )
  )
)

server <- function(input, output, session) {

  chain_data <- reactive({
    input$go_intro

    # Target: posterior is N(2, 0.5^2) — we pretend we can't compute this
    target_mu <- 2
    target_sd <- 0.5
    log_target <- function(x) dnorm(x, target_mu, target_sd, log = TRUE)

    # Run a long chain once
    n_total <- 5000
    chain <- numeric(n_total)
    chain[1] <- -1  # start far from the target
    prop_sd <- 0.4

    for (t in 2:n_total) {
      proposal <- rnorm(1, chain[t - 1], prop_sd)
      log_r <- log_target(proposal) - log_target(chain[t - 1])
      if (log(runif(1)) < log_r) {
        chain[t] <- proposal
      } else {
        chain[t] <- chain[t - 1]
      }
    }

    list(chain = chain, target_mu = target_mu, target_sd = target_sd)
  })

  output$trace_intro <- renderPlot({
    d <- chain_data()
    n_show <- min(input$show_n, length(d$chain))
    par(mar = c(4, 4.5, 3, 1))

    plot(1:n_show, d$chain[1:n_show], type = "l",
         col = "#3498db80", lwd = 0.6,
         xlim = c(1, max(200, n_show)),
         ylim = range(d$chain),
         xlab = "Step", ylab = expression(theta),
         main = "The chain explores")

    # Show current position
    points(n_show, d$chain[n_show], pch = 19, col = "#e74c3c", cex = 1.5)

    abline(h = d$target_mu, lty = 2, col = "#27ae60", lwd = 1.5)
    text(max(200, n_show) * 0.95, d$target_mu,
         expression("Target " * mu), pos = 3, cex = 0.8, col = "#27ae60")
  })

  output$hist_intro <- renderPlot({
    d <- chain_data()
    n_show <- min(input$show_n, length(d$chain))
    samples <- d$chain[1:n_show]
    par(mar = c(4, 4.5, 3, 1))

    xlim <- c(d$target_mu - 3 * d$target_sd - 1,
              d$target_mu + 3 * d$target_sd + 1)

    hist(samples, breaks = seq(xlim[1], xlim[2], length.out = 40),
         freq = FALSE, col = "#3498db40", border = "#3498db",
         xlim = xlim, ylim = c(0, 1.2),
         xlab = expression(theta), main = "Histogram vs true posterior")

    # True target
    x_seq <- seq(xlim[1], xlim[2], length.out = 300)
    lines(x_seq, dnorm(x_seq, d$target_mu, d$target_sd),
          col = "#27ae60", lwd = 2.5)

    legend("topright", bty = "n", cex = 0.85,
           legend = c(paste0("MCMC samples (n=", n_show, ")"),
                      "True posterior"),
           col = c("#3498db", "#27ae60"),
           lwd = c(NA, 2.5), pch = c(15, NA), pt.cex = c(1.5, NA))
  })

  output$results_intro <- renderUI({
    d <- chain_data()
    n_show <- min(input$show_n, length(d$chain))
    samples <- d$chain[1:n_show]

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>Samples:</b> ", n_show, "<br>",
        "<b>MCMC mean:</b> ", round(mean(samples), 3), "<br>",
        "<b>True mean:</b> ", d$target_mu, "<br>",
        "<b>Error:</b> <span class='good'>",
        round(abs(mean(samples) - d$target_mu), 3), "</span>"
      ))
    )
  })
}

shinyApp(ui, server)

Drag the slider (or hit the play button) and watch:

10 samples: the histogram is jagged, nothing like the green curve. The chain hasn’t explored enough.
100 samples: the shape starts to emerge. The chain has found the high-probability region.
1000+ samples: the histogram matches the true posterior almost exactly. The chain has “learned” the distribution through random walking alone.

The chain never knew the formula for the green curve. It only knew how to evaluate the posterior at any single point and compare “is this new spot better or worse?” That’s enough.

Why do we need it?

On the Priors & Posteriors page, we used conjugate priors — special prior-likelihood pairs where the posterior has a closed-form solution. That’s elegant but limiting. Most real models don’t have conjugate posteriors:

Logistic regression with a prior on coefficients
Hierarchical models with multiple levels of parameters
Any model where the posterior \(p(\theta \mid y) \propto p(y \mid \theta) \, p(\theta)\) doesn’t simplify to a known distribution

For these models, we can’t write down the posterior analytically. We need to sample from it numerically. That’s what MCMC does.

The Metropolis-Hastings algorithm

The most foundational MCMC algorithm. The core idea is beautifully simple:

Start at some value \(\theta_0\).
Propose a new value \(\theta^*\) from a proposal distribution \(q(\theta^* \mid \theta_t)\) — typically \(\theta^* \sim N(\theta_t, \sigma_{prop}^2)\).
Compute the acceptance ratio: \[\alpha = \min\left(1, \, \frac{p(\theta^* \mid y) \; q(\theta_t \mid \theta^*)}{p(\theta_t \mid y) \; q(\theta^* \mid \theta_t)}\right)\]
Accept \(\theta^*\) with probability \(\alpha\) (set \(\theta_{t+1} = \theta^*\)). Otherwise stay: \(\theta_{t+1} = \theta_t\).
Repeat.

When the proposal is symmetric — e.g. \(q(\theta^* \mid \theta_t) = N(\theta_t, \sigma_{prop}^2)\) — the proposal ratio \(q(\theta_t \mid \theta^*) / q(\theta^* \mid \theta_t) = 1\) and the acceptance ratio simplifies to \(\alpha = \min\!\left(1,\; p(\theta^* \mid y) \,/\, p(\theta_t \mid y)\right)\). This is the original Metropolis algorithm. The Hastings (1970) generalization adds the proposal ratio so that asymmetric proposals can be used.

Key insight: you never need to compute the normalizing constant \(p(y) = \int p(y \mid \theta) \, p(\theta) \, d\theta\). The ratio \(p(\theta^* \mid y) / p(\theta_t \mid y)\) cancels it out. You only need to evaluate the unnormalized posterior — the numerator of Bayes’ theorem.

After enough iterations, the chain converges to the posterior distribution. The samples \(\theta_1, \theta_2, \ldots\) are (correlated) draws from \(p(\theta \mid y)\).

Simulation 1: Metropolis-Hastings in action

Estimate the mean \(\mu\) of normally distributed data with unknown true value. The proposal width controls exploration: too narrow and the chain moves slowly; too wide and most proposals get rejected.

#| standalone: true
#| viewerHeight: 650

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("true_mu", HTML("True &mu;:"),
                  min = -3, max = 3, value = 1, step = 0.5),

      sliderInput("n_data", "Data points:",
                  min = 5, max = 100, value = 20, step = 5),

      sliderInput("prop_sd", "Proposal width (SD):",
                  min = 0.05, max = 5, value = 0.5, step = 0.05),

      sliderInput("n_iter", "Iterations:",
                  min = 500, max = 5000, value = 2000, step = 500),

      actionButton("go", "Run chain", class = "btn-primary", width = "100%"),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("trace_plot", height = "420px")),
        column(6, plotOutput("hist_plot", height = "420px"))
      )
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$go
    true_mu  <- input$true_mu
    n_data   <- input$n_data
    prop_sd  <- input$prop_sd
    n_iter   <- input$n_iter
    sigma    <- 2  # known SD

    # Generate data
    y <- rnorm(n_data, mean = true_mu, sd = sigma)

    # Log posterior (unnormalized): normal likelihood + flat prior
    log_post <- function(mu) {
      sum(dnorm(y, mean = mu, sd = sigma, log = TRUE))
    }

    # Metropolis-Hastings
    chain <- numeric(n_iter)
    chain[1] <- 0  # start at 0
    accepted <- 0

    for (t in 2:n_iter) {
      proposal <- rnorm(1, mean = chain[t - 1], sd = prop_sd)
      log_ratio <- log_post(proposal) - log_post(chain[t - 1])

      if (log(runif(1)) < log_ratio) {
        chain[t] <- proposal
        accepted <- accepted + 1
      } else {
        chain[t] <- chain[t - 1]
      }
    }

    accept_rate <- accepted / (n_iter - 1)

    # Analytic posterior for comparison (conjugate: flat prior + normal)
    post_mean <- mean(y)
    post_sd   <- sigma / sqrt(n_data)

    list(chain = chain, accept_rate = accept_rate,
         true_mu = true_mu, post_mean = post_mean, post_sd = post_sd,
         n_iter = n_iter, prop_sd = prop_sd)
  })

  output$trace_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    plot(d$chain, type = "l", col = "#3498db80", lwd = 0.5,
         xlab = "Iteration", ylab = expression(mu),
         main = "Trace Plot")

    abline(h = d$true_mu, lty = 2, col = "#e74c3c", lwd = 2)
    abline(h = d$post_mean, lty = 3, col = "#27ae60", lwd = 2)

    legend("topright", bty = "n", cex = 0.8,
           legend = c(expression("True " * mu),
                      "Posterior mean (analytic)"),
           col = c("#e74c3c", "#27ae60"),
           lty = c(2, 3), lwd = 2)
  })

  output$hist_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    # Discard first 20% as burn-in
    burnin <- floor(d$n_iter * 0.2)
    samples <- d$chain[(burnin + 1):d$n_iter]

    hist(samples, breaks = 40, freq = FALSE,
         col = "#3498db40", border = "#3498db",
         xlab = expression(mu), main = "Posterior Distribution",
         xlim = range(c(samples, d$true_mu - 0.5, d$true_mu + 0.5)))

    # Analytic posterior
    x_seq <- seq(min(samples) - 0.5, max(samples) + 0.5, length.out = 200)
    lines(x_seq, dnorm(x_seq, d$post_mean, d$post_sd),
          col = "#27ae60", lwd = 2.5)

    abline(v = d$true_mu, lty = 2, col = "#e74c3c", lwd = 2)

    legend("topright", bty = "n", cex = 0.8,
           legend = c("MCMC samples", "Analytic posterior",
                      expression("True " * mu)),
           col = c("#3498db", "#27ae60", "#e74c3c"),
           lwd = c(NA, 2.5, 2), lty = c(NA, 1, 2),
           pch = c(15, NA, NA), pt.cex = c(1.5, NA, NA))
  })

  output$results <- renderUI({
    d <- dat()
    burnin <- floor(d$n_iter * 0.2)
    samples <- d$chain[(burnin + 1):d$n_iter]

    rate_class <- if (d$accept_rate > 0.15 && d$accept_rate < 0.5) "good" else "bad"
    rate_note <- if (d$accept_rate < 0.15) {
      "Too low — proposal too wide"
    } else if (d$accept_rate > 0.5) {
      "Too high — proposal too narrow"
    } else {
      "Good range (0.15-0.50)"
    }

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>Acceptance rate:</b> <span class='", rate_class, "'>",
        round(d$accept_rate * 100, 1), "%</span><br>",
        "<small>", rate_note, "</small><br>",
        "<hr style='margin:8px 0'>",
        "<b>MCMC posterior mean:</b> ", round(mean(samples), 3), "<br>",
        "<b>Analytic posterior mean:</b> ", round(d$post_mean, 3), "<br>",
        "<b>True &mu;:</b> ", d$true_mu, "<br>",
        "<hr style='margin:8px 0'>",
        "<b>Proposal width:</b> ", d$prop_sd
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

Proposal width = 0.5: a well-tuned chain. The trace plot shows good mixing (bouncing around the posterior), and the histogram matches the analytic posterior (green curve). Acceptance rate is in the sweet spot (20–40%).
Proposal width = 0.05: too narrow. The chain takes tiny steps — the trace plot shows slow, random-walk behavior. Acceptance rate is near 100% (almost every proposal is accepted because it’s barely different). The chain explores the posterior very slowly.
Proposal width = 5: too wide. Most proposals jump far from the current value and land in low-probability regions — they get rejected. The trace plot shows long flat stretches (the chain is stuck). Acceptance rate is very low.
The goldilocks principle: you want a proposal width that’s “just right” — large enough to explore, small enough to get accepted. The theoretical optimum for 1D is an acceptance rate around 44% (Roberts et al., 1997).

Burn-in and convergence

A practical concern: the chain starts at an arbitrary value (\(\theta_0 = 0\) above). The early samples reflect the starting point, not the posterior. You need to discard these initial samples — the “burn-in” period.

How do you know the chain has converged? Run multiple chains from different starting points. If they all end up exploring the same region, you have evidence of convergence. If they’re stuck in different places, the chains haven’t converged and you need more iterations.

Simulation 2: Multiple chains and convergence

#| standalone: true
#| viewerHeight: 580

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("n_iter2", "Iterations:",
                  min = 200, max = 3000, value = 1000, step = 200),

      sliderInput("burnin", "Burn-in (discard first %):",
                  min = 0, max = 50, value = 20, step = 5),

      sliderInput("prop_sd2", "Proposal width:",
                  min = 0.1, max = 3, value = 0.5, step = 0.1),

      actionButton("go2", "Run chains", class = "btn-primary", width = "100%"),

      uiOutput("results2")
    ),

    mainPanel(
      width = 9,
      plotOutput("multi_trace", height = "420px")
    )
  )
)

server <- function(input, output, session) {

  dat2 <- reactive({
    input$go2
    n_iter  <- input$n_iter2
    burnin  <- input$burnin / 100
    prop_sd <- input$prop_sd2
    sigma   <- 2
    true_mu <- 1.5

    # Generate data once
    y <- rnorm(30, mean = true_mu, sd = sigma)

    log_post <- function(mu) {
      sum(dnorm(y, mean = mu, sd = sigma, log = TRUE))
    }

    # Run 4 chains from different starting points
    starts <- c(-5, -2, 4, 7)
    chain_cols <- c("#e74c3c", "#3498db", "#27ae60", "#f39c12")
    chains <- matrix(0, nrow = n_iter, ncol = 4)

    for (ch in 1:4) {
      chains[1, ch] <- starts[ch]
      for (t in 2:n_iter) {
        proposal <- rnorm(1, chains[t - 1, ch], prop_sd)
        log_r <- log_post(proposal) - log_post(chains[t - 1, ch])
        if (log(runif(1)) < log_r) {
          chains[t, ch] <- proposal
        } else {
          chains[t, ch] <- chains[t - 1, ch]
        }
      }
    }

    # Posterior (analytic)
    post_mean <- mean(y)

    list(chains = chains, starts = starts, chain_cols = chain_cols,
         n_iter = n_iter, burnin = burnin, true_mu = true_mu,
         post_mean = post_mean)
  })

  output$multi_trace <- renderPlot({
    d <- dat2()
    par(mar = c(4.5, 4.5, 3, 1))

    burnin_line <- floor(d$n_iter * d$burnin)

    ylim <- range(d$chains)
    plot(NULL, xlim = c(1, d$n_iter), ylim = ylim,
         xlab = "Iteration", ylab = expression(mu),
         main = "Four Chains from Different Starting Points")

    # Shade burn-in region
    if (burnin_line > 0) {
      rect(0, ylim[1] - 1, burnin_line, ylim[2] + 1,
           col = "#f0f0f080", border = NA)
      abline(v = burnin_line, lty = 2, col = "gray40", lwd = 1.5)
      text(burnin_line, ylim[2], "burn-in", pos = 2,
           cex = 0.8, col = "gray40")
    }

    for (ch in 1:4) {
      lines(d$chains[, ch], col = paste0(d$chain_cols[ch], "90"),
            lwd = 0.8)
    }

    abline(h = d$true_mu, lty = 2, col = "#2c3e50", lwd = 2)
    text(d$n_iter * 0.98, d$true_mu, expression("True " * mu),
         pos = 3, cex = 0.85, col = "#2c3e50")

    legend("topright", bty = "n", cex = 0.75,
           legend = paste0("Chain ", 1:4, " (start = ", d$starts, ")"),
           col = d$chain_cols, lwd = 2)
  })

  output$results2 <- renderUI({
    d <- dat2()
    burnin_n <- floor(d$n_iter * d$burnin)

    # Post-burnin means per chain
    if (burnin_n < d$n_iter) {
      post_samples <- d$chains[(burnin_n + 1):d$n_iter, ]
      chain_means <- round(colMeans(post_samples), 3)
      spread <- round(max(chain_means) - min(chain_means), 3)
    } else {
      chain_means <- rep(NA, 4)
      spread <- NA
    }

    converged <- !is.na(spread) && spread < 0.3
    conv_class <- if (converged) "good" else "bad"
    conv_label <- if (converged) "Chains agree" else "Chains disagree"

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>Post-burn-in means:</b><br>",
        paste0("Chain ", 1:4, ": ", chain_means, collapse = "<br>"), "<br>",
        "<hr style='margin:8px 0'>",
        "<b>Spread:</b> <span class='", conv_class, "'>",
        spread, "</span><br>",
        "<span class='", conv_class, "'>", conv_label, "</span>"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

Default settings (1000 iterations, burn-in = 20%): all four chains start at different values (-5, -2, 4, 7) but converge to the same region within ~100 iterations. After burn-in, all chain means agree. This is convergence.
Burn-in = 0%: the early samples (from the starting points) contaminate the posterior. The chain means diverge because each chain’s average is pulled toward its start.
Proposal width = 0.1: very slow exploration. The chains take longer to converge — you can see them creeping slowly toward the true value. With only 1000 iterations, they might not fully converge. Increase iterations to fix this.
Proposal width = 3: the chains converge quickly but the trace plot shows many flat stretches (rejected proposals). The posterior is explored but inefficiently.

The key message

MCMC lets you compute posteriors for any model — not just conjugate ones. Specify the likelihood and the prior, and the algorithm samples from the posterior. This is what makes Bayesian inference practical for real-world problems like hierarchical models, where closed-form posteriors don’t exist.

This is also the engine behind Bayesian regression. When you type bayes: reg y x in Stata, it runs Metropolis-Hastings to sample from the posterior over regression coefficients — the same algorithm you just watched above, applied to a model you already know.

Modern tools like Stan, JAGS, and PyMC automate this — you specify the model and they handle the sampling. But understanding the basics (proposal tuning, burn-in, convergence checks) helps you diagnose problems when things go wrong.

What you do with the draws

Everything above was about getting posterior draws. Here’s the payoff: once you have draws, any question about the posterior is just counting.

Suppose you run a Bayesian regression in Stan and get 4,000 draws of \(\beta\) (4 chains \(\times\) 1,000 each). Those 4,000 numbers are the posterior. To answer any question, you just look at the draws:

Question	Code	What it does
What’s \(P(\beta < 0)\)?	`mean(beta_draws < 0)`	Fraction of draws where \(\beta < 0\)
What’s \(P(\beta > 0.5)\)?	`mean(beta_draws > 0.5)`	Fraction of draws above 0.5
Posterior mean?	`mean(beta_draws)`	Average of the draws
95% credible interval?	`quantile(beta_draws, c(0.025, 0.975))`	2.5th and 97.5th percentiles
Posterior SD?	`sd(beta_draws)`	Standard deviation of the draws

That’s it. No formulas, no distributional assumptions on the output, no asymptotics. The posterior can be skewed, bimodal, heavy-tailed — it doesn’t matter. You just count.

Example: You estimate the effect of education on wages. Out of 4,000 posterior draws for \(\beta\), every single one is positive. Then \(P(\beta > 0) = 4000/4000 = 100\%\). Not a p-value. Not “fail to reject.” A direct probability statement: “given the data and prior, I am essentially certain the effect is positive.” This is the core advantage of Bayesian estimation — you get direct probability statements about parameters.

Contrast with frequentist inference: A frequentist p-value of 0.03 does not mean “3% chance \(\beta = 0\).” It means “if \(\beta\) were 0, there’s a 3% chance of seeing data this extreme.” The Bayesian posterior probability \(P(\beta > 0) = 97\%\) does mean what it sounds like: given everything you know, there’s a 97% probability the effect is positive. The draws make this trivial to compute.

This is what connects MCMC to the rest of the site. The Bayes’ Theorem page shows the logic: prior \(\times\) likelihood \(=\) posterior. The coin-flip simulation there gives you the posterior in closed form. But for real models — regression with multiple predictors, hierarchical structures, non-conjugate priors — there is no closed form. MCMC gives you draws from that posterior, and then you count. The logic is the same; only the computation changes.

Did you know?

The Metropolis algorithm was developed at Los Alamos National Laboratory by Nicholas Metropolis, Arianna Rosenbluth, Marshall Rosenbluth, Augusta Teller, and Edward Teller (1953) — originally for simulating equations of state in statistical mechanics, not statistics. W.K. Hastings (1970) generalized it to asymmetric proposal distributions.
MCMC was named one of the top 10 algorithms of the 20th century by Computing in Science & Engineering (2000). It’s used across physics, chemistry, biology, statistics, and machine learning.
Modern Bayesian computation has largely moved beyond basic Metropolis-Hastings to Hamiltonian Monte Carlo (HMC) and its adaptive variant NUTS (Hoffman & Gelman, 2014). HMC uses gradient information to make smarter proposals, dramatically improving efficiency in high-dimensional problems. This is what Stan uses under the hood.