Bayesian inference as statistical mechanics

Most of the algorithms statisticians use to do Bayesian computation — simulated annealing for MAP, Hamiltonian Monte Carlo for sampling, variational inference for fast approximation — were not invented by statisticians. They came from physics, specifically from the statistical mechanics tradition that traces back to Boltzmann and Gibbs in the nineteenth century, runs through Los Alamos in the 1940s, and was finally imported into statistics in the 1980s.

The reason these methods transfer so cleanly is that the math is identical. A Bayesian posterior is the equilibrium distribution of a physical system in thermal contact with a heat bath. Everything else follows from that.

The energy landscape

In statistical mechanics, the probability that a system at temperature \(T\) is in a state with energy \(E\) is the Boltzmann distribution:

\[ P(\text{state}) \;\propto\; \exp\!\left[-\frac{E(\text{state})}{k_B T}\right]. \]

A Bayesian posterior takes the same form:

\[ p(\theta \mid \text{data}) \;\propto\; p(\text{data} \mid \theta)\, p(\theta) \;=\; \exp\!\bigl[\,\log p(\text{data} \mid \theta) + \log p(\theta)\,\bigr]. \]

Define the energy of a parameter value \(\theta\) as

\[ E(\theta) \;\equiv\; -\log p(\text{data} \mid \theta) \;-\; \log p(\theta). \]

Then the posterior is exactly a Boltzmann distribution over parameter space with \(k_B T = 1\):

\[ p(\theta \mid \text{data}) \;\propto\; \exp[-E(\theta)]. \]

The duality gives every part of the Bayesian setup a physical interpretation:

Statistics Statistical mechanics
Parameter \(\theta\) State of the system
Negative log-posterior \(-\log p(\theta \mid \text{data})\) Energy \(E(\theta)\)
MAP estimate (mode of posterior) Ground state (lowest-energy configuration)
Posterior samples Thermal fluctuations at \(T = 1\)
Tempering (anneal \(T\) down) Cooling a physical system
Log marginal likelihood Negative free energy \(-F = -(U - TS)\)
Multimodal posterior Glassy energy landscape with many local minima

This isn’t analogy. The equations are the same. That is why physicists’ methods work on statisticians’ problems.

Simulated annealing: cooling to find the MAP

If MAP estimation is finding the ground state of an energy landscape, then any physics technique for finding ground states is a candidate optimizer. The oldest is simulated annealing (Kirkpatrick, Gelatt & Vecchi 1983):

  1. Replace the posterior \(p(\theta \mid \text{data}) \propto e^{-E(\theta)}\) with a tempered posterior \(p_T(\theta) \propto e^{-E(\theta)/T}\).
  2. At high \(T\), this is flat — easy to wander around the parameter space without getting stuck in local minima.
  3. Gradually lower \(T\) toward \(0\). As \(T \to 0\), \(p_T\) concentrates entirely on the global minimum of \(E\) — the MAP.
  4. At each \(T\), run a Metropolis–Hastings chain. The slow cooling lets the chain escape local minima while it still has thermal energy, and settle into the global minimum as it freezes.

Simulated annealing is the canonical example of physics serving optimization. It’s used today in chip layout, vehicle routing, protein folding, and combinatorial problems where the energy landscape has lots of false summits.

#| standalone: true
#| viewerHeight: 580

library(shiny)

# Bimodal target "energy" landscape -- two valleys, one deeper than the other.
# E(theta) = -log( 0.6 * dnorm(theta, -2, 0.6) + 0.4 * dnorm(theta, 2.0, 0.5) )
energy <- function(theta) {
  -log(0.6 * dnorm(theta, -2.0, 0.6) + 0.4 * dnorm(theta, 2.0, 0.5))
}

ui <- fluidPage(
  titlePanel("Simulated annealing on a bimodal energy landscape"),

  sidebarLayout(
    sidebarPanel(
      width = 4,
      sliderInput("T_start", "Starting temperature:",
                  min = 0.1, max = 5, value = 2.5, step = 0.1),
      sliderInput("T_end", "Final temperature:",
                  min = 0.001, max = 1, value = 0.05, step = 0.01),
      sliderInput("n_steps", "Cooling steps:",
                  min = 200, max = 5000, value = 1500, step = 100),
      sliderInput("theta0", "Start point θ₀:",
                  min = -4, max = 4, value = 3, step = 0.1),
      actionButton("go", "Run annealing", class = "btn-primary", width = "100%"),
      htmlOutput("readout")
    ),
    mainPanel(
      width = 8,
      plotOutput("anneal_plot", height = "480px")
    )
  )
)

server <- function(input, output, session) {

  run <- eventReactive(input$go, {
    n   <- input$n_steps
    T0  <- input$T_start
    T1  <- input$T_end
    # geometric cooling schedule
    schedule <- T0 * (T1 / T0)^(seq_len(n) / n)

    theta <- numeric(n + 1)
    E_path <- numeric(n + 1)
    theta[1] <- input$theta0
    E_path[1] <- energy(theta[1])
    accepts <- 0
    for (k in seq_len(n)) {
      T <- schedule[k]
      proposal <- theta[k] + rnorm(1, 0, 0.5)
      dE <- energy(proposal) - E_path[k]
      if (dE < 0 || runif(1) < exp(-dE / T)) {
        theta[k + 1]  <- proposal
        E_path[k + 1] <- energy(proposal)
        accepts <- accepts + 1
      } else {
        theta[k + 1]  <- theta[k]
        E_path[k + 1] <- E_path[k]
      }
    }
    list(theta = theta, E = E_path, schedule = schedule,
         accept_rate = accepts / n)
  }, ignoreNULL = FALSE)

  output$anneal_plot <- renderPlot({
    r <- run()
    par(mfrow = c(2, 1), mar = c(4, 4.5, 1.5, 1))

    # Top panel: energy landscape with trajectory
    grid <- seq(-4, 4, length.out = 600)
    plot(grid, energy(grid), type = "l", lwd = 2.5, col = "#2c3e50",
         xlab = expression(theta), ylab = expression(E(theta)),
         main = "Energy landscape and annealing trajectory")
    points(r$theta[1], r$E[1], pch = 19, col = "#c0392b", cex = 1.6)
    points(tail(r$theta, 1), tail(r$E, 1), pch = 19, col = "#27ae60", cex = 1.8)
    sub <- seq(1, length(r$theta), length.out = 150)
    lines(r$theta[sub], r$E[sub], col = adjustcolor("#3498db", 0.4), lwd = 1)
    legend("topright",
           legend = c("Energy curve", "Trajectory", "Start", "End"),
           col = c("#2c3e50", adjustcolor("#3498db", 0.6), "#c0392b", "#27ae60"),
           lwd = c(2.5, 1, NA, NA), pch = c(NA, NA, 19, 19),
           bty = "n", cex = 0.95)

    # Bottom panel: temperature schedule
    plot(seq_along(r$schedule), r$schedule, type = "l", lwd = 2, col = "#8e44ad",
         xlab = "Step", ylab = "Temperature", main = "Cooling schedule (log-y)",
         log = "y")
  })

  output$readout <- renderUI({
    r <- run()
    final_theta <- tail(r$theta, 1)
    HTML(sprintf(paste(
      "<div style='margin-top:10px;font-size:13px;line-height:1.7;'>",
      "<b>Final θ:</b> %.3f<br>",
      "<b>Final energy:</b> %.3f<br>",
      "<b>Acceptance rate:</b> %.0f%%<br><br>",
      "Global minimum is near θ ≈ −2.<br>",
      "If the chain ended up near +2 instead, it got trapped in the shallower well — try a higher starting temperature.",
      "</div>"
    ), final_theta, tail(r$E, 1), 100 * r$accept_rate))
  })
}

shinyApp(ui, server)

Things to try.

  • Drop the starting temperature to 0.5. The chain becomes a local search — it slides into whichever well is closest to \(\theta_0\). Try \(\theta_0 = 3\) and watch it get stuck near the shallow minimum at \(+2\).
  • Start at \(T = 4\). The chain has plenty of thermal energy to climb over the barrier between wells. It nearly always lands at the deep minimum near \(-2\) regardless of \(\theta_0\).
  • Cool too fast (small \(n\)). The chain freezes before it finds the global minimum. Cool too slow and you waste compute. The art of simulated annealing is the cooling schedule.

Hamiltonian Monte Carlo: Newtonian sampling

Simulated annealing finds a single point. To get the whole posterior you need samples. The dominant modern sampler is Hamiltonian Monte Carlo (Duane et al. 1987, originally developed for lattice QCD), and it’s even more directly physical than simulated annealing.

The recipe:

  1. Introduce a fictitious momentum variable \(p\) alongside the parameter \(\theta\). Give it a kinetic-energy density \(K(p) = \tfrac{1}{2} p^\top M^{-1} p\) for some mass matrix \(M\) (usually identity).
  2. The total Hamiltonian is \[ H(\theta, p) \;=\; E(\theta) + K(p) \;=\; -\log p(\theta \mid \text{data}) + \tfrac{1}{2} p^\top M^{-1} p. \]
  3. To propose a new sample: sample momentum \(p \sim \mathcal{N}(0, M)\), then simulate Hamilton’s equations of motion numerically for some short trajectory using leapfrog integration (also borrowed from molecular dynamics).
  4. Accept or reject the endpoint using Metropolis–Hastings on the joint energy \(H\).

Geometrically, you’re rolling a frictionless ball along the negative log-posterior surface. The ball can climb hills (use kinetic energy to surmount barriers) and roll across long flat valleys without random-walk dithering. That makes HMC dramatically more efficient than Metropolis–Hastings in high-dimensional, correlated posteriors.

The No-U-Turn Sampler (NUTS) inside Stan and PyMC is HMC with an automated stop rule for trajectory length. Every modern Bayesian regression you fit is ultimately solving a physics problem.

See MCMC for the broader sampling landscape.

Variational inference: free-energy minimization

When MCMC is too slow, the alternative is variational inference (VI): approximate the true posterior \(p(\theta \mid x)\) by a simpler distribution \(q(\theta)\) chosen from a tractable family. The VI objective is

\[ \mathcal{L}(q) \;=\; \mathbb{E}_q\!\bigl[\,-\log p(x, \theta)\,\bigr] \;-\; H(q), \]

where \(H(q) = -\mathbb{E}_q[\log q(\theta)]\) is the entropy of \(q\). This is exactly the Helmholtz free energy

\[ F \;=\; U \;-\; T S \]

from thermodynamics — internal energy minus temperature times entropy (with \(T=1\)). Minimizing the variational objective is minimizing free energy, the central principle of equilibrium statistical mechanics.

The most common VI choice — assume the variational distribution factorizes, \(q(\theta) = \prod_i q_i(\theta_i)\) — is the mean-field approximation, originally developed by Pierre Weiss (1907) for ferromagnetic systems. The modern mean-field VI updates derived by Beal and Ghahramani in the 2000s are mathematically the same equations physicists had been writing since World War I.

Why this matters for econometrics

Three takeaways for someone who fits regressions for a living:

1. Multimodality is a real risk and there are physics tools for it. Most econometric estimators assume the objective surface is convex — a single global optimum. When it isn’t (mixture models, nonlinear systems, structural models with multiple equilibria), the “energy landscape” framing gives you a vocabulary and a toolbox. Simulated annealing, parallel tempering, and HMC-with-restarts come straight from this tradition.

2. Modern Bayesian software is doing Newtonian mechanics. When PyMC, Stan, or NumPyro samples from a posterior, the engine inside is HMC/NUTS — literally simulating ball-on-landscape physics. Knowing this changes how you diagnose problems: divergent transitions usually mean the leapfrog stepsize is too large for the curvature of your posterior, a problem that’s instantly intuitive once you picture a fast ball overshooting a sharp valley.

3. Variational methods are free-energy minimization. When you fit a variational autoencoder, a topic model, or a fast Bayesian neural net, you are minimizing the same Helmholtz free energy that statistical physicists have been minimizing since the 1870s. The reason VI sometimes underestimates posterior variance is the same reason mean-field theory misses correlations in spin glasses. The known limitations of physics models map directly onto the known limitations of variational Bayes.

The bridge runs both ways. Bayesian inference inherits the algorithms and intuitions of statistical mechanics. The same mathematical structure that governs gas molecules in a box governs the posterior over the parameters of a hedonic regression. Once you see it, it’s hard to unsee.

Did you know?

  • The Metropolis algorithm was developed in 1953 for simulating equation of state for a system of interacting hard spheres on MANIAC I, the Los Alamos computer. The five authors — Metropolis, the Rosenbluths, and the Tellers — were a mix of physicists, applied mathematicians, and computer pioneers. Bayesian statisticians didn’t widely adopt the method until Gelfand & Smith (1990).
  • Edwin Jaynes argued in the 1950s and 60s that statistical mechanics is itself an application of Bayesian inference — the Boltzmann distribution is the maximum-entropy distribution consistent with an expected-energy constraint. So in Jaynes’s view, the connection isn’t that Bayesian inference borrowed from physics; it’s that physics was Bayesian inference all along and didn’t realize it.
  • The “free energy principle” in computational neuroscience (Karl Friston) proposes that the brain itself is doing variational free energy minimization in real time. Whether or not you find that convincing, the fact that someone could make the argument tells you how transferable the free-energy framework is.