Bayesian Thinking in ML & AI

Same ideas, different names

The connection

The Bayesian concepts on this site — priors, posteriors, MCMC, shrinkage — are the same ideas that power modern machine learning and AI. They just go by different names.

If you understand why a prior pulls estimates toward zero, you already understand regularization. If you understand MCMC, you understand how Bayesian neural networks are trained. This page maps what you’ve learned to their ML/AI counterparts.

Bayesian ideas under a different name

Many standard ML techniques are Bayesian ideas in disguise:

ML technique Bayesian equivalent What’s happening
L2 regularization (weight decay) Normal prior on weights Penalizing large weights = assuming weights come from a \(N(0, \sigma^2)\) prior. The penalty strength \(\lambda\) controls how tight the prior is. The result is the MAP estimate.
L1 regularization (lasso) Laplace prior on weights The Laplace prior has a sharp peak at zero, which is why L1 produces sparse solutions — it believes most weights should be exactly zero.
Dropout Approximate Bayesian inference Randomly dropping neurons during training approximates averaging over an ensemble of networks — which is approximate posterior inference over network architectures (Gal & Ghahramani, 2016).
Ensemble methods (random forests, boosting) Bayesian model averaging Training multiple models and averaging predictions approximates integrating over model uncertainty — the Bayesian approach to the “which model?” question. See Model Comparison.
Early stopping Implicit prior / regularization Stopping training before convergence prevents weights from growing too large — similar to a prior that favors simpler models.
Batch normalization Approximate posterior regularization Adds noise during training that has a regularizing effect, similar to how priors prevent overfitting.

The connection to shrinkage is direct: regularization is shrinkage. L2 pulls coefficients toward zero just like a hierarchical model pulls group estimates toward the grand mean.

Directly Bayesian

Some ML methods don’t just resemble Bayesian thinking — they are Bayesian:

Gaussian processes — A prior over functions, not parameters. Instead of assuming a linear model with normal errors, a GP says “I believe the true function is smooth” and lets the data reveal the shape. The posterior is a distribution over functions, giving you predictions with uncertainty bands.

Bayesian optimization — Puts a posterior over the objective function (usually via a GP), then uses that uncertainty to decide where to evaluate next. This is how OpenAI and Google tune hyperparameters — it’s far more efficient than grid search because it reasons about where the optimum probably is.

Bayesian neural networks — Standard neural networks give point estimates for weights. BNNs put priors on every weight and compute (or approximate) the full posterior. The result: predictions that know when they’re uncertain.

Probabilistic programming — Languages like Stan, PyMC, and NumPyro let you write down any Bayesian model and automatically handle the posterior computation. The MCMC samplers you’ve already seen are what these tools use under the hood.

How MCMC scales to neural nets

Standard neural network training (SGD) finds a single best set of weights — that’s maximum likelihood estimation (MLE), a point estimate. A Bayesian neural network needs the full posterior over weights, which requires something like MCMC.

The challenge: MCMC methods like Metropolis-Hastings don’t scale to millions of parameters. The field has developed a progression from exact-but-slow to approximate-but-fast:

Method Idea Scales to
Metropolis-Hastings Random walk through parameter space ~100s of parameters
Hamiltonian Monte Carlo (HMC) Uses gradients to propose better moves ~10,000s of parameters
Stochastic Gradient Langevin Dynamics (SGLD) SGD + calibrated noise = MCMC Millions of parameters
Variational Inference (VI) Approximate the posterior with a simpler distribution Billions of parameters

The SGLD insight is striking: if you take standard SGD and add noise of the right magnitude, you’re doing MCMC (Welling & Teh, 2011). Gradient descent with noise explores the posterior rather than collapsing to a point estimate. Every deep learning practitioner is one noise injection away from being Bayesian.

Variational inference gives up on exact samples and instead finds the closest simple distribution (usually a Gaussian) to the true posterior. It’s fast enough for production but the approximation can miss important structure.

Where it matters most in AI right now

Uncertainty quantification — A standard neural network says “99% cat” for an image it’s never seen before. A Bayesian neural network says “72% cat, but I’m uncertain” — it knows what it doesn’t know. This matters for medical diagnosis, autonomous driving, and any setting where confident wrong answers are dangerous.

Large language models — Every time an LLM generates text, it computes \(P(\text{next token} | \text{context})\) — that’s posterior inference. Training is MLE (find the weights that maximize the likelihood of the training data), but usage is Bayesian updating: each token of context updates the model’s beliefs about what comes next.

Active learning — When labeling data is expensive, which examples should you label next? The Bayesian answer: the ones the model is most uncertain about. This is Bayesian decision theory — choose actions that maximize expected information gain.

Safety and alignment — Knowing when a model is uncertain is critical for AI safety. If a model can say “I don’t know,” it can defer to humans instead of confidently hallucinating.

The big picture

Frequentist ML Bayesian ML
What you get Point estimates Distributions
Uncertainty Requires extra work (bootstrapping, etc.) Built in — it’s the posterior
Overfitting Regularization as ad-hoc fix Prior as principled regularization
Model selection Cross-validation Marginal likelihood (automatic Occam’s razor)
Computation Fast (SGD) Slower (MCMC/VI)

Production ML is mostly frequentist — it’s faster, and when you have billions of data points, the prior doesn’t matter much anyway. But the frontier — safety, calibration, small-data problems, uncertainty-aware systems — is increasingly Bayesian.

And the two are converging. Regularization is a prior. Dropout is approximate inference. SGD + noise is MCMC.

The bridge back to this site:

  • The coin-flip posterior you simulated? That’s a baby Gaussian process — updating beliefs about a parameter with data.
  • Shrinkage pulling group estimates toward the mean? That’s regularization pulling weights toward zero.
  • MCMC sampling from a posterior? That’s how Bayesian neural networks are trained.
  • MAP estimation choosing the posterior mode? That’s L2-regularized regression.

You already understand the ideas. ML just runs them at scale.