Bayesian Thinking in ML & AI

Same ideas, different names

The connection

The Bayesian concepts on this site — priors, posteriors, MCMC, shrinkage — are the same ideas that power modern machine learning and AI. They just go by different names.

If you understand why a prior pulls estimates toward zero, you already understand regularization. If you understand MCMC, you understand how Bayesian neural networks are trained. This page maps what you’ve learned to their ML/AI counterparts.

Bayesian ideas under a different name

Many standard ML techniques are Bayesian ideas in disguise:

ML technique	Bayesian equivalent	What’s happening
L2 regularization (weight decay)	Normal prior on weights	Penalizing large weights = assuming weights come from a \(N(0, \sigma^2)\) prior. The penalty strength \(\lambda\) controls how tight the prior is. The result is the MAP estimate.
L1 regularization (lasso)	Laplace prior on weights	The Laplace prior has a sharp peak at zero, which is why L1 produces sparse solutions — it believes most weights should be exactly zero.
Dropout	Approximate Bayesian inference	Randomly dropping neurons during training approximates averaging over an ensemble of networks — which is approximate posterior inference over network architectures (Gal & Ghahramani, 2016).
Ensemble methods (random forests, boosting)	Bayesian model averaging	Training multiple models and averaging predictions approximates integrating over model uncertainty — the Bayesian approach to the “which model?” question. See Model Comparison.
Early stopping	Implicit prior / regularization	Stopping training before convergence prevents weights from growing too large — similar to a prior that favors simpler models.
Batch normalization	Approximate posterior regularization	Adds noise during training that has a regularizing effect, similar to how priors prevent overfitting.

The connection to shrinkage is direct: regularization is shrinkage. L2 pulls coefficients toward zero just like a hierarchical model pulls group estimates toward the grand mean.

Directly Bayesian

Some ML methods don’t just resemble Bayesian thinking — they are Bayesian:

Gaussian processes — A prior over functions, not parameters. Instead of assuming a linear model with normal errors, a GP says “I believe the true function is smooth” and lets the data reveal the shape. The posterior is a distribution over functions, giving you predictions with uncertainty bands.

Bayesian optimization — Puts a posterior over the objective function (usually via a GP), then uses that uncertainty to decide where to evaluate next. This is how OpenAI and Google tune hyperparameters — it’s far more efficient than grid search because it reasons about where the optimum probably is.

Bayesian neural networks — Standard neural networks give point estimates for weights. BNNs put priors on every weight and compute (or approximate) the full posterior. The result: predictions that know when they’re uncertain.

Probabilistic programming — Languages like Stan, PyMC, and NumPyro let you write down any Bayesian model and automatically handle the posterior computation. The MCMC samplers you’ve already seen are what these tools use under the hood.

How MCMC scales to neural nets

Standard neural network training (SGD) finds a single best set of weights — that’s maximum likelihood estimation (MLE), a point estimate. A Bayesian neural network needs the full posterior over weights, which requires something like MCMC.

The challenge: MCMC methods like Metropolis-Hastings don’t scale to millions of parameters. The field has developed a progression from exact-but-slow to approximate-but-fast:

Method	Idea	Scales to
Metropolis-Hastings	Random walk through parameter space	~100s of parameters
Hamiltonian Monte Carlo (HMC)	Uses gradients to propose better moves	~10,000s of parameters
Stochastic Gradient Langevin Dynamics (SGLD)	SGD + calibrated noise = MCMC	Millions of parameters
Variational Inference (VI)	Approximate the posterior with a simpler distribution	Billions of parameters

The SGLD insight is striking: if you take standard SGD and add noise of the right magnitude, you’re doing MCMC (Welling & Teh, 2011). Gradient descent with noise explores the posterior rather than collapsing to a point estimate. Every deep learning practitioner is one noise injection away from being Bayesian.

Variational inference gives up on exact samples and instead finds the closest simple distribution (usually a Gaussian) to the true posterior. It’s fast enough for production but the approximation can miss important structure.

Where it matters most in AI right now

Uncertainty quantification — A standard neural network says “99% cat” for an image it’s never seen before. A Bayesian neural network says “72% cat, but I’m uncertain” — it knows what it doesn’t know. This matters for medical diagnosis, autonomous driving, and any setting where confident wrong answers are dangerous.

Large language models — Every time an LLM generates text, it computes \(P(\text{next token} | \text{context})\) — that’s posterior inference. Training is MLE (find the weights that maximize the likelihood of the training data), but usage is Bayesian updating: each token of context updates the model’s beliefs about what comes next.

Active learning — When labeling data is expensive, which examples should you label next? The Bayesian answer: the ones the model is most uncertain about. This is Bayesian decision theory — choose actions that maximize expected information gain.

Safety and alignment — Knowing when a model is uncertain is critical for AI safety. If a model can say “I don’t know,” it can defer to humans instead of confidently hallucinating.

The big picture

	Frequentist ML	Bayesian ML
What you get	Point estimates	Distributions
Uncertainty	Requires extra work (bootstrapping, etc.)	Built in — it’s the posterior
Overfitting	Regularization as ad-hoc fix	Prior as principled regularization
Model selection	Cross-validation	Marginal likelihood (automatic Occam’s razor)
Computation	Fast (SGD)	Slower (MCMC/VI)

Production ML is mostly frequentist — it’s faster, and when you have billions of data points, the prior doesn’t matter much anyway. But the frontier — safety, calibration, small-data problems, uncertainty-aware systems — is increasingly Bayesian.

And the two are converging. Regularization is a prior. Dropout is approximate inference. SGD + noise is MCMC.

The bridge back to this site:

The coin-flip posterior you simulated? That’s a baby Gaussian process — updating beliefs about a parameter with data.
Shrinkage pulling group estimates toward the mean? That’s regularization pulling weights toward zero.
MCMC sampling from a posterior? That’s how Bayesian neural networks are trained.
MAP estimation choosing the posterior mode? That’s L2-regularized regression.

You already understand the ideas. ML just runs them at scale.