Regularization as Bayesian Inference

The Bayesian Estimation page showed that MAP with a normal prior gives ridge regression, and MAP with a Laplace prior gives lasso. This page extends that idea to the regularization techniques used in modern machine learning. The central claim is precise: every standard regularizer corresponds to a prior, and every prior corresponds to a regularizer. This is not an analogy. It is an algebraic equivalence.

The equivalence, restated

Recall from Bayesian Estimation that the MAP objective is:

\[ \hat{\theta}_{\text{MAP}} = \arg\max_\theta \left[\log p(\text{data} \mid \theta) + \log p(\theta)\right] \]

The first term is the log-likelihood (the loss function). The second term is the log-prior (the regularization penalty). Minimizing a regularized loss is maximizing a penalized log-likelihood, which is computing a MAP estimate under some prior.

Regularization technique Prior Penalty term
L2 (weight decay / ridge) \(\theta_j \sim N(0, \tau^2)\) \(\lambda \|\theta\|_2^2\)
L1 (lasso) \(\theta_j \sim \text{Laplace}(0, b)\) \(\lambda \|\theta\|_1\)
Elastic net Mixture of normal and Laplace \(\lambda_1 \|\theta\|_1 + \lambda_2 \|\theta\|_2^2\)
No regularization Flat (uniform) prior None (pure MLE)

The penalty strength \(\lambda\) maps to the prior precision: a large \(\lambda\) is a tight prior that strongly constrains the parameters toward zero. A small \(\lambda\) is a vague prior that lets the data dominate.

Weight decay in neural networks

The most common regularizer in deep learning is weight decay: add \(\lambda \|\theta\|_2^2\) to the loss. In the Bayesian interpretation, this is MAP estimation under a Gaussian prior centered at zero.

What does this prior say? “In the absence of data, I believe the weights should be small.” This is a reasonable default — large weights produce extreme predictions and amplify noise. The prior encodes a preference for smooth, conservative functions.

The weight decay coefficient \(\lambda\) controls the bias-variance tradeoff from Bias-Variance: too small and the model overfits (high variance); too large and the model underfits (high bias). The Bayesian interpretation makes this tradeoff precise — \(\lambda\) is the ratio of the noise variance to the prior variance, \(\sigma^2 / \tau^2\).

Dropout as approximate Bayesian inference

Dropout randomly sets a fraction of neural network activations to zero during training. At test time, all units are active but scaled by the dropout probability. This was introduced as a heuristic to prevent overfitting.

The Bayesian connection: Gal and Ghahramani (2016) showed that training with dropout is approximately equivalent to variational inference in a Bayesian neural network. Specifically, dropout training minimizes a divergence between an approximate posterior and the true posterior over the weights.

This means:

  • Running the network multiple times with dropout at test time (Monte Carlo dropout) produces samples from an approximate posterior predictive distribution
  • The spread of these predictions approximates the model’s epistemic uncertainty
  • This connects dropout to the uncertainty quantification discussed in Calibration and Uncertainty

A caveat. The equivalence between dropout and variational inference is approximate and depends on modeling assumptions that may not hold in practice. The quality of the uncertainty estimates from Monte Carlo dropout is debated in the literature. The connection is theoretically grounded but should not be taken as exact.

Early stopping as implicit regularization

Early stopping — halting training before convergence — also has a Bayesian interpretation. In gradient descent with small learning rate, the trajectory of the parameters from initialization traces out a path from the prior (the initial weights) toward the MLE. Stopping early means the final estimate stays closer to the initialization, which functions as an implicit prior.

For linear models, early stopping with gradient descent is exactly equivalent to L2 regularization: the number of gradient steps plays the role of \(1/\lambda\). Fewer steps = more regularization = tighter prior. For nonlinear models the equivalence is approximate, but the intuition holds.

The RLHF penalty as a prior

Reinforcement Learning from Human Feedback (RLHF) fine-tunes a language model to align with human preferences. The standard objective is:

\[ \max_\theta \; E\left[R(y \mid x)\right] - \beta \, D_{KL}\!\left(\pi_\theta \| \pi_{\text{ref}}\right) \]

where \(R\) is a reward model, \(\pi_\theta\) is the policy being trained, and \(\pi_{\text{ref}}\) is the reference (pre-trained) model. The KL divergence term penalizes deviation from the reference model.

In the Bayesian frame, the reference model acts as a prior: it encodes what the model “knew” before seeing human preference data. The KL penalty pulls the fine-tuned model back toward this prior, preventing it from drifting too far in pursuit of reward. The coefficient \(\beta\) controls how much the prior matters — analogous to \(\lambda\) in weight decay or \(1/\tau^2\) in the Bayesian MAP framework.

This is the same tug-of-war described in Bayesian Updating: data (human preferences) push the model in one direction; the prior (reference model) pulls it back. With more preference data, the reward signal dominates. With less, the model stays close to its pre-trained behavior.

What the Bayesian lens buys you

Viewing regularization as Bayesian inference is not just a mathematical curiosity. It provides:

Principled choice of \(\lambda\). Instead of tuning the regularization strength by cross-validation alone, you can reason about what prior is appropriate for the problem. If you have domain knowledge that parameters should be small, a tight prior (large \(\lambda\)) is justified. If you expect sparse effects, a Laplace prior (L1) is appropriate.

A path to uncertainty quantification. Pure MLE (unregularized training) gives point predictions with no uncertainty. The Bayesian interpretation opens the door to posterior distributions, credible intervals, and predictive uncertainty — tools explored in Calibration and Uncertainty.

Unified understanding. Weight decay, dropout, early stopping, and the RLHF penalty all look different mechanically. But they are all doing the same thing: encoding a prior belief that constrains the model. Recognizing this prevents treating them as unrelated “tricks” and enables reasoning about their interactions.