Bayesian Estimation

Bayesian Updating showed the mechanics: start with a prior, observe data, apply Bayes’ rule, get a posterior distribution. But that page focused on the updating process — coin-flipping simulations that show how posteriors shift with data. This page is about estimation: how do you extract a point estimate from a posterior, and how does Bayesian estimation connect to MLE and regularization?

From updating to estimation

Bayes’ rule gives you a full posterior distribution over the parameter:

\[ p(\theta \mid \text{data}) \propto p(\text{data} \mid \theta) \cdot p(\theta) \]

That’s a whole distribution, not a single number. But often you need a point estimate — a single “best guess” for \(\theta\). Two natural choices:

  • Posterior mean: \(\hat{\theta}_{\text{mean}} = E[\theta \mid \text{data}]\) — the center of mass of the posterior
  • MAP (Maximum A Posteriori): \(\hat{\theta}_{\text{MAP}} = \arg\max_\theta \; p(\theta \mid \text{data})\) — the peak of the posterior

These are different summaries of the same distribution, and they can give different answers (especially when the posterior is skewed). Each has its own optimality property, as we’ll see below.

MAP estimation

The MAP estimator maximizes the posterior:

\[ \hat{\theta}_{\text{MAP}} = \arg\max_\theta \; p(\theta \mid \text{data}) = \arg\max_\theta \; p(\text{data} \mid \theta) \cdot p(\theta) \]

Taking logs (since \(\log\) is monotone):

\[ \hat{\theta}_{\text{MAP}} = \arg\max_\theta \; \Big[\underbrace{\log p(\text{data} \mid \theta)}_{\text{log-likelihood}} + \underbrace{\log p(\theta)}_{\text{log-prior}}\Big] \]

This decomposition is revealing. MAP is MLE plus a correction from the prior.

With a flat (uniform) prior, \(\log p(\theta)\) is constant, so it drops out of the optimization:

\[ \hat{\theta}_{\text{MAP}} = \hat{\theta}_{\text{MLE}} \]

A flat prior says “I have no preference” — so the data speak for themselves, and the MAP estimate is pure maximum likelihood.

With an informative prior, the \(\log p(\theta)\) term pulls the estimate toward the prior’s center. The more concentrated the prior (the stronger your belief), the harder it pulls. The more data you have, the more the likelihood dominates and the prior’s influence fades.

Intuition. MAP balances two forces: the likelihood wants to fit the data; the prior wants to stay near your initial beliefs. With little data, the prior wins. With lots of data, the likelihood wins. This is the same tug-of-war described in Bayesian Updating, but now viewed through the lens of optimization.

MAP as regularization

Here’s where Bayesian estimation connects to something you may have seen in machine learning. Consider a linear model where you want to estimate \(\beta\):

Normal prior → Ridge regression

If \(\beta \sim N(0, \tau^2 I)\), the log-prior is:

\[ \log p(\beta) = -\frac{1}{2\tau^2}\|\beta\|_2^2 + \text{const} \]

The MAP objective becomes:

\[ \hat{\beta}_{\text{MAP}} = \arg\max_\beta \; \left[-\frac{1}{2\sigma^2}\sum_i(Y_i - X_i'\beta)^2 - \frac{1}{2\tau^2}\|\beta\|_2^2\right] \]

which is equivalent to:

\[ \hat{\beta}_{\text{MAP}} = \arg\min_\beta \; \left[\sum_i(Y_i - X_i'\beta)^2 + \lambda\|\beta\|_2^2\right] \]

where \(\lambda = \sigma^2/\tau^2\). This is ridge regression — OLS with an L2 penalty. The “penalty” is just the log-prior.

Laplace prior → Lasso

If \(\beta_j \sim \text{Laplace}(0, b)\) independently, the log-prior is:

\[ \log p(\beta) = -\frac{1}{b}\|\beta\|_1 + \text{const} \]

The MAP objective becomes:

\[ \hat{\beta}_{\text{MAP}} = \arg\min_\beta \; \left[\sum_i(Y_i - X_i'\beta)^2 + \lambda\|\beta\|_1\right] \]

This is the lasso — OLS with an L1 penalty that produces sparse solutions (some coefficients exactly zero).

The correspondence

Prior on \(\beta\) Penalty Regularization method
Normal (Gaussian) \(\lambda\|\beta\|_2^2\) Ridge
Laplace (double exponential) \(\lambda\|\beta\|_1\) Lasso
Flat (uniform) None OLS / MLE

The penalty parameter \(\lambda\) maps directly to the prior’s precision: a tight prior (small \(\tau^2\), large \(\lambda\)) imposes heavy regularization; a vague prior (large \(\tau^2\), small \(\lambda\)) lets the data dominate.

The takeaway. Every regularized regression is implicitly doing Bayesian MAP estimation with some prior. And every Bayesian MAP estimate is implicitly doing regularized regression with some penalty. The two frameworks are saying the same thing in different languages.

Posterior mean vs MAP

The posterior mean and MAP are both valid point estimates, but they optimize different things:

Posterior mean \(E[\theta \mid \text{data}]\) minimizes expected squared error (posterior risk under squared-error loss). If your loss function is quadratic — you care equally about overestimating and underestimating — the posterior mean is optimal.

MAP \(\arg\max_\theta \; p(\theta \mid \text{data})\) gives the single most probable value. Under a 0-1 loss (you’re either right or wrong, no partial credit), MAP is optimal.

For symmetric posteriors (like the normal), the mean and mode coincide, so MAP and posterior mean are the same. For skewed posteriors, they diverge — the mean is pulled toward the tail, while the MAP stays at the peak.

With lots of data, both converge to the MLE. As \(n\) grows, the likelihood becomes sharply peaked and overwhelms any finite prior. The posterior concentrates around the MLE, and both the posterior mean and MAP converge to it. This is the Bernstein-von Mises theorem — the Bayesian analog of MLE’s asymptotic normality. It’s the same convergence you saw in the simulations in Bayesian Updating, where the posterior narrowed around the true value as data accumulated.

Credible intervals vs confidence intervals

Bayesian estimation gives you a full posterior, which makes uncertainty quantification natural. A 95% credible interval is any interval \([a, b]\) such that:

\[ P(\theta \in [a, b] \mid \text{data}) = 0.95 \]

This is the interpretation people want confidence intervals to have: “there’s a 95% probability the parameter is in this interval.” Confidence intervals don’t actually mean that — they’re about the procedure’s long-run coverage rate, not the probability of a specific interval containing \(\theta\).

For more on the mechanics of how posteriors are computed and updated, see Bayesian Updating. For the frequentist perspective on confidence intervals and why they’re often misinterpreted, see p-values & Confidence Intervals.

The estimation landscape

Here’s how all four estimation methods from this section fit together:

Method What it optimizes Assumptions Efficiency
MoM Matches moments Minimal (just moments) Lower
MLE Maximizes likelihood Full distribution Highest (asymptotically)
GMM Weighted moment distance Moment conditions Between MoM and MLE
Bayesian Posterior (prior \(\times\) likelihood) Prior + likelihood Depends on prior

A few patterns emerge:

All four are consistent. With enough data, they all converge to the true parameter (for Bayesian: the posterior concentrates there). They differ in how fast they get there and what they assume along the way.

MLE and Bayesian need the full distribution. Both require you to write down a likelihood \(f(x \mid \theta)\). MoM and GMM only need moment conditions — weaker assumptions, but you pay for that in efficiency.

Bayesian = MLE + prior. With a flat prior, Bayesian MAP is MLE. With an informative prior, the Bayesian estimate is a compromise between the likelihood and prior beliefs. The prior acts as regularization, which can help in small samples but becomes irrelevant in large ones.

GMM nests both MoM and MLE. MoM is GMM with a specific choice of moments. MLE is GMM with the score equation as the moment condition. GMM is the general framework.

The right method depends on what you know and what you’re willing to assume. More assumptions (full distribution, informative prior) buy you more efficiency when they’re right — but more risk when they’re wrong.