Bayesian Estimation
Bayesian Updating showed the mechanics: start with a prior, observe data, apply Bayes’ rule, get a posterior distribution. But that page focused on the updating process — coin-flipping simulations that show how posteriors shift with data. This page is about estimation: how do you extract a point estimate from a posterior, and how does Bayesian estimation connect to MLE and regularization?
From updating to estimation
Bayes’ rule gives you a full posterior distribution over the parameter:
\[ p(\theta \mid \text{data}) \propto p(\text{data} \mid \theta) \cdot p(\theta) \]
That’s a whole distribution, not a single number. But often you need a point estimate — a single “best guess” for \(\theta\). Two natural choices:
- Posterior mean: \(\hat{\theta}_{\text{mean}} = E[\theta \mid \text{data}]\) — the center of mass of the posterior
- MAP (Maximum A Posteriori): \(\hat{\theta}_{\text{MAP}} = \arg\max_\theta \; p(\theta \mid \text{data})\) — the peak of the posterior
These are different summaries of the same distribution, and they can give different answers (especially when the posterior is skewed). Each has its own optimality property, as we’ll see below.
MAP estimation
The MAP estimator maximizes the posterior:
\[ \hat{\theta}_{\text{MAP}} = \arg\max_\theta \; p(\theta \mid \text{data}) = \arg\max_\theta \; p(\text{data} \mid \theta) \cdot p(\theta) \]
Taking logs (since \(\log\) is monotone):
\[ \hat{\theta}_{\text{MAP}} = \arg\max_\theta \; \Big[\underbrace{\log p(\text{data} \mid \theta)}_{\text{log-likelihood}} + \underbrace{\log p(\theta)}_{\text{log-prior}}\Big] \]
This decomposition is revealing. MAP is MLE plus a correction from the prior.
With a flat (uniform) prior, \(\log p(\theta)\) is constant, so it drops out of the optimization:
\[ \hat{\theta}_{\text{MAP}} = \hat{\theta}_{\text{MLE}} \]
A flat prior says “I have no preference” — so the data speak for themselves, and the MAP estimate is pure maximum likelihood.
With an informative prior, the \(\log p(\theta)\) term pulls the estimate toward the prior’s center. The more concentrated the prior (the stronger your belief), the harder it pulls. The more data you have, the more the likelihood dominates and the prior’s influence fades.
MAP as regularization
Here’s where Bayesian estimation connects to something you may have seen in machine learning. Consider a linear model where you want to estimate \(\beta\):
Normal prior → Ridge regression
If \(\beta \sim N(0, \tau^2 I)\), the log-prior is:
\[ \log p(\beta) = -\frac{1}{2\tau^2}\|\beta\|_2^2 + \text{const} \]
The MAP objective becomes:
\[ \hat{\beta}_{\text{MAP}} = \arg\max_\beta \; \left[-\frac{1}{2\sigma^2}\sum_i(Y_i - X_i'\beta)^2 - \frac{1}{2\tau^2}\|\beta\|_2^2\right] \]
which is equivalent to:
\[ \hat{\beta}_{\text{MAP}} = \arg\min_\beta \; \left[\sum_i(Y_i - X_i'\beta)^2 + \lambda\|\beta\|_2^2\right] \]
where \(\lambda = \sigma^2/\tau^2\). This is ridge regression — OLS with an L2 penalty. The “penalty” is just the log-prior.
Laplace prior → Lasso
If \(\beta_j \sim \text{Laplace}(0, b)\) independently, the log-prior is:
\[ \log p(\beta) = -\frac{1}{b}\|\beta\|_1 + \text{const} \]
The MAP objective becomes:
\[ \hat{\beta}_{\text{MAP}} = \arg\min_\beta \; \left[\sum_i(Y_i - X_i'\beta)^2 + \lambda\|\beta\|_1\right] \]
This is the lasso — OLS with an L1 penalty that produces sparse solutions (some coefficients exactly zero).
The correspondence
| Prior on \(\beta\) | Penalty | Regularization method |
|---|---|---|
| Normal (Gaussian) | \(\lambda\|\beta\|_2^2\) | Ridge |
| Laplace (double exponential) | \(\lambda\|\beta\|_1\) | Lasso |
| Flat (uniform) | None | OLS / MLE |
The penalty parameter \(\lambda\) maps directly to the prior’s precision: a tight prior (small \(\tau^2\), large \(\lambda\)) imposes heavy regularization; a vague prior (large \(\tau^2\), small \(\lambda\)) lets the data dominate.
Posterior mean vs MAP
The posterior mean and MAP are both valid point estimates, but they optimize different things:
Posterior mean \(E[\theta \mid \text{data}]\) minimizes expected squared error (posterior risk under squared-error loss). If your loss function is quadratic — you care equally about overestimating and underestimating — the posterior mean is optimal.
MAP \(\arg\max_\theta \; p(\theta \mid \text{data})\) gives the single most probable value. Under a 0-1 loss (you’re either right or wrong, no partial credit), MAP is optimal.
For symmetric posteriors (like the normal), the mean and mode coincide, so MAP and posterior mean are the same. For skewed posteriors, they diverge — the mean is pulled toward the tail, while the MAP stays at the peak.
With lots of data, both converge to the MLE. As \(n\) grows, the likelihood becomes sharply peaked and overwhelms any finite prior. The posterior concentrates around the MLE, and both the posterior mean and MAP converge to it. This is the Bernstein-von Mises theorem — the Bayesian analog of MLE’s asymptotic normality. It’s the same convergence you saw in the simulations in Bayesian Updating, where the posterior narrowed around the true value as data accumulated.
Credible intervals vs confidence intervals
Bayesian estimation gives you a full posterior, which makes uncertainty quantification natural. A 95% credible interval is any interval \([a, b]\) such that:
\[ P(\theta \in [a, b] \mid \text{data}) = 0.95 \]
This is the interpretation people want confidence intervals to have: “there’s a 95% probability the parameter is in this interval.” Confidence intervals don’t actually mean that — they’re about the procedure’s long-run coverage rate, not the probability of a specific interval containing \(\theta\).
For more on the mechanics of how posteriors are computed and updated, see Bayesian Updating. For the frequentist perspective on confidence intervals and why they’re often misinterpreted, see p-values & Confidence Intervals.
The estimation landscape
Here’s how all four estimation methods from this section fit together:
| Method | What it optimizes | Assumptions | Efficiency |
|---|---|---|---|
| MoM | Matches moments | Minimal (just moments) | Lower |
| MLE | Maximizes likelihood | Full distribution | Highest (asymptotically) |
| GMM | Weighted moment distance | Moment conditions | Between MoM and MLE |
| Bayesian | Posterior (prior \(\times\) likelihood) | Prior + likelihood | Depends on prior |
A few patterns emerge:
All four are consistent. With enough data, they all converge to the true parameter (for Bayesian: the posterior concentrates there). They differ in how fast they get there and what they assume along the way.
MLE and Bayesian need the full distribution. Both require you to write down a likelihood \(f(x \mid \theta)\). MoM and GMM only need moment conditions — weaker assumptions, but you pay for that in efficiency.
Bayesian = MLE + prior. With a flat prior, Bayesian MAP is MLE. With an informative prior, the Bayesian estimate is a compromise between the likelihood and prior beliefs. The prior acts as regularization, which can help in small samples but becomes irrelevant in large ones.
GMM nests both MoM and MLE. MoM is GMM with a specific choice of moments. MLE is GMM with the score equation as the moment condition. GMM is the general framework.
The right method depends on what you know and what you’re willing to assume. More assumptions (full distribution, informative prior) buy you more efficiency when they’re right — but more risk when they’re wrong.