Maximum Likelihood Estimation

The workhorse of parametric estimation. Instead of matching moments, MLE asks: what parameter values make the observed data most probable? It’s more demanding than Method of Moments — you need to specify the entire distribution — but in return, you get the most efficient estimator that’s available in large samples.

The idea

You have data \(x_1, \ldots, x_n\) and a model that says each observation was drawn from a distribution \(f(x \mid \theta)\), where \(\theta\) is unknown. The likelihood function treats the data as fixed and asks: how probable is this particular dataset, as a function of \(\theta\)?

\[ L(\theta) = \prod_{i=1}^{n} f(x_i \mid \theta) \]

That’s the joint density of the data, viewed as a function of the parameter rather than the data. The maximum likelihood estimator is the value of \(\theta\) that makes the data most probable:

\[ \hat{\theta}_{\text{MLE}} = \arg\max_\theta \; L(\theta) \]

In practice, products are annoying and numerically unstable, so we work with the log-likelihood:

\[ \ell(\theta) = \log L(\theta) = \sum_{i=1}^{n} \log f(x_i \mid \theta) \]

Since \(\log\) is monotonically increasing, maximizing \(\ell(\theta)\) and \(L(\theta)\) give the same answer. The first-order condition is the score equation:

\[ \frac{\partial \ell(\theta)}{\partial \theta} \bigg|_{\theta = \hat{\theta}} = 0 \]

A worked example

You flip a coin \(n\) times and observe \(k\) heads. The probability of this specific sequence (assuming flips are independent) is:

\[ L(p) = p^k (1-p)^{n-k} \]

The log-likelihood is:

\[ \ell(p) = k \log p + (n - k) \log(1 - p) \]

Take the derivative and set it to zero:

\[ \frac{\partial \ell}{\partial p} = \frac{k}{p} - \frac{n - k}{1 - p} = 0 \]

Solving:

\[ \hat{p}_{\text{MLE}} = \frac{k}{n} \]

The MLE is the sample proportion — exactly what you’d expect. Notice this is also what MoM gives you (set \(E[X] = p\) equal to the sample mean \(k/n\)). For this problem, MLE and MoM agree. That won’t always be the case.

Connection to Bayesian updating. If you put a flat (uniform) prior on \(p\), the posterior distribution is proportional to the likelihood. The posterior mode — the peak of the posterior — equals the MLE. With a non-flat prior, the posterior mode gets pulled away from the MLE, as explored in Bayesian Estimation. For the full mechanics of how priors combine with likelihoods, see Bayesian Updating.

OLS is MLE under normality

Here’s a fact that ties the course together. Suppose your regression model is:

\[ Y_i = X_i'\beta + \varepsilon_i, \qquad \varepsilon_i \sim N(0, \sigma^2) \]

The log-likelihood for one observation is:

\[ \log f(Y_i \mid X_i, \beta, \sigma^2) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(Y_i - X_i'\beta)^2}{2\sigma^2} \]

Sum across all observations:

\[ \ell(\beta, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (Y_i - X_i'\beta)^2 \]

To maximize over \(\beta\), you only need to worry about the second term. The \(\hat{\beta}\) that maximizes this log-likelihood is the one that minimizes \(\sum_i (Y_i - X_i'\beta)^2\) — which is OLS.

\[ \hat{\beta}_{\text{MLE}} = \hat{\beta}_{\text{OLS}} = (X'X)^{-1}X'y \]

OLS is maximum likelihood, under the assumption that errors are normal. This is why the OLS formula feels so natural — it’s doing exactly what MLE would do in the Gaussian model. But notice: if errors aren’t normal, OLS is still the best linear unbiased estimator (by Gauss-Markov), but it’s no longer MLE.

For the full matrix algebra behind this, see The Algebra Behind OLS.

Why is OLS so famous? If MoM, MLE, GMM, and Bayesian estimation are all valid approaches, why does OLS dominate introductory courses and applied work?

  1. Closed-form solution\((X'X)^{-1}X'y\) is a formula, not an iterative algorithm. Before computers, this mattered enormously.
  2. Minimal assumptions — Gauss-Markov says OLS is the best linear unbiased estimator without even needing normality. You don’t need to specify the full distribution.
  3. It’s MLE when errors are normal — so in the most common case, OLS is the most efficient estimator. You get MLE for free.
  4. Interpretability — coefficients are partial derivatives of the conditional mean. Everyone can understand “a one-unit change in \(X\) is associated with a \(\hat{\beta}\) change in \(Y\).”
  5. Historical momentum — Gauss and Legendre published it around 1800. By the time MLE (Fisher, 1920s) and GMM (Hansen, 1982) arrived, OLS had a 120+ year head start in textbooks.

OLS hit a sweet spot: minimal assumptions, maximum interpretability, and a formula you can compute by hand. The other methods are more general and more powerful — but OLS solved 80% of problems with 20% of the effort.

Properties of MLE

MLE is popular for good reason. Under regularity conditions (the parameter space is open, the model is identifiable, the likelihood is smooth enough):

Consistent. \(\hat{\theta}_{\text{MLE}} \xrightarrow{p} \theta_0\) as \(n \to \infty\). The estimator converges to the true value.

Asymptotically efficient. No consistent estimator has smaller variance in large samples. MLE achieves the Cramer-Rao lower bound — it extracts the maximum amount of information from the data.

Asymptotically normal.

\[ \sqrt{n}(\hat{\theta}_{\text{MLE}} - \theta_0) \xrightarrow{d} N\big(0, \, \mathcal{I}(\theta_0)^{-1}\big) \]

where \(\mathcal{I}(\theta)\) is the Fisher information (defined below). This means in large samples, MLE estimates are approximately normal, centered at the truth, with variance determined by Fisher information.

Here’s how MLE compares to Method of Moments:

Property MLE MoM
Consistency Yes Yes
Asymptotic normality Yes Yes
Efficiency Achieves Cramer-Rao bound Generally less efficient
Assumptions needed Full distribution Just moments
Computation May need numerical optimization Usually closed-form

Fisher information and standard errors

The Fisher information measures how much information one observation carries about \(\theta\):

\[ \mathcal{I}(\theta) = -E\!\left[\frac{\partial^2 \ell}{\partial \theta^2}\right] \]

This is the expected curvature of the log-likelihood. A sharply curved log-likelihood — one with a pronounced peak — means a small change in \(\theta\) causes a big drop in \(\ell\). That’s high information: the data strongly distinguish the true \(\theta\) from nearby values. A flat log-likelihood means low information: many parameter values look almost equally plausible.

Intuition. Think of the log-likelihood as a mountain. Fisher information measures how steep the sides are near the peak. A sharp peak means the data pinpoint \(\theta\) precisely. A broad, flat hilltop means you’re uncertain about where exactly the peak is.

The standard error of the MLE is:

\[ \text{SE}(\hat{\theta}) \approx \frac{1}{\sqrt{n \cdot \mathcal{I}(\theta)}} \]

More data (\(n\) large) and more informative data (\(\mathcal{I}\) large) both shrink the standard error.

Connection to regression. For the normal linear model, the Fisher information matrix gives back the familiar variance-covariance matrix of OLS:

\[ \text{Var}(\hat{\beta}) = \sigma^2(X'X)^{-1} \]

This is derived in The Algebra Behind OLS. The \((X'X)^{-1}\) piece is the inverse of the Fisher information matrix for the regression setting. The two frameworks — “OLS algebra” and “MLE theory” — are telling you the same thing.

Limitations

Requires specifying the full distribution. MLE needs the entire density \(f(x \mid \theta)\), not just a few moments. If you specify the wrong distribution, the estimator will still converge — but to the parameter value that makes the assumed distribution closest to the truth in Kullback-Leibler divergence. That’s not necessarily what you want.

Can be biased in small samples. Asymptotic efficiency is a large-sample result. In small samples, MLE can be biased. A classic example: the MLE of \(\sigma^2\) in the normal distribution is \(\frac{1}{n}\sum_i (X_i - \bar{X})^2\), which divides by \(n\) rather than \(n-1\) and is biased downward.

Sensitive to model misspecification. If the true data-generating process doesn’t belong to your parametric family, MLE converges to the “pseudo-true” value — the member of your family closest to the truth in KL divergence. This can be far from the parameter you intended to estimate.

Numerical optimization. For many models, there’s no closed-form MLE. You need iterative algorithms (Newton-Raphson, EM, gradient descent), which can get stuck at local optima or fail to converge.

These limitations motivate approaches that require less structure: GMM only needs moment conditions (not the full distribution), and Bayesian Estimation lets you incorporate prior information to stabilize estimates.