Maximum Likelihood Estimation
The workhorse of parametric estimation. Instead of matching moments, MLE asks: what parameter values make the observed data most probable? It’s more demanding than Method of Moments — you need to specify the entire distribution — but in return, you get the most efficient estimator that’s available in large samples.
The idea
You have data \(x_1, \ldots, x_n\) and a model that says each observation was drawn from a distribution \(f(x \mid \theta)\), where \(\theta\) is unknown. The likelihood function treats the data as fixed and asks: how probable is this particular dataset, as a function of \(\theta\)?
\[ L(\theta) = \prod_{i=1}^{n} f(x_i \mid \theta) \]
That’s the joint density of the data, viewed as a function of the parameter rather than the data. The maximum likelihood estimator is the value of \(\theta\) that makes the data most probable:
\[ \hat{\theta}_{\text{MLE}} = \arg\max_\theta \; L(\theta) \]
In practice, products are annoying and numerically unstable, so we work with the log-likelihood:
\[ \ell(\theta) = \log L(\theta) = \sum_{i=1}^{n} \log f(x_i \mid \theta) \]
Since \(\log\) is monotonically increasing, maximizing \(\ell(\theta)\) and \(L(\theta)\) give the same answer. The first-order condition is the score equation:
\[ \frac{\partial \ell(\theta)}{\partial \theta} \bigg|_{\theta = \hat{\theta}} = 0 \]
A worked example
You flip a coin \(n\) times and observe \(k\) heads. The probability of this specific sequence (assuming flips are independent) is:
\[ L(p) = p^k (1-p)^{n-k} \]
The log-likelihood is:
\[ \ell(p) = k \log p + (n - k) \log(1 - p) \]
Take the derivative and set it to zero:
\[ \frac{\partial \ell}{\partial p} = \frac{k}{p} - \frac{n - k}{1 - p} = 0 \]
Solving:
\[ \hat{p}_{\text{MLE}} = \frac{k}{n} \]
The MLE is the sample proportion — exactly what you’d expect. Notice this is also what MoM gives you (set \(E[X] = p\) equal to the sample mean \(k/n\)). For this problem, MLE and MoM agree. That won’t always be the case.
OLS is MLE under normality
Here’s a fact that ties the course together. Suppose your regression model is:
\[ Y_i = X_i'\beta + \varepsilon_i, \qquad \varepsilon_i \sim N(0, \sigma^2) \]
The log-likelihood for one observation is:
\[ \log f(Y_i \mid X_i, \beta, \sigma^2) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(Y_i - X_i'\beta)^2}{2\sigma^2} \]
Sum across all observations:
\[ \ell(\beta, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (Y_i - X_i'\beta)^2 \]
To maximize over \(\beta\), you only need to worry about the second term. The \(\hat{\beta}\) that maximizes this log-likelihood is the one that minimizes \(\sum_i (Y_i - X_i'\beta)^2\) — which is OLS.
\[ \hat{\beta}_{\text{MLE}} = \hat{\beta}_{\text{OLS}} = (X'X)^{-1}X'y \]
OLS is maximum likelihood, under the assumption that errors are normal. This is why the OLS formula feels so natural — it’s doing exactly what MLE would do in the Gaussian model. But notice: if errors aren’t normal, OLS is still the best linear unbiased estimator (by Gauss-Markov), but it’s no longer MLE.
For the full matrix algebra behind this, see The Algebra Behind OLS.
Properties of MLE
MLE is popular for good reason. Under regularity conditions (the parameter space is open, the model is identifiable, the likelihood is smooth enough):
Consistent. \(\hat{\theta}_{\text{MLE}} \xrightarrow{p} \theta_0\) as \(n \to \infty\). The estimator converges to the true value.
Asymptotically efficient. No consistent estimator has smaller variance in large samples. MLE achieves the Cramer-Rao lower bound — it extracts the maximum amount of information from the data.
Asymptotically normal.
\[ \sqrt{n}(\hat{\theta}_{\text{MLE}} - \theta_0) \xrightarrow{d} N\big(0, \, \mathcal{I}(\theta_0)^{-1}\big) \]
where \(\mathcal{I}(\theta)\) is the Fisher information (defined below). This means in large samples, MLE estimates are approximately normal, centered at the truth, with variance determined by Fisher information.
Here’s how MLE compares to Method of Moments:
| Property | MLE | MoM |
|---|---|---|
| Consistency | Yes | Yes |
| Asymptotic normality | Yes | Yes |
| Efficiency | Achieves Cramer-Rao bound | Generally less efficient |
| Assumptions needed | Full distribution | Just moments |
| Computation | May need numerical optimization | Usually closed-form |
Fisher information and standard errors
The Fisher information measures how much information one observation carries about \(\theta\):
\[ \mathcal{I}(\theta) = -E\!\left[\frac{\partial^2 \ell}{\partial \theta^2}\right] \]
This is the expected curvature of the log-likelihood. A sharply curved log-likelihood — one with a pronounced peak — means a small change in \(\theta\) causes a big drop in \(\ell\). That’s high information: the data strongly distinguish the true \(\theta\) from nearby values. A flat log-likelihood means low information: many parameter values look almost equally plausible.
The standard error of the MLE is:
\[ \text{SE}(\hat{\theta}) \approx \frac{1}{\sqrt{n \cdot \mathcal{I}(\theta)}} \]
More data (\(n\) large) and more informative data (\(\mathcal{I}\) large) both shrink the standard error.
Connection to regression. For the normal linear model, the Fisher information matrix gives back the familiar variance-covariance matrix of OLS:
\[ \text{Var}(\hat{\beta}) = \sigma^2(X'X)^{-1} \]
This is derived in The Algebra Behind OLS. The \((X'X)^{-1}\) piece is the inverse of the Fisher information matrix for the regression setting. The two frameworks — “OLS algebra” and “MLE theory” — are telling you the same thing.
Limitations
Requires specifying the full distribution. MLE needs the entire density \(f(x \mid \theta)\), not just a few moments. If you specify the wrong distribution, the estimator will still converge — but to the parameter value that makes the assumed distribution closest to the truth in Kullback-Leibler divergence. That’s not necessarily what you want.
Can be biased in small samples. Asymptotic efficiency is a large-sample result. In small samples, MLE can be biased. A classic example: the MLE of \(\sigma^2\) in the normal distribution is \(\frac{1}{n}\sum_i (X_i - \bar{X})^2\), which divides by \(n\) rather than \(n-1\) and is biased downward.
Sensitive to model misspecification. If the true data-generating process doesn’t belong to your parametric family, MLE converges to the “pseudo-true” value — the member of your family closest to the truth in KL divergence. This can be far from the parameter you intended to estimate.
Numerical optimization. For many models, there’s no closed-form MLE. You need iterative algorithms (Newton-Raphson, EM, gradient descent), which can get stuck at local optima or fail to converge.
These limitations motivate approaches that require less structure: GMM only needs moment conditions (not the full distribution), and Bayesian Estimation lets you incorporate prior information to stabilize estimates.