Training Neural Networks as Maximum Likelihood
The loss functions used to train neural networks are not arbitrary design choices. Most of them are negative log-likelihoods in disguise. If you understand MLE, you already understand what neural network training is doing — and, critically, what it is not doing.
Cross-entropy is negative log-likelihood
The standard loss for classification is cross-entropy. For a binary outcome \(Y_i \in \{0, 1\}\) and a model that predicts \(\hat{p}_i = P(Y_i = 1 \mid X_i)\):
\[ \mathcal{L} = -\frac{1}{n}\sum_{i=1}^n \left[Y_i \log \hat{p}_i + (1 - Y_i)\log(1 - \hat{p}_i)\right] \]
Compare this to the log-likelihood of a Bernoulli model from the MLE page:
\[ \ell(p) = \sum_{i=1}^n \left[Y_i \log p_i + (1 - Y_i)\log(1 - p_i)\right] \]
They are the same expression, up to a sign and a scaling constant. Minimizing cross-entropy loss is maximizing the Bernoulli log-likelihood. The neural network’s output layer (sigmoid activation) parameterizes \(p_i\) as a flexible function of \(X_i\), but the estimation principle is identical to logistic regression — which is itself MLE.
For multi-class classification with \(K\) categories, softmax cross-entropy is the negative log-likelihood of a multinomial model. For regression with squared-error loss:
\[ \mathcal{L} = \frac{1}{n}\sum_{i=1}^n (Y_i - \hat{Y}_i)^2 \]
this is the negative log-likelihood of a Gaussian model with constant variance — exactly the same equivalence shown on the MLE page between OLS and MLE under normality.
| Loss function | Equivalent to | Implicit distributional assumption |
|---|---|---|
| Squared error (MSE) | Gaussian MLE | \(Y \mid X \sim N(\hat{Y}, \sigma^2)\) |
| Binary cross-entropy | Bernoulli MLE | \(Y \mid X \sim \text{Bernoulli}(\hat{p})\) |
| Categorical cross-entropy | Multinomial MLE | \(Y \mid X \sim \text{Multinomial}(\hat{p}_1, \ldots, \hat{p}_K)\) |
SGD as approximate MLE
Classical MLE computes the gradient of the full log-likelihood and solves the score equation exactly. Neural networks can’t do this — the models are nonconvex and the datasets are enormous. Instead, they use stochastic gradient descent (SGD): at each step, sample a mini-batch of data, compute the gradient of the loss on that mini-batch, and take a step.
This is approximate MLE. The mini-batch gradient is a noisy, unbiased estimate of the full gradient. Over many steps, SGD traces out a path that (under regularity conditions) converges to a local maximum of the likelihood — though not necessarily the global one.
The analogy to classical statistics is instructive:
| Classical MLE | Neural network training |
|---|---|
| Full-sample gradient, solve exactly | Mini-batch gradient, iterate |
| Convex log-likelihood (often) | Nonconvex loss landscape |
| Single global optimum (typically) | Multiple local optima |
| Closed-form or Newton-Raphson | SGD, Adam, or variants |
| Fisher information → standard errors | No standard errors by default |
The last row matters. Classical MLE gives you standard errors through the Fisher information. Neural network training typically does not. The model gives you a point prediction, but no measure of uncertainty — a limitation addressed in Calibration and Uncertainty.
What training optimizes — and what it does not
Training a neural network finds parameters \(\hat{\theta}\) that minimize prediction error on the training distribution. This is optimization of a statistical objective: \(\hat{\theta} = \arg\min_\theta \mathcal{L}(\theta)\).
But prediction is not the only thing you might care about. The MLE page noted that if the model is misspecified, MLE converges to the distribution closest to the truth in Kullback-Leibler divergence — not necessarily the “right” answer.
For neural networks, this matters acutely:
- The model learns \(P(Y \mid X)\) — the conditional distribution. It does not learn \(P(Y \mid do(X))\) — the interventional distribution. This distinction is explored in Prediction vs Causation in Foundation Models.
- The model minimizes loss on the training distribution. If the deployment distribution differs (distribution shift), the guarantees vanish.
- The model has no notion of identification. It finds a good predictor, not a causally interpretable parameter.
Connecting to the course
This page bridges two frameworks:
- MLE provides the estimation principle. Neural network training is MLE (or regularized MLE) with a flexible function class.
- Regularization as Bayesian inference shows that weight decay, dropout, and other regularization techniques have principled statistical interpretations — they are not ad hoc tricks.
- The Algebra Behind OLS derived standard errors from \((X'X)^{-1}\). Neural networks lack this closed-form machinery, which is why uncertainty quantification requires separate tools.