Training Neural Networks as Maximum Likelihood

The loss functions used to train neural networks are not arbitrary design choices. Most of them are negative log-likelihoods in disguise. If you understand MLE, you already understand what neural network training is doing — and, critically, what it is not doing.

Cross-entropy is negative log-likelihood

The standard loss for classification is cross-entropy. For a binary outcome \(Y_i \in \{0, 1\}\) and a model that predicts \(\hat{p}_i = P(Y_i = 1 \mid X_i)\):

\[ \mathcal{L} = -\frac{1}{n}\sum_{i=1}^n \left[Y_i \log \hat{p}_i + (1 - Y_i)\log(1 - \hat{p}_i)\right] \]

Compare this to the log-likelihood of a Bernoulli model from the MLE page:

\[ \ell(p) = \sum_{i=1}^n \left[Y_i \log p_i + (1 - Y_i)\log(1 - p_i)\right] \]

They are the same expression, up to a sign and a scaling constant. Minimizing cross-entropy loss is maximizing the Bernoulli log-likelihood. The neural network’s output layer (sigmoid activation) parameterizes \(p_i\) as a flexible function of \(X_i\), but the estimation principle is identical to logistic regression — which is itself MLE.

For multi-class classification with \(K\) categories, softmax cross-entropy is the negative log-likelihood of a multinomial model. For regression with squared-error loss:

\[ \mathcal{L} = \frac{1}{n}\sum_{i=1}^n (Y_i - \hat{Y}_i)^2 \]

this is the negative log-likelihood of a Gaussian model with constant variance — exactly the same equivalence shown on the MLE page between OLS and MLE under normality.

Loss function Equivalent to Implicit distributional assumption
Squared error (MSE) Gaussian MLE \(Y \mid X \sim N(\hat{Y}, \sigma^2)\)
Binary cross-entropy Bernoulli MLE \(Y \mid X \sim \text{Bernoulli}(\hat{p})\)
Categorical cross-entropy Multinomial MLE \(Y \mid X \sim \text{Multinomial}(\hat{p}_1, \ldots, \hat{p}_K)\)

What this means. When a paper says “we trained a neural network with cross-entropy loss,” it is saying “we found the parameters that maximize the likelihood of the observed labels under a Bernoulli model.” The network architecture determines the function class; the loss function determines the estimation principle. The estimation principle, in most cases, is MLE.

SGD as approximate MLE

Classical MLE computes the gradient of the full log-likelihood and solves the score equation exactly. Neural networks can’t do this — the models are nonconvex and the datasets are enormous. Instead, they use stochastic gradient descent (SGD): at each step, sample a mini-batch of data, compute the gradient of the loss on that mini-batch, and take a step.

This is approximate MLE. The mini-batch gradient is a noisy, unbiased estimate of the full gradient. Over many steps, SGD traces out a path that (under regularity conditions) converges to a local maximum of the likelihood — though not necessarily the global one.

The analogy to classical statistics is instructive:

Classical MLE Neural network training
Full-sample gradient, solve exactly Mini-batch gradient, iterate
Convex log-likelihood (often) Nonconvex loss landscape
Single global optimum (typically) Multiple local optima
Closed-form or Newton-Raphson SGD, Adam, or variants
Fisher information → standard errors No standard errors by default

The last row matters. Classical MLE gives you standard errors through the Fisher information. Neural network training typically does not. The model gives you a point prediction, but no measure of uncertainty — a limitation addressed in Calibration and Uncertainty.

What training optimizes — and what it does not

Training a neural network finds parameters \(\hat{\theta}\) that minimize prediction error on the training distribution. This is optimization of a statistical objective: \(\hat{\theta} = \arg\min_\theta \mathcal{L}(\theta)\).

But prediction is not the only thing you might care about. The MLE page noted that if the model is misspecified, MLE converges to the distribution closest to the truth in Kullback-Leibler divergence — not necessarily the “right” answer.

For neural networks, this matters acutely:

  • The model learns \(P(Y \mid X)\) — the conditional distribution. It does not learn \(P(Y \mid do(X))\) — the interventional distribution. This distinction is explored in Prediction vs Causation in Foundation Models.
  • The model minimizes loss on the training distribution. If the deployment distribution differs (distribution shift), the guarantees vanish.
  • The model has no notion of identification. It finds a good predictor, not a causally interpretable parameter.

The key distinction. Training minimizes prediction error. Causal inference asks whether the parameter is identified — whether you can recover the quantity of interest from the data at all, regardless of sample size or model complexity. These are different questions, and a powerful model that answers the first does not automatically answer the second.

Connecting to the course

This page bridges two frameworks:

  • MLE provides the estimation principle. Neural network training is MLE (or regularized MLE) with a flexible function class.
  • Regularization as Bayesian inference shows that weight decay, dropout, and other regularization techniques have principled statistical interpretations — they are not ad hoc tricks.
  • The Algebra Behind OLS derived standard errors from \((X'X)^{-1}\). Neural networks lack this closed-form machinery, which is why uncertainty quantification requires separate tools.