Maximum Likelihood Estimation

The workhorse of parametric estimation. Instead of matching moments, MLE asks: what parameter values make the observed data most probable? It’s more demanding than Method of Moments — you need to specify the entire distribution — but in return, you get the most efficient estimator that’s available in large samples.

The idea

You have data \(x_1, \ldots, x_n\) and a model that says each observation was drawn from a distribution \(f(x \mid \theta)\), where \(\theta\) is unknown. The likelihood function treats the data as fixed and asks: how probable is this particular dataset, as a function of \(\theta\)?

\[ L(\theta) = \prod_{i=1}^{n} f(x_i \mid \theta) \]

That’s the joint density of the data, viewed as a function of the parameter rather than the data. The maximum likelihood estimator is the value of \(\theta\) that makes the data most probable:

\[ \hat{\theta}_{\text{MLE}} = \arg\max_\theta \; L(\theta) \]

In practice, products are annoying and numerically unstable, so we work with the log-likelihood:

\[ \ell(\theta) = \log L(\theta) = \sum_{i=1}^{n} \log f(x_i \mid \theta) \]

Since \(\log\) is monotonically increasing, maximizing \(\ell(\theta)\) and \(L(\theta)\) give the same answer. The first-order condition is the score equation:

\[ \frac{\partial \ell(\theta)}{\partial \theta} \bigg|_{\theta = \hat{\theta}} = 0 \]

A worked example

You flip a coin \(n\) times and observe \(k\) heads. The probability of this specific sequence (assuming flips are independent) is:

\[ L(p) = p^k (1-p)^{n-k} \]

The log-likelihood is:

\[ \ell(p) = k \log p + (n - k) \log(1 - p) \]

Take the derivative and set it to zero:

\[ \frac{\partial \ell}{\partial p} = \frac{k}{p} - \frac{n - k}{1 - p} = 0 \]

Solving:

\[ \hat{p}_{\text{MLE}} = \frac{k}{n} \]

The MLE is the sample proportion — exactly what you’d expect. Notice this is also what MoM gives you (set \(E[X] = p\) equal to the sample mean \(k/n\)). For this problem, MLE and MoM agree. That won’t always be the case.

Connection to Bayesian updating. If you put a flat (uniform) prior on \(p\), the posterior distribution is proportional to the likelihood. The posterior mode — the peak of the posterior — equals the MLE. With a non-flat prior, the posterior mode gets pulled away from the MLE, as explored in Bayesian Estimation. For the full mechanics of how priors combine with likelihoods, see Bayesian Updating.

Simulation: Likelihood surface for coin flips

Watch the log-likelihood curve sharpen as you increase the number of flips. With few flips, many values of \(p\) look plausible (flat curve). With many flips, the curve becomes sharply peaked around the MLE — the data strongly distinguish the true parameter. This is Fisher information in action.

#| standalone: true
#| viewerHeight: 680

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("true_p", "True p:",
                  min = 0.1, max = 0.9, value = 0.6, step = 0.05),

      sliderInput("n_flips", "Number of flips n:",
                  min = 5, max = 200, value = 20, step = 5),

      actionButton("go", "Flip coins",
                   class = "btn-primary", width = "100%"),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      plotOutput("ll_plot", height = "520px")
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$go
    true_p  <- input$true_p
    n_flips <- input$n_flips

    # Simulate flips
    flips <- rbinom(n_flips, size = 1, prob = true_p)
    k     <- sum(flips)
    mle   <- k / n_flips

    list(true_p = true_p, n_flips = n_flips, k = k, mle = mle)
  })

  output$ll_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    p_seq <- seq(0.001, 0.999, length.out = 1000)

    # Log-likelihood: k*log(p) + (n-k)*log(1-p)
    ll <- d$k * log(p_seq) + (d$n_flips - d$k) * log(1 - p_seq)

    # Normalize so the max is 0 (relative log-likelihood)
    ll <- ll - max(ll)

    plot(p_seq, ll, type = "l", lwd = 2.5, col = "#3498db",
         xlab = "p", ylab = "Log-likelihood (relative)",
         main = paste0("Log-likelihood: ", d$k, " heads in ",
                       d$n_flips, " flips"),
         ylim = c(max(min(ll), -10), 0.5))

    # MLE vertical line
    abline(v = d$mle, lwd = 2.5, col = "#e74c3c")

    # True p dashed line
    abline(v = d$true_p, lty = 2, lwd = 2.5, col = "#2c3e50")

    legend("topright", bty = "n", cex = 0.9,
           legend = c(paste0("MLE = k/n = ", round(d$mle, 4)),
                      paste0("True p = ", d$true_p),
                      "Log-likelihood"),
           col = c("#e74c3c", "#2c3e50", "#3498db"),
           lwd = 2.5, lty = c(1, 2, 1))
  })

  output$results <- renderUI({
    d <- dat()

    # Observed Fisher information: curvature at MLE
    # Second derivative of ll: -k/p^2 - (n-k)/(1-p)^2
    # At the MLE p_hat = k/n:
    # -n/p_hat - n/(1-p_hat) = -n/(p_hat*(1-p_hat))
    # Fisher info = negative of that = n/(p_hat*(1-p_hat))
    if (d$mle > 0 && d$mle < 1) {
      fisher <- d$n_flips / (d$mle * (1 - d$mle))
    } else {
      fisher <- NA
    }

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>n flips:</b> ", d$n_flips, "<br>",
        "<b>k heads:</b> ", d$k, "<br>",
        "<b>MLE = k/n:</b> ", round(d$mle, 4), "<br>",
        "<b>True p:</b> ", d$true_p, "<br>",
        "<hr style='margin:8px 0'>",
        "<b>Observed Fisher info:</b> ",
        if (!is.na(fisher)) round(fisher, 2) else "undefined",
        "<br>",
        "<b>Approx SE:</b> ",
        if (!is.na(fisher) && fisher > 0)
          round(1 / sqrt(fisher), 4) else "undefined"
      ))
    )
  })
}

shinyApp(ui, server)

OLS is MLE under normality

Here’s a fact that ties the course together. Suppose your regression model is:

\[ Y_i = X_i'\beta + \varepsilon_i, \qquad \varepsilon_i \sim N(0, \sigma^2) \]

The log-likelihood for one observation is:

\[ \log f(Y_i \mid X_i, \beta, \sigma^2) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(Y_i - X_i'\beta)^2}{2\sigma^2} \]

Sum across all observations:

\[ \ell(\beta, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (Y_i - X_i'\beta)^2 \]

To maximize over \(\beta\), you only need to worry about the second term. The \(\hat{\beta}\) that maximizes this log-likelihood is the one that minimizes \(\sum_i (Y_i - X_i'\beta)^2\) — which is OLS.

\[ \hat{\beta}_{\text{MLE}} = \hat{\beta}_{\text{OLS}} = (X'X)^{-1}X'y \]

OLS is maximum likelihood, under the assumption that errors are normal. This is why the OLS formula feels so natural — it’s doing exactly what MLE would do in the Gaussian model. But notice: if errors aren’t normal, OLS is still the best linear unbiased estimator (by Gauss-Markov), but it’s no longer MLE.

For the full matrix algebra behind this, see The Algebra Behind OLS.

Why is OLS so famous? If MoM, MLE, GMM, and Bayesian estimation are all valid approaches, why does OLS dominate introductory courses and applied work?

  1. Closed-form solution\((X'X)^{-1}X'y\) is a formula, not an iterative algorithm. Before computers, this mattered enormously.
  2. Minimal assumptions — Gauss-Markov says OLS is the best linear unbiased estimator without even needing normality. You don’t need to specify the full distribution.
  3. It’s MLE when errors are normal — so in the most common case, OLS is the most efficient estimator. You get MLE for free.
  4. Interpretability — coefficients are partial derivatives of the conditional mean. Everyone can understand “a one-unit change in \(X\) is associated with a \(\hat{\beta}\) change in \(Y\).”
  5. Historical momentum — Gauss and Legendre published it around 1800. By the time MLE (Fisher, 1920s) and GMM (Hansen, 1982) arrived, OLS had a 120+ year head start in textbooks.

OLS hit a sweet spot: minimal assumptions, maximum interpretability, and a formula you can compute by hand. The other methods are more general and more powerful — but OLS solved 80% of problems with 20% of the effort.

So where is MLE? Everywhere — just under other names. Logistic regression IS MLE — you write down a Bernoulli likelihood and maximize it; the thing we call “logistic regression” is just what comes out. Cross-entropy training is MLE. Poisson regression, probit, Cox proportional hazards — all MLE. Every glm() call and every neural network training run is doing maximum likelihood. But MLE is a principle (“maximize the likelihood”), not a formula. The answer depends on the model: normal errors give you OLS, Bernoulli gives you logistic regression, neural networks give you SGD on cross-entropy. OLS is famous because it’s a concrete tool with a napkin formula. MLE is the invisible engine inside specific tools that get their own names.

Cross-entropy is MLE for classification

OLS is MLE when the errors are normal. There’s an equally clean equivalence for classification.

Suppose you have binary outcomes \(Y_i \in \{0, 1\}\) and a model that predicts \(\hat{p}_i = P(Y_i = 1 \mid X_i)\). The likelihood is:

\[ L = \prod_{i=1}^n \hat{p}_i^{Y_i}(1 - \hat{p}_i)^{1-Y_i} \]

The log-likelihood is:

\[ \ell = \sum_{i=1}^n \left[Y_i \log \hat{p}_i + (1 - Y_i) \log(1 - \hat{p}_i)\right] \]

The cross-entropy loss is the negative of this, divided by \(n\):

\[ \text{CE} = -\frac{1}{n}\sum_{i=1}^n \left[Y_i \log \hat{p}_i + (1 - Y_i) \log(1 - \hat{p}_i)\right] \]

Minimizing cross-entropy = maximizing log-likelihood = MLE. When you train a logistic regression or a neural network with cross-entropy loss, you are doing maximum likelihood estimation. The loss function is not an arbitrary design choice — it comes directly from the Bernoulli likelihood.

For multi-class problems with \(K\) classes, the likelihood is multinomial and the cross-entropy generalizes to:

\[ \text{CE} = -\frac{1}{n}\sum_{i=1}^n \sum_{k=1}^K Y_{ik} \log \hat{p}_{ik} \]

where \(Y_{ik} = 1\) if observation \(i\) belongs to class \(k\). This says: maximize the predicted probability of the correct class.

The pattern. OLS minimizes squared residuals = MLE under normality. Cross-entropy minimizes negative log-probability = MLE under Bernoulli/multinomial. In both cases, the “loss function” is just the negative log-likelihood of a specific probabilistic model. This is why Training as MLE argues that neural network training is MLE in disguise — the loss function is the likelihood.

Properties of MLE

MLE is popular for good reason. Under regularity conditions (the parameter space is open, the model is identifiable, the likelihood is smooth enough):

Consistent. \(\hat{\theta}_{\text{MLE}} \xrightarrow{p} \theta_0\) as \(n \to \infty\). The estimator converges to the true value.

Asymptotically efficient. No consistent estimator has smaller variance in large samples. MLE achieves the Cramer-Rao lower bound — it extracts the maximum amount of information from the data.

Asymptotically normal.

\[ \sqrt{n}(\hat{\theta}_{\text{MLE}} - \theta_0) \xrightarrow{d} N\big(0, \, \mathcal{I}(\theta_0)^{-1}\big) \]

where \(\mathcal{I}(\theta)\) is the Fisher information (defined below). This means in large samples, MLE estimates are approximately normal, centered at the truth, with variance determined by Fisher information.

Here’s how MLE compares to Method of Moments:

Property MLE MoM
Consistency Yes Yes
Asymptotic normality Yes Yes
Efficiency Achieves Cramer-Rao bound Generally less efficient
Assumptions needed Full distribution Just moments
Computation May need numerical optimization Usually closed-form

Fisher information and standard errors

The Fisher information measures how much information one observation carries about \(\theta\):

\[ \mathcal{I}(\theta) = -E\!\left[\frac{\partial^2 \ell}{\partial \theta^2}\right] \]

This is the expected curvature of the log-likelihood. A sharply curved log-likelihood — one with a pronounced peak — means a small change in \(\theta\) causes a big drop in \(\ell\). That’s high information: the data strongly distinguish the true \(\theta\) from nearby values. A flat log-likelihood means low information: many parameter values look almost equally plausible.

Intuition. Think of the log-likelihood as a mountain. Fisher information measures how steep the sides are near the peak. A sharp peak means the data pinpoint \(\theta\) precisely. A broad, flat hilltop means you’re uncertain about where exactly the peak is.

The standard error of the MLE is:

\[ \text{SE}(\hat{\theta}) \approx \frac{1}{\sqrt{n \cdot \mathcal{I}(\theta)}} \]

More data (\(n\) large) and more informative data (\(\mathcal{I}\) large) both shrink the standard error.

Connection to regression. For the normal linear model, the Fisher information matrix gives back the familiar variance-covariance matrix of OLS:

\[ \text{Var}(\hat{\beta}) = \sigma^2(X'X)^{-1} \]

This is derived in The Algebra Behind OLS. The \((X'X)^{-1}\) piece is the inverse of the Fisher information matrix for the regression setting. The two frameworks — “OLS algebra” and “MLE theory” — are telling you the same thing.

How MLE actually finds the maximum

MLE says “maximize the likelihood.” But for most models there’s no closed-form solution — you can’t just take a derivative, set it to zero, and solve. Instead, software uses iterative optimization algorithms that start at a guess and walk uphill until they reach the peak.

The main algorithms

Algorithm How it works Used by
Newton-Raphson Uses the gradient (slope) and the Hessian (curvature) to jump directly toward the peak. Quadratic convergence — very fast near the optimum. glm() in R, Stata’s ml
BFGS / L-BFGS-B Approximates the Hessian instead of computing it exactly. L-BFGS-B adds box constraints (parameter bounds). Cheaper per step than Newton. R’s optim(), Python’s scipy.optimize
Gradient descent / SGD Uses only the gradient — takes small steps in the steepest direction. Slow but scales to millions of parameters. SGD uses random subsets of data per step. Neural network training (PyTorch, TensorFlow)
EM (Expectation-Maximization) For models with latent variables. Alternates between imputing the missing data (E-step) and maximizing (M-step). Each step is easy; convergence can be slow. Mixture models, HMMs

Intuition. Think of the log-likelihood as a landscape and you’re trying to find the highest point. Newton-Raphson is like having a detailed topographic map — you can jump straight to the peak. L-BFGS-B has a rougher map but it’s cheaper to read. Gradient descent has no map at all — it just looks at the slope under its feet and walks uphill. All three get to the top eventually; they differ in speed and cost per step.

Watch the algorithms climb

Generate data from a logistic regression, then watch Newton-Raphson, BFGS, and gradient descent find the MLE. The left panel shows the log-likelihood surface with each algorithm’s path. The right panel tracks the log-likelihood value at each iteration.

#| standalone: true
#| viewerHeight: 680

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("n_obs", "Sample size:",
                  min = 50, max = 500, value = 200, step = 50),

      sliderInput("start_b0", "Starting guess for intercept:",
                  min = -4, max = 4, value = -3, step = 0.5),

      sliderInput("start_b1", "Starting guess for slope:",
                  min = -4, max = 4, value = 3, step = 0.5),

      sliderInput("lr", "Gradient descent step size:",
                  min = 0.01, max = 0.5, value = 0.1, step = 0.01),

      sliderInput("max_iter", "Max iterations:",
                  min = 5, max = 100, value = 30, step = 5),

      actionButton("go", "New data + run",
                   class = "btn-primary", width = "100%"),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("contour_plot", height = "480px")),
        column(6, plotOutput("convergence_plot", height = "480px"))
      )
    )
  )
)

server <- function(input, output, session) {

  # Logistic log-likelihood
  loglik <- function(beta, x, y) {
    eta <- beta[1] + beta[2] * x
    eta <- pmax(pmin(eta, 20), -20)  # prevent overflow
    sum(y * eta - log(1 + exp(eta)))
  }

  # Gradient
  grad_loglik <- function(beta, x, y) {
    eta <- beta[1] + beta[2] * x
    eta <- pmax(pmin(eta, 20), -20)
    p <- 1 / (1 + exp(-eta))
    r <- y - p
    c(sum(r), sum(r * x))
  }

  # Hessian
  hess_loglik <- function(beta, x, y) {
    eta <- beta[1] + beta[2] * x
    eta <- pmax(pmin(eta, 20), -20)
    p <- 1 / (1 + exp(-eta))
    w <- p * (1 - p)
    H <- matrix(0, 2, 2)
    H[1,1] <- -sum(w)
    H[1,2] <- H[2,1] <- -sum(w * x)
    H[2,2] <- -sum(w * x^2)
    H
  }

  dat <- reactive({
    input$go

    n <- input$n_obs
    start <- c(input$start_b0, input$start_b1)
    lr <- input$lr
    max_iter <- input$max_iter

    # True parameters
    true_b <- c(0.5, 1.5)
    x <- rnorm(n)
    p <- 1 / (1 + exp(-(true_b[1] + true_b[2] * x)))
    y <- rbinom(n, 1, p)

    # --- Newton-Raphson ---
    nr_path <- matrix(start, nrow = 1)
    nr_ll <- loglik(start, x, y)
    beta <- start
    for (i in seq_len(max_iter)) {
      g <- grad_loglik(beta, x, y)
      H <- hess_loglik(beta, x, y)
      det_H <- H[1,1]*H[2,2] - H[1,2]*H[2,1]
      if (abs(det_H) < 1e-10) break
      H_inv <- matrix(c(H[2,2], -H[1,2], -H[2,1], H[1,1]), 2, 2) / det_H
      beta <- beta - H_inv %*% g
      beta <- pmax(pmin(as.numeric(beta), 10), -10)
      nr_path <- rbind(nr_path, beta)
      nr_ll <- c(nr_ll, loglik(beta, x, y))
      if (max(abs(g)) < 1e-8) break
    }

    # --- BFGS (approximate Hessian) ---
    bfgs_path <- matrix(start, nrow = 1)
    bfgs_ll <- loglik(start, x, y)
    beta <- start
    B_inv <- diag(2)  # start with identity as inverse Hessian approx
    g_old <- grad_loglik(beta, x, y)
    for (i in seq_len(max_iter)) {
      direction <- as.numeric(B_inv %*% g_old)
      beta_new <- beta + direction * 0.5  # damped step
      beta_new <- pmax(pmin(beta_new, 10), -10)
      g_new <- grad_loglik(beta_new, x, y)
      s <- beta_new - beta
      yy <- g_new - g_old
      sy <- sum(s * yy)
      if (abs(sy) > 1e-10) {
        rho <- 1 / sy
        I2 <- diag(2)
        B_inv <- (I2 - rho * s %*% t(yy)) %*% B_inv %*%
                 (I2 - rho * yy %*% t(s)) + rho * s %*% t(s)
      }
      beta <- beta_new
      g_old <- g_new
      bfgs_path <- rbind(bfgs_path, beta)
      bfgs_ll <- c(bfgs_ll, loglik(beta, x, y))
      if (max(abs(g_new)) < 1e-8) break
    }

    # --- Gradient descent ---
    gd_path <- matrix(start, nrow = 1)
    gd_ll <- loglik(start, x, y)
    beta <- start
    for (i in seq_len(max_iter)) {
      g <- grad_loglik(beta, x, y)
      beta <- beta + lr * g / n  # normalize by n for stability
      beta <- pmax(pmin(beta, 10), -10)
      gd_path <- rbind(gd_path, beta)
      gd_ll <- c(gd_ll, loglik(beta, x, y))
      if (max(abs(g)) < 1e-8) break
    }

    # MLE (from glm)
    fit <- glm(y ~ x, family = binomial)
    mle <- coef(fit)

    list(x = x, y = y, true_b = true_b, mle = mle,
         nr_path = nr_path, nr_ll = nr_ll,
         bfgs_path = bfgs_path, bfgs_ll = bfgs_ll,
         gd_path = gd_path, gd_ll = gd_ll)
  })

  output$contour_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    # Grid for log-likelihood contours
    b0_seq <- seq(-4, 5, length.out = 80)
    b1_seq <- seq(-3, 5, length.out = 80)
    ll_grid <- matrix(NA, 80, 80)
    for (i in seq_along(b0_seq)) {
      for (j in seq_along(b1_seq)) {
        ll_grid[i, j] <- loglik(c(b0_seq[i], b1_seq[j]), d$x, d$y)
      }
    }

    # Contour plot
    contour(b0_seq, b1_seq, ll_grid, nlevels = 25,
            xlab = expression(beta[0] ~ "(intercept)"),
            ylab = expression(beta[1] ~ "(slope)"),
            main = "Optimizer paths on log-likelihood",
            col = "gray70", labcex = 0.6)

    # Newton-Raphson path
    lines(d$nr_path[,1], d$nr_path[,2], col = "#e74c3c", lwd = 2.5)
    points(d$nr_path[,1], d$nr_path[,2], col = "#e74c3c",
           pch = 16, cex = 1.2)

    # BFGS path
    lines(d$bfgs_path[,1], d$bfgs_path[,2], col = "#3498db", lwd = 2.5)
    points(d$bfgs_path[,1], d$bfgs_path[,2], col = "#3498db",
           pch = 17, cex = 1.2)

    # GD path
    lines(d$gd_path[,1], d$gd_path[,2], col = "#27ae60", lwd = 2.5)
    points(d$gd_path[,1], d$gd_path[,2], col = "#27ae60",
           pch = 15, cex = 0.9)

    # MLE
    points(d$mle[1], d$mle[2], pch = 4, cex = 2.5, lwd = 3, col = "#2c3e50")

    # Start
    points(d$nr_path[1,1], d$nr_path[1,2], pch = 1, cex = 2.5,
           lwd = 2, col = "black")

    legend("topright",
           legend = c("Newton-Raphson", "BFGS", "Gradient descent",
                      "MLE (glm)", "Start"),
           col = c("#e74c3c", "#3498db", "#27ae60", "#2c3e50", "black"),
           pch = c(16, 17, 15, 4, 1), lwd = c(2, 2, 2, NA, NA),
           pt.cex = c(1, 1, 0.9, 1.5, 1.5),
           bg = "white", cex = 0.8)
  })

  output$convergence_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    max_ll <- max(c(d$nr_ll, d$bfgs_ll, d$gd_ll))
    all_iter <- max(length(d$nr_ll), length(d$bfgs_ll), length(d$gd_ll))

    plot(NULL,
         xlim = c(0, all_iter - 1),
         ylim = c(min(c(d$nr_ll, d$bfgs_ll, d$gd_ll)), max_ll * 1.02),
         xlab = "Iteration", ylab = "Log-likelihood",
         main = "Convergence speed")

    # Newton
    lines(seq_along(d$nr_ll) - 1, d$nr_ll, col = "#e74c3c",
          lwd = 2.5, type = "b", pch = 16, cex = 1)
    # BFGS
    lines(seq_along(d$bfgs_ll) - 1, d$bfgs_ll, col = "#3498db",
          lwd = 2.5, type = "b", pch = 17, cex = 1)
    # GD
    lines(seq_along(d$gd_ll) - 1, d$gd_ll, col = "#27ae60",
          lwd = 2.5, type = "b", pch = 15, cex = 0.8)

    abline(h = max_ll, lty = 2, col = "#2c3e50", lwd = 1.5)
    text(all_iter * 0.7, max_ll, "Maximum", pos = 3, cex = 0.85)

    legend("bottomright",
           legend = c("Newton-Raphson", "BFGS", "Gradient descent"),
           col = c("#e74c3c", "#3498db", "#27ae60"),
           pch = c(16, 17, 15), lwd = 2,
           bg = "white", cex = 0.85)
  })

  output$results <- renderUI({
    d <- dat()

    nr_steps <- nrow(d$nr_path) - 1
    bfgs_steps <- nrow(d$bfgs_path) - 1
    gd_steps <- nrow(d$gd_path) - 1

    nr_final <- d$nr_path[nrow(d$nr_path),]
    bfgs_final <- d$bfgs_path[nrow(d$bfgs_path),]
    gd_final <- d$gd_path[nrow(d$gd_path),]

    nr_dist <- round(sqrt(sum((nr_final - d$mle)^2)), 4)
    bfgs_dist <- round(sqrt(sum((bfgs_final - d$mle)^2)), 4)
    gd_dist <- round(sqrt(sum((gd_final - d$mle)^2)), 4)

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>Steps to converge:</b><br>",
        "<span style='color:#e74c3c'>Newton: ", nr_steps, "</span><br>",
        "<span style='color:#3498db'>BFGS: ", bfgs_steps, "</span><br>",
        "<span style='color:#27ae60'>GD: ", gd_steps, "</span><br>",
        "<hr style='margin:8px 0'>",
        "<b>Distance from MLE:</b><br>",
        "Newton: ", nr_dist, "<br>",
        "BFGS: ", bfgs_dist, "<br>",
        "GD: ", gd_dist
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • Move the starting guess far away (e.g., intercept = -4, slope = 3): Newton-Raphson still converges in 3-5 steps. Gradient descent may need all 30+.
  • Tiny step size (0.01): gradient descent barely moves. It will get there eventually, but takes many more iterations.
  • Large step size (0.5): gradient descent may overshoot and oscillate around the peak.
  • Small sample (n = 50): the contours are wider (more uncertainty), but all algorithms still find the peak. The MLE itself is noisier.

Why this matters in practice

Most of the time, the optimizer works silently and you never think about it. But sometimes:

  • Convergence failures. The algorithm doesn’t find the maximum. Common with small samples, perfect separation in logit, or too many parameters. If you see “algorithm did not converge,” try different starting values or a different algorithm.
  • Local optima. Non-convex likelihoods (mixture models, neural nets) can have multiple peaks. The algorithm finds a maximum, not necessarily the maximum.
  • Flat regions. Near-collinear regressors create a nearly flat likelihood surface. The optimizer wanders without making progress. This is the numerical symptom of weak identification.

When you call glm(y ~ x, family = binomial) in R, it runs iteratively reweighted least squares (IRLS) — a variant of Newton-Raphson. When you call optim(par, fn, method = "L-BFGS-B"), you’re using the bounded quasi-Newton method. When PyTorch minimizes cross-entropy, it’s running SGD (or Adam, a fancier variant). All of them are doing the same thing: finding the parameters that maximize the likelihood.

Limitations

Requires specifying the full distribution. MLE needs the entire density \(f(x \mid \theta)\), not just a few moments. If you specify the wrong distribution, the estimator will still converge — but to the parameter value that makes the assumed distribution closest to the truth in Kullback-Leibler divergence. That’s not necessarily what you want.

Can be biased in small samples. Asymptotic efficiency is a large-sample result. In small samples, MLE can be biased. A classic example: the MLE of \(\sigma^2\) in the normal distribution is \(\frac{1}{n}\sum_i (X_i - \bar{X})^2\), which divides by \(n\) rather than \(n-1\) and is biased downward.

Sensitive to model misspecification. If the true data-generating process doesn’t belong to your parametric family, MLE converges to the “pseudo-true” value — the member of your family closest to the truth in KL divergence. This can be far from the parameter you intended to estimate.

Numerical optimization. For many models, there’s no closed-form MLE. The iterative algorithms described above can get stuck at local optima or fail to converge.

These limitations motivate approaches that require less structure: GMM only needs moment conditions (not the full distribution), and Bayesian Estimation lets you incorporate prior information to stabilize estimates.