Calibration and Uncertainty Quantification

Classical statistical estimators come with standard errors. OLS gives you \(\sigma^2(X'X)^{-1}\) (Algebra of Regression). MLE gives you the inverse Fisher information (MLE). Bayesian inference gives you the full posterior distribution (Bayesian Updating). Modern machine learning models typically give you none of these. This page asks: when a model says it’s 90% confident, is it right 90% of the time? And if not, what can you do about it?

What calibration means

A model is calibrated if its predicted probabilities match empirical frequencies:

\[ P(Y = 1 \mid \hat{p}(X) = q) = q \qquad \text{for all } q \in [0, 1] \]

If the model says “there’s a 70% chance of rain” for 1,000 days, it should rain on approximately 700 of them. If it rains on 850, the model is overconfident at the 70% level. If it rains on 550, the model is underconfident.

Calibration is a property of the predicted probabilities, not the predicted classes. A model can have high accuracy (correctly classifying most examples) while being poorly calibrated (systematically overconfident or underconfident in its probability estimates).

Why calibration matters

Calibration matters whenever you use predicted probabilities as inputs to decisions:

  • Risk assessment: if a medical model says “30% chance of disease,” a doctor’s decision depends on that number being trustworthy
  • Ranking and prioritization: if a fraud detection model scores transactions, the ordering depends on calibrated probabilities
  • Bayesian updating: if you use a model’s output as a likelihood in Bayesian Updating, miscalibration corrupts the posterior

An uncalibrated model’s predictions are relative orderings at best — “this example is more likely than that one” — but the actual probability values are unreliable.

Neural networks are typically miscalibrated

Modern deep neural networks tend to be overconfident: they assign high probabilities to their predictions even when they are wrong. This is a well-documented empirical finding (Guo et al., 2017).

The cause is related to training as MLE. The cross-entropy loss encourages the model to push predicted probabilities toward 0 or 1 — the loss is minimized when the model is maximally confident on every training example. With enough capacity, the model memorizes the training set and becomes overconfident on the test set.

This is the bias-variance tradeoff manifesting in probability space: an overparameterized model fits the training data perfectly (low bias) but produces overconfident predictions on new data (high variance in the probability estimates).

#| standalone: true
#| viewerHeight: 680

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("n", "Sample size (n):",
                  min = 500, max = 5000, value = 1000, step = 500),

      sliderInput("miscal", "Miscalibration level:",
                  min = 0, max = 1, value = 0.5, step = 0.1),

      actionButton("go", "New draw", class = "btn-primary", width = "100%"),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(7, plotOutput("reliability_plot", height = "470px")),
        column(5, plotOutput("hist_plot", height = "470px"))
      )
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$go
    n      <- input$n
    miscal <- input$miscal

    # True probabilities and outcomes
    p_true <- runif(n)
    Y      <- rbinom(n, 1, p_true)

    # Model predictions: push toward 0/1 when miscal > 0
    q <- p_true^(1 / (1 + miscal))

    # Bin predictions into 10 bins
    n_bins <- 10
    breaks <- seq(0, 1, length.out = n_bins + 1)
    bins   <- cut(q, breaks = breaks, include.lowest = TRUE, labels = FALSE)

    bin_mid  <- numeric(n_bins)
    bin_acc  <- numeric(n_bins)
    bin_conf <- numeric(n_bins)
    bin_n    <- numeric(n_bins)

    for (j in 1:n_bins) {
      idx <- which(bins == j)
      bin_n[j]    <- length(idx)
      bin_conf[j] <- if (length(idx) > 0) mean(q[idx]) else NA
      bin_acc[j]  <- if (length(idx) > 0) mean(Y[idx]) else NA
      bin_mid[j]  <- (breaks[j] + breaks[j + 1]) / 2
    }

    # Brier score
    brier <- mean((q - Y)^2)

    # ECE: weighted average of |accuracy - confidence| per bin
    valid   <- bin_n > 0
    ece     <- sum(bin_n[valid] * abs(bin_acc[valid] - bin_conf[valid])) / n

    list(q = q, Y = Y, n = n, miscal = miscal,
         bin_conf = bin_conf, bin_acc = bin_acc, bin_n = bin_n,
         breaks = breaks, n_bins = n_bins,
         mean_pred = mean(q), pos_rate = mean(Y),
         brier = brier, ece = ece)
  })

  output$reliability_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    valid <- d$bin_n > 0

    plot(NULL, xlim = c(0, 1), ylim = c(0, 1),
         xlab = "Mean predicted probability",
         ylab = "Fraction of positives",
         main = "Reliability (Calibration) Diagram",
         asp = 1)

    # Perfect calibration diagonal
    abline(0, 1, lty = 2, lwd = 2, col = "#7f8c8d")

    # Reliability curve
    lines(d$bin_conf[valid], d$bin_acc[valid],
          col = "#e74c3c", lwd = 2.5, type = "o", pch = 19, cex = 1.3)

    # Shaded gap between curve and diagonal
    for (j in which(valid)) {
      if (!is.na(d$bin_conf[j]) && !is.na(d$bin_acc[j])) {
        segments(d$bin_conf[j], d$bin_conf[j],
                 d$bin_conf[j], d$bin_acc[j],
                 col = adjustcolor("#e74c3c", 0.3), lwd = 6)
      }
    }

    legend("topleft", bty = "n", cex = 0.85,
           legend = c("Perfect calibration", "Model"),
           col = c("#7f8c8d", "#e74c3c"),
           lwd = c(2, 2.5), lty = c(2, 1), pch = c(NA, 19))

    # Annotate miscalibration level
    text(0.95, 0.05, paste0("Miscal = ", d$miscal),
         cex = 0.85, font = 2, col = "#2c3e50", pos = 2)
  })

  output$hist_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    hist(d$q, breaks = 20, col = adjustcolor("#3498db", 0.5),
         border = "#3498db", main = "Predicted Probabilities",
         xlab = "Predicted probability", ylab = "Frequency",
         xlim = c(0, 1))

    abline(v = mean(d$q), lty = 2, lwd = 2, col = "#2c3e50")
    text(mean(d$q), par("usr")[4] * 0.9,
         paste0("Mean = ", round(mean(d$q), 3)),
         cex = 0.8, font = 2, col = "#2c3e50", pos = 4)
  })

  output$results <- renderUI({
    d <- dat()

    ece_class <- if (d$ece < 0.05) "good" else if (d$ece < 0.1) "bad" else "bad"

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>Mean predicted prob:</b> ", round(d$mean_pred, 3), "<br>",
        "<b>Actual positive rate:</b> ", round(d$pos_rate, 3), "<br>",
        "<hr style='margin:6px 0'>",
        "<b>Brier score:</b> ", round(d$brier, 4), "<br>",
        "<b>ECE:</b> <span class='", ece_class, "'>",
        round(d$ece, 4), "</span>"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • Miscalibration = 0: the reliability curve lies on the diagonal — the model is perfectly calibrated. The histogram of predictions is roughly uniform (matching the true probability distribution).
  • Increase miscalibration: the reliability curve bows below the diagonal. The model predicts probabilities that are too extreme — when it says 80%, the actual rate is lower. This is overconfidence. The ECE increases.
  • Large n: the reliability curve becomes smoother and the ECE estimate becomes more precise. With small \(n\), the binned estimates are noisy even for a well-calibrated model.

Post-hoc calibration

The most common fix is temperature scaling: divide the logits (pre-softmax outputs) by a scalar \(T > 0\) before applying the softmax:

\[ \hat{p}_i = \text{softmax}(z_i / T) \]

  • \(T = 1\): original model
  • \(T > 1\): softer probabilities (less confident)
  • \(T < 1\): sharper probabilities (more confident)

The temperature \(T\) is tuned on a held-out validation set to minimize the negative log-likelihood — which is, once again, MLE. Temperature scaling does not change the model’s rankings (which example is “most likely”) — it only adjusts the magnitudes of the probabilities.

Why “temperature”? The name comes from statistical mechanics. In the Boltzmann distribution, temperature controls how peaked or flat the distribution is over energy states. High temperature → uniform (maximum entropy). Low temperature → concentrated on the lowest-energy state. The analogy to softmax is exact.

Bayesian approaches to uncertainty

The Bayesian Estimation page showed that Bayesian inference naturally produces uncertainty estimates — the posterior is a full distribution, not a point. Applying this to neural networks:

Bayesian neural networks place priors on the weights and compute (or approximate) the posterior \(p(\theta \mid \text{data})\). Predictions integrate over weight uncertainty:

\[ p(Y \mid X, \text{data}) = \int p(Y \mid X, \theta) \, p(\theta \mid \text{data}) \, d\theta \]

This integral is typically intractable, motivating approximations:

  • Monte Carlo dropout: as discussed in Regularization as Bayesian Inference, running the network with dropout at test time approximates sampling from the posterior. The spread of predictions estimates epistemic uncertainty.
  • Deep ensembles: train multiple networks with different initializations. The disagreement between them estimates uncertainty. This is not formally Bayesian but captures a similar intuition — uncertainty is where the models disagree.

Two kinds of uncertainty

A useful decomposition:

Aleatoric uncertainty is inherent randomness in the data — the noise \(\varepsilon\) in \(Y = f(X) + \varepsilon\). This cannot be reduced by collecting more data. It is the \(\sigma^2\) in \(\text{Var}(\hat{\beta}) = \sigma^2(X'X)^{-1}\) from The Algebra Behind OLS.

Epistemic uncertainty is uncertainty about the model or parameters — what you don’t know because you have finite data. This can be reduced with more data. It is what Bayesian Updating reduces as the posterior concentrates.

Classical statistics separates these naturally: \(\sigma^2\) is aleatoric; \(\text{Var}(\hat{\beta})\) is epistemic. Neural networks typically conflate them, reporting a single predicted probability that mixes both sources.

Why the distinction matters. If uncertainty is aleatoric, the model should say “I don’t know because no one can know — the outcome is inherently random.” If it is epistemic, the model should say “I don’t know because I haven’t seen enough data — more training data or a better model might help.” These have different implications for decision-making: aleatoric uncertainty calls for risk management; epistemic uncertainty calls for more data.

Conformal prediction

A recent and increasingly popular approach that provides distribution-free prediction intervals. The idea:

  1. Fit any model (neural network, random forest, anything)
  2. On a calibration set, compute the residuals (or nonconformity scores)
  3. Use the quantiles of these residuals to construct prediction intervals

The guarantee: with probability \(1 - \alpha\), the true outcome falls within the prediction interval — regardless of the model or the distribution of the data. This requires only the assumption that calibration and test data are exchangeable (essentially, drawn from the same distribution).

Conformal prediction is attractive because:

  • It works with any underlying model — no Bayesian assumptions needed
  • It provides finite-sample coverage guarantees (not just asymptotic)
  • It is computationally simple — just sort residuals and take quantiles

The connection to the course: conformal prediction produces objects that resemble confidence intervals (they have a coverage guarantee) but are computed without assuming normality or using Fisher information. They are closer in spirit to the bootstrap — using the empirical distribution of residuals rather than parametric assumptions.

Connecting to the course

This page ties together uncertainty quantification across frameworks:

Framework Source of uncertainty Tool
OLS \(\sigma^2(X'X)^{-1}\) Standard errors
MLE Fisher information Asymptotic SEs
Bayesian Full posterior Credible intervals
Bootstrap Resampled distribution Bootstrap CIs
Neural networks (typically missing) Calibration, ensembles, conformal

The fundamental question is always the same: how much should you trust the estimate? Classical statistics answers this with standard errors and confidence intervals. Bayesian inference answers it with posteriors and credible intervals. For machine learning models, the answer requires additional machinery — calibration, ensembles, or conformal prediction — because the training procedure does not provide uncertainty estimates by default.