Calibration and Uncertainty Quantification
Classical statistical estimators come with standard errors. OLS gives you \(\sigma^2(X'X)^{-1}\) (Algebra of Regression). MLE gives you the inverse Fisher information (MLE). Bayesian inference gives you the full posterior distribution (Bayesian Updating). Modern machine learning models typically give you none of these. This page asks: when a model says it’s 90% confident, is it right 90% of the time? And if not, what can you do about it?
What calibration means
A model is calibrated if its predicted probabilities match empirical frequencies:
\[ P(Y = 1 \mid \hat{p}(X) = q) = q \qquad \text{for all } q \in [0, 1] \]
If the model says “there’s a 70% chance of rain” for 1,000 days, it should rain on approximately 700 of them. If it rains on 850, the model is overconfident at the 70% level. If it rains on 550, the model is underconfident.
Calibration is a property of the predicted probabilities, not the predicted classes. A model can have high accuracy (correctly classifying most examples) while being poorly calibrated (systematically overconfident or underconfident in its probability estimates).
Why calibration matters
Calibration matters whenever you use predicted probabilities as inputs to decisions:
- Risk assessment: if a medical model says “30% chance of disease,” a doctor’s decision depends on that number being trustworthy
- Ranking and prioritization: if a fraud detection model scores transactions, the ordering depends on calibrated probabilities
- Bayesian updating: if you use a model’s output as a likelihood in Bayesian Updating, miscalibration corrupts the posterior
An uncalibrated model’s predictions are relative orderings at best — “this example is more likely than that one” — but the actual probability values are unreliable.
Neural networks are typically miscalibrated
Modern deep neural networks tend to be overconfident: they assign high probabilities to their predictions even when they are wrong. This is a well-documented empirical finding (Guo et al., 2017).
The cause is related to training as MLE. The cross-entropy loss encourages the model to push predicted probabilities toward 0 or 1 — the loss is minimized when the model is maximally confident on every training example. With enough capacity, the model memorizes the training set and becomes overconfident on the test set.
This is the bias-variance tradeoff manifesting in probability space: an overparameterized model fits the training data perfectly (low bias) but produces overconfident predictions on new data (high variance in the probability estimates).
#| standalone: true
#| viewerHeight: 680
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
.good { color: #27ae60; font-weight: bold; }
.bad { color: #e74c3c; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("n", "Sample size (n):",
min = 500, max = 5000, value = 1000, step = 500),
sliderInput("miscal", "Miscalibration level:",
min = 0, max = 1, value = 0.5, step = 0.1),
actionButton("go", "New draw", class = "btn-primary", width = "100%"),
uiOutput("results")
),
mainPanel(
width = 9,
fluidRow(
column(7, plotOutput("reliability_plot", height = "470px")),
column(5, plotOutput("hist_plot", height = "470px"))
)
)
)
)
server <- function(input, output, session) {
dat <- reactive({
input$go
n <- input$n
miscal <- input$miscal
# True probabilities and outcomes
p_true <- runif(n)
Y <- rbinom(n, 1, p_true)
# Model predictions: push toward 0/1 when miscal > 0
q <- p_true^(1 / (1 + miscal))
# Bin predictions into 10 bins
n_bins <- 10
breaks <- seq(0, 1, length.out = n_bins + 1)
bins <- cut(q, breaks = breaks, include.lowest = TRUE, labels = FALSE)
bin_mid <- numeric(n_bins)
bin_acc <- numeric(n_bins)
bin_conf <- numeric(n_bins)
bin_n <- numeric(n_bins)
for (j in 1:n_bins) {
idx <- which(bins == j)
bin_n[j] <- length(idx)
bin_conf[j] <- if (length(idx) > 0) mean(q[idx]) else NA
bin_acc[j] <- if (length(idx) > 0) mean(Y[idx]) else NA
bin_mid[j] <- (breaks[j] + breaks[j + 1]) / 2
}
# Brier score
brier <- mean((q - Y)^2)
# ECE: weighted average of |accuracy - confidence| per bin
valid <- bin_n > 0
ece <- sum(bin_n[valid] * abs(bin_acc[valid] - bin_conf[valid])) / n
list(q = q, Y = Y, n = n, miscal = miscal,
bin_conf = bin_conf, bin_acc = bin_acc, bin_n = bin_n,
breaks = breaks, n_bins = n_bins,
mean_pred = mean(q), pos_rate = mean(Y),
brier = brier, ece = ece)
})
output$reliability_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
valid <- d$bin_n > 0
plot(NULL, xlim = c(0, 1), ylim = c(0, 1),
xlab = "Mean predicted probability",
ylab = "Fraction of positives",
main = "Reliability (Calibration) Diagram",
asp = 1)
# Perfect calibration diagonal
abline(0, 1, lty = 2, lwd = 2, col = "#7f8c8d")
# Reliability curve
lines(d$bin_conf[valid], d$bin_acc[valid],
col = "#e74c3c", lwd = 2.5, type = "o", pch = 19, cex = 1.3)
# Shaded gap between curve and diagonal
for (j in which(valid)) {
if (!is.na(d$bin_conf[j]) && !is.na(d$bin_acc[j])) {
segments(d$bin_conf[j], d$bin_conf[j],
d$bin_conf[j], d$bin_acc[j],
col = adjustcolor("#e74c3c", 0.3), lwd = 6)
}
}
legend("topleft", bty = "n", cex = 0.85,
legend = c("Perfect calibration", "Model"),
col = c("#7f8c8d", "#e74c3c"),
lwd = c(2, 2.5), lty = c(2, 1), pch = c(NA, 19))
# Annotate miscalibration level
text(0.95, 0.05, paste0("Miscal = ", d$miscal),
cex = 0.85, font = 2, col = "#2c3e50", pos = 2)
})
output$hist_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
hist(d$q, breaks = 20, col = adjustcolor("#3498db", 0.5),
border = "#3498db", main = "Predicted Probabilities",
xlab = "Predicted probability", ylab = "Frequency",
xlim = c(0, 1))
abline(v = mean(d$q), lty = 2, lwd = 2, col = "#2c3e50")
text(mean(d$q), par("usr")[4] * 0.9,
paste0("Mean = ", round(mean(d$q), 3)),
cex = 0.8, font = 2, col = "#2c3e50", pos = 4)
})
output$results <- renderUI({
d <- dat()
ece_class <- if (d$ece < 0.05) "good" else if (d$ece < 0.1) "bad" else "bad"
tags$div(class = "stats-box",
HTML(paste0(
"<b>Mean predicted prob:</b> ", round(d$mean_pred, 3), "<br>",
"<b>Actual positive rate:</b> ", round(d$pos_rate, 3), "<br>",
"<hr style='margin:6px 0'>",
"<b>Brier score:</b> ", round(d$brier, 4), "<br>",
"<b>ECE:</b> <span class='", ece_class, "'>",
round(d$ece, 4), "</span>"
))
)
})
}
shinyApp(ui, server)
Things to try
- Miscalibration = 0: the reliability curve lies on the diagonal — the model is perfectly calibrated. The histogram of predictions is roughly uniform (matching the true probability distribution).
- Increase miscalibration: the reliability curve bows below the diagonal. The model predicts probabilities that are too extreme — when it says 80%, the actual rate is lower. This is overconfidence. The ECE increases.
- Large n: the reliability curve becomes smoother and the ECE estimate becomes more precise. With small \(n\), the binned estimates are noisy even for a well-calibrated model.
Post-hoc calibration
The most common fix is temperature scaling: divide the logits (pre-softmax outputs) by a scalar \(T > 0\) before applying the softmax:
\[ \hat{p}_i = \text{softmax}(z_i / T) \]
- \(T = 1\): original model
- \(T > 1\): softer probabilities (less confident)
- \(T < 1\): sharper probabilities (more confident)
The temperature \(T\) is tuned on a held-out validation set to minimize the negative log-likelihood — which is, once again, MLE. Temperature scaling does not change the model’s rankings (which example is “most likely”) — it only adjusts the magnitudes of the probabilities.
Bayesian approaches to uncertainty
The Bayesian Estimation page showed that Bayesian inference naturally produces uncertainty estimates — the posterior is a full distribution, not a point. Applying this to neural networks:
Bayesian neural networks place priors on the weights and compute (or approximate) the posterior \(p(\theta \mid \text{data})\). Predictions integrate over weight uncertainty:
\[ p(Y \mid X, \text{data}) = \int p(Y \mid X, \theta) \, p(\theta \mid \text{data}) \, d\theta \]
This integral is typically intractable, motivating approximations:
- Monte Carlo dropout: as discussed in Regularization as Bayesian Inference, running the network with dropout at test time approximates sampling from the posterior. The spread of predictions estimates epistemic uncertainty.
- Deep ensembles: train multiple networks with different initializations. The disagreement between them estimates uncertainty. This is not formally Bayesian but captures a similar intuition — uncertainty is where the models disagree.
Two kinds of uncertainty
A useful decomposition:
Aleatoric uncertainty is inherent randomness in the data — the noise \(\varepsilon\) in \(Y = f(X) + \varepsilon\). This cannot be reduced by collecting more data. It is the \(\sigma^2\) in \(\text{Var}(\hat{\beta}) = \sigma^2(X'X)^{-1}\) from The Algebra Behind OLS.
Epistemic uncertainty is uncertainty about the model or parameters — what you don’t know because you have finite data. This can be reduced with more data. It is what Bayesian Updating reduces as the posterior concentrates.
Classical statistics separates these naturally: \(\sigma^2\) is aleatoric; \(\text{Var}(\hat{\beta})\) is epistemic. Neural networks typically conflate them, reporting a single predicted probability that mixes both sources.
Conformal prediction
A recent and increasingly popular approach that provides distribution-free prediction intervals. The idea:
- Fit any model (neural network, random forest, anything)
- On a calibration set, compute the residuals (or nonconformity scores)
- Use the quantiles of these residuals to construct prediction intervals
The guarantee: with probability \(1 - \alpha\), the true outcome falls within the prediction interval — regardless of the model or the distribution of the data. This requires only the assumption that calibration and test data are exchangeable (essentially, drawn from the same distribution).
Conformal prediction is attractive because:
- It works with any underlying model — no Bayesian assumptions needed
- It provides finite-sample coverage guarantees (not just asymptotic)
- It is computationally simple — just sort residuals and take quantiles
The connection to the course: conformal prediction produces objects that resemble confidence intervals (they have a coverage guarantee) but are computed without assuming normality or using Fisher information. They are closer in spirit to the bootstrap — using the empirical distribution of residuals rather than parametric assumptions.
Connecting to the course
This page ties together uncertainty quantification across frameworks:
| Framework | Source of uncertainty | Tool |
|---|---|---|
| OLS | \(\sigma^2(X'X)^{-1}\) | Standard errors |
| MLE | Fisher information | Asymptotic SEs |
| Bayesian | Full posterior | Credible intervals |
| Bootstrap | Resampled distribution | Bootstrap CIs |
| Neural networks | (typically missing) | Calibration, ensembles, conformal |
The fundamental question is always the same: how much should you trust the estimate? Classical statistics answers this with standard errors and confidence intervals. Bayesian inference answers it with posteriors and credible intervals. For machine learning models, the answer requires additional machinery — calibration, ensembles, or conformal prediction — because the training procedure does not provide uncertainty estimates by default.