Model Selection
The problem
Training error always favors complexity — a more flexible model can always fit the training data at least as well. But out-of-sample, complex models overfit. Model selection asks: how do we choose the right level of complexity?
There are two main approaches: information criteria (penalize the log-likelihood) and cross-validation (directly estimate out-of-sample error).
Information criteria
Both AIC and BIC start from the maximized log-likelihood \(\hat{\ell}\) and add a penalty for the number of parameters \(k\):
\[\text{AIC} = -2\hat{\ell} + 2k\]
\[\text{BIC} = -2\hat{\ell} + k \log n\]
Lower is better. The key difference:
| AIC | BIC | |
|---|---|---|
| Penalty | \(2k\) | \(k \log n\) |
| For \(n > 8\) | Lighter penalty | Heavier penalty |
| Selects | Larger models | More parsimonious models |
| Goal | Best prediction | Consistent model selection |
| Asymptotic property | Minimizes KL divergence to truth | Selects true model (if in candidate set) |
AIC targets prediction — it estimates the out-of-sample KL divergence. BIC targets model identification — as \(n \to \infty\), it selects the true model with probability 1 (if the true model is among the candidates).
Cross-validation
K-fold cross-validation directly estimates the out-of-sample error without distributional assumptions:
- Split data into \(K\) folds.
- For each fold: train on the other \(K-1\) folds, predict on the held-out fold.
- Average the prediction errors across folds.
Common choices: \(K = 5\) or \(K = 10\). Leave-one-out CV (\(K = n\)) is approximately equivalent to AIC for linear models.
Simulation
True polynomial DGP of degree \(d\). Fit degrees 1–10. Training error decreases monotonically; test error is U-shaped. AIC and BIC pick different models.
#| standalone: true
#| viewerHeight: 750
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.eq-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-bottom: 14px; font-size: 14px; line-height: 1.9;
}
.eq-box b { color: #2c3e50; }
.match { color: #27ae60; font-weight: bold; }
.coef { color: #e74c3c; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 4,
sliderInput("true_d", "True polynomial degree:",
min = 1, max = 6, value = 3, step = 1),
sliderInput("n", "Sample size (n):",
min = 30, max = 300, value = 80, step = 10),
sliderInput("sigma", HTML("Noise level (σ):"),
min = 0.5, max = 5, value = 1.5, step = 0.5),
actionButton("resim", "New draw", class = "btn-primary", width = "100%"),
uiOutput("results_box")
),
mainPanel(
width = 8,
fluidRow(
column(6, plotOutput("plot_fit", height = "420px")),
column(6, plotOutput("plot_ic", height = "420px"))
),
uiOutput("note_box")
)
)
)
server <- function(input, output, session) {
dat <- reactive({
input$resim
true_d <- input$true_d
n <- input$n
sigma <- input$sigma
# Generate true polynomial coefficients
set.seed(NULL)
true_coefs <- runif(true_d, -1, 1)
x_train <- sort(runif(n, -2, 2))
x_test <- sort(runif(n, -2, 2))
# True function
f_true <- function(x) {
val <- rep(0, length(x))
for (j in seq_along(true_coefs)) {
val <- val + true_coefs[j] * x^j
}
val
}
y_train <- f_true(x_train) + rnorm(n, sd = sigma)
y_test <- f_true(x_test) + rnorm(n, sd = sigma)
max_d <- 10
train_mse <- numeric(max_d)
test_mse <- numeric(max_d)
aic_vals <- numeric(max_d)
bic_vals <- numeric(max_d)
cv_mse <- numeric(max_d)
for (d in 1:max_d) {
fit <- lm(y_train ~ poly(x_train, d, raw = TRUE))
# Training MSE
train_mse[d] <- mean(resid(fit)^2)
# Test MSE
pred_test <- predict(fit, newdata = data.frame(x_train = x_test))
test_mse[d] <- mean((y_test - pred_test)^2)
# AIC / BIC
k <- d + 1 # coefficients + intercept
log_lik <- -n/2 * log(2 * pi * train_mse[d]) - n/2
aic_vals[d] <- -2 * log_lik + 2 * k
bic_vals[d] <- -2 * log_lik + k * log(n)
# 5-fold CV
folds <- sample(rep(1:5, length.out = n))
cv_errs <- numeric(5)
for (fold in 1:5) {
train_idx <- folds != fold
test_idx <- folds == fold
cv_fit <- lm(y_train[train_idx] ~
poly(x_train[train_idx], d, raw = TRUE))
cv_pred <- predict(cv_fit,
newdata = data.frame(
"x_train[train_idx]" = x_train[test_idx]))
# Manual prediction for robustness
cv_coefs <- coef(cv_fit)
cv_pred2 <- cv_coefs[1]
for (j in 1:d) {
cv_pred2 <- cv_pred2 + cv_coefs[j + 1] * x_train[test_idx]^j
}
cv_errs[fold] <- mean((y_train[test_idx] - cv_pred2)^2)
}
cv_mse[d] <- mean(cv_errs)
}
list(x_train = x_train, y_train = y_train,
f_true = f_true,
train_mse = train_mse, test_mse = test_mse,
aic_vals = aic_vals, bic_vals = bic_vals,
cv_mse = cv_mse,
true_d = true_d, sigma = sigma, n = n)
})
output$plot_fit <- renderPlot({
d <- dat()
par(mar = c(5, 5, 4, 2))
plot(d$x_train, d$y_train, pch = 16,
col = adjustcolor("#3498db", 0.5), cex = 0.7,
xlab = "x", ylab = "y",
main = paste0("True degree: ", d$true_d))
x_grid <- seq(-2, 2, length.out = 300)
lines(x_grid, d$f_true(x_grid), col = "#2c3e50", lwd = 2, lty = 2)
# Show AIC and BIC selected fits
aic_d <- which.min(d$aic_vals)
bic_d <- which.min(d$bic_vals)
fit_aic <- lm(d$y_train ~ poly(d$x_train, aic_d, raw = TRUE))
c_aic <- coef(fit_aic)
pred_aic <- c_aic[1]
for (j in 1:aic_d) pred_aic <- pred_aic + c_aic[j + 1] * x_grid^j
lines(x_grid, pred_aic, col = "#e74c3c", lwd = 2)
if (bic_d != aic_d) {
fit_bic <- lm(d$y_train ~ poly(d$x_train, bic_d, raw = TRUE))
c_bic <- coef(fit_bic)
pred_bic <- c_bic[1]
for (j in 1:bic_d) pred_bic <- pred_bic + c_bic[j + 1] * x_grid^j
lines(x_grid, pred_bic, col = "#27ae60", lwd = 2)
}
legend("topright", bty = "n", cex = 0.85,
legend = c("True function",
paste0("AIC pick (d=", aic_d, ")"),
paste0("BIC pick (d=", bic_d, ")")),
col = c("#2c3e50", "#e74c3c", "#27ae60"),
lwd = 2, lty = c(2, 1, 1))
})
output$plot_ic <- renderPlot({
d <- dat()
par(mar = c(5, 5, 4, 2))
degrees <- 1:10
# Normalize for plotting on same scale
test_norm <- (d$test_mse - min(d$test_mse)) /
(max(d$test_mse) - min(d$test_mse) + 1e-10)
aic_norm <- (d$aic_vals - min(d$aic_vals)) /
(max(d$aic_vals) - min(d$aic_vals) + 1e-10)
bic_norm <- (d$bic_vals - min(d$bic_vals)) /
(max(d$bic_vals) - min(d$bic_vals) + 1e-10)
cv_norm <- (d$cv_mse - min(d$cv_mse)) /
(max(d$cv_mse) - min(d$cv_mse) + 1e-10)
train_norm <- (d$train_mse - min(d$train_mse)) /
(max(d$train_mse) - min(d$train_mse) + 1e-10)
plot(degrees, test_norm, type = "b", pch = 17, cex = 0.9,
col = "#e74c3c", lwd = 2,
xlab = "Polynomial degree", ylab = "Normalized score",
main = "Model selection criteria",
ylim = c(-0.05, 1.2), xaxt = "n")
axis(1, at = 1:10)
lines(degrees, train_norm, type = "b", pch = 19, cex = 0.7,
col = "#95a5a6", lwd = 1.5, lty = 2)
lines(degrees, aic_norm, type = "b", pch = 15, cex = 0.9,
col = "#3498db", lwd = 2)
lines(degrees, bic_norm, type = "b", pch = 18, cex = 1.1,
col = "#27ae60", lwd = 2)
lines(degrees, cv_norm, type = "b", pch = 4, cex = 0.9,
col = "#9b59b6", lwd = 2)
# Mark selections
abline(v = d$true_d, lty = 3, col = "#2c3e50", lwd = 1.5)
legend("topright", bty = "n", cex = 0.8,
legend = c("Test MSE", "Training MSE",
paste0("AIC (picks ", which.min(d$aic_vals), ")"),
paste0("BIC (picks ", which.min(d$bic_vals), ")"),
paste0("CV (picks ", which.min(d$cv_mse), ")"),
paste0("True degree: ", d$true_d)),
col = c("#e74c3c", "#95a5a6", "#3498db",
"#27ae60", "#9b59b6", "#2c3e50"),
pch = c(17, 19, 15, 18, 4, NA),
lwd = c(2, 1.5, 2, 2, 2, 1.5),
lty = c(1, 2, 1, 1, 1, 3))
})
output$results_box <- renderUI({
d <- dat()
tags$div(class = "eq-box", style = "margin-top: 16px;",
HTML(paste0(
"<b>Selections:</b><br>",
"AIC picks: degree <b>", which.min(d$aic_vals), "</b><br>",
"BIC picks: degree <b>", which.min(d$bic_vals), "</b><br>",
"CV picks: degree <b>", which.min(d$cv_mse), "</b><br>",
"Min test MSE: degree <b>", which.min(d$test_mse), "</b><br>",
"<hr style='margin:8px 0'>",
"<b>True degree:</b> ", d$true_d
))
)
})
output$note_box <- renderUI({
tags$div(class = "eq-box", style = "margin-top: 8px;",
HTML(paste0(
"<b>Left:</b> the true function (dashed black) with the AIC and BIC selected fits. ",
"<b>Right:</b> all criteria normalized to [0, 1] for visual comparison. ",
"Training error (gray) always decreases — the other criteria penalize complexity."
))
)
})
}
shinyApp(ui, server)
Things to try
- True degree 3, low noise: AIC and BIC both nail it. Easy problem.
- True degree 3, high noise (\(\sigma = 4\)+): BIC may pick degree 1 or 2 (too conservative). AIC stays closer to 3. Neither is always right.
- True degree 1 (linear): BIC’s conservatism pays off. AIC sometimes overfits to degree 2 or 3.
- Large \(n\) (200+): BIC becomes very accurate at identifying the true degree. AIC may still pick slightly more complex models.
When they disagree
| Goal | Use |
|---|---|
| Prediction (minimize forecast error) | AIC or CV |
| Inference (identify the true model) | BIC |
| No distributional assumptions | CV |
| Safe default | CV |
AIC and cross-validation are asymptotically equivalent for linear models. BIC is more conservative and will select simpler models, especially for large \(n\).
Connections
- Bias-Variance Tradeoff — Model selection is the practical response to the bias-variance tradeoff
- Maximum Likelihood — AIC and BIC are both based on the maximized log-likelihood
- Bayesian Model Comparison — BIC approximates the Bayes factor; Bayesian model selection formalizes the complexity penalty
- Regularization as Bayesian Inference — Regularization is a continuous alternative to discrete model selection
Did you know?
- Hirotugu Akaike proposed the AIC in 1973. The key insight: the expected log-likelihood overestimates the out-of-sample performance by approximately \(k\) (the number of parameters), so subtracting \(2k\) corrects for optimism. Akaike won the first Kyoto Prize in Basic Sciences for this work.
- Gideon Schwarz proposed the BIC in 1978 from a Bayesian perspective: it approximates the log marginal likelihood (Bayes factor) under certain regularity conditions. Despite its Bayesian motivation, it’s used by frequentists everywhere.
- The AIC vs BIC debate is still unresolved because they answer different questions. AIC asks “which model predicts best?” BIC asks “which model is true?” These are different questions with different answers.