Model Selection

The problem

Training error always favors complexity — a more flexible model can always fit the training data at least as well. But out-of-sample, complex models overfit. Model selection asks: how do we choose the right level of complexity?

There are two main approaches: information criteria (penalize the log-likelihood) and cross-validation (directly estimate out-of-sample error).

Information criteria

Both AIC and BIC start from the maximized log-likelihood \(\hat{\ell}\) and add a penalty for the number of parameters \(k\):

\[\text{AIC} = -2\hat{\ell} + 2k\]

\[\text{BIC} = -2\hat{\ell} + k \log n\]

Lower is better. The key difference:

AIC BIC
Penalty \(2k\) \(k \log n\)
For \(n > 8\) Lighter penalty Heavier penalty
Selects Larger models More parsimonious models
Goal Best prediction Consistent model selection
Asymptotic property Minimizes KL divergence to truth Selects true model (if in candidate set)

AIC targets prediction — it estimates the out-of-sample KL divergence. BIC targets model identification — as \(n \to \infty\), it selects the true model with probability 1 (if the true model is among the candidates).

Cross-validation

K-fold cross-validation directly estimates the out-of-sample error without distributional assumptions:

  1. Split data into \(K\) folds.
  2. For each fold: train on the other \(K-1\) folds, predict on the held-out fold.
  3. Average the prediction errors across folds.

Common choices: \(K = 5\) or \(K = 10\). Leave-one-out CV (\(K = n\)) is approximately equivalent to AIC for linear models.

The Oracle View. In the simulation below, we know the true polynomial degree that generated the data. We can compare what AIC, BIC, and cross-validation select against the truth. In practice, the “true degree” is unknown — you’re using these tools to approximate an answer you can’t observe.

Simulation

True polynomial DGP of degree \(d\). Fit degrees 1–10. Training error decreases monotonically; test error is U-shaped. AIC and BIC pick different models.

#| standalone: true
#| viewerHeight: 750

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .eq-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-bottom: 14px; font-size: 14px; line-height: 1.9;
    }
    .eq-box b { color: #2c3e50; }
    .match  { color: #27ae60; font-weight: bold; }
    .coef   { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 4,

      sliderInput("true_d", "True polynomial degree:",
                  min = 1, max = 6, value = 3, step = 1),

      sliderInput("n", "Sample size (n):",
                  min = 30, max = 300, value = 80, step = 10),

      sliderInput("sigma", HTML("Noise level (&sigma;):"),
                  min = 0.5, max = 5, value = 1.5, step = 0.5),

      actionButton("resim", "New draw", class = "btn-primary", width = "100%"),

      uiOutput("results_box")
    ),

    mainPanel(
      width = 8,
      fluidRow(
        column(6, plotOutput("plot_fit", height = "420px")),
        column(6, plotOutput("plot_ic", height = "420px"))
      ),
      uiOutput("note_box")
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$resim
    true_d <- input$true_d
    n      <- input$n
    sigma  <- input$sigma

    # Generate true polynomial coefficients
    set.seed(NULL)
    true_coefs <- runif(true_d, -1, 1)

    x_train <- sort(runif(n, -2, 2))
    x_test  <- sort(runif(n, -2, 2))

    # True function
    f_true <- function(x) {
      val <- rep(0, length(x))
      for (j in seq_along(true_coefs)) {
        val <- val + true_coefs[j] * x^j
      }
      val
    }

    y_train <- f_true(x_train) + rnorm(n, sd = sigma)
    y_test  <- f_true(x_test) + rnorm(n, sd = sigma)

    max_d <- 10
    train_mse <- numeric(max_d)
    test_mse  <- numeric(max_d)
    aic_vals  <- numeric(max_d)
    bic_vals  <- numeric(max_d)
    cv_mse    <- numeric(max_d)

    for (d in 1:max_d) {
      fit <- lm(y_train ~ poly(x_train, d, raw = TRUE))

      # Training MSE
      train_mse[d] <- mean(resid(fit)^2)

      # Test MSE
      pred_test <- predict(fit, newdata = data.frame(x_train = x_test))
      test_mse[d] <- mean((y_test - pred_test)^2)

      # AIC / BIC
      k <- d + 1  # coefficients + intercept
      log_lik <- -n/2 * log(2 * pi * train_mse[d]) - n/2
      aic_vals[d] <- -2 * log_lik + 2 * k
      bic_vals[d] <- -2 * log_lik + k * log(n)

      # 5-fold CV
      folds <- sample(rep(1:5, length.out = n))
      cv_errs <- numeric(5)
      for (fold in 1:5) {
        train_idx <- folds != fold
        test_idx  <- folds == fold
        cv_fit <- lm(y_train[train_idx] ~
                     poly(x_train[train_idx], d, raw = TRUE))
        cv_pred <- predict(cv_fit,
                          newdata = data.frame(
                            "x_train[train_idx]" = x_train[test_idx]))
        # Manual prediction for robustness
        cv_coefs <- coef(cv_fit)
        cv_pred2 <- cv_coefs[1]
        for (j in 1:d) {
          cv_pred2 <- cv_pred2 + cv_coefs[j + 1] * x_train[test_idx]^j
        }
        cv_errs[fold] <- mean((y_train[test_idx] - cv_pred2)^2)
      }
      cv_mse[d] <- mean(cv_errs)
    }

    list(x_train = x_train, y_train = y_train,
         f_true = f_true,
         train_mse = train_mse, test_mse = test_mse,
         aic_vals = aic_vals, bic_vals = bic_vals,
         cv_mse = cv_mse,
         true_d = true_d, sigma = sigma, n = n)
  })

  output$plot_fit <- renderPlot({
    d <- dat()
    par(mar = c(5, 5, 4, 2))

    plot(d$x_train, d$y_train, pch = 16,
         col = adjustcolor("#3498db", 0.5), cex = 0.7,
         xlab = "x", ylab = "y",
         main = paste0("True degree: ", d$true_d))

    x_grid <- seq(-2, 2, length.out = 300)
    lines(x_grid, d$f_true(x_grid), col = "#2c3e50", lwd = 2, lty = 2)

    # Show AIC and BIC selected fits
    aic_d <- which.min(d$aic_vals)
    bic_d <- which.min(d$bic_vals)

    fit_aic <- lm(d$y_train ~ poly(d$x_train, aic_d, raw = TRUE))
    c_aic <- coef(fit_aic)
    pred_aic <- c_aic[1]
    for (j in 1:aic_d) pred_aic <- pred_aic + c_aic[j + 1] * x_grid^j
    lines(x_grid, pred_aic, col = "#e74c3c", lwd = 2)

    if (bic_d != aic_d) {
      fit_bic <- lm(d$y_train ~ poly(d$x_train, bic_d, raw = TRUE))
      c_bic <- coef(fit_bic)
      pred_bic <- c_bic[1]
      for (j in 1:bic_d) pred_bic <- pred_bic + c_bic[j + 1] * x_grid^j
      lines(x_grid, pred_bic, col = "#27ae60", lwd = 2)
    }

    legend("topright", bty = "n", cex = 0.85,
           legend = c("True function",
                      paste0("AIC pick (d=", aic_d, ")"),
                      paste0("BIC pick (d=", bic_d, ")")),
           col = c("#2c3e50", "#e74c3c", "#27ae60"),
           lwd = 2, lty = c(2, 1, 1))
  })

  output$plot_ic <- renderPlot({
    d <- dat()
    par(mar = c(5, 5, 4, 2))

    degrees <- 1:10

    # Normalize for plotting on same scale
    test_norm <- (d$test_mse - min(d$test_mse)) /
                 (max(d$test_mse) - min(d$test_mse) + 1e-10)
    aic_norm  <- (d$aic_vals - min(d$aic_vals)) /
                 (max(d$aic_vals) - min(d$aic_vals) + 1e-10)
    bic_norm  <- (d$bic_vals - min(d$bic_vals)) /
                 (max(d$bic_vals) - min(d$bic_vals) + 1e-10)
    cv_norm   <- (d$cv_mse - min(d$cv_mse)) /
                 (max(d$cv_mse) - min(d$cv_mse) + 1e-10)
    train_norm <- (d$train_mse - min(d$train_mse)) /
                  (max(d$train_mse) - min(d$train_mse) + 1e-10)

    plot(degrees, test_norm, type = "b", pch = 17, cex = 0.9,
         col = "#e74c3c", lwd = 2,
         xlab = "Polynomial degree", ylab = "Normalized score",
         main = "Model selection criteria",
         ylim = c(-0.05, 1.2), xaxt = "n")
    axis(1, at = 1:10)

    lines(degrees, train_norm, type = "b", pch = 19, cex = 0.7,
          col = "#95a5a6", lwd = 1.5, lty = 2)
    lines(degrees, aic_norm, type = "b", pch = 15, cex = 0.9,
          col = "#3498db", lwd = 2)
    lines(degrees, bic_norm, type = "b", pch = 18, cex = 1.1,
          col = "#27ae60", lwd = 2)
    lines(degrees, cv_norm, type = "b", pch = 4, cex = 0.9,
          col = "#9b59b6", lwd = 2)

    # Mark selections
    abline(v = d$true_d, lty = 3, col = "#2c3e50", lwd = 1.5)

    legend("topright", bty = "n", cex = 0.8,
           legend = c("Test MSE", "Training MSE",
                      paste0("AIC (picks ", which.min(d$aic_vals), ")"),
                      paste0("BIC (picks ", which.min(d$bic_vals), ")"),
                      paste0("CV (picks ", which.min(d$cv_mse), ")"),
                      paste0("True degree: ", d$true_d)),
           col = c("#e74c3c", "#95a5a6", "#3498db",
                   "#27ae60", "#9b59b6", "#2c3e50"),
           pch = c(17, 19, 15, 18, 4, NA),
           lwd = c(2, 1.5, 2, 2, 2, 1.5),
           lty = c(1, 2, 1, 1, 1, 3))
  })

  output$results_box <- renderUI({
    d <- dat()

    tags$div(class = "eq-box", style = "margin-top: 16px;",
      HTML(paste0(
        "<b>Selections:</b><br>",
        "AIC picks: degree <b>", which.min(d$aic_vals), "</b><br>",
        "BIC picks: degree <b>", which.min(d$bic_vals), "</b><br>",
        "CV picks: degree <b>", which.min(d$cv_mse), "</b><br>",
        "Min test MSE: degree <b>", which.min(d$test_mse), "</b><br>",
        "<hr style='margin:8px 0'>",
        "<b>True degree:</b> ", d$true_d
      ))
    )
  })

  output$note_box <- renderUI({
    tags$div(class = "eq-box", style = "margin-top: 8px;",
      HTML(paste0(
        "<b>Left:</b> the true function (dashed black) with the AIC and BIC selected fits. ",
        "<b>Right:</b> all criteria normalized to [0, 1] for visual comparison. ",
        "Training error (gray) always decreases — the other criteria penalize complexity."
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • True degree 3, low noise: AIC and BIC both nail it. Easy problem.
  • True degree 3, high noise (\(\sigma = 4\)+): BIC may pick degree 1 or 2 (too conservative). AIC stays closer to 3. Neither is always right.
  • True degree 1 (linear): BIC’s conservatism pays off. AIC sometimes overfits to degree 2 or 3.
  • Large \(n\) (200+): BIC becomes very accurate at identifying the true degree. AIC may still pick slightly more complex models.

When they disagree

Goal Use
Prediction (minimize forecast error) AIC or CV
Inference (identify the true model) BIC
No distributional assumptions CV
Safe default CV

AIC and cross-validation are asymptotically equivalent for linear models. BIC is more conservative and will select simpler models, especially for large \(n\).


Connections


Did you know?

  • Hirotugu Akaike proposed the AIC in 1973. The key insight: the expected log-likelihood overestimates the out-of-sample performance by approximately \(k\) (the number of parameters), so subtracting \(2k\) corrects for optimism. Akaike won the first Kyoto Prize in Basic Sciences for this work.
  • Gideon Schwarz proposed the BIC in 1978 from a Bayesian perspective: it approximates the log marginal likelihood (Bayes factor) under certain regularity conditions. Despite its Bayesian motivation, it’s used by frequentists everywhere.
  • The AIC vs BIC debate is still unresolved because they answer different questions. AIC asks “which model predicts best?” BIC asks “which model is true?” These are different questions with different answers.