The Bias-Variance Tradeoff

Why the best model isn’t the most complex one

You might think: “More flexible models fit the data better, so they must be better.” But there’s a catch. A model that fits the training data perfectly will often perform terribly on new data. This is overfitting.

The bias-variance tradeoff explains why:

Simple model (e.g., linear) Complex model (e.g., degree-20 polynomial)
Bias High — can’t capture the true shape Low — flexible enough to match any pattern
Variance Low — stable across samples High — estimates change wildly with new data
Training error Higher Lower (maybe zero)
Test error U-shaped — sweet spot exists U-shaped — sweet spot exists

The total prediction error (MSE) decomposes as:

\[\text{MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible noise}\]

You can’t eliminate all three. Reducing bias increases variance, and vice versa. The best model balances both.

The Oracle View. In these simulations, we know the true function \(f(x)\) that generated the data — the curve we’re trying to learn. We can compute exact bias and variance because we know the target. In practice, you never know \(f(x)\). Finding it is the whole point of modeling. You use test error or cross-validation as a proxy for the tradeoff you can’t see directly.

Simulation 1: Polynomial regression — underfitting vs overfitting

Fit polynomials of different degrees to noisy data. Low degrees underfit (too rigid). High degrees overfit (too wiggly). Watch the training error decrease monotonically while the test error has a U-shape.

#| standalone: true
#| viewerHeight: 700

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      selectInput("true_fn", "True function:",
                  choices = c("Sine wave",
                              "Quadratic",
                              "Step function",
                              "Linear")),

      sliderInput("n_train", "Training points:",
                  min = 20, max = 200, value = 50, step = 10),

      sliderInput("noise", "Noise level (\u03c3):",
                  min = 0.1, max = 2, value = 0.5, step = 0.1),

      sliderInput("degree", "Polynomial degree:",
                  min = 1, max = 20, value = 3, step = 1),

      actionButton("go", "New data",
                   class = "btn-primary", width = "100%"),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(7, plotOutput("fit_plot", height = "400px")),
        column(5, plotOutput("error_curve", height = "400px"))
      )
    )
  )
)

server <- function(input, output, session) {

  f_true <- function(x, fn) {
    switch(fn,
      "Sine wave"     = sin(2 * pi * x),
      "Quadratic"     = 2 * (x - 0.5)^2,
      "Step function" = ifelse(x > 0.5, 1, 0),
      "Linear"        = x
    )
  }

  dat <- reactive({
    input$go
    n     <- input$n_train
    sigma <- input$noise
    fn    <- input$true_fn

    # Training data
    x_train <- sort(runif(n, 0, 1))
    y_train <- f_true(x_train, fn) + rnorm(n, sd = sigma)

    # Test data (fresh)
    x_test <- sort(runif(n, 0, 1))
    y_test <- f_true(x_test, fn) + rnorm(n, sd = sigma)

    # Fit polynomials of degree 1 through 20
    degrees <- 1:20
    train_mse <- numeric(20)
    test_mse  <- numeric(20)

    for (d in degrees) {
      fit <- lm(y_train ~ poly(x_train, d, raw = TRUE))
      pred_train <- predict(fit)
      pred_test  <- predict(fit, newdata = data.frame(x_train = x_test))
      train_mse[d] <- mean((y_train - pred_train)^2)
      test_mse[d]  <- mean((y_test - pred_test)^2)
    }

    # Current degree fit for plotting
    d_now <- input$degree
    fit_now <- lm(y_train ~ poly(x_train, d_now, raw = TRUE))
    x_grid <- seq(0, 1, length.out = 300)
    pred_grid <- predict(fit_now, newdata = data.frame(x_train = x_grid))

    list(x_train = x_train, y_train = y_train,
         x_test = x_test, y_test = y_test,
         x_grid = x_grid, pred_grid = pred_grid,
         train_mse = train_mse, test_mse = test_mse,
         fn = fn, d_now = d_now, sigma = sigma)
  })

  output$fit_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    plot(d$x_train, d$y_train, pch = 16,
         col = "#3498db80", cex = 0.8,
         xlab = "x", ylab = "y",
         main = paste0("Degree ", d$d_now, " polynomial fit"),
         ylim = range(c(d$y_train, d$pred_grid)))

    # True function
    x_fine <- seq(0, 1, length.out = 300)
    lines(x_fine, f_true(x_fine, d$fn),
          col = "#27ae60", lwd = 2, lty = 2)

    # Fitted curve
    lines(d$x_grid, d$pred_grid, col = "#e74c3c", lwd = 2.5)

    legend("topright", bty = "n", cex = 0.85,
           legend = c("Training data", "True function",
                      paste0("Degree ", d$d_now, " fit")),
           col = c("#3498db80", "#27ae60", "#e74c3c"),
           pch = c(16, NA, NA), lwd = c(NA, 2, 2.5),
           lty = c(NA, 2, 1))
  })

  output$error_curve <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    ylim <- c(0, min(max(d$test_mse) * 1.1, max(d$test_mse[1:15]) * 1.5))

    plot(1:20, d$train_mse, type = "b", pch = 19, cex = 0.7,
         col = "#3498db", lwd = 2,
         xlab = "Polynomial degree",
         ylab = "Mean Squared Error",
         main = "Train vs Test Error",
         ylim = ylim)
    lines(1:20, d$test_mse, type = "b", pch = 17, cex = 0.7,
          col = "#e74c3c", lwd = 2)

    # Mark current degree
    points(d$d_now, d$train_mse[d$d_now], pch = 19, cex = 2, col = "#3498db")
    points(d$d_now, d$test_mse[d$d_now], pch = 17, cex = 2, col = "#e74c3c")

    # Mark optimal
    best <- which.min(d$test_mse)
    abline(v = best, lty = 3, col = "#7f8c8d")

    abline(h = d$sigma^2, lty = 2, col = "#95a5a6")
    text(15, d$sigma^2 * 1.1, expression("Irreducible noise (" * sigma^2 * ")"),
         cex = 0.75, col = "#95a5a6")

    legend("topright", bty = "n", cex = 0.8,
           legend = c("Training error", "Test error",
                      paste0("Best degree: ", best)),
           col = c("#3498db", "#e74c3c", "#7f8c8d"),
           pch = c(19, 17, NA), lwd = c(2, 2, 1),
           lty = c(1, 1, 3))
  })

  output$results <- renderUI({
    d <- dat()
    best <- which.min(d$test_mse)

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>Degree ", d$d_now, ":</b><br>",
        "Train MSE: ", round(d$train_mse[d$d_now], 4), "<br>",
        "Test MSE: ", round(d$test_mse[d$d_now], 4), "<br>",
        "<hr style='margin:8px 0'>",
        "<b>Optimal degree:</b> ", best, "<br>",
        "Test MSE: ", round(d$test_mse[best], 4), "<br>",
        "<small>Noise floor: \u03c3\u00b2 = ",
        round(d$sigma^2, 3), "</small>"
      ))
    )
  })
}

shinyApp(ui, server)

Simulation 2: Slide along the bias-variance curve

Drag the complexity slider to move the ball along the U-shaped MSE curve — or press ▶ to animate. On the right, see what 20 fits from different training samples look like at each degree: simple models are stable but wrong (bias); complex models scatter wildly (variance).

#| standalone: true
#| viewerHeight: 620

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .ball-val { font-weight: bold; color: #e67e22; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      selectInput("fn2", "True function:",
                  choices = c("Sine wave",
                              "Quadratic",
                              "Step function",
                              "Wiggly sine")),

      sliderInput("n2", "Training points:",
                  min = 20, max = 100, value = 40, step = 10),

      sliderInput("noise2", "Noise level:",
                  min = 0.2, max = 1.5, value = 0.5, step = 0.1),

      tags$hr(),

      sliderInput("degree2", "Model complexity (degree):",
                  min = 1, max = 12, value = 1, step = 1,
                  animate = animationOptions(
                    interval = 700, loop = TRUE)),

      actionButton("go2", "New data",
                   class = "btn-primary", width = "100%"),

      uiOutput("results2")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("bv_plot", height = "480px")),
        column(6, plotOutput("fit_plot2", height = "480px"))
      )
    )
  )
)

server <- function(input, output, session) {

  f_true <- function(x, fn) {
    switch(fn,
      "Sine wave"     = sin(2 * pi * x),
      "Quadratic"     = 2 * (x - 0.5)^2,
      "Step function" = ifelse(x > 0.5, 1, 0),
      "Wiggly sine"   = sin(4 * pi * x) + 0.5 * cos(6 * pi * x)
    )
  }

  # Pre-compute MC results for ALL degrees at once.
  # Only recomputes when function / n / noise / button changes —
  # dragging the degree slider is instant.
  mc_data <- reactive({
    input$go2
    fn    <- input$fn2
    n     <- input$n2
    sigma <- input$noise2
    sims  <- 50

    x_eval <- seq(0.05, 0.95, length.out = 40)
    f_eval <- f_true(x_eval, fn)

    degrees   <- 1:12
    bias2_vec <- numeric(12)
    var_vec   <- numeric(12)
    pred_list <- vector("list", 12)

    for (di in seq_along(degrees)) {
      d <- degrees[di]
      pred_mat <- matrix(NA, nrow = sims, ncol = length(x_eval))

      for (s in seq_len(sims)) {
        x_train <- sort(runif(n, 0, 1))
        y_train <- f_true(x_train, fn) + rnorm(n, sd = sigma)
        fit <- tryCatch(
          lm(y_train ~ poly(x_train, d, raw = TRUE)),
          error = function(e) NULL)
        if (!is.null(fit)) {
          pred_mat[s, ] <- predict(fit,
            newdata = data.frame(x_train = x_eval))
        }
      }

      good <- complete.cases(pred_mat)
      if (sum(good) > 2) {
        pred_mat <- pred_mat[good, , drop = FALSE]
        avg_pred <- colMeans(pred_mat)
        bias2_vec[di] <- mean((avg_pred - f_eval)^2)
        var_vec[di]   <- mean(apply(pred_mat, 2, var))
      }
      pred_list[[di]] <- pred_mat
    }

    list(degrees = degrees, bias2 = bias2_vec,
         variance = var_vec,
         mse = bias2_vec + var_vec,
         sigma = sigma, x_eval = x_eval,
         f_eval = f_eval, pred_list = pred_list,
         fn = fn)
  })

  # LEFT PLOT: U-curve with sliding ball
  output$bv_plot <- renderPlot({
    d   <- mc_data()
    deg <- input$degree2
    par(mar = c(4.5, 4.5, 3, 1))

    ylim <- c(0, max(d$mse) * 1.25)
    x    <- d$degrees

    plot(x, d$mse, type = "n",
         xlab = "Polynomial degree", ylab = "Error",
         ylim = ylim,
         main = "Bias\u00b2 + Variance = MSE",
         xaxt = "n")
    axis(1, at = 1:12)

    # Stacked area fills
    polygon(c(x, rev(x)),
            c(d$variance, rep(0, length(x))),
            col = adjustcolor("#3498db", 0.25),
            border = NA)
    polygon(c(x, rev(x)),
            c(d$mse, rev(d$variance)),
            col = adjustcolor("#e74c3c", 0.25),
            border = NA)

    # Lines
    lines(x, d$variance, col = "#3498db", lwd = 2)
    lines(x, d$bias2, col = "#e74c3c", lwd = 2, lty = 2)
    lines(x, d$mse, col = "#2c3e50", lwd = 2.5)

    # Noise floor
    abline(h = d$sigma^2, lty = 2, col = "#95a5a6")
    text(11.5, d$sigma^2 + ylim[2] * 0.03,
         expression(sigma^2), cex = 0.8, col = "#95a5a6")

    # Optimal degree
    best <- which.min(d$mse)
    abline(v = best, lty = 3, col = "#7f8c8d")

    # Vertical line from ball to x-axis
    segments(deg, 0, deg, d$mse[deg],
             lty = 2, col = "#e67e22", lwd = 1.5)

    # The sliding ball
    points(deg, d$mse[deg],
           pch = 21, bg = "#f39c12", col = "#2c3e50",
           cex = 3.5, lwd = 2)

    legend("topright", bty = "n", cex = 0.85,
           legend = c("MSE (total)",
                      "Bias\u00b2", "Variance",
                      paste0("Optimal: degree ", best)),
           col = c("#2c3e50", "#e74c3c",
                   "#3498db", "#7f8c8d"),
           lwd = c(2.5, 2, 2, 1),
           lty = c(1, 2, 1, 3))
  })

  # RIGHT PLOT: What fits look like at current degree
  output$fit_plot2 <- renderPlot({
    d   <- mc_data()
    deg <- input$degree2
    par(mar = c(4.5, 4.5, 3, 1))

    pred_mat <- d$pred_list[[deg]]
    avg_pred <- colMeans(pred_mat)
    n_show   <- min(20, nrow(pred_mat))

    # Cap y-range so wild high-degree fits don't
    # blow up the axis
    preds_shown <- as.vector(pred_mat[1:n_show, ])
    yq <- quantile(preds_shown,
                   c(0.02, 0.98), na.rm = TRUE)
    ylim <- c(
      min(yq[1], min(d$f_eval) - 0.3),
      max(yq[2], max(d$f_eval) + 0.3))

    plot(d$x_eval, d$f_eval, type = "n",
         xlab = "x", ylab = "f(x)",
         ylim = ylim,
         main = paste0("Degree ", deg,
           " \u2014 20 fits from different samples"))

    # Individual fits (semi-transparent)
    for (i in 1:n_show) {
      lines(d$x_eval, pred_mat[i, ],
            col = adjustcolor("#3498db", 0.2),
            lwd = 1.2)
    }

    # Average prediction — gap from truth = bias
    lines(d$x_eval, avg_pred,
          col = "#e74c3c", lwd = 2.5)

    # True function
    lines(d$x_eval, d$f_eval,
          col = "#27ae60", lwd = 2.5, lty = 2)

    legend("topright", bty = "n", cex = 0.85,
           legend = c("True f(x)",
                      "Average prediction",
                      "Individual fits"),
           col = c("#27ae60", "#e74c3c",
                   adjustcolor("#3498db", 0.5)),
           lwd = c(2.5, 2.5, 1.2),
           lty = c(2, 1, 1))
  })

  # Stats box
  output$results2 <- renderUI({
    d   <- mc_data()
    deg <- input$degree2
    best <- which.min(d$mse)

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>Degree ", deg, ":</b><br>",
        "Bias\u00b2: ", round(d$bias2[deg], 4), "<br>",
        "Variance: ", round(d$variance[deg], 4), "<br>",
        "<span class='ball-val'>MSE: ",
        round(d$mse[deg], 4), "</span><br>",
        "<hr style='margin:8px 0'>",
        "<b>Optimal:</b> degree ", best,
        " (MSE ", round(d$mse[best], 4), ")"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • Sine wave, degree 1: the linear fit can’t capture the curve — high bias, low variance. Classic underfitting.
  • Sine wave, degree 20: the polynomial goes wild between data points — low bias, high variance. Classic overfitting.
  • Sine wave, degree 3–5: the sweet spot where bias and variance are balanced. The test error curve reaches its minimum.
  • Increase noise: the optimal degree shifts down. Noisier data means simpler models are better — there’s less signal to extract.
  • Quadratic true function, degree 2: the model matches the DGP perfectly. Bias is essentially zero. Going beyond degree 2 only adds variance.

The bottom line

  • Simple models are biased but stable. They miss patterns in the data but give consistent predictions across samples.
  • Complex models are flexible but noisy. They capture the training data perfectly but their predictions change wildly with new data.
  • The best model balances both. The U-shaped test error curve is the signature of the bias-variance tradeoff.
  • In practice, we use cross-validation to find the sweet spot without needing access to the true function.

Did you know?

  • The bias-variance decomposition was formalized by Stuart Geman and Donald Geman in their landmark 1984 paper on image processing. But the intuition goes back much further — Charles Stein showed in 1956 that the sample mean is inadmissible (not the best estimator) in dimensions ≥ 3. The James-Stein estimator reduces MSE by shrinking toward the grand mean — trading bias for variance. This result was so counterintuitive that Stein’s colleagues initially didn’t believe it.
  • The bias-variance tradeoff is the theoretical foundation of regularization methods like ridge regression, LASSO, and elastic net. These methods intentionally introduce bias (by shrinking coefficients) to dramatically reduce variance, resulting in better predictions overall.
  • Cross-validation — the practical tool for navigating the tradeoff — was proposed by Seymour Geisser in 1975 and popularized by Mervyn Stone in the same year. The leave-one-out version was known even earlier.