The Bias-Variance Tradeoff
Why the best model isn’t the most complex one
You might think: “More flexible models fit the data better, so they must be better.” But there’s a catch. A model that fits the training data perfectly will often perform terribly on new data. This is overfitting.
The bias-variance tradeoff explains why:
| Simple model (e.g., linear) | Complex model (e.g., degree-20 polynomial) | |
|---|---|---|
| Bias | High — can’t capture the true shape | Low — flexible enough to match any pattern |
| Variance | Low — stable across samples | High — estimates change wildly with new data |
| Training error | Higher | Lower (maybe zero) |
| Test error | U-shaped — sweet spot exists | U-shaped — sweet spot exists |
The total prediction error (MSE) decomposes as:
\[\text{MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible noise}\]
You can’t eliminate all three. Reducing bias increases variance, and vice versa. The best model balances both.
Simulation 1: Polynomial regression — underfitting vs overfitting
Fit polynomials of different degrees to noisy data. Low degrees underfit (too rigid). High degrees overfit (too wiggly). Watch the training error decrease monotonically while the test error has a U-shape.
#| standalone: true
#| viewerHeight: 700
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
selectInput("true_fn", "True function:",
choices = c("Sine wave",
"Quadratic",
"Step function",
"Linear")),
sliderInput("n_train", "Training points:",
min = 20, max = 200, value = 50, step = 10),
sliderInput("noise", "Noise level (\u03c3):",
min = 0.1, max = 2, value = 0.5, step = 0.1),
sliderInput("degree", "Polynomial degree:",
min = 1, max = 20, value = 3, step = 1),
actionButton("go", "New data",
class = "btn-primary", width = "100%"),
uiOutput("results")
),
mainPanel(
width = 9,
fluidRow(
column(7, plotOutput("fit_plot", height = "400px")),
column(5, plotOutput("error_curve", height = "400px"))
)
)
)
)
server <- function(input, output, session) {
f_true <- function(x, fn) {
switch(fn,
"Sine wave" = sin(2 * pi * x),
"Quadratic" = 2 * (x - 0.5)^2,
"Step function" = ifelse(x > 0.5, 1, 0),
"Linear" = x
)
}
dat <- reactive({
input$go
n <- input$n_train
sigma <- input$noise
fn <- input$true_fn
# Training data
x_train <- sort(runif(n, 0, 1))
y_train <- f_true(x_train, fn) + rnorm(n, sd = sigma)
# Test data (fresh)
x_test <- sort(runif(n, 0, 1))
y_test <- f_true(x_test, fn) + rnorm(n, sd = sigma)
# Fit polynomials of degree 1 through 20
degrees <- 1:20
train_mse <- numeric(20)
test_mse <- numeric(20)
for (d in degrees) {
fit <- lm(y_train ~ poly(x_train, d, raw = TRUE))
pred_train <- predict(fit)
pred_test <- predict(fit, newdata = data.frame(x_train = x_test))
train_mse[d] <- mean((y_train - pred_train)^2)
test_mse[d] <- mean((y_test - pred_test)^2)
}
# Current degree fit for plotting
d_now <- input$degree
fit_now <- lm(y_train ~ poly(x_train, d_now, raw = TRUE))
x_grid <- seq(0, 1, length.out = 300)
pred_grid <- predict(fit_now, newdata = data.frame(x_train = x_grid))
list(x_train = x_train, y_train = y_train,
x_test = x_test, y_test = y_test,
x_grid = x_grid, pred_grid = pred_grid,
train_mse = train_mse, test_mse = test_mse,
fn = fn, d_now = d_now, sigma = sigma)
})
output$fit_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
plot(d$x_train, d$y_train, pch = 16,
col = "#3498db80", cex = 0.8,
xlab = "x", ylab = "y",
main = paste0("Degree ", d$d_now, " polynomial fit"),
ylim = range(c(d$y_train, d$pred_grid)))
# True function
x_fine <- seq(0, 1, length.out = 300)
lines(x_fine, f_true(x_fine, d$fn),
col = "#27ae60", lwd = 2, lty = 2)
# Fitted curve
lines(d$x_grid, d$pred_grid, col = "#e74c3c", lwd = 2.5)
legend("topright", bty = "n", cex = 0.85,
legend = c("Training data", "True function",
paste0("Degree ", d$d_now, " fit")),
col = c("#3498db80", "#27ae60", "#e74c3c"),
pch = c(16, NA, NA), lwd = c(NA, 2, 2.5),
lty = c(NA, 2, 1))
})
output$error_curve <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
ylim <- c(0, min(max(d$test_mse) * 1.1, max(d$test_mse[1:15]) * 1.5))
plot(1:20, d$train_mse, type = "b", pch = 19, cex = 0.7,
col = "#3498db", lwd = 2,
xlab = "Polynomial degree",
ylab = "Mean Squared Error",
main = "Train vs Test Error",
ylim = ylim)
lines(1:20, d$test_mse, type = "b", pch = 17, cex = 0.7,
col = "#e74c3c", lwd = 2)
# Mark current degree
points(d$d_now, d$train_mse[d$d_now], pch = 19, cex = 2, col = "#3498db")
points(d$d_now, d$test_mse[d$d_now], pch = 17, cex = 2, col = "#e74c3c")
# Mark optimal
best <- which.min(d$test_mse)
abline(v = best, lty = 3, col = "#7f8c8d")
abline(h = d$sigma^2, lty = 2, col = "#95a5a6")
text(15, d$sigma^2 * 1.1, expression("Irreducible noise (" * sigma^2 * ")"),
cex = 0.75, col = "#95a5a6")
legend("topright", bty = "n", cex = 0.8,
legend = c("Training error", "Test error",
paste0("Best degree: ", best)),
col = c("#3498db", "#e74c3c", "#7f8c8d"),
pch = c(19, 17, NA), lwd = c(2, 2, 1),
lty = c(1, 1, 3))
})
output$results <- renderUI({
d <- dat()
best <- which.min(d$test_mse)
tags$div(class = "stats-box",
HTML(paste0(
"<b>Degree ", d$d_now, ":</b><br>",
"Train MSE: ", round(d$train_mse[d$d_now], 4), "<br>",
"Test MSE: ", round(d$test_mse[d$d_now], 4), "<br>",
"<hr style='margin:8px 0'>",
"<b>Optimal degree:</b> ", best, "<br>",
"Test MSE: ", round(d$test_mse[best], 4), "<br>",
"<small>Noise floor: \u03c3\u00b2 = ",
round(d$sigma^2, 3), "</small>"
))
)
})
}
shinyApp(ui, server)
Simulation 2: Slide along the bias-variance curve
Drag the complexity slider to move the ball along the U-shaped MSE curve — or press ▶ to animate. On the right, see what 20 fits from different training samples look like at each degree: simple models are stable but wrong (bias); complex models scatter wildly (variance).
#| standalone: true
#| viewerHeight: 620
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
.ball-val { font-weight: bold; color: #e67e22; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
selectInput("fn2", "True function:",
choices = c("Sine wave",
"Quadratic",
"Step function",
"Wiggly sine")),
sliderInput("n2", "Training points:",
min = 20, max = 100, value = 40, step = 10),
sliderInput("noise2", "Noise level:",
min = 0.2, max = 1.5, value = 0.5, step = 0.1),
tags$hr(),
sliderInput("degree2", "Model complexity (degree):",
min = 1, max = 12, value = 1, step = 1,
animate = animationOptions(
interval = 700, loop = TRUE)),
actionButton("go2", "New data",
class = "btn-primary", width = "100%"),
uiOutput("results2")
),
mainPanel(
width = 9,
fluidRow(
column(6, plotOutput("bv_plot", height = "480px")),
column(6, plotOutput("fit_plot2", height = "480px"))
)
)
)
)
server <- function(input, output, session) {
f_true <- function(x, fn) {
switch(fn,
"Sine wave" = sin(2 * pi * x),
"Quadratic" = 2 * (x - 0.5)^2,
"Step function" = ifelse(x > 0.5, 1, 0),
"Wiggly sine" = sin(4 * pi * x) + 0.5 * cos(6 * pi * x)
)
}
# Pre-compute MC results for ALL degrees at once.
# Only recomputes when function / n / noise / button changes —
# dragging the degree slider is instant.
mc_data <- reactive({
input$go2
fn <- input$fn2
n <- input$n2
sigma <- input$noise2
sims <- 50
x_eval <- seq(0.05, 0.95, length.out = 40)
f_eval <- f_true(x_eval, fn)
degrees <- 1:12
bias2_vec <- numeric(12)
var_vec <- numeric(12)
pred_list <- vector("list", 12)
for (di in seq_along(degrees)) {
d <- degrees[di]
pred_mat <- matrix(NA, nrow = sims, ncol = length(x_eval))
for (s in seq_len(sims)) {
x_train <- sort(runif(n, 0, 1))
y_train <- f_true(x_train, fn) + rnorm(n, sd = sigma)
fit <- tryCatch(
lm(y_train ~ poly(x_train, d, raw = TRUE)),
error = function(e) NULL)
if (!is.null(fit)) {
pred_mat[s, ] <- predict(fit,
newdata = data.frame(x_train = x_eval))
}
}
good <- complete.cases(pred_mat)
if (sum(good) > 2) {
pred_mat <- pred_mat[good, , drop = FALSE]
avg_pred <- colMeans(pred_mat)
bias2_vec[di] <- mean((avg_pred - f_eval)^2)
var_vec[di] <- mean(apply(pred_mat, 2, var))
}
pred_list[[di]] <- pred_mat
}
list(degrees = degrees, bias2 = bias2_vec,
variance = var_vec,
mse = bias2_vec + var_vec,
sigma = sigma, x_eval = x_eval,
f_eval = f_eval, pred_list = pred_list,
fn = fn)
})
# LEFT PLOT: U-curve with sliding ball
output$bv_plot <- renderPlot({
d <- mc_data()
deg <- input$degree2
par(mar = c(4.5, 4.5, 3, 1))
ylim <- c(0, max(d$mse) * 1.25)
x <- d$degrees
plot(x, d$mse, type = "n",
xlab = "Polynomial degree", ylab = "Error",
ylim = ylim,
main = "Bias\u00b2 + Variance = MSE",
xaxt = "n")
axis(1, at = 1:12)
# Stacked area fills
polygon(c(x, rev(x)),
c(d$variance, rep(0, length(x))),
col = adjustcolor("#3498db", 0.25),
border = NA)
polygon(c(x, rev(x)),
c(d$mse, rev(d$variance)),
col = adjustcolor("#e74c3c", 0.25),
border = NA)
# Lines
lines(x, d$variance, col = "#3498db", lwd = 2)
lines(x, d$bias2, col = "#e74c3c", lwd = 2, lty = 2)
lines(x, d$mse, col = "#2c3e50", lwd = 2.5)
# Noise floor
abline(h = d$sigma^2, lty = 2, col = "#95a5a6")
text(11.5, d$sigma^2 + ylim[2] * 0.03,
expression(sigma^2), cex = 0.8, col = "#95a5a6")
# Optimal degree
best <- which.min(d$mse)
abline(v = best, lty = 3, col = "#7f8c8d")
# Vertical line from ball to x-axis
segments(deg, 0, deg, d$mse[deg],
lty = 2, col = "#e67e22", lwd = 1.5)
# The sliding ball
points(deg, d$mse[deg],
pch = 21, bg = "#f39c12", col = "#2c3e50",
cex = 3.5, lwd = 2)
legend("topright", bty = "n", cex = 0.85,
legend = c("MSE (total)",
"Bias\u00b2", "Variance",
paste0("Optimal: degree ", best)),
col = c("#2c3e50", "#e74c3c",
"#3498db", "#7f8c8d"),
lwd = c(2.5, 2, 2, 1),
lty = c(1, 2, 1, 3))
})
# RIGHT PLOT: What fits look like at current degree
output$fit_plot2 <- renderPlot({
d <- mc_data()
deg <- input$degree2
par(mar = c(4.5, 4.5, 3, 1))
pred_mat <- d$pred_list[[deg]]
avg_pred <- colMeans(pred_mat)
n_show <- min(20, nrow(pred_mat))
# Cap y-range so wild high-degree fits don't
# blow up the axis
preds_shown <- as.vector(pred_mat[1:n_show, ])
yq <- quantile(preds_shown,
c(0.02, 0.98), na.rm = TRUE)
ylim <- c(
min(yq[1], min(d$f_eval) - 0.3),
max(yq[2], max(d$f_eval) + 0.3))
plot(d$x_eval, d$f_eval, type = "n",
xlab = "x", ylab = "f(x)",
ylim = ylim,
main = paste0("Degree ", deg,
" \u2014 20 fits from different samples"))
# Individual fits (semi-transparent)
for (i in 1:n_show) {
lines(d$x_eval, pred_mat[i, ],
col = adjustcolor("#3498db", 0.2),
lwd = 1.2)
}
# Average prediction — gap from truth = bias
lines(d$x_eval, avg_pred,
col = "#e74c3c", lwd = 2.5)
# True function
lines(d$x_eval, d$f_eval,
col = "#27ae60", lwd = 2.5, lty = 2)
legend("topright", bty = "n", cex = 0.85,
legend = c("True f(x)",
"Average prediction",
"Individual fits"),
col = c("#27ae60", "#e74c3c",
adjustcolor("#3498db", 0.5)),
lwd = c(2.5, 2.5, 1.2),
lty = c(2, 1, 1))
})
# Stats box
output$results2 <- renderUI({
d <- mc_data()
deg <- input$degree2
best <- which.min(d$mse)
tags$div(class = "stats-box",
HTML(paste0(
"<b>Degree ", deg, ":</b><br>",
"Bias\u00b2: ", round(d$bias2[deg], 4), "<br>",
"Variance: ", round(d$variance[deg], 4), "<br>",
"<span class='ball-val'>MSE: ",
round(d$mse[deg], 4), "</span><br>",
"<hr style='margin:8px 0'>",
"<b>Optimal:</b> degree ", best,
" (MSE ", round(d$mse[best], 4), ")"
))
)
})
}
shinyApp(ui, server)
Things to try
- Sine wave, degree 1: the linear fit can’t capture the curve — high bias, low variance. Classic underfitting.
- Sine wave, degree 20: the polynomial goes wild between data points — low bias, high variance. Classic overfitting.
- Sine wave, degree 3–5: the sweet spot where bias and variance are balanced. The test error curve reaches its minimum.
- Increase noise: the optimal degree shifts down. Noisier data means simpler models are better — there’s less signal to extract.
- Quadratic true function, degree 2: the model matches the DGP perfectly. Bias is essentially zero. Going beyond degree 2 only adds variance.
The bottom line
- Simple models are biased but stable. They miss patterns in the data but give consistent predictions across samples.
- Complex models are flexible but noisy. They capture the training data perfectly but their predictions change wildly with new data.
- The best model balances both. The U-shaped test error curve is the signature of the bias-variance tradeoff.
- In practice, we use cross-validation to find the sweet spot without needing access to the true function.
Did you know?
- The bias-variance decomposition was formalized by Stuart Geman and Donald Geman in their landmark 1984 paper on image processing. But the intuition goes back much further — Charles Stein showed in 1956 that the sample mean is inadmissible (not the best estimator) in dimensions ≥ 3. The James-Stein estimator reduces MSE by shrinking toward the grand mean — trading bias for variance. This result was so counterintuitive that Stein’s colleagues initially didn’t believe it.
- The bias-variance tradeoff is the theoretical foundation of regularization methods like ridge regression, LASSO, and elastic net. These methods intentionally introduce bias (by shrinking coefficients) to dramatically reduce variance, resulting in better predictions overall.
- Cross-validation — the practical tool for navigating the tradeoff — was proposed by Seymour Geisser in 1975 and popularized by Mervyn Stone in the same year. The leave-one-out version was known even earlier.