The Bootstrap
The idea
You want a confidence interval for some statistic (mean, median, regression coefficient), but you don’t know the sampling distribution. Maybe the formula is complicated. Maybe there is no formula.
The bootstrap says: pretend your sample is the population. Then simulate the sampling process by resampling with replacement from your data, over and over. Each resample gives you a new estimate. The distribution of those estimates approximates the true sampling distribution.
The steps
- You have a sample of \(n\) observations.
- Draw \(n\) observations with replacement from your sample (some points get picked twice, some not at all).
- Compute your statistic on this resample.
- Repeat B times (typically 1,000–10,000).
- The spread of those B estimates gives you a standard error and a confidence interval.
That’s it. No formulas, no parametric distributional assumptions. Just resampling.
But can’t I just look at my data?
No — and this is a crucial distinction. There are two different distributions at play:
- The distribution of the data = what individual observations look like (their histogram). This tells you about spread, skew, outliers in your raw data.
- The sampling distribution = what your statistic (mean, median, slope) would look like if you could repeat the entire experiment many times. This is what you need for standard errors and confidence intervals.
You can always plot your data. But you cannot see the sampling distribution — you only ran the experiment once. You have one mean, not a distribution of means.
The bootstrap bridges that gap. It simulates the “what if I repeated this experiment?” question using only the data you have. The left panel below shows your data (a histogram for univariate statistics, a scatter plot for correlation/regression). The right panel shows the bootstrap sampling distribution. Notice: they look completely different. Your data might be skewed and spread out, but the sampling distribution of the mean is narrow and roughly normal (thanks to the CLT — though this narrowing applies specifically to the mean; for other statistics like the median or SD, the bootstrap distribution can remain skewed). The bootstrap discovers this for you without any formulas.
#| standalone: true
#| viewerHeight: 900
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
selectInput("stat", "Statistic to bootstrap:",
choices = c("Mean", "Median", "SD",
"Correlation", "Regression slope")),
selectInput("pop", "Population shape:",
choices = c("Normal", "Skewed (exponential)",
"Heavy-tailed", "Bimodal")),
sliderInput("n", "Sample size:",
min = 10, max = 200, value = 30, step = 10),
sliderInput("B", "Bootstrap resamples (B):",
min = 100, max = 5000, value = 1000, step = 100),
actionButton("go", "New sample + bootstrap",
class = "btn-primary", width = "100%"),
uiOutput("results")
),
mainPanel(
width = 9,
fluidRow(
column(6, plotOutput("sample_plot", height = "350px")),
column(6, plotOutput("boot_plot", height = "350px"))
),
fluidRow(
column(12, plotOutput("compare_plot", height = "350px"))
)
)
)
)
server <- function(input, output, session) {
draw_pop <- function(n, pop) {
switch(pop,
"Normal" = rnorm(n, mean = 5, sd = 2),
"Skewed (exponential)" = rexp(n, rate = 0.5),
"Heavy-tailed" = rt(n, df = 3) * 2 + 5,
"Bimodal" = {
k <- rbinom(n, 1, 0.5)
k * rnorm(n, 2, 0.8) + (1 - k) * rnorm(n, 7, 0.8)
}
)
}
true_param <- function(stat, pop) {
switch(stat,
"Mean" = switch(pop,
"Normal" = 5,
"Skewed (exponential)" = 1 / 0.5,
"Heavy-tailed" = 5,
"Bimodal" = 0.5 * 2 + 0.5 * 7
),
"Median" = switch(pop,
"Normal" = 5,
"Skewed (exponential)" = log(2) / 0.5,
"Heavy-tailed" = 5,
"Bimodal" = 4.5
),
"SD" = switch(pop,
"Normal" = 2,
"Skewed (exponential)" = 1 / 0.5,
"Heavy-tailed" = NA_real_,
"Bimodal" = sqrt(0.5 * (0.8^2 + 2^2) + 0.5 * (0.8^2 + 7^2)
- (0.5 * 2 + 0.5 * 7)^2)
),
"Correlation" = {
# y = 1 + 0.8*x + N(0,2), so cor = 0.8*sd(x) / sqrt(0.64*var(x)+4)
var_x <- switch(pop,
"Normal" = 4,
"Skewed (exponential)" = 4,
"Heavy-tailed" = NA_real_, # t(3) variance infinite
"Bimodal" = 0.5*(0.8^2+2^2) + 0.5*(0.8^2+7^2) - (0.5*2+0.5*7)^2
)
if (is.na(var_x)) NA_real_
else 0.8 * sqrt(var_x) / sqrt(0.64 * var_x + 4)
},
"Regression slope" = 0.8
)
}
compute_stat <- function(x, y = NULL, stat) {
switch(stat,
"Mean" = mean(x),
"Median" = median(x),
"SD" = sd(x),
"Correlation" = cor(x, y),
"Regression slope" = coef(lm(y ~ x))[2]
)
}
dat <- reactive({
input$go
n <- input$n
B <- input$B
stat <- input$stat
pop <- input$pop
# Cap B for regression to keep responsiveness
B_eff <- if (stat %in% c("Regression slope")) min(B, 2000) else B
x <- draw_pop(n, pop)
y <- NULL
if (stat %in% c("Correlation", "Regression slope")) {
y <- 1 + 0.8 * x + rnorm(n, sd = 2)
}
obs_stat <- compute_stat(x, y, stat)
truth <- true_param(stat, pop)
# Bootstrap
boot_stats <- replicate(B_eff, {
idx <- sample(n, n, replace = TRUE)
x_b <- x[idx]
y_b <- if (!is.null(y)) y[idx] else NULL
compute_stat(x_b, y_b, stat)
})
boot_se <- sd(boot_stats)
boot_ci <- quantile(boot_stats, c(0.025, 0.975))
covered <- if (!is.na(truth)) {
truth >= boot_ci[1] && truth <= boot_ci[2]
} else NA
list(x = x, y = y, obs_stat = obs_stat,
boot_stats = boot_stats, boot_se = boot_se,
boot_ci = boot_ci, stat = stat, n = n, B = B_eff,
truth = truth, covered = covered)
})
output$sample_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
if (d$stat %in% c("Correlation", "Regression slope")) {
plot(d$x, d$y, pch = 16, col = "#3498db80", cex = 0.8,
xlab = "X", ylab = "Y", main = "Your sample")
if (d$stat == "Regression slope") {
abline(lm(d$y ~ d$x), col = "#e74c3c", lwd = 2)
}
} else {
hist(d$x, breaks = 20, col = "#d5e8d4", border = "#82b366",
main = "Your sample", xlab = "X", ylab = "Frequency")
abline(v = d$obs_stat, col = "#e74c3c", lwd = 2.5, lty = 2)
legend("topright", bty = "n", cex = 0.85,
legend = paste0(d$stat, " = ", round(d$obs_stat, 3)),
col = "#e74c3c", lty = 2, lwd = 2)
}
})
output$boot_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
# Title includes coverage verdict
cov_label <- if (!is.na(d$covered)) {
if (d$covered) " — Covered: YES" else " — Covered: NO"
} else ""
main_title <- paste0("Bootstrap distribution (B = ", d$B, ")", cov_label)
hist(d$boot_stats, breaks = 40, probability = TRUE,
col = "#dae8fc", border = "#6c8ebf",
main = main_title,
xlab = paste0("Bootstrap ", tolower(d$stat)),
ylab = "Density")
abline(v = d$obs_stat, col = "#e74c3c", lwd = 2.5, lty = 2)
abline(v = d$boot_ci, col = "#9b59b6", lwd = 2, lty = 3)
# True parameter (Oracle View)
leg_labels <- c(paste0("Observed: ", round(d$obs_stat, 3)),
paste0("95% CI: [", round(d$boot_ci[1], 3),
", ", round(d$boot_ci[2], 3), "]"))
leg_cols <- c("#e74c3c", "#9b59b6")
leg_ltys <- c(2, 3)
if (!is.na(d$truth)) {
abline(v = d$truth, col = "#27ae60", lwd = 2.5, lty = 2)
leg_labels <- c(leg_labels,
paste0("True value: ", round(d$truth, 3)))
leg_cols <- c(leg_cols, "#27ae60")
leg_ltys <- c(leg_ltys, 2)
}
legend("topright", bty = "n", cex = 0.8,
legend = leg_labels,
col = leg_cols,
lty = leg_ltys, lwd = 2)
})
output$compare_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
# Overlay: data distribution (wide) vs bootstrap distribution (narrow)
dens_data <- density(d$x)
dens_boot <- density(d$boot_stats)
# Normalize both to peak at 1 for visual comparison
dens_data$y <- dens_data$y / max(dens_data$y)
dens_boot$y <- dens_boot$y / max(dens_boot$y)
xlim <- range(c(dens_data$x, dens_boot$x))
if (d$stat %in% c("Correlation", "Regression slope")) {
# For bivariate stats, only show bootstrap dist
# (data dist of X alone is not meaningful here)
plot(dens_boot, col = "#3498db", lwd = 2.5,
main = "Bootstrap sampling distribution",
xlab = paste0("Bootstrap ", tolower(d$stat)),
ylab = "Scaled density")
polygon(dens_boot$x, dens_boot$y,
col = adjustcolor("#3498db", 0.15), border = NA)
abline(v = d$boot_ci, col = "#9b59b6", lwd = 2, lty = 3)
abline(v = d$obs_stat, col = "#e74c3c", lwd = 2, lty = 2)
if (!is.na(d$truth)) {
abline(v = d$truth, col = "#27ae60", lwd = 2.5, lty = 2)
}
} else {
plot(dens_data, col = "#e74c3c", lwd = 2.5, xlim = xlim,
ylim = c(0, 1.25), main = "Data vs Sampling Distribution",
xlab = "Value", ylab = "Scaled density")
polygon(dens_data$x, dens_data$y,
col = adjustcolor("#e74c3c", 0.15), border = NA)
lines(dens_boot, col = "#3498db", lwd = 2.5)
polygon(dens_boot$x, dens_boot$y,
col = adjustcolor("#3498db", 0.15), border = NA)
legend("topright", bty = "n", cex = 0.8,
legend = c("Your data (wide, raw obs.)",
paste0("Sampling dist. of ", tolower(d$stat),
" (narrow)")),
col = c("#e74c3c", "#3498db"), lwd = 2.5)
}
})
output$results <- renderUI({
d <- dat()
# Coverage line
cov_html <- if (!is.na(d$covered)) {
cov_col <- if (d$covered) "#27ae60" else "#e74c3c"
cov_txt <- if (d$covered) "YES" else "NO"
paste0("<b>True value:</b> ", round(d$truth, 4), "<br>",
"<b style='color:", cov_col, "'>Covered: ",
cov_txt, "</b><br>")
} else ""
tags$div(class = "stats-box",
HTML(paste0(
"<b>Statistic:</b> ", d$stat, "<br>",
"<b>Observed:</b> ", round(d$obs_stat, 4), "<br>",
"<b>Bootstrap SE:</b> ", round(d$boot_se, 4), "<br>",
"<b>95% CI:</b> [", round(d$boot_ci[1], 4),
", ", round(d$boot_ci[2], 4), "]<br>",
cov_html,
"<hr style='margin:8px 0'>",
"<small>CI uses the <b>percentile method</b><br>",
"(2.5th & 97.5th percentiles).<br>",
"The BCa (bias-corrected and<br>",
"accelerated) method is generally<br>",
"preferred in practice.</small>"
))
)
})
}
shinyApp(ui, server)
Things to try
- Mean with Normal population: the bootstrap distribution looks normal, similar to what CLT gives you analytically.
- Median with skewed population: no easy formula for the SE of a median. But the bootstrap handles it effortlessly.
- SD with heavy-tailed data: the bootstrap distribution is skewed — the percentile CI is asymmetric. For skewed statistics like the SD, the BCa (bias-corrected and accelerated) interval generally performs better than the simple percentile method shown here.
- Regression slope: bootstrap the slope to get a CI without assuming homoskedasticity.
- Small n (10) vs large n (200): with small samples the bootstrap distribution is choppy (limited unique resamples). With more data it smooths out.
When to use the bootstrap
| Use bootstrap when… | Don’t bother when… |
|---|---|
| No formula for the SE exists | Standard formulas are available and valid |
| The statistic is complicated (ratios, quantiles) | You’re computing a simple mean |
| You suspect non-normality | n is large and CLT applies |
| You want a quick, assumption-free CI | You need exact small-sample inference |
The bootstrap is not magic — it can fail with very small samples or non-smooth statistics. But for most practical situations, it’s the easiest path to a valid confidence interval.
Did you know?
- Bradley Efron invented the bootstrap in 1979 at Stanford. The name comes from the expression “pulling yourself up by your bootstraps” — attributed to the tall tales of Baron Munchausen, who claimed to have pulled himself out of a swamp by his own hair (later versions say bootstraps). The idea that a sample can estimate its own sampling distribution seemed equally absurd at first.
- When Efron first presented the bootstrap, many statisticians were skeptical. The common objection was essentially: if you’re just sampling from your own sample, how can that tell you anything new? The answer: it tells you about the variability of your estimate, not about new data points.
- The bootstrap is now one of the most cited statistical methods ever. Efron’s 1979 paper has over 20,000 citations. He received the International Prize in Statistics (the statistics equivalent of the Nobel) in 2019 for this work.
- Before the bootstrap, getting standard errors for complex statistics (ratios, quantiles, eigenvalues) required either painful analytical derivations or the delta method — a Taylor expansion trick that often required heroic assumptions. The bootstrap made all of that unnecessary.