The Bootstrap
The idea
You want a confidence interval for some statistic (mean, median, regression coefficient), but you don’t know the sampling distribution. Maybe the formula is complicated. Maybe there is no formula.
The bootstrap says: pretend your sample is the population. Then simulate the sampling process by resampling with replacement from your data, over and over. Each resample gives you a new estimate. The distribution of those estimates approximates the true sampling distribution.
The steps
- You have a sample of \(n\) observations.
- Draw \(n\) observations with replacement from your sample (some points get picked twice, some not at all).
- Compute your statistic on this resample.
- Repeat B times (typically 1,000–10,000).
- The spread of those B estimates gives you a standard error and a confidence interval.
That’s it. No formulas, no distributional assumptions. Just resampling.
But can’t I just look at my data?
No — and this is a crucial distinction. There are two different distributions at play:
- The distribution of the data = what individual observations look like (their histogram). This tells you about spread, skew, outliers in your raw data.
- The sampling distribution = what your statistic (mean, median, slope) would look like if you could repeat the entire experiment many times. This is what you need for standard errors and confidence intervals.
You can always plot your data. But you cannot see the sampling distribution — you only ran the experiment once. You have one mean, not a distribution of means.
The bootstrap bridges that gap. It simulates the “what if I repeated this experiment?” question using only the data you have. The left panel below shows your data distribution. The right panel shows the bootstrap sampling distribution. Notice: they look completely different. Your data might be skewed and spread out, but the sampling distribution of the mean is narrow and roughly normal (thanks to the CLT). The bootstrap discovers this for you without any formulas.
#| '!! shinylive warning !!': |
#| shinylive does not work in self-contained HTML documents.
#| Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 900
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
selectInput("stat", "Statistic to bootstrap:",
choices = c("Mean", "Median", "SD",
"Correlation", "Regression slope")),
selectInput("pop", "Population shape:",
choices = c("Normal", "Skewed (exponential)",
"Heavy-tailed", "Bimodal")),
sliderInput("n", "Sample size:",
min = 10, max = 200, value = 30, step = 10),
sliderInput("B", "Bootstrap resamples (B):",
min = 100, max = 5000, value = 1000, step = 100),
actionButton("go", "New sample + bootstrap",
class = "btn-primary", width = "100%"),
uiOutput("results")
),
mainPanel(
width = 9,
fluidRow(
column(6, plotOutput("sample_plot", height = "350px")),
column(6, plotOutput("boot_plot", height = "350px"))
),
fluidRow(
column(12, plotOutput("compare_plot", height = "350px"))
)
)
)
)
server <- function(input, output, session) {
draw_pop <- function(n, pop) {
switch(pop,
"Normal" = rnorm(n, mean = 5, sd = 2),
"Skewed (exponential)" = rexp(n, rate = 0.5),
"Heavy-tailed" = rt(n, df = 3) * 2 + 5,
"Bimodal" = {
k <- rbinom(n, 1, 0.5)
k * rnorm(n, 2, 0.8) + (1 - k) * rnorm(n, 7, 0.8)
}
)
}
compute_stat <- function(x, y = NULL, stat) {
switch(stat,
"Mean" = mean(x),
"Median" = median(x),
"SD" = sd(x),
"Correlation" = cor(x, y),
"Regression slope" = coef(lm(y ~ x))[2]
)
}
dat <- reactive({
input$go
n <- input$n
B <- input$B
stat <- input$stat
pop <- input$pop
x <- draw_pop(n, pop)
y <- NULL
if (stat %in% c("Correlation", "Regression slope")) {
y <- 1 + 0.8 * x + rnorm(n, sd = 2)
}
obs_stat <- compute_stat(x, y, stat)
# Bootstrap
boot_stats <- replicate(B, {
idx <- sample(n, n, replace = TRUE)
x_b <- x[idx]
y_b <- if (!is.null(y)) y[idx] else NULL
compute_stat(x_b, y_b, stat)
})
boot_se <- sd(boot_stats)
boot_ci <- quantile(boot_stats, c(0.025, 0.975))
list(x = x, y = y, obs_stat = obs_stat,
boot_stats = boot_stats, boot_se = boot_se,
boot_ci = boot_ci, stat = stat, n = n, B = B)
})
output$sample_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
if (d$stat %in% c("Correlation", "Regression slope")) {
plot(d$x, d$y, pch = 16, col = "#3498db80", cex = 0.8,
xlab = "X", ylab = "Y", main = "Your sample")
if (d$stat == "Regression slope") {
abline(lm(d$y ~ d$x), col = "#e74c3c", lwd = 2)
}
} else {
hist(d$x, breaks = 20, col = "#d5e8d4", border = "#82b366",
main = "Your sample", xlab = "X", ylab = "Frequency")
abline(v = d$obs_stat, col = "#e74c3c", lwd = 2.5, lty = 2)
legend("topright", bty = "n", cex = 0.85,
legend = paste0(d$stat, " = ", round(d$obs_stat, 3)),
col = "#e74c3c", lty = 2, lwd = 2)
}
})
output$boot_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
hist(d$boot_stats, breaks = 40, probability = TRUE,
col = "#dae8fc", border = "#6c8ebf",
main = paste0("Bootstrap distribution (B = ", d$B, ")"),
xlab = paste0("Bootstrap ", tolower(d$stat)),
ylab = "Density")
abline(v = d$obs_stat, col = "#e74c3c", lwd = 2.5, lty = 2)
abline(v = d$boot_ci, col = "#27ae60", lwd = 2, lty = 3)
legend("topright", bty = "n", cex = 0.8,
legend = c(paste0("Observed: ", round(d$obs_stat, 3)),
paste0("95% CI: [", round(d$boot_ci[1], 3),
", ", round(d$boot_ci[2], 3), "]")),
col = c("#e74c3c", "#27ae60"),
lty = c(2, 3), lwd = 2)
})
output$compare_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
if (d$stat %in% c("Correlation", "Regression slope")) {
# For bivariate stats, just show bootstrap dist with CI
hist(d$boot_stats, breaks = 40, probability = TRUE,
col = "#dae8fc", border = "#6c8ebf",
main = "Bootstrap sampling dist.",
xlab = paste0("Bootstrap ", tolower(d$stat)),
ylab = "Density")
abline(v = d$boot_ci, col = "#27ae60", lwd = 2, lty = 3)
abline(v = d$obs_stat, col = "#e74c3c", lwd = 2, lty = 2)
} else {
# Overlay: data distribution (wide) vs bootstrap distribution (narrow)
# Rescale data density to fit on same axes
dens_data <- density(d$x)
dens_boot <- density(d$boot_stats)
# Normalize both to peak at 1 for visual comparison
dens_data$y <- dens_data$y / max(dens_data$y)
dens_boot$y <- dens_boot$y / max(dens_boot$y)
xlim <- range(c(dens_data$x, dens_boot$x))
plot(dens_data, col = "#e74c3c", lwd = 2.5, xlim = xlim,
ylim = c(0, 1.25), main = "Data vs Sampling Distribution",
xlab = "Value", ylab = "Scaled density")
polygon(dens_data$x, dens_data$y,
col = adjustcolor("#e74c3c", 0.15), border = NA)
lines(dens_boot, col = "#3498db", lwd = 2.5)
polygon(dens_boot$x, dens_boot$y,
col = adjustcolor("#3498db", 0.15), border = NA)
legend("topright", bty = "n", cex = 0.8,
legend = c("Your data (wide, raw obs.)",
paste0("Sampling dist. of ", tolower(d$stat),
" (narrow)")),
col = c("#e74c3c", "#3498db"), lwd = 2.5)
}
})
output$results <- renderUI({
d <- dat()
tags$div(class = "stats-box",
HTML(paste0(
"<b>Statistic:</b> ", d$stat, "<br>",
"<b>Observed:</b> ", round(d$obs_stat, 4), "<br>",
"<b>Bootstrap SE:</b> ", round(d$boot_se, 4), "<br>",
"<b>95% CI:</b> [", round(d$boot_ci[1], 4),
", ", round(d$boot_ci[2], 4), "]<br>",
"<hr style='margin:8px 0'>",
"<small>CI uses percentile method:<br>",
"2.5th and 97.5th percentiles of<br>",
"bootstrap distribution.</small>"
))
)
})
}
shinyApp(ui, server)
Things to try
- Mean with Normal population: the bootstrap distribution looks normal, similar to what CLT gives you analytically.
- Median with skewed population: no easy formula for the SE of a median. But the bootstrap handles it effortlessly.
- SD with heavy-tailed data: the bootstrap distribution is skewed — the percentile CI is asymmetric, which is exactly right.
- Regression slope: bootstrap the slope to get a CI without assuming homoskedasticity.
- Small n (10) vs large n (200): with small samples the bootstrap distribution is choppy (limited unique resamples). With more data it smooths out.
When to use the bootstrap
| Use bootstrap when… | Don’t bother when… |
|---|---|
| No formula for the SE exists | Standard formulas are available and valid |
| The statistic is complicated (ratios, quantiles) | You’re computing a simple mean |
| You suspect non-normality | n is large and CLT applies |
| You want a quick, assumption-free CI | You need exact small-sample inference |
The bootstrap is not magic — it can fail with very small samples or non-smooth statistics. But for most practical situations, it’s the easiest path to a valid confidence interval.
Did you know?
- Bradley Efron invented the bootstrap in 1979 at Stanford. The name comes from the expression “pulling yourself up by your bootstraps” — attributed to the tall tales of Baron Munchausen, who claimed to have pulled himself out of a swamp by his own hair (later versions say bootstraps). The idea that a sample can estimate its own sampling distribution seemed equally absurd at first.
- When Efron first presented the bootstrap, many statisticians were skeptical. One famously asked: “You’re just sampling from your sample — how can that tell you anything new?” The answer: it tells you about the variability of your estimate, not about new data points.
- The bootstrap is now one of the most cited statistical methods ever. Efron’s 1979 paper has over 20,000 citations. He received the International Prize in Statistics (the statistics equivalent of the Nobel) in 2019 for this work.
- Before the bootstrap, getting standard errors for complex statistics (ratios, quantiles, eigenvalues) required either painful analytical derivations or the delta method — a Taylor expansion trick that often required heroic assumptions. The bootstrap made all of that unnecessary.