The Bootstrap

The idea

You want a confidence interval for some statistic (mean, median, regression coefficient), but you don’t know the sampling distribution. Maybe the formula is complicated. Maybe there is no formula.

The bootstrap says: pretend your sample is the population. Then simulate the sampling process by resampling with replacement from your data, over and over. Each resample gives you a new estimate. The distribution of those estimates approximates the true sampling distribution.

The steps

  1. You have a sample of \(n\) observations.
  2. Draw \(n\) observations with replacement from your sample (some points get picked twice, some not at all).
  3. Compute your statistic on this resample.
  4. Repeat B times (typically 1,000–10,000).
  5. The spread of those B estimates gives you a standard error and a confidence interval.

That’s it. No formulas, no parametric distributional assumptions. Just resampling.

The Oracle View. In these simulations, we know the true population and the true parameter — so we can check whether the bootstrap CIs actually cover the truth. Look for the green dashed line in the bootstrap distribution plot: that’s the true parameter value. The sidebar tells you whether the CI covers it. In practice, the bootstrap is all you have: you resample from your one sample and trust that it approximates the population well enough.

But can’t I just look at my data?

No — and this is a crucial distinction. There are two different distributions at play:

  • The distribution of the data = what individual observations look like (their histogram). This tells you about spread, skew, outliers in your raw data.
  • The sampling distribution = what your statistic (mean, median, slope) would look like if you could repeat the entire experiment many times. This is what you need for standard errors and confidence intervals.

You can always plot your data. But you cannot see the sampling distribution — you only ran the experiment once. You have one mean, not a distribution of means.

The bootstrap bridges that gap. It simulates the “what if I repeated this experiment?” question using only the data you have. The left panel below shows your data (a histogram for univariate statistics, a scatter plot for correlation/regression). The right panel shows the bootstrap sampling distribution. Notice: they look completely different. Your data might be skewed and spread out, but the sampling distribution of the mean is narrow and roughly normal (thanks to the CLT — though this narrowing applies specifically to the mean; for other statistics like the median or SD, the bootstrap distribution can remain skewed). The bootstrap discovers this for you without any formulas.

#| standalone: true
#| viewerHeight: 900

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      selectInput("stat", "Statistic to bootstrap:",
                  choices = c("Mean", "Median", "SD",
                              "Correlation", "Regression slope")),

      selectInput("pop", "Population shape:",
                  choices = c("Normal", "Skewed (exponential)",
                              "Heavy-tailed", "Bimodal")),

      sliderInput("n", "Sample size:",
                  min = 10, max = 200, value = 30, step = 10),

      sliderInput("B", "Bootstrap resamples (B):",
                  min = 100, max = 5000, value = 1000, step = 100),

      actionButton("go", "New sample + bootstrap",
                   class = "btn-primary", width = "100%"),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("sample_plot", height = "350px")),
        column(6, plotOutput("boot_plot", height = "350px"))
      ),
      fluidRow(
        column(12, plotOutput("compare_plot", height = "350px"))
      )
    )
  )
)

server <- function(input, output, session) {

  draw_pop <- function(n, pop) {
    switch(pop,
      "Normal" = rnorm(n, mean = 5, sd = 2),
      "Skewed (exponential)" = rexp(n, rate = 0.5),
      "Heavy-tailed" = rt(n, df = 3) * 2 + 5,
      "Bimodal" = {
        k <- rbinom(n, 1, 0.5)
        k * rnorm(n, 2, 0.8) + (1 - k) * rnorm(n, 7, 0.8)
      }
    )
  }

  true_param <- function(stat, pop) {
    switch(stat,
      "Mean" = switch(pop,
        "Normal" = 5,
        "Skewed (exponential)" = 1 / 0.5,
        "Heavy-tailed" = 5,
        "Bimodal" = 0.5 * 2 + 0.5 * 7
      ),
      "Median" = switch(pop,
        "Normal" = 5,
        "Skewed (exponential)" = log(2) / 0.5,
        "Heavy-tailed" = 5,
        "Bimodal" = 4.5
      ),
      "SD" = switch(pop,
        "Normal" = 2,
        "Skewed (exponential)" = 1 / 0.5,
        "Heavy-tailed" = NA_real_,
        "Bimodal" = sqrt(0.5 * (0.8^2 + 2^2) + 0.5 * (0.8^2 + 7^2)
                        - (0.5 * 2 + 0.5 * 7)^2)
      ),
      "Correlation" = {
        # y = 1 + 0.8*x + N(0,2), so cor = 0.8*sd(x) / sqrt(0.64*var(x)+4)
        var_x <- switch(pop,
          "Normal" = 4,
          "Skewed (exponential)" = 4,
          "Heavy-tailed" = NA_real_,  # t(3) variance infinite
          "Bimodal" = 0.5*(0.8^2+2^2) + 0.5*(0.8^2+7^2) - (0.5*2+0.5*7)^2
        )
        if (is.na(var_x)) NA_real_
        else 0.8 * sqrt(var_x) / sqrt(0.64 * var_x + 4)
      },
      "Regression slope" = 0.8
    )
  }

  compute_stat <- function(x, y = NULL, stat) {
    switch(stat,
      "Mean" = mean(x),
      "Median" = median(x),
      "SD" = sd(x),
      "Correlation" = cor(x, y),
      "Regression slope" = coef(lm(y ~ x))[2]
    )
  }

  dat <- reactive({
    input$go
    n    <- input$n
    B    <- input$B
    stat <- input$stat
    pop  <- input$pop

    # Cap B for regression to keep responsiveness
    B_eff <- if (stat %in% c("Regression slope")) min(B, 2000) else B

    x <- draw_pop(n, pop)
    y <- NULL
    if (stat %in% c("Correlation", "Regression slope")) {
      y <- 1 + 0.8 * x + rnorm(n, sd = 2)
    }

    obs_stat <- compute_stat(x, y, stat)
    truth <- true_param(stat, pop)

    # Bootstrap
    boot_stats <- replicate(B_eff, {
      idx <- sample(n, n, replace = TRUE)
      x_b <- x[idx]
      y_b <- if (!is.null(y)) y[idx] else NULL
      compute_stat(x_b, y_b, stat)
    })

    boot_se <- sd(boot_stats)
    boot_ci <- quantile(boot_stats, c(0.025, 0.975))
    covered <- if (!is.na(truth)) {
      truth >= boot_ci[1] && truth <= boot_ci[2]
    } else NA

    list(x = x, y = y, obs_stat = obs_stat,
         boot_stats = boot_stats, boot_se = boot_se,
         boot_ci = boot_ci, stat = stat, n = n, B = B_eff,
         truth = truth, covered = covered)
  })

  output$sample_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    if (d$stat %in% c("Correlation", "Regression slope")) {
      plot(d$x, d$y, pch = 16, col = "#3498db80", cex = 0.8,
           xlab = "X", ylab = "Y", main = "Your sample")
      if (d$stat == "Regression slope") {
        abline(lm(d$y ~ d$x), col = "#e74c3c", lwd = 2)
      }
    } else {
      hist(d$x, breaks = 20, col = "#d5e8d4", border = "#82b366",
           main = "Your sample", xlab = "X", ylab = "Frequency")
      abline(v = d$obs_stat, col = "#e74c3c", lwd = 2.5, lty = 2)
      legend("topright", bty = "n", cex = 0.85,
             legend = paste0(d$stat, " = ", round(d$obs_stat, 3)),
             col = "#e74c3c", lty = 2, lwd = 2)
    }
  })

  output$boot_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    # Title includes coverage verdict
    cov_label <- if (!is.na(d$covered)) {
      if (d$covered) " — Covered: YES" else " — Covered: NO"
    } else ""
    main_title <- paste0("Bootstrap distribution (B = ", d$B, ")", cov_label)

    hist(d$boot_stats, breaks = 40, probability = TRUE,
         col = "#dae8fc", border = "#6c8ebf",
         main = main_title,
         xlab = paste0("Bootstrap ", tolower(d$stat)),
         ylab = "Density")

    abline(v = d$obs_stat, col = "#e74c3c", lwd = 2.5, lty = 2)
    abline(v = d$boot_ci, col = "#9b59b6", lwd = 2, lty = 3)

    # True parameter (Oracle View)
    leg_labels <- c(paste0("Observed: ", round(d$obs_stat, 3)),
                    paste0("95% CI: [", round(d$boot_ci[1], 3),
                           ", ", round(d$boot_ci[2], 3), "]"))
    leg_cols <- c("#e74c3c", "#9b59b6")
    leg_ltys <- c(2, 3)

    if (!is.na(d$truth)) {
      abline(v = d$truth, col = "#27ae60", lwd = 2.5, lty = 2)
      leg_labels <- c(leg_labels,
                      paste0("True value: ", round(d$truth, 3)))
      leg_cols <- c(leg_cols, "#27ae60")
      leg_ltys <- c(leg_ltys, 2)
    }

    legend("topright", bty = "n", cex = 0.8,
           legend = leg_labels,
           col = leg_cols,
           lty = leg_ltys, lwd = 2)
  })

  output$compare_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    # Overlay: data distribution (wide) vs bootstrap distribution (narrow)
    dens_data <- density(d$x)
    dens_boot <- density(d$boot_stats)

    # Normalize both to peak at 1 for visual comparison
    dens_data$y <- dens_data$y / max(dens_data$y)
    dens_boot$y <- dens_boot$y / max(dens_boot$y)

    xlim <- range(c(dens_data$x, dens_boot$x))

    if (d$stat %in% c("Correlation", "Regression slope")) {
      # For bivariate stats, only show bootstrap dist
      # (data dist of X alone is not meaningful here)
      plot(dens_boot, col = "#3498db", lwd = 2.5,
           main = "Bootstrap sampling distribution",
           xlab = paste0("Bootstrap ", tolower(d$stat)),
           ylab = "Scaled density")
      polygon(dens_boot$x, dens_boot$y,
              col = adjustcolor("#3498db", 0.15), border = NA)
      abline(v = d$boot_ci, col = "#9b59b6", lwd = 2, lty = 3)
      abline(v = d$obs_stat, col = "#e74c3c", lwd = 2, lty = 2)
      if (!is.na(d$truth)) {
        abline(v = d$truth, col = "#27ae60", lwd = 2.5, lty = 2)
      }
    } else {
      plot(dens_data, col = "#e74c3c", lwd = 2.5, xlim = xlim,
           ylim = c(0, 1.25), main = "Data vs Sampling Distribution",
           xlab = "Value", ylab = "Scaled density")
      polygon(dens_data$x, dens_data$y,
              col = adjustcolor("#e74c3c", 0.15), border = NA)

      lines(dens_boot, col = "#3498db", lwd = 2.5)
      polygon(dens_boot$x, dens_boot$y,
              col = adjustcolor("#3498db", 0.15), border = NA)

      legend("topright", bty = "n", cex = 0.8,
             legend = c("Your data (wide, raw obs.)",
                        paste0("Sampling dist. of ", tolower(d$stat),
                               " (narrow)")),
             col = c("#e74c3c", "#3498db"), lwd = 2.5)
    }
  })

  output$results <- renderUI({
    d <- dat()

    # Coverage line
    cov_html <- if (!is.na(d$covered)) {
      cov_col <- if (d$covered) "#27ae60" else "#e74c3c"
      cov_txt <- if (d$covered) "YES" else "NO"
      paste0("<b>True value:</b> ", round(d$truth, 4), "<br>",
             "<b style='color:", cov_col, "'>Covered: ",
             cov_txt, "</b><br>")
    } else ""

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>Statistic:</b> ", d$stat, "<br>",
        "<b>Observed:</b> ", round(d$obs_stat, 4), "<br>",
        "<b>Bootstrap SE:</b> ", round(d$boot_se, 4), "<br>",
        "<b>95% CI:</b> [", round(d$boot_ci[1], 4),
        ", ", round(d$boot_ci[2], 4), "]<br>",
        cov_html,
        "<hr style='margin:8px 0'>",
        "<small>CI uses the <b>percentile method</b><br>",
        "(2.5th & 97.5th percentiles).<br>",
        "The BCa (bias-corrected and<br>",
        "accelerated) method is generally<br>",
        "preferred in practice.</small>"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • Mean with Normal population: the bootstrap distribution looks normal, similar to what CLT gives you analytically.
  • Median with skewed population: no easy formula for the SE of a median. But the bootstrap handles it effortlessly.
  • SD with heavy-tailed data: the bootstrap distribution is skewed — the percentile CI is asymmetric. For skewed statistics like the SD, the BCa (bias-corrected and accelerated) interval generally performs better than the simple percentile method shown here.
  • Regression slope: bootstrap the slope to get a CI without assuming homoskedasticity.
  • Small n (10) vs large n (200): with small samples the bootstrap distribution is choppy (limited unique resamples). With more data it smooths out.

When to use the bootstrap

Use bootstrap when… Don’t bother when…
No formula for the SE exists Standard formulas are available and valid
The statistic is complicated (ratios, quantiles) You’re computing a simple mean
You suspect non-normality n is large and CLT applies
You want a quick, assumption-free CI You need exact small-sample inference

The bootstrap is not magic — it can fail with very small samples or non-smooth statistics. But for most practical situations, it’s the easiest path to a valid confidence interval.


Did you know?

  • Bradley Efron invented the bootstrap in 1979 at Stanford. The name comes from the expression “pulling yourself up by your bootstraps” — attributed to the tall tales of Baron Munchausen, who claimed to have pulled himself out of a swamp by his own hair (later versions say bootstraps). The idea that a sample can estimate its own sampling distribution seemed equally absurd at first.
  • When Efron first presented the bootstrap, many statisticians were skeptical. The common objection was essentially: if you’re just sampling from your own sample, how can that tell you anything new? The answer: it tells you about the variability of your estimate, not about new data points.
  • The bootstrap is now one of the most cited statistical methods ever. Efron’s 1979 paper has over 20,000 citations. He received the International Prize in Statistics (the statistics equivalent of the Nobel) in 2019 for this work.
  • Before the bootstrap, getting standard errors for complex statistics (ratios, quantiles, eigenvalues) required either painful analytical derivations or the delta method — a Taylor expansion trick that often required heroic assumptions. The bootstrap made all of that unnecessary.