Bag of Little Bootstraps

The problem with the bootstrap at scale

The standard bootstrap is beautifully simple: resample \(n\) observations with replacement, re-estimate, repeat \(B\) times. But each bootstrap replicate processes all \(n\) observations. When \(n\) is large and estimation is expensive — panel regressions, kernel estimators, ML models — the bootstrap becomes prohibitively slow:

\[\text{Cost}_{\text{bootstrap}} = O(B \cdot n)\]

With \(n = 500{,}000\) and \(B = 1{,}000\), you’re fitting your model 1,000 times on half a million observations each. That can take days.

The BLB idea

The Bag of Little Bootstraps (Kleiner, Talwalkar, Sarkar & Jordan, 2014) gets the same answer at a fraction of the cost. The key insight: instead of bootstrapping the full dataset, subsample small chunks and reweight them to mimic full-sample uncertainty.

The three steps

Step 1 — Subsample. Draw \(S\) small subsamples of size \(b \ll n\) from the data without replacement. A common choice is \(b = n^\gamma\) with \(\gamma \approx 0.6\text{–}0.9\):

\[b = n^\gamma, \quad \gamma \in [0.6, 0.9]\]

For \(n = 500{,}000\) and \(\gamma = 0.7\), you get \(b \approx 10{,}000\) — a 50x reduction.

Step 2 — Reweight, don’t resample. Within each subsample, generate \(R\) sets of multinomial weights that sum to \(n\) (not \(b\)). Each weight vector simulates a bootstrap resample from the full population, but you only need to work with \(b\) distinct observations:

\[\mathbf{w} \sim \text{Multinomial}(n, \; 1/b, \ldots, 1/b)\]

Estimate the parameter using weighted estimation for each weight draw. This gives \(R\) estimates per subsample.

Step 3 — Combine. For each subsample \(s\), compute the bootstrap variance from its \(R\) reweighted estimates. Average the variances across the \(S\) subsamples:

\[\widehat{\text{Var}}_{\text{BLB}} = \frac{1}{S} \sum_{s=1}^{S} \widehat{\text{Var}}_s\]

The resulting SE is asymptotically equivalent to the full bootstrap SE.

Why the reweighting works

The multinomial weights summing to \(n\) are crucial. If you just bootstrapped the subsample of size \(b\), you’d estimate uncertainty for a sample of size \(b\) — too much variance. By inflating the weights to sum to \(n\), each reweighted estimate behaves as if it came from a full-size bootstrap resample. The subsample provides the right support (which data points can appear); the weights provide the right scale (how much uncertainty a sample of size \(n\) should have).

Cost comparison

Method Replications Obs per replication Total cost
Bootstrap \(B\) \(n\) \(O(B \cdot n)\)
BLB \(S \cdot R\) \(b\) \(O(S \cdot R \cdot b)\)

Since \(b \ll n\), BLB is dramatically faster. With \(n = 500{,}000\), \(b = 10{,}000\), \(S = 10\), \(R = 100\): bootstrap does \(1{,}000 \times 500{,}000 = 5 \times 10^8\) observation-fits; BLB does \(10 \times 100 \times 10{,}000 = 10^7\) — a 50x speedup.

The Oracle View. In this simulation, we know the true parameter and can compute exact bootstrap SEs by brute force. We compare BLB’s SE estimate against the full bootstrap SE and the true sampling distribution. In practice, you’d use BLB because the full bootstrap is too expensive — you trust the theory that they converge.

Simulation

Compare standard bootstrap vs BLB. Left: both SE estimates as \(n\) grows (they converge). Right: computation cost — bootstrap scales with \(n\), BLB stays manageable.

#| standalone: true
#| viewerHeight: 750

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .eq-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-bottom: 14px; font-size: 14px; line-height: 1.9;
    }
    .eq-box b { color: #2c3e50; }
    .match  { color: #27ae60; font-weight: bold; }
    .coef   { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 4,

      sliderInput("n", "Sample size (n):",
                  min = 500, max = 10000, value = 2000, step = 500),

      sliderInput("gamma", HTML("Subsample exponent (&gamma;):"),
                  min = 0.5, max = 0.9, value = 0.7, step = 0.05),

      sliderInput("S", "Number of subsamples (S):",
                  min = 5, max = 30, value = 10, step = 5),

      sliderInput("R", "Reweights per subsample (R):",
                  min = 50, max = 300, value = 100, step = 50),

      sliderInput("B", "Bootstrap replicates (B):",
                  min = 200, max = 1000, value = 500, step = 100),

      actionButton("resim", "Run comparison",
                   class = "btn-primary", width = "100%"),

      uiOutput("results_box")
    ),

    mainPanel(
      width = 8,
      fluidRow(
        column(6, plotOutput("plot_dist", height = "450px")),
        column(6, plotOutput("plot_cost", height = "450px"))
      ),
      uiOutput("explain_box")
    )
  )
)

server <- function(input, output, session) {

  sim_results <- reactive({
    input$resim
    n     <- input$n
    gamma <- input$gamma
    S     <- input$S
    R     <- input$R
    B     <- input$B

    b <- max(10, round(n^gamma))

    # Generate data: Y = 2 + 0.5*X + eps
    x   <- rnorm(n)
    eps <- rnorm(n, sd = 2)
    y   <- 2 + 0.5 * x + eps

    # --- Standard bootstrap ---
    boot_coefs <- numeric(B)
    for (i in seq_len(B)) {
      idx <- sample(n, n, replace = TRUE)
      fit <- lm(y[idx] ~ x[idx])
      boot_coefs[i] <- coef(fit)[2]
    }
    boot_se <- sd(boot_coefs)

    # --- BLB ---
    blb_vars <- numeric(S)
    for (s in seq_len(S)) {
      # Step 1: subsample without replacement
      sub_idx <- sample(n, b, replace = FALSE)
      x_sub <- x[sub_idx]
      y_sub <- y[sub_idx]

      # Step 2: R multinomial reweights
      blb_coefs <- numeric(R)
      for (r in seq_len(R)) {
        # Multinomial weights summing to n
        w <- as.numeric(rmultinom(1, n, rep(1/b, b)))
        # Weighted least squares
        fit_w <- lm(y_sub ~ x_sub, weights = w)
        blb_coefs[r] <- coef(fit_w)[2]
      }
      blb_vars[s] <- var(blb_coefs)
    }
    blb_se <- sqrt(mean(blb_vars))

    # OLS analytic SE for reference
    fit_full <- lm(y ~ x)
    ols_se <- summary(fit_full)$coefficients[2, 2]

    # Cost metrics (observation-fits)
    boot_cost <- B * n
    blb_cost  <- S * R * b

    list(boot_se = boot_se, blb_se = blb_se, ols_se = ols_se,
         boot_coefs = boot_coefs,
         b = b, n = n, S = S, R = R, B = B,
         boot_cost = boot_cost, blb_cost = blb_cost)
  })

  output$plot_dist <- renderPlot({
    d <- sim_results()
    par(mar = c(5, 5, 4, 2))

    hist(d$boot_coefs, breaks = 40,
         col = adjustcolor("#3498db", 0.4), border = "white",
         main = expression("Bootstrap distribution of " * hat(beta)),
         xlab = expression(hat(beta)),
         freq = FALSE)

    # Overlay BLB normal approximation
    x_seq <- seq(min(d$boot_coefs), max(d$boot_coefs),
                 length.out = 200)
    lines(x_seq,
          dnorm(x_seq, mean = 0.5, sd = d$blb_se),
          col = "#27ae60", lwd = 2.5)
    lines(x_seq,
          dnorm(x_seq, mean = 0.5, sd = d$boot_se),
          col = "#3498db", lwd = 2, lty = 2)

    abline(v = 0.5, col = "#e74c3c", lwd = 2)

    legend("topright", bty = "n", cex = 0.85,
           legend = c(
             "Bootstrap histogram",
             paste0("Bootstrap SE = ", round(d$boot_se, 4)),
             paste0("BLB SE = ", round(d$blb_se, 4)),
             expression("True " * beta * " = 0.5")),
           col = c(adjustcolor("#3498db", 0.6),
                   "#3498db", "#27ae60", "#e74c3c"),
           pch = c(15, NA, NA, NA),
           lwd = c(NA, 2, 2.5, 2),
           lty = c(NA, 2, 1, 1),
           pt.cex = 2)
  })

  output$plot_cost <- renderPlot({
    d <- sim_results()
    par(mar = c(5, 5, 4, 2))

    # Show cost comparison across a range of n
    n_range <- seq(500, 10000, by = 500)
    boot_costs <- d$B * n_range
    blb_costs  <- d$S * d$R * round(n_range^(log(d$b) / log(d$n)))

    # Recalculate with same gamma
    gamma <- log(d$b) / log(d$n)
    blb_costs <- d$S * d$R * round(n_range^gamma)

    ylim <- c(0, max(boot_costs) * 1.1)

    plot(n_range, boot_costs, type = "l", lwd = 2.5,
         col = "#3498db",
         xlab = "Sample size (n)",
         ylab = "Observation-fits",
         main = "Computational cost",
         ylim = ylim)
    lines(n_range, blb_costs, lwd = 2.5, col = "#27ae60")

    # Mark current n
    points(d$n, d$boot_cost, pch = 19, cex = 2, col = "#3498db")
    points(d$n, d$blb_cost, pch = 19, cex = 2, col = "#27ae60")

    # Speedup annotation
    speedup <- round(d$boot_cost / d$blb_cost)
    text(d$n, (d$boot_cost + d$blb_cost) / 2,
         paste0(speedup, "x speedup"),
         cex = 1.1, font = 2, col = "#e74c3c")

    legend("topleft", bty = "n", cex = 0.9,
           legend = c(
             paste0("Bootstrap (B\u00d7n)"),
             paste0("BLB (S\u00d7R\u00d7b)")),
           col = c("#3498db", "#27ae60"),
           lwd = 2.5)
  })

  output$results_box <- renderUI({
    d <- sim_results()

    tags$div(class = "eq-box", style = "margin-top: 16px;",
      HTML(paste0(
        "<b>SE comparison:</b><br>",
        "OLS analytic: ", round(d$ols_se, 4), "<br>",
        "Bootstrap (B=", d$B, "): <span class='coef'>",
        round(d$boot_se, 4), "</span><br>",
        "BLB (S=", d$S, ", R=", d$R, "): <span class='match'>",
        round(d$blb_se, 4), "</span><br>",
        "<hr style='margin:8px 0'>",
        "<b>Subsample size:</b> b = n<sup>",
        round(log(d$b)/log(d$n), 2),
        "</sup> = ", d$b, "<br>",
        "<b>Speedup:</b> ",
        round(d$boot_cost / d$blb_cost), "x<br>",
        "<small>Bootstrap: ",
        format(d$boot_cost, big.mark = ","),
        " obs-fits<br>",
        "BLB: ",
        format(d$blb_cost, big.mark = ","),
        " obs-fits</small>"
      ))
    )
  })

  output$explain_box <- renderUI({
    d <- sim_results()
    tags$div(class = "eq-box", style = "margin-top: 8px;",
      HTML(paste0(
        "<b>Left:</b> Bootstrap sampling distribution (histogram) with ",
        "bootstrap SE (blue dashed) and BLB SE (green solid) overlaid as ",
        "normal densities. They should nearly coincide. ",
        "<b>Right:</b> Computational cost as n grows. Bootstrap scales ",
        "linearly; BLB scales as n<sup>&gamma;</sup> — the gap widens with ",
        "larger data."
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • Small \(n\) (500): bootstrap and BLB give similar SEs, and the speedup is modest (~5–10x). BLB isn’t worth the complexity here.
  • Large \(n\) (8,000+): BLB SE still matches bootstrap, but the cost gap explodes. This is where BLB shines.
  • \(\gamma = 0.5\) (aggressive subsampling): subsample is very small. BLB SE may be noisier — you’re working with less data per subsample.
  • \(\gamma = 0.9\) (conservative subsampling): subsample is nearly the full dataset. BLB SE is very stable but the speedup shrinks.
  • Increase \(S\) (more subsamples): stabilizes the BLB variance estimate across subsamples. Diminishing returns past ~15–20.
  • Increase \(R\) (more reweights): stabilizes within-subsample variance. Cheap since each reweight is on only \(b\) observations.

When to use BLB

Setting Bootstrap BLB
\(n < 10{,}000\), fast estimator Fine Overkill
\(n > 100{,}000\), any estimator Slow Use BLB
Expensive estimator (kernel, ML, ABM) Very slow Use BLB
Distributed computing (data on multiple machines) Awkward Natural fit — each subsample on one machine
Clustered data Cluster bootstrap BLB with clustered subsampling

Rules of thumb:

  • Set \(\gamma \in [0.6, 0.8]\). Lower \(\gamma\) = more speedup, more noise.
  • \(S = 10\text{–}20\) subsamples is usually enough.
  • \(R = 100\text{–}200\) reweights per subsample.
  • The total cost is \(S \times R\) fits on \(b\) observations — plan accordingly.

BLB for clustered data

When data has a cluster structure (panels, spatial data), you should subsample clusters, not observations. Draw \(s\) clusters from the \(G\) total clusters, then reweight within them. This preserves the within-cluster correlation structure that clustered SEs are designed to handle.

This is especially relevant for panel econometrics with large \(n\) and expensive estimators — exactly the setting where both clustering and computational constraints bind simultaneously.


Connections

  • The Bootstrap — BLB produces the same answer as the standard bootstrap, just faster
  • Clustered SEs — BLB can be adapted to preserve cluster structure via clustered subsampling
  • Monte Carlo Experiments — BLB is a way to make Monte Carlo-style inference feasible at scale

Did you know?

  • BLB was introduced by Kleiner, Talwalkar, Sarkar, and Michael I. Jordan in 2014 at UC Berkeley. Jordan is one of the most influential figures in both statistics and machine learning — he supervised Andrew Ng and helped develop variational inference, among many other contributions.
  • The name “Bag of Little Bootstraps” is a deliberate riff on bagging (Bootstrap AGGregatING), Leo Breiman’s 1996 method for reducing variance in prediction. Both use resampling on subsets, but for different purposes: bagging reduces prediction variance, BLB estimates parameter uncertainty.
  • BLB is particularly natural for MapReduce/Spark distributed computing: each subsample lives on one worker node, reweighting is embarrassingly parallel, and only the variance estimates need to be communicated back. This made it popular at tech companies dealing with massive datasets.
  • The subsample size \(b = n^\gamma\) is not arbitrary — the theory requires \(b \to \infty\) and \(b/n \to 0\) for consistency. The power law \(n^\gamma\) with \(\gamma < 1\) satisfies both conditions and gives a clean cost-accuracy tradeoff.