Robust vs Clustered SEs: When Observations Aren’t Independent

The independence assumption you forgot about

OLS assumes that observations are independent. But in many real datasets they aren’t:

  • Students within the same school share teachers and resources
  • Patients within the same hospital get similar care
  • Purchases by the same customer are correlated over time
  • Employees within the same firm face the same management

When observations within a group (cluster) are correlated, OLS standard errors are too small — often dramatically so. Your t-statistics are inflated, your p-values are too small, and you reject the null far more than 5% of the time.

Three types of standard errors

SE type What it assumes When to use
OLS (classical) Errors are i.i.d. (independent, constant variance) Almost never in practice
Robust (HC) Errors can have different variances, but are independent Cross-sectional data with heteroskedasticity
Clustered Errors can be correlated within clusters Any grouped/panel data

Robust SEs fix heteroskedasticity but not within-cluster correlation. Clustered SEs fix both. If your data has clusters, you need clustered SEs.

The Oracle View. In these simulations, we set the cluster structure and the intra-cluster correlation (ICC) — we know exactly how much correlation exists within groups. In practice, you identify clusters from your study design (schools, hospitals, firms) and let the clustered SE estimator handle the rest without knowing the true ICC.

Where does the independence assumption live?

The standard OLS SE formula assumes independent observations. Here is exactly where that assumption appears.

The OLS estimator for \(Y = X\beta + \varepsilon\) is

\[\hat{\beta} = (X'X)^{-1}X'Y = \beta + (X'X)^{-1}X'\varepsilon\]

so the variance of the estimator depends on the variance of the errors:

\[\text{Var}(\hat{\beta}) = (X'X)^{-1} \; X' \, \text{Var}(\varepsilon) \, X \; (X'X)^{-1}\]

This formula is always true. The key question is: what is \(\text{Var}(\varepsilon)\)?

Under independence, classical OLS assumes \(\text{Var}(\varepsilon) = \sigma^2 I\), where the identity matrix \(I\) has zeros on the off-diagonal — meaning \(\text{Cov}(\varepsilon_i, \varepsilon_j) = 0\) for all \(i \neq j\). Plugging this in:

\[\text{Var}(\hat{\beta}) = \sigma^2 (X'X)^{-1}\]

This is the default SE printed in regression output: \(\text{SE}(\hat{\beta}_j) = \sqrt{\hat{\sigma}^2 [(X'X)^{-1}]_{jj}}\).

When observations are correlated, the true error variance is \(\text{Var}(\varepsilon) = \Omega\), where \(\Omega\) is not diagonal. For example, with three students in the same school and one in a different school:

\[\Omega = \begin{bmatrix} \sigma^2 & \rho & \rho & 0 \\ \rho & \sigma^2 & \rho & 0 \\ \rho & \rho & \sigma^2 & 0 \\ 0 & 0 & 0 & \sigma^2 \end{bmatrix}\]

The correct variance is \((X'X)^{-1} X' \Omega X (X'X)^{-1}\), but the OLS formula pretends \(\Omega = \sigma^2 I\) and ignores those off-diagonal terms — so the estimated SE is too small.

Cluster-robust SEs fix this by replacing \(\sigma^2 I\) with an estimate of \(\Omega\):

\[\text{Var}(\hat{\beta}) = (X'X)^{-1} \left( \sum_g X_g' \hat{\varepsilon}_g \hat{\varepsilon}_g' X_g \right) (X'X)^{-1}\]

where \(g\) indexes clusters. This allows \(\text{Cov}(\varepsilon_{it}, \varepsilon_{is}) \neq 0\) within the same cluster — exactly the sandwich formula implemented in Stata’s , cluster() and R’s vcovCL().


Simulation 1: Clustered data and wrong SEs

Generate data with students nested in schools. Within each school, outcomes are correlated (shared school effect). Compare the three types of standard errors.

#| standalone: true
#| viewerHeight: 620

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("n_clusters", "Number of schools:",
                  min = 10, max = 100, value = 30, step = 5),

      sliderInput("cluster_size", "Students per school:",
                  min = 5, max = 50, value = 20, step = 5),

      sliderInput("icc", "ICC (intra-cluster correlation):",
                  min = 0, max = 0.8, value = 0.3, step = 0.05),

      helpText("ICC = share of total variance due to the
               school-level component. Higher = more clustering."),

      actionButton("go", "New draw",
                   class = "btn-primary", width = "100%"),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("scatter", height = "400px")),
        column(6, plotOutput("se_compare", height = "400px"))
      )
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$go
    G  <- input$n_clusters
    m  <- input$cluster_size
    icc <- input$icc

    # Variance decomposition: total var = 1
    sigma_b <- sqrt(icc)        # between-cluster SD
    sigma_w <- sqrt(1 - icc)    # within-cluster SD

    # Generate clustered data (no true effect: beta = 0)
    cluster_id <- rep(seq_len(G), each = m)
    school_effect <- rep(rnorm(G, sd = sigma_b), each = m)
    x <- rnorm(G * m)
    y <- 0 * x + school_effect + rnorm(G * m, sd = sigma_w)

    fit <- lm(y ~ x)
    b_hat <- coef(fit)[2]
    n <- G * m

    # OLS SE
    ols_se <- summary(fit)$coefficients[2, 2]

    # Robust SE (HC1)
    r <- resid(fit)
    X <- cbind(1, x)
    bread <- solve(t(X) %*% X)
    meat_hc <- t(X) %*% diag(r^2 * n / (n - 2)) %*% X
    robust_se <- sqrt((bread %*% meat_hc %*% bread)[2, 2])

    # Clustered SE (Liang-Zeger)
    meat_cl <- matrix(0, 2, 2)
    for (g in seq_len(G)) {
      idx <- which(cluster_id == g)
      Xg <- X[idx, , drop = FALSE]
      rg <- r[idx]
      meat_cl <- meat_cl + t(Xg) %*% (rg %*% t(rg)) %*% Xg
    }
    adj <- G / (G - 1) * (n - 1) / (n - 2)
    cl_vcov <- adj * bread %*% meat_cl %*% bread
    clustered_se <- sqrt(cl_vcov[2, 2])

    list(x = x, y = y, cluster_id = cluster_id, fit = fit,
         b_hat = b_hat, ols_se = ols_se, robust_se = robust_se,
         clustered_se = clustered_se, G = G, m = m, icc = icc)
  })

  output$scatter <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    # Color by cluster (cycle through colors)
    palette <- rep(c("#e74c3c", "#3498db", "#27ae60", "#9b59b6",
                     "#e67e22", "#1abc9c", "#34495e", "#f39c12"),
                   length.out = d$G)
    cols <- palette[d$cluster_id]

    plot(d$x, d$y, pch = 16, col = adjustcolor(cols, 0.5), cex = 0.6,
         xlab = "X", ylab = "Y",
         main = paste0(d$G, " schools, ", d$m, " students each"))
    abline(d$fit, col = "#2c3e50", lwd = 2.5)
    abline(h = 0, lty = 3, col = "gray60")
  })

  output$se_compare <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 8, 3, 1))

    ses <- c(d$ols_se, d$robust_se, d$clustered_se)
    names_se <- c("OLS SE", "Robust SE", "Clustered SE")
    cols <- c("#e74c3c", "#e67e22", "#27ae60")

    # CIs
    lo <- d$b_hat - 1.96 * ses
    hi <- d$b_hat + 1.96 * ses

    xlim <- range(c(lo, hi, 0)) + c(-0.3, 0.3)

    plot(NULL, xlim = xlim, ylim = c(0.5, 3.5),
         yaxt = "n", ylab = "", xlab = expression(hat(beta)),
         main = "95% CIs with different SEs")
    axis(2, at = 1:3, labels = names_se, las = 1, cex.axis = 0.85)

    for (i in 1:3) {
      segments(lo[i], i, hi[i], i, lwd = 4, col = cols[i])
      points(d$b_hat, i, pch = 19, cex = 1.5, col = cols[i])
    }

    abline(v = 0, lty = 2, lwd = 2, col = "#2c3e50")
    text(0, 3.4, expression("True " * beta * " = 0"), cex = 0.9)
  })

  output$results <- renderUI({
    d <- dat()
    ratio <- d$clustered_se / d$ols_se

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>", expression("hat(beta)"), ":</b> ",
        round(d$b_hat, 4), "<br>",
        "<b>OLS SE:</b> <span class='bad'>",
        round(d$ols_se, 4), "</span><br>",
        "<b>Robust SE:</b> ",
        round(d$robust_se, 4), "<br>",
        "<b>Clustered SE:</b> <span class='good'>",
        round(d$clustered_se, 4), "</span><br>",
        "<hr style='margin:8px 0'>",
        "<b>Cluster/OLS ratio:</b> ", round(ratio, 2), "x<br>",
        "<small>With ICC = ", d$icc, ", clustered SE is<br>",
        round(ratio, 1), "x larger than OLS SE.</small>"
      ))
    )
  })
}

shinyApp(ui, server)

Simulation 2: Rejection rates

The real test: run 500 experiments where the true effect is zero. How often does each type of SE incorrectly reject H₀ at the 5% level? OLS SEs reject far too often. Clustered SEs get it right.

#| standalone: true
#| viewerHeight: 520

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("G2", "Number of clusters:",
                  min = 10, max = 100, value = 30, step = 5),

      sliderInput("m2", "Obs per cluster:",
                  min = 5, max = 50, value = 20, step = 5),

      sliderInput("icc2", "ICC:",
                  min = 0, max = 0.8, value = 0.3, step = 0.05),

      sliderInput("sims", "Number of simulations:",
                  min = 200, max = 1000, value = 500, step = 100),

      actionButton("go2", "Run simulations",
                   class = "btn-primary", width = "100%"),

      uiOutput("results2")
    ),

    mainPanel(
      width = 9,
      plotOutput("rejection_plot", height = "420px")
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$go2
    G    <- input$G2
    m    <- input$m2
    icc  <- input$icc2
    sims <- input$sims

    sigma_b <- sqrt(icc)
    sigma_w <- sqrt(1 - icc)
    n <- G * m

    reject_ols <- 0
    reject_robust <- 0
    reject_cluster <- 0

    for (s in seq_len(sims)) {
      cluster_id <- rep(seq_len(G), each = m)
      school_eff <- rep(rnorm(G, sd = sigma_b), each = m)
      x <- rnorm(n)
      y <- 0 * x + school_eff + rnorm(n, sd = sigma_w)

      fit <- lm(y ~ x)
      b_hat <- coef(fit)[2]
      r <- resid(fit)
      X <- cbind(1, x)

      # OLS
      ols_se <- summary(fit)$coefficients[2, 2]

      # Robust
      bread <- solve(t(X) %*% X)
      meat_hc <- t(X) %*% diag(r^2 * n / (n - 2)) %*% X
      robust_se <- sqrt((bread %*% meat_hc %*% bread)[2, 2])

      # Clustered
      meat_cl <- matrix(0, 2, 2)
      for (g in seq_len(G)) {
        idx <- which(cluster_id == g)
        Xg <- X[idx, , drop = FALSE]
        rg <- r[idx]
        meat_cl <- meat_cl + t(Xg) %*% (rg %*% t(rg)) %*% Xg
      }
      adj <- G / (G - 1) * (n - 1) / (n - 2)
      cl_se <- sqrt((adj * bread %*% meat_cl %*% bread)[2, 2])

      if (abs(b_hat / ols_se) > 1.96)    reject_ols <- reject_ols + 1
      if (abs(b_hat / robust_se) > 1.96) reject_robust <- reject_robust + 1
      if (abs(b_hat / cl_se) > 1.96)     reject_cluster <- reject_cluster + 1
    }

    list(rej_ols = reject_ols / sims,
         rej_robust = reject_robust / sims,
         rej_cluster = reject_cluster / sims,
         sims = sims, G = G, m = m, icc = icc)
  })

  output$rejection_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 8, 3, 1))

    rates <- c(d$rej_ols, d$rej_robust, d$rej_cluster) * 100
    names_se <- c("OLS SE", "Robust SE", "Clustered SE")
    cols <- c("#e74c3c", "#e67e22", "#27ae60")

    bp <- barplot(rates, names.arg = names_se, col = cols,
                  horiz = TRUE, xlim = c(0, max(rates, 10) * 1.3),
                  xlab = "Rejection rate (%)",
                  main = paste0("False rejection rate at 5% level\n(",
                               d$sims, " simulations, ICC = ",
                               d$icc, ")"),
                  las = 1, border = NA)

    abline(v = 5, lty = 2, lwd = 2.5, col = "#2c3e50")
    text(5, max(bp) + 0.6, "Target: 5%", cex = 0.85)

    text(rates + 1, bp, paste0(round(rates, 1), "%"),
         cex = 0.9, adj = 0)
  })

  output$results2 <- renderUI({
    d <- dat()
    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>OLS rejection rate:</b> <span class='",
        ifelse(d$rej_ols > 0.08, "bad", "good"), "'>",
        round(d$rej_ols * 100, 1), "%</span><br>",
        "<b>Robust rejection rate:</b> <span class='",
        ifelse(d$rej_robust > 0.08, "bad", "good"), "'>",
        round(d$rej_robust * 100, 1), "%</span><br>",
        "<b>Clustered rejection rate:</b> <span class='good'>",
        round(d$rej_cluster * 100, 1), "%</span><br>",
        "<hr style='margin:8px 0'>",
        "<small>True \u03b2 = 0 in all simulations.<br>",
        "Correct rejection rate is 5%.</small>"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • ICC = 0 (no clustering): all three SEs give ~5% rejection. When there’s no within-cluster correlation, OLS SEs are fine.
  • ICC = 0.3: OLS rejection shoots up to ~15–25%. Robust SEs help a little but not enough. Clustered SEs stay near 5%.
  • ICC = 0.5: OLS can reject 30–40% of the time — catastrophic.
  • Increase cluster size, hold clusters fixed: the problem gets worse with more observations per cluster. More correlated data doesn’t help — it just gives you a false sense of precision.
  • Increase number of clusters: this does help. The effective sample size for clustered data is closer to the number of clusters than the number of observations.

Simulation 3: Choosing the cluster level

In hierarchical data — students in classrooms in schools in districts — you have to decide which level to cluster at. The same \(\hat{\beta}\) gets different standard errors depending on your choice:

  • Too fine (e.g., classroom when correlation exists at the school level): SEs are too small, you over-reject.
  • Too coarse (e.g., state when you only have 6 states): too few clusters for the asymptotic approximation, SEs may be unreliable.
  • The right level: cluster where treatment is assigned, or at the coarsest level where errors are correlated.

This simulation generates a four-level hierarchy with variance at each level you control. Watch the SE staircase: SEs weakly increase as you cluster coarser, and the jumps tell you where the correlation lives. If SEs barely change going from school to district, there’s little district-level correlation — clustering at school is fine.

#| standalone: true
#| viewerHeight: 650

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("n_dist3", "Number of districts:",
                  min = 4, max = 12, value = 8, step = 1),

      helpText("4 schools/district, 4 classrooms/school,
               8 students/classroom"),

      tags$hr(),

      sliderInput("var_dist", "District variance share:",
                  min = 0, max = 0.5, value = 0.2, step = 0.05),

      sliderInput("var_school", "School variance share:",
                  min = 0, max = 0.5, value = 0.15, step = 0.05),

      sliderInput("var_class", "Classroom variance share:",
                  min = 0, max = 0.5, value = 0.1, step = 0.05),

      uiOutput("var_indiv_text"),

      actionButton("go3", "New draw",
                   class = "btn-primary", width = "100%"),

      uiOutput("results3")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("se_staircase", height = "450px")),
        column(6, plotOutput("ci_levels", height = "450px"))
      )
    )
  )
)

server <- function(input, output, session) {

  # Helper: clustered SE for coefficient on x (index 2)
  calc_cluster_se <- function(Xmat, r, cluster_id) {
    n <- nrow(Xmat)
    k <- ncol(Xmat)
    bread <- solve(crossprod(Xmat))
    clusters <- unique(cluster_id)
    G <- length(clusters)
    meat <- matrix(0, k, k)
    for (g in clusters) {
      idx <- which(cluster_id == g)
      Xg <- Xmat[idx, , drop = FALSE]
      rg <- r[idx]
      score_g <- crossprod(Xg, rg)
      meat <- meat + tcrossprod(score_g)
    }
    adj <- G / (G - 1) * (n - 1) / (n - k)
    vcov <- adj * bread %*% meat %*% bread
    sqrt(vcov[2, 2])
  }

  dat3 <- reactive({
    input$go3
    n_dist <- input$n_dist3
    v_d    <- input$var_dist
    v_s    <- input$var_school
    v_c    <- input$var_class
    v_i    <- max(0.01, 1 - v_d - v_s - v_c)

    # Fixed hierarchy
    sch_per_dist <- 4
    cls_per_sch  <- 4
    stu_per_cls  <- 8

    n_sch <- n_dist * sch_per_dist
    n_cls <- n_sch * cls_per_sch
    n_stu <- n_cls * stu_per_cls

    # IDs
    dist_id  <- rep(1:n_dist,
                    each = sch_per_dist * cls_per_sch * stu_per_cls)
    school_id <- rep(1:n_sch,
                     each = cls_per_sch * stu_per_cls)
    class_id  <- rep(1:n_cls,
                     each = stu_per_cls)

    # Random effects
    d_eff <- rep(rnorm(n_dist, sd = sqrt(v_d)),
                 each = sch_per_dist * cls_per_sch * stu_per_cls)
    s_eff <- rep(rnorm(n_sch, sd = sqrt(v_s)),
                 each = cls_per_sch * stu_per_cls)
    c_eff <- rep(rnorm(n_cls, sd = sqrt(v_c)),
                 each = stu_per_cls)
    i_err <- rnorm(n_stu, sd = sqrt(v_i))

    x <- rnorm(n_stu)
    y <- 0 * x + d_eff + s_eff + c_eff + i_err

    fit <- lm(y ~ x)
    b_hat <- coef(fit)[2]
    r <- resid(fit)
    Xmat <- cbind(1, x)

    # Robust SE (HC1) — efficient version
    e2 <- r^2 * n_stu / (n_stu - 2)
    bread <- solve(crossprod(Xmat))
    meat_hc <- crossprod(Xmat * sqrt(e2))
    robust_se <- sqrt((bread %*% meat_hc %*% bread)[2, 2])

    # Clustered SEs at each level
    class_se  <- calc_cluster_se(Xmat, r, class_id)
    school_se <- calc_cluster_se(Xmat, r, school_id)
    dist_se   <- calc_cluster_se(Xmat, r, dist_id)

    ses <- c(robust_se, class_se, school_se, dist_se)
    n_clusters <- c(n_stu, n_cls, n_sch, n_dist)
    labels <- c("Robust\n(individual)",
                paste0("Classroom\n(G=", n_cls, ")"),
                paste0("School\n(G=", n_sch, ")"),
                paste0("District\n(G=", n_dist, ")"))

    list(b_hat = b_hat, ses = ses,
         n_clusters = n_clusters, labels = labels)
  })

  output$var_indiv_text <- renderUI({
    v_i <- max(0, 1 - input$var_dist -
               input$var_school - input$var_class)
    col <- if (v_i < 0.05) "color: #e74c3c;" else ""
    tags$p(style = paste0("font-size: 13px; ", col),
      HTML(paste0("<b>Individual share:</b> ",
                  round(v_i, 2))))
  })

  output$se_staircase <- renderPlot({
    d <- dat3()
    par(mar = c(5, 9, 3, 2))

    cols <- c("#e74c3c", "#e67e22", "#3498db", "#27ae60")
    ses <- d$ses

    bp <- barplot(ses, names.arg = d$labels,
                  horiz = TRUE, col = cols, border = NA,
                  las = 1, cex.names = 0.8,
                  xlab = expression("SE(" * hat(beta) * ")"),
                  main = "SE at each clustering level",
                  xlim = c(0, max(ses) * 1.5))

    # Label: value and ratio vs robust
    text(ses + max(ses) * 0.03, bp,
         paste0(format(round(ses, 4), nsmall = 4),
                "  (", round(ses / ses[1], 1), "x)"),
         adj = 0, cex = 0.85)
  })

  output$ci_levels <- renderPlot({
    d <- dat3()
    par(mar = c(5, 9, 3, 2))

    ses <- d$ses
    b   <- d$b_hat
    lo  <- b - 1.96 * ses
    hi  <- b + 1.96 * ses
    cols <- c("#e74c3c", "#e67e22", "#3498db", "#27ae60")

    xlim <- c(min(lo, 0) - abs(min(lo, 0)) * 0.3,
              max(hi, 0) + abs(max(hi, 0)) * 0.3)

    plot(NULL, xlim = xlim, ylim = c(0.5, 4.5),
         yaxt = "n", ylab = "",
         xlab = expression(hat(beta)),
         main = expression(
           "95% CI at each level (true " * beta * " = 0)"))
    axis(2, at = 1:4, labels = d$labels,
         las = 1, cex.axis = 0.8)

    for (i in 1:4) {
      segments(lo[i], i, hi[i], i,
               lwd = 5, col = cols[i])
      points(b, i, pch = 19, cex = 1.5, col = cols[i])
    }

    abline(v = 0, lty = 2, lwd = 2, col = "#2c3e50")
  })

  output$results3 <- renderUI({
    d <- dat3()
    ses <- d$ses

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>SE staircase:</b><br>",
        "Robust: ", round(ses[1], 4), "<br>",
        "Classroom: ", round(ses[2], 4),
        " (<b>", round(ses[2]/ses[1], 1), "x</b>)<br>",
        "School: ", round(ses[3], 4),
        " (<b>", round(ses[3]/ses[1], 1), "x</b>)<br>",
        "District: ", round(ses[4], 4),
        " (<b>", round(ses[4]/ses[1], 1), "x</b>)<br>",
        "<hr style='margin:8px 0'>",
        "<small>Ratios are vs robust SE.<br>",
        "Big jump = correlation at that level.</small>"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • All variance at district level (0.5, 0, 0): huge jump from robust to district. Clustering at classroom or school barely helps — the correlation lives at the top.
  • All variance at classroom level (0, 0, 0.5): the jump happens from robust to classroom. School and district clustering add almost nothing beyond classroom, because there’s no correlation between classrooms.
  • Equal shares (0.15, 0.15, 0.15): a gradual staircase — each level adds some.
  • Zero variance at all levels (0, 0, 0): no hierarchy at all. All four SEs are essentially the same. Robust is fine.
  • Few districts (4): even with the “right” SE, only 4 clusters is dangerously few for asymptotic clustered SEs. Consider the wild cluster bootstrap.

Choosing the level in practice

The staircase gives you a diagnostic: where the SE jumps is where the correlation lives. Apply this to your research design:

If your design looks like… Cluster at… Why
Students randomized within schools School Treatment assigned at school level
Policy varies by state, data is individual State Treatment assigned at state level
Tracts in counties, treatment is tract-level exposure County Spatial correlation in outcomes across tracts in same county
Panel data, units observed over time Unit Serial correlation within unit

The key rule: you can never go wrong clustering coarser (SEs only get bigger), but you can go wrong clustering too fine. The risk of coarse clustering is having too few clusters — below ~30, use the wild cluster bootstrap.

Code reference

Software Clustered SEs
R lmtest::coeftest(fit, vcov = sandwich::vcovCL, cluster = ~school)
Stata reg y x, cluster(school)
Python sm.OLS(y, X).fit(cov_type='cluster', cov_kwds={'groups': school})

The practical rule

  • Always cluster at the level of treatment assignment. If you randomized schools, cluster by school. If you randomized classrooms, cluster by classroom.
  • When in doubt, cluster at the highest level that makes sense. Clustering too aggressively (too few clusters) can be a problem, but not clustering when you should is always worse.
  • The minimum number of clusters for reliable clustered SEs is roughly 30–50. With fewer clusters, consider the wild cluster bootstrap.

Did you know?

  • Brent Moulton (1990) published a now-classic paper showing that regressions using aggregate variables (like state-level policies) with individual-level data produce t-statistics that are wildly inflated if you ignore clustering. He called it the “Moulton problem.” Many published results in economics were likely spurious because of this.
  • The Moulton factor — the ratio of the true variance to the (too-small) OLS variance — is approximately \(1 + (m-1) \times ICC\), where \(m\) is the average cluster size. With 50 students per school and ICC = 0.2, the Moulton factor is about 11, meaning your OLS SE is \(\sqrt{11} \approx 3.3\) times too small. Your t-statistic is 3x too large.
  • Cameron, Gelbach, and Miller (2008) showed that even clustered SEs can be unreliable when the number of clusters is small (< 30–50), leading to the development of the wild cluster bootstrap as a more robust alternative.