Multiple Testing & the Replication Crisis

The problem: test enough things and something will be “significant”

If you test one null hypothesis at \(\alpha = 0.05\), there’s a 5% chance of a false positive. But if you test 20 null hypotheses, the chance that at least one is falsely significant is:

\[P(\text{at least one false positive}) = 1 - (1 - 0.05)^{20} \approx 0.64\]

That’s a 64% chance of finding something “significant” even when nothing is real. This is the multiple testing problem, and it’s at the heart of the replication crisis.

How it happens in practice

  • A researcher tests 20 outcome variables and reports the one with p < 0.05
  • A genetics study tests 500,000 SNPs for association with a disease
  • An A/B test checks conversions, clicks, revenue, time on page, bounce rate…
  • A psychology experiment tests the effect on 10 different mood scales

In each case, the nominal \(\alpha = 0.05\) no longer controls the actual false positive rate. You need a correction.

The Oracle View. In these simulations, we know which effects are real and which are null — so we can count exactly how many “significant” results are true discoveries vs false alarms. In practice, when you get a significant p-value, you never know which kind it is. That’s why corrections matter.

Simulation 1: False discoveries from multiple tests

Test 20 true null hypotheses at \(\alpha = 0.05\). Watch how many come up “significant” purely by chance. Repeat to see the pattern.

#| standalone: true
#| viewerHeight: 650

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("n_tests", "Number of hypotheses tested:",
                  min = 5, max = 50, value = 20, step = 5),

      sliderInput("n_obs", "Sample size per test:",
                  min = 20, max = 200, value = 50, step = 10),

      sliderInput("alpha", HTML("&alpha; (per-test):"),
                  min = 0.01, max = 0.10, value = 0.05, step = 0.01),

      sliderInput("n_real", "Number with real effects:",
                  min = 0, max = 10, value = 0, step = 1),

      actionButton("go", "Run tests",
                   class = "btn-primary", width = "100%"),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      plotOutput("pval_plot", height = "520px")
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$go
    m     <- input$n_tests
    n     <- input$n_obs
    alpha <- input$alpha
    n_real <- min(input$n_real, m)

    p_vals <- numeric(m)
    is_real <- rep(FALSE, m)

    for (i in seq_len(m)) {
      if (i <= n_real) {
        # Real effect (d = 0.5)
        x <- rnorm(n, mean = 0.5, sd = 1)
        is_real[i] <- TRUE
      } else {
        # Null is true
        x <- rnorm(n, mean = 0, sd = 1)
      }
      p_vals[i] <- t.test(x, mu = 0)$p.value
    }

    # Bonferroni correction
    bonf_alpha <- alpha / m
    bonf_reject <- p_vals < bonf_alpha

    # BH (FDR) correction
    sorted_idx <- order(p_vals)
    sorted_p <- p_vals[sorted_idx]
    bh_threshold <- (seq_len(m) / m) * alpha
    bh_max <- max(c(0, which(sorted_p <= bh_threshold)))
    bh_reject <- rep(FALSE, m)
    if (bh_max > 0) bh_reject[sorted_idx[seq_len(bh_max)]] <- TRUE

    list(p_vals = p_vals, is_real = is_real, m = m, n = n,
         alpha = alpha, n_real = n_real,
         raw_reject = p_vals < alpha,
         bonf_reject = bonf_reject,
         bh_reject = bh_reject,
         bonf_alpha = bonf_alpha)
  })

  output$pval_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 6, 3, 1))

    ord <- order(d$p_vals)
    p_sorted <- d$p_vals[ord]
    real_sorted <- d$is_real[ord]

    cols <- ifelse(real_sorted, "#3498db", "#95a5a6")
    pch_vals <- ifelse(real_sorted, 17, 16)

    plot(p_sorted, seq_len(d$m), pch = pch_vals, cex = 1.5,
         col = cols, xlim = c(0, 1),
         xlab = "p-value", ylab = "Test (sorted by p-value)",
         main = paste0(d$m, " hypothesis tests (", d$n_real,
                      " real effects)"),
         yaxt = "n")
    axis(2, at = seq_len(d$m), labels = seq_len(d$m),
         las = 1, cex.axis = 0.7)

    # Alpha threshold
    abline(v = d$alpha, lty = 2, lwd = 2, col = "#e74c3c")
    text(d$alpha + 0.02, d$m, paste0("\u03b1 = ", d$alpha),
         col = "#e74c3c", cex = 0.8, adj = 0)

    # Bonferroni threshold
    abline(v = d$bonf_alpha, lty = 2, lwd = 2, col = "#e67e22")
    text(d$bonf_alpha + 0.02, d$m - 1,
         paste0("Bonf = ", round(d$bonf_alpha, 4)),
         col = "#e67e22", cex = 0.8, adj = 0)

    # BH line
    bh_line <- (seq_len(d$m) / d$m) * d$alpha
    lines(bh_line[ord], seq_len(d$m), lty = 3, lwd = 2, col = "#27ae60")

    # Highlight rejections
    raw_sig <- which(p_sorted < d$alpha)
    if (length(raw_sig) > 0) {
      points(p_sorted[raw_sig], raw_sig, pch = 1, cex = 2.5,
             col = "#e74c3c", lwd = 2)
    }

    legend("bottomright", bty = "n", cex = 0.8,
           legend = c(
             ifelse(d$n_real > 0, "Real effect", ""),
             "Null (no effect)",
             paste0("Uncorrected \u03b1 = ", d$alpha),
             paste0("Bonferroni \u03b1 = ", round(d$bonf_alpha, 4)),
             "BH (FDR) threshold"
           )[c(d$n_real > 0, TRUE, TRUE, TRUE, TRUE)],
           col = c(
             if (d$n_real > 0) "#3498db",
             "#95a5a6", "#e74c3c", "#e67e22", "#27ae60"
           ),
           pch = c(
             if (d$n_real > 0) 17,
             16, NA, NA, NA
           ),
           lwd = c(
             if (d$n_real > 0) NA,
             NA, 2, 2, 2
           ),
           lty = c(
             if (d$n_real > 0) NA,
             NA, 2, 2, 3
           ))
  })

  output$results <- renderUI({
    d <- dat()

    raw_fp <- sum(d$raw_reject & !d$is_real)
    bonf_fp <- sum(d$bonf_reject & !d$is_real)
    bh_fp <- sum(d$bh_reject & !d$is_real)

    raw_tp <- sum(d$raw_reject & d$is_real)
    bonf_tp <- sum(d$bonf_reject & d$is_real)
    bh_tp <- sum(d$bh_reject & d$is_real)

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>Uncorrected:</b><br>",
        "&nbsp; Reject: ", sum(d$raw_reject),
        " (FP: <span class='bad'>", raw_fp, "</span>",
        if (d$n_real > 0) paste0(", TP: ", raw_tp), ")<br>",
        "<b>Bonferroni:</b><br>",
        "&nbsp; Reject: ", sum(d$bonf_reject),
        " (FP: ", bonf_fp,
        if (d$n_real > 0) paste0(", TP: ", bonf_tp), ")<br>",
        "<b>BH (FDR):</b><br>",
        "&nbsp; Reject: ", sum(d$bh_reject),
        " (FP: ", bh_fp,
        if (d$n_real > 0) paste0(", TP: ", bh_tp), ")<br>",
        "<hr style='margin:8px 0'>",
        "<small>FP = false positive<br>",
        "TP = true positive</small>"
      ))
    )
  })
}

shinyApp(ui, server)

Simulation 2: Correction methods compared

Run the multiple testing experiment many times to see the long-run performance of each correction method. Track the family-wise error rate (FWER: any false positive?) and false discovery rate (FDR: what fraction of discoveries are false?).

#| standalone: true
#| viewerHeight: 520

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("m2", "Number of tests:",
                  min = 5, max = 50, value = 20, step = 5),

      sliderInput("n_real2", "Tests with real effects:",
                  min = 0, max = 10, value = 3, step = 1),

      sliderInput("sims2", "Simulations:",
                  min = 200, max = 1000, value = 500, step = 100),

      actionButton("go2", "Run simulations",
                   class = "btn-primary", width = "100%"),

      uiOutput("results2")
    ),

    mainPanel(
      width = 9,
      plotOutput("compare_plot", height = "420px")
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$go2
    m      <- input$m2
    n_real <- min(input$n_real2, m)
    sims   <- input$sims2
    n      <- 50
    alpha  <- 0.05

    fwer <- c(raw = 0, bonf = 0, bh = 0)
    power <- c(raw = 0, bonf = 0, bh = 0)
    fdr_sum <- c(raw = 0, bonf = 0, bh = 0)
    fdr_count <- c(raw = 0, bonf = 0, bh = 0)

    for (s in seq_len(sims)) {
      p_vals <- numeric(m)
      is_real <- rep(FALSE, m)

      for (i in seq_len(m)) {
        if (i <= n_real) {
          x <- rnorm(n, mean = 0.5, sd = 1)
          is_real[i] <- TRUE
        } else {
          x <- rnorm(n, mean = 0, sd = 1)
        }
        p_vals[i] <- t.test(x, mu = 0)$p.value
      }

      # Raw
      raw_rej <- p_vals < alpha

      # Bonferroni
      bonf_rej <- p_vals < (alpha / m)

      # BH
      sorted_idx <- order(p_vals)
      sorted_p <- p_vals[sorted_idx]
      bh_thresh <- (seq_len(m) / m) * alpha
      bh_max <- max(c(0, which(sorted_p <= bh_thresh)))
      bh_rej <- rep(FALSE, m)
      if (bh_max > 0) bh_rej[sorted_idx[seq_len(bh_max)]] <- TRUE

      for (method in list(list("raw", raw_rej),
                          list("bonf", bonf_rej),
                          list("bh", bh_rej))) {
        nm <- method[[1]]
        rej <- method[[2]]

        # FWER
        if (any(rej & !is_real)) fwer[nm] <- fwer[nm] + 1

        # Power (if any real effects)
        if (n_real > 0) {
          power[nm] <- power[nm] + sum(rej & is_real)
        }

        # FDR
        if (sum(rej) > 0) {
          fdr_sum[nm] <- fdr_sum[nm] + sum(rej & !is_real) / sum(rej)
          fdr_count[nm] <- fdr_count[nm] + 1
        }
      }
    }

    fwer <- fwer / sims
    if (n_real > 0) power <- power / (sims * n_real)
    fdr <- ifelse(fdr_count > 0, fdr_sum / fdr_count, 0)

    list(fwer = fwer, power = power, fdr = fdr,
         m = m, n_real = n_real, sims = sims)
  })

  output$compare_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 6), xpd = TRUE)

    methods <- c("Uncorrected", "Bonferroni", "BH (FDR)")
    cols <- c("#e74c3c", "#e67e22", "#27ae60")
    x_pos <- seq_along(methods)

    if (d$n_real > 0) {
      # Show both FWER and power
      mat <- rbind(d$fwer * 100, d$power * 100)
      bp <- barplot(mat, beside = TRUE, names.arg = methods,
                    col = c("#e74c3c80", "#3498db80"),
                    border = c("#e74c3c", "#3498db"),
                    ylim = c(0, 100),
                    ylab = "Rate (%)",
                    main = paste0("FWER & Power (", d$sims,
                                 " simulations, ", d$n_real,
                                 " real effects)"))

      abline(h = 5, lty = 2, lwd = 2, col = "#7f8c8d")
      text(max(bp) + 1, 7, "\u03b1 = 5%", cex = 0.8, col = "#7f8c8d")

      legend("topright", inset = c(-0.18, 0), bty = "n", cex = 0.85,
             legend = c("FWER", "Power"),
             fill = c("#e74c3c80", "#3498db80"),
             border = c("#e74c3c", "#3498db"))
    } else {
      bp <- barplot(d$fwer * 100, names.arg = methods,
                    col = cols, border = NA,
                    ylim = c(0, max(d$fwer * 100) * 1.3),
                    ylab = "Family-wise error rate (%)",
                    main = paste0("FWER under H\u2080 (",
                                 d$sims, " simulations)"))

      abline(h = 5, lty = 2, lwd = 2, col = "#7f8c8d")
      text(max(bp) + 0.3, 7, "Target: 5%", cex = 0.8, col = "#7f8c8d")

      text(bp, d$fwer * 100 + 2,
           paste0(round(d$fwer * 100, 1), "%"), cex = 0.9)
    }
  })

  output$results2 <- renderUI({
    d <- dat()

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>FWER (any false positive):</b><br>",
        "&nbsp; Uncorrected: <span class='bad'>",
        round(d$fwer["raw"] * 100, 1), "%</span><br>",
        "&nbsp; Bonferroni: <span class='good'>",
        round(d$fwer["bonf"] * 100, 1), "%</span><br>",
        "&nbsp; BH: ", round(d$fwer["bh"] * 100, 1), "%<br>",
        if (d$n_real > 0) paste0(
          "<hr style='margin:8px 0'>",
          "<b>Power (detect real effects):</b><br>",
          "&nbsp; Uncorrected: ", round(d$power["raw"] * 100, 1), "%<br>",
          "&nbsp; Bonferroni: ", round(d$power["bonf"] * 100, 1), "%<br>",
          "&nbsp; BH: ", round(d$power["bh"] * 100, 1), "%"
        ) else ""
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • 20 tests, 0 real effects: uncorrected FWER is ~64%. Bonferroni brings it to ~5%. BH is somewhere in between.
  • 20 tests, 3 real effects: Bonferroni controls FWER tightly but has lower power. BH controls the false discovery rate (what fraction of your discoveries are wrong) while maintaining better power.
  • 50 tests, 0 real effects: uncorrected FWER is over 90%. The more you test, the worse it gets.
  • Sim 1 with 0 real effects: click “Run tests” repeatedly. Each time, a different set of null hypotheses appears “significant.” This is exactly how researchers accidentally find spurious results.

When to use which correction

Method Controls Best for
None Per-comparison \(\alpha\) Single pre-registered hypothesis
Bonferroni FWER (any false positive) Safety-critical decisions; few tests
BH (Benjamini-Hochberg) FDR (false discovery proportion) Exploratory analysis; many tests
Pre-registration Everything Specify your hypothesis before seeing data

The bottom line

  • If you test enough things, something will be significant. The question is whether it’s real.
  • Bonferroni is conservative — it controls the probability of any false positive but sacrifices power.
  • BH is more permissive — it controls the proportion of discoveries that are false, giving you better power.
  • The best correction is not testing everything. Pre-register your primary hypothesis and keep the number of tests small.

Did you know?

  • The XKCD comic “Significant” perfectly illustrates the multiple testing problem: researchers test whether jelly beans cause acne, testing 20 different colors. Green jelly beans come up significant at p < 0.05 — and the newspaper headline reads “Green Jelly Beans Linked to Acne!”
  • John Ioannidis published “Why Most Published Research Findings Are False” in 2005. His argument: when studies are underpowered, when many hypotheses are tested, and when researchers have flexibility in their analysis, the majority of “significant” findings are likely false positives. The paper has been cited over 10,000 times and helped launch the replication crisis movement.
  • Benjamini and Hochberg published their FDR procedure in 1995. It was initially met with skepticism — why would you allow some false discoveries? But in fields like genomics, where you test thousands of genes simultaneously, controlling the proportion of false discoveries turns out to be much more practical (and powerful) than trying to prevent any single false positive.