Multiple Testing & the Replication Crisis
The problem: test enough things and something will be “significant”
If you test one null hypothesis at \(\alpha = 0.05\), there’s a 5% chance of a false positive. But if you test 20 null hypotheses, the chance that at least one is falsely significant is:
\[P(\text{at least one false positive}) = 1 - (1 - 0.05)^{20} \approx 0.64\]
That’s a 64% chance of finding something “significant” even when nothing is real. This is the multiple testing problem, and it’s at the heart of the replication crisis.
How it happens in practice
- A researcher tests 20 outcome variables and reports the one with p < 0.05
- A genetics study tests 500,000 SNPs for association with a disease
- An A/B test checks conversions, clicks, revenue, time on page, bounce rate…
- A psychology experiment tests the effect on 10 different mood scales
In each case, the nominal \(\alpha = 0.05\) no longer controls the actual false positive rate. You need a correction.
Simulation 1: False discoveries from multiple tests
Test 20 true null hypotheses at \(\alpha = 0.05\). Watch how many come up “significant” purely by chance. Repeat to see the pattern.
#| standalone: true
#| viewerHeight: 650
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
.good { color: #27ae60; font-weight: bold; }
.bad { color: #e74c3c; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("n_tests", "Number of hypotheses tested:",
min = 5, max = 50, value = 20, step = 5),
sliderInput("n_obs", "Sample size per test:",
min = 20, max = 200, value = 50, step = 10),
sliderInput("alpha", HTML("α (per-test):"),
min = 0.01, max = 0.10, value = 0.05, step = 0.01),
sliderInput("n_real", "Number with real effects:",
min = 0, max = 10, value = 0, step = 1),
actionButton("go", "Run tests",
class = "btn-primary", width = "100%"),
uiOutput("results")
),
mainPanel(
width = 9,
plotOutput("pval_plot", height = "520px")
)
)
)
server <- function(input, output, session) {
dat <- reactive({
input$go
m <- input$n_tests
n <- input$n_obs
alpha <- input$alpha
n_real <- min(input$n_real, m)
p_vals <- numeric(m)
is_real <- rep(FALSE, m)
for (i in seq_len(m)) {
if (i <= n_real) {
# Real effect (d = 0.5)
x <- rnorm(n, mean = 0.5, sd = 1)
is_real[i] <- TRUE
} else {
# Null is true
x <- rnorm(n, mean = 0, sd = 1)
}
p_vals[i] <- t.test(x, mu = 0)$p.value
}
# Bonferroni correction
bonf_alpha <- alpha / m
bonf_reject <- p_vals < bonf_alpha
# BH (FDR) correction
sorted_idx <- order(p_vals)
sorted_p <- p_vals[sorted_idx]
bh_threshold <- (seq_len(m) / m) * alpha
bh_max <- max(c(0, which(sorted_p <= bh_threshold)))
bh_reject <- rep(FALSE, m)
if (bh_max > 0) bh_reject[sorted_idx[seq_len(bh_max)]] <- TRUE
list(p_vals = p_vals, is_real = is_real, m = m, n = n,
alpha = alpha, n_real = n_real,
raw_reject = p_vals < alpha,
bonf_reject = bonf_reject,
bh_reject = bh_reject,
bonf_alpha = bonf_alpha)
})
output$pval_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 6, 3, 1))
ord <- order(d$p_vals)
p_sorted <- d$p_vals[ord]
real_sorted <- d$is_real[ord]
cols <- ifelse(real_sorted, "#3498db", "#95a5a6")
pch_vals <- ifelse(real_sorted, 17, 16)
plot(p_sorted, seq_len(d$m), pch = pch_vals, cex = 1.5,
col = cols, xlim = c(0, 1),
xlab = "p-value", ylab = "Test (sorted by p-value)",
main = paste0(d$m, " hypothesis tests (", d$n_real,
" real effects)"),
yaxt = "n")
axis(2, at = seq_len(d$m), labels = seq_len(d$m),
las = 1, cex.axis = 0.7)
# Alpha threshold
abline(v = d$alpha, lty = 2, lwd = 2, col = "#e74c3c")
text(d$alpha + 0.02, d$m, paste0("\u03b1 = ", d$alpha),
col = "#e74c3c", cex = 0.8, adj = 0)
# Bonferroni threshold
abline(v = d$bonf_alpha, lty = 2, lwd = 2, col = "#e67e22")
text(d$bonf_alpha + 0.02, d$m - 1,
paste0("Bonf = ", round(d$bonf_alpha, 4)),
col = "#e67e22", cex = 0.8, adj = 0)
# BH line
bh_line <- (seq_len(d$m) / d$m) * d$alpha
lines(bh_line[ord], seq_len(d$m), lty = 3, lwd = 2, col = "#27ae60")
# Highlight rejections
raw_sig <- which(p_sorted < d$alpha)
if (length(raw_sig) > 0) {
points(p_sorted[raw_sig], raw_sig, pch = 1, cex = 2.5,
col = "#e74c3c", lwd = 2)
}
legend("bottomright", bty = "n", cex = 0.8,
legend = c(
ifelse(d$n_real > 0, "Real effect", ""),
"Null (no effect)",
paste0("Uncorrected \u03b1 = ", d$alpha),
paste0("Bonferroni \u03b1 = ", round(d$bonf_alpha, 4)),
"BH (FDR) threshold"
)[c(d$n_real > 0, TRUE, TRUE, TRUE, TRUE)],
col = c(
if (d$n_real > 0) "#3498db",
"#95a5a6", "#e74c3c", "#e67e22", "#27ae60"
),
pch = c(
if (d$n_real > 0) 17,
16, NA, NA, NA
),
lwd = c(
if (d$n_real > 0) NA,
NA, 2, 2, 2
),
lty = c(
if (d$n_real > 0) NA,
NA, 2, 2, 3
))
})
output$results <- renderUI({
d <- dat()
raw_fp <- sum(d$raw_reject & !d$is_real)
bonf_fp <- sum(d$bonf_reject & !d$is_real)
bh_fp <- sum(d$bh_reject & !d$is_real)
raw_tp <- sum(d$raw_reject & d$is_real)
bonf_tp <- sum(d$bonf_reject & d$is_real)
bh_tp <- sum(d$bh_reject & d$is_real)
tags$div(class = "stats-box",
HTML(paste0(
"<b>Uncorrected:</b><br>",
" Reject: ", sum(d$raw_reject),
" (FP: <span class='bad'>", raw_fp, "</span>",
if (d$n_real > 0) paste0(", TP: ", raw_tp), ")<br>",
"<b>Bonferroni:</b><br>",
" Reject: ", sum(d$bonf_reject),
" (FP: ", bonf_fp,
if (d$n_real > 0) paste0(", TP: ", bonf_tp), ")<br>",
"<b>BH (FDR):</b><br>",
" Reject: ", sum(d$bh_reject),
" (FP: ", bh_fp,
if (d$n_real > 0) paste0(", TP: ", bh_tp), ")<br>",
"<hr style='margin:8px 0'>",
"<small>FP = false positive<br>",
"TP = true positive</small>"
))
)
})
}
shinyApp(ui, server)
Simulation 2: Correction methods compared
Run the multiple testing experiment many times to see the long-run performance of each correction method. Track the family-wise error rate (FWER: any false positive?) and false discovery rate (FDR: what fraction of discoveries are false?).
#| standalone: true
#| viewerHeight: 520
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
.good { color: #27ae60; font-weight: bold; }
.bad { color: #e74c3c; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("m2", "Number of tests:",
min = 5, max = 50, value = 20, step = 5),
sliderInput("n_real2", "Tests with real effects:",
min = 0, max = 10, value = 3, step = 1),
sliderInput("sims2", "Simulations:",
min = 200, max = 1000, value = 500, step = 100),
actionButton("go2", "Run simulations",
class = "btn-primary", width = "100%"),
uiOutput("results2")
),
mainPanel(
width = 9,
plotOutput("compare_plot", height = "420px")
)
)
)
server <- function(input, output, session) {
dat <- reactive({
input$go2
m <- input$m2
n_real <- min(input$n_real2, m)
sims <- input$sims2
n <- 50
alpha <- 0.05
fwer <- c(raw = 0, bonf = 0, bh = 0)
power <- c(raw = 0, bonf = 0, bh = 0)
fdr_sum <- c(raw = 0, bonf = 0, bh = 0)
fdr_count <- c(raw = 0, bonf = 0, bh = 0)
for (s in seq_len(sims)) {
p_vals <- numeric(m)
is_real <- rep(FALSE, m)
for (i in seq_len(m)) {
if (i <= n_real) {
x <- rnorm(n, mean = 0.5, sd = 1)
is_real[i] <- TRUE
} else {
x <- rnorm(n, mean = 0, sd = 1)
}
p_vals[i] <- t.test(x, mu = 0)$p.value
}
# Raw
raw_rej <- p_vals < alpha
# Bonferroni
bonf_rej <- p_vals < (alpha / m)
# BH
sorted_idx <- order(p_vals)
sorted_p <- p_vals[sorted_idx]
bh_thresh <- (seq_len(m) / m) * alpha
bh_max <- max(c(0, which(sorted_p <= bh_thresh)))
bh_rej <- rep(FALSE, m)
if (bh_max > 0) bh_rej[sorted_idx[seq_len(bh_max)]] <- TRUE
for (method in list(list("raw", raw_rej),
list("bonf", bonf_rej),
list("bh", bh_rej))) {
nm <- method[[1]]
rej <- method[[2]]
# FWER
if (any(rej & !is_real)) fwer[nm] <- fwer[nm] + 1
# Power (if any real effects)
if (n_real > 0) {
power[nm] <- power[nm] + sum(rej & is_real)
}
# FDR
if (sum(rej) > 0) {
fdr_sum[nm] <- fdr_sum[nm] + sum(rej & !is_real) / sum(rej)
fdr_count[nm] <- fdr_count[nm] + 1
}
}
}
fwer <- fwer / sims
if (n_real > 0) power <- power / (sims * n_real)
fdr <- ifelse(fdr_count > 0, fdr_sum / fdr_count, 0)
list(fwer = fwer, power = power, fdr = fdr,
m = m, n_real = n_real, sims = sims)
})
output$compare_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 6), xpd = TRUE)
methods <- c("Uncorrected", "Bonferroni", "BH (FDR)")
cols <- c("#e74c3c", "#e67e22", "#27ae60")
x_pos <- seq_along(methods)
if (d$n_real > 0) {
# Show both FWER and power
mat <- rbind(d$fwer * 100, d$power * 100)
bp <- barplot(mat, beside = TRUE, names.arg = methods,
col = c("#e74c3c80", "#3498db80"),
border = c("#e74c3c", "#3498db"),
ylim = c(0, 100),
ylab = "Rate (%)",
main = paste0("FWER & Power (", d$sims,
" simulations, ", d$n_real,
" real effects)"))
abline(h = 5, lty = 2, lwd = 2, col = "#7f8c8d")
text(max(bp) + 1, 7, "\u03b1 = 5%", cex = 0.8, col = "#7f8c8d")
legend("topright", inset = c(-0.18, 0), bty = "n", cex = 0.85,
legend = c("FWER", "Power"),
fill = c("#e74c3c80", "#3498db80"),
border = c("#e74c3c", "#3498db"))
} else {
bp <- barplot(d$fwer * 100, names.arg = methods,
col = cols, border = NA,
ylim = c(0, max(d$fwer * 100) * 1.3),
ylab = "Family-wise error rate (%)",
main = paste0("FWER under H\u2080 (",
d$sims, " simulations)"))
abline(h = 5, lty = 2, lwd = 2, col = "#7f8c8d")
text(max(bp) + 0.3, 7, "Target: 5%", cex = 0.8, col = "#7f8c8d")
text(bp, d$fwer * 100 + 2,
paste0(round(d$fwer * 100, 1), "%"), cex = 0.9)
}
})
output$results2 <- renderUI({
d <- dat()
tags$div(class = "stats-box",
HTML(paste0(
"<b>FWER (any false positive):</b><br>",
" Uncorrected: <span class='bad'>",
round(d$fwer["raw"] * 100, 1), "%</span><br>",
" Bonferroni: <span class='good'>",
round(d$fwer["bonf"] * 100, 1), "%</span><br>",
" BH: ", round(d$fwer["bh"] * 100, 1), "%<br>",
if (d$n_real > 0) paste0(
"<hr style='margin:8px 0'>",
"<b>Power (detect real effects):</b><br>",
" Uncorrected: ", round(d$power["raw"] * 100, 1), "%<br>",
" Bonferroni: ", round(d$power["bonf"] * 100, 1), "%<br>",
" BH: ", round(d$power["bh"] * 100, 1), "%"
) else ""
))
)
})
}
shinyApp(ui, server)
Things to try
- 20 tests, 0 real effects: uncorrected FWER is ~64%. Bonferroni brings it to ~5%. BH is somewhere in between.
- 20 tests, 3 real effects: Bonferroni controls FWER tightly but has lower power. BH controls the false discovery rate (what fraction of your discoveries are wrong) while maintaining better power.
- 50 tests, 0 real effects: uncorrected FWER is over 90%. The more you test, the worse it gets.
- Sim 1 with 0 real effects: click “Run tests” repeatedly. Each time, a different set of null hypotheses appears “significant.” This is exactly how researchers accidentally find spurious results.
When to use which correction
| Method | Controls | Best for |
|---|---|---|
| None | Per-comparison \(\alpha\) | Single pre-registered hypothesis |
| Bonferroni | FWER (any false positive) | Safety-critical decisions; few tests |
| BH (Benjamini-Hochberg) | FDR (false discovery proportion) | Exploratory analysis; many tests |
| Pre-registration | Everything | Specify your hypothesis before seeing data |
The bottom line
- If you test enough things, something will be significant. The question is whether it’s real.
- Bonferroni is conservative — it controls the probability of any false positive but sacrifices power.
- BH is more permissive — it controls the proportion of discoveries that are false, giving you better power.
- The best correction is not testing everything. Pre-register your primary hypothesis and keep the number of tests small.
Did you know?
- The XKCD comic “Significant” perfectly illustrates the multiple testing problem: researchers test whether jelly beans cause acne, testing 20 different colors. Green jelly beans come up significant at p < 0.05 — and the newspaper headline reads “Green Jelly Beans Linked to Acne!”
- John Ioannidis published “Why Most Published Research Findings Are False” in 2005. His argument: when studies are underpowered, when many hypotheses are tested, and when researchers have flexibility in their analysis, the majority of “significant” findings are likely false positives. The paper has been cited over 10,000 times and helped launch the replication crisis movement.
- Benjamini and Hochberg published their FDR procedure in 1995. It was initially met with skepticism — why would you allow some false discoveries? But in fields like genomics, where you test thousands of genes simultaneously, controlling the proportion of false discoveries turns out to be much more practical (and powerful) than trying to prevent any single false positive.