Power, Alpha, Beta & MDE
The big picture
You run an experiment to test whether some treatment works. There are only four things that can happen:
| Treatment does nothing (H₀ true) | Treatment works (H₁ true) | |
|---|---|---|
| You say “no effect” | Correct | Type II error (miss it) — probability = \(\beta\) |
| You say “it works!” | Type I error (false alarm) — probability = \(\alpha\) | Correct — probability = Power = \(1 - \beta\) |
That’s it. Everything on this page is about these four cells.
What are \(\alpha\) and \(\beta\)?
\(\alpha\) (alpha) is how often you cry wolf. You set this before the experiment — typically 0.05. It’s the false positive rate: the chance you declare “it works!” when the treatment actually does nothing.
\(\beta\) (beta) is how often you miss a real effect. If the treatment genuinely works, \(\beta\) is the probability you shrug and say “no effect.” You want this to be small.
Power = \(1 - \beta\) is the flip side: the probability you correctly detect a real effect. Convention is to aim for 0.80 (80%).
The two-distribution picture
The key insight is that there are two worlds — one where the treatment does nothing (null), and one where it has an effect (alternative). Each world gives you a different sampling distribution for your test statistic:
- Under the null, the distribution is centered at 0 (no effect).
- Under the alternative, the distribution is shifted by the true effect size.
You pick a critical value (the cutoff). If your test statistic lands past it, you reject H₀. The simulation below shows both distributions. Drag the sliders and watch how \(\alpha\), \(\beta\), and power change.
#| standalone: true
#| viewerHeight: 580
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("effect", "True effect size (d):",
min = 0, max = 2, value = 0.5, step = 0.05),
sliderInput("n", "Sample size per group (n):",
min = 10, max = 500, value = 50, step = 10),
sliderInput("alpha", HTML("α (significance level):"),
min = 0.01, max = 0.10, value = 0.05, step = 0.01),
uiOutput("results_box")
),
mainPanel(
width = 9,
plotOutput("dist_plot", height = "450px")
)
)
)
server <- function(input, output, session) {
vals <- reactive({
d <- input$effect
n <- input$n
alpha <- input$alpha
se <- sqrt(2 / n) # SE of difference in means (sigma=1 each group)
shift <- d / se # noncentrality (in SE units)
crit <- qnorm(1 - alpha) # one-sided critical value
power <- 1 - pnorm(crit - shift)
beta <- 1 - power
list(se = se, shift = shift, crit = crit,
power = power, beta = beta, alpha = alpha, d = d, n = n)
})
output$dist_plot <- renderPlot({
v <- vals()
xmin <- min(-4, v$shift - 4)
xmax <- max(4, v$shift + 4)
x <- seq(xmin, xmax, length.out = 500)
y_null <- dnorm(x)
y_alt <- dnorm(x, mean = v$shift)
par(mar = c(4.5, 4.5, 3, 1))
plot(x, y_null, type = "l", lwd = 2.5, col = "#2c3e50",
xlab = "Test statistic (z)", ylab = "Density",
main = "Null vs Alternative Distribution",
ylim = c(0, max(y_null, y_alt) * 1.15),
xlim = c(xmin, xmax))
lines(x, y_alt, lwd = 2.5, col = "#3498db")
# Critical value line
abline(v = v$crit, lty = 2, lwd = 2, col = "#7f8c8d")
# Shade alpha region (right tail of null beyond crit)
x_alpha <- seq(v$crit, xmax, length.out = 200)
polygon(c(v$crit, x_alpha, xmax),
c(0, dnorm(x_alpha), 0),
col = adjustcolor("#e74c3c", 0.35), border = NA)
# Shade beta region (left part of alternative, below crit)
x_beta <- seq(xmin, v$crit, length.out = 200)
polygon(c(xmin, x_beta, v$crit),
c(0, dnorm(x_beta, mean = v$shift), 0),
col = adjustcolor("#f39c12", 0.35), border = NA)
# Shade power region (right part of alternative, beyond crit)
x_pow <- seq(v$crit, xmax, length.out = 200)
polygon(c(v$crit, x_pow, xmax),
c(0, dnorm(x_pow, mean = v$shift), 0),
col = adjustcolor("#2ecc71", 0.35), border = NA)
# Labels
legend("topleft", bty = "n", cex = 0.9,
legend = c(
expression("Null distribution (H"[0]*": no effect)"),
expression("Alternative distribution (H"[1]*": effect exists)"),
"Critical value",
expression(alpha * " (false positive)"),
expression(beta * " (miss / Type II)"),
"Power (correct detection)"
),
col = c("#2c3e50", "#3498db", "#7f8c8d",
adjustcolor("#e74c3c", 0.6),
adjustcolor("#f39c12", 0.6),
adjustcolor("#2ecc71", 0.6)),
lwd = c(2.5, 2.5, 2, 8, 8, 8),
lty = c(1, 1, 2, 1, 1, 1))
})
output$results_box <- renderUI({
v <- vals()
tags$div(class = "stats-box",
HTML(paste0(
"<b>α:</b> ", v$alpha, "<br>",
"<b>β:</b> ", round(v$beta, 3), "<br>",
"<b>Power:</b> ", round(v$power, 3), "<br>",
"<hr style='margin:8px 0'>",
"<b>Effect (d):</b> ", v$d, "<br>",
"<b>n per group:</b> ", v$n, "<br>",
"<b>Critical z:</b> ", round(v$crit, 2)
))
)
})
}
shinyApp(ui, server)
Things to try
- Set effect = 0 and watch: there is no alternative distribution to detect. Any rejection is a false positive.
- Set effect = 0.5 with n = 20: power is low. Now slide n up — power climbs. This is why sample size matters.
- Set n = 200 and shrink the effect toward 0: even large samples struggle to detect tiny effects.
- Lower \(\alpha\) from 0.05 to 0.01: the critical value moves right, \(\alpha\) shrinks, but \(\beta\) grows. There’s always a tradeoff between false positives and false negatives.
Minimum Detectable Effect (MDE)
When planning an experiment, you often ask: “Given my sample size, what’s the smallest effect I can reliably detect?” That’s the MDE.
It depends on three things: sample size (\(n\)), significance level (\(\alpha\)), and desired power (\(1 - \beta\)). The formula for a two-sample test with equal groups is:
\[ \text{MDE} = (z_{1-\alpha} + z_{1-\beta}) \times \sqrt{\frac{2}{n}} \]
Notice: that \(\sqrt{2/n}\) is just the standard error of the difference in means. So MDE is really just a scaled-up SE:
\[MDE = (z_{1-\alpha} + z_{1-\beta}) \times SE\]
The critical values (~2.8 for 5% significance and 80% power) are fixed multipliers. The only thing you control is the SE — by increasing \(n\) or reducing \(\sigma\) (through better measurement, stratification, or controls). Power analysis is really just an SE calculation in disguise. See Variance, SD & Standard Error for more on this connection.
Larger \(n\) shrinks the SE, which shrinks the MDE. Higher power demands a larger MDE (or more \(n\)).
#| standalone: true
#| viewerHeight: 480
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.mde-box {
background: #eaf2f8; border-radius: 6px; padding: 16px;
margin-top: 14px; font-size: 15px; line-height: 2;
text-align: center;
}
.mde-box .big { font-size: 28px; color: #e74c3c; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("n2", "Sample size per group (n):",
min = 10, max = 1000, value = 100, step = 10),
sliderInput("alpha2", HTML("α:"),
min = 0.01, max = 0.10, value = 0.05, step = 0.01),
sliderInput("power2", "Desired power:",
min = 0.50, max = 0.95, value = 0.80, step = 0.05),
uiOutput("mde_box")
),
mainPanel(
width = 9,
plotOutput("mde_curve", height = "400px")
)
)
)
server <- function(input, output, session) {
output$mde_curve <- renderPlot({
alpha <- input$alpha2
power <- input$power2
n_now <- input$n2
ns <- seq(10, 1000, by = 5)
mdes <- (qnorm(1 - alpha) + qnorm(power)) * sqrt(2 / ns)
mde_now <- (qnorm(1 - alpha) + qnorm(power)) * sqrt(2 / n_now)
par(mar = c(4.5, 4.5, 3, 1))
plot(ns, mdes, type = "l", lwd = 2.5, col = "#3498db",
xlab = "Sample size per group (n)",
ylab = "MDE (standardized effect size)",
main = paste0("MDE curve (\u03b1 = ", alpha, ", power = ", power, ")"),
ylim = c(0, max(mdes)))
# Highlight current n
points(n_now, mde_now, pch = 19, cex = 2, col = "#e74c3c")
segments(n_now, 0, n_now, mde_now, lty = 2, col = "#e74c3c")
segments(0, mde_now, n_now, mde_now, lty = 2, col = "#e74c3c")
text(n_now + 30, mde_now + 0.02,
paste0("MDE = ", round(mde_now, 3)),
col = "#e74c3c", cex = 0.95, adj = 0)
})
output$mde_box <- renderUI({
alpha <- input$alpha2
power <- input$power2
n_now <- input$n2
mde <- (qnorm(1 - alpha) + qnorm(power)) * sqrt(2 / n_now)
tags$div(class = "mde-box",
HTML(paste0(
"With <b>n = ", n_now, "</b> per group,<br>",
"you can detect effects as small as:<br>",
"<span class='big'>d = ", round(mde, 3), "</span>"
))
)
})
}
shinyApp(ui, server)
The intuition
- MDE is your experiment’s resolution. A microscope can’t see atoms; your experiment can’t see effects smaller than the MDE.
- More data (larger \(n\)) = sharper microscope = smaller MDE.
- If you need to detect a 1% lift in click-through rate but your MDE is 3%, your experiment is pointless — you’ll almost certainly miss it even if the effect is real.
- In practice: figure out the smallest effect that would matter for your decision, then compute the \(n\) needed to detect it.
Did you know?
- Jacob Cohen, the psychologist who popularized power analysis, found in 1962 that the median power of studies in behavioral science journals was only 0.48 — meaning most studies had less than a coin-flip chance of detecting the effects they were looking for. He spent the rest of his career trying to fix this. His book Statistical Power Analysis (1969) remains a classic.
- Cohen’s famous effect size conventions (small = 0.2, medium = 0.5, large = 0.8) were meant as rough guides, not rigid rules. He later regretted that people treated them as gospel: “My intent was that d = 0.5 represents a medium effect… it does not mean that 0.5 is a medium effect in your field.”
- The replication crisis in psychology and medicine is largely a power problem. Underpowered studies that happen to find significant results are published; the many more that find nothing are filed away. This is publication bias, and it’s a direct consequence of running experiments without power calculations.