Power, Alpha, Beta & MDE

The big picture

You run an experiment to test whether some treatment works. There are only four things that can happen:

Treatment does nothing (H₀ true) Treatment works (H₁ true)
You say “no effect” Correct Type II error (miss it) — probability = \(\beta\)
You say “it works!” Type I error (false alarm) — probability = \(\alpha\) Correct — probability = Power = \(1 - \beta\)

That’s it. Everything on this page is about these four cells.

What are \(\alpha\) and \(\beta\)?

\(\alpha\) (alpha) is how often you cry wolf. You set this before the experiment — typically 0.05. It’s the false positive rate: the chance you declare “it works!” when the treatment actually does nothing.

\(\beta\) (beta) is how often you miss a real effect. If the treatment genuinely works, \(\beta\) is the probability you shrug and say “no effect.” You want this to be small.

Power = \(1 - \beta\) is the flip side: the probability you correctly detect a real effect. Convention is to aim for 0.80 (80%).

The two-distribution picture

The key insight is that there are two worlds — one where the treatment does nothing (null), and one where it has an effect (alternative). Each world gives you a different sampling distribution for your test statistic:

  • Under the null, the distribution is centered at 0 (no effect).
  • Under the alternative, the distribution is shifted by the true effect size.

You pick a critical value (the cutoff). If your test statistic lands past it, you reject H₀. The simulation below shows both distributions. Drag the sliders and watch how \(\alpha\), \(\beta\), and power change.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 580

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("effect", "True effect size (d):",
                  min = 0, max = 2, value = 0.5, step = 0.05),

      sliderInput("n", "Sample size per group (n):",
                  min = 10, max = 500, value = 50, step = 10),

      sliderInput("alpha", HTML("&alpha; (significance level):"),
                  min = 0.01, max = 0.10, value = 0.05, step = 0.01),

      uiOutput("results_box")
    ),

    mainPanel(
      width = 9,
      plotOutput("dist_plot", height = "450px")
    )
  )
)

server <- function(input, output, session) {

  vals <- reactive({
    d     <- input$effect
    n     <- input$n
    alpha <- input$alpha

    se    <- sqrt(2 / n)          # SE of difference in means (sigma=1 each group)
    shift <- d / se               # noncentrality (in SE units)
    crit  <- qnorm(1 - alpha)     # one-sided critical value

    power <- 1 - pnorm(crit - shift)
    beta  <- 1 - power

    list(se = se, shift = shift, crit = crit,
         power = power, beta = beta, alpha = alpha, d = d, n = n)
  })

  output$dist_plot <- renderPlot({
    v <- vals()

    xmin <- min(-4, v$shift - 4)
    xmax <- max(4, v$shift + 4)
    x <- seq(xmin, xmax, length.out = 500)

    y_null <- dnorm(x)
    y_alt  <- dnorm(x, mean = v$shift)

    par(mar = c(4.5, 4.5, 3, 1))
    plot(x, y_null, type = "l", lwd = 2.5, col = "#2c3e50",
         xlab = "Test statistic (z)", ylab = "Density",
         main = "Null vs Alternative Distribution",
         ylim = c(0, max(y_null, y_alt) * 1.15),
         xlim = c(xmin, xmax))
    lines(x, y_alt, lwd = 2.5, col = "#3498db")

    # Critical value line
    abline(v = v$crit, lty = 2, lwd = 2, col = "#7f8c8d")

    # Shade alpha region (right tail of null beyond crit)
    x_alpha <- seq(v$crit, xmax, length.out = 200)
    polygon(c(v$crit, x_alpha, xmax),
            c(0, dnorm(x_alpha), 0),
            col = adjustcolor("#e74c3c", 0.35), border = NA)

    # Shade beta region (left part of alternative, below crit)
    x_beta <- seq(xmin, v$crit, length.out = 200)
    polygon(c(xmin, x_beta, v$crit),
            c(0, dnorm(x_beta, mean = v$shift), 0),
            col = adjustcolor("#f39c12", 0.35), border = NA)

    # Shade power region (right part of alternative, beyond crit)
    x_pow <- seq(v$crit, xmax, length.out = 200)
    polygon(c(v$crit, x_pow, xmax),
            c(0, dnorm(x_pow, mean = v$shift), 0),
            col = adjustcolor("#2ecc71", 0.35), border = NA)

    # Labels
    legend("topleft", bty = "n", cex = 0.9,
           legend = c(
             expression("Null distribution (H"[0]*": no effect)"),
             expression("Alternative distribution (H"[1]*": effect exists)"),
             "Critical value",
             expression(alpha * " (false positive)"),
             expression(beta * " (miss / Type II)"),
             "Power (correct detection)"
           ),
           col = c("#2c3e50", "#3498db", "#7f8c8d",
                   adjustcolor("#e74c3c", 0.6),
                   adjustcolor("#f39c12", 0.6),
                   adjustcolor("#2ecc71", 0.6)),
           lwd = c(2.5, 2.5, 2, 8, 8, 8),
           lty = c(1, 1, 2, 1, 1, 1))
  })

  output$results_box <- renderUI({
    v <- vals()
    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>&alpha;:</b> ", v$alpha, "<br>",
        "<b>&beta;:</b> ", round(v$beta, 3), "<br>",
        "<b>Power:</b> ", round(v$power, 3), "<br>",
        "<hr style='margin:8px 0'>",
        "<b>Effect (d):</b> ", v$d, "<br>",
        "<b>n per group:</b> ", v$n, "<br>",
        "<b>Critical z:</b> ", round(v$crit, 2)
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • Set effect = 0 and watch: there is no alternative distribution to detect. Any rejection is a false positive.
  • Set effect = 0.5 with n = 20: power is low. Now slide n up — power climbs. This is why sample size matters.
  • Set n = 200 and shrink the effect toward 0: even large samples struggle to detect tiny effects.
  • Lower \(\alpha\) from 0.05 to 0.01: the critical value moves right, \(\alpha\) shrinks, but \(\beta\) grows. There’s always a tradeoff between false positives and false negatives.

Minimum Detectable Effect (MDE)

When planning an experiment, you often ask: “Given my sample size, what’s the smallest effect I can reliably detect?” That’s the MDE.

It depends on three things: sample size (\(n\)), significance level (\(\alpha\)), and desired power (\(1 - \beta\)). The formula for a two-sample test with equal groups is:

\[ \text{MDE} = (z_{1-\alpha} + z_{1-\beta}) \times \sqrt{\frac{2}{n}} \]

Larger \(n\) shrinks the MDE. Higher power demands a larger MDE (or more \(n\)).

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 480

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .mde-box {
      background: #eaf2f8; border-radius: 6px; padding: 16px;
      margin-top: 14px; font-size: 15px; line-height: 2;
      text-align: center;
    }
    .mde-box .big { font-size: 28px; color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("n2", "Sample size per group (n):",
                  min = 10, max = 1000, value = 100, step = 10),

      sliderInput("alpha2", HTML("&alpha;:"),
                  min = 0.01, max = 0.10, value = 0.05, step = 0.01),

      sliderInput("power2", "Desired power:",
                  min = 0.50, max = 0.95, value = 0.80, step = 0.05),

      uiOutput("mde_box")
    ),

    mainPanel(
      width = 9,
      plotOutput("mde_curve", height = "400px")
    )
  )
)

server <- function(input, output, session) {

  output$mde_curve <- renderPlot({
    alpha <- input$alpha2
    power <- input$power2
    n_now <- input$n2

    ns <- seq(10, 1000, by = 5)
    mdes <- (qnorm(1 - alpha) + qnorm(power)) * sqrt(2 / ns)

    mde_now <- (qnorm(1 - alpha) + qnorm(power)) * sqrt(2 / n_now)

    par(mar = c(4.5, 4.5, 3, 1))
    plot(ns, mdes, type = "l", lwd = 2.5, col = "#3498db",
         xlab = "Sample size per group (n)",
         ylab = "MDE (standardized effect size)",
         main = paste0("MDE curve (\u03b1 = ", alpha, ", power = ", power, ")"),
         ylim = c(0, max(mdes)))

    # Highlight current n
    points(n_now, mde_now, pch = 19, cex = 2, col = "#e74c3c")
    segments(n_now, 0, n_now, mde_now, lty = 2, col = "#e74c3c")
    segments(0, mde_now, n_now, mde_now, lty = 2, col = "#e74c3c")

    text(n_now + 30, mde_now + 0.02,
         paste0("MDE = ", round(mde_now, 3)),
         col = "#e74c3c", cex = 0.95, adj = 0)
  })

  output$mde_box <- renderUI({
    alpha <- input$alpha2
    power <- input$power2
    n_now <- input$n2
    mde <- (qnorm(1 - alpha) + qnorm(power)) * sqrt(2 / n_now)

    tags$div(class = "mde-box",
      HTML(paste0(
        "With <b>n = ", n_now, "</b> per group,<br>",
        "you can detect effects as small as:<br>",
        "<span class='big'>d = ", round(mde, 3), "</span>"
      ))
    )
  })
}

shinyApp(ui, server)

The intuition

  • MDE is your experiment’s resolution. A microscope can’t see atoms; your experiment can’t see effects smaller than the MDE.
  • More data (larger \(n\)) = sharper microscope = smaller MDE.
  • If you need to detect a 1% lift in click-through rate but your MDE is 3%, your experiment is pointless — you’ll almost certainly miss it even if the effect is real.
  • In practice: figure out the smallest effect that would matter for your decision, then compute the \(n\) needed to detect it.

Did you know?

  • Jacob Cohen, the psychologist who popularized power analysis, found in 1962 that the median power of studies in behavioral science journals was only 0.48 — meaning most studies had less than a coin-flip chance of detecting the effects they were looking for. He spent the rest of his career trying to fix this. His book Statistical Power Analysis (1969) remains a classic.
  • Cohen’s famous effect size conventions (small = 0.2, medium = 0.5, large = 0.8) were meant as rough guides, not rigid rules. He later regretted that people treated them as gospel: “My intent was that d = 0.5 represents a medium effect… it does not mean that 0.5 is a medium effect in your field.”
  • The replication crisis in psychology and medicine is largely a power problem. Underpowered studies that happen to find significant results are published; the many more that find nothing are filed away. This is publication bias, and it’s a direct consequence of running experiments without power calculations.