Power, Alpha, Beta & MDE

The big picture

You run an experiment to test whether some treatment works. There are only four things that can happen:

	Treatment does nothing (H₀ true)	Treatment works (H₁ true)
You say “no effect”	Correct	Type II error (miss it) — probability = \(\beta\)
You say “it works!”	Type I error (false alarm) — probability = \(\alpha\)	Correct — probability = Power = \(1 - \beta\)

That’s it. Everything on this page is about these four cells.

The Oracle View. In these simulations, we set the true effect size and know whether the treatment works. We can label every rejection as correct or false. In practice, you design the experiment without knowing the effect size — you guess it from pilot data or prior studies. You never find out which cell of the table you landed in.

What are \(\alpha\) and \(\beta\)?

\(\alpha\) (alpha) is how often you cry wolf. You set this before the experiment — typically 0.05. It’s the false positive rate: the chance you declare “it works!” when the treatment actually does nothing.

\(\beta\) (beta) is how often you miss a real effect. If the treatment genuinely works, \(\beta\) is the probability you shrug and say “no effect.” You want this to be small.

Power = \(1 - \beta\) is the flip side: the probability you correctly detect a real effect. Convention is to aim for 0.80 (80%).

The two-distribution picture

The key insight is that there are two worlds — one where the treatment does nothing (null), and one where it has an effect (alternative). Each world gives you a different sampling distribution for your test statistic:

Under the null, the distribution is centered at 0 (no effect).
Under the alternative, the distribution is shifted by the true effect size.

You pick a critical value (the cutoff). If your test statistic lands past it, you reject H₀. The simulation below shows both distributions. Drag the sliders and watch how \(\alpha\), \(\beta\), and power change.

#| standalone: true
#| viewerHeight: 580

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("effect", "True effect size (d):",
                  min = 0, max = 2, value = 0.5, step = 0.05),

      sliderInput("n", "Sample size per group (n):",
                  min = 10, max = 500, value = 50, step = 10),

      sliderInput("alpha", HTML("&alpha; (significance level):"),
                  min = 0.01, max = 0.10, value = 0.05, step = 0.01),

      uiOutput("results_box")
    ),

    mainPanel(
      width = 9,
      plotOutput("dist_plot", height = "450px")
    )
  )
)

server <- function(input, output, session) {

  vals <- reactive({
    d     <- input$effect
    n     <- input$n
    alpha <- input$alpha

    se    <- sqrt(2 / n)          # SE of difference in means (sigma=1 each group)
    shift <- d / se               # noncentrality (in SE units)
    crit  <- qnorm(1 - alpha)     # one-sided critical value

    power <- 1 - pnorm(crit - shift)
    beta  <- 1 - power

    list(se = se, shift = shift, crit = crit,
         power = power, beta = beta, alpha = alpha, d = d, n = n)
  })

  output$dist_plot <- renderPlot({
    v <- vals()

    xmin <- min(-4, v$shift - 4)
    xmax <- max(4, v$shift + 4)
    x <- seq(xmin, xmax, length.out = 500)

    y_null <- dnorm(x)
    y_alt  <- dnorm(x, mean = v$shift)

    par(mar = c(4.5, 4.5, 3, 1))
    plot(x, y_null, type = "l", lwd = 2.5, col = "#2c3e50",
         xlab = "Test statistic (z)", ylab = "Density",
         main = "Null vs Alternative Distribution",
         ylim = c(0, max(y_null, y_alt) * 1.15),
         xlim = c(xmin, xmax))
    lines(x, y_alt, lwd = 2.5, col = "#3498db")

    # Critical value line
    abline(v = v$crit, lty = 2, lwd = 2, col = "#7f8c8d")

    # Shade alpha region (right tail of null beyond crit)
    x_alpha <- seq(v$crit, xmax, length.out = 200)
    polygon(c(v$crit, x_alpha, xmax),
            c(0, dnorm(x_alpha), 0),
            col = adjustcolor("#e74c3c", 0.35), border = NA)

    # Shade beta region (left part of alternative, below crit)
    x_beta <- seq(xmin, v$crit, length.out = 200)
    polygon(c(xmin, x_beta, v$crit),
            c(0, dnorm(x_beta, mean = v$shift), 0),
            col = adjustcolor("#f39c12", 0.35), border = NA)

    # Shade power region (right part of alternative, beyond crit)
    x_pow <- seq(v$crit, xmax, length.out = 200)
    polygon(c(v$crit, x_pow, xmax),
            c(0, dnorm(x_pow, mean = v$shift), 0),
            col = adjustcolor("#2ecc71", 0.35), border = NA)

    # Labels
    legend("topleft", bty = "n", cex = 0.9,
           legend = c(
             expression("Null distribution (H"[0]*": no effect)"),
             expression("Alternative distribution (H"[1]*": effect exists)"),
             "Critical value",
             expression(alpha * " (false positive)"),
             expression(beta * " (miss / Type II)"),
             "Power (correct detection)"
           ),
           col = c("#2c3e50", "#3498db", "#7f8c8d",
                   adjustcolor("#e74c3c", 0.6),
                   adjustcolor("#f39c12", 0.6),
                   adjustcolor("#2ecc71", 0.6)),
           lwd = c(2.5, 2.5, 2, 8, 8, 8),
           lty = c(1, 1, 2, 1, 1, 1))
  })

  output$results_box <- renderUI({
    v <- vals()
    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>&alpha;:</b> ", v$alpha, "<br>",
        "<b>&beta;:</b> ", round(v$beta, 3), "<br>",
        "<b>Power:</b> ", round(v$power, 3), "<br>",
        "<hr style='margin:8px 0'>",
        "<b>Effect (d):</b> ", v$d, "<br>",
        "<b>n per group:</b> ", v$n, "<br>",
        "<b>Critical z:</b> ", round(v$crit, 2)
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

Set effect = 0 and watch: there is no alternative distribution to detect. Any rejection is a false positive.
Set effect = 0.5 with n = 20: power is low. Now slide n up — power climbs. This is why sample size matters.
Set n = 200 and shrink the effect toward 0: even large samples struggle to detect tiny effects.
Lower \(\alpha\) from 0.05 to 0.01: the critical value moves right, \(\alpha\) shrinks, but \(\beta\) grows. There’s always a tradeoff between false positives and false negatives.

Minimum Detectable Effect (MDE)

When planning an experiment, you often ask: “Given my sample size, what’s the smallest effect I can reliably detect?” That’s the MDE.

It depends on three things: sample size (\(n\)), significance level (\(\alpha\)), and desired power (\(1 - \beta\)). The formula for a two-sample test with equal groups is:

\[ \text{MDE} = (z_{1-\alpha} + z_{1-\beta}) \times \sqrt{\frac{2}{n}} \]

Notice: that \(\sqrt{2/n}\) is just the standard error of the difference in means. So MDE is really just a scaled-up SE:

\[MDE = (z_{1-\alpha} + z_{1-\beta}) \times SE\]

The critical values (~2.8 for 5% significance and 80% power) are fixed multipliers. The only thing you control is the SE — by increasing \(n\) or reducing \(\sigma\) (through better measurement, stratification, or controls). Power analysis is really just an SE calculation in disguise. See Variance, SD & Standard Error for more on this connection.

Larger \(n\) shrinks the SE, which shrinks the MDE. Higher power demands a larger MDE (or more \(n\)).

#| standalone: true
#| viewerHeight: 480

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .mde-box {
      background: #eaf2f8; border-radius: 6px; padding: 16px;
      margin-top: 14px; font-size: 15px; line-height: 2;
      text-align: center;
    }
    .mde-box .big { font-size: 28px; color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("n2", "Sample size per group (n):",
                  min = 10, max = 1000, value = 100, step = 10),

      sliderInput("alpha2", HTML("&alpha;:"),
                  min = 0.01, max = 0.10, value = 0.05, step = 0.01),

      sliderInput("power2", "Desired power:",
                  min = 0.50, max = 0.95, value = 0.80, step = 0.05),

      uiOutput("mde_box")
    ),

    mainPanel(
      width = 9,
      plotOutput("mde_curve", height = "400px")
    )
  )
)

server <- function(input, output, session) {

  output$mde_curve <- renderPlot({
    alpha <- input$alpha2
    power <- input$power2
    n_now <- input$n2

    ns <- seq(10, 1000, by = 5)
    mdes <- (qnorm(1 - alpha) + qnorm(power)) * sqrt(2 / ns)

    mde_now <- (qnorm(1 - alpha) + qnorm(power)) * sqrt(2 / n_now)

    par(mar = c(4.5, 4.5, 3, 1))
    plot(ns, mdes, type = "l", lwd = 2.5, col = "#3498db",
         xlab = "Sample size per group (n)",
         ylab = "MDE (standardized effect size)",
         main = paste0("MDE curve (\u03b1 = ", alpha, ", power = ", power, ")"),
         ylim = c(0, max(mdes)))

    # Highlight current n
    points(n_now, mde_now, pch = 19, cex = 2, col = "#e74c3c")
    segments(n_now, 0, n_now, mde_now, lty = 2, col = "#e74c3c")
    segments(0, mde_now, n_now, mde_now, lty = 2, col = "#e74c3c")

    text(n_now + 30, mde_now + 0.02,
         paste0("MDE = ", round(mde_now, 3)),
         col = "#e74c3c", cex = 0.95, adj = 0)
  })

  output$mde_box <- renderUI({
    alpha <- input$alpha2
    power <- input$power2
    n_now <- input$n2
    mde <- (qnorm(1 - alpha) + qnorm(power)) * sqrt(2 / n_now)

    tags$div(class = "mde-box",
      HTML(paste0(
        "With <b>n = ", n_now, "</b> per group,<br>",
        "you can detect effects as small as:<br>",
        "<span class='big'>d = ", round(mde, 3), "</span>"
      ))
    )
  })
}

shinyApp(ui, server)

The intuition

MDE is your experiment’s resolution. A microscope can’t see atoms; your experiment can’t see effects smaller than the MDE.
More data (larger \(n\)) = sharper microscope = smaller MDE.
If you need to detect a 1% lift in click-through rate but your MDE is 3%, your experiment is pointless — you’ll almost certainly miss it even if the effect is real.
In practice: figure out the smallest effect that would matter for your decision, then compute the \(n\) needed to detect it.

Did you know?

Jacob Cohen, the psychologist who popularized power analysis, found in 1962 that the median power of studies in behavioral science journals was only 0.48 — meaning most studies had less than a coin-flip chance of detecting the effects they were looking for. He spent the rest of his career trying to fix this. His book Statistical Power Analysis (1969) remains a classic.
Cohen’s famous effect size conventions (small = 0.2, medium = 0.5, large = 0.8) were meant as rough guides, not rigid rules. He later regretted that people treated them as gospel: “My intent was that d = 0.5 represents a medium effect… it does not mean that 0.5 is a medium effect in your field.”
The replication crisis in psychology and medicine is largely a power problem. Underpowered studies that happen to find significant results are published; the many more that find nothing are filed away. This is publication bias, and it’s a direct consequence of running experiments without power calculations.