p-values, Confidence Intervals & What They Actually Mean

First: estimation. Then: inference.

These are two separate steps, and confusing them is where most misunderstanding begins.

Step 1 — Estimation is about computing numbers from your data:

  • You collect a sample and calculate a point estimate — your best guess at the true parameter. For example, \(\hat{\beta} = -5\) or \(\bar{x} = 12.3\).
  • You also calculate a standard error (SE) — how much that estimate would bounce around if you repeated the experiment. A small SE means your estimate is precise; a large SE means it’s noisy.

Estimation gives you a number. It does not tell you what to conclude.

Step 2 — Inference is about drawing conclusions from those numbers:

  • Is the effect real, or could it be noise? → p-value
  • What range of values is plausible for the true parameter? → confidence interval
  • Both are built from the same two ingredients: the estimate and its SE.

\[\text{test statistic} = \frac{\text{estimate}}{\text{SE}} \qquad \qquad \text{CI} = \text{estimate} \pm 1.96 \times \text{SE}\]

Everything on this page is Step 2. If the estimation step is wrong (biased estimate, wrong SE), then the inference is wrong too — no matter how sophisticated the test. Good inference starts with good estimation.

The Oracle View. In these simulations, we set the true \(\mu\) — so we know whether \(H_0\) is true, whether each CI covers the truth, and whether each rejection is a correct detection or a false alarm. In practice, you never know if \(H_0\) is true. That’s the whole point of the test.


Part 1: p-values

What is a test statistic?

Before we can talk about p-values, we need the test statistic — a single number that measures how far your estimate is from what the null hypothesis predicts, in units of standard error.

The most common one is the z-statistic (or t-statistic for small samples):

\[z = \frac{\bar{x} - \mu_0}{\text{SE}} = \frac{\text{estimate} - \text{null value}}{\text{standard error}}\]

It answers: “How many standard errors is my estimate from the null?” A \(z\) of 2 means your estimate is 2 SEs away from what H₀ predicts — that’s unusual enough to start doubting H₀.

What a p-value actually is

The “p” stands for probability.

The p-value is the probability of getting a test statistic as large as (or larger than) yours, assuming the null hypothesis is true.

That’s it. Large \(|z|\) → small p-value → more evidence against H₀.

It is not:

  • The probability that H₀ is true
  • The probability that you made a mistake
  • The probability that the result will replicate

It’s a statement about the data, not about the hypothesis. The p-value asks: “If there really were no effect, how surprising would my data be?”

A concrete example

Say you’re testing whether a drug lowers blood pressure. You run a regression and get \(\hat{\beta} = -5\) (blood pressure drops 5 points).

H₀: \(\beta = 0\) — the drug does nothing.

The p-value asks: if the drug truly does nothing (\(\beta = 0\)), what’s the probability of observing \(\hat{\beta}\) as extreme as \(-5\) or more?

p = 0.03 means: there’s only a 3% chance of seeing an estimate this large purely from random noise. So either:

  1. H₀ is true and you got unlucky (3% chance), or
  2. H₀ is false — the drug actually works

You reject H₀ because 3% feels too unlikely to be just noise.

But the p-value never tells you \(\beta\)’s actual value. It only tells you the data would be surprising if \(\beta\) were exactly zero. This matters:

  • p = 0.03 with \(\hat{\beta} = -5\) → probably a real, meaningful effect
  • p = 0.03 with \(\hat{\beta} = -0.001\) and \(n = 10{,}000{,}000\) → “statistically significant” but practically meaningless

That’s why you should always look at the estimate (\(\hat{\beta}\)) and confidence interval together with the p-value — not just whether p < 0.05.

Simulation: The p-value machine

Draw a sample, compute a test statistic, and see where it falls on the null distribution. The shaded area is the p-value. Under H₀, p-values are uniformly distributed — every value between 0 and 1 is equally likely.

#| standalone: true
#| viewerHeight: 680

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("n", "Sample size (n):",
                  min = 10, max = 200, value = 30, step = 5),

      sliderInput("true_mu", HTML("True &mu; (for data generation):"),
                  min = -1, max = 1, value = 0, step = 0.1),

      sliderInput("reps", "Number of experiments:",
                  min = 100, max = 2000, value = 500, step = 100),

      helpText("Set true \u03bc = 0 to see p-values under H\u2080.
               Set it away from 0 to see p-values under H\u2081."),

      actionButton("go", "Run experiments",
                   class = "btn-primary", width = "100%"),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("null_dist", height = "400px")),
        column(6, plotOutput("pval_hist", height = "400px"))
      )
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$go
    n      <- input$n
    mu     <- input$true_mu
    reps   <- input$reps

    # Draw one sample for display
    one_sample <- rnorm(n, mean = mu, sd = 1)
    one_z <- mean(one_sample) / (1 / sqrt(n))
    one_p <- 2 * pnorm(-abs(one_z))

    # Draw many samples
    z_stats <- replicate(reps, {
      x <- rnorm(n, mean = mu, sd = 1)
      mean(x) / (1 / sqrt(n))
    })
    p_vals <- 2 * pnorm(-abs(z_stats))

    reject_rate <- mean(p_vals < 0.05)

    list(one_z = one_z, one_p = one_p, z_stats = z_stats, p_vals = p_vals,
         n = n, mu = mu, reps = reps, reject_rate = reject_rate)
  })

  output$null_dist <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    x <- seq(-4, 4, length.out = 300)
    y <- dnorm(x)

    plot(x, y, type = "l", lwd = 2.5, col = "#2c3e50",
         xlab = "Test statistic (z)", ylab = "Density",
         main = "Null distribution & your test statistic")

    # Shade p-value area (two-tailed)
    z_abs <- abs(d$one_z)
    if (z_abs < 4) {
      x_right <- seq(z_abs, 4, length.out = 100)
      polygon(c(z_abs, x_right, 4), c(0, dnorm(x_right), 0),
              col = adjustcolor("#e74c3c", 0.4), border = NA)
      x_left <- seq(-4, -z_abs, length.out = 100)
      polygon(c(-4, x_left, -z_abs), c(0, dnorm(x_left), 0),
              col = adjustcolor("#e74c3c", 0.4), border = NA)
    }

    abline(v = d$one_z, lwd = 2.5, col = "#3498db", lty = 1)

    legend("topright", bty = "n", cex = 0.85,
           legend = c("Null distribution (H\u2080)",
                      paste0("Your z = ", round(d$one_z, 3)),
                      paste0("p-value = ", round(d$one_p, 4))),
           col = c("#2c3e50", "#3498db", adjustcolor("#e74c3c", 0.6)),
           lwd = c(2.5, 2.5, 8), lty = c(1, 1, 1))
  })

  output$pval_hist <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    hist(d$p_vals, breaks = 20, probability = TRUE,
         col = "#3498db30", border = "#3498db80",
         main = paste0("Distribution of p-values (",
                       d$reps, " experiments)"),
         xlab = "p-value", ylab = "Density",
         xlim = c(0, 1))

    if (abs(d$mu) < 0.001) {
      abline(h = 1, lty = 2, lwd = 2, col = "#e74c3c")
      legend("topright", bty = "n", cex = 0.85,
             legend = c("p-value histogram",
                        "Uniform(0,1) reference"),
             col = c("#3498db80", "#e74c3c"),
             lwd = c(8, 2), lty = c(1, 2))
    } else {
      legend("topright", bty = "n", cex = 0.85,
             legend = paste0("Under H\u2081 (\u03bc = ", d$mu, "):\n",
                            "p-values pile up near 0"))
    }
  })

  output$results <- renderUI({
    d <- dat()

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>This sample:</b><br>",
        "z = ", round(d$one_z, 3), "<br>",
        "p = ", round(d$one_p, 4), " ",
        ifelse(d$one_p < 0.05,
               "<span class='bad'>(&lt; 0.05)</span>",
               "<span class='good'>(\u2265 0.05)</span>"), "<br>",
        "<hr style='margin:8px 0'>",
        "<b>Across ", d$reps, " experiments:</b><br>",
        "Rejection rate: ", round(d$reject_rate * 100, 1), "%<br>",
        "<small>",
        ifelse(abs(d$mu) < 0.001,
               "Under H\u2080, this should be ~5%.",
               paste0("Under H\u2081, this is the power.")),
        "</small>"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • True \(\mu\) = 0: p-values are uniform — flat histogram. This is the defining property of a valid test. About 5% of p-values fall below 0.05 purely by chance.
  • True \(\mu\) = 0.5: p-values pile up near 0. The further \(\mu\) is from 0, the more powerful the test, and the smaller the p-values tend to be.

Part 2: Confidence intervals

What a confidence interval actually is

A 95% CI is constructed by a procedure that, in repeated sampling, captures the true parameter 95% of the time.

It is not:

  • A 95% probability that the true parameter is inside this particular interval
  • A range where 95% of the data falls
  • A range where 95% of sample means fall

Once you compute a specific CI, the true parameter is either in it or it isn’t — there’s no probability about it. The 95% refers to the method, not to any single interval.

How a CI is constructed

The recipe is simple. You need three ingredients:

  1. A point estimate — your best guess (e.g., \(\bar{x}\) or \(\hat{\beta}\))
  2. A standard error — how much the estimate bounces around due to sampling
  3. A critical value — how many SEs to go out for your desired confidence level

The formula for a 95% CI for a mean:

\[\text{CI} = \bar{x} \pm z_{0.975} \times \text{SE} = \bar{x} \pm 1.96 \times \frac{s}{\sqrt{n}}\]

That’s it: estimate ± margin of error. The margin of error is just a scaled-up standard error.

Confidence level Critical value (\(z\)) Margin of error
90% 1.645 Narrower CI
95% 1.960 Standard
99% 2.576 Wider CI

Where does the 1.96 come from? The middle 95% of a standard normal distribution falls between \(-1.96\) and \(+1.96\). So if the sampling distribution of \(\bar{x}\) is approximately normal (CLT), going out 1.96 SEs in each direction captures the true mean 95% of the time.

For small samples (roughly \(n < 30\)), replace the \(z\) critical value with a \(t\) critical value from the \(t\)-distribution with \(n - 1\) degrees of freedom. The \(t\)-distribution has heavier tails, making the CI wider to account for the extra uncertainty in estimating the SE from small samples. As \(n\) grows, the \(t\) and \(z\) values converge.

The key insight: a CI is wide when (a) the data is noisy (large \(s\)), (b) the sample is small (small \(n\)), or (c) you demand more confidence (larger critical value). You can’t have precision, small samples, and high confidence all at once — pick two.

Simulation: CI coverage

Draw 100 confidence intervals. Each one either contains the true \(\mu\) (blue) or misses it (red). Over many experiments, about 95% contain \(\mu\) — but any single CI either does or doesn’t. The 95% is about the procedure, not about your interval.

#| standalone: true
#| viewerHeight: 680

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("n2", "Sample size (n):",
                  min = 5, max = 100, value = 25, step = 5),

      sliderInput("conf", "Confidence level:",
                  min = 0.80, max = 0.99, value = 0.95, step = 0.01),

      sliderInput("n_ci", "Number of CIs to draw:",
                  min = 20, max = 200, value = 100, step = 10),

      actionButton("go2", "Draw new CIs",
                   class = "btn-primary", width = "100%"),

      uiOutput("results2")
    ),

    mainPanel(
      width = 9,
      plotOutput("ci_plot", height = "550px")
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$go2
    n     <- input$n2
    conf  <- input$conf
    n_ci  <- input$n_ci
    mu    <- 5  # true mean
    sigma <- 2

    z <- qnorm(1 - (1 - conf) / 2)

    results <- t(replicate(n_ci, {
      x <- rnorm(n, mean = mu, sd = sigma)
      xbar <- mean(x)
      se <- sd(x) / sqrt(n)
      lo <- xbar - z * se
      hi <- xbar + z * se
      c(xbar = xbar, lo = lo, hi = hi, covers = (lo <= mu & mu <= hi))
    }))

    list(results = results, mu = mu, n = n, conf = conf, n_ci = n_ci)
  })

  output$ci_plot <- renderPlot({
    d <- dat()
    res <- d$results
    n_ci <- d$n_ci
    par(mar = c(4.5, 4.5, 3, 1))

    covers <- as.logical(res[, "covers"])
    cols <- ifelse(covers, "#3498db", "#e74c3c")

    plot(NULL, xlim = range(res[, c("lo", "hi")]),
         ylim = c(0.5, n_ci + 0.5),
         xlab = expression("Value of " * mu),
         ylab = "Experiment number",
         main = paste0(n_ci, " Confidence Intervals (",
                       d$conf * 100, "% level)"),
         yaxt = "n")

    # Draw CIs
    segments(res[, "lo"], seq_len(n_ci), res[, "hi"], seq_len(n_ci),
             col = cols, lwd = 1.5)
    points(res[, "xbar"], seq_len(n_ci), pch = 16, cex = 0.5, col = cols)

    # True mu
    abline(v = d$mu, lty = 2, lwd = 2.5, col = "#2c3e50")

    n_miss <- sum(!covers)
    legend("topright", bty = "n", cex = 0.9,
           legend = c(paste0("Contains \u03bc (", sum(covers), ")"),
                      paste0("Misses \u03bc (", n_miss, ")"),
                      paste0("True \u03bc = ", d$mu)),
           col = c("#3498db", "#e74c3c", "#2c3e50"),
           lwd = c(3, 3, 2.5), lty = c(1, 1, 2))
  })

  output$results2 <- renderUI({
    d <- dat()
    covers <- as.logical(d$results[, "covers"])
    cover_rate <- mean(covers)

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>Coverage rate:</b> <span class='",
        ifelse(abs(cover_rate - d$conf) < 0.05, "good", "bad"), "'>",
        round(cover_rate * 100, 1), "%</span><br>",
        "<b>Target:</b> ", d$conf * 100, "%<br>",
        "<b>Missed:</b> ", sum(!covers), " of ", d$n_ci, "<br>",
        "<hr style='margin:8px 0'>",
        "<small>Each red line is a CI that<br>",
        "does not contain the true \u03bc.<br>",
        "The method gets it right ~",
        d$conf * 100, "% of the time.</small>"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • 95% level: roughly 5 out of 100 CIs miss \(\mu\) (shown in red). Click “Draw new CIs” several times — the exact number varies, but it averages to 5%.
  • Change confidence to 99%: fewer CIs miss \(\mu\), but each CI is wider. There’s always a tradeoff between coverage and precision.
  • Small n (n = 5): the CIs are very wide. With little data, you can’t be precise. This is the price of honesty.

The bottom line

  • A p-value is not the probability the null is true. It’s the probability of seeing data this extreme if the null were true.
  • A confidence interval is not a probability statement about \(\mu\). It’s a statement about the procedure: if you repeated the experiment many times, 95% of your CIs would contain \(\mu\).
  • Both concepts are statements about long-run frequency, not about any single experiment. This is what makes frequentist inference subtle — and why Bayesian methods (which can make probability statements about parameters) are appealing to many.

Did you know?

  • The p-value was popularized by R.A. Fisher in the 1920s as an informal measure of evidence against the null. But the rigid “reject if p < 0.05” framework came from Jerzy Neyman and Egon Pearson in the 1930s. Fisher and Neyman-Pearson fundamentally disagreed about what statistical testing means, and the debate was never resolved — we use an awkward hybrid of both frameworks to this day.
  • In 2016, the American Statistical Association issued its first-ever formal statement on p-values, warning against common misinterpretations. Statement #2: “P-values do not measure the probability that the studied hypothesis is true.”
  • Confidence intervals were invented by Neyman in 1937. He was explicit that the 95% refers to the procedure, not to any single interval. He wrote: “It is not possible to say that the probability of the true value falling in any particular interval is 95%.”