The Sampling Distribution

What is a sampling distribution?

You run an experiment and compute a statistic — say the sample mean. You get \(\bar{x} = 4.7\). But if you ran the experiment again with new data, you’d get \(\bar{x} = 5.1\). And again: \(\bar{x} = 4.3\).

The sampling distribution is the distribution of all possible values your statistic could take across every possible sample you could have drawn. It tells you: how much does my estimate bounce around due to the randomness of sampling?

You never observe it directly — you only ran the experiment once. But it’s the foundation of everything: standard errors, confidence intervals, p-values, and hypothesis tests all depend on it.

It’s not the same as your data

This is the most common confusion in statistics:

Distribution of the data Sampling distribution
What is it? Histogram of individual observations Histogram of the statistic across repeated samples
What does it show? How spread out individual values are How much the estimate varies
Shape Could be anything (skewed, bimodal, etc.) Often approximately normal (CLT)
Width Depends on \(\sigma\) (population SD) Depends on \(\sigma / \sqrt{n}\) (standard error)
You can see it? Yes — plot your data No — you only have one sample

Your data might be wildly skewed. But the sampling distribution of the mean can still be normal, because averaging smooths things out. That’s the CLT.

The simulation below makes this concrete. The left panel shows your data (one sample). The right panel shows what happens when you repeat the experiment 1,000 times — the sampling distribution.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 1050

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      selectInput("pop", "Population shape:",
                  choices = c("Normal",
                              "Uniform",
                              "Skewed (exponential)",
                              "Heavily skewed (chi-sq 2)",
                              "Bimodal",
                              "Heavy-tailed (t, df=3)")),

      selectInput("stat", "Statistic:",
                  choices = c("Mean", "Median", "SD",
                              "Max", "IQR")),

      sliderInput("n", "Sample size (n):",
                  min = 2, max = 200, value = 5, step = 1),

      sliderInput("reps", "Repeated experiments:",
                  min = 100, max = 5000, value = 1000, step = 100),

      actionButton("go", "Run experiments",
                   class = "btn-primary", width = "100%"),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("data_plot", height = "350px")),
        column(6, plotOutput("samp_dist_plot", height = "350px"))
      ),
      fluidRow(
        column(12, plotOutput("convergence_plot", height = "450px"))
      )
    )
  )
)

server <- function(input, output, session) {

  draw_pop <- function(n, pop) {
    switch(pop,
      "Normal" = rnorm(n, mean = 5, sd = 2),
      "Uniform" = runif(n, 0, 10),
      "Skewed (exponential)" = rexp(n, rate = 0.5),
      "Heavily skewed (chi-sq 2)" = rchisq(n, df = 2),
      "Bimodal" = {
        k <- rbinom(n, 1, 0.5)
        k * rnorm(n, 2, 0.7) + (1 - k) * rnorm(n, 8, 0.7)
      },
      "Heavy-tailed (t, df=3)" = rt(n, df = 3) * 2 + 5
    )
  }

  compute_stat <- function(x, stat) {
    switch(stat,
      "Mean"   = mean(x),
      "Median" = median(x),
      "SD"     = sd(x),
      "Max"    = max(x),
      "IQR"    = IQR(x)
    )
  }

  dat <- reactive({
    input$go
    n    <- input$n
    reps <- input$reps
    pop  <- input$pop
    stat <- input$stat

    # One sample (for display)
    one_sample <- draw_pop(n, pop)
    one_stat   <- compute_stat(one_sample, stat)

    # Big population draw (to show true shape)
    big_pop <- draw_pop(10000, pop)

    # Repeated experiments
    samp_stats <- replicate(reps, compute_stat(draw_pop(n, pop), stat))

    # Check normality (Shapiro on subset)
    shap_sub <- samp_stats[seq_len(min(5000, length(samp_stats)))]
    shap_p <- tryCatch(shapiro.test(shap_sub)$p.value, error = function(e) NA)

    list(one_sample = one_sample, one_stat = one_stat,
         big_pop = big_pop, samp_stats = samp_stats,
         n = n, reps = reps, stat = stat, pop = pop,
         shap_p = shap_p)
  })

  # --- Left: data distribution ---
  output$data_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    hist(d$big_pop, breaks = 60, probability = TRUE,
         col = "#e74c3c20", border = "#e74c3c60",
         main = paste("Population:", d$pop),
         xlab = "X", ylab = "Density")

    # Overlay one sample as rug
    rug(d$one_sample, col = "#2c3e50", lwd = 2)

    legend("topright", bty = "n", cex = 0.85,
           legend = c("Population shape",
                      paste0("Your sample (n = ", d$n, ")")),
           col = c("#e74c3c60", "#2c3e50"),
           lwd = c(8, 2), lty = c(1, 1))
  })

  # --- Right: sampling distribution ---
  output$samp_dist_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    hist(d$samp_stats, breaks = 50, probability = TRUE,
         col = "#3498db30", border = "#3498db80",
         main = paste0("Sampling dist. of ", d$stat, " (n = ", d$n, ")"),
         xlab = paste0("Sample ", tolower(d$stat)),
         ylab = "Density")

    # Normal overlay
    x_seq <- seq(min(d$samp_stats), max(d$samp_stats), length.out = 300)
    lines(x_seq,
          dnorm(x_seq, mean = mean(d$samp_stats), sd = sd(d$samp_stats)),
          col = "#e74c3c", lwd = 2, lty = 2)

    # Mark your one estimate
    abline(v = d$one_stat, col = "#2c3e50", lwd = 2.5, lty = 1)

    legend("topright", bty = "n", cex = 0.8,
           legend = c("Sampling distribution",
                      "Normal approximation",
                      paste0("Your estimate: ", round(d$one_stat, 3))),
           col = c("#3498db80", "#e74c3c", "#2c3e50"),
           lwd = c(8, 2, 2.5), lty = c(1, 2, 1))
  })

  # --- Bottom: how normality improves with n ---
  output$convergence_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    ns <- c(2, 5, 10, 30, 50, 100)
    ns <- ns[ns <= 200]

    colors <- colorRampPalette(c("#e74c3c", "#3498db"))(length(ns))

    # First pass to get x range
    all_stats <- list()
    for (i in seq_along(ns)) {
      all_stats[[i]] <- replicate(500, compute_stat(draw_pop(ns[i], d$pop), d$stat))
    }
    xlim <- range(unlist(all_stats))

    plot(NULL, xlim = xlim, ylim = c(0.5, length(ns) * 1.5 + 1),
         yaxt = "n", ylab = "", xlab = paste0("Sample ", tolower(d$stat)),
         main = paste0("Sampling distribution of ", d$stat, " at different n"))
    positions <- seq_along(ns) * 1.5
    axis(2, at = positions, labels = paste0("n=", ns), las = 1, cex.axis = 0.9)

    for (i in seq_along(ns)) {
      dens <- density(all_stats[[i]])
      # Scale density to fit in a strip
      dens$y <- dens$y / max(dens$y) * 0.7
      polygon(dens$x, dens$y + positions[i], col = adjustcolor(colors[i], 0.4),
              border = colors[i], lwd = 1.5)
    }
  })

  output$results <- renderUI({
    d <- dat()
    se <- sd(d$samp_stats)
    normal_enough <- !is.na(d$shap_p) && d$shap_p > 0.05

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>Your estimate:</b> ", round(d$one_stat, 4), "<br>",
        "<b>SE (from sampling dist):</b> ", round(se, 4), "<br>",
        "<b>Sampling dist SD:</b> ", round(sd(d$samp_stats), 4), "<br>",
        "<hr style='margin:8px 0'>",
        "<b>Looks normal?</b> ",
        if (normal_enough)
          "<span style='color:#27ae60;'>Yes (Shapiro p = " else
          "<span style='color:#e74c3c;'>No (Shapiro p = ",
        round(d$shap_p, 4), ")</span><br>",
        "<small>Shapiro-Wilk tests whether the<br>",
        "sampling distribution is normal.</small>"
      ))
    )
  })
}

shinyApp(ui, server)

When can you assume normality?

The red dashed line on the right panel is the normal approximation. When it fits the histogram well, you can safely use normal-based inference (z-tests, t-tests, standard CIs).

Rules of thumb:

Population Statistic n needed for normality
Symmetric (normal, uniform) Mean ~10–15
Moderately skewed (exponential) Mean ~30
Heavily skewed (chi-sq 2) Mean ~50–100
Any shape Mean ~30 is the textbook answer, but check
Any shape Median Slower — needs larger n
Any shape Max Very slow — often never normal
Any shape SD Moderate — ~30–50

Things to try

  • Exponential, Mean, n = 5: the sampling distribution is visibly skewed. Normal approximation is poor. Shapiro test says “No.”
  • Same but n = 50: now it’s nearly bell-shaped. CLT has kicked in.
  • Any population, Max, n = 100: the sampling distribution of the maximum is not normal even with large n. It has its own distribution (extreme value theory). Normal approximation fails.
  • Bimodal, Mean, n = 2: the sampling distribution itself is bimodal! With only 2 observations, averaging doesn’t smooth enough.
  • Bottom panel (ridgeline): watch the sampling distribution narrow and become more normal as n increases, all on one plot.

The bottom line

The sampling distribution is not your data. It’s the distribution of your estimate. Whether it’s normal depends on three things:

  1. The population shape — the uglier it is, the more data you need
  2. The sample size — larger n → CLT → normality
  3. The statistic — means converge fast, medians slower, maxima barely at all

When in doubt, don’t assume — bootstrap it. The bootstrap page shows you how to build the sampling distribution empirically.


Did you know?

  • William Sealy Gosset, a chemist at the Guinness brewery, invented the t-distribution in 1908 because he was working with tiny samples of barley. Guinness didn’t let employees publish under their own names, so he used the pen name “Student” — hence “Student’s t-test.”
  • Gosset’s problem: with only 3–4 observations, you can’t assume the sampling distribution is normal. The t-distribution has heavier tails to account for the extra uncertainty when \(n\) is small. As \(n\) grows, the t-distribution converges to the normal — because with enough data, the sampling distribution is normal (CLT).
  • The concept of a sampling distribution was so confusing to early statisticians that R.A. Fisher reportedly said it was the most difficult idea in all of statistics to teach clearly.