The Sampling Distribution
What is a sampling distribution?
You run an experiment and compute a statistic — say the sample mean. You get \(\bar{x} = 4.7\). But if you ran the experiment again with new data, you’d get \(\bar{x} = 5.1\). And again: \(\bar{x} = 4.3\).
The sampling distribution is the distribution of all possible values your statistic could take across every possible sample you could have drawn. It tells you: how much does my estimate bounce around due to the randomness of sampling?
You never observe it directly — you only ran the experiment once. But it’s the foundation of everything: standard errors, confidence intervals, p-values, and hypothesis tests all depend on it.
It’s not the same as your data
This is the most common confusion in statistics:
| Distribution of the data | Sampling distribution | |
|---|---|---|
| What is it? | Histogram of individual observations | Histogram of the statistic across repeated samples |
| What does it show? | How spread out individual values are | How much the estimate varies |
| Shape | Could be anything (skewed, bimodal, etc.) | Often approximately normal (CLT) |
| Width | Depends on \(\sigma\) (population SD) | Depends on \(\sigma / \sqrt{n}\) (standard error) |
| You can see it? | Yes — plot your data | No — you only have one sample |
Your data might be wildly skewed. But the sampling distribution of the mean can still be normal, because averaging smooths things out. That’s the CLT.
The simulation below makes this concrete. The left panel shows your data (one sample). The right panel shows what happens when you repeat the experiment 1,000 times — the sampling distribution.
#| '!! shinylive warning !!': |
#| shinylive does not work in self-contained HTML documents.
#| Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 1050
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
selectInput("pop", "Population shape:",
choices = c("Normal",
"Uniform",
"Skewed (exponential)",
"Heavily skewed (chi-sq 2)",
"Bimodal",
"Heavy-tailed (t, df=3)")),
selectInput("stat", "Statistic:",
choices = c("Mean", "Median", "SD",
"Max", "IQR")),
sliderInput("n", "Sample size (n):",
min = 2, max = 200, value = 5, step = 1),
sliderInput("reps", "Repeated experiments:",
min = 100, max = 5000, value = 1000, step = 100),
actionButton("go", "Run experiments",
class = "btn-primary", width = "100%"),
uiOutput("results")
),
mainPanel(
width = 9,
fluidRow(
column(6, plotOutput("data_plot", height = "350px")),
column(6, plotOutput("samp_dist_plot", height = "350px"))
),
fluidRow(
column(12, plotOutput("convergence_plot", height = "450px"))
)
)
)
)
server <- function(input, output, session) {
draw_pop <- function(n, pop) {
switch(pop,
"Normal" = rnorm(n, mean = 5, sd = 2),
"Uniform" = runif(n, 0, 10),
"Skewed (exponential)" = rexp(n, rate = 0.5),
"Heavily skewed (chi-sq 2)" = rchisq(n, df = 2),
"Bimodal" = {
k <- rbinom(n, 1, 0.5)
k * rnorm(n, 2, 0.7) + (1 - k) * rnorm(n, 8, 0.7)
},
"Heavy-tailed (t, df=3)" = rt(n, df = 3) * 2 + 5
)
}
compute_stat <- function(x, stat) {
switch(stat,
"Mean" = mean(x),
"Median" = median(x),
"SD" = sd(x),
"Max" = max(x),
"IQR" = IQR(x)
)
}
dat <- reactive({
input$go
n <- input$n
reps <- input$reps
pop <- input$pop
stat <- input$stat
# One sample (for display)
one_sample <- draw_pop(n, pop)
one_stat <- compute_stat(one_sample, stat)
# Big population draw (to show true shape)
big_pop <- draw_pop(10000, pop)
# Repeated experiments
samp_stats <- replicate(reps, compute_stat(draw_pop(n, pop), stat))
# Check normality (Shapiro on subset)
shap_sub <- samp_stats[seq_len(min(5000, length(samp_stats)))]
shap_p <- tryCatch(shapiro.test(shap_sub)$p.value, error = function(e) NA)
list(one_sample = one_sample, one_stat = one_stat,
big_pop = big_pop, samp_stats = samp_stats,
n = n, reps = reps, stat = stat, pop = pop,
shap_p = shap_p)
})
# --- Left: data distribution ---
output$data_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
hist(d$big_pop, breaks = 60, probability = TRUE,
col = "#e74c3c20", border = "#e74c3c60",
main = paste("Population:", d$pop),
xlab = "X", ylab = "Density")
# Overlay one sample as rug
rug(d$one_sample, col = "#2c3e50", lwd = 2)
legend("topright", bty = "n", cex = 0.85,
legend = c("Population shape",
paste0("Your sample (n = ", d$n, ")")),
col = c("#e74c3c60", "#2c3e50"),
lwd = c(8, 2), lty = c(1, 1))
})
# --- Right: sampling distribution ---
output$samp_dist_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
hist(d$samp_stats, breaks = 50, probability = TRUE,
col = "#3498db30", border = "#3498db80",
main = paste0("Sampling dist. of ", d$stat, " (n = ", d$n, ")"),
xlab = paste0("Sample ", tolower(d$stat)),
ylab = "Density")
# Normal overlay
x_seq <- seq(min(d$samp_stats), max(d$samp_stats), length.out = 300)
lines(x_seq,
dnorm(x_seq, mean = mean(d$samp_stats), sd = sd(d$samp_stats)),
col = "#e74c3c", lwd = 2, lty = 2)
# Mark your one estimate
abline(v = d$one_stat, col = "#2c3e50", lwd = 2.5, lty = 1)
legend("topright", bty = "n", cex = 0.8,
legend = c("Sampling distribution",
"Normal approximation",
paste0("Your estimate: ", round(d$one_stat, 3))),
col = c("#3498db80", "#e74c3c", "#2c3e50"),
lwd = c(8, 2, 2.5), lty = c(1, 2, 1))
})
# --- Bottom: how normality improves with n ---
output$convergence_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
ns <- c(2, 5, 10, 30, 50, 100)
ns <- ns[ns <= 200]
colors <- colorRampPalette(c("#e74c3c", "#3498db"))(length(ns))
# First pass to get x range
all_stats <- list()
for (i in seq_along(ns)) {
all_stats[[i]] <- replicate(500, compute_stat(draw_pop(ns[i], d$pop), d$stat))
}
xlim <- range(unlist(all_stats))
plot(NULL, xlim = xlim, ylim = c(0.5, length(ns) * 1.5 + 1),
yaxt = "n", ylab = "", xlab = paste0("Sample ", tolower(d$stat)),
main = paste0("Sampling distribution of ", d$stat, " at different n"))
positions <- seq_along(ns) * 1.5
axis(2, at = positions, labels = paste0("n=", ns), las = 1, cex.axis = 0.9)
for (i in seq_along(ns)) {
dens <- density(all_stats[[i]])
# Scale density to fit in a strip
dens$y <- dens$y / max(dens$y) * 0.7
polygon(dens$x, dens$y + positions[i], col = adjustcolor(colors[i], 0.4),
border = colors[i], lwd = 1.5)
}
})
output$results <- renderUI({
d <- dat()
se <- sd(d$samp_stats)
normal_enough <- !is.na(d$shap_p) && d$shap_p > 0.05
tags$div(class = "stats-box",
HTML(paste0(
"<b>Your estimate:</b> ", round(d$one_stat, 4), "<br>",
"<b>SE (from sampling dist):</b> ", round(se, 4), "<br>",
"<b>Sampling dist SD:</b> ", round(sd(d$samp_stats), 4), "<br>",
"<hr style='margin:8px 0'>",
"<b>Looks normal?</b> ",
if (normal_enough)
"<span style='color:#27ae60;'>Yes (Shapiro p = " else
"<span style='color:#e74c3c;'>No (Shapiro p = ",
round(d$shap_p, 4), ")</span><br>",
"<small>Shapiro-Wilk tests whether the<br>",
"sampling distribution is normal.</small>"
))
)
})
}
shinyApp(ui, server)
When can you assume normality?
The red dashed line on the right panel is the normal approximation. When it fits the histogram well, you can safely use normal-based inference (z-tests, t-tests, standard CIs).
Rules of thumb:
| Population | Statistic | n needed for normality |
|---|---|---|
| Symmetric (normal, uniform) | Mean | ~10–15 |
| Moderately skewed (exponential) | Mean | ~30 |
| Heavily skewed (chi-sq 2) | Mean | ~50–100 |
| Any shape | Mean | ~30 is the textbook answer, but check |
| Any shape | Median | Slower — needs larger n |
| Any shape | Max | Very slow — often never normal |
| Any shape | SD | Moderate — ~30–50 |
Things to try
- Exponential, Mean, n = 5: the sampling distribution is visibly skewed. Normal approximation is poor. Shapiro test says “No.”
- Same but n = 50: now it’s nearly bell-shaped. CLT has kicked in.
- Any population, Max, n = 100: the sampling distribution of the maximum is not normal even with large n. It has its own distribution (extreme value theory). Normal approximation fails.
- Bimodal, Mean, n = 2: the sampling distribution itself is bimodal! With only 2 observations, averaging doesn’t smooth enough.
- Bottom panel (ridgeline): watch the sampling distribution narrow and become more normal as n increases, all on one plot.
The bottom line
The sampling distribution is not your data. It’s the distribution of your estimate. Whether it’s normal depends on three things:
- The population shape — the uglier it is, the more data you need
- The sample size — larger n → CLT → normality
- The statistic — means converge fast, medians slower, maxima barely at all
When in doubt, don’t assume — bootstrap it. The bootstrap page shows you how to build the sampling distribution empirically.
Did you know?
- William Sealy Gosset, a chemist at the Guinness brewery, invented the t-distribution in 1908 because he was working with tiny samples of barley. Guinness didn’t let employees publish under their own names, so he used the pen name “Student” — hence “Student’s t-test.”
- Gosset’s problem: with only 3–4 observations, you can’t assume the sampling distribution is normal. The t-distribution has heavier tails to account for the extra uncertainty when \(n\) is small. As \(n\) grows, the t-distribution converges to the normal — because with enough data, the sampling distribution is normal (CLT).
- The concept of a sampling distribution was so confusing to early statisticians that R.A. Fisher reportedly said it was the most difficult idea in all of statistics to teach clearly.