Variance, SD & Standard Error
The Big Picture
Imagine you’re measuring the heights of students in a class. Some are tall, some are short — there’s spread. Statistics gives us precise language for that spread:
| Concept | What it measures | Formula |
|---|---|---|
| Variance | Average squared distance from the mean | \(\sigma^2 = \frac{1}{N}\sum(x_i - \mu)^2\) |
| Standard Deviation (SD) | Average distance from the mean (in original units) | \(\sigma = \sqrt{\sigma^2}\) |
| Standard Error (SE) | How much the sample mean bounces around | \(SE = \frac{\sigma}{\sqrt{n}}\) |
The key insight: SD measures spread in your data. SE measures spread in your estimates.
Variance: Measuring Spread
Variance answers: “On average, how far are values from the center?”
We square the deviations because:
- Positive and negative deviations would cancel out otherwise
- It penalizes large deviations more than small ones
- It has nice mathematical properties (additive for independent variables)
Population Variance vs Sample Variance
Here’s where it gets tricky. There are two formulas:
Population variance (when you have the entire population): \[\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2\]
Sample variance (when you have a sample from a larger population): \[s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2\]
Why n - 1? (Bessel’s Correction)
When you compute deviations from the sample mean \(\bar{x}\) instead of the true mean \(\mu\), you’re using the data twice — once to compute \(\bar{x}\), then again to compute deviations from it. The sample mean is always closer to the data points than the true mean would be, so dividing by \(n\) systematically underestimates the true variance.
Think of it this way: if you have \(n = 1\) data point, the sample mean equals that point, so the deviation is 0. But that doesn’t mean there’s no variability in the population! Dividing by \(n - 1 = 0\) makes the formula undefined, which correctly tells you: you can’t estimate spread from a single observation.
The \(n - 1\) is called the degrees of freedom — you “used up” one degree of freedom estimating the mean.
Try the simulation below to see this in action:
#| standalone: true
#| viewerHeight: 750
library(shiny)
ui <- fluidPage(
titlePanel("Why n - 1?"),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("pop_sd", "Population SD (\u03c3):", min = 2, max = 25, value = 10, step = 1),
sliderInput("samp_n", "Sample size (n):", min = 2, max = 100, value = 10),
sliderInput("n_reps", "Number of samples:", min = 100, max = 2000, value = 500, step = 100),
hr(),
h4("Results"),
uiOutput("results_box"),
hr(),
p(strong("Bias check:"), "If the average of many sample estimates equals the true value, the estimator is unbiased."),
p("Notice: dividing by n underestimates. Dividing by n-1 gets it right on average.")
),
mainPanel(
width = 9,
plotOutput("dist_plot", height = "600px")
)
)
)
server <- function(input, output, session) {
sim <- reactive({
mu <- 50
sigma <- input$pop_sd
n <- input$samp_n
n_reps <- input$n_reps
var_n <- numeric(n_reps)
var_n1 <- numeric(n_reps)
for (i in 1:n_reps) {
samp <- rnorm(n, mu, sigma)
xbar <- mean(samp)
devs <- (samp - xbar)^2
var_n[i] <- sum(devs) / n
var_n1[i] <- sum(devs) / (n - 1)
}
list(var_n = var_n, var_n1 = var_n1, true_var = sigma^2, n = n)
})
output$dist_plot <- renderPlot({
v <- sim()
par(mfrow = c(2, 1), mar = c(4.5, 4.5, 3, 1))
xlims <- range(c(v$var_n, v$var_n1))
hist(v$var_n, breaks = 40, col = adjustcolor("#3498db", 0.6), border = "white",
main = paste0("Divide by n (biased) \u2014 n = ", v$n),
xlab = "Estimated Variance", xlim = xlims, cex.main = 1.3)
abline(v = v$true_var, col = "#e74c3c", lwd = 2.5, lty = 2)
abline(v = mean(v$var_n), col = "#2c3e50", lwd = 2)
legend("topright", legend = c(paste0("True = ", v$true_var),
paste0("Avg = ", round(mean(v$var_n), 2))),
col = c("#e74c3c", "#2c3e50"), lwd = 2, lty = c(2, 1), bty = "n")
hist(v$var_n1, breaks = 40, col = adjustcolor("#2ecc71", 0.6), border = "white",
main = "Divide by n\u22121 (unbiased)",
xlab = "Estimated Variance", xlim = xlims, cex.main = 1.3)
abline(v = v$true_var, col = "#e74c3c", lwd = 2.5, lty = 2)
abline(v = mean(v$var_n1), col = "#2c3e50", lwd = 2)
legend("topright", legend = c(paste0("True = ", v$true_var),
paste0("Avg = ", round(mean(v$var_n1), 2))),
col = c("#e74c3c", "#2c3e50"), lwd = 2, lty = c(2, 1), bty = "n")
})
output$results_box <- renderUI({
v <- sim()
tags$div(
style = "background:#f0f4f8; border-radius:6px; padding:12px; font-size:14px; line-height:1.9;",
HTML(paste0(
"<b>True variance:</b> ", v$true_var, "<br>",
"<b>Avg (/ n):</b> ", round(mean(v$var_n), 2), "<br>",
"<b>Avg (/ n\u22121):</b> ", round(mean(v$var_n1), 2)
))
)
})
}
shinyApp(ui, server)
Standard Deviation: Back to Original Units
Variance is in squared units. If heights are in cm, variance is in cm². That’s hard to interpret.
Standard deviation = \(\sqrt{\text{variance}}\) — it puts spread back in the original units.
\[\sigma = \sqrt{\sigma^2} \qquad \text{(population)} \qquad\qquad s = \sqrt{s^2} \qquad \text{(sample)}\]
Rules of thumb (for roughly normal data):
- ~68% of data falls within ±1 SD of the mean
- ~95% of data falls within ±2 SD of the mean
- ~99.7% falls within ±3 SD
Standard Error: The SD of Your Estimate
Here’s where people get confused. The standard error is not about your data — it’s about your estimate.
If you took many samples of size \(n\) and computed the mean each time, those means would form a distribution (the sampling distribution). The standard deviation of that distribution is the standard error:
\[SE(\bar{x}) = \frac{\sigma}{\sqrt{n}}\]
In practice, we don’t know \(\sigma\), so we plug in \(s\):
\[\widehat{SE}(\bar{x}) = \frac{s}{\sqrt{n}}\]
Why does SE = SD / √n? The Intuition
Imagine you want to know the average commute time in your city. You ask one person — they say 45 minutes. But that’s just one person. Maybe they live far away. Your estimate is noisy — it could easily be off by 20 minutes (the SD of commute times).
Now you ask 4 people and average: 45, 25, 60, 30 → mean = 40 minutes. Some are high, some are low — the errors partially cancel out. Your estimate is less noisy. But not 4× less noisy — only 2× less noisy (because \(\sqrt{4} = 2\)).
Ask 100 people: the cancellation is even stronger. Your average is now 10× more precise than a single observation (\(\sqrt{100} = 10\)).
Why square root and not just n? Because the errors don’t perfectly cancel — they’re random, so sometimes more are high than low. The cancellation is partial. Mathematically, independent errors cancel at the rate of \(\sqrt{n}\), not \(n\). If errors perfectly cancelled, the SE would be \(\sigma / n\) and you’d barely need any data. If they didn’t cancel at all, the SE would stay at \(\sigma\) forever and more data would be useless. The \(\sqrt{n}\) is the sweet spot of reality.
The Math Behind It
Each observation \(x_i\) is an independent draw with variance \(\sigma^2\). What does that actually mean?
The population has a distribution with some spread — that spread is \(\sigma^2\). When you sample one person, you don’t know what you’ll get. You might get someone with a long commute or a short one. Before you observe them, that measurement is uncertain — it could land anywhere in the population distribution. That uncertainty is \(\sigma^2\). Every observation carries the same amount of uncertainty because every observation comes from the same population.
“Independent” means knowing what the first person said tells you nothing about what the second person will say. Each person is a fresh roll of the dice. So you’re averaging \(n\) equally-noisy, independent measurements — and the question is: how much noise survives the averaging?
The sample mean is:
\[\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i\]
Step 1: The variance of a sum of independent random variables equals the sum of their variances. Each \(x_i\) contributes \(\sigma^2\):
\[\text{Var}\left(\sum_{i=1}^n x_i\right) = \sigma^2 + \sigma^2 + \cdots + \sigma^2 = n\sigma^2\]
The noise grows with \(n\) — more observations means more total variability in the sum. But we don’t want the sum, we want the average.
Step 2: The mean divides the sum by \(n\). When you multiply a random variable by a constant \(c\), its variance gets multiplied by \(c^2\) (because variance is in squared units):
\[\text{Var}(cX) = c^2 \cdot \text{Var}(X)\]
So dividing by \(n\) means multiplying by \(1/n\), which divides the variance by \(n^2\):
\[\text{Var}(\bar{x}) = \frac{1}{n^2} \cdot n\sigma^2 = \frac{\sigma^2}{n}\]
Step 3: Take the square root to get back to original units:
\[SE = \sqrt{\text{Var}(\bar{x})} = \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}}\]
The sum’s noise grows as \(n\), but dividing by \(n\) shrinks it by \(n^2\). Net effect: noise shrinks by \(n^2 / n = n\), and the square root gives us \(\sqrt{n}\). Averaging \(n\) independent measurements reduces noise by \(\sqrt{n}\).
The Key Relationship
\[\boxed{SE = \frac{SD}{\sqrt{n}}}\]
This single formula connects everything:
- SD tells you how spread out individual data points are
- n is your sample size
- SE tells you how precisely you’ve estimated the mean
- More data → smaller SE → more precise estimate
#| standalone: true
#| viewerHeight: 800
library(shiny)
ui <- fluidPage(
titlePanel("SD vs SE: What's the Difference?"),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("true_sd", "Population SD:", min = 5, max = 40, value = 15, step = 1),
sliderInput("n_obs", "Sample size (n):", min = 5, max = 200, value = 25),
sliderInput("n_samples", "Samples to draw:", min = 50, max = 1000, value = 300, step = 50),
hr(),
h4("Theory"),
uiOutput("theory_box"),
hr(),
h4("Simulation"),
uiOutput("sim_box")
),
mainPanel(
width = 9,
plotOutput("main_plot", height = "650px")
)
)
)
server <- function(input, output, session) {
sim <- reactive({
mu <- 100
sigma <- input$true_sd
n <- input$n_obs
n_samp <- input$n_samples
one_sample <- rnorm(n, mu, sigma)
means <- replicate(n_samp, mean(rnorm(n, mu, sigma)))
list(one_sample = one_sample, means = means, mu = mu, sigma = sigma, n = n)
})
output$theory_box <- renderUI({
d <- sim()
se_theory <- round(d$sigma / sqrt(d$n), 2)
tagList(
p(paste0("SD = ", d$sigma)),
p(paste0("SE = SD/\u221an = ", d$sigma, "/\u221a", d$n, " = ", se_theory))
)
})
output$sim_box <- renderUI({
d <- sim()
tagList(
p(paste0("SD of one sample: ", round(sd(d$one_sample), 2))),
p(paste0("SD of sample means: ", round(sd(d$means), 2))),
p(paste0("Ratio: ", round(sd(d$one_sample) / sd(d$means), 1), "x narrower"))
)
})
output$main_plot <- renderPlot({
d <- sim()
sample_sd <- sd(d$one_sample)
se_actual <- sd(d$means)
xbar <- mean(d$one_sample)
xlims <- c(d$mu - 3.5 * d$sigma, d$mu + 3.5 * d$sigma)
par(mfrow = c(2, 1), mar = c(4.5, 4.5, 3.5, 1))
hist(d$one_sample, breaks = 30, col = adjustcolor("#3498db", 0.6), border = "white",
main = paste0("One Sample (n = ", d$n, ") \u2014 spread = SD"),
xlab = "Value", xlim = xlims, freq = FALSE, cex.main = 1.3)
abline(v = xbar, col = "#e74c3c", lwd = 2.5)
segments(xbar - sample_sd, 0, xbar + sample_sd, 0, col = "#e67e22", lwd = 4)
mtext(paste0("Mean = ", round(xbar, 1), " | SD = ", round(sample_sd, 1)),
side = 3, line = 0, cex = 1.1, font = 2)
hist(d$means, breaks = 40, col = adjustcolor("#2ecc71", 0.6), border = "white",
main = "Sampling Distribution of the Mean \u2014 spread = SE",
xlab = "Sample Mean", xlim = xlims, freq = FALSE, cex.main = 1.3)
xseq <- seq(xlims[1], xlims[2], length.out = 300)
lines(xseq, dnorm(xseq, d$mu, d$sigma / sqrt(d$n)), col = "#8e44ad", lwd = 2.5)
abline(v = d$mu, col = "#e74c3c", lwd = 2.5, lty = 2)
segments(d$mu - se_actual, 0, d$mu + se_actual, 0, col = "#e67e22", lwd = 4)
mtext(paste0("Mean of means = ", round(mean(d$means), 1), " | SE = ", round(se_actual, 2)),
side = 3, line = 0, cex = 1.1, font = 2)
legend("topright", "Theoretical N(\u03bc, SD/\u221an)", col = "#8e44ad", lwd = 2.5, bty = "n")
})
}
shinyApp(ui, server)
How They All Connect
Let’s see all three in one place. Watch what happens as you increase \(n\):
- SD stays the same — it’s a property of the population, not the sample size
- SE shrinks — more data means a more precise estimate
- Sample variance (s²) bounces around σ² — but is unbiased on average
#| standalone: true
#| viewerHeight: 700
library(shiny)
ui <- fluidPage(
titlePanel("SD vs SE as n Grows"),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("sigma", "Population SD:", min = 2, max = 30, value = 10),
sliderInput("n_size", "Sample size (n):", min = 5, max = 500, value = 10),
p(style = "margin-top:12px;",
strong("What to notice:"),
br(),
"Blue histogram (data) stays equally wide — SD doesn't shrink with n",
br(), br(),
"Green histogram (means) gets narrower — SE shrinks with sqrt(n)",
br(), br(),
"At n = 100, SE is 10x smaller than SD"
)
),
mainPanel(
width = 9,
plotOutput("combo_plot", height = "280px"),
plotOutput("se_curve", height = "300px")
)
)
)
server <- function(input, output, session) {
output$combo_plot <- renderPlot({
mu <- 50
sigma <- input$sigma
n <- input$n_size
one_samp <- rnorm(n, mu, sigma)
means <- replicate(500, mean(rnorm(n, mu, sigma)))
xlims <- c(mu - 3.5 * sigma, mu + 3.5 * sigma)
par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))
hist(one_samp, breaks = 30, col = adjustcolor("#3498db", 0.6), border = "white",
main = paste0("One Sample (SD = ", round(sd(one_samp), 1), ")"),
xlab = "Value", xlim = xlims, freq = FALSE, cex.main = 1.2)
abline(v = mu, col = "#e74c3c", lwd = 2, lty = 2)
hist(means, breaks = 40, col = adjustcolor("#2ecc71", 0.6), border = "white",
main = paste0("Sample Means (SE = ", round(sd(means), 2), ")"),
xlab = "Value", xlim = xlims, freq = FALSE, cex.main = 1.2)
abline(v = mu, col = "#e74c3c", lwd = 2, lty = 2)
})
output$se_curve <- renderPlot({
sigma <- input$sigma
n_curr <- input$n_size
ns <- 2:500
ses <- sigma / sqrt(ns)
se_now <- sigma / sqrt(n_curr)
par(mar = c(4.5, 4.5, 3, 1))
plot(ns, ses, type = "l", lwd = 2.5, col = "#2c3e50",
xlab = "Sample Size (n)", ylab = "Standard Error",
main = "SE = SD / sqrt(n) — diminishing returns",
ylim = c(0, max(ses) * 1.1), cex.main = 1.3)
abline(h = sigma, col = "#3498db", lwd = 2, lty = 2)
points(n_curr, se_now, pch = 19, cex = 2, col = "#e74c3c")
segments(n_curr, 0, n_curr, se_now, lty = 2, col = "#e74c3c")
segments(0, se_now, n_curr, se_now, lty = 2, col = "#e74c3c")
text(400, sigma + 0.8, paste0("SD = ", sigma), col = "#3498db", cex = 1.2, font = 2)
text(n_curr + 25, se_now + 0.8, paste0("SE = ", round(se_now, 2)),
col = "#e74c3c", cex = 1.1, font = 2, adj = 0)
})
}
shinyApp(ui, server)
SE and MDE: Why SE Determines Your Experiment’s Power
If you run an experiment, the Minimum Detectable Effect (MDE) — the smallest effect you can reliably catch — is directly driven by the SE:
\[MDE = (z_{\alpha/2} + z_{\beta}) \times SE\]
For a two-sample experiment with equal groups (\(\sigma = 1\) for simplicity):
\[MDE = (z_{\alpha/2} + z_{\beta}) \times \sqrt{\frac{2}{n}}\]
That \(\sqrt{2/n}\) term is the SE of the difference in means. So MDE is just a scaled-up SE. The critical values (\(z_{\alpha/2} + z_{\beta} \approx 2.8\) for 5% significance and 80% power) are just multipliers.
This means:
- Shrink SE → shrink MDE → detect smaller effects. The only levers are: increase \(n\), or reduce \(\sigma\) (better measurement, stratification, controls).
- Power analysis is really an SE calculation. You’re asking: “How small can I make my SE with this sample size?”
- If your SE is too large, no statistical test can save you. The effect will be buried in noise.
See the Power, Alpha, Beta & MDE page for interactive simulations.
Quick Reference
| Variance (σ² or s²) | Standard Deviation (σ or s) | Standard Error (SE) | |
|---|---|---|---|
| Measures | Spread of data (squared) | Spread of data (original units) | Precision of an estimate |
| Population | \(\sigma^2 = \frac{1}{N}\sum(x_i - \mu)^2\) | \(\sigma = \sqrt{\sigma^2}\) | \(\frac{\sigma}{\sqrt{n}}\) |
| Sample | \(s^2 = \frac{1}{n-1}\sum(x_i - \bar{x})^2\) | \(s = \sqrt{s^2}\) | \(\frac{s}{\sqrt{n}}\) |
| Changes with n? | Converges to σ² | Converges to σ | Shrinks as \(1/\sqrt{n}\) |
| Used for | Math convenience | Describing data | Confidence intervals, hypothesis tests |
Common Mistakes
Reporting SD when you mean SE (or vice versa). SD describes how variable the data is. SE describes how precisely you’ve estimated the mean. A study with low SD but high SE has consistent data but too few observations.
Thinking SE measures data quality. SE can be tiny even with messy data — you just need a big enough \(n\). It says nothing about whether your data is clean or your measurements are accurate.
Forgetting that SE applies to any estimator, not just the mean. The slope in a regression has an SE. A proportion has an SE. Any time you estimate something from data, there’s uncertainty in that estimate — and the SE quantifies it.
Did you know?
The n − 1 Story
The correction for dividing by \(n - 1\) instead of \(n\) is called Bessel’s correction, named after Friedrich Bessel (1784–1846), a German astronomer who was obsessed with measurement precision. He needed to compute the orbits of stars and realized that using \(n\) in the denominator systematically underestimated how uncertain his measurements really were. His fix — dividing by \(n - 1\) — is now used billions of times a day in software from Excel to R to Python. Bessel never saw a computer, but his correction is baked into every call to var() and sd() in R and numpy.std(ddof=1) in Python.
Why “Standard” Error?
The word “standard” in “standard deviation” and “standard error” simply means “typical.” Karl Pearson coined “standard deviation” in 1893 because he wanted a word for the typical amount by which values deviate from their average. “Standard error” then naturally means the typical amount by which your estimate deviates from the truth. Nothing fancy — just “how far off is typical?”