Test Statistics
p-values & Confidence Intervals explained what a p-value means. This page explains the machinery that produces them — the test statistics. There are many, but they almost all reduce to the same core idea: how far is the estimate from the null, measured in units of its uncertainty?
The universal structure
Nearly every classical test statistic has this shape:
\[ \text{test statistic} = \frac{(\text{estimate} - \text{null value})^2}{\text{variance of estimate}} \]
That’s a quadratic form — a squared distance, scaled by uncertainty. This is why the \(\chi^2\) distribution shows up everywhere: a squared standard normal is \(\chi^2_1\) by definition, and sums of squared standard normals are \(\chi^2\) with more degrees of freedom.
I. Mean-based tests
Z-statistic
The simplest test statistic. Used when the variance is known (rare) or the sample is large enough for the CLT to kick in:
\[ Z = \frac{\hat{\theta} - \theta_0}{\text{SE}(\hat{\theta})} \]
Under \(H_0\): \(Z \sim N(0, 1)\).
You’ll see Z-tests in large-sample regressions and most asymptotic tests in econometrics. When someone reports a “coefficient divided by its standard error” in a large sample, that’s a Z-statistic.
t-statistic
Same structure as \(Z\), but the standard error is estimated rather than known:
\[ t = \frac{\hat{\theta} - \theta_0}{\widehat{\text{SE}}(\hat{\theta})} \]
Under \(H_0\): \(t \sim t_{df}\), where \(df\) depends on the sample size and number of parameters.
This is what standard regression output reports. Every coefficient row in a regression table has a t-statistic and a p-value derived from it.
The t-distribution has heavier tails than the normal — it’s more conservative, accounting for the extra uncertainty from estimating the variance. As \(n \to \infty\), the t-distribution converges to the normal, so \(t \to Z\).
II. Variance and distribution tests
Chi-squared (\(\chi^2\))
The chi-squared statistic is a general quadratic form:
\[ \chi^2 = \sum \frac{(O - E)^2}{E} \]
where \(O\) is observed and \(E\) is expected under the null. Used for goodness-of-fit tests, independence in contingency tables, and likelihood ratio tests.
A key fact that connects everything:
\[ \text{If } Z \sim N(0, 1), \text{ then } Z^2 \sim \chi^2_1 \]
This is why \(\chi^2\) shows up everywhere — many test statistics are squared normals, or sums of squared normals. When you have \(k\) independent squared standard normals, you get \(\chi^2_k\).
F-statistic
Used to test multiple restrictions simultaneously — “are all these coefficients jointly zero?”:
\[ F = \frac{(\text{RSS}_r - \text{RSS}_u) / q}{\text{RSS}_u / (n - k)} \]
where \(\text{RSS}_r\) is the residual sum of squares from the restricted model, \(\text{RSS}_u\) from the unrestricted model, \(q\) is the number of restrictions, and \(n - k\) is the degrees of freedom.
Under \(H_0\): \(F \sim F_{q, \, n-k}\).
Used for joint hypothesis tests (“are all pre-trends zero?” in DiD), ANOVA, and overall model significance.
The F-statistic and the t-statistic are directly related:
\[ F_{1, \, df} = t^2 \]
An F-test with one restriction is exactly the square of the corresponding t-test. The F generalizes the t to multiple restrictions. And just as \(t \to Z\) as \(n \to \infty\), we have \(q \cdot F_{q, n-k} \to \chi^2_q\) — the F-test converges to a scaled chi-squared test in large samples.
Simulation: Watch t converge to Z and F converge to chi-squared as n grows
Drag the degrees-of-freedom slider and watch two convergences happen simultaneously: the t distribution collapses onto the standard normal, and F(1, df) collapses onto chi-squared(1).
#| standalone: true
#| viewerHeight: 680
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("df", "Degrees of freedom (df):",
min = 2, max = 200, value = 10, step = 1),
uiOutput("stats")
),
mainPanel(
width = 9,
plotOutput("conv_plot", height = "540px")
)
)
)
server <- function(input, output, session) {
output$conv_plot <- renderPlot({
df <- input$df
par(mfrow = c(2, 1), mar = c(4.5, 4.5, 3, 1))
# --- Top panel: t vs Z ---
xseq <- seq(-4.5, 4.5, length.out = 500)
y_t <- dt(xseq, df = df)
y_z <- dnorm(xseq)
plot(xseq, y_z, type = "l", lwd = 2, lty = 2, col = "black",
xlab = "x", ylab = "Density",
main = paste0("t(", df, ") vs Standard Normal"),
ylim = c(0, max(y_z) * 1.15))
lines(xseq, y_t, lwd = 3, col = "#3498db")
legend("topright", bty = "n", cex = 0.9,
legend = c(paste0("t(", df, ")"), "N(0, 1)"),
col = c("#3498db", "black"),
lwd = c(3, 2), lty = c(1, 2))
# --- Bottom panel: F(1, df) vs chi-sq(1) ---
xf <- seq(0.001, 8, length.out = 500)
y_f <- df(xf, df1 = 1, df2 = df)
y_ch <- dchisq(xf, df = 1)
ymax <- min(max(c(y_f, y_ch), na.rm = TRUE) * 1.1, 5)
plot(xf, y_f, type = "l", lwd = 3, col = "#27ae60",
xlab = "x", ylab = "Density",
main = paste0("F(1, ", df, ") vs \u03c7\u00b2(1)"),
ylim = c(0, ymax))
lines(xf, y_ch, lwd = 2, lty = 2, col = "#e74c3c")
legend("topright", bty = "n", cex = 0.9,
legend = c(paste0("F(1, ", df, ")"),
expression(chi^2 * "(1)")),
col = c("#27ae60", "#e74c3c"),
lwd = c(3, 2), lty = c(1, 2))
})
output$stats <- renderUI({
df <- input$df
p_t <- 2 * pt(-1.96, df = df)
p_z <- 0.05
tags$div(class = "stats-box",
HTML(paste0(
"<b>df:</b> ", df, "<br>",
"<hr style='margin:8px 0'>",
"<b>P(|t| > 1.96):</b> ", round(p_t, 4), "<br>",
"<b>P(|Z| > 1.96):</b> ", round(p_z, 4), "<br>",
"<b>Difference:</b> ", round(abs(p_t - p_z), 4), "<br>",
"<hr style='margin:8px 0'>",
"<small>As df → ∞, the t-tail<br>",
"probability converges to 0.05<br>",
"and F(1, df) → χ²(1).</small>"
))
)
})
}
shinyApp(ui, server)
III. The likelihood-based trinity
These three tests are the workhorses of MLE-based inference. They all test the same null hypothesis and are asymptotically equivalent — but they differ in what you need to estimate.
Likelihood Ratio (LR) test
Estimate both the restricted and unrestricted models, then compare their log-likelihoods:
\[ LR = -2\left(\ell_{\text{restricted}} - \ell_{\text{unrestricted}}\right) \]
Under \(H_0\): \(LR \sim \chi^2_q\), where \(q\) is the number of restrictions.
Intuition: does the likelihood drop much when you impose the restriction? If the restriction is true, the restricted model shouldn’t fit much worse, so \(LR\) should be small.
Used in logit/probit, structural estimation, and any MLE comparison.
Wald test
Estimate only the unrestricted model, then check whether the estimates are far from the restriction:
\[ W = \frac{(\hat{\theta} - \theta_0)^2}{\text{Var}(\hat{\theta})} \]
Under \(H_0\): \(W \sim \chi^2_q\).
Intuition: if the restriction is true, the unrestricted estimate should land close to \(\theta_0\). A large \(W\) means the estimate is far from the null, in variance-scaled units.
This is what most regression software gives you by default. In fact, t-tests and F-tests are special cases of Wald tests: the t-statistic is \(\sqrt{W}\) for a single restriction, and the F-statistic is \(W/q\) with a finite-sample correction.
Score (Lagrange Multiplier) test
Estimate only the restricted model, then check whether the score (the gradient of the log-likelihood) is far from zero at the restricted estimate:
\[ LM = \left.\frac{\partial \ell}{\partial \theta}\right|_{\theta = \hat{\theta}_r}^{\!\!2} \bigg/ \mathcal{I}(\hat{\theta}_r) \]
Under \(H_0\): \(LM \sim \chi^2_q\).
Intuition: if the restriction is correct, the restricted estimate should be near the peak of the log-likelihood, so the slope (score) should be near zero. A steep slope at the restricted estimate means the restriction is pushing you away from the peak.
The Score test is useful when the unrestricted model is hard to estimate — you only need the restricted model.
The trinity compared
| Test | Requires estimating | Checks |
|---|---|---|
| Likelihood Ratio | Both models | Did the likelihood drop? |
| Wald | Unrestricted only | Is the estimate far from the null? |
| Score (LM) | Restricted only | Is the score far from zero? |
All three converge to \(\chi^2_q\) under the null. In finite samples they can disagree, but the ranking is typically: Wald \(\geq\) LR \(\geq\) Score (Wald tends to over-reject, Score tends to under-reject).
IV. Nonparametric tests
These don’t assume normality or any specific distribution.
Rank-based tests
Wilcoxon and Mann-Whitney tests replace raw values with ranks. Instead of asking “is the mean different?”, they ask “do observations from one group tend to have higher ranks?” Robust to outliers and non-normal distributions.
Kolmogorov-Smirnov (KS) test
Tests whether two distributions are equal by finding the maximum distance between their CDFs:
\[ KS = \sup_x |F_1(x) - F_2(x)| \]
This is a whole-distribution test — it detects differences in shape, location, and spread simultaneously.
V. Meta-analysis and multiple testing
Fisher’s combined test
When you have \(k\) independent p-values from separate studies and want to test whether there’s any signal across them:
\[ X^2 = -2\sum_{i=1}^k \ln(p_i) \]
Under the global null (all \(H_0\)’s are true): \(X^2 \sim \chi^2_{2k}\).
The intuition: under the null, p-values are uniform on \([0, 1]\), so \(-2\ln(p_i) \sim \chi^2_2\). Summing them gives a \(\chi^2\) with \(2k\) degrees of freedom.
Multiple testing corrections
Not a new statistic — instead, these adjust p-values or rejection thresholds when you’re running many tests simultaneously. Bonferroni (divide \(\alpha\) by the number of tests), Holm, Benjamini-Hochberg (FDR control). Covered in Multiple Testing.
The relationships between everything
Here’s how these test statistics connect to each other:
\[ Z^2 = \chi^2_1 = F_{1,\infty} = W \text{ (one restriction, large sample)} \]
\[ t^2 = F_{1, df} \]
\[ q \cdot F_{q, n-k} \;\xrightarrow{n \to \infty}\; \chi^2_q \]
\[ \text{Wald} \approx \text{LR} \approx \text{Score} \;\sim\; \chi^2_q \text{ (asymptotically)} \]
And in causal inference contexts:
| Setting | Individual coefficient | Joint test | Robust version |
|---|---|---|---|
| Regression | t-test | F-test | Cluster-robust t or F |
| DiD / event study | t (each period) | F (all pre-trends) | Cluster-robust |
| MLE (logit, probit) | Wald (z) | LR or Wald \(\chi^2\) | Sandwich-robust Wald |
| Bootstrap | — | — | Empirical distribution |