Test Statistics

p-values & Confidence Intervals explained what a p-value means. This page explains the machinery that produces them — the test statistics. There are many, but they almost all reduce to the same core idea: how far is the estimate from the null, measured in units of its uncertainty?

The universal structure

Nearly every classical test statistic has this shape:

\[ \text{test statistic} = \frac{(\text{estimate} - \text{null value})^2}{\text{variance of estimate}} \]

That’s a quadratic form — a squared distance, scaled by uncertainty. This is why the \(\chi^2\) distribution shows up everywhere: a squared standard normal is \(\chi^2_1\) by definition, and sums of squared standard normals are \(\chi^2\) with more degrees of freedom.

The deep pattern. Linear forms (estimate \(-\) null, unscaled) give you \(Z\) or \(t\) statistics. Quadratic forms (squared and variance-scaled) give you \(\chi^2\) or \(F\) statistics. Likelihood comparisons give you LR statistics. Rank-based forms give you nonparametric tests. But asymptotically, most of them converge to normal or chi-squared distributions.

I. Mean-based tests

Z-statistic

The simplest test statistic. Used when the variance is known (rare) or the sample is large enough for the CLT to kick in:

\[ Z = \frac{\hat{\theta} - \theta_0}{\text{SE}(\hat{\theta})} \]

Under \(H_0\): \(Z \sim N(0, 1)\).

You’ll see Z-tests in large-sample regressions and most asymptotic tests in econometrics. When someone reports a “coefficient divided by its standard error” in a large sample, that’s a Z-statistic.

t-statistic

Same structure as \(Z\), but the standard error is estimated rather than known:

\[ t = \frac{\hat{\theta} - \theta_0}{\widehat{\text{SE}}(\hat{\theta})} \]

Under \(H_0\): \(t \sim t_{df}\), where \(df\) depends on the sample size and number of parameters.

This is what standard regression output reports. Every coefficient row in a regression table has a t-statistic and a p-value derived from it.

The t-distribution has heavier tails than the normal — it’s more conservative, accounting for the extra uncertainty from estimating the variance. As \(n \to \infty\), the t-distribution converges to the normal, so \(t \to Z\).

II. Variance and distribution tests

Chi-squared (\(\chi^2\))

The chi-squared statistic is a general quadratic form:

\[ \chi^2 = \sum \frac{(O - E)^2}{E} \]

where \(O\) is observed and \(E\) is expected under the null. Used for goodness-of-fit tests, independence in contingency tables, and likelihood ratio tests.

A key fact that connects everything:

\[ \text{If } Z \sim N(0, 1), \text{ then } Z^2 \sim \chi^2_1 \]

This is why \(\chi^2\) shows up everywhere — many test statistics are squared normals, or sums of squared normals. When you have \(k\) independent squared standard normals, you get \(\chi^2_k\).

F-statistic

Used to test multiple restrictions simultaneously — “are all these coefficients jointly zero?”:

\[ F = \frac{(\text{RSS}_r - \text{RSS}_u) / q}{\text{RSS}_u / (n - k)} \]

where \(\text{RSS}_r\) is the residual sum of squares from the restricted model, \(\text{RSS}_u\) from the unrestricted model, \(q\) is the number of restrictions, and \(n - k\) is the degrees of freedom.

Under \(H_0\): \(F \sim F_{q, \, n-k}\).

Used for joint hypothesis tests (“are all pre-trends zero?” in DiD), ANOVA, and overall model significance.

The F-statistic and the t-statistic are directly related:

\[ F_{1, \, df} = t^2 \]

An F-test with one restriction is exactly the square of the corresponding t-test. The F generalizes the t to multiple restrictions. And just as \(t \to Z\) as \(n \to \infty\), we have \(q \cdot F_{q, n-k} \to \chi^2_q\) — the F-test converges to a scaled chi-squared test in large samples.

Simulation: Watch t converge to Z and F converge to chi-squared as n grows

Drag the degrees-of-freedom slider and watch two convergences happen simultaneously: the t distribution collapses onto the standard normal, and F(1, df) collapses onto chi-squared(1).

#| standalone: true
#| viewerHeight: 680

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("df", "Degrees of freedom (df):",
                  min = 2, max = 200, value = 10, step = 1),

      uiOutput("stats")
    ),

    mainPanel(
      width = 9,
      plotOutput("conv_plot", height = "540px")
    )
  )
)

server <- function(input, output, session) {

  output$conv_plot <- renderPlot({
    df <- input$df
    par(mfrow = c(2, 1), mar = c(4.5, 4.5, 3, 1))

    # --- Top panel: t vs Z ---
    xseq <- seq(-4.5, 4.5, length.out = 500)
    y_t <- dt(xseq, df = df)
    y_z <- dnorm(xseq)

    plot(xseq, y_z, type = "l", lwd = 2, lty = 2, col = "black",
         xlab = "x", ylab = "Density",
         main = paste0("t(", df, ") vs Standard Normal"),
         ylim = c(0, max(y_z) * 1.15))
    lines(xseq, y_t, lwd = 3, col = "#3498db")

    legend("topright", bty = "n", cex = 0.9,
           legend = c(paste0("t(", df, ")"), "N(0, 1)"),
           col = c("#3498db", "black"),
           lwd = c(3, 2), lty = c(1, 2))

    # --- Bottom panel: F(1, df) vs chi-sq(1) ---
    xf <- seq(0.001, 8, length.out = 500)
    y_f  <- df(xf, df1 = 1, df2 = df)
    y_ch <- dchisq(xf, df = 1)

    ymax <- min(max(c(y_f, y_ch), na.rm = TRUE) * 1.1, 5)

    plot(xf, y_f, type = "l", lwd = 3, col = "#27ae60",
         xlab = "x", ylab = "Density",
         main = paste0("F(1, ", df, ") vs \u03c7\u00b2(1)"),
         ylim = c(0, ymax))
    lines(xf, y_ch, lwd = 2, lty = 2, col = "#e74c3c")

    legend("topright", bty = "n", cex = 0.9,
           legend = c(paste0("F(1, ", df, ")"),
                      expression(chi^2 * "(1)")),
           col = c("#27ae60", "#e74c3c"),
           lwd = c(3, 2), lty = c(1, 2))
  })

  output$stats <- renderUI({
    df <- input$df
    p_t  <- 2 * pt(-1.96, df = df)
    p_z  <- 0.05

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>df:</b> ", df, "<br>",
        "<hr style='margin:8px 0'>",
        "<b>P(|t| > 1.96):</b> ", round(p_t, 4), "<br>",
        "<b>P(|Z| > 1.96):</b> ", round(p_z, 4), "<br>",
        "<b>Difference:</b> ", round(abs(p_t - p_z), 4), "<br>",
        "<hr style='margin:8px 0'>",
        "<small>As df &rarr; &infin;, the t-tail<br>",
        "probability converges to 0.05<br>",
        "and F(1, df) &rarr; &chi;&sup2;(1).</small>"
      ))
    )
  })
}

shinyApp(ui, server)

III. The likelihood-based trinity

These three tests are the workhorses of MLE-based inference. They all test the same null hypothesis and are asymptotically equivalent — but they differ in what you need to estimate.

Likelihood Ratio (LR) test

Estimate both the restricted and unrestricted models, then compare their log-likelihoods:

\[ LR = -2\left(\ell_{\text{restricted}} - \ell_{\text{unrestricted}}\right) \]

Under \(H_0\): \(LR \sim \chi^2_q\), where \(q\) is the number of restrictions.

Intuition: does the likelihood drop much when you impose the restriction? If the restriction is true, the restricted model shouldn’t fit much worse, so \(LR\) should be small.

Used in logit/probit, structural estimation, and any MLE comparison.

Wald test

Estimate only the unrestricted model, then check whether the estimates are far from the restriction:

\[ W = \frac{(\hat{\theta} - \theta_0)^2}{\text{Var}(\hat{\theta})} \]

Under \(H_0\): \(W \sim \chi^2_q\).

Intuition: if the restriction is true, the unrestricted estimate should land close to \(\theta_0\). A large \(W\) means the estimate is far from the null, in variance-scaled units.

This is what most regression software gives you by default. In fact, t-tests and F-tests are special cases of Wald tests: the t-statistic is \(\sqrt{W}\) for a single restriction, and the F-statistic is \(W/q\) with a finite-sample correction.

Score (Lagrange Multiplier) test

Estimate only the restricted model, then check whether the score (the gradient of the log-likelihood) is far from zero at the restricted estimate:

\[ LM = \left.\frac{\partial \ell}{\partial \theta}\right|_{\theta = \hat{\theta}_r}^{\!\!2} \bigg/ \mathcal{I}(\hat{\theta}_r) \]

Under \(H_0\): \(LM \sim \chi^2_q\).

Intuition: if the restriction is correct, the restricted estimate should be near the peak of the log-likelihood, so the slope (score) should be near zero. A steep slope at the restricted estimate means the restriction is pushing you away from the peak.

The Score test is useful when the unrestricted model is hard to estimate — you only need the restricted model.

The trinity compared

Test	Requires estimating	Checks
Likelihood Ratio	Both models	Did the likelihood drop?
Wald	Unrestricted only	Is the estimate far from the null?
Score (LM)	Restricted only	Is the score far from zero?

All three converge to \(\chi^2_q\) under the null. In finite samples they can disagree, but the ranking is typically: Wald \(\geq\) LR \(\geq\) Score (Wald tends to over-reject, Score tends to under-reject).

When to use which. If you’ve already estimated the full model, use Wald — it’s what your regression output gives you. If you’re comparing nested models via MLE, use LR. If the unrestricted model is computationally expensive or doesn’t converge, use Score.

IV. Nonparametric tests

These don’t assume normality or any specific distribution.

Rank-based tests

Wilcoxon and Mann-Whitney tests replace raw values with ranks. Instead of asking “is the mean different?”, they ask “do observations from one group tend to have higher ranks?” Robust to outliers and non-normal distributions.

Kolmogorov-Smirnov (KS) test

Tests whether two distributions are equal by finding the maximum distance between their CDFs:

\[ KS = \sup_x |F_1(x) - F_2(x)| \]

This is a whole-distribution test — it detects differences in shape, location, and spread simultaneously.

V. Meta-analysis and multiple testing

Fisher’s combined test

When you have \(k\) independent p-values from separate studies and want to test whether there’s any signal across them:

\[ X^2 = -2\sum_{i=1}^k \ln(p_i) \]

Under the global null (all \(H_0\)’s are true): \(X^2 \sim \chi^2_{2k}\).

The intuition: under the null, p-values are uniform on \([0, 1]\), so \(-2\ln(p_i) \sim \chi^2_2\). Summing them gives a \(\chi^2\) with \(2k\) degrees of freedom.

Multiple testing corrections

Not a new statistic — instead, these adjust p-values or rejection thresholds when you’re running many tests simultaneously. Bonferroni (divide \(\alpha\) by the number of tests), Holm, Benjamini-Hochberg (FDR control). Covered in Multiple Testing.

The relationships between everything

Here’s how these test statistics connect to each other:

\[ Z^2 = \chi^2_1 = F_{1,\infty} = W \text{ (one restriction, large sample)} \]

\[ t^2 = F_{1, df} \]

\[ q \cdot F_{q, n-k} \;\xrightarrow{n \to \infty}\; \chi^2_q \]

\[ \text{Wald} \approx \text{LR} \approx \text{Score} \;\sim\; \chi^2_q \text{ (asymptotically)} \]

And in causal inference contexts:

Setting	Individual coefficient	Joint test	Robust version
Regression	t-test	F-test	Cluster-robust t or F
DiD / event study	t (each period)	F (all pre-trends)	Cluster-robust
MLE (logit, probit)	Wald (z)	LR or Wald \(\chi^2\)	Sandwich-robust Wald
Bootstrap	—	—	Empirical distribution

The takeaway. Different test statistics differ in their small-sample properties, robustness, and whether they require variance estimation or rely on asymptotics. But the unifying idea is always the same: measure how far the data are from what the null hypothesis predicts, in units that account for sampling variability. The bigger that distance, the stronger the evidence against the null.