Matching

The idea

You want the causal effect of a treatment, and you believe selection on observables holds — all confounders are observed. Matching implements this by finding, for each treated unit, a control unit that looks similar on covariates \(X\).

The logic is simple: if two people have the same \(X\), and one got treated while the other didn’t, the difference in their outcomes estimates the treatment effect for that value of \(X\). Average across all matched pairs and you get the ATT.

Example: job training programs. The government runs a job training program. People who enroll earn more afterward — but is that the program, or did motivated people self-select? Matching says: for each enrollee, find a non-enrollee with the same age, education, prior income, and industry. Compare their earnings. If the enrollee still earns more after matching on everything observable, that’s the program’s effect. This is exactly what LaLonde (1986) studied — and found that naive comparisons badly overestimated the program’s impact.

\[\hat{\tau}_{match} = \frac{1}{N_1} \sum_{i: D_i=1} \Big[ Y_i - Y_{j(i)} \Big]\]

where \(j(i)\) is the control unit matched to treated unit \(i\).

Types of matching

Method	How it matches	Pros	Cons
Exact	Identical X values	No approximation error	Infeasible with continuous or many covariates
Nearest neighbor (on X)	Closest \(\\|X_i - X_j\\|\)	Simple, intuitive	Curse of dimensionality with many covariates
Nearest neighbor (on PS)	Closest \(\\|e(X_i) - e(X_j)\\|\)	Reduces to 1D; Rosenbaum & Rubin (1983)	Inherits PS model risk
Caliper	NN, but reject if distance \(> c\)	Drops bad matches	Loses observations
Mahalanobis	Distance accounts for correlations	Better than Euclidean for correlated X	Still suffers in high dimensions

Assumptions

Same as all selection on observables methods:

Conditional independence (CIA): \(Y(0), Y(1) \perp D \mid X\). All confounders are observed.
Overlap (common support): \(0 < P(D = 1 \mid X) < 1\). For every treated unit, a comparable control exists.
SUTVA: no interference between units.

The vulnerability: match quality

Matching finds comparisons rather than modeling outcomes. But the quality of those comparisons depends on:

Overlap: if treated and control units live in different parts of the covariate space, the “nearest” neighbor may be far away. The match is bad and the estimate is biased.
Dimensionality: with many covariates, even the nearest neighbor can be quite distant (“curse of dimensionality”). This is why propensity score matching collapses \(X\) to a single dimension.

Compare this to regression adjustment, which extrapolates using a model, and IPW, which reweights instead of pairing. Each has different failure modes.

#| standalone: true
#| viewerHeight: 650

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("n", "Sample size:",
                  min = 200, max = 2000, value = 500, step = 100),

      sliderInput("ate", "True ATE:",
                  min = 0, max = 5, value = 2, step = 0.5),

      sliderInput("confounding", "Confounding strength:",
                  min = 0, max = 3, value = 1.5, step = 0.25),

      selectInput("match_type", "Matching method:",
                  choices = c("NN on X" = "nn_x",
                              "NN on propensity score" = "nn_ps",
                              "Caliper (on X)" = "caliper")),

      actionButton("go", "New draw", class = "btn-primary", width = "100%"),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("match_plot", height = "420px")),
        column(6, plotOutput("compare_plot", height = "420px"))
      )
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$go
    n    <- input$n
    ate  <- input$ate
    conf <- input$confounding
    mtype <- input$match_type

    # Confounder
    x <- rnorm(n)

    # Treatment depends on x
    p_true <- pnorm(conf * x)
    treat <- rbinom(n, 1, p_true)

    # Outcome
    y <- 1 + 2 * x + ate * treat + rnorm(n)

    idx_t <- which(treat == 1)
    idx_c <- which(treat == 0)

    # Propensity score (for PS matching)
    ps <- fitted(glm(treat ~ x, family = binomial))

    # Matching
    matched_j <- integer(length(idx_t))
    match_dist <- numeric(length(idx_t))

    if (mtype == "nn_x") {
      for (k in seq_along(idx_t)) {
        dists <- abs(x[idx_t[k]] - x[idx_c])
        best <- which.min(dists)
        matched_j[k] <- idx_c[best]
        match_dist[k] <- dists[best]
      }
    } else if (mtype == "nn_ps") {
      for (k in seq_along(idx_t)) {
        dists <- abs(ps[idx_t[k]] - ps[idx_c])
        best <- which.min(dists)
        matched_j[k] <- idx_c[best]
        match_dist[k] <- abs(x[idx_t[k]] - x[idx_c[best]])
      }
    } else {
      # Caliper on X (caliper = 0.25 * sd(x))
      cal <- 0.25 * sd(x)
      for (k in seq_along(idx_t)) {
        dists <- abs(x[idx_t[k]] - x[idx_c])
        best <- which.min(dists)
        if (dists[best] <= cal) {
          matched_j[k] <- idx_c[best]
          match_dist[k] <- dists[best]
        } else {
          matched_j[k] <- NA
          match_dist[k] <- NA
        }
      }
    }

    # Matching estimate
    valid <- !is.na(matched_j)
    n_matched <- sum(valid)
    if (n_matched > 0) {
      match_est <- mean(y[idx_t[valid]] - y[matched_j[valid]])
    } else {
      match_est <- NA
    }
    avg_dist <- mean(match_dist[valid])

    # Naive
    naive <- mean(y[treat == 1]) - mean(y[treat == 0])

    # Regression adjustment
    reg_est <- coef(lm(y ~ treat + x))["treat"]

    list(x = x, y = y, treat = treat, ps = ps,
         idx_t = idx_t, idx_c = idx_c,
         matched_j = matched_j, valid = valid,
         match_est = match_est, naive = naive, reg_est = reg_est,
         ate = ate, n_matched = n_matched, avg_dist = avg_dist,
         mtype = mtype)
  })

  output$match_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    plot(d$x, d$y, pch = 16, cex = 0.5,
         col = ifelse(d$treat == 1, "#3498db40", "#e74c3c40"),
         xlab = "X (confounder)", ylab = "Y (outcome)",
         main = "Matched Pairs")

    # Draw match lines for a subset (first 30 valid matches)
    valid_idx <- which(d$valid)
    show <- valid_idx[seq_len(min(30, length(valid_idx)))]

    for (k in show) {
      i <- d$idx_t[k]
      j <- d$matched_j[k]
      segments(d$x[i], d$y[i], d$x[j], d$y[j],
               col = "#9b59b680", lwd = 1)
    }

    # Re-draw matched points on top
    for (k in show) {
      i <- d$idx_t[k]
      j <- d$matched_j[k]
      points(d$x[i], d$y[i], pch = 16, cex = 0.7, col = "#3498db")
      points(d$x[j], d$y[j], pch = 16, cex = 0.7, col = "#e74c3c")
    }

    n_show <- length(show)
    legend("topleft", bty = "n", cex = 0.8,
           legend = c("Treated", "Control",
                      paste0("Match link (", n_show, " shown)")),
           col = c("#3498db", "#e74c3c", "#9b59b6"),
           pch = c(16, 16, NA), lwd = c(NA, NA, 1.5))
  })

  output$compare_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 9, 3, 2))

    ests <- c(d$reg_est, d$match_est, d$naive)
    labels <- c("Regression adj.", "Matching", "Naive")
    cols <- c("#3498db", "#9b59b6", "#e74c3c")

    valid_ests <- !is.na(ests)
    xlim <- range(c(ests[valid_ests], d$ate))
    pad  <- max(diff(xlim) * 0.4, 0.5)
    xlim <- xlim + c(-pad, pad)

    plot(ests, 1:3, pch = 19, cex = 2, col = cols,
         xlim = xlim, ylim = c(0.5, 3.5),
         yaxt = "n", xlab = "Estimated treatment effect",
         ylab = "", main = "Estimator Comparison")
    axis(2, at = 1:3, labels = labels, las = 1, cex.axis = 0.9)

    abline(v = d$ate, lty = 2, col = "#2c3e50", lwd = 2)
    text(d$ate, 3.45, paste0("True ATE = ", d$ate),
         cex = 0.85, font = 2, col = "#2c3e50")

    segments(d$ate, 1:3, ests, 1:3, col = cols, lwd = 2, lty = 2)
  })

  output$results <- renderUI({
    d <- dat()
    b_naive <- d$naive - d$ate
    b_match <- if (!is.na(d$match_est)) d$match_est - d$ate else NA
    b_reg   <- d$reg_est - d$ate

    match_class <- if (!is.na(b_match) && abs(b_match) < abs(b_naive) * 0.5) "good" else "bad"
    reg_class   <- if (abs(b_reg) < abs(b_naive) * 0.5) "good" else "bad"

    method_label <- switch(d$mtype,
      nn_x    = "NN on X",
      nn_ps   = "NN on propensity score",
      caliper = "Caliper on X")

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>Method:</b> ", method_label, "<br>",
        "<b>True ATE:</b> ", d$ate, "<br>",
        "<hr style='margin:6px 0'>",
        "<b>Naive:</b> <span class='bad'>", round(d$naive, 3), "</span>",
        " (bias: ", round(b_naive, 3), ")<br>",
        "<b>Matching:</b> <span class='", match_class, "'>",
        if (!is.na(d$match_est)) round(d$match_est, 3) else "N/A",
        "</span>",
        if (!is.na(b_match)) paste0(" (bias: ", round(b_match, 3), ")") else "", "<br>",
        "<b>Reg. adj:</b> <span class='", reg_class, "'>",
        round(d$reg_est, 3), "</span>",
        " (bias: ", round(b_reg, 3), ")<br>",
        "<hr style='margin:6px 0'>",
        "<b>Pairs matched:</b> ", d$n_matched, " / ", length(d$idx_t), "<br>",
        "<b>Avg match distance:</b> ",
        if (!is.na(d$avg_dist)) round(d$avg_dist, 3) else "N/A"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

NN on X, confounding = 1.5: matching works well. The purple lines connecting matched pairs are short — each treated unit finds a similar control. The matching estimate is close to the true ATE.
Switch to caliper: some treated units in the tails can’t find a close match and get dropped. “Pairs matched” drops below the total. The remaining matches are better quality (shorter distances).
Confounding = 3: treated and control units are far apart in X space. Match distances increase, match quality degrades, and the estimate gets noisier. This is the overlap problem — matching can’t fix it.
Small sample (n = 200) with high confounding: matching struggles because there aren’t enough control units in the right part of the covariate space. Regression adjustment extrapolates, which helps here but can hurt elsewhere (see the regression adjustment page).
NN on propensity score: matches on the estimated \(e(X)\) instead of \(X\) directly. With one confounder, results are similar. The advantage appears with multiple covariates (not shown here).

How does matching compare to other tools?

Tool	Strategy	Vulnerability
Matching	Find similar units	Bad matches when overlap is poor
Regression adj.	Model the outcome	Wrong functional form
IPW	Reweight by propensity score	Extreme weights
Entropy Balancing	Balance moments exactly	Missing higher-order moments
Doubly Robust	Combine outcome model + PS	Fails only if both models wrong

All rely on the same identification assumption (CIA). They differ in how they use \(X\) to make the comparison fair.

In Stata

* Nearest-neighbor matching (on covariates)
teffects nnmatch (outcome x1 x2) (treatment), nneighbor(1)

* Bias-corrected matching (Abadie & Imbens)
teffects nnmatch (outcome x1 x2) (treatment), nneighbor(1) biasadj(x1 x2)

* Propensity score matching
teffects psmatch (outcome) (treatment x1 x2)

* Coarsened exact matching
* ssc install cem
cem x1 (#5) x2 (#3), treatment(treatment)
reg outcome treatment x1 x2 [iw=cem_weights]

teffects nnmatch with biasadj() corrects for the remaining covariate imbalance within matched pairs — important when matching on continuous variables.

Did you know?

Abadie & Imbens (2006) showed that nearest-neighbor matching is not \(\sqrt{n}\)-consistent — its convergence rate is slower than parametric estimators when matching on more than one continuous covariate. The bias from imperfect matches doesn’t vanish fast enough. Their bias-corrected matching estimator fixes this by running a regression within each matched pair to adjust for the remaining covariate difference.
Propensity score matching was proposed by Rosenbaum & Rubin (1983), who proved that matching on the scalar \(e(X) = P(D=1 \mid X)\) is sufficient for removing confounding — you don’t need to match on every covariate separately. This was a breakthrough because it collapsed the high-dimensional matching problem to one dimension.
King & Nielsen (2019) argued that propensity score matching can paradoxically increase imbalance and model dependence. Their critique centers on the fact that exact PS matches don’t guarantee covariate balance — two units with the same propensity score can have very different covariate values.