Matching
The idea
You want the causal effect of a treatment, and you believe selection on observables holds — all confounders are observed. Matching implements this by finding, for each treated unit, a control unit that looks similar on covariates \(X\).
The logic is simple: if two people have the same \(X\), and one got treated while the other didn’t, the difference in their outcomes estimates the treatment effect for that value of \(X\). Average across all matched pairs and you get the ATT.
\[\hat{\tau}_{match} = \frac{1}{N_1} \sum_{i: D_i=1} \Big[ Y_i - Y_{j(i)} \Big]\]
where \(j(i)\) is the control unit matched to treated unit \(i\).
Types of matching
| Method | How it matches | Pros | Cons |
|---|---|---|---|
| Exact | Identical X values | No approximation error | Infeasible with continuous or many covariates |
| Nearest neighbor (on X) | Closest \(\|X_i - X_j\|\) | Simple, intuitive | Curse of dimensionality with many covariates |
| Nearest neighbor (on PS) | Closest \(\|e(X_i) - e(X_j)\|\) | Reduces to 1D; Rosenbaum & Rubin (1983) | Inherits PS model risk |
| Caliper | NN, but reject if distance \(> c\) | Drops bad matches | Loses observations |
| Mahalanobis | Distance accounts for correlations | Better than Euclidean for correlated X | Still suffers in high dimensions |
Assumptions
Same as all selection on observables methods:
- Conditional independence (CIA): \(Y(0), Y(1) \perp D \mid X\). All confounders are observed.
- Overlap (common support): \(0 < P(D = 1 \mid X) < 1\). For every treated unit, a comparable control exists.
- SUTVA: no interference between units.
The vulnerability: match quality
Matching finds comparisons rather than modeling outcomes. But the quality of those comparisons depends on:
- Overlap: if treated and control units live in different parts of the covariate space, the “nearest” neighbor may be far away. The match is bad and the estimate is biased.
- Dimensionality: with many covariates, even the nearest neighbor can be quite distant (“curse of dimensionality”). This is why propensity score matching collapses \(X\) to a single dimension.
Compare this to regression adjustment, which extrapolates using a model, and IPW, which reweights instead of pairing. Each has different failure modes.
#| standalone: true
#| viewerHeight: 650
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
.good { color: #27ae60; font-weight: bold; }
.bad { color: #e74c3c; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("n", "Sample size:",
min = 200, max = 2000, value = 500, step = 100),
sliderInput("ate", "True ATE:",
min = 0, max = 5, value = 2, step = 0.5),
sliderInput("confounding", "Confounding strength:",
min = 0, max = 3, value = 1.5, step = 0.25),
selectInput("match_type", "Matching method:",
choices = c("NN on X" = "nn_x",
"NN on propensity score" = "nn_ps",
"Caliper (on X)" = "caliper")),
actionButton("go", "New draw", class = "btn-primary", width = "100%"),
uiOutput("results")
),
mainPanel(
width = 9,
fluidRow(
column(6, plotOutput("match_plot", height = "420px")),
column(6, plotOutput("compare_plot", height = "420px"))
)
)
)
)
server <- function(input, output, session) {
dat <- reactive({
input$go
n <- input$n
ate <- input$ate
conf <- input$confounding
mtype <- input$match_type
# Confounder
x <- rnorm(n)
# Treatment depends on x
p_true <- pnorm(conf * x)
treat <- rbinom(n, 1, p_true)
# Outcome
y <- 1 + 2 * x + ate * treat + rnorm(n)
idx_t <- which(treat == 1)
idx_c <- which(treat == 0)
# Propensity score (for PS matching)
ps <- fitted(glm(treat ~ x, family = binomial))
# Matching
matched_j <- integer(length(idx_t))
match_dist <- numeric(length(idx_t))
if (mtype == "nn_x") {
for (k in seq_along(idx_t)) {
dists <- abs(x[idx_t[k]] - x[idx_c])
best <- which.min(dists)
matched_j[k] <- idx_c[best]
match_dist[k] <- dists[best]
}
} else if (mtype == "nn_ps") {
for (k in seq_along(idx_t)) {
dists <- abs(ps[idx_t[k]] - ps[idx_c])
best <- which.min(dists)
matched_j[k] <- idx_c[best]
match_dist[k] <- abs(x[idx_t[k]] - x[idx_c[best]])
}
} else {
# Caliper on X (caliper = 0.25 * sd(x))
cal <- 0.25 * sd(x)
for (k in seq_along(idx_t)) {
dists <- abs(x[idx_t[k]] - x[idx_c])
best <- which.min(dists)
if (dists[best] <= cal) {
matched_j[k] <- idx_c[best]
match_dist[k] <- dists[best]
} else {
matched_j[k] <- NA
match_dist[k] <- NA
}
}
}
# Matching estimate
valid <- !is.na(matched_j)
n_matched <- sum(valid)
if (n_matched > 0) {
match_est <- mean(y[idx_t[valid]] - y[matched_j[valid]])
} else {
match_est <- NA
}
avg_dist <- mean(match_dist[valid])
# Naive
naive <- mean(y[treat == 1]) - mean(y[treat == 0])
# Regression adjustment
reg_est <- coef(lm(y ~ treat + x))["treat"]
list(x = x, y = y, treat = treat, ps = ps,
idx_t = idx_t, idx_c = idx_c,
matched_j = matched_j, valid = valid,
match_est = match_est, naive = naive, reg_est = reg_est,
ate = ate, n_matched = n_matched, avg_dist = avg_dist,
mtype = mtype)
})
output$match_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
plot(d$x, d$y, pch = 16, cex = 0.5,
col = ifelse(d$treat == 1, "#3498db40", "#e74c3c40"),
xlab = "X (confounder)", ylab = "Y (outcome)",
main = "Matched Pairs")
# Draw match lines for a subset (first 30 valid matches)
valid_idx <- which(d$valid)
show <- valid_idx[seq_len(min(30, length(valid_idx)))]
for (k in show) {
i <- d$idx_t[k]
j <- d$matched_j[k]
segments(d$x[i], d$y[i], d$x[j], d$y[j],
col = "#9b59b680", lwd = 1)
}
# Re-draw matched points on top
for (k in show) {
i <- d$idx_t[k]
j <- d$matched_j[k]
points(d$x[i], d$y[i], pch = 16, cex = 0.7, col = "#3498db")
points(d$x[j], d$y[j], pch = 16, cex = 0.7, col = "#e74c3c")
}
n_show <- length(show)
legend("topleft", bty = "n", cex = 0.8,
legend = c("Treated", "Control",
paste0("Match link (", n_show, " shown)")),
col = c("#3498db", "#e74c3c", "#9b59b6"),
pch = c(16, 16, NA), lwd = c(NA, NA, 1.5))
})
output$compare_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 9, 3, 2))
ests <- c(d$reg_est, d$match_est, d$naive)
labels <- c("Regression adj.", "Matching", "Naive")
cols <- c("#3498db", "#9b59b6", "#e74c3c")
valid_ests <- !is.na(ests)
xlim <- range(c(ests[valid_ests], d$ate))
pad <- max(diff(xlim) * 0.4, 0.5)
xlim <- xlim + c(-pad, pad)
plot(ests, 1:3, pch = 19, cex = 2, col = cols,
xlim = xlim, ylim = c(0.5, 3.5),
yaxt = "n", xlab = "Estimated treatment effect",
ylab = "", main = "Estimator Comparison")
axis(2, at = 1:3, labels = labels, las = 1, cex.axis = 0.9)
abline(v = d$ate, lty = 2, col = "#2c3e50", lwd = 2)
text(d$ate, 3.45, paste0("True ATE = ", d$ate),
cex = 0.85, font = 2, col = "#2c3e50")
segments(d$ate, 1:3, ests, 1:3, col = cols, lwd = 2, lty = 2)
})
output$results <- renderUI({
d <- dat()
b_naive <- d$naive - d$ate
b_match <- if (!is.na(d$match_est)) d$match_est - d$ate else NA
b_reg <- d$reg_est - d$ate
match_class <- if (!is.na(b_match) && abs(b_match) < abs(b_naive) * 0.5) "good" else "bad"
reg_class <- if (abs(b_reg) < abs(b_naive) * 0.5) "good" else "bad"
method_label <- switch(d$mtype,
nn_x = "NN on X",
nn_ps = "NN on propensity score",
caliper = "Caliper on X")
tags$div(class = "stats-box",
HTML(paste0(
"<b>Method:</b> ", method_label, "<br>",
"<b>True ATE:</b> ", d$ate, "<br>",
"<hr style='margin:6px 0'>",
"<b>Naive:</b> <span class='bad'>", round(d$naive, 3), "</span>",
" (bias: ", round(b_naive, 3), ")<br>",
"<b>Matching:</b> <span class='", match_class, "'>",
if (!is.na(d$match_est)) round(d$match_est, 3) else "N/A",
"</span>",
if (!is.na(b_match)) paste0(" (bias: ", round(b_match, 3), ")") else "", "<br>",
"<b>Reg. adj:</b> <span class='", reg_class, "'>",
round(d$reg_est, 3), "</span>",
" (bias: ", round(b_reg, 3), ")<br>",
"<hr style='margin:6px 0'>",
"<b>Pairs matched:</b> ", d$n_matched, " / ", length(d$idx_t), "<br>",
"<b>Avg match distance:</b> ",
if (!is.na(d$avg_dist)) round(d$avg_dist, 3) else "N/A"
))
)
})
}
shinyApp(ui, server)
Things to try
- NN on X, confounding = 1.5: matching works well. The purple lines connecting matched pairs are short — each treated unit finds a similar control. The matching estimate is close to the true ATE.
- Switch to caliper: some treated units in the tails can’t find a close match and get dropped. “Pairs matched” drops below the total. The remaining matches are better quality (shorter distances).
- Confounding = 3: treated and control units are far apart in X space. Match distances increase, match quality degrades, and the estimate gets noisier. This is the overlap problem — matching can’t fix it.
- Small sample (n = 200) with high confounding: matching struggles because there aren’t enough control units in the right part of the covariate space. Regression adjustment extrapolates, which helps here but can hurt elsewhere (see the regression adjustment page).
- NN on propensity score: matches on the estimated \(e(X)\) instead of \(X\) directly. With one confounder, results are similar. The advantage appears with multiple covariates (not shown here).
How does matching compare to other tools?
| Tool | Strategy | Vulnerability |
|---|---|---|
| Matching | Find similar units | Bad matches when overlap is poor |
| Regression adj. | Model the outcome | Wrong functional form |
| IPW | Reweight by propensity score | Extreme weights |
| Entropy Balancing | Balance moments exactly | Missing higher-order moments |
| Doubly Robust | Combine outcome model + PS | Fails only if both models wrong |
All rely on the same identification assumption (CIA). They differ in how they use \(X\) to make the comparison fair.
In Stata
* Nearest-neighbor matching (on covariates)
teffects nnmatch (outcome x1 x2) (treatment), nneighbor(1)
* Bias-corrected matching (Abadie & Imbens)
teffects nnmatch (outcome x1 x2) (treatment), nneighbor(1) biasadj(x1 x2)
* Propensity score matching
teffects psmatch (outcome) (treatment x1 x2)
* Coarsened exact matching
* ssc install cem
cem x1 (#5) x2 (#3), treatment(treatment)
reg outcome treatment x1 x2 [iw=cem_weights]teffects nnmatch with biasadj() corrects for the remaining covariate imbalance within matched pairs — important when matching on continuous variables.
Did you know?
Abadie & Imbens (2006) showed that nearest-neighbor matching is not \(\sqrt{n}\)-consistent — its convergence rate is slower than parametric estimators when matching on more than one continuous covariate. The bias from imperfect matches doesn’t vanish fast enough. Their bias-corrected matching estimator fixes this by running a regression within each matched pair to adjust for the remaining covariate difference.
Propensity score matching was proposed by Rosenbaum & Rubin (1983), who proved that matching on the scalar \(e(X) = P(D=1 \mid X)\) is sufficient for removing confounding — you don’t need to match on every covariate separately. This was a breakthrough because it collapsed the high-dimensional matching problem to one dimension.
King & Nielsen (2019) argued that propensity score matching can paradoxically increase imbalance and model dependence. Their critique centers on the fact that exact PS matches don’t guarantee covariate balance — two units with the same propensity score can have very different covariate values.