Identification vs Estimation
Two separate questions in every causal study
Every causal inference project answers two fundamentally different questions:
- Identification. Why is this comparison causal? What assumptions make the estimand equal to the causal parameter?
- Estimation. How do we compute the estimand from data? What statistical procedure do we use?
These are conceptually independent:
- You can have correct identification with a poor estimator (unbiased but noisy).
- You can have a highly efficient estimator with no identification (precise but biased toward the wrong parameter).
Identification comes first. If the assumptions fail, no estimator can recover the causal effect.
Identification: the assumption
Identification is about the source of exogenous variation — why the variation in treatment you’re using is “as good as random” for estimating a causal effect.
| Identification strategy | The assumption | In words |
|---|---|---|
| Selection on observables | \(Y(0), Y(1) \perp D \mid X\) | Conditional on X, treatment is as good as random |
| Parallel trends | \(E[Y(0)_t - Y(0)_{t-1} \mid D=1] = E[Y(0)_t - Y(0)_{t-1} \mid D=0]\) | Absent treatment, both groups would have trended the same |
| Exclusion restriction | \(Z\) affects \(Y\) only through \(X\) | The instrument has no direct effect on the outcome |
| Continuity | \(E[Y(0) \mid X=x]\) is continuous at the cutoff | No other jump happens at the cutoff |
Each assumption is a claim about the world — not something you compute. You argue it using institutional knowledge, theory, and indirect evidence. Some are partially testable (you can check pre-trends for DID, run a McCrary test for RDD), but none can be fully proven from data.
Estimation: the computation
Estimation is about how you turn data into a number, given that you believe your identification assumption holds.
| Estimator | What it does |
|---|---|
| OLS regression | Fits a linear model, uses coefficients |
| Matching | Pairs treated/control units with similar covariates |
| IPW | Reweights observations by inverse propensity scores |
| Entropy balancing | Finds weights that exactly balance covariate moments |
| Doubly robust | Combines regression and weighting |
| 2SLS | Two-stage regression using predicted values from the first stage |
| Local polynomial | Fits flexible curves on each side of a cutoff |
| Synthetic control weights | Constrained optimization to match pre-treatment trends |
| TWFE | Two-way fixed effects regression |
These are tools — they can often be combined with different identification strategies. IPW can implement selection on observables or be used in a DID design (IPW-DID). Regression can adjust for covariates in an RDD or in a cross-sectional study. The estimator doesn’t determine identification; the assumption does.
Research designs bundle both
What we usually call “methods” in applied work — DID, IV, RDD — are really research designs that bundle an identification strategy with a default estimator:
| Research design | Identification | Common estimators |
|---|---|---|
| SOO study | Conditional independence | Regression, matching, IPW, EB, doubly robust |
| DID | Parallel trends | 2×2 difference, TWFE, IPW-DID (Abadie 2005), DR-DID (Sant’Anna & Zhao 2020) |
| IV | Exclusion restriction + relevance | 2SLS, LIML, GMM |
| RDD | Continuity at cutoff | Local polynomial, local randomization |
| Synthetic control | Pre-treatment fit → valid counterfactual | Constrained weight optimization, augmented SCM |
This is why the same estimation tool shows up in multiple designs. IPW appears in the SOO column and the DID column — because it’s a tool, not a strategy.
The math: where bias comes from
When an identification assumption fails, it introduces a bias term that no estimator can remove. Here’s the decomposition for three methods.
Selection on observables
We want the Average Treatment Effect on the Treated (ATT):
\[\tau = E[Y(1) - Y(0) \mid D = 1]\]
We observe \(E[Y \mid D=1] = E[Y(1) \mid D=1]\) and \(E[Y \mid D=0] = E[Y(0) \mid D=0]\). The naive comparison is:
\[E[Y \mid D=1] - E[Y \mid D=0] = \underbrace{E[Y(1) - Y(0) \mid D=1]}_{\text{ATT}} + \underbrace{E[Y(0) \mid D=1] - E[Y(0) \mid D=0]}_{\text{selection bias}}\]
The second term is selection bias — the treated group would have had different outcomes even without treatment. The CIA says: conditional on \(X\), \(E[Y(0) \mid D=1, X] = E[Y(0) \mid D=0, X]\), so the selection bias is zero within each stratum of \(X\).
If the CIA fails — there’s an unobserved confounder \(U\) — then \(E[Y(0) \mid D=1, X] \neq E[Y(0) \mid D=0, X]\) because \(D\) is still correlated with \(Y(0)\) through \(U\) even after conditioning on \(X\). The bias term is nonzero. Regression, IPW, matching — all give biased answers because the selection bias is baked into the estimand, not the estimator.
Difference-in-differences
The DID estimand is:
\[\hat{\tau}_{DID} = \big(E[Y_{1t}] - E[Y_{1,t-1}]\big) - \big(E[Y_{0t}] - E[Y_{0,t-1}]\big)\]
where group 1 is treated, group 0 is control, \(t\) is post, \(t-1\) is pre. Substitute potential outcomes and add and subtract \(E[Y_{1t}(0)]\):
\[\hat{\tau}_{DID} = \underbrace{E[Y_{1t}(1) - Y_{1t}(0)]}_{\text{ATT}} + \underbrace{\big(E[Y_{1t}(0)] - E[Y_{1,t-1}(0)]\big) - \big(E[Y_{0t}(0)] - E[Y_{0,t-1}(0)]\big)}_{\text{differential trend bias}}\]
The parallel trends assumption says the second term equals zero — the treated group’s untreated trajectory matches the control group’s trajectory. Then \(\hat{\tau}_{DID} = \text{ATT}\).
If parallel trends fail — say the treated group was already trending upward faster — the differential trend term is positive. DID overestimates the effect. This bias doesn’t shrink with more data. It doesn’t go away if you switch from a 2×2 difference to TWFE or IPW-DID. It’s an identification failure, not an estimation failure.
Instrumental variables
We have \(Y = \beta X + \varepsilon\) where \(\text{Cov}(X, \varepsilon) \neq 0\) (endogeneity). The IV estimand is:
\[\hat{\beta}_{IV} = \frac{\text{Cov}(Z, Y)}{\text{Cov}(Z, X)}\]
Substitute \(Y = \beta X + \varepsilon\):
\[\hat{\beta}_{IV} = \frac{\text{Cov}(Z, \beta X + \varepsilon)}{\text{Cov}(Z, X)} = \beta + \frac{\text{Cov}(Z, \varepsilon)}{\text{Cov}(Z, X)}\]
The exclusion restriction says \(\text{Cov}(Z, \varepsilon) = 0\) — the instrument is uncorrelated with the error. Then \(\hat{\beta}_{IV} = \beta\).
If the exclusion restriction fails — \(Z\) directly affects \(Y\) through some channel other than \(X\) — then \(\text{Cov}(Z, \varepsilon) \neq 0\) and the bias term \(\frac{\text{Cov}(Z, \varepsilon)}{\text{Cov}(Z, X)}\) is nonzero. No amount of data, no alternative estimator (LIML, GMM, jackknife) removes this. It’s baked in.
Threats to identification
Each method has specific threats — things that make the bias term nonzero:
| Method | Identification assumption | Threat (what breaks it) | What the bias looks like |
|---|---|---|---|
| SOO | No unobserved confounders | Omitted variable that drives both \(D\) and \(Y\) | Selection bias: treated group was different to begin with |
| DID | Parallel trends | Treated group was already on a different trajectory | You attribute the pre-existing trend to the treatment |
| IV | Exclusion restriction | Instrument affects \(Y\) through a channel other than \(X\) | IV picks up the direct effect, not just the causal path |
| RDD | Continuity at cutoff | Units manipulate their score to sort across the cutoff; or another policy also kicks in at the same cutoff | The “jump” reflects sorting or a different treatment, not your treatment |
| Synthetic control | Pre-treatment fit generalizes | Spillovers from treated unit to donors; structural break changes the relationship | Counterfactual is wrong, gap doesn’t reflect the treatment |
Notice: every threat is about the world, not about the math. You can’t test your way out of these — you argue them with institutional knowledge.
The pattern
In all three cases:
\[\text{Estimate} = \text{Causal effect} + \text{Identification bias} + \text{Estimation bias}\]
Identification bias comes from violated assumptions — it’s a function of how the world works. Estimation bias comes from the estimator — it’s a function of how you computed the number. Identification bias dominates and can’t be fixed. Estimation bias is usually smaller and can be fixed by choosing a better estimator. That’s why identification comes first.
Types of bias
Identification biases
These come from the world, not the estimator. More data doesn’t help. A fancier estimator doesn’t help. You need a different identification strategy or better data.
Omitted variable bias (OVB). The most common. An unobserved variable \(U\) affects both treatment and outcome. For the simple regression \(Y = \beta X + \gamma U + \varepsilon\) where you omit \(U\):
\[\text{OVB} = \gamma \cdot \frac{\text{Cov}(X, U)}{\text{Var}(X)}\]
The bias is the effect of \(U\) on \(Y\) (\(\gamma\)) times how much \(U\) correlates with \(X\). If \(U\) drives people toward treatment and improves outcomes, both terms are positive and you overestimate the effect.
Selection bias. The treated group would have had different outcomes even without treatment: \(E[Y(0) \mid D=1] \neq E[Y(0) \mid D=0]\). This is OVB rephrased in potential outcomes language — the “omitted variable” is whatever drives people to select into treatment.
Simultaneity bias. \(X\) causes \(Y\) but \(Y\) also causes \(X\). Regressing \(Y\) on \(X\) picks up both directions. Common in macro (do interest rates affect GDP, or does GDP affect interest rates?) and in supply/demand estimation.
Collider bias (sample selection bias). You condition on a variable caused by both treatment and outcome — this opens a fake path between them. The correlation-causation page covers this in detail.
Differential trends bias. In DID: the treated group was already on a different trajectory before treatment. The estimate captures the pre-existing divergence, not the treatment effect.
Estimation biases
These come from the estimator, not the world. They can be reduced or eliminated by choosing a better estimator, using more data, or fixing the specification.
Functional form misspecification. You fit a linear model but the truth is nonlinear. In RDD: a straight line through curved data creates a fake “jump” at the cutoff. Fix: use local polynomials, check robustness to specification.
Finite sample / weak instrument bias. With weak instruments (\(F < 10\)), 2SLS is biased toward OLS in finite samples — even if the exclusion restriction holds. This shrinks with stronger instruments or alternative estimators (LIML, Anderson-Rubin).
Negative weighting (TWFE with staggered treatment). Goodman-Bacon (2021) and de Chaisemartin & d’Haultfoeuille (2020) showed that two-way fixed effects can produce negative weights on some treatment effects when treatment is staggered across time — giving biased estimates even when parallel trends holds. Fix: use Callaway & Sant’Anna, Sun & Abraham, or other heterogeneity-robust estimators.
Extreme weights. In IPW: when propensity scores are near 0 or 1, some observations get enormous weights, making the estimate noisy and potentially biased in finite samples. Fix: trim extreme scores, use entropy balancing, or use doubly robust estimators.
Attenuation bias (measurement error). When \(X\) is measured with noise, OLS is biased toward zero. The noisier the measurement relative to the true signal, the more the estimate shrinks. Can be fixed with better measurement or IV. See the measurement error page for the signal-to-noise ratio formula.
Summary
| Bias | Type | Goes away with more data? | Fix |
|---|---|---|---|
| Omitted variable / confounding | Identification | No | Better controls, different strategy (IV, DID, RDD) |
| Selection bias | Identification | No | Randomization, or argue CIA |
| Simultaneity | Identification | No | IV, timing restrictions |
| Collider / sample selection | Identification | No | Don’t condition on colliders |
| Differential trends | Identification | No | Different comparison group, different strategy |
| Functional form | Estimation | Partially | Flexible specifications, local methods |
| Weak instruments | Estimation | Partially | Stronger instruments, LIML |
| TWFE negative weighting | Estimation | No | Heterogeneity-robust DID estimators |
| Extreme IPW weights | Estimation | Yes (slowly) | Trimming, EB, doubly robust |
| Attenuation (measurement error) | Both | No | Better data, IV |
Simulation: identification matters, estimation is secondary
Same data, same identification assumption, three different estimators. When the assumption holds, they all work. When it doesn’t, they all fail.
#| standalone: true
#| viewerHeight: 580
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 13px; line-height: 1.8;
}
.stats-box b { color: #2c3e50; }
.good { color: #27ae60; font-weight: bold; }
.bad { color: #e74c3c; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("n_ie", "Sample size:",
min = 200, max = 2000, value = 500, step = 100),
sliderInput("ate_ie", "True ATE:",
min = 0, max = 5, value = 2, step = 0.5),
sliderInput("obs_ie", "Observed confounding (X):",
min = 0, max = 3, value = 1.5, step = 0.25),
sliderInput("unobs_ie", "Unobserved confounding (U):",
min = 0, max = 3, value = 0, step = 0.25),
actionButton("go_ie", "New draw", class = "btn-primary", width = "100%"),
uiOutput("results_ie")
),
mainPanel(
width = 9,
plotOutput("ie_plot", height = "420px")
)
)
)
server <- function(input, output, session) {
dat <- reactive({
input$go_ie
n <- input$n_ie
ate <- input$ate_ie
gx <- input$obs_ie
gu <- input$unobs_ie
x <- rnorm(n)
u <- rnorm(n)
p <- pnorm(gx * x + gu * u)
treat <- rbinom(n, 1, p)
y <- 1 + 2 * x + 1.5 * u + ate * treat + rnorm(n)
# Estimator 1: OLS regression controlling for X
est_reg <- coef(lm(y ~ treat + x))[2]
# Estimator 2: IPW
ps <- fitted(glm(treat ~ x, family = binomial))
ps <- pmin(pmax(ps, 0.01), 0.99)
w <- ifelse(treat == 1, 1 / ps, 1 / (1 - ps))
est_ipw <- weighted.mean(y[treat == 1], w[treat == 1]) -
weighted.mean(y[treat == 0], w[treat == 0])
# Estimator 3: Matching (simple: nearest neighbor on X)
matched_y <- numeric(sum(treat == 1))
x_t <- x[treat == 1]
y_t <- y[treat == 1]
x_c <- x[treat == 0]
y_c <- y[treat == 0]
for (i in seq_along(x_t)) {
nearest <- which.min(abs(x_c - x_t[i]))
matched_y[i] <- y_c[nearest]
}
est_match <- mean(y_t) - mean(matched_y)
list(est_reg = est_reg, est_ipw = est_ipw, est_match = est_match,
ate = ate, gu = gu)
})
output$ie_plot <- renderPlot({
d <- dat()
par(mar = c(5, 4.5, 3, 1))
estimates <- c(d$est_reg, d$est_ipw, d$est_match)
biases <- estimates - d$ate
labels <- c("OLS\nregression", "IPW", "Nearest-neighbor\nmatching")
cia_holds <- d$gu == 0
cols <- ifelse(abs(biases) < 0.5, "#27ae60", "#e74c3c")
bp <- barplot(estimates, col = cols, border = NA,
names.arg = labels, cex.names = 0.85,
main = ifelse(cia_holds,
"CIA holds: all estimators work",
"CIA violated: all estimators fail"),
ylab = "Estimate",
ylim = c(0, max(estimates, d$ate) * 1.5))
abline(h = d$ate, lty = 2, col = "gray40", lwd = 2)
text(0.2, d$ate + 0.15, paste0("True ATE = ", d$ate),
col = "gray40", cex = 0.85, adj = 0)
text(bp, estimates + 0.2,
paste0(round(estimates, 2)),
cex = 0.9, font = 2)
})
output$results_ie <- renderUI({
d <- dat()
cia_holds <- d$gu == 0
tags$div(class = "stats-box",
HTML(paste0(
"<b>True ATE:</b> ", d$ate, "<br>",
"<hr style='margin:6px 0'>",
"<b>Regression:</b> ", round(d$est_reg, 2),
" (bias: ", round(d$est_reg - d$ate, 2), ")<br>",
"<b>IPW:</b> ", round(d$est_ipw, 2),
" (bias: ", round(d$est_ipw - d$ate, 2), ")<br>",
"<b>Matching:</b> ", round(d$est_match, 2),
" (bias: ", round(d$est_match - d$ate, 2), ")<br>",
"<hr style='margin:6px 0'>",
if (cia_holds)
"<span class='good'>CIA holds.</span> All three estimators give similar, roughly unbiased answers. The choice of estimator is secondary."
else
"<span class='bad'>CIA violated.</span> All three estimators are biased. Switching estimators doesn't help — you need a different identification strategy."
))
)
})
}
shinyApp(ui, server)
Things to try
- Unobserved confounding = 0: the CIA holds. All three estimators — regression, IPW, matching — give roughly the same answer, close to the true ATE. The choice between them is about efficiency, not bias.
- Unobserved confounding = 2: the CIA is violated. All three estimators are biased in the same direction. Switching from regression to IPW to matching doesn’t help — the problem is identification, not estimation.
- Increase sample size with unobserved confounding: all three get more precise but stay biased. More data doesn’t fix a broken assumption.
The lesson: spend your energy on identification, not on the fanciest estimator.
In Stata: identification → estimation cheat sheet
| Identification strategy | Stata command |
|---|---|
| Random assignment | reg outcome treatment |
| Selection on observables | teffects ra (outcome x1 x2) (treatment) |
| Inverse probability weighting | teffects ipw (outcome) (treatment x1 x2) |
| Matching | teffects nnmatch (outcome x1 x2) (treatment) |
| Doubly robust | teffects aipw (outcome x1 x2) (treatment x1 x2) |
| Difference-in-differences | reg outcome treated##post, cluster(group) |
| Instrumental variables | ivregress 2sls outcome (treatment = instrument) |
| Regression discontinuity | rdrobust outcome running_var, c(0) |
| Fixed effects | xtreg outcome treatment x1, fe cluster(id) |
The right column is the easy part. The hard part is arguing that the left column holds.
Did you know?
The distinction between identification and estimation was articulated clearly by Charles Manski in his 1995 book Identification Problems in the Social Sciences. He argued that most debates in empirical work are really about identification, not estimation.
Angrist & Pischke (Mostly Harmless Econometrics, 2009) organized their entire textbook around identification strategies — regression, IV, DID, RDD — rather than estimators. This framing reshaped how a generation of economists thinks about empirical work.
A common mistake in applied papers: spending pages discussing the estimator (clustered SEs, bootstrap, semiparametric methods) while spending one paragraph on identification. The estimator is the easy part. The hard part is arguing that your comparison is causal.