Identification vs Estimation

Two separate questions in every causal study

Every causal inference project answers two fundamentally different questions:

  1. Identification. Why is this comparison causal? What assumptions make the estimand equal to the causal parameter?
  2. Estimation. How do we compute the estimand from data? What statistical procedure do we use?

These are conceptually independent:

  • You can have correct identification with a poor estimator (unbiased but noisy).
  • You can have a highly efficient estimator with no identification (precise but biased toward the wrong parameter).

Identification comes first. If the assumptions fail, no estimator can recover the causal effect.

Identification: the assumption

Identification is about the source of exogenous variation — why the variation in treatment you’re using is “as good as random” for estimating a causal effect.

Identification strategy The assumption In words
Selection on observables \(Y(0), Y(1) \perp D \mid X\) Conditional on X, treatment is as good as random
Parallel trends \(E[Y(0)_t - Y(0)_{t-1} \mid D=1] = E[Y(0)_t - Y(0)_{t-1} \mid D=0]\) Absent treatment, both groups would have trended the same
Exclusion restriction \(Z\) affects \(Y\) only through \(X\) The instrument has no direct effect on the outcome
Continuity \(E[Y(0) \mid X=x]\) is continuous at the cutoff No other jump happens at the cutoff

Each assumption is a claim about the world — not something you compute. You argue it using institutional knowledge, theory, and indirect evidence. Some are partially testable (you can check pre-trends for DID, run a McCrary test for RDD), but none can be fully proven from data.

Estimation: the computation

Estimation is about how you turn data into a number, given that you believe your identification assumption holds.

Estimator What it does
OLS regression Fits a linear model, uses coefficients
Matching Pairs treated/control units with similar covariates
IPW Reweights observations by inverse propensity scores
Entropy balancing Finds weights that exactly balance covariate moments
Doubly robust Combines regression and weighting
2SLS Two-stage regression using predicted values from the first stage
Local polynomial Fits flexible curves on each side of a cutoff
Synthetic control weights Constrained optimization to match pre-treatment trends
TWFE Two-way fixed effects regression

These are tools — they can often be combined with different identification strategies. IPW can implement selection on observables or be used in a DID design (IPW-DID). Regression can adjust for covariates in an RDD or in a cross-sectional study. The estimator doesn’t determine identification; the assumption does.

Research designs bundle both

What we usually call “methods” in applied work — DID, IV, RDD — are really research designs that bundle an identification strategy with a default estimator:

Research design Identification Common estimators
SOO study Conditional independence Regression, matching, IPW, EB, doubly robust
DID Parallel trends 2×2 difference, TWFE, IPW-DID (Abadie 2005), DR-DID (Sant’Anna & Zhao 2020)
IV Exclusion restriction + relevance 2SLS, LIML, GMM
RDD Continuity at cutoff Local polynomial, local randomization
Synthetic control Pre-treatment fit → valid counterfactual Constrained weight optimization, augmented SCM

This is why the same estimation tool shows up in multiple designs. IPW appears in the SOO column and the DID column — because it’s a tool, not a strategy.


The math: where bias comes from

When an identification assumption fails, it introduces a bias term that no estimator can remove. Here’s the decomposition for three methods.

Selection on observables

We want the Average Treatment Effect on the Treated (ATT):

\[\tau = E[Y(1) - Y(0) \mid D = 1]\]

We observe \(E[Y \mid D=1] = E[Y(1) \mid D=1]\) and \(E[Y \mid D=0] = E[Y(0) \mid D=0]\). The naive comparison is:

\[E[Y \mid D=1] - E[Y \mid D=0] = \underbrace{E[Y(1) - Y(0) \mid D=1]}_{\text{ATT}} + \underbrace{E[Y(0) \mid D=1] - E[Y(0) \mid D=0]}_{\text{selection bias}}\]

The second term is selection bias — the treated group would have had different outcomes even without treatment. The CIA says: conditional on \(X\), \(E[Y(0) \mid D=1, X] = E[Y(0) \mid D=0, X]\), so the selection bias is zero within each stratum of \(X\).

If the CIA fails — there’s an unobserved confounder \(U\) — then \(E[Y(0) \mid D=1, X] \neq E[Y(0) \mid D=0, X]\) because \(D\) is still correlated with \(Y(0)\) through \(U\) even after conditioning on \(X\). The bias term is nonzero. Regression, IPW, matching — all give biased answers because the selection bias is baked into the estimand, not the estimator.

Difference-in-differences

The DID estimand is:

\[\hat{\tau}_{DID} = \big(E[Y_{1t}] - E[Y_{1,t-1}]\big) - \big(E[Y_{0t}] - E[Y_{0,t-1}]\big)\]

where group 1 is treated, group 0 is control, \(t\) is post, \(t-1\) is pre. Substitute potential outcomes and add and subtract \(E[Y_{1t}(0)]\):

\[\hat{\tau}_{DID} = \underbrace{E[Y_{1t}(1) - Y_{1t}(0)]}_{\text{ATT}} + \underbrace{\big(E[Y_{1t}(0)] - E[Y_{1,t-1}(0)]\big) - \big(E[Y_{0t}(0)] - E[Y_{0,t-1}(0)]\big)}_{\text{differential trend bias}}\]

The parallel trends assumption says the second term equals zero — the treated group’s untreated trajectory matches the control group’s trajectory. Then \(\hat{\tau}_{DID} = \text{ATT}\).

If parallel trends fail — say the treated group was already trending upward faster — the differential trend term is positive. DID overestimates the effect. This bias doesn’t shrink with more data. It doesn’t go away if you switch from a 2×2 difference to TWFE or IPW-DID. It’s an identification failure, not an estimation failure.

Instrumental variables

We have \(Y = \beta X + \varepsilon\) where \(\text{Cov}(X, \varepsilon) \neq 0\) (endogeneity). The IV estimand is:

\[\hat{\beta}_{IV} = \frac{\text{Cov}(Z, Y)}{\text{Cov}(Z, X)}\]

Substitute \(Y = \beta X + \varepsilon\):

\[\hat{\beta}_{IV} = \frac{\text{Cov}(Z, \beta X + \varepsilon)}{\text{Cov}(Z, X)} = \beta + \frac{\text{Cov}(Z, \varepsilon)}{\text{Cov}(Z, X)}\]

The exclusion restriction says \(\text{Cov}(Z, \varepsilon) = 0\) — the instrument is uncorrelated with the error. Then \(\hat{\beta}_{IV} = \beta\).

If the exclusion restriction fails\(Z\) directly affects \(Y\) through some channel other than \(X\) — then \(\text{Cov}(Z, \varepsilon) \neq 0\) and the bias term \(\frac{\text{Cov}(Z, \varepsilon)}{\text{Cov}(Z, X)}\) is nonzero. No amount of data, no alternative estimator (LIML, GMM, jackknife) removes this. It’s baked in.

Threats to identification

Each method has specific threats — things that make the bias term nonzero:

Method Identification assumption Threat (what breaks it) What the bias looks like
SOO No unobserved confounders Omitted variable that drives both \(D\) and \(Y\) Selection bias: treated group was different to begin with
DID Parallel trends Treated group was already on a different trajectory You attribute the pre-existing trend to the treatment
IV Exclusion restriction Instrument affects \(Y\) through a channel other than \(X\) IV picks up the direct effect, not just the causal path
RDD Continuity at cutoff Units manipulate their score to sort across the cutoff; or another policy also kicks in at the same cutoff The “jump” reflects sorting or a different treatment, not your treatment
Synthetic control Pre-treatment fit generalizes Spillovers from treated unit to donors; structural break changes the relationship Counterfactual is wrong, gap doesn’t reflect the treatment

Notice: every threat is about the world, not about the math. You can’t test your way out of these — you argue them with institutional knowledge.

The pattern

In all three cases:

\[\text{Estimate} = \text{Causal effect} + \text{Identification bias} + \text{Estimation bias}\]

Identification bias comes from violated assumptions — it’s a function of how the world works. Estimation bias comes from the estimator — it’s a function of how you computed the number. Identification bias dominates and can’t be fixed. Estimation bias is usually smaller and can be fixed by choosing a better estimator. That’s why identification comes first.


Types of bias

Identification biases

These come from the world, not the estimator. More data doesn’t help. A fancier estimator doesn’t help. You need a different identification strategy or better data.

Omitted variable bias (OVB). The most common. An unobserved variable \(U\) affects both treatment and outcome. For the simple regression \(Y = \beta X + \gamma U + \varepsilon\) where you omit \(U\):

\[\text{OVB} = \gamma \cdot \frac{\text{Cov}(X, U)}{\text{Var}(X)}\]

The bias is the effect of \(U\) on \(Y\) (\(\gamma\)) times how much \(U\) correlates with \(X\). If \(U\) drives people toward treatment and improves outcomes, both terms are positive and you overestimate the effect.

Selection bias. The treated group would have had different outcomes even without treatment: \(E[Y(0) \mid D=1] \neq E[Y(0) \mid D=0]\). This is OVB rephrased in potential outcomes language — the “omitted variable” is whatever drives people to select into treatment.

Simultaneity bias. \(X\) causes \(Y\) but \(Y\) also causes \(X\). Regressing \(Y\) on \(X\) picks up both directions. Common in macro (do interest rates affect GDP, or does GDP affect interest rates?) and in supply/demand estimation.

Collider bias (sample selection bias). You condition on a variable caused by both treatment and outcome — this opens a fake path between them. The correlation-causation page covers this in detail.

Differential trends bias. In DID: the treated group was already on a different trajectory before treatment. The estimate captures the pre-existing divergence, not the treatment effect.

Estimation biases

These come from the estimator, not the world. They can be reduced or eliminated by choosing a better estimator, using more data, or fixing the specification.

Functional form misspecification. You fit a linear model but the truth is nonlinear. In RDD: a straight line through curved data creates a fake “jump” at the cutoff. Fix: use local polynomials, check robustness to specification.

Finite sample / weak instrument bias. With weak instruments (\(F < 10\)), 2SLS is biased toward OLS in finite samples — even if the exclusion restriction holds. This shrinks with stronger instruments or alternative estimators (LIML, Anderson-Rubin).

Negative weighting (TWFE with staggered treatment). Goodman-Bacon (2021) and de Chaisemartin & d’Haultfoeuille (2020) showed that two-way fixed effects can produce negative weights on some treatment effects when treatment is staggered across time — giving biased estimates even when parallel trends holds. Fix: use Callaway & Sant’Anna, Sun & Abraham, or other heterogeneity-robust estimators.

Extreme weights. In IPW: when propensity scores are near 0 or 1, some observations get enormous weights, making the estimate noisy and potentially biased in finite samples. Fix: trim extreme scores, use entropy balancing, or use doubly robust estimators.

Attenuation bias (measurement error). When \(X\) is measured with noise, OLS is biased toward zero. The noisier the measurement relative to the true signal, the more the estimate shrinks. Can be fixed with better measurement or IV. See the measurement error page for the signal-to-noise ratio formula.

Summary

Bias Type Goes away with more data? Fix
Omitted variable / confounding Identification No Better controls, different strategy (IV, DID, RDD)
Selection bias Identification No Randomization, or argue CIA
Simultaneity Identification No IV, timing restrictions
Collider / sample selection Identification No Don’t condition on colliders
Differential trends Identification No Different comparison group, different strategy
Functional form Estimation Partially Flexible specifications, local methods
Weak instruments Estimation Partially Stronger instruments, LIML
TWFE negative weighting Estimation No Heterogeneity-robust DID estimators
Extreme IPW weights Estimation Yes (slowly) Trimming, EB, doubly robust
Attenuation (measurement error) Both No Better data, IV

Simulation: identification matters, estimation is secondary

Same data, same identification assumption, three different estimators. When the assumption holds, they all work. When it doesn’t, they all fail.

#| standalone: true
#| viewerHeight: 580

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 13px; line-height: 1.8;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("n_ie", "Sample size:",
                  min = 200, max = 2000, value = 500, step = 100),

      sliderInput("ate_ie", "True ATE:",
                  min = 0, max = 5, value = 2, step = 0.5),

      sliderInput("obs_ie", "Observed confounding (X):",
                  min = 0, max = 3, value = 1.5, step = 0.25),

      sliderInput("unobs_ie", "Unobserved confounding (U):",
                  min = 0, max = 3, value = 0, step = 0.25),

      actionButton("go_ie", "New draw", class = "btn-primary", width = "100%"),

      uiOutput("results_ie")
    ),

    mainPanel(
      width = 9,
      plotOutput("ie_plot", height = "420px")
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$go_ie
    n   <- input$n_ie
    ate <- input$ate_ie
    gx  <- input$obs_ie
    gu  <- input$unobs_ie

    x <- rnorm(n)
    u <- rnorm(n)

    p <- pnorm(gx * x + gu * u)
    treat <- rbinom(n, 1, p)

    y <- 1 + 2 * x + 1.5 * u + ate * treat + rnorm(n)

    # Estimator 1: OLS regression controlling for X
    est_reg <- coef(lm(y ~ treat + x))[2]

    # Estimator 2: IPW
    ps <- fitted(glm(treat ~ x, family = binomial))
    ps <- pmin(pmax(ps, 0.01), 0.99)
    w <- ifelse(treat == 1, 1 / ps, 1 / (1 - ps))
    est_ipw <- weighted.mean(y[treat == 1], w[treat == 1]) -
               weighted.mean(y[treat == 0], w[treat == 0])

    # Estimator 3: Matching (simple: nearest neighbor on X)
    matched_y <- numeric(sum(treat == 1))
    x_t <- x[treat == 1]
    y_t <- y[treat == 1]
    x_c <- x[treat == 0]
    y_c <- y[treat == 0]
    for (i in seq_along(x_t)) {
      nearest <- which.min(abs(x_c - x_t[i]))
      matched_y[i] <- y_c[nearest]
    }
    est_match <- mean(y_t) - mean(matched_y)

    list(est_reg = est_reg, est_ipw = est_ipw, est_match = est_match,
         ate = ate, gu = gu)
  })

  output$ie_plot <- renderPlot({
    d <- dat()
    par(mar = c(5, 4.5, 3, 1))

    estimates <- c(d$est_reg, d$est_ipw, d$est_match)
    biases <- estimates - d$ate
    labels <- c("OLS\nregression", "IPW", "Nearest-neighbor\nmatching")

    cia_holds <- d$gu == 0
    cols <- ifelse(abs(biases) < 0.5, "#27ae60", "#e74c3c")

    bp <- barplot(estimates, col = cols, border = NA,
                  names.arg = labels, cex.names = 0.85,
                  main = ifelse(cia_holds,
                    "CIA holds: all estimators work",
                    "CIA violated: all estimators fail"),
                  ylab = "Estimate",
                  ylim = c(0, max(estimates, d$ate) * 1.5))

    abline(h = d$ate, lty = 2, col = "gray40", lwd = 2)
    text(0.2, d$ate + 0.15, paste0("True ATE = ", d$ate),
         col = "gray40", cex = 0.85, adj = 0)

    text(bp, estimates + 0.2,
         paste0(round(estimates, 2)),
         cex = 0.9, font = 2)
  })

  output$results_ie <- renderUI({
    d <- dat()
    cia_holds <- d$gu == 0

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>True ATE:</b> ", d$ate, "<br>",
        "<hr style='margin:6px 0'>",
        "<b>Regression:</b> ", round(d$est_reg, 2),
        " (bias: ", round(d$est_reg - d$ate, 2), ")<br>",
        "<b>IPW:</b> ", round(d$est_ipw, 2),
        " (bias: ", round(d$est_ipw - d$ate, 2), ")<br>",
        "<b>Matching:</b> ", round(d$est_match, 2),
        " (bias: ", round(d$est_match - d$ate, 2), ")<br>",
        "<hr style='margin:6px 0'>",
        if (cia_holds)
          "<span class='good'>CIA holds.</span> All three estimators give similar, roughly unbiased answers. The choice of estimator is secondary."
        else
          "<span class='bad'>CIA violated.</span> All three estimators are biased. Switching estimators doesn't help — you need a different identification strategy."
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • Unobserved confounding = 0: the CIA holds. All three estimators — regression, IPW, matching — give roughly the same answer, close to the true ATE. The choice between them is about efficiency, not bias.
  • Unobserved confounding = 2: the CIA is violated. All three estimators are biased in the same direction. Switching from regression to IPW to matching doesn’t help — the problem is identification, not estimation.
  • Increase sample size with unobserved confounding: all three get more precise but stay biased. More data doesn’t fix a broken assumption.

The lesson: spend your energy on identification, not on the fanciest estimator.


In Stata: identification → estimation cheat sheet

Identification strategy Stata command
Random assignment reg outcome treatment
Selection on observables teffects ra (outcome x1 x2) (treatment)
Inverse probability weighting teffects ipw (outcome) (treatment x1 x2)
Matching teffects nnmatch (outcome x1 x2) (treatment)
Doubly robust teffects aipw (outcome x1 x2) (treatment x1 x2)
Difference-in-differences reg outcome treated##post, cluster(group)
Instrumental variables ivregress 2sls outcome (treatment = instrument)
Regression discontinuity rdrobust outcome running_var, c(0)
Fixed effects xtreg outcome treatment x1, fe cluster(id)

The right column is the easy part. The hard part is arguing that the left column holds.


Did you know?

  • The distinction between identification and estimation was articulated clearly by Charles Manski in his 1995 book Identification Problems in the Social Sciences. He argued that most debates in empirical work are really about identification, not estimation.

  • Angrist & Pischke (Mostly Harmless Econometrics, 2009) organized their entire textbook around identification strategies — regression, IV, DID, RDD — rather than estimators. This framing reshaped how a generation of economists thinks about empirical work.

  • A common mistake in applied papers: spending pages discussing the estimator (clustered SEs, bootstrap, semiparametric methods) while spending one paragraph on identification. The estimator is the easy part. The hard part is arguing that your comparison is causal.