Identification vs Estimation

Two separate questions in every causal study

Every causal inference project answers two fundamentally different questions:

Identification. Why is this comparison causal? What assumptions make the estimand equal to the causal parameter?
Estimation. How do we compute the estimand from data? What statistical procedure do we use?

These are conceptually independent:

You can have correct identification with a poor estimator (unbiased but noisy).
You can have a highly efficient estimator with no identification (precise but biased toward the wrong parameter).

Identification comes first. If the assumptions fail, no estimator can recover the causal effect.

Identification: the assumption

Identification is about the source of exogenous variation — why the variation in treatment you’re using is “as good as random” for estimating a causal effect.

Identification strategy	The assumption	In words
Selection on observables	\(Y(0), Y(1) \perp D \mid X\)	Conditional on X, treatment is as good as random
Parallel trends	\(E[Y(0)_t - Y(0)_{t-1} \mid D=1] = E[Y(0)_t - Y(0)_{t-1} \mid D=0]\)	Absent treatment, both groups would have trended the same
Exclusion restriction	\(Z\) affects \(Y\) only through \(X\)	The instrument has no direct effect on the outcome
Continuity	\(E[Y(0) \mid X=x]\) is continuous at the cutoff	No other jump happens at the cutoff

Each assumption is a claim about the world — not something you compute. You argue it using institutional knowledge, theory, and indirect evidence. Some are partially testable (you can check pre-trends for DID, run a McCrary test for RDD), but none can be fully proven from data.

Estimation: the computation

Estimation is about how you turn data into a number, given that you believe your identification assumption holds.

Estimator	What it does
OLS regression	Fits a linear model, uses coefficients
Matching	Pairs treated/control units with similar covariates
IPW	Reweights observations by inverse propensity scores
Entropy balancing	Finds weights that exactly balance covariate moments
Doubly robust	Combines regression and weighting
2SLS	Two-stage regression using predicted values from the first stage
Local polynomial	Fits flexible curves on each side of a cutoff
Synthetic control weights	Constrained optimization to match pre-treatment trends
TWFE	Two-way fixed effects regression

These are tools — they can often be combined with different identification strategies. IPW can implement selection on observables or be used in a DID design (IPW-DID). Regression can adjust for covariates in an RDD or in a cross-sectional study. The estimator doesn’t determine identification; the assumption does.

Research designs bundle both

What we usually call “methods” in applied work — DID, IV, RDD — are really research designs that bundle an identification strategy with a default estimator:

Research design	Identification	Common estimators
SOO study	Conditional independence	Regression, matching, IPW, EB, doubly robust
DID	Parallel trends	2×2 difference, TWFE, IPW-DID (Abadie 2005), DR-DID (Sant’Anna & Zhao 2020)
IV	Exclusion restriction + relevance	2SLS, LIML, GMM
RDD	Continuity at cutoff	Local polynomial, local randomization
Synthetic control	Pre-treatment fit → valid counterfactual	Constrained weight optimization, augmented SCM

This is why the same estimation tool shows up in multiple designs. IPW appears in the SOO column and the DID column — because it’s a tool, not a strategy.

The math: where bias comes from

When an identification assumption fails, it introduces a bias term that no estimator can remove. Here’s the decomposition for three methods.

Selection on observables

We want the Average Treatment Effect on the Treated (ATT):

\[\tau = E[Y(1) - Y(0) \mid D = 1]\]

We observe \(E[Y \mid D=1] = E[Y(1) \mid D=1]\) and \(E[Y \mid D=0] = E[Y(0) \mid D=0]\). The naive comparison is:

\[E[Y \mid D=1] - E[Y \mid D=0] = \underbrace{E[Y(1) - Y(0) \mid D=1]}_{\text{ATT}} + \underbrace{E[Y(0) \mid D=1] - E[Y(0) \mid D=0]}_{\text{selection bias}}\]

The second term is selection bias — the treated group would have had different outcomes even without treatment. The CIA says: conditional on \(X\), \(E[Y(0) \mid D=1, X] = E[Y(0) \mid D=0, X]\), so the selection bias is zero within each stratum of \(X\).

If the CIA fails — there’s an unobserved confounder \(U\) — then \(E[Y(0) \mid D=1, X] \neq E[Y(0) \mid D=0, X]\) because \(D\) is still correlated with \(Y(0)\) through \(U\) even after conditioning on \(X\). The bias term is nonzero. Regression, IPW, matching — all give biased answers because the selection bias is baked into the estimand, not the estimator.

Difference-in-differences

The DID estimand is:

\[\hat{\tau}_{DID} = \big(E[Y_{1t}] - E[Y_{1,t-1}]\big) - \big(E[Y_{0t}] - E[Y_{0,t-1}]\big)\]

where group 1 is treated, group 0 is control, \(t\) is post, \(t-1\) is pre. Substitute potential outcomes and add and subtract \(E[Y_{1t}(0)]\):

\[\hat{\tau}_{DID} = \underbrace{E[Y_{1t}(1) - Y_{1t}(0)]}_{\text{ATT}} + \underbrace{\big(E[Y_{1t}(0)] - E[Y_{1,t-1}(0)]\big) - \big(E[Y_{0t}(0)] - E[Y_{0,t-1}(0)]\big)}_{\text{differential trend bias}}\]

The parallel trends assumption says the second term equals zero — the treated group’s untreated trajectory matches the control group’s trajectory. Then \(\hat{\tau}_{DID} = \text{ATT}\).

If parallel trends fail — say the treated group was already trending upward faster — the differential trend term is positive. DID overestimates the effect. This bias doesn’t shrink with more data. It doesn’t go away if you switch from a 2×2 difference to TWFE or IPW-DID. It’s an identification failure, not an estimation failure.

Instrumental variables

We have \(Y = \beta X + \varepsilon\) where \(\text{Cov}(X, \varepsilon) \neq 0\) (endogeneity). The IV estimand is:

\[\hat{\beta}_{IV} = \frac{\text{Cov}(Z, Y)}{\text{Cov}(Z, X)}\]

Substitute \(Y = \beta X + \varepsilon\):

\[\hat{\beta}_{IV} = \frac{\text{Cov}(Z, \beta X + \varepsilon)}{\text{Cov}(Z, X)} = \beta + \frac{\text{Cov}(Z, \varepsilon)}{\text{Cov}(Z, X)}\]

The exclusion restriction says \(\text{Cov}(Z, \varepsilon) = 0\) — the instrument is uncorrelated with the error. Then \(\hat{\beta}_{IV} = \beta\).

If the exclusion restriction fails — \(Z\) directly affects \(Y\) through some channel other than \(X\) — then \(\text{Cov}(Z, \varepsilon) \neq 0\) and the bias term \(\frac{\text{Cov}(Z, \varepsilon)}{\text{Cov}(Z, X)}\) is nonzero. No amount of data, no alternative estimator (LIML, GMM, jackknife) removes this. It’s baked in.

Threats to identification

Each method has specific threats — things that make the bias term nonzero:

Method	Identification assumption	Threat (what breaks it)	What the bias looks like
SOO	No unobserved confounders	Omitted variable that drives both \(D\) and \(Y\)	Selection bias: treated group was different to begin with
DID	Parallel trends	Treated group was already on a different trajectory	You attribute the pre-existing trend to the treatment
IV	Exclusion restriction	Instrument affects \(Y\) through a channel other than \(X\)	IV picks up the direct effect, not just the causal path
RDD	Continuity at cutoff	Units manipulate their score to sort across the cutoff; or another policy also kicks in at the same cutoff	The “jump” reflects sorting or a different treatment, not your treatment
Synthetic control	Pre-treatment fit generalizes	Spillovers from treated unit to donors; structural break changes the relationship	Counterfactual is wrong, gap doesn’t reflect the treatment

Notice: every threat is about the world, not about the math. You can’t test your way out of these — you argue them with institutional knowledge.

The pattern

In all three cases:

\[\text{Estimate} = \text{Causal effect} + \text{Identification bias} + \text{Estimation bias}\]

Identification bias comes from violated assumptions — it’s a function of how the world works. Estimation bias comes from the estimator — it’s a function of how you computed the number. Identification bias dominates and can’t be fixed. Estimation bias is usually smaller and can be fixed by choosing a better estimator. That’s why identification comes first.

Types of bias

Identification biases

These come from the world, not the estimator. More data doesn’t help. A fancier estimator doesn’t help. You need a different identification strategy or better data.

Omitted variable bias (OVB). The most common. An unobserved variable \(U\) affects both treatment and outcome. For the simple regression \(Y = \beta X + \gamma U + \varepsilon\) where you omit \(U\):

\[\text{OVB} = \gamma \cdot \frac{\text{Cov}(X, U)}{\text{Var}(X)}\]

The bias is the effect of \(U\) on \(Y\) (\(\gamma\)) times how much \(U\) correlates with \(X\). If \(U\) drives people toward treatment and improves outcomes, both terms are positive and you overestimate the effect.

Selection bias. The treated group would have had different outcomes even without treatment: \(E[Y(0) \mid D=1] \neq E[Y(0) \mid D=0]\). This is OVB rephrased in potential outcomes language — the “omitted variable” is whatever drives people to select into treatment.

Simultaneity bias. \(X\) causes \(Y\) but \(Y\) also causes \(X\). Regressing \(Y\) on \(X\) picks up both directions. Common in macro (do interest rates affect GDP, or does GDP affect interest rates?) and in supply/demand estimation.

Collider bias (sample selection bias). You condition on a variable caused by both treatment and outcome — this opens a fake path between them. The correlation-causation page covers this in detail.

Differential trends bias. In DID: the treated group was already on a different trajectory before treatment. The estimate captures the pre-existing divergence, not the treatment effect.

Estimation biases

These come from the estimator, not the world. They can be reduced or eliminated by choosing a better estimator, using more data, or fixing the specification.

Functional form misspecification. You fit a linear model but the truth is nonlinear. In RDD: a straight line through curved data creates a fake “jump” at the cutoff. Fix: use local polynomials, check robustness to specification.

Finite sample / weak instrument bias. With weak instruments (\(F < 10\)), 2SLS is biased toward OLS in finite samples — even if the exclusion restriction holds. This shrinks with stronger instruments or alternative estimators (LIML, Anderson-Rubin).

Negative weighting (TWFE with staggered treatment). Goodman-Bacon (2021) and de Chaisemartin & d’Haultfoeuille (2020) showed that two-way fixed effects can produce negative weights on some treatment effects when treatment is staggered across time — giving biased estimates even when parallel trends holds. Fix: use Callaway & Sant’Anna, Sun & Abraham, or other heterogeneity-robust estimators.

Extreme weights. In IPW: when propensity scores are near 0 or 1, some observations get enormous weights, making the estimate noisy and potentially biased in finite samples. Fix: trim extreme scores, use entropy balancing, or use doubly robust estimators.

Attenuation bias (measurement error). When \(X\) is measured with noise, OLS is biased toward zero. The noisier the measurement relative to the true signal, the more the estimate shrinks. Can be fixed with better measurement or IV. See the measurement error page for the signal-to-noise ratio formula.

Summary

Bias	Type	Goes away with more data?	Fix
Omitted variable / confounding	Identification	No	Better controls, different strategy (IV, DID, RDD)
Selection bias	Identification	No	Randomization, or argue CIA
Simultaneity	Identification	No	IV, timing restrictions
Collider / sample selection	Identification	No	Don’t condition on colliders
Differential trends	Identification	No	Different comparison group, different strategy
Functional form	Estimation	Partially	Flexible specifications, local methods
Weak instruments	Estimation	Partially	Stronger instruments, LIML
TWFE negative weighting	Estimation	No	Heterogeneity-robust DID estimators
Extreme IPW weights	Estimation	Yes (slowly)	Trimming, EB, doubly robust
Attenuation (measurement error)	Both	No	Better data, IV

Simulation: identification matters, estimation is secondary

Same data, same identification assumption, three different estimators. When the assumption holds, they all work. When it doesn’t, they all fail.

#| standalone: true
#| viewerHeight: 580

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 13px; line-height: 1.8;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("n_ie", "Sample size:",
                  min = 200, max = 2000, value = 500, step = 100),

      sliderInput("ate_ie", "True ATE:",
                  min = 0, max = 5, value = 2, step = 0.5),

      sliderInput("obs_ie", "Observed confounding (X):",
                  min = 0, max = 3, value = 1.5, step = 0.25),

      sliderInput("unobs_ie", "Unobserved confounding (U):",
                  min = 0, max = 3, value = 0, step = 0.25),

      actionButton("go_ie", "New draw", class = "btn-primary", width = "100%"),

      uiOutput("results_ie")
    ),

    mainPanel(
      width = 9,
      plotOutput("ie_plot", height = "420px")
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$go_ie
    n   <- input$n_ie
    ate <- input$ate_ie
    gx  <- input$obs_ie
    gu  <- input$unobs_ie

    x <- rnorm(n)
    u <- rnorm(n)

    p <- pnorm(gx * x + gu * u)
    treat <- rbinom(n, 1, p)

    y <- 1 + 2 * x + 1.5 * u + ate * treat + rnorm(n)

    # Estimator 1: OLS regression controlling for X
    est_reg <- coef(lm(y ~ treat + x))[2]

    # Estimator 2: IPW
    ps <- fitted(glm(treat ~ x, family = binomial))
    ps <- pmin(pmax(ps, 0.01), 0.99)
    w <- ifelse(treat == 1, 1 / ps, 1 / (1 - ps))
    est_ipw <- weighted.mean(y[treat == 1], w[treat == 1]) -
               weighted.mean(y[treat == 0], w[treat == 0])

    # Estimator 3: Matching (simple: nearest neighbor on X)
    matched_y <- numeric(sum(treat == 1))
    x_t <- x[treat == 1]
    y_t <- y[treat == 1]
    x_c <- x[treat == 0]
    y_c <- y[treat == 0]
    for (i in seq_along(x_t)) {
      nearest <- which.min(abs(x_c - x_t[i]))
      matched_y[i] <- y_c[nearest]
    }
    est_match <- mean(y_t) - mean(matched_y)

    list(est_reg = est_reg, est_ipw = est_ipw, est_match = est_match,
         ate = ate, gu = gu)
  })

  output$ie_plot <- renderPlot({
    d <- dat()
    par(mar = c(5, 4.5, 3, 1))

    estimates <- c(d$est_reg, d$est_ipw, d$est_match)
    biases <- estimates - d$ate
    labels <- c("OLS\nregression", "IPW", "Nearest-neighbor\nmatching")

    cia_holds <- d$gu == 0
    cols <- ifelse(abs(biases) < 0.5, "#27ae60", "#e74c3c")

    bp <- barplot(estimates, col = cols, border = NA,
                  names.arg = labels, cex.names = 0.85,
                  main = ifelse(cia_holds,
                    "CIA holds: all estimators work",
                    "CIA violated: all estimators fail"),
                  ylab = "Estimate",
                  ylim = c(0, max(estimates, d$ate) * 1.5))

    abline(h = d$ate, lty = 2, col = "gray40", lwd = 2)
    text(0.2, d$ate + 0.15, paste0("True ATE = ", d$ate),
         col = "gray40", cex = 0.85, adj = 0)

    text(bp, estimates + 0.2,
         paste0(round(estimates, 2)),
         cex = 0.9, font = 2)
  })

  output$results_ie <- renderUI({
    d <- dat()
    cia_holds <- d$gu == 0

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>True ATE:</b> ", d$ate, "<br>",
        "<hr style='margin:6px 0'>",
        "<b>Regression:</b> ", round(d$est_reg, 2),
        " (bias: ", round(d$est_reg - d$ate, 2), ")<br>",
        "<b>IPW:</b> ", round(d$est_ipw, 2),
        " (bias: ", round(d$est_ipw - d$ate, 2), ")<br>",
        "<b>Matching:</b> ", round(d$est_match, 2),
        " (bias: ", round(d$est_match - d$ate, 2), ")<br>",
        "<hr style='margin:6px 0'>",
        if (cia_holds)
          "<span class='good'>CIA holds.</span> All three estimators give similar, roughly unbiased answers. The choice of estimator is secondary."
        else
          "<span class='bad'>CIA violated.</span> All three estimators are biased. Switching estimators doesn't help — you need a different identification strategy."
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

Unobserved confounding = 0: the CIA holds. All three estimators — regression, IPW, matching — give roughly the same answer, close to the true ATE. The choice between them is about efficiency, not bias.
Unobserved confounding = 2: the CIA is violated. All three estimators are biased in the same direction. Switching from regression to IPW to matching doesn’t help — the problem is identification, not estimation.
Increase sample size with unobserved confounding: all three get more precise but stay biased. More data doesn’t fix a broken assumption.

The lesson: spend your energy on identification, not on the fanciest estimator.

In Stata: identification → estimation cheat sheet

Identification strategy	Stata command
Random assignment	`reg outcome treatment`
Selection on observables	`teffects ra (outcome x1 x2) (treatment)`
Inverse probability weighting	`teffects ipw (outcome) (treatment x1 x2)`
Matching	`teffects nnmatch (outcome x1 x2) (treatment)`
Doubly robust	`teffects aipw (outcome x1 x2) (treatment x1 x2)`
Difference-in-differences	`reg outcome treated##post, cluster(group)`
Instrumental variables	`ivregress 2sls outcome (treatment = instrument)`
Regression discontinuity	`rdrobust outcome running_var, c(0)`
Fixed effects	`xtreg outcome treatment x1, fe cluster(id)`

The right column is the easy part. The hard part is arguing that the left column holds.

Did you know?

The distinction between identification and estimation was articulated clearly by Charles Manski in his 1995 book Identification Problems in the Social Sciences. He argued that most debates in empirical work are really about identification, not estimation.
Angrist & Pischke (Mostly Harmless Econometrics, 2009) organized their entire textbook around identification strategies — regression, IV, DID, RDD — rather than estimators. This framing reshaped how a generation of economists thinks about empirical work.
A common mistake in applied papers: spending pages discussing the estimator (clustered SEs, bootstrap, semiparametric methods) while spending one paragraph on identification. The estimator is the easy part. The hard part is arguing that your comparison is causal.