When ML Helps and When It Doesn’t

Machine learning is transforming parts of empirical research. But it is not transforming all parts equally, and understanding where ML helps is as important as knowing the methods. This page draws the boundary.

The separation principle

Identification vs Estimation established that every causal study has two components:

  1. Identification: the argument for why a comparison is causal
  2. Estimation: the procedure for computing the causal parameter from data

Machine learning helps with estimation. It does not help with identification. This is the separation principle, and it is the single most important idea in causal ML.

What ML improves What ML does not improve
Nuisance estimation Propensity scores, outcome models The argument for unconfoundedness
Functional form Nonlinear, high-dimensional adjustment Whether the adjustment variables are sufficient
Heterogeneity Flexible CATE estimation Whether the ATE is identified in the first place
Prediction Out-of-sample forecasting Whether the forecast has a causal interpretation

The Manski principle, restated. No estimator — OLS, random forest, neural network, or transformer — can identify a causal parameter that the research design does not identify. Identification is a property of the data-generating process and the assumptions you’re willing to make, not the statistical method you apply. ML makes estimation more flexible; it does not make identification less necessary.

Where ML helps

1. High-dimensional confounders

When the set of potential confounders is large — dozens or hundreds of variables — traditional methods struggle. You can’t include 200 variables in a logistic regression for the propensity score without overfitting. ML methods (lasso, random forests, boosting) handle high-dimensional covariate adjustment naturally.

DML provides the framework: use ML for the nuisance functions (propensity score, outcome model), cross-fit to avoid overfitting bias, and use Neyman orthogonal scores for valid inference on the causal parameter.

2. Nonlinear relationships

If the outcome depends on covariates in complex, nonlinear ways, a linear outcome model is misspecified. ML methods can capture these nonlinearities without the researcher having to specify them in advance (which polynomial terms, which interactions).

This is where doubly robust estimation and DML intersect: the DR structure protects against misspecification of either model, and ML provides flexible estimators that reduce the risk of misspecification in the first place.

3. Discovering treatment effect heterogeneity

Heterogeneous treatment effects showed that causal forests and meta-learners can estimate \(\tau(x)\) — revealing which subgroups benefit most from treatment. This is genuinely new: traditional subgroup analysis requires pre-specifying which subgroups to examine, while ML-based methods can search over many covariates simultaneously.

4. Improving precision

Even when a simple estimator is consistent, ML-based covariate adjustment can reduce variance. In randomized experiments, the ATE is identified by randomization — you don’t need covariate adjustment for consistency. But adjusting for predictive covariates reduces the residual variance and tightens confidence intervals. ML can identify which covariates are most predictive without the researcher specifying a functional form.

Where ML does not help

1. Unobserved confounding

If treatment assignment depends on an unobserved variable \(U\):

\[ Y(0), Y(1) \not\perp D \mid X \]

then no method — parametric or nonparametric — can estimate the ATE consistently from observational data on \((Y, D, X)\) alone. ML estimates \(E[Y \mid X, D]\) more flexibly, but a more flexible approximation to a confounded conditional expectation is still confounded.

This is the failure mode that identification guards against. The solution is not a better estimator — it is a better research design (an instrument, a discontinuity, a natural experiment).

2. Providing identification

ML can estimate \(P(Y \mid X)\) but not \(P(Y \mid do(X))\) — the distinction from the stats course. The interventional distribution requires knowledge of the causal structure (a DAG, an exclusion restriction, a parallel trends argument). ML methods are agnostic about causal structure by design — they optimize prediction, not identification.

3. Small samples

ML methods generally require large samples to outperform simple parametric models. With \(n = 200\), a lasso propensity score is unlikely to improve on a carefully specified logit. DML’s cross-fitting further reduces effective sample size. In small samples, domain knowledge and parametric modeling often dominate data-driven flexibility.

4. Interpretability of nuisance functions

When a random forest estimates the propensity score, you lose the ability to inspect and interpret the model. You can’t point to specific coefficients and say “age increases treatment probability by X percentage points.” This matters when the propensity score model is of substantive interest (understanding selection into treatment) rather than just a statistical tool.

A decision framework

When deciding whether to use ML in a causal study:

Is the causal parameter identified?
├── No → Fix the research design. ML cannot help.
└── Yes → Are nuisance functions well-approximated by simple models?
    ├── Yes → Standard parametric methods are fine. ML adds complexity
    │         without clear benefit.
    └── No → ML-based estimation (DML, causal forests) can help.
        ├── Is the sample large enough? (n > 1000 as rough guide)
        │   ├── Yes → Proceed with DML / causal forests
        │   └── No → Consider regularized parametric models (lasso)
        └── Do you need heterogeneous effects?
            ├── Yes → Causal forests, meta-learners
            └── No → DML for ATE

The honest summary. ML is a powerful tool for the estimation half of causal inference. It handles high-dimensional confounders, nonlinear relationships, and heterogeneous effects better than traditional parametric methods. But it is entirely silent on the identification half — the question of whether the causal parameter can be recovered at all. The most sophisticated ML pipeline applied to a confounded comparison produces a precise, confident, wrong answer.

Connecting to the course

This page synthesizes the entire causal ML section: