Heterogeneous Treatment Effects
Everything so far has focused on average treatment effects — the ATE or ATT from Potential Outcomes. But averages can hide enormous variation. A drug that helps most patients and harms a few has a positive ATE, but the few who are harmed would like to know that. A policy that benefits one demographic and hurts another looks mediocre on average.
This page introduces a new estimand: the treatment effect as a function of covariates. The next page covers the estimation machinery — causal forests, meta-learners — for recovering it from data.
From \(\tau\) to \(\tau(x)\)
The ATE is a single number:
\[ \tau = E[Y(1) - Y(0)] \]
The Conditional Average Treatment Effect (CATE) is a function:
\[ \tau(x) = E[Y(1) - Y(0) \mid X = x] \]
This answers: “What is the treatment effect for individuals with characteristics \(X = x\)?” The ATE is the average of the CATE across the population:
\[ \tau = E[\tau(X)] \]
The CATE sits alongside the other estimands in the course:
| Estimand | Definition | Question it answers |
|---|---|---|
| ATE | \(E[Y(1) - Y(0)]\) | What is the average effect? |
| ATT | \(E[Y(1) - Y(0) \mid D = 1]\) | What is the effect on the treated? |
| LATE | Effect for compliers (IV) | What is the effect for those moved by the instrument? |
| CATE | \(E[Y(1) - Y(0) \mid X = x]\) | How does the effect vary with \(x\)? |
The first three are numbers. The CATE is a function — and that changes the estimation problem fundamentally.
Why average effects can mislead
Sign reversal
The ATE can be positive even if the treatment harms a substantial subgroup. Suppose a job training program increases earnings by $5,000 for workers without a college degree but decreases earnings by $1,000 for workers with a degree (perhaps by diverting them from better opportunities). If 80% of participants lack a degree:
\[ \tau = 0.8 \times 5000 + 0.2 \times (-1000) = 3800 \]
The ATE is $3,800 — positive and “significant.” But the policy is actively harmful for 20% of participants. Without estimating \(\tau(x)\), you would never know.
Optimal targeting
If a treatment has heterogeneous effects and resources are limited, you want to treat the people who benefit most. This requires knowing \(\tau(x)\), not just \(\tau\). The optimal treatment rule is:
\[ D^*(x) = \mathbf{1}\{\tau(x) > 0\} \]
Treat if and only if the expected effect is positive. This is policy targeting — and it requires the CATE.
External validity
An RCT in one population gives you \(\tau\) for that population. If you want to know what would happen in a different population with different \(X\) distributions, you need \(\tau(x)\). The CATE is transportable across populations in a way that the ATE is not (provided the effect heterogeneity is stable).
Identification of the CATE
The CATE requires the same identification assumptions as the ATE, but applied conditionally:
\[ Y(0), Y(1) \perp D \mid X = x \qquad \text{(conditional unconfoundedness)} \]
\[ 0 < P(D = 1 \mid X = x) < 1 \qquad \text{(overlap)} \]
These are the selection on observables conditions, now required at every value of \(x\). This is a stronger requirement than for the ATE: the ATE can be identified even if overlap fails in some regions (the failures cancel out in expectation), but the CATE at \(x\) requires overlap at that specific \(x\).
Subgroup analysis: the traditional approach
The simplest way to look for heterogeneity: split the sample by a covariate and estimate the ATE within each subgroup. Run the regression separately for men and women, for young and old, for high-income and low-income.
This works when:
- You have a small number of pre-specified subgroups
- You have large samples within each subgroup
- The heterogeneity is along a single, known dimension
It breaks down when:
- You have many covariates and don’t know which ones drive heterogeneity
- Interactions matter (the effect differs for young women vs old men, not just by age or gender separately)
- You test many subgroups without correction — a multiple testing problem
The limitations of subgroup analysis motivate data-driven methods — causal forests and meta-learners — covered in the next page. These methods search over many covariates simultaneously while controlling for overfitting, providing a principled way to discover heterogeneity that traditional subgroup analysis cannot.
Connecting to the course
- Potential Outcomes: the CATE is defined within the potential outcomes framework — \(\tau(x) = E[Y(1) - Y(0) \mid X = x]\)
- Selection on Observables: conditional unconfoundedness is the identification assumption for the CATE
- Identification vs Estimation: the CATE is a new estimand, not a new identification strategy — the research design must still justify the causal interpretation
- Causal Forests: the estimation machinery for recovering \(\tau(x)\) from data