Double/Debiased Machine Learning
Doubly robust estimation showed that combining an outcome model with a propensity score model gives you two shots at consistency. But both models were specified parametrically — linear regression for the outcome, logit for the propensity score. What if you want to use machine learning for these nuisance functions? The answer is not as simple as “plug in a random forest.”
The problem with naively plugging in ML
Suppose you want to estimate the ATE under selection on observables:
\[ \tau = E[Y(1) - Y(0)] \]
You need two nuisance functions: the outcome model \(\mu_d(X) = E[Y \mid X, D=d]\) and the propensity score \(e(X) = P(D=1 \mid X)\). The AIPW estimator from the doubly robust page uses both:
\[ \hat{\tau}_{DR} = \frac{1}{n}\sum_{i=1}^n \left[\hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{D_i(Y_i - \hat{\mu}_1(X_i))}{\hat{e}(X_i)} - \frac{(1-D_i)(Y_i - \hat{\mu}_0(X_i))}{1 - \hat{e}(X_i)}\right] \]
If you estimate \(\hat{\mu}\) and \(\hat{e}\) with OLS and logit, standard asymptotic theory gives you \(\sqrt{n}\)-consistent inference on \(\hat{\tau}\). But if you estimate them with a random forest, lasso, or neural network, the standard theory breaks down. The problem is regularization bias: ML estimators are biased (they trade bias for variance — the bias-variance tradeoff), and this bias contaminates the estimate of \(\tau\).
Concretely, the bias of \(\hat{\tau}\) depends on the product of the biases of \(\hat{\mu}\) and \(\hat{e}\):
\[ \text{Bias}(\hat{\tau}) \approx \text{Bias}(\hat{\mu}) \times \text{Bias}(\hat{e}) \]
For parametric models with \(\sqrt{n}\) convergence, both biases are \(O(1/\sqrt{n})\), so the product is \(O(1/n)\) — negligible. But ML estimators typically converge slower than \(\sqrt{n}\), and the bias product can be large enough to distort inference.
The DML solution: Neyman orthogonality + cross-fitting
Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins (2018) showed that two ingredients fix this problem.
Ingredient 1: Neyman orthogonal scores
Use an estimating equation (a “score”) whose expectation is insensitive to small perturbations in the nuisance functions. The AIPW score from doubly robust estimation has this property — it is Neyman orthogonal:
\[ \psi(Y, D, X; \tau, \mu, e) = \hat{\mu}_1(X) - \hat{\mu}_0(X) + \frac{D(Y - \hat{\mu}_1(X))}{e(X)} - \frac{(1-D)(Y - \hat{\mu}_0(X))}{1 - e(X)} - \tau \]
The Neyman orthogonality condition means that the derivative of \(E[\psi]\) with respect to the nuisance functions \((\mu, e)\) is zero at the true values. Intuitively, the score is “locally robust” to errors in the nuisance estimates — first-order mistakes in \(\hat{\mu}\) and \(\hat{e}\) don’t affect \(\hat{\tau}\).
This is what “debiased” means. The AIPW structure removes the first-order bias from the nuisance estimation. What remains is the product of biases — which is second-order.
Ingredient 2: Cross-fitting
Even with an orthogonal score, using the same data to estimate nuisance functions and the causal parameter creates an overfitting bias. ML estimators can overfit in subtle ways that parametric models don’t, and this leaks into \(\hat{\tau}\).
The fix is cross-fitting — a form of sample splitting:
- Split the data into \(K\) folds (typically \(K = 5\))
- For each fold \(k\):
- Estimate \(\hat{\mu}\) and \(\hat{e}\) using data outside fold \(k\)
- Compute the score \(\psi_i\) for observations inside fold \(k\)
- Average all scores: \(\hat{\tau} = \frac{1}{n}\sum_{i=1}^n \psi_i\)
This ensures the nuisance estimates are never evaluated on the same data used to construct them. Cross-fitting eliminates the overfitting bias while using all data for both nuisance estimation and causal estimation (unlike simple sample splitting, which wastes half the data).
The partially linear model
A common DML application. Suppose the model is:
\[ Y_i = \tau D_i + g(X_i) + \varepsilon_i \]
where \(g(X_i)\) is a potentially complex function of high-dimensional controls. You want \(\tau\) — the treatment effect — but \(g\) is a nuisance function.
The DML approach:
- Use ML to estimate \(\hat{g}(X) \approx E[Y \mid X]\) and \(\hat{m}(X) \approx E[D \mid X]\) via cross-fitting
- Compute residuals: \(\tilde{Y}_i = Y_i - \hat{g}(X_i)\) and \(\tilde{D}_i = D_i - \hat{m}(X_i)\)
- Regress \(\tilde{Y}\) on \(\tilde{D}\) to get \(\hat{\tau}\)
This is Frisch-Waugh-Lovell with ML residualization. Instead of controlling for \(X\) linearly, you partial it out flexibly. The treatment effect \(\tau\) remains a simple, interpretable parameter — only the nuisance part uses ML.
Why “double-debiased”?
The name means you remove bias from two things — the outcome and the treatment.
Debias Y. You predict \(Y\) from confounders \(X\) using ML (random forest, lasso, whatever), then subtract the prediction. If someone’s obesity rate is 35% and their demographics predict 32%, the residual \(\tilde{Y}_i = 35 - 32 = 3\) is the part of the outcome that confounders cannot explain. That’s what’s left for treatment to potentially explain.
Debias D. Same idea for treatment. If a tract has 4 dollar stores and its demographics predict 3.2, the residual \(\tilde{D}_i = 4 - 3.2 = 0.8\) is the “surprising” part of treatment — the variation not driven by confounders.
Why do you need both? If you only debiased \(Y\) but used raw \(D\), the ML estimation error in \(\hat{g}\) leaks into your causal estimate:
\[ \text{Single-debiased (broken):} \quad \hat{\tau} - \tau \;\approx\; \frac{1}{n}\sum \underbrace{D_i}_{\text{correlated with } X} \cdot \underbrace{\big(g(X_i) - \hat{g}(X_i)\big)}_{\text{ML error}} \]
This does not vanish — \(D_i\) is correlated with \(X_i\), so ML errors accumulate. But after double-debiasing:
\[ \text{Double-debiased (works):} \quad \hat{\tau} - \tau \;\approx\; \frac{1}{n}\sum \underbrace{\tilde{D}_i}_{\perp\, X} \cdot \underbrace{\big(g(X_i) - \hat{g}(X_i)\big)}_{\text{ML error}} \]
Now \(\tilde{D}_i \perp X_i\), so the ML errors cancel out. This is the Neyman orthogonality property — the “double” residualization makes the causal estimate immune to first-order ML mistakes.
#| standalone: true
#| viewerHeight: 680
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
.good { color: #27ae60; font-weight: bold; }
.bad { color: #e74c3c; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("n", "Sample size (n):",
min = 200, max = 2000, value = 500, step = 200),
sliderInput("tau", "True treatment effect (\u03C4):",
min = 0, max = 5, value = 2, step = 0.5),
sliderInput("p", "Number of confounders (p):",
min = 2, max = 20, value = 5, step = 1),
actionButton("go", "Run simulation", class = "btn-primary", width = "100%"),
uiOutput("results")
),
mainPanel(
width = 9,
plotOutput("density_plot", height = "500px")
)
)
)
server <- function(input, output, session) {
sim <- reactive({
input$go
n <- input$n
tau <- input$tau
p <- input$p
B <- 200
naive_ests <- numeric(B)
ols_ests <- numeric(B)
dml_ests <- numeric(B)
for (b in 1:B) {
# Generate confounders
X <- matrix(rnorm(n * p), n, p)
# Propensity and treatment
e_x <- plogis(X[, 1] + 0.5 * X[, 2])
D <- rbinom(n, 1, e_x)
# Outcome
Y <- tau * D + sin(X[, 1]) + X[, 2]^2 + rnorm(n)
# 1. Naive: unadjusted difference in means
naive_ests[b] <- mean(Y[D == 1]) - mean(Y[D == 0])
# 2. OLS: lm(Y ~ D + X)
dat_ols <- data.frame(Y = Y, D = D, X)
ols_fit <- lm(Y ~ D + ., data = dat_ols)
ols_ests[b] <- coef(ols_fit)["D"]
# 3. DML-style: residualize Y and D on X with cross-fitting
fold <- rep(1:2, length.out = n)
fold <- sample(fold)
Y_resid <- numeric(n)
D_resid <- numeric(n)
for (k in 1:2) {
train <- fold != k
test <- fold == k
X_train <- X[train, , drop = FALSE]
X_test <- X[test, , drop = FALSE]
# Fit nuisance models on training fold
dat_y <- data.frame(Y = Y[train], X_train)
dat_d <- data.frame(D = D[train], X_train)
fit_y <- lm(Y ~ ., data = dat_y)
fit_d <- lm(D ~ ., data = dat_d)
# Predict on test fold
newdat <- data.frame(X_test)
names(newdat) <- names(dat_y)[-1]
Y_resid[test] <- Y[test] - predict(fit_y, newdata = newdat)
D_resid[test] <- D[test] - predict(fit_d, newdata = newdat)
}
# Regress residualized Y on residualized D
dml_ests[b] <- coef(lm(Y_resid ~ D_resid - 1))
}
list(naive = naive_ests, ols = ols_ests, dml = dml_ests, tau = tau)
})
output$density_plot <- renderPlot({
s <- sim()
par(mar = c(5, 4.5, 3, 1))
# Compute densities
d_naive <- density(s$naive, adjust = 1.2)
d_ols <- density(s$ols, adjust = 1.2)
d_dml <- density(s$dml, adjust = 1.2)
xlim <- range(c(d_naive$x, d_ols$x, d_dml$x))
ylim <- c(0, max(c(d_naive$y, d_ols$y, d_dml$y)) * 1.15)
plot(NULL, xlim = xlim, ylim = ylim,
xlab = expression("Estimated " * tau),
ylab = "Density",
main = "Sampling Distributions (200 Monte Carlo Reps)")
# Density curves
polygon(d_naive$x, d_naive$y, col = adjustcolor("#e74c3c", 0.25), border = "#e74c3c", lwd = 2)
polygon(d_ols$x, d_ols$y, col = adjustcolor("#3498db", 0.25), border = "#3498db", lwd = 2)
polygon(d_dml$x, d_dml$y, col = adjustcolor("#27ae60", 0.25), border = "#27ae60", lwd = 2)
# True tau
abline(v = s$tau, lty = 2, lwd = 2.5, col = "#2c3e50")
text(s$tau, ylim[2] * 0.95,
bquote("True " * tau == .(s$tau)),
cex = 0.9, font = 2, col = "#2c3e50", pos = 4)
legend("topright", bty = "n", cex = 0.85,
legend = c("Naive (diff. in means)", "OLS (with controls)", "DML-style (cross-fitted)"),
col = c("#e74c3c", "#3498db", "#27ae60"),
lwd = 2, fill = adjustcolor(c("#e74c3c", "#3498db", "#27ae60"), 0.25),
border = c("#e74c3c", "#3498db", "#27ae60"))
})
output$results <- renderUI({
s <- sim()
tau <- s$tau
mn <- c(mean(s$naive), mean(s$ols), mean(s$dml))
sd <- c(sd(s$naive), sd(s$ols), sd(s$dml))
bias <- mn - tau
tags$div(class = "stats-box",
HTML(paste0(
"<b>True \u03C4:</b> ", tau, "<br>",
"<hr style='margin:6px 0'>",
"<b>Naive:</b> mean = <span class='bad'>", round(mn[1], 3), "</span>",
" (SD: ", round(sd[1], 3), ", bias: ", round(bias[1], 3), ")<br>",
"<b>OLS:</b> mean = ", round(mn[2], 3),
" (SD: ", round(sd[2], 3), ", bias: ", round(bias[2], 3), ")<br>",
"<b>DML:</b> mean = <span class='good'>", round(mn[3], 3), "</span>",
" (SD: ", round(sd[3], 3), ", bias: ", round(bias[3], 3), ")"
))
)
})
}
shinyApp(ui, server)
Things to try
- Increase confounders: with more confounders \(p\), the naive estimator gets worse because there is more confounding. OLS and DML both adjust, but DML’s cross-fitting avoids overfitting bias that can creep in when \(p\) is large relative to \(n\).
- Set \(\tau = 0\): all estimators should center near zero if unbiased. The naive estimator will still be biased because treatment is confounded.
- Small \(n\), large \(p\): with \(n = 200\) and \(p = 20\), OLS starts to overfit the confounders. The cross-fitted DML approach is more robust because it never evaluates predictions on the same data used to fit the nuisance models.
What DML assumes
DML does not weaken the identification assumptions. You still need:
- Selection on observables: \(Y(0), Y(1) \perp D \mid X\)
- Overlap: \(0 < P(D=1 \mid X) < 1\) for all \(X\)
- SUTVA and consistency from Potential Outcomes
What DML relaxes are the estimation assumptions:
- The nuisance functions \(\mu(X)\) and \(e(X)\) can be nonparametric — estimated by lasso, random forest, boosting, neural network, or any method
- The nuisance estimators need to converge at rate \(n^{-1/4}\) (not \(n^{-1/2}\)) — a much weaker requirement that many ML methods satisfy
- No need to correctly specify a parametric model for \(\mu\) or \(e\)
Continuous treatment and heterogeneous effects
Everything above assumes a binary treatment \(D \in \{0,1\}\) and estimates a single average effect \(\tau\). But DML extends naturally to two important generalizations.
Continuous treatment
When treatment is a dose rather than a switch — dollar store exposure within a radius, pollution concentration, hours of tutoring — the partially linear model already handles it:
\[Y_i = \tau D_i + g(X_i) + \varepsilon_i\]
Here \(D_i\) is continuous and \(\tau\) is the marginal effect of a one-unit increase in \(D\). The DML recipe is identical: residualize \(Y\) and \(D\) on \(X\) via cross-fitting, then regress \(\tilde{Y}\) on \(\tilde{D}\). No propensity score is needed — you estimate \(E[D \mid X]\) (the conditional mean of treatment) and \(E[Y \mid X]\) as the two nuisance functions.
The orthogonalization ensures \(\sqrt{n}\)-consistency of \(\hat{\tau}\) even when the nuisance models converge at slower nonparametric rates — this is the same guarantee as in the binary case.
Conditional average treatment effects (CATE)
To go from a single \(\tau\) to a function \(\tau(x)\) — how the effect varies across individuals — DML combines with causal forests or other CATE estimators. The idea:
- Residualize \(Y\) and \(D\) on \(X\) via cross-fitting (the DML step)
- Estimate heterogeneity in \(\tilde{Y} = \tau(X)\tilde{D} + \eta\) using a causal forest or kernel method on the residualized data
This is called the R-learner (Nie & Wager, 2021). It solves the loss:
\[\hat{\tau}(x) = \arg\min_{\tau} \sum_{i=1}^{n} \left(\tilde{Y}_i - \tau(X_i)\tilde{D}_i\right)^2\]
The residualization removes confounding; the second stage finds heterogeneity. This works for both binary and continuous treatments — with continuous \(D\), \(\tau(x)\) is the heterogeneous marginal effect, not a binary treatment/control comparison.
When to use this
| Setting | Method |
|---|---|
| Binary \(D\), want ATE | Standard DML (this page) |
| Binary \(D\), want \(\tau(x)\) | DML + causal forest (R-learner) |
| Continuous \(D\), want average marginal effect | Partially linear model with DML |
| Continuous \(D\), want \(\tau(x)\) | R-learner with continuous \(D\) |
Connecting to the course
DML builds directly on several pages:
- Doubly robust estimation: the AIPW estimator is the score function that DML uses. DML adds cross-fitting and ML nuisance estimation.
- IPW: the propensity score \(e(X)\) can now be estimated flexibly instead of by logit.
- FWL: the partially linear model is FWL with ML residualization.
- Identification vs Estimation: DML is the clearest example of this separation — ML for estimation, research design for identification.