Double/Debiased Machine Learning
Doubly robust estimation showed that combining an outcome model with a propensity score model gives you two shots at consistency. But both models were specified parametrically — linear regression for the outcome, logit for the propensity score. What if you want to use machine learning for these nuisance functions? The answer is not as simple as “plug in a random forest.”
The problem with naively plugging in ML
Suppose you want to estimate the ATE under selection on observables:
\[ \tau = E[Y(1) - Y(0)] \]
You need two nuisance functions: the outcome model \(\mu_d(X) = E[Y \mid X, D=d]\) and the propensity score \(e(X) = P(D=1 \mid X)\). The AIPW estimator from the doubly robust page uses both:
\[ \hat{\tau}_{DR} = \frac{1}{n}\sum_{i=1}^n \left[\hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{D_i(Y_i - \hat{\mu}_1(X_i))}{\hat{e}(X_i)} - \frac{(1-D_i)(Y_i - \hat{\mu}_0(X_i))}{1 - \hat{e}(X_i)}\right] \]
If you estimate \(\hat{\mu}\) and \(\hat{e}\) with OLS and logit, standard asymptotic theory gives you \(\sqrt{n}\)-consistent inference on \(\hat{\tau}\). But if you estimate them with a random forest, lasso, or neural network, the standard theory breaks down. The problem is regularization bias: ML estimators are biased (they trade bias for variance — the bias-variance tradeoff), and this bias contaminates the estimate of \(\tau\).
Concretely, the bias of \(\hat{\tau}\) depends on the product of the biases of \(\hat{\mu}\) and \(\hat{e}\):
\[ \text{Bias}(\hat{\tau}) \approx \text{Bias}(\hat{\mu}) \times \text{Bias}(\hat{e}) \]
For parametric models with \(\sqrt{n}\) convergence, both biases are \(O(1/\sqrt{n})\), so the product is \(O(1/n)\) — negligible. But ML estimators typically converge slower than \(\sqrt{n}\), and the bias product can be large enough to distort inference.
The DML solution: Neyman orthogonality + cross-fitting
Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins (2018) showed that two ingredients fix this problem.
Ingredient 1: Neyman orthogonal scores
Use an estimating equation (a “score”) whose expectation is insensitive to small perturbations in the nuisance functions. The AIPW score from doubly robust estimation has this property — it is Neyman orthogonal:
\[ \psi(Y, D, X; \tau, \mu, e) = \hat{\mu}_1(X) - \hat{\mu}_0(X) + \frac{D(Y - \hat{\mu}_1(X))}{e(X)} - \frac{(1-D)(Y - \hat{\mu}_0(X))}{1 - e(X)} - \tau \]
The Neyman orthogonality condition means that the derivative of \(E[\psi]\) with respect to the nuisance functions \((\mu, e)\) is zero at the true values. Intuitively, the score is “locally robust” to errors in the nuisance estimates — first-order mistakes in \(\hat{\mu}\) and \(\hat{e}\) don’t affect \(\hat{\tau}\).
This is what “debiased” means. The AIPW structure removes the first-order bias from the nuisance estimation. What remains is the product of biases — which is second-order.
Ingredient 2: Cross-fitting
Even with an orthogonal score, using the same data to estimate nuisance functions and the causal parameter creates an overfitting bias. ML estimators can overfit in subtle ways that parametric models don’t, and this leaks into \(\hat{\tau}\).
The fix is cross-fitting — a form of sample splitting:
- Split the data into \(K\) folds (typically \(K = 5\))
- For each fold \(k\):
- Estimate \(\hat{\mu}\) and \(\hat{e}\) using data outside fold \(k\)
- Compute the score \(\psi_i\) for observations inside fold \(k\)
- Average all scores: \(\hat{\tau} = \frac{1}{n}\sum_{i=1}^n \psi_i\)
This ensures the nuisance estimates are never evaluated on the same data used to construct them. Cross-fitting eliminates the overfitting bias while using all data for both nuisance estimation and causal estimation (unlike simple sample splitting, which wastes half the data).
The partially linear model
A common DML application. Suppose the model is:
\[ Y_i = \tau D_i + g(X_i) + \varepsilon_i \]
where \(g(X_i)\) is a potentially complex function of high-dimensional controls. You want \(\tau\) — the treatment effect — but \(g\) is a nuisance function.
The DML approach:
- Use ML to estimate \(\hat{g}(X) \approx E[Y \mid X]\) and \(\hat{m}(X) \approx E[D \mid X]\) via cross-fitting
- Compute residuals: \(\tilde{Y}_i = Y_i - \hat{g}(X_i)\) and \(\tilde{D}_i = D_i - \hat{m}(X_i)\)
- Regress \(\tilde{Y}\) on \(\tilde{D}\) to get \(\hat{\tau}\)
This is Frisch-Waugh-Lovell with ML residualization. Instead of controlling for \(X\) linearly, you partial it out flexibly. The treatment effect \(\tau\) remains a simple, interpretable parameter — only the nuisance part uses ML.
#| standalone: true
#| viewerHeight: 680
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
.good { color: #27ae60; font-weight: bold; }
.bad { color: #e74c3c; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("n", "Sample size (n):",
min = 200, max = 2000, value = 500, step = 200),
sliderInput("tau", "True treatment effect (\u03C4):",
min = 0, max = 5, value = 2, step = 0.5),
sliderInput("p", "Number of confounders (p):",
min = 2, max = 20, value = 5, step = 1),
actionButton("go", "Run simulation", class = "btn-primary", width = "100%"),
uiOutput("results")
),
mainPanel(
width = 9,
plotOutput("density_plot", height = "500px")
)
)
)
server <- function(input, output, session) {
sim <- reactive({
input$go
n <- input$n
tau <- input$tau
p <- input$p
B <- 200
naive_ests <- numeric(B)
ols_ests <- numeric(B)
dml_ests <- numeric(B)
for (b in 1:B) {
# Generate confounders
X <- matrix(rnorm(n * p), n, p)
# Propensity and treatment
e_x <- plogis(X[, 1] + 0.5 * X[, 2])
D <- rbinom(n, 1, e_x)
# Outcome
Y <- tau * D + sin(X[, 1]) + X[, 2]^2 + rnorm(n)
# 1. Naive: unadjusted difference in means
naive_ests[b] <- mean(Y[D == 1]) - mean(Y[D == 0])
# 2. OLS: lm(Y ~ D + X)
dat_ols <- data.frame(Y = Y, D = D, X)
ols_fit <- lm(Y ~ D + ., data = dat_ols)
ols_ests[b] <- coef(ols_fit)["D"]
# 3. DML-style: residualize Y and D on X with cross-fitting
fold <- rep(1:2, length.out = n)
fold <- sample(fold)
Y_resid <- numeric(n)
D_resid <- numeric(n)
for (k in 1:2) {
train <- fold != k
test <- fold == k
X_train <- X[train, , drop = FALSE]
X_test <- X[test, , drop = FALSE]
# Fit nuisance models on training fold
dat_y <- data.frame(Y = Y[train], X_train)
dat_d <- data.frame(D = D[train], X_train)
fit_y <- lm(Y ~ ., data = dat_y)
fit_d <- lm(D ~ ., data = dat_d)
# Predict on test fold
newdat <- data.frame(X_test)
names(newdat) <- names(dat_y)[-1]
Y_resid[test] <- Y[test] - predict(fit_y, newdata = newdat)
D_resid[test] <- D[test] - predict(fit_d, newdata = newdat)
}
# Regress residualized Y on residualized D
dml_ests[b] <- coef(lm(Y_resid ~ D_resid - 1))
}
list(naive = naive_ests, ols = ols_ests, dml = dml_ests, tau = tau)
})
output$density_plot <- renderPlot({
s <- sim()
par(mar = c(5, 4.5, 3, 1))
# Compute densities
d_naive <- density(s$naive, adjust = 1.2)
d_ols <- density(s$ols, adjust = 1.2)
d_dml <- density(s$dml, adjust = 1.2)
xlim <- range(c(d_naive$x, d_ols$x, d_dml$x))
ylim <- c(0, max(c(d_naive$y, d_ols$y, d_dml$y)) * 1.15)
plot(NULL, xlim = xlim, ylim = ylim,
xlab = expression("Estimated " * tau),
ylab = "Density",
main = "Sampling Distributions (200 Monte Carlo Reps)")
# Density curves
polygon(d_naive$x, d_naive$y, col = adjustcolor("#e74c3c", 0.25), border = "#e74c3c", lwd = 2)
polygon(d_ols$x, d_ols$y, col = adjustcolor("#3498db", 0.25), border = "#3498db", lwd = 2)
polygon(d_dml$x, d_dml$y, col = adjustcolor("#27ae60", 0.25), border = "#27ae60", lwd = 2)
# True tau
abline(v = s$tau, lty = 2, lwd = 2.5, col = "#2c3e50")
text(s$tau, ylim[2] * 0.95,
bquote("True " * tau == .(s$tau)),
cex = 0.9, font = 2, col = "#2c3e50", pos = 4)
legend("topright", bty = "n", cex = 0.85,
legend = c("Naive (diff. in means)", "OLS (with controls)", "DML-style (cross-fitted)"),
col = c("#e74c3c", "#3498db", "#27ae60"),
lwd = 2, fill = adjustcolor(c("#e74c3c", "#3498db", "#27ae60"), 0.25),
border = c("#e74c3c", "#3498db", "#27ae60"))
})
output$results <- renderUI({
s <- sim()
tau <- s$tau
mn <- c(mean(s$naive), mean(s$ols), mean(s$dml))
sd <- c(sd(s$naive), sd(s$ols), sd(s$dml))
bias <- mn - tau
tags$div(class = "stats-box",
HTML(paste0(
"<b>True \u03C4:</b> ", tau, "<br>",
"<hr style='margin:6px 0'>",
"<b>Naive:</b> mean = <span class='bad'>", round(mn[1], 3), "</span>",
" (SD: ", round(sd[1], 3), ", bias: ", round(bias[1], 3), ")<br>",
"<b>OLS:</b> mean = ", round(mn[2], 3),
" (SD: ", round(sd[2], 3), ", bias: ", round(bias[2], 3), ")<br>",
"<b>DML:</b> mean = <span class='good'>", round(mn[3], 3), "</span>",
" (SD: ", round(sd[3], 3), ", bias: ", round(bias[3], 3), ")"
))
)
})
}
shinyApp(ui, server)
Things to try
- Increase confounders: with more confounders \(p\), the naive estimator gets worse because there is more confounding. OLS and DML both adjust, but DML’s cross-fitting avoids overfitting bias that can creep in when \(p\) is large relative to \(n\).
- Set \(\tau = 0\): all estimators should center near zero if unbiased. The naive estimator will still be biased because treatment is confounded.
- Small \(n\), large \(p\): with \(n = 200\) and \(p = 20\), OLS starts to overfit the confounders. The cross-fitted DML approach is more robust because it never evaluates predictions on the same data used to fit the nuisance models.
What DML assumes
DML does not weaken the identification assumptions. You still need:
- Selection on observables: \(Y(0), Y(1) \perp D \mid X\)
- Overlap: \(0 < P(D=1 \mid X) < 1\) for all \(X\)
- SUTVA and consistency from Potential Outcomes
What DML relaxes are the estimation assumptions:
- The nuisance functions \(\mu(X)\) and \(e(X)\) can be nonparametric — estimated by lasso, random forest, boosting, neural network, or any method
- The nuisance estimators need to converge at rate \(n^{-1/4}\) (not \(n^{-1/2}\)) — a much weaker requirement that many ML methods satisfy
- No need to correctly specify a parametric model for \(\mu\) or \(e\)
Connecting to the course
DML builds directly on several pages:
- Doubly robust estimation: the AIPW estimator is the score function that DML uses. DML adds cross-fitting and ML nuisance estimation.
- IPW: the propensity score \(e(X)\) can now be estimated flexibly instead of by logit.
- FWL: the partially linear model is FWL with ML residualization.
- Identification vs Estimation: DML is the clearest example of this separation — ML for estimation, research design for identification.