Double/Debiased Machine Learning

Doubly robust estimation showed that combining an outcome model with a propensity score model gives you two shots at consistency. But both models were specified parametrically — linear regression for the outcome, logit for the propensity score. What if you want to use machine learning for these nuisance functions? The answer is not as simple as “plug in a random forest.”

The problem with naively plugging in ML

Suppose you want to estimate the ATE under selection on observables:

\[ \tau = E[Y(1) - Y(0)] \]

You need two nuisance functions: the outcome model \(\mu_d(X) = E[Y \mid X, D=d]\) and the propensity score \(e(X) = P(D=1 \mid X)\). The AIPW estimator from the doubly robust page uses both:

\[ \hat{\tau}_{DR} = \frac{1}{n}\sum_{i=1}^n \left[\hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{D_i(Y_i - \hat{\mu}_1(X_i))}{\hat{e}(X_i)} - \frac{(1-D_i)(Y_i - \hat{\mu}_0(X_i))}{1 - \hat{e}(X_i)}\right] \]

If you estimate \(\hat{\mu}\) and \(\hat{e}\) with OLS and logit, standard asymptotic theory gives you \(\sqrt{n}\)-consistent inference on \(\hat{\tau}\). But if you estimate them with a random forest, lasso, or neural network, the standard theory breaks down. The problem is regularization bias: ML estimators are biased (they trade bias for variance — the bias-variance tradeoff), and this bias contaminates the estimate of \(\tau\).

Concretely, the bias of \(\hat{\tau}\) depends on the product of the biases of \(\hat{\mu}\) and \(\hat{e}\):

\[ \text{Bias}(\hat{\tau}) \approx \text{Bias}(\hat{\mu}) \times \text{Bias}(\hat{e}) \]

For parametric models with \(\sqrt{n}\) convergence, both biases are \(O(1/\sqrt{n})\), so the product is \(O(1/n)\) — negligible. But ML estimators typically converge slower than \(\sqrt{n}\), and the bias product can be large enough to distort inference.

The DML solution: Neyman orthogonality + cross-fitting

Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins (2018) showed that two ingredients fix this problem.

Ingredient 1: Neyman orthogonal scores

Use an estimating equation (a “score”) whose expectation is insensitive to small perturbations in the nuisance functions. The AIPW score from doubly robust estimation has this property — it is Neyman orthogonal:

\[ \psi(Y, D, X; \tau, \mu, e) = \hat{\mu}_1(X) - \hat{\mu}_0(X) + \frac{D(Y - \hat{\mu}_1(X))}{e(X)} - \frac{(1-D)(Y - \hat{\mu}_0(X))}{1 - e(X)} - \tau \]

The Neyman orthogonality condition means that the derivative of \(E[\psi]\) with respect to the nuisance functions \((\mu, e)\) is zero at the true values. Intuitively, the score is “locally robust” to errors in the nuisance estimates — first-order mistakes in \(\hat{\mu}\) and \(\hat{e}\) don’t affect \(\hat{\tau}\).

This is what “debiased” means. The AIPW structure removes the first-order bias from the nuisance estimation. What remains is the product of biases — which is second-order.

Ingredient 2: Cross-fitting

Even with an orthogonal score, using the same data to estimate nuisance functions and the causal parameter creates an overfitting bias. ML estimators can overfit in subtle ways that parametric models don’t, and this leaks into \(\hat{\tau}\).

The fix is cross-fitting — a form of sample splitting:

  1. Split the data into \(K\) folds (typically \(K = 5\))
  2. For each fold \(k\):
    • Estimate \(\hat{\mu}\) and \(\hat{e}\) using data outside fold \(k\)
    • Compute the score \(\psi_i\) for observations inside fold \(k\)
  3. Average all scores: \(\hat{\tau} = \frac{1}{n}\sum_{i=1}^n \psi_i\)

This ensures the nuisance estimates are never evaluated on the same data used to construct them. Cross-fitting eliminates the overfitting bias while using all data for both nuisance estimation and causal estimation (unlike simple sample splitting, which wastes half the data).

The DML recipe. (1) Choose a Neyman orthogonal score — AIPW for the ATE, the partially linear model score for a coefficient. (2) Estimate nuisance functions with any ML method, using cross-fitting. (3) Compute the causal parameter from the averaged scores. (4) Standard errors from the variance of the scores: \(\widehat{\text{Var}}(\hat{\tau}) = \frac{1}{n^2}\sum_i \hat{\psi}_i^2\).

The partially linear model

A common DML application. Suppose the model is:

\[ Y_i = \tau D_i + g(X_i) + \varepsilon_i \]

where \(g(X_i)\) is a potentially complex function of high-dimensional controls. You want \(\tau\) — the treatment effect — but \(g\) is a nuisance function.

The DML approach:

  1. Use ML to estimate \(\hat{g}(X) \approx E[Y \mid X]\) and \(\hat{m}(X) \approx E[D \mid X]\) via cross-fitting
  2. Compute residuals: \(\tilde{Y}_i = Y_i - \hat{g}(X_i)\) and \(\tilde{D}_i = D_i - \hat{m}(X_i)\)
  3. Regress \(\tilde{Y}\) on \(\tilde{D}\) to get \(\hat{\tau}\)

This is Frisch-Waugh-Lovell with ML residualization. Instead of controlling for \(X\) linearly, you partial it out flexibly. The treatment effect \(\tau\) remains a simple, interpretable parameter — only the nuisance part uses ML.

#| standalone: true
#| viewerHeight: 680

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("n", "Sample size (n):",
                  min = 200, max = 2000, value = 500, step = 200),

      sliderInput("tau", "True treatment effect (\u03C4):",
                  min = 0, max = 5, value = 2, step = 0.5),

      sliderInput("p", "Number of confounders (p):",
                  min = 2, max = 20, value = 5, step = 1),

      actionButton("go", "Run simulation", class = "btn-primary", width = "100%"),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      plotOutput("density_plot", height = "500px")
    )
  )
)

server <- function(input, output, session) {

  sim <- reactive({
    input$go
    n   <- input$n
    tau <- input$tau
    p   <- input$p
    B   <- 200

    naive_ests <- numeric(B)
    ols_ests   <- numeric(B)
    dml_ests   <- numeric(B)

    for (b in 1:B) {
      # Generate confounders
      X <- matrix(rnorm(n * p), n, p)

      # Propensity and treatment
      e_x <- plogis(X[, 1] + 0.5 * X[, 2])
      D   <- rbinom(n, 1, e_x)

      # Outcome
      Y <- tau * D + sin(X[, 1]) + X[, 2]^2 + rnorm(n)

      # 1. Naive: unadjusted difference in means
      naive_ests[b] <- mean(Y[D == 1]) - mean(Y[D == 0])

      # 2. OLS: lm(Y ~ D + X)
      dat_ols  <- data.frame(Y = Y, D = D, X)
      ols_fit  <- lm(Y ~ D + ., data = dat_ols)
      ols_ests[b] <- coef(ols_fit)["D"]

      # 3. DML-style: residualize Y and D on X with cross-fitting
      fold <- rep(1:2, length.out = n)
      fold <- sample(fold)

      Y_resid <- numeric(n)
      D_resid <- numeric(n)

      for (k in 1:2) {
        train <- fold != k
        test  <- fold == k

        X_train <- X[train, , drop = FALSE]
        X_test  <- X[test, , drop = FALSE]

        # Fit nuisance models on training fold
        dat_y <- data.frame(Y = Y[train], X_train)
        dat_d <- data.frame(D = D[train], X_train)

        fit_y <- lm(Y ~ ., data = dat_y)
        fit_d <- lm(D ~ ., data = dat_d)

        # Predict on test fold
        newdat <- data.frame(X_test)
        names(newdat) <- names(dat_y)[-1]

        Y_resid[test] <- Y[test] - predict(fit_y, newdata = newdat)
        D_resid[test] <- D[test] - predict(fit_d, newdata = newdat)
      }

      # Regress residualized Y on residualized D
      dml_ests[b] <- coef(lm(Y_resid ~ D_resid - 1))
    }

    list(naive = naive_ests, ols = ols_ests, dml = dml_ests, tau = tau)
  })

  output$density_plot <- renderPlot({
    s <- sim()
    par(mar = c(5, 4.5, 3, 1))

    # Compute densities
    d_naive <- density(s$naive, adjust = 1.2)
    d_ols   <- density(s$ols,   adjust = 1.2)
    d_dml   <- density(s$dml,   adjust = 1.2)

    xlim <- range(c(d_naive$x, d_ols$x, d_dml$x))
    ylim <- c(0, max(c(d_naive$y, d_ols$y, d_dml$y)) * 1.15)

    plot(NULL, xlim = xlim, ylim = ylim,
         xlab = expression("Estimated " * tau),
         ylab = "Density",
         main = "Sampling Distributions (200 Monte Carlo Reps)")

    # Density curves
    polygon(d_naive$x, d_naive$y, col = adjustcolor("#e74c3c", 0.25), border = "#e74c3c", lwd = 2)
    polygon(d_ols$x,   d_ols$y,   col = adjustcolor("#3498db", 0.25), border = "#3498db", lwd = 2)
    polygon(d_dml$x,   d_dml$y,   col = adjustcolor("#27ae60", 0.25), border = "#27ae60", lwd = 2)

    # True tau
    abline(v = s$tau, lty = 2, lwd = 2.5, col = "#2c3e50")
    text(s$tau, ylim[2] * 0.95,
         bquote("True " * tau == .(s$tau)),
         cex = 0.9, font = 2, col = "#2c3e50", pos = 4)

    legend("topright", bty = "n", cex = 0.85,
           legend = c("Naive (diff. in means)", "OLS (with controls)", "DML-style (cross-fitted)"),
           col = c("#e74c3c", "#3498db", "#27ae60"),
           lwd = 2, fill = adjustcolor(c("#e74c3c", "#3498db", "#27ae60"), 0.25),
           border = c("#e74c3c", "#3498db", "#27ae60"))
  })

  output$results <- renderUI({
    s <- sim()
    tau <- s$tau

    mn <- c(mean(s$naive), mean(s$ols), mean(s$dml))
    sd <- c(sd(s$naive), sd(s$ols), sd(s$dml))
    bias <- mn - tau

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>True \u03C4:</b> ", tau, "<br>",
        "<hr style='margin:6px 0'>",
        "<b>Naive:</b> mean = <span class='bad'>", round(mn[1], 3), "</span>",
        " (SD: ", round(sd[1], 3), ", bias: ", round(bias[1], 3), ")<br>",
        "<b>OLS:</b> mean = ", round(mn[2], 3),
        " (SD: ", round(sd[2], 3), ", bias: ", round(bias[2], 3), ")<br>",
        "<b>DML:</b> mean = <span class='good'>", round(mn[3], 3), "</span>",
        " (SD: ", round(sd[3], 3), ", bias: ", round(bias[3], 3), ")"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • Increase confounders: with more confounders \(p\), the naive estimator gets worse because there is more confounding. OLS and DML both adjust, but DML’s cross-fitting avoids overfitting bias that can creep in when \(p\) is large relative to \(n\).
  • Set \(\tau = 0\): all estimators should center near zero if unbiased. The naive estimator will still be biased because treatment is confounded.
  • Small \(n\), large \(p\): with \(n = 200\) and \(p = 20\), OLS starts to overfit the confounders. The cross-fitted DML approach is more robust because it never evaluates predictions on the same data used to fit the nuisance models.

What DML assumes

DML does not weaken the identification assumptions. You still need:

What DML relaxes are the estimation assumptions:

  • The nuisance functions \(\mu(X)\) and \(e(X)\) can be nonparametric — estimated by lasso, random forest, boosting, neural network, or any method
  • The nuisance estimators need to converge at rate \(n^{-1/4}\) (not \(n^{-1/2}\)) — a much weaker requirement that many ML methods satisfy
  • No need to correctly specify a parametric model for \(\mu\) or \(e\)

The key distinction. DML separates the causal problem (identification) from the statistical problem (nuisance estimation). Identification comes from the research design — the argument for why treatment is as-good-as-random conditional on \(X\). ML handles the high-dimensional, potentially nonlinear adjustment for \(X\). These are complementary, not substitutes. As noted in Identification vs Estimation: identification comes first, and no estimator — parametric or ML — can fix a failure of identification.

Connecting to the course

DML builds directly on several pages:

  • Doubly robust estimation: the AIPW estimator is the score function that DML uses. DML adds cross-fitting and ML nuisance estimation.
  • IPW: the propensity score \(e(X)\) can now be estimated flexibly instead of by logit.
  • FWL: the partially linear model is FWL with ML residualization.
  • Identification vs Estimation: DML is the clearest example of this separation — ML for estimation, research design for identification.