Doubly Robust Estimation

The idea

Every estimation tool under selection on observables relies on getting a model right:

Regression adjustment needs the outcome model \(E[Y \mid X, D]\) to be correct.
IPW needs the propensity score model \(P(D = 1 \mid X)\) to be correct.

What if you’re not sure which one you got right? Doubly robust (DR) estimation combines both models and gives you a consistent estimate if either one is correctly specified.

Example: vaccine effectiveness. You want to know if a vaccine reduces hospitalization. You model who gets vaccinated (propensity score: older people and healthcare workers vaccinate more) and you model hospitalization risk (outcome model: depends on age, comorbidities, obesity). DR combines both. If your outcome model misses a nonlinear age effect but your propensity score is right — you’re fine. If your propensity score ignores rural vs urban but your outcome model captures it — also fine. You only fail if both models are wrong simultaneously.

The AIPW estimator

The most common DR estimator is Augmented Inverse Probability Weighting (AIPW):

\[\hat{\tau}_{DR} = \frac{1}{N} \sum_{i=1}^{N} \left[ \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{D_i (Y_i - \hat{\mu}_1(X_i))}{\hat{e}(X_i)} - \frac{(1 - D_i)(Y_i - \hat{\mu}_0(X_i))}{1 - \hat{e}(X_i)} \right]\]

where \(\hat{\mu}_d(X)\) is the estimated outcome under treatment \(d\), and \(\hat{e}(X)\) is the estimated propensity score.

Intuition: start with the regression prediction (\(\hat{\mu}_1 - \hat{\mu}_0\)), then correct it using the IPW-weighted residuals. If the outcome model is perfect, the residuals are zero and the correction vanishes. If the outcome model is wrong but the propensity score is right, the correction fixes the bias.

The “doubly robust” property

Outcome model	Propensity score	DR consistent?
Correct	Correct	Yes
Correct	Wrong	Yes
Wrong	Correct	Yes
Wrong	Wrong	No

Three out of four — you get two shots at a consistent estimate. This is the key advantage over methods that rely on a single model.

The DGP

To see DR in action, we need a setting where models can be wrong:

True outcome: \(Y = 1 + 2X^2 + \tau D + \varepsilon\) (quadratic in \(X\))
True treatment: \(P(D=1 \mid X) = \Phi(1.5 \cdot X)\) (probit)
“Correct” outcome model: includes \(X^2\)
“Wrong” outcome model: linear only (\(X\), no \(X^2\))
“Correct” PS model: logit on \(X\) (close enough to probit)
“Wrong” PS model: constant 0.5 (ignores \(X\) entirely)

#| standalone: true
#| viewerHeight: 680

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("n", "Sample size:",
                  min = 500, max = 3000, value = 1000, step = 250),

      sliderInput("ate", "True ATE:",
                  min = 0, max = 5, value = 2, step = 0.5),

      selectInput("scenario", "Scenario:",
                  choices = c(
                    "Both correct"         = "both",
                    "Only outcome correct"  = "outcome_only",
                    "Only PS correct"       = "ps_only",
                    "Neither correct"       = "neither"
                  )),

      actionButton("go", "New draw", class = "btn-primary", width = "100%"),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("fit_plot", height = "430px")),
        column(6, plotOutput("compare_plot", height = "430px"))
      )
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$go
    n   <- input$n
    ate <- input$ate
    sc  <- input$scenario

    # Data generating process
    x <- rnorm(n)

    # True PS: probit
    p_true <- pnorm(1.5 * x)
    treat <- rbinom(n, 1, p_true)

    # True outcome: quadratic
    y <- 1 + 2 * x^2 + ate * treat + rnorm(n)

    # --- Outcome models ---
    outcome_correct <- sc %in% c("both", "outcome_only")
    if (outcome_correct) {
      x2 <- x^2
      fit1 <- lm(y ~ treat + x + x2)
      mu1 <- predict(fit1, newdata = data.frame(treat = 1, x = x, x2 = x^2))
      mu0 <- predict(fit1, newdata = data.frame(treat = 0, x = x, x2 = x^2))
    } else {
      fit1 <- lm(y ~ treat + x)
      mu1 <- predict(fit1, newdata = data.frame(treat = 1, x = x))
      mu0 <- predict(fit1, newdata = data.frame(treat = 0, x = x))
    }

    # Regression adjustment estimate
    reg_est <- mean(mu1 - mu0)

    # --- PS models ---
    ps_correct <- sc %in% c("both", "ps_only")
    if (ps_correct) {
      ps <- fitted(glm(treat ~ x, family = binomial))
    } else {
      ps <- rep(0.5, n)
    }

    # IPW estimate
    w <- ifelse(treat == 1, 1 / ps, 1 / (1 - ps))
    ipw_est <- weighted.mean(y[treat == 1], w[treat == 1]) -
               weighted.mean(y[treat == 0], w[treat == 0])

    # Doubly robust (AIPW)
    dr_est <- mean(mu1 - mu0 +
      treat * (y - mu1) / ps -
      (1 - treat) * (y - mu0) / (1 - ps))

    # Naive
    naive <- mean(y[treat == 1]) - mean(y[treat == 0])

    # For plotting: outcome model fit curves
    x_seq <- seq(min(x), max(x), length.out = 200)
    if (outcome_correct) {
      pred_t <- predict(fit1, newdata = data.frame(treat = 1, x = x_seq, x2 = x_seq^2))
      pred_c <- predict(fit1, newdata = data.frame(treat = 0, x = x_seq, x2 = x_seq^2))
    } else {
      pred_t <- predict(fit1, newdata = data.frame(treat = 1, x = x_seq))
      pred_c <- predict(fit1, newdata = data.frame(treat = 0, x = x_seq))
    }
    true_t <- 1 + 2 * x_seq^2 + ate
    true_c <- 1 + 2 * x_seq^2

    list(x = x, y = y, treat = treat,
         x_seq = x_seq, pred_t = pred_t, pred_c = pred_c,
         true_t = true_t, true_c = true_c,
         naive = naive, reg_est = reg_est, ipw_est = ipw_est, dr_est = dr_est,
         ate = ate, sc = sc,
         outcome_correct = outcome_correct, ps_correct = ps_correct)
  })

  output$fit_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    plot(d$x, d$y, pch = 16, cex = 0.3,
         col = ifelse(d$treat == 1, "#3498db30", "#e74c3c30"),
         xlab = "X (confounder)", ylab = "Y (outcome)",
         main = "Outcome Model Fit vs Truth")

    # Fitted curves (solid)
    lines(d$x_seq, d$pred_t, col = "#3498db", lwd = 2.5)
    lines(d$x_seq, d$pred_c, col = "#e74c3c", lwd = 2.5)

    # True curves (dashed)
    lines(d$x_seq, d$true_t, col = "#3498db", lwd = 1.5, lty = 3)
    lines(d$x_seq, d$true_c, col = "#e74c3c", lwd = 1.5, lty = 3)

    outcome_label <- if (d$outcome_correct) "Correct (quadratic)" else "Wrong (linear)"
    ps_label <- if (d$ps_correct) "Correct (logit)" else "Wrong (constant)"

    legend("topleft", bty = "n", cex = 0.75,
           legend = c("Treated data", "Control data",
                      "Model fit (solid)", "True E[Y|X,D] (dotted)",
                      paste0("Outcome: ", outcome_label),
                      paste0("PS: ", ps_label)),
           col = c("#3498db", "#e74c3c", "gray40", "gray40", NA, NA),
           pch = c(16, 16, NA, NA, NA, NA),
           lwd = c(NA, NA, 2.5, 1.5, NA, NA),
           lty = c(NA, NA, 1, 3, NA, NA))
  })

  output$compare_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 9, 3, 2))

    ests <- c(d$dr_est, d$ipw_est, d$reg_est, d$naive)
    labels <- c("Doubly Robust", "IPW", "Regression adj.", "Naive")
    cols <- c("#9b59b6", "#27ae60", "#3498db", "#e74c3c")

    xlim <- range(c(ests, d$ate))
    pad  <- max(diff(xlim) * 0.4, 0.5)
    xlim <- xlim + c(-pad, pad)

    plot(ests, 1:4, pch = 19, cex = 2, col = cols,
         xlim = xlim, ylim = c(0.5, 4.5),
         yaxt = "n", xlab = "Estimated treatment effect",
         ylab = "", main = "Estimator Comparison")
    axis(2, at = 1:4, labels = labels, las = 1, cex.axis = 0.9)

    abline(v = d$ate, lty = 2, col = "#2c3e50", lwd = 2)
    text(d$ate, 4.45, paste0("True ATE = ", d$ate),
         cex = 0.85, font = 2, col = "#2c3e50")

    segments(d$ate, 1:4, ests, 1:4, col = cols, lwd = 2, lty = 2)
  })

  output$results <- renderUI({
    d <- dat()
    b_naive <- d$naive - d$ate
    b_reg   <- d$reg_est - d$ate
    b_ipw   <- d$ipw_est - d$ate
    b_dr    <- d$dr_est - d$ate

    scenario_label <- switch(d$sc,
      both         = "Both models correct",
      outcome_only = "Only outcome model correct",
      ps_only      = "Only propensity score correct",
      neither      = "Neither model correct")

    dr_ok <- abs(b_dr) < abs(b_naive) * 0.5

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>Scenario:</b> ", scenario_label, "<br>",
        "<b>True ATE:</b> ", d$ate, "<br>",
        "<hr style='margin:6px 0'>",
        "<b>Naive:</b> <span class='bad'>", round(d$naive, 3), "</span>",
        " (bias: ", round(b_naive, 3), ")<br>",
        "<b>Reg. adj:</b> ", round(d$reg_est, 3),
        " (bias: ", round(b_reg, 3), ")<br>",
        "<b>IPW:</b> ", round(d$ipw_est, 3),
        " (bias: ", round(b_ipw, 3), ")<br>",
        "<b>DR:</b> <span class='", ifelse(dr_ok, "good", "bad"), "'>",
        round(d$dr_est, 3), "</span>",
        " (bias: ", round(b_dr, 3), ")"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

Walk through all four scenarios:

Both correct: all three adjusted estimators (regression, IPW, DR) are close to the truth. The left plot shows the fitted curves matching the true curves. DR works.
Only outcome correct: the PS is constant (ignores X), so IPW is just the naive difference — biased. But regression gets the outcome right, and DR inherits that. DR works.
Only PS correct: the outcome model fits a line when the truth is quadratic — you can see the misfit on the left plot. Regression is biased. But the PS is right, so the IPW correction rescues DR. DR works.
Neither correct: both models are wrong. The left plot shows a bad fit, and the PS can’t correct it. DR fails — garbage in, garbage out.

Key insight: DR doesn’t require both models to be right. It requires at least one to be right. That’s the “double robustness” — two chances instead of one.

Why not always use DR?

If DR is better than regression alone and IPW alone, why not always use it?

You still need at least one model right. DR isn’t magic — it fails when both models are misspecified. In the “neither correct” scenario above, DR is just as biased as the others.
Variance. DR can have higher variance than a correctly specified single model. The extra robustness comes at the cost of efficiency. In the “both correct” scenario, regression alone is often more precise.
Complexity. Two models to specify, diagnose, and justify instead of one.

In practice, DR is most valuable when you’re uncertain about your models — which is most of the time.

In Stata

* Augmented IPW (doubly robust)
teffects aipw (outcome x1 x2) (treatment x1 x2)

* Check overlap
teffects overlap

* Doubly robust DID (Sant'Anna & Zhao 2020)
* ssc install drdid
drdid outcome x1 x2, ivar(id) time(year) treatment(treated)

teffects aipw specifies two models in one command: the outcome model (first parentheses) and the treatment model (second parentheses). You get double robustness — consistent if either model is correct.

Did you know?

The doubly robust property was established by Scharfstein, Rotnitzky & Robins (1999) and Bang & Robins (2005). The key insight is that AIPW belongs to a class of semiparametrically efficient estimators — it achieves the lowest possible variance among regular estimators when both models are correct, and remains consistent when either is correct.
Doubly robust DID (DR-DID) extends the idea to difference-in-differences. Sant’Anna & Zhao (2020) showed how to combine outcome regression and propensity score weighting in the DID setting, getting double robustness for treatment effect estimation under parallel trends. This is now standard in the staggered DID literature.
The DR estimator is connected to targeted learning (van der Laan & Rose,
1. and debiased machine learning (Chernozhukov et al., 2018). These frameworks use machine learning for the nuisance models (\(\hat{\mu}\) and \(\hat{e}\)) while maintaining valid inference for the causal parameter — DR structure is what makes this possible.