Doubly Robust Estimation
The idea
Every estimation tool under selection on observables relies on getting a model right:
- Regression adjustment needs the outcome model \(E[Y \mid X, D]\) to be correct.
- IPW needs the propensity score model \(P(D = 1 \mid X)\) to be correct.
What if you’re not sure which one you got right? Doubly robust (DR) estimation combines both models and gives you a consistent estimate if either one is correctly specified.
The AIPW estimator
The most common DR estimator is Augmented Inverse Probability Weighting (AIPW):
\[\hat{\tau}_{DR} = \frac{1}{N} \sum_{i=1}^{N} \left[ \hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{D_i (Y_i - \hat{\mu}_1(X_i))}{\hat{e}(X_i)} - \frac{(1 - D_i)(Y_i - \hat{\mu}_0(X_i))}{1 - \hat{e}(X_i)} \right]\]
where \(\hat{\mu}_d(X)\) is the estimated outcome under treatment \(d\), and \(\hat{e}(X)\) is the estimated propensity score.
Intuition: start with the regression prediction (\(\hat{\mu}_1 - \hat{\mu}_0\)), then correct it using the IPW-weighted residuals. If the outcome model is perfect, the residuals are zero and the correction vanishes. If the outcome model is wrong but the propensity score is right, the correction fixes the bias.
The “doubly robust” property
| Outcome model | Propensity score | DR consistent? |
|---|---|---|
| Correct | Correct | Yes |
| Correct | Wrong | Yes |
| Wrong | Correct | Yes |
| Wrong | Wrong | No |
Three out of four — you get two shots at a consistent estimate. This is the key advantage over methods that rely on a single model.
The DGP
To see DR in action, we need a setting where models can be wrong:
- True outcome: \(Y = 1 + 2X^2 + \tau D + \varepsilon\) (quadratic in \(X\))
- True treatment: \(P(D=1 \mid X) = \Phi(1.5 \cdot X)\) (probit)
- “Correct” outcome model: includes \(X^2\)
- “Wrong” outcome model: linear only (\(X\), no \(X^2\))
- “Correct” PS model: logit on \(X\) (close enough to probit)
- “Wrong” PS model: constant 0.5 (ignores \(X\) entirely)
#| standalone: true
#| viewerHeight: 680
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
.good { color: #27ae60; font-weight: bold; }
.bad { color: #e74c3c; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("n", "Sample size:",
min = 500, max = 3000, value = 1000, step = 250),
sliderInput("ate", "True ATE:",
min = 0, max = 5, value = 2, step = 0.5),
selectInput("scenario", "Scenario:",
choices = c(
"Both correct" = "both",
"Only outcome correct" = "outcome_only",
"Only PS correct" = "ps_only",
"Neither correct" = "neither"
)),
actionButton("go", "New draw", class = "btn-primary", width = "100%"),
uiOutput("results")
),
mainPanel(
width = 9,
fluidRow(
column(6, plotOutput("fit_plot", height = "430px")),
column(6, plotOutput("compare_plot", height = "430px"))
)
)
)
)
server <- function(input, output, session) {
dat <- reactive({
input$go
n <- input$n
ate <- input$ate
sc <- input$scenario
# Data generating process
x <- rnorm(n)
# True PS: probit
p_true <- pnorm(1.5 * x)
treat <- rbinom(n, 1, p_true)
# True outcome: quadratic
y <- 1 + 2 * x^2 + ate * treat + rnorm(n)
# --- Outcome models ---
outcome_correct <- sc %in% c("both", "outcome_only")
if (outcome_correct) {
x2 <- x^2
fit1 <- lm(y ~ treat + x + x2)
mu1 <- predict(fit1, newdata = data.frame(treat = 1, x = x, x2 = x^2))
mu0 <- predict(fit1, newdata = data.frame(treat = 0, x = x, x2 = x^2))
} else {
fit1 <- lm(y ~ treat + x)
mu1 <- predict(fit1, newdata = data.frame(treat = 1, x = x))
mu0 <- predict(fit1, newdata = data.frame(treat = 0, x = x))
}
# Regression adjustment estimate
reg_est <- mean(mu1 - mu0)
# --- PS models ---
ps_correct <- sc %in% c("both", "ps_only")
if (ps_correct) {
ps <- fitted(glm(treat ~ x, family = binomial))
} else {
ps <- rep(0.5, n)
}
# IPW estimate
w <- ifelse(treat == 1, 1 / ps, 1 / (1 - ps))
ipw_est <- weighted.mean(y[treat == 1], w[treat == 1]) -
weighted.mean(y[treat == 0], w[treat == 0])
# Doubly robust (AIPW)
dr_est <- mean(mu1 - mu0 +
treat * (y - mu1) / ps -
(1 - treat) * (y - mu0) / (1 - ps))
# Naive
naive <- mean(y[treat == 1]) - mean(y[treat == 0])
# For plotting: outcome model fit curves
x_seq <- seq(min(x), max(x), length.out = 200)
if (outcome_correct) {
pred_t <- predict(fit1, newdata = data.frame(treat = 1, x = x_seq, x2 = x_seq^2))
pred_c <- predict(fit1, newdata = data.frame(treat = 0, x = x_seq, x2 = x_seq^2))
} else {
pred_t <- predict(fit1, newdata = data.frame(treat = 1, x = x_seq))
pred_c <- predict(fit1, newdata = data.frame(treat = 0, x = x_seq))
}
true_t <- 1 + 2 * x_seq^2 + ate
true_c <- 1 + 2 * x_seq^2
list(x = x, y = y, treat = treat,
x_seq = x_seq, pred_t = pred_t, pred_c = pred_c,
true_t = true_t, true_c = true_c,
naive = naive, reg_est = reg_est, ipw_est = ipw_est, dr_est = dr_est,
ate = ate, sc = sc,
outcome_correct = outcome_correct, ps_correct = ps_correct)
})
output$fit_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
plot(d$x, d$y, pch = 16, cex = 0.3,
col = ifelse(d$treat == 1, "#3498db30", "#e74c3c30"),
xlab = "X (confounder)", ylab = "Y (outcome)",
main = "Outcome Model Fit vs Truth")
# Fitted curves (solid)
lines(d$x_seq, d$pred_t, col = "#3498db", lwd = 2.5)
lines(d$x_seq, d$pred_c, col = "#e74c3c", lwd = 2.5)
# True curves (dashed)
lines(d$x_seq, d$true_t, col = "#3498db", lwd = 1.5, lty = 3)
lines(d$x_seq, d$true_c, col = "#e74c3c", lwd = 1.5, lty = 3)
outcome_label <- if (d$outcome_correct) "Correct (quadratic)" else "Wrong (linear)"
ps_label <- if (d$ps_correct) "Correct (logit)" else "Wrong (constant)"
legend("topleft", bty = "n", cex = 0.75,
legend = c("Treated data", "Control data",
"Model fit (solid)", "True E[Y|X,D] (dotted)",
paste0("Outcome: ", outcome_label),
paste0("PS: ", ps_label)),
col = c("#3498db", "#e74c3c", "gray40", "gray40", NA, NA),
pch = c(16, 16, NA, NA, NA, NA),
lwd = c(NA, NA, 2.5, 1.5, NA, NA),
lty = c(NA, NA, 1, 3, NA, NA))
})
output$compare_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 9, 3, 2))
ests <- c(d$dr_est, d$ipw_est, d$reg_est, d$naive)
labels <- c("Doubly Robust", "IPW", "Regression adj.", "Naive")
cols <- c("#9b59b6", "#27ae60", "#3498db", "#e74c3c")
xlim <- range(c(ests, d$ate))
pad <- max(diff(xlim) * 0.4, 0.5)
xlim <- xlim + c(-pad, pad)
plot(ests, 1:4, pch = 19, cex = 2, col = cols,
xlim = xlim, ylim = c(0.5, 4.5),
yaxt = "n", xlab = "Estimated treatment effect",
ylab = "", main = "Estimator Comparison")
axis(2, at = 1:4, labels = labels, las = 1, cex.axis = 0.9)
abline(v = d$ate, lty = 2, col = "#2c3e50", lwd = 2)
text(d$ate, 4.45, paste0("True ATE = ", d$ate),
cex = 0.85, font = 2, col = "#2c3e50")
segments(d$ate, 1:4, ests, 1:4, col = cols, lwd = 2, lty = 2)
})
output$results <- renderUI({
d <- dat()
b_naive <- d$naive - d$ate
b_reg <- d$reg_est - d$ate
b_ipw <- d$ipw_est - d$ate
b_dr <- d$dr_est - d$ate
scenario_label <- switch(d$sc,
both = "Both models correct",
outcome_only = "Only outcome model correct",
ps_only = "Only propensity score correct",
neither = "Neither model correct")
dr_ok <- abs(b_dr) < abs(b_naive) * 0.5
tags$div(class = "stats-box",
HTML(paste0(
"<b>Scenario:</b> ", scenario_label, "<br>",
"<b>True ATE:</b> ", d$ate, "<br>",
"<hr style='margin:6px 0'>",
"<b>Naive:</b> <span class='bad'>", round(d$naive, 3), "</span>",
" (bias: ", round(b_naive, 3), ")<br>",
"<b>Reg. adj:</b> ", round(d$reg_est, 3),
" (bias: ", round(b_reg, 3), ")<br>",
"<b>IPW:</b> ", round(d$ipw_est, 3),
" (bias: ", round(b_ipw, 3), ")<br>",
"<b>DR:</b> <span class='", ifelse(dr_ok, "good", "bad"), "'>",
round(d$dr_est, 3), "</span>",
" (bias: ", round(b_dr, 3), ")"
))
)
})
}
shinyApp(ui, server)
Things to try
Walk through all four scenarios:
- Both correct: all three adjusted estimators (regression, IPW, DR) are close to the truth. The left plot shows the fitted curves matching the true curves. DR works.
- Only outcome correct: the PS is constant (ignores X), so IPW is just the naive difference — biased. But regression gets the outcome right, and DR inherits that. DR works.
- Only PS correct: the outcome model fits a line when the truth is quadratic — you can see the misfit on the left plot. Regression is biased. But the PS is right, so the IPW correction rescues DR. DR works.
- Neither correct: both models are wrong. The left plot shows a bad fit, and the PS can’t correct it. DR fails — garbage in, garbage out.
Key insight: DR doesn’t require both models to be right. It requires at least one to be right. That’s the “double robustness” — two chances instead of one.
Why not always use DR?
If DR is better than regression alone and IPW alone, why not always use it?
- You still need at least one model right. DR isn’t magic — it fails when both models are misspecified. In the “neither correct” scenario above, DR is just as biased as the others.
- Variance. DR can have higher variance than a correctly specified single model. The extra robustness comes at the cost of efficiency. In the “both correct” scenario, regression alone is often more precise.
- Complexity. Two models to specify, diagnose, and justify instead of one.
In practice, DR is most valuable when you’re uncertain about your models — which is most of the time.
In Stata
* Augmented IPW (doubly robust)
teffects aipw (outcome x1 x2) (treatment x1 x2)
* Check overlap
teffects overlap
* Doubly robust DID (Sant'Anna & Zhao 2020)
* ssc install drdid
drdid outcome x1 x2, ivar(id) time(year) treatment(treated)teffects aipw specifies two models in one command: the outcome model (first parentheses) and the treatment model (second parentheses). You get double robustness — consistent if either model is correct.
Did you know?
The doubly robust property was established by Scharfstein, Rotnitzky & Robins (1999) and Bang & Robins (2005). The key insight is that AIPW belongs to a class of semiparametrically efficient estimators — it achieves the lowest possible variance among regular estimators when both models are correct, and remains consistent when either is correct.
Doubly robust DID (DR-DID) extends the idea to difference-in-differences. Sant’Anna & Zhao (2020) showed how to combine outcome regression and propensity score weighting in the DID setting, getting double robustness for treatment effect estimation under parallel trends. This is now standard in the staggered DID literature.
The DR estimator is connected to targeted learning (van der Laan & Rose,
- and debiased machine learning (Chernozhukov et al., 2018). These frameworks use machine learning for the nuisance models (\(\hat{\mu}\) and \(\hat{e}\)) while maintaining valid inference for the causal parameter — DR structure is what makes this possible.