Causal DAGs and the choice of controls
The single most common question in applied work is: which variables should I control for, and which should I leave out? Throw in too few and your estimate is confounded. Throw in too many and you can introduce a different bias that’s harder to see. Causal DAGs — directed acyclic graphs — give you a clean visual framework for answering this question without invoking heavy machinery.
This page is the practical version. We’re not going to develop the do-calculus. We’re going to draw graphs, name the four roles a covariate can play, and state one rule (the backdoor criterion) that tells you what to do with each.
The setup: variables and arrows
A causal DAG is a picture of the data-generating process. Each variable is a node. Each arrow is a direct causal effect: \(A \to B\) means \(A\) is a cause of \(B\), with no other variable in between.
The graph is directed (arrows point from cause to effect) and acyclic (no variable causes itself, possibly indirectly through a loop). That’s the entire definition.
What the graph does not tell you is the size or the sign of any effect — only the structure of causal relationships and the direction in which causation runs. Whether \(A \to B\) is a positive or negative effect, and how large it is, comes from the data, not the graph.
The four roles
Once you’ve drawn the DAG with your treatment \(D\) and outcome \(Y\), every other variable plays one of four roles. The right thing to do with each is different.
Confounder (\(W\)): a common cause of \(D\) and \(Y\)
\[ W \;\to\; D, \qquad W \;\to\; Y. \]
A variable that causes both treatment and outcome generates a spurious association between them. Cities with more parks (\(W\)) have higher housing prices (\(Y\)) and more crime (\(D\), because cities are large). Unless you control for \(W\), the OLS coefficient on \(D\) is biased.
Rule: confounders must be in your regression. That single move — controlling for the common cause — is the essence of “selection on observables.”
Mediator (\(M\)): on the causal pathway from \(D\) to \(Y\)
\[ D \;\to\; M \;\to\; Y. \]
A mediator transmits some of the effect of \(D\) on \(Y\). Education (\(D\)) increases earnings (\(Y\)) partly through occupation (\(M\)): more education opens up higher-paying jobs.
Rule: if you want the total effect of \(D\) on \(Y\), don’t control for the mediator. Controlling for \(M\) blocks part of the very effect you’re trying to measure, giving you a “direct effect” instead. If you really want the direct effect, control. If you don’t, leave it out.
Collider (\(C\)): a common effect of \(D\) and \(Y\)
\[ D \;\to\; C, \qquad Y \;\to\; C. \]
This is the dangerous one. A collider is a variable that both the treatment and the outcome cause. In the raw data \(D\) and \(Y\) are uncorrelated through \(C\) (the arrows point into \(C\), so the path doesn’t transmit association). But the instant you condition on \(C\) — by sampling on it, controlling for it in a regression, or stratifying — you open the path. A spurious correlation appears between \(D\) and \(Y\) that wasn’t there before.
This is collider bias, also called Berkson’s paradox in the epidemiological literature. The classic example: two diseases are independent in the population, but among hospitalized patients (a population conditioned on hospitalization, which both diseases cause), the diseases look negatively correlated. Seeing one disease in a hospitalized patient lowers the conditional probability of the other.
Rule: never control for a collider. This is the rule most often broken in applied work, and the bias it creates can be larger than the original confounding you were trying to fix.
Instrument (\(Z\)): a cause of \(D\) that affects \(Y\) only through \(D\)
\[ Z \;\to\; D, \qquad Z \;\not\to\; Y \text{ except through } D. \]
An instrument doesn’t go in your OLS controls list — it goes into a different estimator entirely (IV, 2SLS, GMM). Instruments are the basis of identification when confounders are unobserved. The DAG framework formalizes what makes a good instrument: relevance (\(Z \to D\) is real) and exclusion (\(Z\) doesn’t have any other path to \(Y\)).
The backdoor criterion, stated practically
A backdoor path from \(D\) to \(Y\) is any path between them that starts with an arrow into \(D\) (i.e., it doesn’t begin with \(D\) causing something). If you can block every backdoor path, you’ve removed all the confounding.
A set of variables \(X\) blocks all backdoor paths if:
- Every backdoor path between \(D\) and \(Y\) passes through at least one variable in \(X\) that is not a collider on that path.
- No variable in \(X\) is a descendant of \(D\) (i.e., \(X\) contains no post-treatment variables — no mediators, no colliders).
Condition 1 is “control for enough.” Condition 2 is “don’t control for too much.” Both matter, and missing either one leaves a bias.
The practical workflow:
- Draw the DAG.
- For each backdoor path, find a non-collider variable on it.
- Control for that variable (or use one of its descendants that captures the same information).
- Do not control for any post-treatment variables.
That’s it. Selection on observables, propensity score methods, regression adjustment — all of these are operationalizations of the backdoor criterion under different functional-form assumptions.
Collider bias in action
The interactive demo below shows what happens to the OLS estimate of the treatment effect as you toggle the control set. The DGP has a known true effect (\(\tau = 1\)), a confounder, a mediator, and a collider. Each combination of controls gives a different bias. The lesson — what you control for changes the meaning of your estimate, not just its precision — is visible in the bars.
#| standalone: true
#| viewerHeight: 640
library(shiny)
# Fixed DGP:
# W -> D, W -> Y (W is a confounder)
# D -> M, M -> Y (M is a mediator on the D -> M -> Y path)
# D -> C, Y -> C (C is a collider)
# True total effect of D on Y is tau.
# Path D -> M -> Y has size beta_DM * beta_MY; the direct D -> Y effect is
# tau minus that, so the *total* derivative dY/dD equals tau by construction.
ui <- fluidPage(
titlePanel("What different control choices do to the OLS estimate"),
sidebarLayout(
sidebarPanel(
width = 4,
sliderInput("n", "Sample size:",
min = 200, max = 5000, value = 1500, step = 100),
sliderInput("tau", "True total effect τ (D → Y):",
min = 0, max = 3, value = 1, step = 0.1),
actionButton("go", "Resample", class = "btn-primary", width = "100%"),
htmlOutput("readout")
),
mainPanel(
width = 8,
plotOutput("bias_plot", height = "500px")
)
)
)
server <- function(input, output, session) {
dat <- reactive({
input$go
n <- input$n
tau <- input$tau
beta_DM <- 0.7 # D -> M
beta_MY <- 0.6 # M -> Y
indirect <- beta_DM * beta_MY # part of tau that flows via M
direct <- tau - indirect # remaining direct D -> Y effect
W <- rnorm(n)
D <- 0.8 * W + rnorm(n)
M <- beta_DM * D + rnorm(n)
Y <- direct * D + beta_MY * M + 1.0 * W + rnorm(n)
C <- 1.2 * D + 1.2 * Y + rnorm(n) # both D and Y point into C
specs <- list(
"no controls" = lm(Y ~ D),
"+ W (confounder)" = lm(Y ~ D + W),
"+ M (mediator)" = lm(Y ~ D + M),
"+ C (collider)" = lm(Y ~ D + C)
)
list(specs = specs, tau = tau, direct = direct)
})
output$bias_plot <- renderPlot({
d <- dat()
coefs <- sapply(d$specs, function(fit) coef(fit)["D"])
ses <- sapply(d$specs, function(fit) summary(fit)$coefficients["D", "Std. Error"])
# Color each bar according to whether it gives the right answer
bar_cols <- c("#e67e22", # no controls -- confounded
"#27ae60", # +W -- unbiased (target)
"#3498db", # +M -- direct effect, not total
"#c0392b") # +C -- collider bias
par(mar = c(7.5, 5, 1.5, 1))
ymin <- min(c(0, coefs - 2 * ses, d$tau, d$direct)) - 0.2
ymax <- max(c(coefs + 2 * ses, d$tau, d$direct)) + 0.3
bp <- barplot(coefs, names.arg = names(d$specs), las = 2,
col = bar_cols, ylim = c(ymin, ymax),
ylab = expression("Estimated coefficient on " * D),
main = "")
arrows(bp, coefs - 1.96 * ses, bp, coefs + 1.96 * ses,
code = 3, angle = 90, length = 0.05, col = "#2c3e50")
abline(h = d$tau, col = "#27ae60", lwd = 2, lty = 2)
text(par("usr")[2] * 0.98, d$tau,
sprintf("true total effect τ = %.2f", d$tau),
pos = 2, col = "#27ae60", cex = 0.95)
abline(h = d$direct, col = "#3498db", lwd = 1.5, lty = 3)
text(par("usr")[2] * 0.98, d$direct,
sprintf("direct effect = %.2f", d$direct),
pos = 2, col = "#3498db", cex = 0.85)
abline(h = 0, col = "gray70", lty = 3)
})
output$readout <- renderUI({
d <- dat()
coefs <- sapply(d$specs, function(fit) coef(fit)["D"])
HTML(sprintf(paste(
"<div style='margin-top:10px;font-size:13px;line-height:1.65;'>",
"<b>True total τ:</b> %.2f<br>",
"<b>Direct effect (after stripping D → M → Y):</b> %.2f<br><br>",
"<span style='color:#e67e22;'>● <b>no controls:</b> %.3f</span> — confounded upward by W<br>",
"<span style='color:#27ae60;'>● <b>+ W:</b> %.3f</span> — unbiased estimate of τ<br>",
"<span style='color:#3498db;'>● <b>+ M:</b> %.3f</span> — measures the <i>direct</i> effect, not the total<br>",
"<span style='color:#c0392b;'>● <b>+ C:</b> %.3f</span> — collider bias from conditioning on a common effect",
"</div>"
),
d$tau, d$direct,
coefs[1], coefs[2], coefs[3], coefs[4]))
})
}
shinyApp(ui, server)
A cheat sheet for econometricians
For a clean estimate of the total causal effect of \(D\) on \(Y\):
| Variable role | Include in OLS? | Why |
|---|---|---|
| Pre-treatment confounder (\(W\)) | ✅ Yes | Blocks a backdoor path |
| Pre-treatment instrument (\(Z\)) | ❌ No (use IV) | Conditioning on \(Z\) throws away identifying variation |
| Mediator (\(M\), on the path \(D \to M \to Y\)) | ❌ No | Removes part of the total effect |
| Collider (\(C\), with \(D \to C\) and \(Y \to C\)) | ❌ Never | Opens a non-causal path |
| Descendant of \(D\) that’s not on the \(D \to Y\) pathway | ❌ No | Often correlated with unobserved confounders in dangerous ways |
| Variable that’s neither a cause of \(D\) nor of \(Y\) | Don’t care (no bias) | Wastes degrees of freedom but doesn’t bias point estimate |
Two rules of thumb that follow:
1. Pre-treatment vs. post-treatment is the simplest filter. If a variable was determined before the treatment, controlling for it is at worst harmless (if it’s irrelevant) and at best correct (if it’s a confounder). If a variable was determined after the treatment, controlling for it is at best harmless and often wrong (mediator or collider).
2. “Including more controls is safer” is wrong. Adding a collider or a mediator can introduce a bias bigger than the confounding you were trying to fix. This is the most consequential lesson from the DAG framework and the hardest sell to applied workers trained on “kitchen sink” regressions.
How this connects to the rest of the site
The backdoor criterion is the formal statement behind much of what other pages on this site assume:
- Selection on Observables is the identification strategy that the backdoor criterion enables — controlling for the right pre-treatment variables.
- Regression Adjustment, IPW, and Matching are different ways to operationalize the same backdoor adjustment.
- Instrumental Variables is what you do when you can’t satisfy the backdoor criterion because a key confounder is unobserved — you find a \(Z\) that exploits a frontdoor path instead.
- DiD, RDD, and Synthetic Control are designs that use a different blocking strategy — exploiting timing, thresholds, or counterfactual construction — to avoid relying on observable confounders at all.
The DAG framework makes those choices explicit. You draw what you think the world looks like, you check whether the backdoor criterion holds with your chosen controls, and if it doesn’t, you reach for a different research design.
Did you know?
- Judea Pearl won the 2011 Turing Award largely for developing the DAG framework and the do-calculus. He has been openly frustrated that economists adopted the potential-outcomes notation of Rubin and Imbens while remaining cool to graphical methods for decades. The two frameworks are mathematically equivalent for most everyday problems; the gap was about taste, language, and which department you read papers from.
- The “bad controls” warning dates back at least to Rosenbaum (1984) in statistics, and was canonized in econometrics by Angrist & Pischke’s Mostly Harmless Econometrics (2009), whose “Bad Control” section has been required reading in applied PhD programs for over a decade. The DAG-based taxonomic version — Cinelli, Forney & Pearl’s “A Crash Course in Good and Bad Controls” — covers many configurations in one place and is increasingly co-assigned alongside MHE.
- Collider bias in everyday data. Restaurant reviews on Yelp tend to show a negative correlation between food quality and service quality — not because they trade off, but because both have to clear a threshold for the restaurant to stay in business (the collider). Among surviving restaurants, a great kitchen with rude service can compensate for a great service with mediocre food. Condition on survival and the correlation flips.