From Correlation to Causation

The regression slope is a correlation

On the Regression & CEF page, you learned that OLS estimates the best linear approximation to \(E[Y \mid X]\). The slope \(\hat{\beta}\) tells you: when X is one unit higher, Y is on average \(\hat{\beta}\) units higher. But “higher” is not “caused by.” The slope measures how X and Y move together — that’s correlation. In fact:

\[\hat{\beta} = \frac{\text{Cov}(X, Y)}{\text{Var}(X)} = r_{XY} \cdot \frac{SD(Y)}{SD(X)}\]

It’s literally a rescaled correlation coefficient. A significant slope means X and Y are associated. It says nothing about why.

So you see a correlation. Now what?

Your slope is positive, your p-value is small. But why are X and Y correlated?

There are exactly five possibilities:

Reason Structure Example
X causes Y \(X \to Y\) Exercise → lower blood pressure
Y causes X \(Y \to X\) Higher income → more education (you can afford grad school)
Something else causes both \(Z \to X\), \(Z \to Y\) Family background → education AND income
Spurious correlation No real connection US spending on science correlates with suicides by hanging
You conditioned on a common effect Collider bias Among hospitalized patients, two unrelated diseases look negatively correlated

Here’s the problem: all five produce the same scatter plot. A positive slope between X and Y looks identical whether X actually causes Y, or a confounder drives both, or it’s pure coincidence. The data can’t tell you which story is true — only theory, study design, and assumptions can.

The Oracle View. In these simulations, we build the causal structure — we decide what causes what. We can compare the true causal effect with what a naive regression estimates. In practice, you never see the causal structure. You see the data and have to argue — using design, theory, and assumptions — that your estimate is causal.


Simulation 1: Same scatter plot, different causal stories

Three different causal structures — all producing the same correlation between X and Y. Drag the sliders and watch: the scatter plots are indistinguishable, but the true causal effect of X on Y is completely different in each case.

#| standalone: true
#| viewerHeight: 520

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 8px; font-size: 13px; line-height: 1.7;
    }
    .stats-box b { color: #2c3e50; }
    .causal  { color: #27ae60; font-weight: bold; }
    .mislead { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 4,

      sliderInput("n1", "Sample size:",
                  min = 100, max = 500, value = 300, step = 50),

      sliderInput("true_effect", "True causal effect of X on Y:",
                  min = 0, max = 2, value = 0, step = 0.1),

      sliderInput("confound", "Confounding strength (Z on both):",
                  min = 0, max = 3, value = 2, step = 0.25),

      uiOutput("results1")
    ),

    mainPanel(
      width = 8,
      fluidRow(
        column(4, plotOutput("plot_causal", height = "260px")),
        column(4, plotOutput("plot_confound", height = "260px")),
        column(4, plotOutput("plot_reverse", height = "260px"))
      )
    )
  )
)

server <- function(input, output, session) {

  sim <- reactive({
    n  <- input$n1
    b  <- input$true_effect
    cf <- input$confound

    set.seed(42)
    z <- rnorm(n)
    noise_x <- rnorm(n)
    noise_y <- rnorm(n)

    # Story 1: X causes Y (true effect = b, no confounding)
    x1 <- noise_x
    y1 <- b * x1 + noise_y

    # Story 2: Confounding — Z causes both X and Y
    x2 <- cf * z + noise_x
    y2 <- b * x2 + cf * z + noise_y

    # Story 3: Reverse causality — Y causes X
    y3 <- cf * z + noise_y
    x3 <- 1.5 * y3 + noise_x

    fit1 <- lm(y1 ~ x1)
    fit2 <- lm(y2 ~ x2)
    fit3 <- lm(y3 ~ x3)

    list(x1 = x1, y1 = y1, x2 = x2, y2 = y2, x3 = x3, y3 = y3,
         b1 = coef(fit1)[2], b2 = coef(fit2)[2], b3 = coef(fit3)[2],
         true_b = b, confound = cf)
  })

  output$plot_causal <- renderPlot({
    d <- sim()
    par(mar = c(3, 3, 2, 0.5), cex.axis = 0.8, cex.lab = 0.9, cex.main = 0.85)
    plot(d$x1, d$y1, pch = 16, cex = 0.5,
         col = adjustcolor("#3498db", 0.4),
         xlab = "X", ylab = "Y",
         main = expression("Story 1: X causes Y  (" * X %->% Y * ")"))
    abline(lm(d$y1 ~ d$x1), col = "#e74c3c", lwd = 3)
    legend("topleft", bty = "n", cex = 0.8,
           legend = paste0("OLS slope = ", round(d$b1, 2),
                           "  |  True effect = ", d$true_b),
           text.col = if (abs(d$b1 - d$true_b) < 0.15) "#27ae60" else "#e74c3c",
           text.font = 2)
  })

  output$plot_confound <- renderPlot({
    d <- sim()
    par(mar = c(3, 3, 2, 0.5), cex.axis = 0.8, cex.lab = 0.9, cex.main = 0.85)
    plot(d$x2, d$y2, pch = 16, cex = 0.5,
         col = adjustcolor("#3498db", 0.4),
         xlab = "X", ylab = "Y",
         main = expression("Story 2: Confounding  (" * Z %->% X * "," ~~ Z %->% Y * ")"))
    abline(lm(d$y2 ~ d$x2), col = "#e74c3c", lwd = 3)
    legend("topleft", bty = "n", cex = 0.8,
           legend = paste0("OLS slope = ", round(d$b2, 2),
                           "  |  True effect = ", d$true_b,
                           "  [BIASED]"),
           text.col = "#e74c3c", text.font = 2)
  })

  output$plot_reverse <- renderPlot({
    d <- sim()
    par(mar = c(3, 3, 2, 0.5), cex.axis = 0.8, cex.lab = 0.9, cex.main = 0.85)
    plot(d$x3, d$y3, pch = 16, cex = 0.5,
         col = adjustcolor("#3498db", 0.4),
         xlab = "X", ylab = "Y",
         main = expression("Story 3: Reverse causality  (" * Y %->% X * ")"))
    abline(lm(d$y3 ~ d$x3), col = "#e74c3c", lwd = 3)
    legend("topleft", bty = "n", cex = 0.8,
           legend = paste0("OLS slope = ", round(d$b3, 2),
                           "  |  True effect = 0  [MISLEADING]"),
           text.col = "#e74c3c", text.font = 2)
  })

  output$results1 <- renderUI({
    d <- sim()

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>X \u2192 Y (no confounding):</b><br>",
        "OLS = ", round(d$b1, 3),
        " | True = ", d$true_b,
        if (abs(d$b1 - d$true_b) < 0.15)
          " <span class='causal'>\u2713</span>" else "", "<br>",
        "<hr style='margin:6px 0'>",
        "<b>Z \u2192 X, Z \u2192 Y (confounding):</b><br>",
        "OLS = ", round(d$b2, 3),
        " | True = ", d$true_b,
        " <span class='mislead'>biased</span><br>",
        "<hr style='margin:6px 0'>",
        "<b>Y \u2192 X (reverse causality):</b><br>",
        "OLS = ", round(d$b3, 3),
        " | True = 0",
        " <span class='mislead'>misleading</span>"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • True effect = 0, confounding = 2: all three panels show a positive correlation. But only in panel 1 is the causal effect actually zero and the OLS slope correct. In panels 2 and 3, the correlation is entirely spurious — driven by confounding or reverse causality.
  • True effect = 1, confounding = 2: panel 1 recovers the true effect. Panel 2 overestimates it (confounding adds to the slope). Panel 3 shows something completely unrelated.
  • Increase sample size: all three slopes get more precise — but the biased ones stay biased. More data doesn’t fix confounding. It just gives you a more precise wrong answer.

Simulation 2: Randomization breaks confounding

This is the fundamental insight of causal inference: randomization makes X independent of everything else, so the only reason Y changes with X is the causal effect.

Compare an observational study (where confounding biases the estimate) with a randomized experiment (where the bias disappears).

#| standalone: true
#| viewerHeight: 700

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 4,

      sliderInput("n2", "Sample size:",
                  min = 100, max = 500, value = 200, step = 50),

      sliderInput("true_b", "True causal effect of treatment:",
                  min = 0, max = 5, value = 2, step = 0.25),

      sliderInput("confound2", "Confounding strength:",
                  min = 0, max = 5, value = 3, step = 0.25),

      uiOutput("results2")
    ),

    mainPanel(
      width = 8,
      plotOutput("rct_plot", height = "550px")
    )
  )
)

server <- function(input, output, session) {

  sim <- reactive({
    n  <- input$n2
    b  <- input$true_b
    cf <- input$confound2

    # Confounder: ability, motivation, etc.
    z <- rnorm(n)

    # Observational: treatment correlated with confounder
    # (motivated people self-select into treatment)
    prob_treat <- plogis(cf * z)
    treat_obs  <- rbinom(n, 1, prob_treat)
    y_obs      <- b * treat_obs + 2 * z + rnorm(n)
    fit_obs    <- lm(y_obs ~ treat_obs)

    # RCT: treatment is random (coin flip)
    treat_rct <- rbinom(n, 1, 0.5)
    y_rct     <- b * treat_rct + 2 * z + rnorm(n)
    fit_rct   <- lm(y_rct ~ treat_rct)

    list(treat_obs = treat_obs, y_obs = y_obs,
         treat_rct = treat_rct, y_rct = y_rct,
         z = z, b_obs = coef(fit_obs)[2], b_rct = coef(fit_rct)[2],
         true_b = b, confound = cf)
  })

  output$rct_plot <- renderPlot({
    d <- sim()
    par(mfrow = c(1, 2), mar = c(5, 5, 4, 2))

    # Left: Observational
    cols_obs <- ifelse(d$treat_obs == 1, "#e74c3c", "#3498db")
    stripchart(d$y_obs ~ d$treat_obs,
               method = "jitter", jitter = 0.15,
               vertical = TRUE, pch = 16, cex = 0.6,
               col = adjustcolor(cols_obs, 0.4),
               group.names = c("Control", "Treated"),
               xlab = "", ylab = "Outcome (Y)",
               main = "Observational study")

    means_obs <- tapply(d$y_obs, d$treat_obs, mean)
    segments(0.8, means_obs[1], 1.2, means_obs[1],
             col = "#3498db", lwd = 3)
    segments(1.8, means_obs[2], 2.2, means_obs[2],
             col = "#e74c3c", lwd = 3)

    mtext(paste0("Difference = ", round(d$b_obs, 2),
                 "  (true = ", d$true_b, ")"),
          side = 3, line = 0, cex = 1, font = 2, col = "#e74c3c")

    # Right: RCT
    cols_rct <- ifelse(d$treat_rct == 1, "#e74c3c", "#3498db")
    stripchart(d$y_rct ~ d$treat_rct,
               method = "jitter", jitter = 0.15,
               vertical = TRUE, pch = 16, cex = 0.6,
               col = adjustcolor(cols_rct, 0.4),
               group.names = c("Control", "Treated"),
               xlab = "", ylab = "Outcome (Y)",
               main = "Randomized experiment")

    means_rct <- tapply(d$y_rct, d$treat_rct, mean)
    segments(0.8, means_rct[1], 1.2, means_rct[1],
             col = "#3498db", lwd = 3)
    segments(1.8, means_rct[2], 2.2, means_rct[2],
             col = "#e74c3c", lwd = 3)

    mtext(paste0("Difference = ", round(d$b_rct, 2),
                 "  (true = ", d$true_b, ")"),
          side = 3, line = 0, cex = 1, font = 2, col = "#27ae60")
  })

  output$results2 <- renderUI({
    d <- sim()

    bias_obs <- d$b_obs - d$true_b
    bias_rct <- d$b_rct - d$true_b

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>True causal effect:</b> ", d$true_b, "<br>",
        "<hr style='margin:6px 0'>",
        "<b>Observational:</b> ", round(d$b_obs, 3),
        "<br>Bias: <span class='bad'>",
        round(bias_obs, 3), "</span><br>",
        "<small>Treated group has higher ability",
        " (selection bias)</small>",
        "<hr style='margin:6px 0'>",
        "<b>Randomized:</b> ", round(d$b_rct, 3),
        "<br>Bias: <span class='good'>",
        round(bias_rct, 3), "</span><br>",
        "<small>Treatment is independent of",
        " everything \u2192 unbiased</small>"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • True effect = 2, confounding = 3: the observational study overestimates the effect (maybe by 50–100%) because motivated people self-select into treatment. The RCT nails it.
  • True effect = 0, confounding = 3: the treatment does nothing, but the observational study finds a large “effect” — entirely driven by selection bias. The RCT correctly shows ~0.
  • Confounding = 0: both studies agree. When there’s no confounding, observational data works fine. The problem is you never know if confounding is zero in practice.
  • Increase n: both estimates get more precise, but the observational one stays biased. Sample size doesn’t fix confounding.

Digging deeper: why correlations mislead

Collider bias: the trickiest one

The first four reasons are intuitive. The fifth — collider bias — is subtle and catches even experienced researchers.

Start with the name. In causal diagrams, arrows show what causes what. A collider is a variable where two arrows point into it:

\[X \to C \leftarrow Y\]

The variable \(C\) “collides” — it’s caused by both \(X\) and \(Y\). On its own, that’s fine. \(X\) and \(Y\) can be completely independent. The problem starts when you condition on the collider — by restricting your sample to specific values of \(C\), or by controlling for \(C\) in a regression. That act of conditioning opens a fake pathway between \(X\) and \(Y\) that doesn’t exist in the real world.

Why does conditioning on a collider create a fake association? Think of it as an information leak. If you know someone is in a specific group (defined by \(C\)), then learning about \(X\) tells you something about \(Y\) — not because they’re related, but because both contributed to getting into that group. Knowing one cause of \(C\) changes what you infer about the other cause.

The hospital example, step by step.

  1. Disease A and disease B are completely unrelated in the general population. Knowing someone has A tells you nothing about whether they have B.
  2. Both diseases can put you in the hospital. “Hospitalization” is a collider: A → Hospital ← B.
  3. Now you study only hospitalized patients. You’ve conditioned on the collider.
  4. Pick a hospitalized patient. You learn they don’t have disease A. But they’re in the hospital — so something put them there. That something is more likely to be disease B.
  5. Result: among hospitalized patients, A and B look negatively correlated. Not because one prevents the other, but because you restricted your sample to people who have at least one of them.

In the general population: no correlation. In the hospital: negative correlation. The correlation is entirely an artifact of how you selected your sample.

The talent-attractiveness tradeoff.

  1. In the general population, acting talent and physical attractiveness are unrelated. Plenty of beautiful untalented people, plenty of talented plain-looking people.
  2. Hollywood selects actors who are talented OR attractive OR both. “Getting cast” is a collider: Talent → Cast ← Attractiveness.
  3. Among successful actors (conditioned on being cast), you notice: the beautiful ones seem less talented, and the talented ones seem less attractive.
  4. Why? If an actor isn’t particularly talented, they probably got cast because they’re attractive (and vice versa). Knowing one cause tells you about the other — but only within the selected group.
  5. You conclude “there’s a tradeoff between talent and looks.” There isn’t. You’re seeing collider bias.

How it shows up in research. Collider bias often sneaks in through:

  • Sample selection: studying only college graduates, only employed people, only published studies, only survivors
  • Controlling for post-treatment variables: if you run a regression of X on Y “controlling for” a variable that X and Y both cause, you’ve conditioned on a collider
  • Attrition: if people drop out of your study for reasons related to both treatment and outcome, the remaining sample is conditioned on a collider

The fix: think carefully about the causal structure before you run your regression. Draw the diagram. If a variable is caused by your treatment and your outcome, don’t control for it and don’t select on it. Condition only on variables that are causes (confounders), never on variables that are consequences (colliders).


When does correlation become causation?

Correlation becomes causal when you can rule out reasons 2–5. That’s what the entire field of causal inference is about.

Threat What it means How to address it
Confounding A third variable drives both X and Y Randomize, or control for all confounders (hard)
Reverse causality Y causes X, not the other way around Randomize, use timing/lags, or find an instrument
Selection / collider bias Your sample is selected in a way that creates a spurious relationship Think carefully about your sample; don’t condition on post-treatment variables
Coincidence Small samples, multiple testing Large samples, pre-registration, corrections

Randomization solves all of them at once — that’s why RCTs are the gold standard. But randomization isn’t always possible (you can’t randomly assign smoking, education, or recessions). That’s where the causal inference toolkit comes in:

  • Instrumental variables (IV): find something that affects X but has no other path to Y
  • Difference-in-differences (DiD): compare before/after changes between treated and untreated groups
  • Regression discontinuity (RDD): exploit a cutoff that creates near-random assignment
  • Matching / propensity scores: create comparable treated and control groups from observational data

Each method makes different assumptions to rule out the threats above. None is universally valid — the right method depends on your setting.

These methods are covered in depth in the Causal Inference course.


Simpson’s Paradox

A striking example of confounding: a trend that appears in aggregate data reverses when you look within subgroups.

The UC Berkeley admissions case (1973)

Berkeley’s graduate admissions data showed an apparent bias against women: overall, 44% of men were admitted vs 35% of women. Lawsuit-worthy? Not so fast.

When researchers looked within departments, women were admitted at equal or higher rates than men in most departments. The aggregate gap existed because women applied disproportionately to more competitive departments (like English) while men applied to less competitive ones (like Engineering). Department was a confounder: it drove both the gender composition of applicants and the admission rate.

Level Men admitted Women admitted Looks like…
Aggregate 44% 35% Bias against women
Within departments Similar or lower Similar or higher No bias (or slight bias for women)

This is Simpson’s paradox: the aggregate trend is the opposite of the within-group trend. It’s not a paradox in logic — it’s a consequence of confounding. The aggregate numbers mix together departments with very different admission rates and very different gender ratios.

The causal lesson

Simpson’s paradox is resolved by asking a causal question: what should we condition on?

  • If department choice is a pre-treatment variable (a confounder), you should condition on it — and the “bias” disappears.
  • If department choice is a consequence of discrimination (e.g., women are steered away from certain fields), conditioning on it would hide the discrimination.

The data alone can’t tell you which is correct. You need a causal model — a story about what causes what. This is the same lesson as the rest of this page: correlation (the aggregate gap) doesn’t imply causation (discrimination), and the right adjustment depends on the causal structure.

#| standalone: true
#| viewerHeight: 580

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 8px; font-size: 13px; line-height: 1.7;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 4,

      sliderInput("n_simpson", "Applicants per group-dept cell:",
                  min = 50, max = 500, value = 200, step = 50),

      sliderInput("n_groups", "Number of departments:",
                  min = 2, max = 6, value = 3, step = 1),

      sliderInput("confound_strength", "Confounding strength:",
                  min = 0, max = 0.4, value = 0.25, step = 0.05),

      actionButton("go_simp", "New draw", class = "btn-primary", width = "100%"),

      uiOutput("results_simpson")
    ),

    mainPanel(
      width = 8,
      fluidRow(
        column(6, plotOutput("simpson_agg", height = "350px")),
        column(6, plotOutput("simpson_dept", height = "350px"))
      )
    )
  )
)

server <- function(input, output, session) {

  sim <- reactive({
    input$go_simp
    n_cell <- input$n_simpson
    K <- input$n_groups
    cf <- input$confound_strength

    # Department base admission rates (decreasing)
    base_rates <- seq(0.7, 0.2, length.out = K)

    # Group A (e.g., men) applies more to easy departments
    # Group B (e.g., women) applies more to hard departments
    weight_a <- seq(1 + cf * K, 1 - cf * K, length.out = K)
    weight_a <- pmax(weight_a, 0.1)
    weight_a <- weight_a / sum(weight_a)

    weight_b <- seq(1 - cf * K, 1 + cf * K, length.out = K)
    weight_b <- pmax(weight_b, 0.1)
    weight_b <- weight_b / sum(weight_b)

    n_a <- round(weight_a * n_cell * K)
    n_b <- round(weight_b * n_cell * K)

    # Generate admits with same or slightly higher rate for group B within each dept
    dept_results <- data.frame(
      dept = integer(), group = character(),
      n = integer(), admitted = integer(),
      stringsAsFactors = FALSE
    )

    for (k in 1:K) {
      rate_a <- base_rates[k]
      rate_b <- base_rates[k] + 0.02  # Slight advantage for group B within dept

      adm_a <- rbinom(1, n_a[k], rate_a)
      adm_b <- rbinom(1, n_b[k], rate_b)

      dept_results <- rbind(dept_results,
        data.frame(dept = k, group = "A", n = n_a[k], admitted = adm_a),
        data.frame(dept = k, group = "B", n = n_b[k], admitted = adm_b))
    }

    dept_results$rate <- dept_results$admitted / dept_results$n

    # Aggregate rates
    agg_a <- sum(dept_results$admitted[dept_results$group == "A"]) /
             sum(dept_results$n[dept_results$group == "A"])
    agg_b <- sum(dept_results$admitted[dept_results$group == "B"]) /
             sum(dept_results$n[dept_results$group == "B"])

    list(dept_results = dept_results, agg_a = agg_a, agg_b = agg_b,
         K = K, base_rates = base_rates,
         n_a = n_a, n_b = n_b)
  })

  output$simpson_agg <- renderPlot({
    d <- sim()
    par(mar = c(4, 4.5, 3, 1))

    vals <- c(d$agg_a, d$agg_b)
    cols <- c("#3498db", "#e74c3c")
    bp <- barplot(vals * 100, col = cols, border = NA,
                  names.arg = c("Group A", "Group B"),
                  main = "Aggregate Admission Rate",
                  ylab = "Admission rate (%)",
                  ylim = c(0, max(vals * 100) * 1.3))
    text(bp, vals * 100 + 2, paste0(round(vals * 100, 1), "%"),
         cex = 1, font = 2)

    if (d$agg_a > d$agg_b) {
      mtext("Group A admitted at higher rate", side = 1, line = 2.5,
            col = "#e74c3c", font = 2, cex = 0.9)
    } else {
      mtext("Group B admitted at higher rate", side = 1, line = 2.5,
            col = "#27ae60", font = 2, cex = 0.9)
    }
  })

  output$simpson_dept <- renderPlot({
    d <- sim()
    dr <- d$dept_results
    K <- d$K
    par(mar = c(4, 4.5, 3, 1))

    at_a <- (1:K) * 3 - 1.5
    at_b <- (1:K) * 3 - 0.5

    rates_a <- dr$rate[dr$group == "A"]
    rates_b <- dr$rate[dr$group == "B"]

    ylim <- c(0, max(c(rates_a, rates_b) * 100) * 1.25)

    plot(NULL, xlim = c(0, K * 3 + 0.5), ylim = ylim,
         xaxt = "n", xlab = "", ylab = "Admission rate (%)",
         main = "Within-Department Rates")

    rect(at_a - 0.4, 0, at_a + 0.4, rates_a * 100,
         col = "#3498db", border = NA)
    rect(at_b - 0.4, 0, at_b + 0.4, rates_b * 100,
         col = "#e74c3c", border = NA)

    text(at_a, rates_a * 100 + 2, paste0(round(rates_a * 100, 0), "%"),
         cex = 0.7, col = "#3498db", font = 2)
    text(at_b, rates_b * 100 + 2, paste0(round(rates_b * 100, 0), "%"),
         cex = 0.7, col = "#e74c3c", font = 2)

    axis(1, at = (1:K) * 3 - 1, labels = paste("Dept", 1:K), cex.axis = 0.85)

    # Show sample sizes
    n_a <- dr$n[dr$group == "A"]
    n_b <- dr$n[dr$group == "B"]
    mtext(paste0("n: ", paste(n_a, collapse = "/")), side = 1,
          line = 2.3, cex = 0.7, col = "#3498db", adj = 0)
    mtext(paste0("n: ", paste(n_b, collapse = "/")), side = 1,
          line = 3, cex = 0.7, col = "#e74c3c", adj = 0)

    legend("topright", bty = "n", cex = 0.85,
           legend = c("Group A", "Group B"),
           fill = c("#3498db", "#e74c3c"), border = NA)

    b_wins <- sum(rates_b >= rates_a)
    mtext(paste0("Group B higher in ", b_wins, "/", K, " departments"),
          side = 1, line = 2.5, col = "#27ae60", font = 2, cex = 0.85,
          adj = 1)
  })

  output$results_simpson <- renderUI({
    d <- sim()
    dr <- d$dept_results
    rates_a <- dr$rate[dr$group == "A"]
    rates_b <- dr$rate[dr$group == "B"]
    b_higher <- sum(rates_b >= rates_a)

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>Aggregate:</b><br>",
        "Group A: ", round(d$agg_a * 100, 1), "%<br>",
        "Group B: ", round(d$agg_b * 100, 1), "%<br>",
        if (d$agg_a > d$agg_b)
          "<span class='bad'>A looks favored overall</span>"
        else
          "<span class='good'>No aggregate reversal</span>",
        "<hr style='margin:6px 0'>",
        "<b>Within departments:</b><br>",
        "B higher in <span class='good'>", b_higher, "/", d$K,
        "</span> depts<br>",
        "<hr style='margin:6px 0'>",
        "<small>The reversal happens because Group B applies more to ",
        "harder departments. Department is a confounder.</small>"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • Confounding strength = 0.25: Group A has a higher aggregate rate, but Group B has a higher (or equal) rate in most individual departments. That’s the paradox.
  • Confounding strength = 0: both groups apply to departments at the same rates. The paradox disappears — aggregate and within-department trends agree.
  • Increase confounding: the aggregate gap widens even though within-department rates stay the same. More confounding = more misleading aggregate data.
  • 2 departments: the paradox is easy to see. With 6 departments, it’s harder to spot without the visualization.

Did you know?

  • The phrase “correlation does not imply causation” is one of the most cited principles in statistics, yet researchers violate it constantly. A 2019 study found that over 30% of observational studies in top medical journals used causal language (like “X reduces Y”) without any causal identification strategy.

  • Ronald Fisher, the father of modern statistics, spent decades arguing that the correlation between smoking and lung cancer was not causal — it could be confounding by a genetic factor that causes both the desire to smoke and susceptibility to cancer. He was wrong, but his argument was statistically valid. It took decades of converging evidence from multiple study designs to establish causality.

  • The 2021 Nobel Prize in Economics went to David Card, Joshua Angrist, and Guido Imbens for developing methods to establish causal relationships from observational data — essentially for building the toolkit described above. Their key insight: you don’t need a perfect experiment; you need a credible natural experiment where something approximates random assignment.