Instrumental Variables

The idea

You want the causal effect of \(X\) on \(Y\), but \(X\) is endogenous — correlated with the error term because of confounding, reverse causality, or measurement error. OLS is biased.

The fix: find a variable \(Z\) (the instrument) that:

Relevance: \(Z\) is correlated with \(X\) — it actually moves \(X\)
Exclusion restriction: \(Z\) affects \(Y\) only through \(X\) — no back doors

\[Z \to X \to Y\]

If both conditions hold, you can use \(Z\) to isolate the part of \(X\) that’s “as good as random” and estimate the causal effect.

Two-stage least squares (2SLS)

The mechanics are simple:

First stage: regress \(X\) on \(Z\) to get predicted values \(\hat{X}\)

\[X = \pi_0 + \pi_1 Z + v\]

Second stage: regress \(Y\) on \(\hat{X}\) instead of \(X\)

\[Y = \beta_0 + \beta_1 \hat{X} + \varepsilon\]

Why does this work? \(\hat{X}\) contains only the variation in \(X\) that comes from \(Z\). Since \(Z\) is exogenous (by assumption), \(\hat{X}\) is uncorrelated with the error term. The confounding is gone.

Assumptions

Relevance: \(Z\) is correlated with \(X\) — the instrument actually moves the endogenous variable. Testable: check the first-stage F-statistic.
Exclusion restriction: \(Z\) affects \(Y\) only through \(X\) — no direct effect and no back-door paths. Not testable — you argue it.
Independence: \(Z\) is as good as randomly assigned — uncorrelated with the error term in the outcome equation
Monotonicity (for LATE): the instrument moves everyone in the same direction — no “defiers” who do the opposite of what the instrument encourages

Classic example

Returns to education. You want to know if more schooling causes higher earnings. But ability confounds: smarter people get more education and earn more. OLS overstates the return.

Angrist & Krueger (1991) used quarter of birth as an instrument. Because of compulsory schooling laws, people born in Q1 can drop out slightly earlier than Q4 births — so quarter of birth affects education (relevance) but presumably doesn’t affect earnings directly (exclusion).

When does IV fail?

Weak instruments: if \(Z\) barely moves \(X\), the first stage is weak and the IV estimate becomes wildly noisy and biased. Rule of thumb: first-stage F-statistic > 10.
Exclusion restriction violated: if \(Z\) affects \(Y\) through channels other than \(X\), the estimate is biased. This assumption is untestable — you argue it, you don’t prove it.

#| standalone: true
#| viewerHeight: 620

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("n", "Sample size:",
                  min = 100, max = 2000, value = 500, step = 100),

      sliderInput("true_b", "True causal effect of X on Y:",
                  min = 0, max = 5, value = 2, step = 0.25),

      sliderInput("confound", "Confounding strength:",
                  min = 0, max = 5, value = 3, step = 0.25),

      sliderInput("inst_str", "Instrument strength:",
                  min = 0, max = 3, value = 1.5, step = 0.1),

      actionButton("go", "New draw", class = "btn-primary", width = "100%"),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      plotOutput("iv_plot", height = "470px")
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$go
    n   <- input$n
    b   <- input$true_b
    cf  <- input$confound
    pi1 <- input$inst_str

    # Confounder (unobserved ability)
    u <- rnorm(n)

    # Instrument
    z <- rnorm(n)

    # Endogenous regressor: driven by instrument + confounder
    x <- pi1 * z + cf * u + rnorm(n)

    # Outcome: causal effect of x + confounder
    y <- b * x + cf * u + rnorm(n)

    # OLS (biased)
    ols <- lm(y ~ x)

    # 2SLS by hand
    first <- lm(x ~ z)
    x_hat <- fitted(first)
    second <- lm(y ~ x_hat)

    # First-stage F
    f_stat <- summary(first)$fstatistic[1]

    list(x = x, y = y, z = z, x_hat = x_hat,
         b_ols = coef(ols)[2],
         b_iv = coef(second)[2],
         first_coef = coef(first)[2],
         f_stat = f_stat,
         true_b = b, confound = cf, inst_str = pi1)
  })

  output$iv_plot <- renderPlot({
    d <- dat()
    par(mfrow = c(1, 2), mar = c(4.5, 4.5, 3, 1))

    # Left: OLS scatter (X vs Y)
    plot(d$x, d$y, pch = 16, cex = 0.4,
         col = adjustcolor("#3498db", 0.3),
         xlab = "X (endogenous)", ylab = "Y",
         main = "OLS: Y on X")
    abline(lm(d$y ~ d$x), col = "#e74c3c", lwd = 3)
    abline(a = 0, b = d$true_b, col = "#27ae60", lwd = 2, lty = 2)
    legend("topleft", bty = "n", cex = 0.85,
           legend = c(paste0("OLS = ", round(d$b_ols, 2)),
                      paste0("True = ", d$true_b)),
           col = c("#e74c3c", "#27ae60"), lwd = c(3, 2), lty = c(1, 2))

    # Right: IV scatter (X-hat vs Y)
    plot(d$x_hat, d$y, pch = 16, cex = 0.4,
         col = adjustcolor("#9b59b6", 0.3),
         xlab = expression(hat(X) ~ "(from first stage)"), ylab = "Y",
         main = expression("2SLS: Y on " * hat(X)))
    abline(lm(d$y ~ d$x_hat), col = "#e74c3c", lwd = 3)
    abline(a = 0, b = d$true_b, col = "#27ae60", lwd = 2, lty = 2)
    legend("topleft", bty = "n", cex = 0.85,
           legend = c(paste0("IV = ", round(d$b_iv, 2)),
                      paste0("True = ", d$true_b)),
           col = c("#e74c3c", "#27ae60"), lwd = c(3, 2), lty = c(1, 2))
  })

  output$results <- renderUI({
    d <- dat()
    ols_bias <- d$b_ols - d$true_b
    iv_bias  <- d$b_iv - d$true_b
    weak <- d$f_stat < 10

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>True effect:</b> ", d$true_b, "<br>",
        "<hr style='margin:6px 0'>",
        "<b>OLS:</b> ", round(d$b_ols, 3),
        " &nbsp; Bias: <span class='bad'>", round(ols_bias, 3), "</span><br>",
        "<hr style='margin:6px 0'>",
        "<b>IV (2SLS):</b> ", round(d$b_iv, 3),
        " &nbsp; Bias: <span class='", ifelse(abs(iv_bias) < abs(ols_bias), "good", "bad"), "'>",
        round(iv_bias, 3), "</span><br>",
        "<hr style='margin:6px 0'>",
        "<b>First-stage F:</b> ", round(d$f_stat, 1),
        if (weak) " <span class='bad'>&lt; 10 (weak!)</span>"
        else " <span class='good'>&ge; 10</span>"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

Confounding = 3, instrument = 1.5: OLS is badly biased. IV recovers the true effect. This is the whole point.
Set confounding = 0: OLS and IV agree — when there’s no endogeneity, you don’t need an instrument.
Slide instrument strength toward 0: the first-stage F drops below 10. The IV estimate becomes erratic — sometimes worse than OLS. That’s the weak instrument problem.
Increase sample size with a weak instrument: it doesn’t help much. Weak instruments bias IV toward OLS, and more data doesn’t fix that.
True effect = 0, confounding = 3: OLS “finds” a large effect. IV correctly shows ~0.

What does IV actually estimate?

A subtle point: IV doesn’t estimate the effect for everyone. It estimates the Local Average Treatment Effect (LATE) — the effect for compliers, people whose treatment status is actually changed by the instrument.

In the Angrist & Krueger example: IV estimates the return to education for people who would have dropped out if born in a different quarter. It says nothing about people who would have stayed in school regardless.

This means two different valid instruments can give you two different IV estimates — not because one is wrong, but because they’re identifying effects for different subpopulations.

In Stata

* Two-stage least squares
ivregress 2sls outcome x1 x2 (treatment = instrument)

* First-stage F statistic (check relevance)
estat firststage

* Overidentification test (with multiple instruments)
ivregress 2sls outcome x1 (treatment = inst1 inst2)
estat overid

* Manually run the two stages (to see what's happening)
reg treatment instrument x1 x2          /* first stage */
predict treatment_hat, xb
reg outcome treatment_hat x1 x2         /* second stage */

The first-stage F should be well above 10 (Staiger & Stock rule of thumb). If it’s weak, the IV estimate is unreliable — possibly more biased than OLS.

Did you know?

The instrumental variables method dates back to Philip Wright (1928), who used it to estimate supply and demand curves for butter and flax seed. Some historians credit his son, Sewall Wright, with the actual derivation.
The “weak instruments” problem was formalized by Staiger & Stock (1997). They showed that when the first-stage F is below 10, IV can be more biased than OLS — the cure becomes worse than the disease.
Joshua Angrist, one of the 2021 Nobel laureates, built much of his career on clever instruments: quarter of birth for schooling, draft lottery numbers for military service, religious composition for family size. The art is finding instruments that are both relevant and excludable.