Sample Selection & Heckman Correction
The problem
Sometimes the outcome you want to study is only observed for a non-random subset of the population:
- Wages are only observed for people who work. If people with low potential wages choose not to work, OLS on workers overestimates the average wage.
- Test scores are only observed for students who stay enrolled. If struggling students drop out, average scores look artificially high.
- Stock returns are only observed for companies that survive (survivorship bias).
When selection into the sample correlates with the outcome, estimates on the selected subsample are biased. This isn’t a standard omitted variable — the problem is that your sample itself is non-random.
The Heckman model
James Heckman (Nobel 2000) formalized this as a two-equation system:
Selection equation (who is observed): \[S_i^* = Z_i'\gamma + u_i, \qquad S_i = \mathbf{1}(S_i^* > 0)\]
Outcome equation (what we want to estimate): \[Y_i = X_i'\beta + \varepsilon_i, \qquad \text{observed only if } S_i = 1\]
If the errors \((u_i, \varepsilon_i)\) are correlated — i.e., the unobservables that drive selection also affect the outcome — then \(E[\varepsilon_i \mid S_i = 1] \neq 0\), and OLS on the selected sample is biased.
Two-step correction
Step 1: Probit. Estimate the selection equation by probit to get \(\hat{\gamma}\). Compute the inverse Mills ratio for each observation:
\[\hat{\lambda}_i = \frac{\phi(Z_i'\hat{\gamma})}{\Phi(Z_i'\hat{\gamma})}\]
where \(\phi\) is the normal density and \(\Phi\) is the normal CDF. The Mills ratio captures the expected value of the truncated error — how “surprising” it is that observation \(i\) was selected.
Step 2: OLS with correction. Include \(\hat{\lambda}_i\) in the outcome regression:
\[Y_i = X_i'\beta + \rho\sigma_\varepsilon \hat{\lambda}_i + \eta_i\]
The coefficient on the Mills ratio estimates the selection bias \(\rho\sigma_\varepsilon\). If it’s significantly different from zero, selection bias is present.
The exclusion restriction
For the model to be well-identified, you need at least one variable in \(Z\) (selection equation) that is not in \(X\) (outcome equation) — a variable that affects whether you’re observed but not the outcome itself.
This is conceptually identical to the exclusion restriction in IV. Without it, identification relies solely on the nonlinearity of the probit — which is fragile.
Good examples: distance to a job center (affects employment but not wages directly), having young children (affects labor force participation), draft lottery number.
Bad examples: variables that directly affect both selection and the outcome.
Simulation
Wages are observed only for workers. Selection into work correlates with unobserved ability (which also affects wages). Compare OLS on selected sample (biased) vs Heckman correction vs the population truth.
#| standalone: true
#| viewerHeight: 750
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.eq-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-bottom: 14px; font-size: 14px; line-height: 1.9;
}
.eq-box b { color: #2c3e50; }
.match { color: #27ae60; font-weight: bold; }
.coef { color: #e74c3c; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 4,
sliderInput("n", "Population size:",
min = 500, max = 3000, value = 1000, step = 250),
sliderInput("rho", HTML("Error correlation (ρ):"),
min = -0.9, max = 0.9, value = 0.6, step = 0.1),
sliderInput("gamma_z", "Selection instrument strength:",
min = 0.5, max = 3, value = 1.5, step = 0.25),
sliderInput("beta", HTML("True return to education (β):"),
min = 0.5, max = 3, value = 1, step = 0.25),
actionButton("resim", "New draw", class = "btn-primary", width = "100%"),
uiOutput("results_box")
),
mainPanel(
width = 8,
plotOutput("main_plot", height = "450px"),
uiOutput("explain_box")
)
)
)
server <- function(input, output, session) {
dat <- reactive({
input$resim
n <- input$n
rho <- input$rho
gamma_z <- input$gamma_z
beta <- input$beta
# Education (observed)
educ <- rnorm(n, mean = 12, sd = 3)
# Exclusion restriction: distance to job center
# Affects selection but not wages
distance <- rnorm(n)
# Correlated errors
e1 <- rnorm(n)
e2 <- rnorm(n)
u <- e1 # selection error
eps <- rho * e1 + sqrt(1 - rho^2) * e2 # outcome error
# Latent selection
s_star <- 0.5 + 0.3 * educ - gamma_z * distance + u
selected <- s_star > 0
# Wages (latent for everyone)
wage <- 5 + beta * educ + 2 * eps
# Observed wages
wage_obs <- ifelse(selected, wage, NA)
# OLS on full population (oracle)
ols_full <- lm(wage ~ educ)
# OLS on selected sample (biased)
ols_sel <- lm(wage[selected] ~ educ[selected])
# Heckman two-step
# Step 1: probit for selection
probit_fit <- glm(selected ~ educ + distance,
family = binomial(link = "probit"))
# Inverse Mills ratio for selected observations
xg <- predict(probit_fit, type = "link")
mills <- dnorm(xg) / pnorm(xg)
mills_sel <- mills[selected]
# Step 2: OLS with Mills ratio
heckman_fit <- lm(wage[selected] ~ educ[selected] + mills_sel)
# Prediction grids
educ_grid <- seq(min(educ), max(educ), length.out = 100)
pred_full <- coef(ols_full)[1] + coef(ols_full)[2] * educ_grid
pred_sel <- coef(ols_sel)[1] + coef(ols_sel)[2] * educ_grid
pred_heck <- coef(heckman_fit)[1] + coef(heckman_fit)[2] * educ_grid
list(educ = educ, wage = wage, selected = selected,
educ_grid = educ_grid,
pred_full = pred_full, pred_sel = pred_sel,
pred_heck = pred_heck,
coef_full = coef(ols_full),
coef_sel = coef(ols_sel),
coef_heck = coef(heckman_fit),
pct_selected = round(100 * mean(selected), 1),
beta = beta, rho = rho)
})
output$main_plot <- renderPlot({
d <- dat()
par(mar = c(5, 5, 4, 2))
# Plot unselected (faded)
plot(d$educ[!d$selected], d$wage[!d$selected],
pch = 16, col = adjustcolor("#e74c3c", 0.15), cex = 0.5,
xlab = "Education (years)", ylab = "Wage",
main = "Sample Selection Bias",
xlim = range(d$educ),
ylim = quantile(d$wage, c(0.01, 0.99)))
# Plot selected
points(d$educ[d$selected], d$wage[d$selected],
pch = 16, col = adjustcolor("#3498db", 0.25), cex = 0.5)
# Regression lines
lines(d$educ_grid, d$pred_full, col = "#2c3e50", lwd = 2.5, lty = 2)
lines(d$educ_grid, d$pred_sel, col = "#e74c3c", lwd = 2.5)
lines(d$educ_grid, d$pred_heck, col = "#27ae60", lwd = 2.5)
legend("topleft", bty = "n", cex = 0.85,
legend = c(
"Not selected (unobserved)",
"Selected (observed)",
paste0("OLS on all (true, slope=",
round(d$coef_full[2], 2), ")"),
paste0("OLS on selected (biased, slope=",
round(d$coef_sel[2], 2), ")"),
paste0("Heckman corrected (slope=",
round(d$coef_heck[2], 2), ")")
),
col = c(adjustcolor("#e74c3c", 0.3),
adjustcolor("#3498db", 0.5),
"#2c3e50", "#e74c3c", "#27ae60"),
pch = c(16, 16, NA, NA, NA),
lwd = c(NA, NA, 2.5, 2.5, 2.5),
lty = c(NA, NA, 2, 1, 1))
})
output$results_box <- renderUI({
d <- dat()
mills_coef <- round(d$coef_heck[3], 3)
bias_sign <- if (d$rho > 0) "upward" else if (d$rho < 0) "downward" else "none"
tags$div(class = "eq-box", style = "margin-top: 16px;",
HTML(paste0(
"<b>True β:</b> ", d$beta, "<br>",
"<b>OLS (full population):</b> ", round(d$coef_full[2], 3), "<br>",
"<b>OLS (selected only):</b> <span class='coef'>",
round(d$coef_sel[2], 3), "</span><br>",
"<b>Heckman corrected:</b> <span class='match'>",
round(d$coef_heck[2], 3), "</span><br>",
"<hr style='margin:8px 0'>",
"<b>Mills ratio coef:</b> ", mills_coef, "<br>",
"<b>Selected:</b> ", d$pct_selected, "% of population<br>",
"<small>Selection bias direction: ", bias_sign, "</small>"
))
)
})
output$explain_box <- renderUI({
d <- dat()
tags$div(class = "eq-box", style = "margin-top: 8px;",
HTML(paste0(
"<b>Faded red dots</b> = people not in the sample (would be invisible in practice). ",
"<b>Blue dots</b> = observed sample. ",
"When ρ > 0, people who select in tend to have positive unobservable shocks ",
"(higher ability), biasing the wage regression <b>upward</b>. ",
"The Heckman correction (green) recovers the population slope."
))
)
})
}
shinyApp(ui, server)
Things to try
- \(\rho = 0\): no correlation between selection and outcome errors. OLS on the selected sample is unbiased — no correction needed.
- \(\rho = 0.6\): strong positive selection. Workers have higher unobserved ability than non-workers. The naive slope is biased upward. Heckman pulls it back.
- \(\rho = -0.5\): negative selection (e.g., people with high outside options don’t work). The naive slope is biased downward.
- Weak instrument (\(\gamma_z = 0.5\)): the exclusion restriction is weak. Heckman correction becomes unstable.
- Large \(n\) (2000+): the Heckman-corrected estimate tightens around the true value.
The bottom line
- Sample selection bias arises when who you observe is correlated with what you’re measuring. It’s a form of endogeneity.
- The Heckman correction models selection explicitly using a probit and corrects the outcome equation with the inverse Mills ratio.
- Like IV, the method requires an exclusion restriction — a variable that shifts selection without directly affecting the outcome. Without it, identification is fragile.
- The coefficient on the Mills ratio tells you whether (and how much) selection bias exists.
Connections
- Selection on Observables — When selection depends on observables, you can condition on them directly. Heckman is for when selection depends on unobservables.
- Instrumental Variables — The exclusion restriction in Heckman is conceptually identical to the IV exclusion restriction.
- Limited Dependent Variables — The selection equation is a probit; the Tobit model is a special case where selection is mechanical.
Did you know?
- James Heckman shared the 2000 Nobel Prize in Economics for developing methods to deal with sample selection bias. His 1979 paper “Sample Selection Bias as a Specification Error” is one of the most cited papers in economics.
- The inverse Mills ratio is named after John P. Mills, who tabulated the ratio of the normal density to the survival function in the 1920s. It shows up in any problem involving truncated normal distributions.
- Heckman’s original application was to female labor supply. He asked: what is the wage offer for women who choose not to work? Since we never observe their wages, naive estimation on working women overstates the average wage offer.
- The two-step estimator is consistent but inefficient. Full maximum likelihood estimation of both equations jointly is more efficient but computationally harder. In practice, the two-step version dominates because it’s simpler and robust enough.