Two-Part Models
The problem with healthcare spending data
Healthcare spending data has two features that make standard OLS a poor choice:
Mass at zero. Many people have zero medical spending in a given period. In a typical month, roughly half the US population incurs no healthcare costs. In a year, about 15–20% have zero spending.
Right skew. Among those who do spend, the distribution is heavily right-skewed. A few patients with serious conditions account for an enormous share of total spending — the top 5% of spenders account for roughly 50% of all healthcare costs.
OLS on raw spending is badly misspecified. It can predict negative spending (impossible), it treats the zeros and the positive values as arising from the same process (implausible), and it is heavily influenced by the right tail.
Three approaches
OLS (linear probability model for spending). Just regress spending on covariates. Simple but can predict negatives and is misspecified for both the zeros and the skew.
Tobit. Assumes a single latent variable drives both the decision to spend and the amount spent. Spending is censored at zero: \(Y^* = X\beta + \varepsilon\), and we observe \(Y = \max(0, Y^*)\). The Tobit assumes the same \(\beta\) governs both the probability of any spending and the amount conditional on spending. This is often unrealistic — the factors that determine whether you go to the doctor at all may differ from the factors that determine how much you spend once you go.
Two-part model. Separates the process into two parts:
- Part 1 (extensive margin): Model \(P(Y > 0 | X)\) using logit or probit
- Part 2 (intensive margin): Model \(E[Y | Y > 0, X]\) using OLS on log spending, GLM with gamma family, or similar
The predicted spending is: \(\hat{E}[Y|X] = \hat{P}(Y > 0 | X) \times \hat{E}[Y | Y > 0, X]\)
The two-part model allows different covariates to have different effects on the two margins. A treatment might increase the probability of using care (Part 1) without changing how much is spent conditional on use (Part 2), or vice versa. See Limited Dependent Variables for the mechanics of logit/probit.
#| standalone: true
#| viewerHeight: 700
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
.good { color: #27ae60; font-weight: bold; }
.bad { color: #e74c3c; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("frac_zero", "Fraction with zero spending:",
min = 0.1, max = 0.8, value = 0.5, step = 0.05),
sliderInput("mean_pos", "Mean spending among users ($):",
min = 500, max = 20000, value = 3000, step = 500),
sliderInput("skewness", "Skewness (shape parameter):",
min = 0.5, max = 5, value = 2, step = 0.25),
sliderInput("n_samp", "Sample size:",
min = 200, max = 2000, value = 500, step = 100),
sliderInput("effect_x", "True effect of X on E[spending] ($):",
min = -2000, max = 5000, value = 500, step = 100),
actionButton("resim", "New draw", class = "btn-primary", width = "100%"),
tags$hr(),
uiOutput("info_box")
),
mainPanel(
width = 9,
plotOutput("hist_plot", height = "280px"),
plotOutput("est_plot", height = "300px")
)
)
)
server <- function(input, output, session) {
dat <- reactive({
input$resim
n <- input$n_samp
fz <- input$frac_zero
mu <- input$mean_pos
shape <- input$skewness
bx <- input$effect_x
set.seed(sample(1:10000, 1))
# Generate X (binary treatment)
X <- rbinom(n, 1, 0.5)
# True DGP: two-part
# Part 1: P(Y > 0) depends on X
# True effect on extensive margin
p_base <- 1 - fz
# X increases probability of any spending
b_ext <- 0.1 * sign(bx) * min(1, abs(bx) / 2000)
p_use <- pmin(0.99, pmax(0.01, p_base + b_ext * X))
any_use <- rbinom(n, 1, p_use)
# Part 2: E[Y | Y > 0] depends on X
# Gamma distribution for positive values
rate <- shape / mu
# Effect on intensive margin
b_int <- bx - b_ext * mu # remaining effect goes to intensive margin
mu_cond <- mu + (bx - b_ext * mu) * X
mu_cond <- pmax(100, mu_cond)
Y_pos <- rgamma(n, shape = shape, rate = shape / mu_cond)
Y <- any_use * Y_pos
# True effect on E[Y]:
# E[Y|X=1] - E[Y|X=0] = (p_base + b_ext) * (mu + b_int) - p_base * mu
true_ey1 <- (p_base + b_ext) * (mu + bx - b_ext * mu)
true_ey0 <- p_base * mu
true_effect <- true_ey1 - true_ey0
# Estimate 1: OLS
ols_fit <- lm(Y ~ X)
ols_est <- coef(ols_fit)[2]
# Estimate 2: Tobit (approximate -- just use censored normal MLE approximation)
# Since we can't use censReg, approximate Tobit via Heckman-style
# Use a simple approximation: OLS on positive values * P(Y>0) adjustment
# Actually, do manual Tobit via optim
tobit_ll <- function(par) {
mu_t <- par[1] + par[2] * X
sig <- exp(par[3])
ll <- ifelse(Y == 0,
log(pmax(1e-10, pnorm(-mu_t / sig))),
dnorm(Y, mu_t, sig, log = TRUE))
-sum(ll)
}
tobit_start <- c(mean(Y), ols_est, log(sd(Y[Y > 0])))
tobit_fit <- tryCatch(
optim(tobit_start, tobit_ll, method = "Nelder-Mead"),
error = function(e) list(par = c(0, ols_est, log(sd(Y[Y>0]))))
)
# Tobit marginal effect on E[Y]: beta * Phi(Xb/sigma)
tobit_b <- tobit_fit$par[2]
tobit_sig <- exp(tobit_fit$par[3])
tobit_xb <- mean(tobit_fit$par[1] + tobit_fit$par[2] * X)
tobit_est <- tobit_b * pnorm(tobit_xb / tobit_sig)
# Estimate 3: Two-part model
# Part 1: logit for P(Y > 0)
use_indicator <- as.numeric(Y > 0)
p1_fit <- glm(use_indicator ~ X, family = binomial(link = "logit"))
p1_pred0 <- predict(p1_fit, newdata = data.frame(X = 0), type = "response")
p1_pred1 <- predict(p1_fit, newdata = data.frame(X = 1), type = "response")
# Part 2: OLS on log(Y) for Y > 0, then smearing
pos <- Y > 0
if (sum(pos & X == 1) > 5 && sum(pos & X == 0) > 5) {
p2_fit <- lm(log(Y[pos]) ~ X[pos])
# Duan smearing
resids <- residuals(p2_fit)
smear <- mean(exp(resids))
e_y_given_pos_0 <- exp(coef(p2_fit)[1]) * smear
e_y_given_pos_1 <- exp(coef(p2_fit)[1] + coef(p2_fit)[2]) * smear
tp_est <- p1_pred1 * e_y_given_pos_1 - p1_pred0 * e_y_given_pos_0
} else {
tp_est <- ols_est
}
list(Y = Y, X = X, n = n,
true_effect = bx, ols_est = ols_est,
tobit_est = tobit_est, tp_est = tp_est,
fz = fz, mu = mu)
})
output$hist_plot <- renderPlot({
d <- dat()
par(mar = c(5, 5, 3, 1))
# Histogram with spike at zero
breaks_pos <- seq(0, max(d$Y) * 1.1, length.out = 40)
h <- hist(d$Y[d$Y > 0], breaks = breaks_pos, plot = FALSE)
# Count zeros
n_zero <- sum(d$Y == 0)
n_pos <- sum(d$Y > 0)
y_max <- max(h$counts, n_zero) * 1.2
plot(NULL, xlim = c(-max(d$Y) * 0.05, max(d$Y) * 1.05),
ylim = c(0, y_max),
xlab = "Healthcare spending ($)", ylab = "Count",
main = paste0("Distribution of Spending (n=", d$n, ")"),
cex.lab = 1.1)
# Zero bar
rect(-max(d$Y) * 0.02, 0, max(d$Y) * 0.02, n_zero,
col = adjustcolor("#e74c3c", 0.6), border = "#e74c3c")
text(0, n_zero + y_max * 0.05,
paste0(n_zero, " zeros\n(", round(n_zero / d$n * 100), "%)"),
col = "#e74c3c", cex = 0.8, font = 2)
# Positive values
rect(h$breaks[-length(h$breaks)], 0, h$breaks[-1], h$counts,
col = adjustcolor("#3498db", 0.5), border = "#3498db")
# Mark mean of positives
mean_pos <- mean(d$Y[d$Y > 0])
abline(v = mean_pos, lty = 2, col = "#2c3e50", lwd = 1.5)
text(mean_pos, y_max * 0.9,
paste0("Mean|Y>0: $", format(round(mean_pos), big.mark = ",")),
col = "#2c3e50", cex = 0.8, adj = -0.1)
})
output$est_plot <- renderPlot({
d <- dat()
par(mar = c(5, 8, 3, 2))
estimates <- c(d$ols_est, d$tobit_est, d$tp_est)
names_est <- c("OLS", "Tobit", "Two-Part")
cols <- c("#3498db", "#9b59b6", "#27ae60")
x_range <- max(abs(c(estimates, d$true_effect))) * 1.5
if (x_range < 100) x_range <- 500
bp <- barplot(estimates, names.arg = names_est,
horiz = TRUE, col = cols,
main = "Estimated Effect of X on E[Spending]",
xlab = "Estimated effect ($)",
xlim = c(min(0, -x_range * 0.3), x_range),
las = 1, cex.lab = 1.1, cex.names = 1)
# True effect line
abline(v = d$true_effect, lwd = 3, lty = 2, col = "#e74c3c")
text(d$true_effect, max(bp) + 0.5,
paste0("True: $", round(d$true_effect)),
col = "#e74c3c", cex = 0.9, font = 2, xpd = TRUE)
# Add estimate values on bars
text(pmax(estimates, 0) + x_range * 0.03, bp,
paste0("$", round(estimates)),
cex = 0.9, adj = 0, col = cols, font = 2)
# Add bias labels
bias <- estimates - d$true_effect
bias_txt <- paste0("bias: ", ifelse(bias >= 0, "+", ""),
round(bias))
text(x_range * 0.85, bp, bias_txt, cex = 0.8,
col = ifelse(abs(bias) < abs(d$true_effect) * 0.15, "#27ae60", "#e74c3c"))
})
output$info_box <- renderUI({
d <- dat()
bias_ols <- d$ols_est - d$true_effect
bias_tob <- d$tobit_est - d$true_effect
bias_tp <- d$tp_est - d$true_effect
best <- which.min(abs(c(bias_ols, bias_tob, bias_tp)))
best_name <- c("OLS", "Tobit", "Two-Part")[best]
tags$div(class = "stats-box",
HTML(paste0(
"<b>True effect:</b> $", round(d$true_effect), "<br>",
"<hr style='margin:8px 0'>",
"<b>OLS:</b> $", round(d$ols_est),
" (bias: <span class='", ifelse(abs(bias_ols) < abs(d$true_effect)*0.15, "good", "bad"),
"'>", round(bias_ols), "</span>)<br>",
"<b>Tobit:</b> $", round(d$tobit_est),
" (bias: <span class='", ifelse(abs(bias_tob) < abs(d$true_effect)*0.15, "good", "bad"),
"'>", round(bias_tob), "</span>)<br>",
"<b>Two-Part:</b> $", round(d$tp_est),
" (bias: <span class='", ifelse(abs(bias_tp) < abs(d$true_effect)*0.15, "good", "bad"),
"'>", round(bias_tp), "</span>)<br>",
"<hr style='margin:8px 0'>",
"<b>Best:</b> <span class='good'>", best_name, "</span>"
))
)
})
}
shinyApp(ui, server)
Things to try
- Set fraction zero to 0.7 (lots of non-users): OLS is pulled heavily toward zero. The two-part model handles this well by modeling the zeros separately.
- Set skewness very high (4-5): the positive distribution has a long right tail. OLS is influenced by extreme values; the two-part model with log-link is more robust.
- Set the effect to 0 and click “New draw” several times: watch how each estimator performs under the null. OLS should be unbiased on average but noisy.
- Compare with fraction zero = 0.1 (few zeros): the three methods converge because the zero problem is less severe.
- Large sample (n = 2000) vs small (n = 200): all estimators improve, but the two-part model’s advantage in handling zeros persists.
The Duan smearing estimator
When Part 2 of the two-part model uses log-transformed spending, you need to retransform predictions back to the dollar scale. A naive approach is to exponentiate: \(\hat{E}[Y|Y>0, X] = \exp(\hat{\beta}_0 + \hat{\beta}_1 X)\). But this is biased because \(E[\exp(\log Y)] \neq \exp(E[\log Y])\) — Jensen’s inequality.
Duan’s smearing estimator (1983) corrects for this. Instead of exponentiating the predicted log value, multiply by the smearing factor: the sample average of exponentiated residuals from the log regression.
\[\hat{E}[Y | Y > 0, X] = \exp(\hat{X}\hat{\beta}) \times \underbrace{\frac{1}{n} \sum_{i=1}^n \exp(\hat{\varepsilon}_i)}_{\text{smearing factor}}\]
This is what the two-part model in the app above does. The smearing factor is typically between 1.1 and 2.0 for health spending data.
Manning et al. (1987) showed that the retransformation problem can introduce substantial bias if the error distribution on the log scale is not normal or is heteroskedastic. Modern practice often uses GLM with a gamma family and log link instead of OLS on logs — this avoids the retransformation problem entirely because the model is specified on the original scale.
Did you know?
Duan’s smearing estimator (1983) was developed specifically for the RAND HIE data analysis. Naihua Duan was a statistician at RAND who realized that the standard retransformation of log-linear models was biased. His nonparametric correction became standard practice in health economics.
Manning, Newhouse, Duan, Keeler, and Leibowitz (1987) — the definitive RAND HIE spending analysis — showed that the choice of model for healthcare spending matters enormously. Their comparison of OLS, log-OLS with smearing, and two-part models remains one of the most cited methods papers in health economics.
In a typical month, about 50% of Americans have zero medical spending. In a year, about 15–20% have zero spending. The top 1% of spenders account for about 23% of total spending, and the top 5% account for about 50%. This extreme concentration makes healthcare spending one of the most challenging outcome variables in applied economics.
The two-part model is closely related to the hurdle model in statistics. The key difference from a Tobit is philosophical: the Tobit assumes a single latent process generates both the zeros and the positive values (you want care but are “censored” at zero), while the two-part model allows two separate processes (the decision to seek care is different from how much you spend once you do).