Hierarchical Models
The problem
You have data from \(K\) groups — schools, hospitals, factories, regions — and you want to estimate a parameter (say, the mean) for each group. Two extreme approaches:
- No pooling: estimate each group separately using only its own data. With small groups, the estimates are noisy and unreliable.
- Complete pooling: ignore groups entirely and estimate a single grand mean. This throws away real group differences.
Neither is satisfying. No pooling overfits to noise. Complete pooling underfits by ignoring structure. You want something in between.
Hierarchical models: partial pooling
A hierarchical (multilevel) model learns the group-level variation from the data and uses it to shrink each group’s estimate toward the grand mean — more for small groups, less for large groups.
The model:
\[y_{ij} \sim N(\theta_j, \sigma^2) \quad \text{(data within group } j\text{)}\] \[\theta_j \sim N(\mu, \tau^2) \quad \text{(group means come from a population)}\]
- \(\theta_j\) is the true mean for group \(j\)
- \(\mu\) is the overall population mean
- \(\tau^2\) is the between-group variance (how different groups really are)
- \(\sigma^2\) is the within-group variance (noise)
The posterior for each \(\theta_j\) is a weighted average of the group’s own data and the grand mean:
\[\hat{\theta}_j^{partial} = w_j \cdot \bar{y}_j + (1 - w_j) \cdot \hat{\mu}\]
where the weight \(w_j = \frac{n_j / \sigma^2}{n_j / \sigma^2 + 1/\tau^2}\) depends on the group sample size \(n_j\). Small groups shrink more because their data is less informative relative to the population prior.
This is the same shrinkage logic from the shrinkage page, but now applied within a formal model that estimates \(\tau^2\) from the data.
Simulation: School test scores
\(K\) schools, each with \(n_k\) students. Compare three estimates for each school’s true mean:
- No pooling (red): each school’s raw sample mean
- Complete pooling (gray): the grand mean for all schools
- Partial pooling (blue): the hierarchical Bayes estimate
#| standalone: true
#| viewerHeight: 680
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
.good { color: #27ae60; font-weight: bold; }
.bad { color: #e74c3c; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("K", "Number of schools:",
min = 5, max = 30, value = 12, step = 1),
sliderInput("n_per", "Students per school:",
min = 5, max = 100, value = 15, step = 5),
sliderInput("tau", "Between-school SD (\u03C4):",
min = 1, max = 15, value = 5, step = 1),
actionButton("go", "New draw", class = "btn-primary", width = "100%"),
uiOutput("results")
),
mainPanel(
width = 9,
fluidRow(
column(6, plotOutput("school_plot", height = "450px")),
column(6, plotOutput("mse_plot", height = "450px"))
)
)
)
)
server <- function(input, output, session) {
dat <- reactive({
input$go
K <- input$K
n_per <- input$n_per
tau <- input$tau
sigma <- 10 # within-school SD (fixed)
# True school means
mu <- 70 # grand mean
theta_true <- rnorm(K, mean = mu, sd = tau)
# Generate student scores
y_bar <- numeric(K)
for (j in 1:K) {
scores <- rnorm(n_per, mean = theta_true[j], sd = sigma)
y_bar[j] <- mean(scores)
}
# No pooling: raw means
no_pool <- y_bar
# Complete pooling: grand mean
grand_mean <- mean(y_bar)
complete_pool <- rep(grand_mean, K)
# Partial pooling (empirical Bayes)
# Estimate tau from data
between_var <- var(y_bar)
within_var <- sigma^2 / n_per
tau_hat_sq <- max(between_var - within_var, 0.01)
w <- (n_per / sigma^2) / (n_per / sigma^2 + 1 / tau_hat_sq)
partial_pool <- w * y_bar + (1 - w) * grand_mean
# MSE
mse_no <- mean((no_pool - theta_true)^2)
mse_comp <- mean((complete_pool - theta_true)^2)
mse_part <- mean((partial_pool - theta_true)^2)
list(theta_true = theta_true, no_pool = no_pool,
complete_pool = complete_pool, partial_pool = partial_pool,
grand_mean = grand_mean, w = w,
mse_no = mse_no, mse_comp = mse_comp, mse_part = mse_part,
K = K, n_per = n_per, tau = tau)
})
output$school_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 5.5, 3, 1))
K <- d$K
ord <- order(d$no_pool)
xlim <- range(c(d$no_pool, d$partial_pool, d$theta_true, d$grand_mean))
xlim <- xlim + c(-2, 2)
plot(NULL, xlim = xlim, ylim = c(1, K),
yaxt = "n", xlab = "Score",
ylab = "", main = "School Mean Estimates")
axis(2, at = 1:K, labels = paste0("School ", ord), las = 1, cex.axis = 0.7)
# Grand mean line
abline(v = d$grand_mean, lty = 2, col = "gray50", lwd = 1.5)
for (i in 1:K) {
j <- ord[i]
# Arrow from no-pooling to partial-pooling
arrows(d$no_pool[j], i, d$partial_pool[j], i,
length = 0.04, col = "#bdc3c780", lwd = 1)
# Points
points(d$no_pool[j], i, pch = 16, col = "#e74c3c", cex = 1.2)
points(d$partial_pool[j], i, pch = 17, col = "#3498db", cex = 1.2)
points(d$theta_true[j], i, pch = 4, col = "#27ae60", cex = 1, lwd = 2)
}
legend("bottomright", bty = "n", cex = 0.8,
legend = c("No pooling (raw mean)", "Partial pooling",
"True mean", "Grand mean"),
col = c("#e74c3c", "#3498db", "#27ae60", "gray50"),
pch = c(16, 17, 4, NA),
lty = c(NA, NA, NA, 2), lwd = c(NA, NA, 2, 1.5))
})
output$mse_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4, 3, 1))
vals <- c(d$mse_no, d$mse_comp, d$mse_part)
cols <- c("#e74c3c", "gray60", "#3498db")
labels <- c("No pooling", "Complete\npooling", "Partial\npooling")
bp <- barplot(vals, col = cols, border = NA,
names.arg = labels, cex.names = 0.85,
main = "Mean Squared Error",
ylab = "MSE")
text(bp, vals + max(vals) * 0.03, round(vals, 1), cex = 0.85, font = 2)
pct <- round((1 - d$mse_part / d$mse_no) * 100, 0)
mtext(paste0("Partial pooling reduces MSE by ~", pct, "% vs no pooling"),
side = 1, line = 3.5, cex = 0.85, col = "#2c3e50")
})
output$results <- renderUI({
d <- dat()
pct_np <- round((1 - d$mse_part / d$mse_no) * 100, 1)
pct_cp <- round((1 - d$mse_part / d$mse_comp) * 100, 1)
tags$div(class = "stats-box",
HTML(paste0(
"<b>Shrinkage weight:</b> ", round(d$w * 100, 1),
"% on group data<br>",
"<small>(", round((1 - d$w) * 100, 1), "% on grand mean)</small><br>",
"<hr style='margin:8px 0'>",
"<b>MSE no pooling:</b> <span class='bad'>",
round(d$mse_no, 1), "</span><br>",
"<b>MSE complete pooling:</b> ",
round(d$mse_comp, 1), "<br>",
"<b>MSE partial pooling:</b> <span class='good'>",
round(d$mse_part, 1), "</span><br>",
"<hr style='margin:8px 0'>",
"<b>vs no pooling:</b> <span class='good'>",
pct_np, "% lower</span><br>",
"<b>vs complete pooling:</b> <span class='good'>",
pct_cp, "% lower</span>"
))
)
})
}
shinyApp(ui, server)
Things to try
- Students per school = 5, between-school SD = 5: small samples, real group differences. The no-pooling estimates (red) are scattered — some far from the truth. Partial pooling (blue triangles) shrinks them toward the grand mean, landing closer to the true values (green crosses). MSE drops substantially.
- Students per school = 100: with lots of data per group, the raw means are already precise. Shrinkage is minimal — blue and red nearly overlap. With enough data, partial pooling = no pooling.
- Between-school SD = 1: schools are very similar. Complete pooling is nearly optimal because the group differences are tiny. Partial pooling agrees — it shrinks almost entirely to the grand mean.
- Between-school SD = 15: schools differ a lot. Complete pooling is terrible because it ignores real differences. No pooling is better, and partial pooling is best — it respects both the data and the group structure.
- Look at the left plot: the arrows show shrinkage. Extreme schools (raw means far from the grand mean) shrink the most. Schools near the middle barely move. This is adaptive — the model learns how much shrinkage is appropriate.
Why partial pooling wins
The logic is identical to the shrinkage page, but in a structured model:
- No pooling has zero bias but high variance (each estimate uses only local data).
- Complete pooling has high bias but zero variance across groups.
- Partial pooling trades a little bias for a large reduction in variance. The bias-variance tradeoff is optimized by the model.
The more groups you have, the better the model estimates \(\tau^2\) (the between-group variance), and the more precisely it calibrates the amount of shrinkage.
Where hierarchical models show up
| Application | Groups | What gets shrunk |
|---|---|---|
| Education | Schools / districts | Test score means |
| Sports analytics | Players / teams | Batting averages, win rates |
| Clinical trials | Study sites / subgroups | Treatment effects |
| Marketing | Regions / customer segments | Response rates |
| Ecology | Species / habitats | Population parameters |
In each case, the hierarchical structure lets you borrow strength across groups — improving estimates for every group, especially the small ones.
When is your data hierarchical?
Whenever units are nested inside groups, and you expect units in the same group to be more similar to each other than to units in other groups.
Without a hierarchical model, you’d have to choose:
| Approach | Problem |
|---|---|
| Ignore counties, pool all tracts | Misses that tracts in the same county share unobserved factors |
| County fixed effects | Eats degrees of freedom; can’t estimate county-level predictors; small counties get noisy estimates |
| Hierarchical model | Best of both — partial pooling, can include county-level predictors, small counties get regularized |
The rule of thumb: if your research question involves variation across groups (does the effect differ by county? do rural vs urban counties respond differently?), a hierarchical model is the natural fit. If you’re just estimating one overall effect and groups are a nuisance, fixed effects or clustered SEs might be enough.
Did you know?
The “8 schools” example from Rubin (1981) and Gelman et al. (2013, Bayesian Data Analysis) is the canonical textbook example of hierarchical modeling. Eight schools ran a coaching program; the hierarchical model showed that the most impressive result (School A, +28 points) was largely noise, and the true effects were more modest and similar across schools.
Hierarchical models are the Bayesian counterpart of random effects models in frequentist statistics. The key difference: Bayesian hierarchical models provide full posterior distributions for each group parameter, while frequentist random effects models typically give only point estimates (BLUPs — Best Linear Unbiased Predictors).
Stein’s paradox (1956) proved that when estimating 3 or more means simultaneously, the sample means are inadmissible — there always exists an estimator with lower total MSE. The James-Stein estimator and hierarchical models both exploit this: by shrinking toward a common mean, they beat the “obvious” estimator that uses each group’s data alone.