Hierarchical Models

The problem

You have data from \(K\) groups — schools, hospitals, factories, regions — and you want to estimate a parameter (say, the mean) for each group. Two extreme approaches:

No pooling: estimate each group separately using only its own data. With small groups, the estimates are noisy and unreliable.
Complete pooling: ignore groups entirely and estimate a single grand mean. This throws away real group differences.

Neither is satisfying. No pooling overfits to noise. Complete pooling underfits by ignoring structure. You want something in between.

Example: Uber driver ratings. Uber has millions of drivers. A new driver with 3 rides and a perfect 5.0 rating — is she really the best driver on the platform? A veteran with 2,000 rides and a 4.7 — is he worse? Common sense says no: the 5.0 is mostly luck (small sample), and the 4.7 is a reliable signal (large sample). Partial pooling formalizes this: shrink the new driver’s 5.0 toward the platform average (say, 4.8), but barely touch the veteran’s 4.7. Every driver’s estimate improves — especially the ones with few rides.

Hierarchical models: partial pooling

A hierarchical (multilevel) model learns the group-level variation from the data and uses it to shrink each group’s estimate toward the grand mean — more for small groups, less for large groups.

The model:

\[y_{ij} \sim N(\theta_j, \sigma^2) \quad \text{(data within group } j\text{)}\] \[\theta_j \sim N(\mu, \tau^2) \quad \text{(group means come from a population)}\]

\(\theta_j\) is the true mean for group \(j\)
\(\mu\) is the overall population mean
\(\tau^2\) is the between-group variance (how different groups really are)
\(\sigma^2\) is the within-group variance (noise)

The posterior for each \(\theta_j\) is a weighted average of the group’s own data and the grand mean:

\[\hat{\theta}_j^{partial} = w_j \cdot \bar{y}_j + (1 - w_j) \cdot \hat{\mu}\]

where the weight \(w_j = \frac{n_j / \sigma^2}{n_j / \sigma^2 + 1/\tau^2}\) depends on the group sample size \(n_j\). Small groups shrink more because their data is less informative relative to the population prior.

This is the same shrinkage logic from the shrinkage page, but now applied within a formal model that estimates \(\tau^2\) from the data.

Simulation: School test scores

\(K\) schools, each with \(n_k\) students. Compare three estimates for each school’s true mean:

No pooling (red): each school’s raw sample mean
Complete pooling (gray): the grand mean for all schools
Partial pooling (blue): the hierarchical Bayes estimate

#| standalone: true
#| viewerHeight: 680

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("K", "Number of schools:",
                  min = 5, max = 30, value = 12, step = 1),

      sliderInput("n_per", "Students per school:",
                  min = 5, max = 100, value = 15, step = 5),

      sliderInput("tau", "Between-school SD (\u03C4):",
                  min = 1, max = 15, value = 5, step = 1),

      actionButton("go", "New draw", class = "btn-primary", width = "100%"),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("school_plot", height = "450px")),
        column(6, plotOutput("mse_plot",    height = "450px"))
      )
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$go
    K     <- input$K
    n_per <- input$n_per
    tau   <- input$tau
    sigma <- 10  # within-school SD (fixed)

    # True school means
    mu <- 70  # grand mean
    theta_true <- rnorm(K, mean = mu, sd = tau)

    # Generate student scores
    y_bar <- numeric(K)
    for (j in 1:K) {
      scores <- rnorm(n_per, mean = theta_true[j], sd = sigma)
      y_bar[j] <- mean(scores)
    }

    # No pooling: raw means
    no_pool <- y_bar

    # Complete pooling: grand mean
    grand_mean <- mean(y_bar)
    complete_pool <- rep(grand_mean, K)

    # Partial pooling (empirical Bayes)
    # Estimate tau from data
    between_var <- var(y_bar)
    within_var  <- sigma^2 / n_per
    tau_hat_sq  <- max(between_var - within_var, 0.01)

    w <- (n_per / sigma^2) / (n_per / sigma^2 + 1 / tau_hat_sq)
    partial_pool <- w * y_bar + (1 - w) * grand_mean

    # MSE
    mse_no   <- mean((no_pool - theta_true)^2)
    mse_comp <- mean((complete_pool - theta_true)^2)
    mse_part <- mean((partial_pool - theta_true)^2)

    list(theta_true = theta_true, no_pool = no_pool,
         complete_pool = complete_pool, partial_pool = partial_pool,
         grand_mean = grand_mean, w = w,
         mse_no = mse_no, mse_comp = mse_comp, mse_part = mse_part,
         K = K, n_per = n_per, tau = tau)
  })

  output$school_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 5.5, 3, 1))

    K <- d$K
    ord <- order(d$no_pool)

    xlim <- range(c(d$no_pool, d$partial_pool, d$theta_true, d$grand_mean))
    xlim <- xlim + c(-2, 2)

    plot(NULL, xlim = xlim, ylim = c(1, K),
         yaxt = "n", xlab = "Score",
         ylab = "", main = "School Mean Estimates")
    axis(2, at = 1:K, labels = paste0("School ", ord), las = 1, cex.axis = 0.7)

    # Grand mean line
    abline(v = d$grand_mean, lty = 2, col = "gray50", lwd = 1.5)

    for (i in 1:K) {
      j <- ord[i]

      # Arrow from no-pooling to partial-pooling
      arrows(d$no_pool[j], i, d$partial_pool[j], i,
             length = 0.04, col = "#bdc3c780", lwd = 1)

      # Points
      points(d$no_pool[j], i, pch = 16, col = "#e74c3c", cex = 1.2)
      points(d$partial_pool[j], i, pch = 17, col = "#3498db", cex = 1.2)
      points(d$theta_true[j], i, pch = 4, col = "#27ae60", cex = 1, lwd = 2)
    }

    legend("bottomright", bty = "n", cex = 0.8,
           legend = c("No pooling (raw mean)", "Partial pooling",
                      "True mean", "Grand mean"),
           col = c("#e74c3c", "#3498db", "#27ae60", "gray50"),
           pch = c(16, 17, 4, NA),
           lty = c(NA, NA, NA, 2), lwd = c(NA, NA, 2, 1.5))
  })

  output$mse_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4, 3, 1))

    vals <- c(d$mse_no, d$mse_comp, d$mse_part)
    cols <- c("#e74c3c", "gray60", "#3498db")
    labels <- c("No pooling", "Complete\npooling", "Partial\npooling")

    bp <- barplot(vals, col = cols, border = NA,
                  names.arg = labels, cex.names = 0.85,
                  main = "Mean Squared Error",
                  ylab = "MSE")
    text(bp, vals + max(vals) * 0.03, round(vals, 1), cex = 0.85, font = 2)

    pct <- round((1 - d$mse_part / d$mse_no) * 100, 0)
    mtext(paste0("Partial pooling reduces MSE by ~", pct, "% vs no pooling"),
          side = 1, line = 3.5, cex = 0.85, col = "#2c3e50")
  })

  output$results <- renderUI({
    d <- dat()
    pct_np <- round((1 - d$mse_part / d$mse_no) * 100, 1)
    pct_cp <- round((1 - d$mse_part / d$mse_comp) * 100, 1)

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>Shrinkage weight:</b> ", round(d$w * 100, 1),
        "% on group data<br>",
        "<small>(", round((1 - d$w) * 100, 1), "% on grand mean)</small><br>",
        "<hr style='margin:8px 0'>",
        "<b>MSE no pooling:</b> <span class='bad'>",
        round(d$mse_no, 1), "</span><br>",
        "<b>MSE complete pooling:</b> ",
        round(d$mse_comp, 1), "<br>",
        "<b>MSE partial pooling:</b> <span class='good'>",
        round(d$mse_part, 1), "</span><br>",
        "<hr style='margin:8px 0'>",
        "<b>vs no pooling:</b> <span class='good'>",
        pct_np, "% lower</span><br>",
        "<b>vs complete pooling:</b> <span class='good'>",
        pct_cp, "% lower</span>"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

Students per school = 5, between-school SD = 5: small samples, real group differences. The no-pooling estimates (red) are scattered — some far from the truth. Partial pooling (blue triangles) shrinks them toward the grand mean, landing closer to the true values (green crosses). MSE drops substantially.
Students per school = 100: with lots of data per group, the raw means are already precise. Shrinkage is minimal — blue and red nearly overlap. With enough data, partial pooling = no pooling.
Between-school SD = 1: schools are very similar. Complete pooling is nearly optimal because the group differences are tiny. Partial pooling agrees — it shrinks almost entirely to the grand mean.
Between-school SD = 15: schools differ a lot. Complete pooling is terrible because it ignores real differences. No pooling is better, and partial pooling is best — it respects both the data and the group structure.
Look at the left plot: the arrows show shrinkage. Extreme schools (raw means far from the grand mean) shrink the most. Schools near the middle barely move. This is adaptive — the model learns how much shrinkage is appropriate.

Why partial pooling wins

The logic is identical to the shrinkage page, but in a structured model:

No pooling has zero bias but high variance (each estimate uses only local data).
Complete pooling has high bias but zero variance across groups.
Partial pooling trades a little bias for a large reduction in variance. The bias-variance tradeoff is optimized by the model.

The more groups you have, the better the model estimates \(\tau^2\) (the between-group variance), and the more precisely it calibrates the amount of shrinkage.

Where hierarchical models show up

Application	Groups	What gets shrunk
Education	Schools / districts	Test score means
Sports analytics	Players / teams	Batting averages, win rates
Clinical trials	Study sites / subgroups	Treatment effects
Marketing	Regions / customer segments	Response rates
Ecology	Species / habitats	Population parameters

In each case, the hierarchical structure lets you borrow strength across groups — improving estimates for every group, especially the small ones.

When is your data hierarchical?

Whenever units are nested inside groups, and you expect units in the same group to be more similar to each other than to units in other groups.

Example: dollar stores and obesity. Say you’re studying whether dollar store presence affects obesity rates. Your data has census tracts nested within counties. That’s hierarchical — tracts in the same county share the local food environment, demographics, and policy context.

Level 1 (tracts): each tract has its own obesity rate, influenced by dollar store presence + tract-level covariates
Level 2 (counties): tract-level effects come from a county-level distribution — tracts in the same county are more similar to each other than to tracts in other counties

Partial pooling matters for small counties with few tracts. If a county has only 2 tracts, the raw county average is noisy. The hierarchical model shrinks it toward the overall pattern — borrowing strength from larger counties.

Without a hierarchical model, you’d have to choose:

Approach	Problem
Ignore counties, pool all tracts	Misses that tracts in the same county share unobserved factors
County fixed effects	Eats degrees of freedom; can’t estimate county-level predictors; small counties get noisy estimates
Hierarchical model	Best of both — partial pooling, can include county-level predictors, small counties get regularized

The rule of thumb: if your research question involves variation across groups (does the effect differ by county? do rural vs urban counties respond differently?), a hierarchical model is the natural fit. If you’re just estimating one overall effect and groups are a nuisance, fixed effects or clustered SEs might be enough.

Did you know?

The “8 schools” example from Rubin (1981) and Gelman et al. (2013, Bayesian Data Analysis) is the canonical textbook example of hierarchical modeling. Eight schools ran a coaching program; the hierarchical model showed that the most impressive result (School A, +28 points) was largely noise, and the true effects were more modest and similar across schools.
Hierarchical models are the Bayesian counterpart of random effects models in frequentist statistics. The key difference: Bayesian hierarchical models provide full posterior distributions for each group parameter, while frequentist random effects models typically give only point estimates (BLUPs — Best Linear Unbiased Predictors).
Stein’s paradox (1956) proved that when estimating 3 or more means simultaneously, the sample means are inadmissible — there always exists an estimator with lower total MSE. The James-Stein estimator and hierarchical models both exploit this: by shrinking toward a common mean, they beat the “obvious” estimator that uses each group’s data alone.