Posterior Predictive Checks

Does the model fit?

You’ve chosen a prior, computed the posterior, maybe even compared models. But there’s a question you haven’t asked: does the model actually fit the data?

Frequentists check model fit with residual plots, Q-Q plots, and goodness-of-fit tests. Bayesians have their own tool: posterior predictive checks (PPCs).

The idea is beautifully simple: if your model is correct, then data simulated from it should look like the real data. If they don’t, something is wrong.

Example: You fit a normal model to income data. The model says incomes are symmetric. But real income data has a long right tail — a few people earn much more than average. Simulated datasets from your model won’t have those extreme values. The mismatch tells you the normal model is wrong.


How it works

The posterior predictive distribution generates “fake data” from your fitted model:

  1. Draw \(\theta^*\) from the posterior \(p(\theta \mid \mathbf{y})\)
  2. Generate \(\mathbf{y}^{\text{rep}} \sim p(\mathbf{y} \mid \theta^*)\)
  3. Repeat many times to get many replicated datasets

Each \(\mathbf{y}^{\text{rep}}\) is a dataset that could have been generated by the model, given what you learned from the data. If the model is right, the real data should look like a “typical” replicated dataset.

To make this concrete, pick a test statistic \(T\) — anything that captures a feature you care about (the max, the skewness, the standard deviation) — and compare:

\[ T(\mathbf{y}^{\text{obs}}) \quad \text{vs} \quad T(\mathbf{y}^{\text{rep}}_1),\; T(\mathbf{y}^{\text{rep}}_2),\; \ldots \]

The posterior predictive p-value is:

\[ p_{pp} = P\!\left(T(\mathbf{y}^{\text{rep}}) \geq T(\mathbf{y}^{\text{obs}})\right) \]

If \(p_{pp}\) is near 0 or 1, the observed statistic is extreme relative to what the model predicts — evidence of model misfit.

Not a “real” p-value: The posterior predictive p-value uses the data twice (once to fit, once to compare), so it’s not a proper significance test. Think of it as a diagnostic, not a hypothesis test. Values near 0.5 mean the model is consistent with the data on that feature.


Simulation: Is the normal model right?

Fit a normal model (\(y \sim N(\mu, \sigma^2)\)) to data that may or may not be normal. The posterior predictive check reveals the mismatch.

#| standalone: true
#| viewerHeight: 620

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good  { color: #27ae60; font-weight: bold; }
    .bad   { color: #e74c3c; font-weight: bold; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      selectInput("dgp", "True DGP:",
                  choices = c("Normal", "Heavy-tailed (t, df=3)", "Skewed (exp)"),
                  selected = "Normal"),

      sliderInput("n", "Sample size (n):",
                  min = 30, max = 500, value = 100, step = 10),

      selectInput("stat", "Test statistic T:",
                  choices = c("Max" = "max", "Skewness" = "skew", "SD" = "sd"),
                  selected = "max"),

      actionButton("go", "New data", class = "btn-primary", width = "100%"),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("data_plot", height = "420px")),
        column(6, plotOutput("ppc_plot", height = "420px"))
      )
    )
  )
)

server <- function(input, output, session) {

  n_rep <- 500  # number of replicated datasets

  dat <- reactive({
    input$go
    n   <- input$n
    dgp <- input$dgp

    # Generate observed data from true DGP
    y_obs <- if (dgp == "Normal") {
      rnorm(n, 5, 2)
    } else if (dgp == "Heavy-tailed (t, df=3)") {
      5 + 2 * rt(n, df = 3)
    } else {
      rexp(n, rate = 0.5)  # mean = 2, skewed right
    }

    # Fit normal model: posterior for mu and sigma
    # Using conjugate with vague priors:
    # mu | sigma, y ~ N(ybar, sigma^2/n)
    # sigma^2 | y ~ InvGamma(...)
    ybar <- mean(y_obs)
    s2   <- var(y_obs)

    # Compute test statistic
    calc_stat <- function(y, type) {
      if (type == "max") return(max(y))
      if (type == "skew") {
        m <- mean(y); s <- sd(y)
        return(mean(((y - m) / s)^3))
      }
      if (type == "sd") return(sd(y))
    }

    stat_type <- input$stat
    t_obs <- calc_stat(y_obs, stat_type)

    # Generate replicated datasets from posterior predictive
    t_rep <- numeric(n_rep)
    y_rep_curves <- list()
    n_curves <- 50  # store some for visual overlay

    for (r in 1:n_rep) {
      # Draw sigma^2 from scaled inv-chi-sq (approximate with posterior)
      sig2_star <- s2 * (n - 1) / rchisq(1, df = n - 1)
      sig_star  <- sqrt(sig2_star)

      # Draw mu from conditional posterior
      mu_star <- rnorm(1, ybar, sig_star / sqrt(n))

      # Generate replicated data
      y_r <- rnorm(n, mu_star, sig_star)
      t_rep[r] <- calc_stat(y_r, stat_type)

      if (r <= n_curves) {
        y_rep_curves[[r]] <- y_r
      }
    }

    # Posterior predictive p-value
    p_pp <- mean(t_rep >= t_obs)

    list(y_obs = y_obs, t_obs = t_obs, t_rep = t_rep,
         y_rep_curves = y_rep_curves, p_pp = p_pp,
         stat_type = stat_type, dgp = dgp, ybar = ybar, s2 = s2)
  })

  output$data_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    # Histogram of observed data
    h <- hist(d$y_obs, breaks = 25, plot = FALSE)

    hist(d$y_obs, breaks = 25, freq = FALSE,
         col = "#3498db40", border = "#3498db",
         xlab = "y", main = "Observed Data + Posterior Predictive",
         ylim = c(0, max(h$density) * 1.5))

    # Overlay density curves from replicated datasets
    x_range <- range(c(d$y_obs, unlist(d$y_rep_curves)))
    x_seq <- seq(x_range[1] - 1, x_range[2] + 1, length.out = 200)

    for (i in seq_along(d$y_rep_curves)) {
      dens <- density(d$y_rep_curves[[i]], from = x_range[1] - 1,
                      to = x_range[2] + 1, n = 200)
      lines(dens$x, dens$y, col = "#e74c3c15", lwd = 1)
    }

    # Observed density on top
    dens_obs <- density(d$y_obs)
    lines(dens_obs$x, dens_obs$y, col = "#2c3e50", lwd = 2.5)

    legend("topright", bty = "n", cex = 0.85,
           legend = c("Observed data", "Replicated (model)"),
           col = c("#2c3e50", "#e74c3c80"),
           lwd = c(2.5, 1.5))
  })

  output$ppc_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    stat_label <- switch(d$stat_type,
                         "max" = "max(y)",
                         "skew" = "skewness(y)",
                         "sd" = "sd(y)")

    h <- hist(d$t_rep, breaks = 30, plot = FALSE)

    # Color bins by whether they're >= t_obs
    bin_cols <- ifelse(h$mids >= d$t_obs, "#e74c3c80", "#bdc3c780")

    hist(d$t_rep, breaks = 30, freq = FALSE,
         col = bin_cols, border = "white",
         xlab = paste0("T = ", stat_label),
         main = paste0("Posterior Predictive Check: ", stat_label))

    # Vertical line at observed
    abline(v = d$t_obs, col = "#2c3e50", lwd = 2.5, lty = 1)
    text(d$t_obs, max(h$density) * 0.9, pos = 4, cex = 0.85,
         col = "#2c3e50", font = 2,
         labels = paste0("T(obs) = ", round(d$t_obs, 3)))

    # p-value label
    p_label <- if (d$p_pp < 0.05 || d$p_pp > 0.95) {
      paste0("p_pp = ", round(d$p_pp, 3), "  MISFIT")
    } else {
      paste0("p_pp = ", round(d$p_pp, 3), "  OK")
    }
    p_col <- if (d$p_pp < 0.05 || d$p_pp > 0.95) "#e74c3c" else "#27ae60"
    mtext(p_label, side = 3, line = -1.5, cex = 0.9, col = p_col, font = 2)

    legend("topleft", bty = "n", cex = 0.8,
           legend = c("T(y_rep) distribution",
                      paste0("T(y_obs) = ", round(d$t_obs, 3)),
                      "p-value region"),
           col = c("#bdc3c7", "#2c3e50", "#e74c3c80"),
           pch = c(15, NA, 15), lwd = c(NA, 2.5, NA), lty = c(NA, 1, NA))
  })

  output$results <- renderUI({
    d <- dat()

    stat_label <- switch(d$stat_type,
                         "max" = "max(y)",
                         "skew" = "skewness",
                         "sd" = "sd(y)")

    verdict <- if (d$p_pp < 0.05 || d$p_pp > 0.95) {
      "<span class='bad'>FAIL — model misfit</span>"
    } else {
      "<span class='good'>PASS — consistent</span>"
    }

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>T(y_obs):</b> ", round(d$t_obs, 3), "<br>",
        "<b>Mean T(y_rep):</b> ", round(mean(d$t_rep), 3), "<br>",
        "<hr style='margin:8px 0'>",
        "<b>p<sub>pp</sub>:</b> ", round(d$p_pp, 3), "<br>",
        "<b>Verdict:</b> ", verdict, "<br>",
        "<hr style='margin:8px 0'>",
        "<b>DGP:</b> ", d$dgp, "<br>",
        "<b>Statistic:</b> ", stat_label
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • Normal DGP, any statistic: the check passes. The model is correct, so \(T(\mathbf{y}^{\text{obs}})\) falls comfortably within the replicated distribution. The posterior predictive p-value is near 0.5.
  • Heavy-tailed (t, df=3), test stat = Max: the normal model can’t produce the extreme values that a heavy-tailed distribution generates. \(\max(y)\) is far into the tail of the replicated distribution. The check fails.
  • Skewed (exponential), test stat = Skewness: the normal model is symmetric, so replicated datasets have skewness near 0. But the exponential data is heavily right-skewed. The check catches this.
  • Large n makes failures sharper: with n = 500, the replicated distribution concentrates tightly, making any mismatch glaringly obvious. Small samples are more forgiving.
  • Left panel clue: when the model is wrong, the light red replicated curves don’t match the shape of the observed histogram. That visual mismatch is what the p-value quantifies.
  • SD statistic, Normal DGP: a good check that the variance is well-captured. Switch to heavy-tailed data — the SD check may or may not fail depending on the specific draw.

In Stata

* Fit the Bayesian model
bayes: reg y x1 x2

* Posterior predictive p-values for test statistics
bayesstats ppvalues (mean) (sd) (max)

* If p_pp near 0 or 1, the model struggles
* to reproduce that feature of the data

Which statistics to check? Pick statistics that matter for your research question. If you care about tail behavior (risk analysis), check max and min. If you’re modeling skewed outcomes (wages, prices), check skewness. If your model should capture variability, check sd.


Did you know?

  • Rubin (1984) introduced posterior predictive checks as a general framework for model criticism. The key insight: use the model to generate data, then see if the generated data resemble reality. Simple, powerful, and applicable to any model.

  • Gelman, Meng & Stern (1996) formalized posterior predictive p-values and showed how to choose test statistics that target specific model assumptions. Their paper established PPCs as a standard tool in applied Bayesian work.

  • Connection to cross-validation: LOO-CV (leave-one-out cross-validation) is another model checking tool. While PPCs ask “can the model reproduce features of the data?”, LOO-CV asks “can the model predict held-out observations?” Both catch model problems, but from different angles. The loo package in R and Stata’s bayesstats ic with WAIC approximate LOO-CV efficiently.