Posterior Predictive Checks
Does the model fit?
You’ve chosen a prior, computed the posterior, maybe even compared models. But there’s a question you haven’t asked: does the model actually fit the data?
Frequentists check model fit with residual plots, Q-Q plots, and goodness-of-fit tests. Bayesians have their own tool: posterior predictive checks (PPCs).
The idea is beautifully simple: if your model is correct, then data simulated from it should look like the real data. If they don’t, something is wrong.
How it works
The posterior predictive distribution generates “fake data” from your fitted model:
- Draw \(\theta^*\) from the posterior \(p(\theta \mid \mathbf{y})\)
- Generate \(\mathbf{y}^{\text{rep}} \sim p(\mathbf{y} \mid \theta^*)\)
- Repeat many times to get many replicated datasets
Each \(\mathbf{y}^{\text{rep}}\) is a dataset that could have been generated by the model, given what you learned from the data. If the model is right, the real data should look like a “typical” replicated dataset.
To make this concrete, pick a test statistic \(T\) — anything that captures a feature you care about (the max, the skewness, the standard deviation) — and compare:
\[ T(\mathbf{y}^{\text{obs}}) \quad \text{vs} \quad T(\mathbf{y}^{\text{rep}}_1),\; T(\mathbf{y}^{\text{rep}}_2),\; \ldots \]
The posterior predictive p-value is:
\[ p_{pp} = P\!\left(T(\mathbf{y}^{\text{rep}}) \geq T(\mathbf{y}^{\text{obs}})\right) \]
If \(p_{pp}\) is near 0 or 1, the observed statistic is extreme relative to what the model predicts — evidence of model misfit.
Simulation: Is the normal model right?
Fit a normal model (\(y \sim N(\mu, \sigma^2)\)) to data that may or may not be normal. The posterior predictive check reveals the mismatch.
#| standalone: true
#| viewerHeight: 620
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
.good { color: #27ae60; font-weight: bold; }
.bad { color: #e74c3c; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
selectInput("dgp", "True DGP:",
choices = c("Normal", "Heavy-tailed (t, df=3)", "Skewed (exp)"),
selected = "Normal"),
sliderInput("n", "Sample size (n):",
min = 30, max = 500, value = 100, step = 10),
selectInput("stat", "Test statistic T:",
choices = c("Max" = "max", "Skewness" = "skew", "SD" = "sd"),
selected = "max"),
actionButton("go", "New data", class = "btn-primary", width = "100%"),
uiOutput("results")
),
mainPanel(
width = 9,
fluidRow(
column(6, plotOutput("data_plot", height = "420px")),
column(6, plotOutput("ppc_plot", height = "420px"))
)
)
)
)
server <- function(input, output, session) {
n_rep <- 500 # number of replicated datasets
dat <- reactive({
input$go
n <- input$n
dgp <- input$dgp
# Generate observed data from true DGP
y_obs <- if (dgp == "Normal") {
rnorm(n, 5, 2)
} else if (dgp == "Heavy-tailed (t, df=3)") {
5 + 2 * rt(n, df = 3)
} else {
rexp(n, rate = 0.5) # mean = 2, skewed right
}
# Fit normal model: posterior for mu and sigma
# Using conjugate with vague priors:
# mu | sigma, y ~ N(ybar, sigma^2/n)
# sigma^2 | y ~ InvGamma(...)
ybar <- mean(y_obs)
s2 <- var(y_obs)
# Compute test statistic
calc_stat <- function(y, type) {
if (type == "max") return(max(y))
if (type == "skew") {
m <- mean(y); s <- sd(y)
return(mean(((y - m) / s)^3))
}
if (type == "sd") return(sd(y))
}
stat_type <- input$stat
t_obs <- calc_stat(y_obs, stat_type)
# Generate replicated datasets from posterior predictive
t_rep <- numeric(n_rep)
y_rep_curves <- list()
n_curves <- 50 # store some for visual overlay
for (r in 1:n_rep) {
# Draw sigma^2 from scaled inv-chi-sq (approximate with posterior)
sig2_star <- s2 * (n - 1) / rchisq(1, df = n - 1)
sig_star <- sqrt(sig2_star)
# Draw mu from conditional posterior
mu_star <- rnorm(1, ybar, sig_star / sqrt(n))
# Generate replicated data
y_r <- rnorm(n, mu_star, sig_star)
t_rep[r] <- calc_stat(y_r, stat_type)
if (r <= n_curves) {
y_rep_curves[[r]] <- y_r
}
}
# Posterior predictive p-value
p_pp <- mean(t_rep >= t_obs)
list(y_obs = y_obs, t_obs = t_obs, t_rep = t_rep,
y_rep_curves = y_rep_curves, p_pp = p_pp,
stat_type = stat_type, dgp = dgp, ybar = ybar, s2 = s2)
})
output$data_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
# Histogram of observed data
h <- hist(d$y_obs, breaks = 25, plot = FALSE)
hist(d$y_obs, breaks = 25, freq = FALSE,
col = "#3498db40", border = "#3498db",
xlab = "y", main = "Observed Data + Posterior Predictive",
ylim = c(0, max(h$density) * 1.5))
# Overlay density curves from replicated datasets
x_range <- range(c(d$y_obs, unlist(d$y_rep_curves)))
x_seq <- seq(x_range[1] - 1, x_range[2] + 1, length.out = 200)
for (i in seq_along(d$y_rep_curves)) {
dens <- density(d$y_rep_curves[[i]], from = x_range[1] - 1,
to = x_range[2] + 1, n = 200)
lines(dens$x, dens$y, col = "#e74c3c15", lwd = 1)
}
# Observed density on top
dens_obs <- density(d$y_obs)
lines(dens_obs$x, dens_obs$y, col = "#2c3e50", lwd = 2.5)
legend("topright", bty = "n", cex = 0.85,
legend = c("Observed data", "Replicated (model)"),
col = c("#2c3e50", "#e74c3c80"),
lwd = c(2.5, 1.5))
})
output$ppc_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
stat_label <- switch(d$stat_type,
"max" = "max(y)",
"skew" = "skewness(y)",
"sd" = "sd(y)")
h <- hist(d$t_rep, breaks = 30, plot = FALSE)
# Color bins by whether they're >= t_obs
bin_cols <- ifelse(h$mids >= d$t_obs, "#e74c3c80", "#bdc3c780")
hist(d$t_rep, breaks = 30, freq = FALSE,
col = bin_cols, border = "white",
xlab = paste0("T = ", stat_label),
main = paste0("Posterior Predictive Check: ", stat_label))
# Vertical line at observed
abline(v = d$t_obs, col = "#2c3e50", lwd = 2.5, lty = 1)
text(d$t_obs, max(h$density) * 0.9, pos = 4, cex = 0.85,
col = "#2c3e50", font = 2,
labels = paste0("T(obs) = ", round(d$t_obs, 3)))
# p-value label
p_label <- if (d$p_pp < 0.05 || d$p_pp > 0.95) {
paste0("p_pp = ", round(d$p_pp, 3), " MISFIT")
} else {
paste0("p_pp = ", round(d$p_pp, 3), " OK")
}
p_col <- if (d$p_pp < 0.05 || d$p_pp > 0.95) "#e74c3c" else "#27ae60"
mtext(p_label, side = 3, line = -1.5, cex = 0.9, col = p_col, font = 2)
legend("topleft", bty = "n", cex = 0.8,
legend = c("T(y_rep) distribution",
paste0("T(y_obs) = ", round(d$t_obs, 3)),
"p-value region"),
col = c("#bdc3c7", "#2c3e50", "#e74c3c80"),
pch = c(15, NA, 15), lwd = c(NA, 2.5, NA), lty = c(NA, 1, NA))
})
output$results <- renderUI({
d <- dat()
stat_label <- switch(d$stat_type,
"max" = "max(y)",
"skew" = "skewness",
"sd" = "sd(y)")
verdict <- if (d$p_pp < 0.05 || d$p_pp > 0.95) {
"<span class='bad'>FAIL — model misfit</span>"
} else {
"<span class='good'>PASS — consistent</span>"
}
tags$div(class = "stats-box",
HTML(paste0(
"<b>T(y_obs):</b> ", round(d$t_obs, 3), "<br>",
"<b>Mean T(y_rep):</b> ", round(mean(d$t_rep), 3), "<br>",
"<hr style='margin:8px 0'>",
"<b>p<sub>pp</sub>:</b> ", round(d$p_pp, 3), "<br>",
"<b>Verdict:</b> ", verdict, "<br>",
"<hr style='margin:8px 0'>",
"<b>DGP:</b> ", d$dgp, "<br>",
"<b>Statistic:</b> ", stat_label
))
)
})
}
shinyApp(ui, server)
Things to try
- Normal DGP, any statistic: the check passes. The model is correct, so \(T(\mathbf{y}^{\text{obs}})\) falls comfortably within the replicated distribution. The posterior predictive p-value is near 0.5.
- Heavy-tailed (t, df=3), test stat = Max: the normal model can’t produce the extreme values that a heavy-tailed distribution generates. \(\max(y)\) is far into the tail of the replicated distribution. The check fails.
- Skewed (exponential), test stat = Skewness: the normal model is symmetric, so replicated datasets have skewness near 0. But the exponential data is heavily right-skewed. The check catches this.
- Large n makes failures sharper: with n = 500, the replicated distribution concentrates tightly, making any mismatch glaringly obvious. Small samples are more forgiving.
- Left panel clue: when the model is wrong, the light red replicated curves don’t match the shape of the observed histogram. That visual mismatch is what the p-value quantifies.
- SD statistic, Normal DGP: a good check that the variance is well-captured. Switch to heavy-tailed data — the SD check may or may not fail depending on the specific draw.
In Stata
* Fit the Bayesian model
bayes: reg y x1 x2
* Posterior predictive p-values for test statistics
bayesstats ppvalues (mean) (sd) (max)
* If p_pp near 0 or 1, the model struggles
* to reproduce that feature of the dataDid you know?
Rubin (1984) introduced posterior predictive checks as a general framework for model criticism. The key insight: use the model to generate data, then see if the generated data resemble reality. Simple, powerful, and applicable to any model.
Gelman, Meng & Stern (1996) formalized posterior predictive p-values and showed how to choose test statistics that target specific model assumptions. Their paper established PPCs as a standard tool in applied Bayesian work.
Connection to cross-validation: LOO-CV (leave-one-out cross-validation) is another model checking tool. While PPCs ask “can the model reproduce features of the data?”, LOO-CV asks “can the model predict held-out observations?” Both catch model problems, but from different angles. The
loopackage in R and Stata’sbayesstats icwith WAIC approximate LOO-CV efficiently.