Model Comparison
Which model is better?
You’ve built a model — maybe a regression with three predictors. A colleague suggests dropping one. Frequentists compare models with F-tests or likelihood ratio tests. Bayesians have their own answer: Bayes factors.
The idea is simple. Compute the marginal likelihood of each model — how well the model predicted the data, averaging over all possible parameter values — and take the ratio:
\[ \text{BF}_{10} = \frac{p(\mathbf{y} \mid M_1)}{p(\mathbf{y} \mid M_0)} \]
where \(p(\mathbf{y} \mid M_k) = \int p(\mathbf{y} \mid \theta, M_k) \, p(\theta \mid M_k) \, d\theta\).
A Bayes factor of 10 means the data are 10 times more likely under \(M_1\) than \(M_0\). Unlike a p-value, Bayes factors can support the null — a BF of 0.1 means the data favor \(M_0\) by a factor of 10.
To get posterior odds between models, multiply the Bayes factor by your prior odds:
\[ \underbrace{\frac{P(M_1 \mid \mathbf{y})}{P(M_0 \mid \mathbf{y})}}_{\text{posterior odds}} = \underbrace{\frac{P(M_1)}{P(M_0)}}_{\text{prior odds}} \times \underbrace{\text{BF}_{10}}_{\text{Bayes factor}} \]
With equal prior odds (50/50 between models), the posterior odds equal the Bayes factor.
Jeffreys’ interpretation scale:
| BF₁₀ | Evidence for M₁ |
|---|---|
| 1–3 | Anecdotal |
| 3–10 | Substantial |
| 10–30 | Strong |
| 30–100 | Very strong |
| > 100 | Decisive |
Connection to BIC
Computing the marginal likelihood exactly requires integrating over the prior — often intractable. But there’s a shortcut you already have.
The Bayesian Information Criterion (BIC) approximates the log marginal likelihood:
\[ \text{BIC} \approx -2 \log p(\mathbf{y} \mid M) + \text{const} \]
So the difference in BIC between two models approximates the log Bayes factor:
\[ \Delta\text{BIC} = \text{BIC}_1 - \text{BIC}_0 \approx -2 \log \text{BF}_{10} \]
which gives us:
\[ \text{BF}_{10} \approx \exp\!\left(-\tfrac{\Delta\text{BIC}}{2}\right) \]
Three approaches compared:
| Approach | Favors M₁ when | Can support null? | Requires prior? |
|---|---|---|---|
| p-value | p < 0.05 | No (only rejects) | No |
| BIC | ΔBIC < 0 | Yes (ΔBIC > 0) | No (implicit) |
| Bayes factor | BF₁₀ > 1 | Yes (BF₁₀ < 1) | Yes (explicit) |
Simulation: Does X belong in the model?
Compare two models: \(M_0\): \(y = \text{noise}\) vs \(M_1\): \(y = \beta x + \text{noise}\). The Bayes factor tells you whether the data support including \(x\).
#| standalone: true
#| viewerHeight: 620
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
.good { color: #27ae60; font-weight: bold; }
.bad { color: #e74c3c; font-weight: bold; }
.muted { color: #7f8c8d; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("true_beta", HTML("True β:"),
min = 0, max = 3, value = 0, step = 0.25),
sliderInput("prior_sd", "Prior SD on β:",
min = 0.1, max = 5, value = 1, step = 0.1),
sliderInput("n", "Sample size (n):",
min = 10, max = 300, value = 50, step = 10),
sliderInput("sigma", HTML("Noise (σ):"),
min = 0.5, max = 5, value = 2, step = 0.5),
actionButton("go", "New data", class = "btn-primary", width = "100%"),
uiOutput("results")
),
mainPanel(
width = 9,
fluidRow(
column(6, plotOutput("density_plot", height = "420px")),
column(6, plotOutput("bf_plot", height = "420px"))
)
)
)
)
server <- function(input, output, session) {
dat <- reactive({
input$go
true_beta <- input$true_beta
prior_sd <- input$prior_sd
n <- input$n
sigma <- input$sigma
# Generate data
x <- rnorm(n, 0, 1)
y <- true_beta * x + rnorm(n, 0, sigma)
# OLS estimate under M1
beta_ols <- sum(x * y) / sum(x^2)
se_ols <- sigma / sqrt(sum(x^2))
# Bayesian posterior under M1 (prior: beta ~ N(0, prior_sd^2))
prior_prec <- 1 / prior_sd^2
data_prec <- sum(x^2) / sigma^2
post_prec <- prior_prec + data_prec
post_sd <- 1 / sqrt(post_prec)
post_mean <- (0 * prior_prec + beta_ols * data_prec) / post_prec
# Savage-Dickey BF: BF10 = prior density at 0 / posterior density at 0
prior_at_0 <- dnorm(0, 0, prior_sd)
post_at_0 <- dnorm(0, post_mean, post_sd)
bf_10 <- prior_at_0 / post_at_0
# BIC approximation
# Under M0: RSS = sum(y^2); Under M1: RSS = sum((y - beta_ols*x)^2)
rss0 <- sum(y^2)
rss1 <- sum((y - beta_ols * x)^2)
bic0 <- n * log(rss0 / n) + 0 * log(n)
bic1 <- n * log(rss1 / n) + 1 * log(n)
delta_bic <- bic1 - bic0
bf_bic <- exp(-delta_bic / 2)
# Jeffreys label
abf <- bf_10
label <- if (abf > 100) "Decisive for M1"
else if (abf > 30) "Very strong for M1"
else if (abf > 10) "Strong for M1"
else if (abf > 3) "Substantial for M1"
else if (abf > 1) "Anecdotal for M1"
else if (abf > 1/3) "Anecdotal for M0"
else if (abf > 1/10) "Substantial for M0"
else if (abf > 1/30) "Strong for M0"
else if (abf > 1/100) "Very strong for M0"
else "Decisive for M0"
list(x = x, y = y, beta_ols = beta_ols, se_ols = se_ols,
prior_sd = prior_sd, post_mean = post_mean, post_sd = post_sd,
prior_at_0 = prior_at_0, post_at_0 = post_at_0,
bf_10 = bf_10, delta_bic = delta_bic, bf_bic = bf_bic,
label = label, true_beta = true_beta)
})
output$density_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
# Plot range
xlim <- c(min(-3 * d$prior_sd, d$post_mean - 4 * d$post_sd),
max(3 * d$prior_sd, d$post_mean + 4 * d$post_sd))
x_seq <- seq(xlim[1], xlim[2], length.out = 400)
prior_y <- dnorm(x_seq, 0, d$prior_sd)
post_y <- dnorm(x_seq, d$post_mean, d$post_sd)
ylim <- c(0, max(c(prior_y, post_y)) * 1.2)
plot(NULL, xlim = xlim, ylim = ylim,
xlab = expression(beta), ylab = "Density",
main = "Savage-Dickey: Prior vs Posterior at \u03b2 = 0")
# Prior (red dashed)
lines(x_seq, prior_y, col = "#e74c3c", lwd = 2, lty = 2)
# Posterior (blue filled)
polygon(c(x_seq, rev(x_seq)), c(post_y, rep(0, length(x_seq))),
col = "#3498db30", border = NA)
lines(x_seq, post_y, col = "#3498db", lwd = 2.5)
# Mark densities at beta = 0
segments(0, 0, 0, d$prior_at_0, col = "#e74c3c", lwd = 2, lty = 3)
points(0, d$prior_at_0, pch = 16, col = "#e74c3c", cex = 1.5)
segments(0, 0, 0, d$post_at_0, col = "#3498db", lwd = 2, lty = 3)
points(0, d$post_at_0, pch = 16, col = "#3498db", cex = 1.5)
# Labels
text(0, d$prior_at_0, pos = 4, cex = 0.8, col = "#e74c3c",
labels = paste0("Prior(0) = ", round(d$prior_at_0, 3)))
text(0, d$post_at_0, pos = 4, cex = 0.8, col = "#3498db",
labels = paste0("Post(0) = ", round(d$post_at_0, 3)))
legend("topright", bty = "n", cex = 0.85,
legend = c("Prior", "Posterior"),
col = c("#e74c3c", "#3498db"),
lwd = c(2, 2.5), lty = c(2, 1))
})
output$bf_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 6, 3, 1))
log_bf <- log10(d$bf_10)
log_bic <- log10(d$bf_bic)
vals <- c(log_bf, log_bic)
xlim <- range(c(vals, -2.5, 2.5))
xlim <- c(min(xlim[1], -2.5), max(xlim[2], 2.5))
bp <- barplot(vals, horiz = TRUE,
names.arg = c("Bayes Factor", "BIC approx"),
col = c("#3498db", "#f39c12"),
xlim = xlim, las = 1, border = NA,
main = expression("log"[10] * "(BF"[10] * ")"),
xlab = expression("log"[10] * "(BF"[10] * ")"))
# Jeffreys reference lines
abline(v = 0, lty = 1, col = "#2c3e50", lwd = 1.5)
ref_vals <- c(log10(3), log10(10), log10(30), log10(100),
-log10(3), -log10(10), -log10(30), -log10(100))
ref_labs <- c("3", "10", "30", "100", "1/3", "1/10", "1/30", "1/100")
for (i in seq_along(ref_vals)) {
if (ref_vals[i] >= xlim[1] && ref_vals[i] <= xlim[2]) {
abline(v = ref_vals[i], lty = 3, col = "#bdc3c7")
}
}
# Label regions
text(xlim[2] * 0.85, mean(bp), "Favors M1 \u2192", cex = 0.8, col = "#27ae60")
text(xlim[1] * 0.85, mean(bp), "\u2190 Favors M0", cex = 0.8, col = "#e74c3c")
})
output$results <- renderUI({
d <- dat()
label_class <- if (d$bf_10 >= 3) "good"
else if (d$bf_10 <= 1/3) "bad"
else "muted"
tags$div(class = "stats-box",
HTML(paste0(
"<b>BF\u2081\u2080:</b> ", round(d$bf_10, 3), "<br>",
"<b>log\u2081\u2080(BF):</b> ", round(log10(d$bf_10), 3), "<br>",
"<hr style='margin:8px 0'>",
"<b>\u0394BIC:</b> ", round(d$delta_bic, 2), "<br>",
"<b>BIC-approx BF:</b> ", round(d$bf_bic, 3), "<br>",
"<hr style='margin:8px 0'>",
"<b>Verdict:</b> <span class='", label_class, "'>",
d$label, "</span>"
))
)
})
}
shinyApp(ui, server)
Things to try
- True β = 0 (null is true): the Bayes factor favors \(M_0\). Unlike a p-value that can only “fail to reject,” the BF actively supports the null.
- True β = 2: the BF strongly favors \(M_1\). Increase n to make the evidence decisive.
- Wide prior (prior SD = 5): even when β is truly nonzero, a very wide prior penalizes \(M_1\). The prior spreads probability over implausible values, wasting “predictive budget.” This is Lindley’s paradox: a vague Bayesian can reject a true effect.
- Narrow prior (prior SD = 0.5) centered at 0: a tight prior makes \(M_1\) more efficient — if the truth is near 0, \(M_1\) predicts nearly as well as \(M_0\).
- Watch the left panel: the Savage-Dickey ratio is literally prior density at 0 divided by posterior density at 0. When the posterior shifts away from 0, the denominator shrinks and the BF grows.
- Compare BF and BIC: they usually agree in direction but the magnitudes differ. BIC is a large-sample approximation, so they converge with bigger n.
In Stata
* Fit two competing models
bayes: reg y x1 x2
bayesstats ic
bayes: reg y x1
bayesstats ic
* Compare DIC/WAIC across models
* Lower DIC or WAIC = better predictive fit
* ΔDIC works like ΔBIC: bigger gap = stronger preferenceDid you know?
Jeffreys (1961) proposed the interpretation scale for Bayes factors in Theory of Probability. His categories — anecdotal, substantial, strong, very strong, decisive — remain the standard reference, though some Bayesians caution against rigid thresholds (just as with p = 0.05).
Lindley’s paradox (1957): with a sufficiently large sample, a vague prior can lead the Bayes factor to favor the null even when the p-value is tiny. The resolution: the BF is sensitive to the spread of the prior, not just its center. A prior of \(N(0, 1000^2)\) says you think β could be anywhere from -3000 to 3000 — that’s a real claim, and the data punish it.
Kass & Raftery (1995) wrote the definitive review of Bayes factors, proposing modified guidelines and advocating BIC as a practical approximation. Their paper remains one of the most cited in Bayesian statistics.