Model Comparison

Which model is better?

You’ve built a model — maybe a regression with three predictors. A colleague suggests dropping one. Frequentists compare models with F-tests or likelihood ratio tests. Bayesians have their own answer: Bayes factors.

The idea is simple. Compute the marginal likelihood of each model — how well the model predicted the data, averaging over all possible parameter values — and take the ratio:

\[ \text{BF}_{10} = \frac{p(\mathbf{y} \mid M_1)}{p(\mathbf{y} \mid M_0)} \]

where \(p(\mathbf{y} \mid M_k) = \int p(\mathbf{y} \mid \theta, M_k) \, p(\theta \mid M_k) \, d\theta\).

A Bayes factor of 10 means the data are 10 times more likely under \(M_1\) than \(M_0\). Unlike a p-value, Bayes factors can support the null — a BF of 0.1 means the data favor \(M_0\) by a factor of 10.

To get posterior odds between models, multiply the Bayes factor by your prior odds:

\[ \underbrace{\frac{P(M_1 \mid \mathbf{y})}{P(M_0 \mid \mathbf{y})}}_{\text{posterior odds}} = \underbrace{\frac{P(M_1)}{P(M_0)}}_{\text{prior odds}} \times \underbrace{\text{BF}_{10}}_{\text{Bayes factor}} \]

With equal prior odds (50/50 between models), the posterior odds equal the Bayes factor.

Jeffreys’ interpretation scale:

BF₁₀	Evidence for M₁
1–3	Anecdotal
3–10	Substantial
10–30	Strong
30–100	Very strong
> 100	Decisive

Reading BF < 1: If BF₁₀ = 0.05, that’s the same as BF₀₁ = 20 — strong evidence for the null. Bayes factors are symmetric: just flip the ratio.

Connection to BIC

Computing the marginal likelihood exactly requires integrating over the prior — often intractable. But there’s a shortcut you already have.

The Bayesian Information Criterion (BIC) approximates the log marginal likelihood:

\[ \text{BIC} \approx -2 \log p(\mathbf{y} \mid M) + \text{const} \]

So the difference in BIC between two models approximates the log Bayes factor:

\[ \Delta\text{BIC} = \text{BIC}_1 - \text{BIC}_0 \approx -2 \log \text{BF}_{10} \]

which gives us:

\[ \text{BF}_{10} \approx \exp\!\left(-\tfrac{\Delta\text{BIC}}{2}\right) \]

BIC is the lazy Bayes factor. It doesn’t require specifying priors — it uses an implicit “unit information prior.” You already get BIC from Stata’s estat ic. To approximate a Bayes factor, just compute \(\exp(-\Delta\text{BIC}/2)\).

Three approaches compared:

Approach	Favors M₁ when	Can support null?	Requires prior?
p-value	p < 0.05	No (only rejects)	No
BIC	ΔBIC < 0	Yes (ΔBIC > 0)	No (implicit)
Bayes factor	BF₁₀ > 1	Yes (BF₁₀ < 1)	Yes (explicit)

Simulation: Does X belong in the model?

Compare two models: \(M_0\): \(y = \text{noise}\) vs \(M_1\): \(y = \beta x + \text{noise}\). The Bayes factor tells you whether the data support including \(x\).

#| standalone: true
#| viewerHeight: 620

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good  { color: #27ae60; font-weight: bold; }
    .bad   { color: #e74c3c; font-weight: bold; }
    .muted { color: #7f8c8d; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("true_beta", HTML("True &beta;:"),
                  min = 0, max = 3, value = 0, step = 0.25),

      sliderInput("prior_sd", "Prior SD on &beta;:",
                  min = 0.1, max = 5, value = 1, step = 0.1),

      sliderInput("n", "Sample size (n):",
                  min = 10, max = 300, value = 50, step = 10),

      sliderInput("sigma", HTML("Noise (&sigma;):"),
                  min = 0.5, max = 5, value = 2, step = 0.5),

      actionButton("go", "New data", class = "btn-primary", width = "100%"),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("density_plot", height = "420px")),
        column(6, plotOutput("bf_plot", height = "420px"))
      )
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$go
    true_beta <- input$true_beta
    prior_sd  <- input$prior_sd
    n         <- input$n
    sigma     <- input$sigma

    # Generate data
    x <- rnorm(n, 0, 1)
    y <- true_beta * x + rnorm(n, 0, sigma)

    # OLS estimate under M1
    beta_ols <- sum(x * y) / sum(x^2)
    se_ols   <- sigma / sqrt(sum(x^2))

    # Bayesian posterior under M1 (prior: beta ~ N(0, prior_sd^2))
    prior_prec <- 1 / prior_sd^2
    data_prec  <- sum(x^2) / sigma^2
    post_prec  <- prior_prec + data_prec
    post_sd    <- 1 / sqrt(post_prec)
    post_mean  <- (0 * prior_prec + beta_ols * data_prec) / post_prec

    # Savage-Dickey BF: BF10 = prior density at 0 / posterior density at 0
    prior_at_0 <- dnorm(0, 0, prior_sd)
    post_at_0  <- dnorm(0, post_mean, post_sd)
    bf_10      <- prior_at_0 / post_at_0

    # BIC approximation
    # Under M0: RSS = sum(y^2); Under M1: RSS = sum((y - beta_ols*x)^2)
    rss0 <- sum(y^2)
    rss1 <- sum((y - beta_ols * x)^2)
    bic0 <- n * log(rss0 / n) + 0 * log(n)
    bic1 <- n * log(rss1 / n) + 1 * log(n)
    delta_bic <- bic1 - bic0
    bf_bic    <- exp(-delta_bic / 2)

    # Jeffreys label
    abf <- bf_10
    label <- if (abf > 100) "Decisive for M1"
             else if (abf > 30) "Very strong for M1"
             else if (abf > 10) "Strong for M1"
             else if (abf > 3) "Substantial for M1"
             else if (abf > 1) "Anecdotal for M1"
             else if (abf > 1/3) "Anecdotal for M0"
             else if (abf > 1/10) "Substantial for M0"
             else if (abf > 1/30) "Strong for M0"
             else if (abf > 1/100) "Very strong for M0"
             else "Decisive for M0"

    list(x = x, y = y, beta_ols = beta_ols, se_ols = se_ols,
         prior_sd = prior_sd, post_mean = post_mean, post_sd = post_sd,
         prior_at_0 = prior_at_0, post_at_0 = post_at_0,
         bf_10 = bf_10, delta_bic = delta_bic, bf_bic = bf_bic,
         label = label, true_beta = true_beta)
  })

  output$density_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    # Plot range
    xlim <- c(min(-3 * d$prior_sd, d$post_mean - 4 * d$post_sd),
              max(3 * d$prior_sd, d$post_mean + 4 * d$post_sd))
    x_seq <- seq(xlim[1], xlim[2], length.out = 400)

    prior_y <- dnorm(x_seq, 0, d$prior_sd)
    post_y  <- dnorm(x_seq, d$post_mean, d$post_sd)

    ylim <- c(0, max(c(prior_y, post_y)) * 1.2)

    plot(NULL, xlim = xlim, ylim = ylim,
         xlab = expression(beta), ylab = "Density",
         main = "Savage-Dickey: Prior vs Posterior at \u03b2 = 0")

    # Prior (red dashed)
    lines(x_seq, prior_y, col = "#e74c3c", lwd = 2, lty = 2)

    # Posterior (blue filled)
    polygon(c(x_seq, rev(x_seq)), c(post_y, rep(0, length(x_seq))),
            col = "#3498db30", border = NA)
    lines(x_seq, post_y, col = "#3498db", lwd = 2.5)

    # Mark densities at beta = 0
    segments(0, 0, 0, d$prior_at_0, col = "#e74c3c", lwd = 2, lty = 3)
    points(0, d$prior_at_0, pch = 16, col = "#e74c3c", cex = 1.5)

    segments(0, 0, 0, d$post_at_0, col = "#3498db", lwd = 2, lty = 3)
    points(0, d$post_at_0, pch = 16, col = "#3498db", cex = 1.5)

    # Labels
    text(0, d$prior_at_0, pos = 4, cex = 0.8, col = "#e74c3c",
         labels = paste0("Prior(0) = ", round(d$prior_at_0, 3)))
    text(0, d$post_at_0, pos = 4, cex = 0.8, col = "#3498db",
         labels = paste0("Post(0) = ", round(d$post_at_0, 3)))

    legend("topright", bty = "n", cex = 0.85,
           legend = c("Prior", "Posterior"),
           col = c("#e74c3c", "#3498db"),
           lwd = c(2, 2.5), lty = c(2, 1))
  })

  output$bf_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 6, 3, 1))

    log_bf  <- log10(d$bf_10)
    log_bic <- log10(d$bf_bic)

    vals <- c(log_bf, log_bic)
    xlim <- range(c(vals, -2.5, 2.5))
    xlim <- c(min(xlim[1], -2.5), max(xlim[2], 2.5))

    bp <- barplot(vals, horiz = TRUE,
                  names.arg = c("Bayes Factor", "BIC approx"),
                  col = c("#3498db", "#f39c12"),
                  xlim = xlim, las = 1, border = NA,
                  main = expression("log"[10] * "(BF"[10] * ")"),
                  xlab = expression("log"[10] * "(BF"[10] * ")"))

    # Jeffreys reference lines
    abline(v = 0, lty = 1, col = "#2c3e50", lwd = 1.5)
    ref_vals <- c(log10(3), log10(10), log10(30), log10(100),
                  -log10(3), -log10(10), -log10(30), -log10(100))
    ref_labs <- c("3", "10", "30", "100", "1/3", "1/10", "1/30", "1/100")

    for (i in seq_along(ref_vals)) {
      if (ref_vals[i] >= xlim[1] && ref_vals[i] <= xlim[2]) {
        abline(v = ref_vals[i], lty = 3, col = "#bdc3c7")
      }
    }

    # Label regions
    text(xlim[2] * 0.85, mean(bp), "Favors M1 \u2192", cex = 0.8, col = "#27ae60")
    text(xlim[1] * 0.85, mean(bp), "\u2190 Favors M0", cex = 0.8, col = "#e74c3c")
  })

  output$results <- renderUI({
    d <- dat()
    label_class <- if (d$bf_10 >= 3) "good"
                   else if (d$bf_10 <= 1/3) "bad"
                   else "muted"

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>BF\u2081\u2080:</b> ", round(d$bf_10, 3), "<br>",
        "<b>log\u2081\u2080(BF):</b> ", round(log10(d$bf_10), 3), "<br>",
        "<hr style='margin:8px 0'>",
        "<b>\u0394BIC:</b> ", round(d$delta_bic, 2), "<br>",
        "<b>BIC-approx BF:</b> ", round(d$bf_bic, 3), "<br>",
        "<hr style='margin:8px 0'>",
        "<b>Verdict:</b> <span class='", label_class, "'>",
        d$label, "</span>"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

True β = 0 (null is true): the Bayes factor favors \(M_0\). Unlike a p-value that can only “fail to reject,” the BF actively supports the null.
True β = 2: the BF strongly favors \(M_1\). Increase n to make the evidence decisive.
Wide prior (prior SD = 5): even when β is truly nonzero, a very wide prior penalizes \(M_1\). The prior spreads probability over implausible values, wasting “predictive budget.” This is Lindley’s paradox: a vague Bayesian can reject a true effect.
Narrow prior (prior SD = 0.5) centered at 0: a tight prior makes \(M_1\) more efficient — if the truth is near 0, \(M_1\) predicts nearly as well as \(M_0\).
Watch the left panel: the Savage-Dickey ratio is literally prior density at 0 divided by posterior density at 0. When the posterior shifts away from 0, the denominator shrinks and the BF grows.
Compare BF and BIC: they usually agree in direction but the magnitudes differ. BIC is a large-sample approximation, so they converge with bigger n.

In Stata

* Fit two competing models
bayes: reg y x1 x2
bayesstats ic

bayes: reg y x1
bayesstats ic

* Compare DIC/WAIC across models
* Lower DIC or WAIC = better predictive fit
* ΔDIC works like ΔBIC: bigger gap = stronger preference

Quick BIC-based Bayes factor: Run frequentist models, use estat ic to get BIC, then compute display exp(-(BIC1 - BIC0)/2) for an approximate Bayes factor. No MCMC needed.

Did you know?

Jeffreys (1961) proposed the interpretation scale for Bayes factors in Theory of Probability. His categories — anecdotal, substantial, strong, very strong, decisive — remain the standard reference, though some Bayesians caution against rigid thresholds (just as with p = 0.05).
Lindley’s paradox (1957): with a sufficiently large sample, a vague prior can lead the Bayes factor to favor the null even when the p-value is tiny. The resolution: the BF is sensitive to the spread of the prior, not just its center. A prior of \(N(0, 1000^2)\) says you think β could be anywhere from -3000 to 3000 — that’s a real claim, and the data punish it.
Kass & Raftery (1995) wrote the definitive review of Bayes factors, proposing modified guidelines and advocating BIC as a practical approximation. Their paper remains one of the most cited in Bayesian statistics.