Bayes Factors

What a Bayes factor is

One-line intuition. A Bayes factor asks: which hypothesis was less surprised by the data you actually saw? The one that “called it” better wins, and the ratio tells you by how much.

Two weather forecasters. Alice says “70% chance of rain tomorrow.” Bob says “10% chance of rain tomorrow.” It rains. Alice assigned probability \(0.7\) to what actually happened; Bob assigned \(0.1\). The Bayes factor in Alice’s favor is \(0.7 / 0.1 = 7\) — “substantial” evidence that Alice’s model of the weather is better than Bob’s. No priors, no integrals — just whose forecast looked less silly after the fact. Everything else on this page is the same idea, generalized to hypotheses that don’t make a single sharp prediction.

Suppose you have two hypotheses — call them \(H_0\) and \(H_1\) — and a body of data \(y\). Both hypotheses make predictions about what \(y\) should look like, but those predictions depend on unknown parameters. The Bayes factor is the ratio of the average probability each hypothesis assigns to the data, where the averaging is over your prior beliefs about each hypothesis’s parameters:

\[ BF_{10} \;=\; \frac{p(y \mid H_1)}{p(y \mid H_0)} \;=\; \frac{\int p(y \mid \theta, H_1)\, p(\theta \mid H_1)\, d\theta} {\int p(y \mid \theta, H_0)\, p(\theta \mid H_0)\, d\theta}. \]

In plain words: “how much more likely is the data under model 1 than under model 0, after averaging over what each model said could happen.” The two integrals are called marginal likelihoods (or evidence) — this is what makes Bayes factors hard to compute and powerful when you can compute them.

The gambler analogy. Two friends each propose a model for how a coin behaves. Friend 0 says “it’s fair.” Friend 1 says “it could be biased either way.” Before you flip, ask each to bet money on every possible sequence of 100 flips. Friend 0 puts all their money on sequences with roughly 50 heads. Friend 1 spreads bets across many sequences. After flipping, look at who bet how much on the sequence that actually happened. The ratio of their winning bets is the Bayes factor.

Note on the subscript convention. \(BF_{10}\) means evidence for \(H_1\) over \(H_0\). \(BF_{01} = 1 / BF_{10}\) goes the other way. Always state which one you’re reporting — sign errors are common.

Jeffreys’ interpretive scale

A Bayes factor is just a number. To turn it into a verdict, the standard references are Jeffreys (1961) and Kass & Raftery (1995). The two scales roughly agree:

\(BF_{10}\) \(\log_{10} BF_{10}\) Strength of evidence for \(H_1\)
1 – 3 0 – 0.5 Barely worth mentioning
3 – 10 0.5 – 1 Substantial
10 – 30 1 – 1.5 Strong
30 – 100 1.5 – 2 Very strong
> 100 > 2 Decisive

A symmetric reading applies for \(BF_{01} > 1\): evidence for the null.

These thresholds are conventions, not laws. They are calibrated against intuitions about how surprised you should be. They do not match p-value thresholds: a p-value of \(0.05\) does not correspond to a particular Bayes factor, and the two can disagree sharply (see Lindley’s paradox below).

Why the prior on \(H_1\) matters so much

Look at the formula again. The denominator under \(H_1\) is an integral over the prior \(p(\theta \mid H_1)\). If you put a vague prior on \(\theta\) — one that spreads mass over a huge range — most of that mass sits in regions where the data is poorly predicted. The integral becomes small. \(H_1\) loses, regardless of the data, because its prior was too generous about what was possible.

This is the single most important thing to understand about Bayes factors. A diffuse prior is not a “non-committal” prior. A diffuse prior is a strong claim that the parameter could be anywhere over a wide range — and Bayes factors punish you for that claim by averaging the likelihood over places it has no business being.

Three consequences:

  • Improper priors break things. Setting \(p(\theta \mid H_1)\) to Lebesgue measure (a uniform “prior” over the whole real line) makes the marginal likelihood literally undefined. Improper priors give arbitrary Bayes factors. (This is Bartlett’s paradox, 1957.)
  • The choice of prior must be substantive. Use what you actually believed about effect sizes from prior literature, from physical bounds, or from past studies. Report it. A Bayes factor without a disclosed prior is incomplete.
  • Reference priors exist for common testing problems. Jeffreys proposed the Cauchy prior centered at zero for testing a normal mean; this is the basis for JZS (Jeffreys-Zellner-Siow) Bayes factors used in many psychology and economics papers today.

Computation

Marginal likelihoods are hard to compute. Four practical approaches:

Savage-Dickey ratio (for nested tests)

When testing whether a parameter equals zero (\(H_0: \theta = 0\) versus \(H_1: \theta \neq 0\)), and the prior under \(H_1\) assigns positive density to \(\theta = 0\):

\[ BF_{01} \;=\; \frac{p(\theta = 0 \mid y, H_1)}{p(\theta = 0 \mid H_1)}. \]

That is — the posterior density at zero divided by the prior density at zero, both computed under \(H_1\). If the posterior places more mass near zero than the prior did, \(BF_{01} > 1\) (evidence for the null). This is practical: you fit one model, evaluate two densities, done.

BIC approximation

For comparing two regression models, the BIC gives a quick back-of-envelope:

\[ \log BF_{10} \;\approx\; -\tfrac{1}{2}\,(BIC_1 - BIC_0). \]

This holds asymptotically with a unit-information prior — i.e., a prior carrying as much information as one observation. Useful for rough triage; not a substitute for a real Bayes factor when priors matter.

Closed-form for Gaussian tests

Testing whether a sample mean differs from \(0\) with known \(\sigma^2\) and a normal prior \(\theta \sim N(0, \tau^2)\) under \(H_1\):

\[ BF_{01} \;=\; \sqrt{1 + n\tau^2/\sigma^2}\;\; \exp\!\Big(-\tfrac{1}{2}\, \tfrac{(n\bar{y}/\sigma^2)^2}{n/\sigma^2 + 1/\tau^2}\Big). \]

This is the formula behind the closed-form example we use elsewhere in these notes (see Precise Null vs. Underpowered).

Bridge sampling

For models where neither analytical formula nor BIC apply, bridge sampling is the gold-standard numerical method — it uses posterior draws (from MCMC) plus a clever importance-sampling identity to estimate the marginal likelihood. The R package bridgesampling and Stan’s bridge_sampler implement it.

Worked example: testing a regression coefficient

You run a regression and the coefficient on a treatment variable is \(\hat{\beta} = 0.40\) with \(\text{SE} = 0.25\). The \(t\)-statistic is \(1.6\), \(p \approx 0.11\) — not significant at conventional thresholds. Frequentist verdict: fail to reject. But is the data informative about the null, or just quiet?

Assume \(\hat{\beta} \mid \beta \sim N(\beta, 0.25^2)\). Under \(H_1\), put a normal prior \(\beta \sim N(0, \tau^2)\) with \(\tau = 0.5\) — encoding “effects of this magnitude are plausible but not guaranteed.” Plug into the closed-form:

  • Prior precision: \(1/\tau^2 = 4\)
  • Data precision: \(1/0.25^2 = 16\)
  • Posterior precision: \(4 + 16 = 20\)
  • Posterior mean: \((16 \times 0.40) / 20 = 0.32\)
  • Savage-Dickey: \(BF_{01} = N(0; 0.32, 1/\sqrt{20}) / N(0; 0, 0.5)\)

Working through: \(BF_{01} \approx 1.7\). Mild evidence for the null — slightly more than break-even, well below “substantial.” The correct report is “not significant, and the Bayes factor is inconclusive.” The data are quiet, not pro-null.

Why this matters. A frequentist non-rejection collapses every non-significant outcome into one verdict. A Bayes factor distinguishes “the data are noisy” (\(BF \approx 1\)) from “the data favor the null” (\(BF \gg 1\)). This is the right answer to a question p-values can’t answer.

Lindley’s paradox — deep dive

The most famous failure mode of Bayes factors versus p-values: with enough data, a frequentist test rejects the null while a Bayes factor with a diffuse prior favors it.

Toy example (Lindley 1957). A sample of \(n = 98{,}000\) newborns yields a male-birth proportion of \(50.36\%\). Testing \(H_0:\ p = 0.5\):

  • \(z = (0.5036 - 0.5) / \sqrt{0.5(0.5)/98000} \approx 2.27\), so \(p \approx 0.02\).
  • Frequentist verdict: reject \(H_0\) at the 5% level.

Now compute a Bayes factor using a uniform prior under \(H_1\): \(p \sim U(0, 1)\). The marginal likelihood under \(H_1\) is the average binomial likelihood over all \(p \in [0,1]\) — most of which is far from \(0.5036\). Working through:

\[ BF_{01} \;=\; \frac{p(y \mid p = 0.5)}{\int_0^1 p(y \mid p)\, dp} \;\approx\; 11. \]

Bayesian verdict: strong evidence for the null. The two procedures disagree on the same dataset.

Why? The null predicts the observed proportion specifically; the alternative predicts it on average over a wide range. With \(n\) large, the deviation from \(0.5\) is enormous in z-units but tiny in probability-space, so \(H_1\)’s diffuse prior averages it down. Tightening the prior under \(H_1\) — e.g., \(p \sim \text{Beta}(50, 50)\), encoding “roughly 50-50 but allow some slack” — flips the verdict back.

The takeaway is not that one procedure is right and the other wrong. They answer different questions, and the disagreement is a clean diagnostic: whenever a p-value and a Bayes factor disagree, the prior on \(H_1\) is doing the work. Report it explicitly.

When Bayes factors mislead

  • Improper priors give arbitrary answers. Avoid them for testing.
  • Diffuse “default” priors bias toward the null in large samples. This is Lindley’s paradox. Default priors are still useful for exploratory tests; they should not be the basis of a definitive claim.
  • Model misspecification hurts more than in estimation. Bayes factors compare the marginal likelihood under each model’s full specification — if neither model is close to the truth, the ratio is hard to interpret. Posterior predictive checks (see Posterior Predictive) are a useful parallel diagnostic.
  • Computation is fragile. Marginal-likelihood estimators have known bias issues; bridge sampling is the most reliable but is not push-button.

Did you know?

Jeffreys’ 1961 textbook Theory of Probability developed the entire Bayes-factor framework decades before MCMC made it computable for realistic models. His original motivating problem was the rate of seismic activity along the Pacific Rim — not what most people associate with founding modern Bayesian hypothesis testing. The Cauchy prior he proposed for testing a normal mean remains the default in many software packages today, more than sixty years later.