Bayes Factors
What a Bayes factor is
Suppose you have two hypotheses — call them \(H_0\) and \(H_1\) — and a body of data \(y\). Both hypotheses make predictions about what \(y\) should look like, but those predictions depend on unknown parameters. The Bayes factor is the ratio of the average probability each hypothesis assigns to the data, where the averaging is over your prior beliefs about each hypothesis’s parameters:
\[ BF_{10} \;=\; \frac{p(y \mid H_1)}{p(y \mid H_0)} \;=\; \frac{\int p(y \mid \theta, H_1)\, p(\theta \mid H_1)\, d\theta} {\int p(y \mid \theta, H_0)\, p(\theta \mid H_0)\, d\theta}. \]
In plain words: “how much more likely is the data under model 1 than under model 0, after averaging over what each model said could happen.” The two integrals are called marginal likelihoods (or evidence) — this is what makes Bayes factors hard to compute and powerful when you can compute them.
Jeffreys’ interpretive scale
A Bayes factor is just a number. To turn it into a verdict, the standard references are Jeffreys (1961) and Kass & Raftery (1995). The two scales roughly agree:
| \(BF_{10}\) | \(\log_{10} BF_{10}\) | Strength of evidence for \(H_1\) |
|---|---|---|
| 1 – 3 | 0 – 0.5 | Barely worth mentioning |
| 3 – 10 | 0.5 – 1 | Substantial |
| 10 – 30 | 1 – 1.5 | Strong |
| 30 – 100 | 1.5 – 2 | Very strong |
| > 100 | > 2 | Decisive |
A symmetric reading applies for \(BF_{01} > 1\): evidence for the null.
These thresholds are conventions, not laws. They are calibrated against intuitions about how surprised you should be. They do not match p-value thresholds: a p-value of \(0.05\) does not correspond to a particular Bayes factor, and the two can disagree sharply (see Lindley’s paradox below).
Why the prior on \(H_1\) matters so much
Look at the formula again. The denominator under \(H_1\) is an integral over the prior \(p(\theta \mid H_1)\). If you put a vague prior on \(\theta\) — one that spreads mass over a huge range — most of that mass sits in regions where the data is poorly predicted. The integral becomes small. \(H_1\) loses, regardless of the data, because its prior was too generous about what was possible.
This is the single most important thing to understand about Bayes factors. A diffuse prior is not a “non-committal” prior. A diffuse prior is a strong claim that the parameter could be anywhere over a wide range — and Bayes factors punish you for that claim by averaging the likelihood over places it has no business being.
Three consequences:
- Improper priors break things. Setting \(p(\theta \mid H_1)\) to Lebesgue measure (a uniform “prior” over the whole real line) makes the marginal likelihood literally undefined. Improper priors give arbitrary Bayes factors. (This is Bartlett’s paradox, 1957.)
- The choice of prior must be substantive. Use what you actually believed about effect sizes from prior literature, from physical bounds, or from past studies. Report it. A Bayes factor without a disclosed prior is incomplete.
- Reference priors exist for common testing problems. Jeffreys proposed the Cauchy prior centered at zero for testing a normal mean; this is the basis for JZS (Jeffreys-Zellner-Siow) Bayes factors used in many psychology and economics papers today.
Computation
Marginal likelihoods are hard to compute. Four practical approaches:
Savage-Dickey ratio (for nested tests)
When testing whether a parameter equals zero (\(H_0: \theta = 0\) versus \(H_1: \theta \neq 0\)), and the prior under \(H_1\) assigns positive density to \(\theta = 0\):
\[ BF_{01} \;=\; \frac{p(\theta = 0 \mid y, H_1)}{p(\theta = 0 \mid H_1)}. \]
That is — the posterior density at zero divided by the prior density at zero, both computed under \(H_1\). If the posterior places more mass near zero than the prior did, \(BF_{01} > 1\) (evidence for the null). This is practical: you fit one model, evaluate two densities, done.
BIC approximation
For comparing two regression models, the BIC gives a quick back-of-envelope:
\[ \log BF_{10} \;\approx\; -\tfrac{1}{2}\,(BIC_1 - BIC_0). \]
This holds asymptotically with a unit-information prior — i.e., a prior carrying as much information as one observation. Useful for rough triage; not a substitute for a real Bayes factor when priors matter.
Closed-form for Gaussian tests
Testing whether a sample mean differs from \(0\) with known \(\sigma^2\) and a normal prior \(\theta \sim N(0, \tau^2)\) under \(H_1\):
\[ BF_{01} \;=\; \sqrt{1 + n\tau^2/\sigma^2}\;\; \exp\!\Big(-\tfrac{1}{2}\, \tfrac{(n\bar{y}/\sigma^2)^2}{n/\sigma^2 + 1/\tau^2}\Big). \]
This is the formula behind the closed-form example we use elsewhere in these notes (see Precise Null vs. Underpowered).
Bridge sampling
For models where neither analytical formula nor BIC apply, bridge sampling is the gold-standard numerical method — it uses posterior draws (from MCMC) plus a clever importance-sampling identity to estimate the marginal likelihood. The R package bridgesampling and Stan’s bridge_sampler implement it.
Worked example: testing a regression coefficient
You run a regression and the coefficient on a treatment variable is \(\hat{\beta} = 0.40\) with \(\text{SE} = 0.25\). The \(t\)-statistic is \(1.6\), \(p \approx 0.11\) — not significant at conventional thresholds. Frequentist verdict: fail to reject. But is the data informative about the null, or just quiet?
Assume \(\hat{\beta} \mid \beta \sim N(\beta, 0.25^2)\). Under \(H_1\), put a normal prior \(\beta \sim N(0, \tau^2)\) with \(\tau = 0.5\) — encoding “effects of this magnitude are plausible but not guaranteed.” Plug into the closed-form:
- Prior precision: \(1/\tau^2 = 4\)
- Data precision: \(1/0.25^2 = 16\)
- Posterior precision: \(4 + 16 = 20\)
- Posterior mean: \((16 \times 0.40) / 20 = 0.32\)
- Savage-Dickey: \(BF_{01} = N(0; 0.32, 1/\sqrt{20}) / N(0; 0, 0.5)\)
Working through: \(BF_{01} \approx 1.7\). Mild evidence for the null — slightly more than break-even, well below “substantial.” The correct report is “not significant, and the Bayes factor is inconclusive.” The data are quiet, not pro-null.
Lindley’s paradox — deep dive
The most famous failure mode of Bayes factors versus p-values: with enough data, a frequentist test rejects the null while a Bayes factor with a diffuse prior favors it.
Toy example (Lindley 1957). A sample of \(n = 98{,}000\) newborns yields a male-birth proportion of \(50.36\%\). Testing \(H_0:\ p = 0.5\):
- \(z = (0.5036 - 0.5) / \sqrt{0.5(0.5)/98000} \approx 2.27\), so \(p \approx 0.02\).
- Frequentist verdict: reject \(H_0\) at the 5% level.
Now compute a Bayes factor using a uniform prior under \(H_1\): \(p \sim U(0, 1)\). The marginal likelihood under \(H_1\) is the average binomial likelihood over all \(p \in [0,1]\) — most of which is far from \(0.5036\). Working through:
\[ BF_{01} \;=\; \frac{p(y \mid p = 0.5)}{\int_0^1 p(y \mid p)\, dp} \;\approx\; 11. \]
Bayesian verdict: strong evidence for the null. The two procedures disagree on the same dataset.
Why? The null predicts the observed proportion specifically; the alternative predicts it on average over a wide range. With \(n\) large, the deviation from \(0.5\) is enormous in z-units but tiny in probability-space, so \(H_1\)’s diffuse prior averages it down. Tightening the prior under \(H_1\) — e.g., \(p \sim \text{Beta}(50, 50)\), encoding “roughly 50-50 but allow some slack” — flips the verdict back.
The takeaway is not that one procedure is right and the other wrong. They answer different questions, and the disagreement is a clean diagnostic: whenever a p-value and a Bayes factor disagree, the prior on \(H_1\) is doing the work. Report it explicitly.
When Bayes factors mislead
- Improper priors give arbitrary answers. Avoid them for testing.
- Diffuse “default” priors bias toward the null in large samples. This is Lindley’s paradox. Default priors are still useful for exploratory tests; they should not be the basis of a definitive claim.
- Model misspecification hurts more than in estimation. Bayes factors compare the marginal likelihood under each model’s full specification — if neither model is close to the truth, the ratio is hard to interpret. Posterior predictive checks (see Posterior Predictive) are a useful parallel diagnostic.
- Computation is fragile. Marginal-likelihood estimators have known bias issues; bridge sampling is the most reliable but is not push-button.
Did you know?
Jeffreys’ 1961 textbook Theory of Probability developed the entire Bayes-factor framework decades before MCMC made it computable for realistic models. His original motivating problem was the rate of seismic activity along the Pacific Rim — not what most people associate with founding modern Bayesian hypothesis testing. The Cauchy prior he proposed for testing a normal mean remains the default in many software packages today, more than sixty years later.