Bayes’ Theorem
The formula
\[P(H \mid D) = \frac{P(D \mid H) \cdot P(H)}{P(D)}\]
In words:
\[\text{Posterior} = \frac{\text{Likelihood} \times \text{Prior}}{\text{Evidence}}\]
Why each piece exists
Every term in Bayes’ theorem does a specific job. Drop any one and the reasoning breaks:
Prior — \(P(H)\): What you believe before seeing data. Without it, you’re implicitly assuming all hypotheses are equally likely, which is often absurd. The medical test below makes this vivid: the disease has a 0.1% base rate. Ignoring that makes you think a positive test means a 99% chance of being sick. The prior encodes that the disease is rare — which is why the real answer is 9%.
Likelihood — \(P(D \mid H)\): How well the hypothesis explains the data you actually observed. This is the same ingredient frequentists use — the data model. Without it, your beliefs never contact reality. The likelihood is what makes Bayesian inference empirical rather than just opinion.
Evidence — \(P(D)\): The total probability of seeing this data under all hypotheses. It’s a normalizing constant — it ensures the posterior is a valid probability distribution (sums/integrates to 1). You often don’t need to compute it directly (conjugate models and MCMC both avoid it), but conceptually it makes the comparison fair across hypotheses.
Posterior — \(P(H \mid D)\): The answer. Your updated belief after seeing data. Neither the prior nor the likelihood alone gives you this. The prior ignores data; the likelihood ignores context. The posterior combines both.
| Drop this | What goes wrong |
|---|---|
| Prior | You get MLE — no regularization, overfits with small samples, can’t express “this is rare” |
| Likelihood | Pure opinion — beliefs never update with evidence |
| Evidence | Posterior isn’t a proper distribution — can’t compare hypotheses on equal footing |
| Posterior | You have ingredients but no answer — flour and eggs but no cake |
The two simulations below let you feel this. In the medical test, the prior (base rate) is the piece most people forget. In the coin-flip, try a flat prior (\(\alpha = \beta = 1\)) and the posterior equals the likelihood — the prior contributes nothing. Set a strong prior and few flips — the posterior barely moves from the prior. You need both pieces pulling on each other.
That looks abstract. Let’s make it concrete.
The medical test example
Imagine a disease that affects 1 in 1,000 people. A test for it is 99% accurate — if you have the disease it says positive 99% of the time, and if you don’t have it, it says negative 99% of the time.
You test positive. What’s the probability you actually have the disease?
Most people say 99%. The real answer is about 9%. This is not a trick — it’s Bayes’ theorem. The disease is so rare that even a good test produces more false positives than true positives.
The simulator below lets you adjust the base rate and test accuracy and watch how the posterior probability changes.
#| standalone: true
#| viewerHeight: 580
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.result-box {
background: #f0f4f8; border-radius: 6px; padding: 16px;
margin-top: 14px; font-size: 15px; line-height: 2;
text-align: center;
}
.result-box .big {
font-size: 32px; color: #e74c3c; font-weight: bold;
}
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("prev", "Base rate (prevalence):",
min = 0.001, max = 0.20, value = 0.001, step = 0.001),
sliderInput("sens", "Sensitivity (true positive rate):",
min = 0.50, max = 1.00, value = 0.99, step = 0.01),
sliderInput("spec", "Specificity (true negative rate):",
min = 0.50, max = 1.00, value = 0.99, step = 0.01),
uiOutput("result_box")
),
mainPanel(
width = 9,
fluidRow(
column(6, plotOutput("tree_plot", height = "420px")),
column(6, plotOutput("icon_plot", height = "420px"))
)
)
)
)
server <- function(input, output, session) {
vals <- reactive({
prev <- input$prev
sens <- input$sens
spec <- input$spec
# Out of 10,000 people
N <- 10000
sick <- round(N * prev)
healthy <- N - sick
true_pos <- round(sick * sens)
false_neg <- sick - true_pos
false_pos <- round(healthy * (1 - spec))
true_neg <- healthy - false_pos
total_pos <- true_pos + false_pos
ppv <- if (total_pos > 0) true_pos / total_pos else 0
list(N = N, sick = sick, healthy = healthy,
true_pos = true_pos, false_neg = false_neg,
false_pos = false_pos, true_neg = true_neg,
total_pos = total_pos, ppv = ppv)
})
output$tree_plot <- renderPlot({
v <- vals()
par(mar = c(1, 1, 3, 1))
plot(NULL, xlim = c(0, 10), ylim = c(0, 10), axes = FALSE,
xlab = "", ylab = "", main = "What happens to 10,000 people?")
# Population
text(5, 9.5, paste0("Population: ", v$N), cex = 1.2, font = 2)
# Sick vs Healthy
text(2.5, 7.5, paste0("Sick: ", v$sick), cex = 1.1, col = "#e74c3c")
text(7.5, 7.5, paste0("Healthy: ", v$healthy), cex = 1.1, col = "#3498db")
segments(5, 9, 2.5, 8, lwd = 2)
segments(5, 9, 7.5, 8, lwd = 2)
# Test results for sick
text(1.2, 5.2, paste0("Test +\n", v$true_pos), cex = 1, col = "#27ae60", font = 2)
text(3.8, 5.2, paste0("Test -\n", v$false_neg), cex = 1, col = "#7f8c8d")
segments(2.5, 7, 1.2, 5.8, lwd = 1.5)
segments(2.5, 7, 3.8, 5.8, lwd = 1.5)
# Test results for healthy
text(6.2, 5.2, paste0("Test +\n", v$false_pos), cex = 1, col = "#e74c3c", font = 2)
text(8.8, 5.2, paste0("Test -\n", v$true_neg), cex = 1, col = "#7f8c8d")
segments(7.5, 7, 6.2, 5.8, lwd = 1.5)
segments(7.5, 7, 8.8, 5.8, lwd = 1.5)
# Total positives
text(3.7, 3, paste0("Total positive tests: ", v$total_pos), cex = 1.1, font = 2)
text(3.7, 2, paste0("Of these, truly sick: ", v$true_pos), cex = 1.1,
col = "#27ae60", font = 2)
text(3.7, 1, paste0("P(sick | test+) = ",
v$true_pos, "/", v$total_pos, " = ",
round(v$ppv * 100, 1), "%"), cex = 1.2, font = 2, col = "#e74c3c")
})
output$icon_plot <- renderPlot({
v <- vals()
par(mar = c(1, 1, 3, 1))
# Show total positive tests as dots
n_show <- min(v$total_pos, 200)
n_true <- round(n_show * v$ppv)
n_false <- n_show - n_true
cols <- c(rep("#27ae60", n_true), rep("#e74c3c", n_false))
cols <- sample(cols)
ncol <- ceiling(sqrt(n_show))
nrow <- ceiling(n_show / ncol)
plot(NULL, xlim = c(0, ncol + 1), ylim = c(0, nrow + 1),
axes = FALSE, xlab = "", ylab = "",
main = paste0("All ", v$total_pos, " positive tests"))
if (n_show > 0) {
x <- rep(seq_len(ncol), times = nrow)[seq_len(n_show)]
y <- rep(seq(nrow, 1), each = ncol)[seq_len(n_show)]
points(x, y, pch = 15, cex = max(0.5, 3 - n_show / 50), col = cols)
}
legend("bottom", bty = "n", horiz = TRUE, cex = 0.95,
legend = c(paste0("Truly sick (", n_true, ")"),
paste0("False alarm (", n_false, ")")),
col = c("#27ae60", "#e74c3c"), pch = 15, pt.cex = 1.5)
})
output$result_box <- renderUI({
v <- vals()
tags$div(class = "result-box",
HTML(paste0(
"If you test positive,<br>",
"the probability you're sick is:<br>",
"<span class='big'>", round(v$ppv * 100, 1), "%</span><br>",
"<small>not ", round(input$sens * 100), "%!</small>"
))
)
})
}
shinyApp(ui, server)
Things to try
- Default settings (prevalence 0.1%, test 99% accurate): only ~9% of positive tests are truly sick. The base rate dominates.
- Slide prevalence up to 5%: now ~84% of positives are real. The prior matters!
- Slide prevalence to 50%: the posterior is ~99%. When the disease is common, a positive test is very informative.
- Lower specificity to 90%: false positives explode. Watch the right plot fill with red dots.
The lesson
Bayes’ theorem tells you: don’t just look at the test accuracy — look at how common the thing is. A 99% accurate test is nearly useless for a 1-in-1,000 disease because most positives are false alarms. This is the base rate fallacy, and Bayes’ theorem is the cure.
The coin-flip: your first conjugate model
The medical test was a discrete example — disease or no disease. Now let’s do the same logic with a continuous parameter.
You have a coin and want to estimate \(\theta = P(\text{heads})\). Before flipping, you express your belief about \(\theta\) as a Beta distribution:
\[\theta \sim \text{Beta}(\alpha, \beta)\]
The Beta distribution lives on \([0, 1]\) — perfect for a probability. The parameters \(\alpha\) and \(\beta\) control the shape:
- \(\alpha = \beta = 1\): flat (uniform) — “I have no idea”
- \(\alpha = \beta = 5\): peaked at 0.5 — “I think it’s fair”
- \(\alpha = 2, \beta = 8\): peaked near 0.2 — “I think it’s biased toward tails”
Now flip the coin \(n\) times and observe \(k\) heads. The likelihood is binomial:
\[k \sim \text{Binomial}(n, \theta)\]
Bayes’ theorem gives the posterior — and because the Beta prior is conjugate to the binomial likelihood, the posterior is also a Beta:
\[\theta \mid k \sim \text{Beta}(\alpha + k, \;\; \beta + n - k)\]
That’s it. The prior “counts” (\(\alpha, \beta\)) get updated by the data counts (\(k\) heads, \(n - k\) tails). No MCMC needed — the answer is a formula.
#| standalone: true
#| viewerHeight: 620
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("alpha", HTML("Prior α:"),
min = 1, max = 10, value = 1, step = 1),
sliderInput("beta_param", HTML("Prior β:"),
min = 1, max = 10, value = 1, step = 1),
sliderInput("n_flips", "Number of flips (n):",
min = 1, max = 100, value = 20, step = 1),
sliderInput("true_theta", HTML("True θ:"),
min = 0.01, max = 0.99, value = 0.6, step = 0.01),
actionButton("flip", "Flip coins", class = "btn-primary", width = "100%"),
uiOutput("stats_box")
),
mainPanel(
width = 9,
fluidRow(
column(6, plotOutput("density_plot", height = "420px")),
column(6, plotOutput("sequential_plot", height = "420px"))
)
)
)
)
server <- function(input, output, session) {
sim <- reactive({
input$flip
a <- input$alpha
b <- input$beta_param
n <- input$n_flips
theta <- input$true_theta
# Simulate coin flips
flips <- rbinom(n, 1, theta)
k <- sum(flips)
# Posterior parameters
a_post <- a + k
b_post <- b + (n - k)
# Sequential updating: posterior mean after each flip
seq_means <- numeric(n)
cum_heads <- cumsum(flips)
for (i in seq_len(n)) {
seq_means[i] <- (a + cum_heads[i]) / (a + b + i)
}
list(a = a, b = b, n = n, k = k, theta = theta,
a_post = a_post, b_post = b_post,
flips = flips, seq_means = seq_means)
})
output$density_plot <- renderPlot({
d <- sim()
par(mar = c(4.5, 4.5, 3, 1))
x <- seq(0, 1, length.out = 300)
prior_y <- dbeta(x, d$a, d$b)
post_y <- dbeta(x, d$a_post, d$b_post)
ylim <- c(0, max(c(prior_y, post_y)) * 1.15)
plot(NULL, xlim = c(0, 1), ylim = ylim,
xlab = expression(theta), ylab = "Density",
main = expression("Prior & Posterior for " * theta))
# Prior
lines(x, prior_y, col = "#e74c3c", lwd = 2, lty = 2)
# Posterior (shaded)
polygon(c(x, rev(x)), c(post_y, rep(0, length(x))),
col = "#3498db30", border = NA)
lines(x, post_y, col = "#3498db", lwd = 2.5)
# True theta
abline(v = d$theta, lty = 2, col = "#27ae60", lwd = 2)
legend("topright", bty = "n", cex = 0.85,
legend = c(
paste0("Prior: Beta(", d$a, ", ", d$b, ")"),
paste0("Posterior: Beta(", d$a_post, ", ", d$b_post, ")"),
expression("True " * theta)
),
col = c("#e74c3c", "#3498db", "#27ae60"),
lwd = c(2, 2.5, 2), lty = c(2, 1, 2))
})
output$sequential_plot <- renderPlot({
d <- sim()
par(mar = c(4.5, 4.5, 3, 1))
prior_mean <- d$a / (d$a + d$b)
plot(seq_len(d$n), d$seq_means, type = "l", lwd = 2, col = "#3498db",
xlab = "Flip number", ylab = expression("Posterior mean of " * theta),
main = "Sequential Updating",
ylim = c(0, 1))
# True theta
abline(h = d$theta, lty = 2, col = "#27ae60", lwd = 1.5)
# Prior mean
abline(h = prior_mean, lty = 3, col = "#e74c3c", lwd = 1.5)
# MLE line (cumulative k/n)
cum_heads <- cumsum(d$flips)
mle_seq <- cum_heads / seq_len(d$n)
lines(seq_len(d$n), mle_seq, col = "gray50", lwd = 1.5, lty = 3)
legend("topright", bty = "n", cex = 0.8,
legend = c("Posterior mean", "MLE (k/n)",
expression("True " * theta), "Prior mean"),
col = c("#3498db", "gray50", "#27ae60", "#e74c3c"),
lwd = c(2, 1.5, 1.5, 1.5), lty = c(1, 3, 2, 3))
})
output$stats_box <- renderUI({
d <- sim()
prior_mean <- d$a / (d$a + d$b)
post_mean <- d$a_post / (d$a_post + d$b_post)
mle <- if (d$n > 0) d$k / d$n else NA
tags$div(class = "stats-box",
HTML(paste0(
"<b>Heads:</b> ", d$k, " / ", d$n, "<br>",
"<b>MLE (k/n):</b> ", round(mle, 3), "<br>",
"<hr style='margin:8px 0'>",
"<b>Prior mean:</b> ", round(prior_mean, 3), "<br>",
"<b>Posterior mean:</b> ", round(post_mean, 3), "<br>",
"<b>True θ:</b> ", d$theta
))
)
})
}
shinyApp(ui, server)
Things to try
- Flat prior (\(\alpha = \beta = 1\)): the posterior is driven entirely by the data. The posterior mean equals the MLE. With no prior information, the Bayesian is the frequentist.
- Strong prior (\(\alpha = \beta = 10\)): the prior is concentrated near 0.5. With few flips, the posterior barely moves. Increase \(n\) and watch the data overwhelm the prior.
- Large \(n\) (100 flips): the prior becomes irrelevant. Prior and posterior mean converge to the MLE. This is the “prior washes out” property.
- Sequential plot: watch how the posterior mean (blue) starts at the prior mean and drifts toward the true \(\theta\) as data accumulates. Early flips matter more; later flips change the estimate less.
Why this matters
The coin-flip model is Bayesian inference in its simplest form: prior \(\times\) likelihood \(=\) posterior, all in closed form. The same logic — start with a belief, update with data — drives everything else on this site.
The coin-flip uses a Beta prior on a probability. The next page shows the same updating with different distributions. Bayesian Regression uses a Normal prior on a slope. The math changes; the logic doesn’t.