Bayes’ Theorem

The formula

\[P(H \mid D) = \frac{P(D \mid H) \cdot P(H)}{P(D)}\]

In words:

\[\text{Posterior} = \frac{\text{Likelihood} \times \text{Prior}}{\text{Evidence}}\]

Why each piece exists

Every term in Bayes’ theorem does a specific job. Drop any one and the reasoning breaks:

Prior — \(P(H)\): What you believe before seeing data. Without it, you’re implicitly assuming all hypotheses are equally likely, which is often absurd. The medical test below makes this vivid: the disease has a 0.1% base rate. Ignoring that makes you think a positive test means a 99% chance of being sick. The prior encodes that the disease is rare — which is why the real answer is 9%.

Likelihood — \(P(D \mid H)\): How well the hypothesis explains the data you actually observed. This is the same ingredient frequentists use — the data model. Without it, your beliefs never contact reality. The likelihood is what makes Bayesian inference empirical rather than just opinion.

Evidence — \(P(D)\): The total probability of seeing this data under all hypotheses. It’s a normalizing constant — it ensures the posterior is a valid probability distribution (sums/integrates to 1). You often don’t need to compute it directly (conjugate models and MCMC both avoid it), but conceptually it makes the comparison fair across hypotheses.

Posterior — \(P(H \mid D)\): The answer. Your updated belief after seeing data. Neither the prior nor the likelihood alone gives you this. The prior ignores data; the likelihood ignores context. The posterior combines both.

Drop this What goes wrong
Prior You get MLE — no regularization, overfits with small samples, can’t express “this is rare”
Likelihood Pure opinion — beliefs never update with evidence
Evidence Posterior isn’t a proper distribution — can’t compare hypotheses on equal footing
Posterior You have ingredients but no answer — flour and eggs but no cake

The two simulations below let you feel this. In the medical test, the prior (base rate) is the piece most people forget. In the coin-flip, try a flat prior (\(\alpha = \beta = 1\)) and the posterior equals the likelihood — the prior contributes nothing. Set a strong prior and few flips — the posterior barely moves from the prior. You need both pieces pulling on each other.

That looks abstract. Let’s make it concrete.

The medical test example

Imagine a disease that affects 1 in 1,000 people. A test for it is 99% accurate — if you have the disease it says positive 99% of the time, and if you don’t have it, it says negative 99% of the time.

You test positive. What’s the probability you actually have the disease?

Most people say 99%. The real answer is about 9%. This is not a trick — it’s Bayes’ theorem. The disease is so rare that even a good test produces more false positives than true positives.

The simulator below lets you adjust the base rate and test accuracy and watch how the posterior probability changes.

#| standalone: true
#| viewerHeight: 580

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .result-box {
      background: #f0f4f8; border-radius: 6px; padding: 16px;
      margin-top: 14px; font-size: 15px; line-height: 2;
      text-align: center;
    }
    .result-box .big {
      font-size: 32px; color: #e74c3c; font-weight: bold;
    }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("prev", "Base rate (prevalence):",
                  min = 0.001, max = 0.20, value = 0.001, step = 0.001),

      sliderInput("sens", "Sensitivity (true positive rate):",
                  min = 0.50, max = 1.00, value = 0.99, step = 0.01),

      sliderInput("spec", "Specificity (true negative rate):",
                  min = 0.50, max = 1.00, value = 0.99, step = 0.01),

      uiOutput("result_box")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("tree_plot", height = "420px")),
        column(6, plotOutput("icon_plot", height = "420px"))
      )
    )
  )
)

server <- function(input, output, session) {

  vals <- reactive({
    prev <- input$prev
    sens <- input$sens
    spec <- input$spec

    # Out of 10,000 people
    N <- 10000
    sick <- round(N * prev)
    healthy <- N - sick

    true_pos  <- round(sick * sens)
    false_neg <- sick - true_pos
    false_pos <- round(healthy * (1 - spec))
    true_neg  <- healthy - false_pos

    total_pos <- true_pos + false_pos
    ppv <- if (total_pos > 0) true_pos / total_pos else 0

    list(N = N, sick = sick, healthy = healthy,
         true_pos = true_pos, false_neg = false_neg,
         false_pos = false_pos, true_neg = true_neg,
         total_pos = total_pos, ppv = ppv)
  })

  output$tree_plot <- renderPlot({
    v <- vals()
    par(mar = c(1, 1, 3, 1))

    plot(NULL, xlim = c(0, 10), ylim = c(0, 10), axes = FALSE,
         xlab = "", ylab = "", main = "What happens to 10,000 people?")

    # Population
    text(5, 9.5, paste0("Population: ", v$N), cex = 1.2, font = 2)

    # Sick vs Healthy
    text(2.5, 7.5, paste0("Sick: ", v$sick), cex = 1.1, col = "#e74c3c")
    text(7.5, 7.5, paste0("Healthy: ", v$healthy), cex = 1.1, col = "#3498db")
    segments(5, 9, 2.5, 8, lwd = 2)
    segments(5, 9, 7.5, 8, lwd = 2)

    # Test results for sick
    text(1.2, 5.2, paste0("Test +\n", v$true_pos), cex = 1, col = "#27ae60", font = 2)
    text(3.8, 5.2, paste0("Test -\n", v$false_neg), cex = 1, col = "#7f8c8d")
    segments(2.5, 7, 1.2, 5.8, lwd = 1.5)
    segments(2.5, 7, 3.8, 5.8, lwd = 1.5)

    # Test results for healthy
    text(6.2, 5.2, paste0("Test +\n", v$false_pos), cex = 1, col = "#e74c3c", font = 2)
    text(8.8, 5.2, paste0("Test -\n", v$true_neg), cex = 1, col = "#7f8c8d")
    segments(7.5, 7, 6.2, 5.8, lwd = 1.5)
    segments(7.5, 7, 8.8, 5.8, lwd = 1.5)

    # Total positives
    text(3.7, 3, paste0("Total positive tests: ", v$total_pos), cex = 1.1, font = 2)
    text(3.7, 2, paste0("Of these, truly sick: ", v$true_pos), cex = 1.1,
         col = "#27ae60", font = 2)
    text(3.7, 1, paste0("P(sick | test+) = ",
         v$true_pos, "/", v$total_pos, " = ",
         round(v$ppv * 100, 1), "%"), cex = 1.2, font = 2, col = "#e74c3c")
  })

  output$icon_plot <- renderPlot({
    v <- vals()
    par(mar = c(1, 1, 3, 1))

    # Show total positive tests as dots
    n_show <- min(v$total_pos, 200)
    n_true <- round(n_show * v$ppv)
    n_false <- n_show - n_true

    cols <- c(rep("#27ae60", n_true), rep("#e74c3c", n_false))
    cols <- sample(cols)

    ncol <- ceiling(sqrt(n_show))
    nrow <- ceiling(n_show / ncol)

    plot(NULL, xlim = c(0, ncol + 1), ylim = c(0, nrow + 1),
         axes = FALSE, xlab = "", ylab = "",
         main = paste0("All ", v$total_pos, " positive tests"))

    if (n_show > 0) {
      x <- rep(seq_len(ncol), times = nrow)[seq_len(n_show)]
      y <- rep(seq(nrow, 1), each = ncol)[seq_len(n_show)]
      points(x, y, pch = 15, cex = max(0.5, 3 - n_show / 50), col = cols)
    }

    legend("bottom", bty = "n", horiz = TRUE, cex = 0.95,
           legend = c(paste0("Truly sick (", n_true, ")"),
                      paste0("False alarm (", n_false, ")")),
           col = c("#27ae60", "#e74c3c"), pch = 15, pt.cex = 1.5)
  })

  output$result_box <- renderUI({
    v <- vals()
    tags$div(class = "result-box",
      HTML(paste0(
        "If you test positive,<br>",
        "the probability you're sick is:<br>",
        "<span class='big'>", round(v$ppv * 100, 1), "%</span><br>",
        "<small>not ", round(input$sens * 100), "%!</small>"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • Default settings (prevalence 0.1%, test 99% accurate): only ~9% of positive tests are truly sick. The base rate dominates.
  • Slide prevalence up to 5%: now ~84% of positives are real. The prior matters!
  • Slide prevalence to 50%: the posterior is ~99%. When the disease is common, a positive test is very informative.
  • Lower specificity to 90%: false positives explode. Watch the right plot fill with red dots.

The lesson

Bayes’ theorem tells you: don’t just look at the test accuracy — look at how common the thing is. A 99% accurate test is nearly useless for a 1-in-1,000 disease because most positives are false alarms. This is the base rate fallacy, and Bayes’ theorem is the cure.

The coin-flip: your first conjugate model

The medical test was a discrete example — disease or no disease. Now let’s do the same logic with a continuous parameter.

You have a coin and want to estimate \(\theta = P(\text{heads})\). Before flipping, you express your belief about \(\theta\) as a Beta distribution:

\[\theta \sim \text{Beta}(\alpha, \beta)\]

The Beta distribution lives on \([0, 1]\) — perfect for a probability. The parameters \(\alpha\) and \(\beta\) control the shape:

  • \(\alpha = \beta = 1\): flat (uniform) — “I have no idea”
  • \(\alpha = \beta = 5\): peaked at 0.5 — “I think it’s fair”
  • \(\alpha = 2, \beta = 8\): peaked near 0.2 — “I think it’s biased toward tails”

Now flip the coin \(n\) times and observe \(k\) heads. The likelihood is binomial:

\[k \sim \text{Binomial}(n, \theta)\]

Bayes’ theorem gives the posterior — and because the Beta prior is conjugate to the binomial likelihood, the posterior is also a Beta:

\[\theta \mid k \sim \text{Beta}(\alpha + k, \;\; \beta + n - k)\]

That’s it. The prior “counts” (\(\alpha, \beta\)) get updated by the data counts (\(k\) heads, \(n - k\) tails). No MCMC needed — the answer is a formula.

#| standalone: true
#| viewerHeight: 620

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("alpha", HTML("Prior &alpha;:"),
                  min = 1, max = 10, value = 1, step = 1),

      sliderInput("beta_param", HTML("Prior &beta;:"),
                  min = 1, max = 10, value = 1, step = 1),

      sliderInput("n_flips", "Number of flips (n):",
                  min = 1, max = 100, value = 20, step = 1),

      sliderInput("true_theta", HTML("True &theta;:"),
                  min = 0.01, max = 0.99, value = 0.6, step = 0.01),

      actionButton("flip", "Flip coins", class = "btn-primary", width = "100%"),

      uiOutput("stats_box")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("density_plot", height = "420px")),
        column(6, plotOutput("sequential_plot", height = "420px"))
      )
    )
  )
)

server <- function(input, output, session) {

  sim <- reactive({
    input$flip
    a <- input$alpha
    b <- input$beta_param
    n <- input$n_flips
    theta <- input$true_theta

    # Simulate coin flips
    flips <- rbinom(n, 1, theta)
    k <- sum(flips)

    # Posterior parameters
    a_post <- a + k
    b_post <- b + (n - k)

    # Sequential updating: posterior mean after each flip
    seq_means <- numeric(n)
    cum_heads <- cumsum(flips)
    for (i in seq_len(n)) {
      seq_means[i] <- (a + cum_heads[i]) / (a + b + i)
    }

    list(a = a, b = b, n = n, k = k, theta = theta,
         a_post = a_post, b_post = b_post,
         flips = flips, seq_means = seq_means)
  })

  output$density_plot <- renderPlot({
    d <- sim()
    par(mar = c(4.5, 4.5, 3, 1))

    x <- seq(0, 1, length.out = 300)
    prior_y <- dbeta(x, d$a, d$b)
    post_y  <- dbeta(x, d$a_post, d$b_post)

    ylim <- c(0, max(c(prior_y, post_y)) * 1.15)

    plot(NULL, xlim = c(0, 1), ylim = ylim,
         xlab = expression(theta), ylab = "Density",
         main = expression("Prior & Posterior for " * theta))

    # Prior
    lines(x, prior_y, col = "#e74c3c", lwd = 2, lty = 2)

    # Posterior (shaded)
    polygon(c(x, rev(x)), c(post_y, rep(0, length(x))),
            col = "#3498db30", border = NA)
    lines(x, post_y, col = "#3498db", lwd = 2.5)

    # True theta
    abline(v = d$theta, lty = 2, col = "#27ae60", lwd = 2)

    legend("topright", bty = "n", cex = 0.85,
           legend = c(
             paste0("Prior: Beta(", d$a, ", ", d$b, ")"),
             paste0("Posterior: Beta(", d$a_post, ", ", d$b_post, ")"),
             expression("True " * theta)
           ),
           col = c("#e74c3c", "#3498db", "#27ae60"),
           lwd = c(2, 2.5, 2), lty = c(2, 1, 2))
  })

  output$sequential_plot <- renderPlot({
    d <- sim()
    par(mar = c(4.5, 4.5, 3, 1))

    prior_mean <- d$a / (d$a + d$b)

    plot(seq_len(d$n), d$seq_means, type = "l", lwd = 2, col = "#3498db",
         xlab = "Flip number", ylab = expression("Posterior mean of " * theta),
         main = "Sequential Updating",
         ylim = c(0, 1))

    # True theta
    abline(h = d$theta, lty = 2, col = "#27ae60", lwd = 1.5)

    # Prior mean
    abline(h = prior_mean, lty = 3, col = "#e74c3c", lwd = 1.5)

    # MLE line (cumulative k/n)
    cum_heads <- cumsum(d$flips)
    mle_seq <- cum_heads / seq_len(d$n)
    lines(seq_len(d$n), mle_seq, col = "gray50", lwd = 1.5, lty = 3)

    legend("topright", bty = "n", cex = 0.8,
           legend = c("Posterior mean", "MLE (k/n)",
                      expression("True " * theta), "Prior mean"),
           col = c("#3498db", "gray50", "#27ae60", "#e74c3c"),
           lwd = c(2, 1.5, 1.5, 1.5), lty = c(1, 3, 2, 3))
  })

  output$stats_box <- renderUI({
    d <- sim()
    prior_mean <- d$a / (d$a + d$b)
    post_mean  <- d$a_post / (d$a_post + d$b_post)
    mle        <- if (d$n > 0) d$k / d$n else NA

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>Heads:</b> ", d$k, " / ", d$n, "<br>",
        "<b>MLE (k/n):</b> ", round(mle, 3), "<br>",
        "<hr style='margin:8px 0'>",
        "<b>Prior mean:</b> ", round(prior_mean, 3), "<br>",
        "<b>Posterior mean:</b> ", round(post_mean, 3), "<br>",
        "<b>True &theta;:</b> ", d$theta
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • Flat prior (\(\alpha = \beta = 1\)): the posterior is driven entirely by the data. The posterior mean equals the MLE. With no prior information, the Bayesian is the frequentist.
  • Strong prior (\(\alpha = \beta = 10\)): the prior is concentrated near 0.5. With few flips, the posterior barely moves. Increase \(n\) and watch the data overwhelm the prior.
  • Large \(n\) (100 flips): the prior becomes irrelevant. Prior and posterior mean converge to the MLE. This is the “prior washes out” property.
  • Sequential plot: watch how the posterior mean (blue) starts at the prior mean and drifts toward the true \(\theta\) as data accumulates. Early flips matter more; later flips change the estimate less.

Why this matters

The coin-flip model is Bayesian inference in its simplest form: prior \(\times\) likelihood \(=\) posterior, all in closed form. The same logic — start with a belief, update with data — drives everything else on this site.

The coin-flip uses a Beta prior on a probability. The next page shows the same updating with different distributions. Bayesian Regression uses a Normal prior on a slope. The math changes; the logic doesn’t.