LLN vs CLT: Why Averages Stabilize Before They Become Normal

Two theorems, two different promises

The Law of Large Numbers (LLN) and the Central Limit Theorem (CLT) are the two pillars of statistics, but they say different things:

LLN CLT
What it says Sample mean converges to the true mean Distribution of the (scaled) sample mean is approximately normal
Object \(\bar{X}_1, \bar{X}_2, \bar{X}_3, \dots\) — one sequence, growing \(n\) \(\bar{X}^{(1)}, \bar{X}^{(2)}, \bar{X}^{(3)}, \dots\) — many samples, fixed \(n\)
Key result \(\bar{X}_n \xrightarrow{p} \mu\) \(\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)\)
Interpretation Consistency Asymptotic normality
Promise You get the right answer You know how uncertain your estimate is
Requires Finite mean Finite variance

LLN says the mean settles down. CLT says where it settles follows a normal distribution.

Think of it this way: LLN tells you that if you flip a fair coin enough times, the fraction of heads approaches 0.5. CLT tells you that if you repeat the entire experiment many times, the distribution of those fractions is bell-shaped.

The Oracle View. In these simulations, we know the true mean \(\mu\) — the horizontal line the running average converges to. In practice, you don’t know \(\mu\). You watch your estimate stabilize but never know exactly what it’s converging toward. That’s the gap between theory and practice: LLN guarantees convergence, but the target is invisible.

Simulation 1: The running average

Watch one running average converge to the true mean as you add observations one at a time. This is LLN in action — the cumulative mean stabilizes. Try different distributions and notice that convergence happens regardless of the population shape.

#| standalone: true
#| viewerHeight: 620

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      selectInput("pop", "Population:",
                  choices = c("Uniform(0, 1)",
                              "Exponential(1)",
                              "Bimodal",
                              "Bernoulli(0.3)",
                              "Heavy-tailed (t, df=2)")),

      sliderInput("n", "Number of observations:",
                  min = 50, max = 5000, value = 500, step = 50),

      sliderInput("paths", "Number of paths:",
                  min = 1, max = 10, value = 3, step = 1),

      actionButton("go", "Draw new paths",
                   class = "btn-primary", width = "100%"),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      plotOutput("running_avg", height = "500px")
    )
  )
)

server <- function(input, output, session) {

  draw_pop <- function(n, pop) {
    switch(pop,
      "Uniform(0, 1)"          = runif(n),
      "Exponential(1)"         = rexp(n, rate = 1),
      "Bimodal"                = {
        k <- rbinom(n, 1, 0.5)
        k * rnorm(n, -2, 0.6) + (1 - k) * rnorm(n, 2, 0.6)
      },
      "Bernoulli(0.3)"         = rbinom(n, 1, 0.3),
      "Heavy-tailed (t, df=2)" = rt(n, df = 2)
    )
  }

  pop_mu <- function(pop) {
    switch(pop,
      "Uniform(0, 1)"          = 0.5,
      "Exponential(1)"         = 1,
      "Bimodal"                = 0,
      "Bernoulli(0.3)"         = 0.3,
      "Heavy-tailed (t, df=2)" = 0
    )
  }

  dat <- reactive({
    input$go
    n     <- input$n
    paths <- input$paths
    pop   <- input$pop
    mu    <- pop_mu(pop)

    all_paths <- lapply(seq_len(paths), function(i) {
      x <- draw_pop(n, pop)
      cumsum(x) / seq_along(x)
    })

    list(all_paths = all_paths, n = n, mu = mu, pop = pop, paths = paths)
  })

  output$running_avg <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    ylim <- range(unlist(d$all_paths))
    ylim <- ylim + c(-1, 1) * 0.1 * diff(ylim)

    cols <- c("#e74c3c", "#3498db", "#27ae60", "#9b59b6", "#e67e22",
              "#1abc9c", "#34495e", "#f39c12", "#2ecc71", "#c0392b")

    plot(NULL, xlim = c(1, d$n), ylim = ylim,
         xlab = "Number of observations",
         ylab = "Cumulative mean",
         main = paste0("LLN: Running average converges to \u03bc = ", d$mu))

    abline(h = d$mu, lty = 2, lwd = 2.5, col = "#2c3e50")

    for (i in seq_along(d$all_paths)) {
      lines(seq_along(d$all_paths[[i]]), d$all_paths[[i]],
            col = adjustcolor(cols[i], 0.7), lwd = 1.5)
    }

    legend("topright", bty = "n", cex = 0.9,
           legend = c(paste0("True mean (\u03bc = ", d$mu, ")"),
                      paste0(d$paths, " sample path(s)")),
           col = c("#2c3e50", cols[1]),
           lwd = c(2.5, 1.5), lty = c(2, 1))
  })

  output$results <- renderUI({
    d <- dat()
    final_means <- sapply(d$all_paths, function(p) p[length(p)])

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>True mean:</b> ", d$mu, "<br>",
        "<b>Final running avg(s):</b><br>",
        paste(round(final_means, 4), collapse = ", "), "<br>",
        "<b>Max deviation:</b> ",
        round(max(abs(final_means - d$mu)), 4)
      ))
    )
  })
}

shinyApp(ui, server)

Simulation 2: LLN vs CLT side by side

Now see both theorems at once. The left panel shows LLN — one running average converging to \(\mu\). The right panel shows CLT — the histogram of many sample means forming a normal distribution. Same data, different questions.

#| standalone: true
#| viewerHeight: 580

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      selectInput("pop2", "Population:",
                  choices = c("Uniform(0, 1)",
                              "Exponential(1)",
                              "Bimodal",
                              "Bernoulli(0.3)")),

      sliderInput("n2", "Sample size per experiment:",
                  min = 5, max = 200, value = 30, step = 5),

      sliderInput("reps2", "Number of experiments:",
                  min = 200, max = 3000, value = 1000, step = 100),

      actionButton("go2", "Run experiments",
                   class = "btn-primary", width = "100%"),

      uiOutput("results2")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("lln_plot", height = "430px")),
        column(6, plotOutput("clt_plot", height = "430px"))
      )
    )
  )
)

server <- function(input, output, session) {

  draw_pop <- function(n, pop) {
    switch(pop,
      "Uniform(0, 1)"  = runif(n),
      "Exponential(1)" = rexp(n, rate = 1),
      "Bimodal"        = {
        k <- rbinom(n, 1, 0.5)
        k * rnorm(n, -2, 0.6) + (1 - k) * rnorm(n, 2, 0.6)
      },
      "Bernoulli(0.3)" = rbinom(n, 1, 0.3)
    )
  }

  pop_mu <- function(pop) {
    switch(pop,
      "Uniform(0, 1)"  = 0.5,
      "Exponential(1)" = 1,
      "Bimodal"        = 0,
      "Bernoulli(0.3)" = 0.3
    )
  }

  pop_sigma <- function(pop) {
    switch(pop,
      "Uniform(0, 1)"  = sqrt(1 / 12),
      "Exponential(1)" = 1,
      "Bimodal"        = sqrt(0.6^2 + 4),
      "Bernoulli(0.3)" = sqrt(0.3 * 0.7)
    )
  }

  dat <- reactive({
    input$go2
    n    <- input$n2
    reps <- input$reps2
    pop  <- input$pop2
    mu   <- pop_mu(pop)
    sig  <- pop_sigma(pop)

    # One long draw for LLN
    long_draw <- draw_pop(n * 5, pop)
    running   <- cumsum(long_draw) / seq_along(long_draw)

    # Many experiments for CLT
    means <- replicate(reps, mean(draw_pop(n, pop)))

    list(running = running, means = means, mu = mu, sig = sig,
         n = n, reps = reps, pop = pop)
  })

  output$lln_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    plot(seq_along(d$running), d$running, type = "l",
         col = "#3498db", lwd = 1.5,
         xlab = "Number of observations",
         ylab = "Cumulative mean",
         main = "LLN: One running average")
    abline(h = d$mu, lty = 2, lwd = 2.5, col = "#e74c3c")

    legend("topright", bty = "n", cex = 0.85,
           legend = c("Running average",
                      paste0("True \u03bc = ", d$mu)),
           col = c("#3498db", "#e74c3c"),
           lwd = c(1.5, 2.5), lty = c(1, 2))
  })

  output$clt_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    hist(d$means, breaks = 40, probability = TRUE,
         col = "#3498db30", border = "#3498db80",
         main = paste0("CLT: ", d$reps, " sample means (n=", d$n, ")"),
         xlab = "Sample mean", ylab = "Density")

    x_seq <- seq(min(d$means), max(d$means), length.out = 300)
    se <- d$sig / sqrt(d$n)
    lines(x_seq, dnorm(x_seq, mean = d$mu, sd = se),
          col = "#e74c3c", lwd = 2.5)

    abline(v = d$mu, lty = 2, lwd = 2, col = "#2c3e50")

    legend("topright", bty = "n", cex = 0.85,
           legend = c("Sampling distribution",
                      paste0("N(\u03bc, \u03c3/\u221an) overlay"),
                      paste0("True \u03bc = ", d$mu)),
           col = c("#3498db80", "#e74c3c", "#2c3e50"),
           lwd = c(8, 2.5, 2), lty = c(1, 1, 2))
  })

  output$results2 <- renderUI({
    d <- dat()
    se <- d$sig / sqrt(d$n)

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>LLN (left):</b><br>",
        "Final avg: ", round(d$running[length(d$running)], 4), "<br>",
        "True \u03bc: ", d$mu, "<br>",
        "<hr style='margin:8px 0'>",
        "<b>CLT (right):</b><br>",
        "Mean of means: ", round(mean(d$means), 4), "<br>",
        "SD of means: ", round(sd(d$means), 4), "<br>",
        "Theoretical SE: ", round(se, 4)
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • Exponential, n = 5: the LLN panel shows the running average still wobbling, while the CLT histogram is visibly skewed. Neither theorem has fully “kicked in” yet — but they will at different rates.
  • Exponential, n = 100: the running average has settled (LLN), and the histogram is now bell-shaped (CLT). Both theorems have done their job.
  • Bimodal: the running average converges to 0 (the midpoint), even though no individual observation is near 0. LLN doesn’t care about the shape.
  • Multiple paths (Sim 1): all paths converge to the same \(\mu\), but they take different routes. The early wobble is sampling variability; the eventual convergence is LLN.

The bottom line

  • LLN = your estimate gets closer to the truth as \(n\) grows. It’s about accuracy of a single estimate.
  • CLT = the shape of the sampling distribution is approximately normal. It’s about the distribution of the estimator across repeated experiments.
  • You need LLN before CLT matters. If the mean hasn’t stabilized, knowing its distribution is normal doesn’t help much.

When does each theorem kick in?

LLN and CLT are both asymptotic results — they describe what happens as \(n \to \infty\). But in practice, they kick in at different speeds, and the speed depends on the population you’re sampling from.

LLN kicks in early. As soon as you have a few dozen observations, the sample mean is usually close to \(\mu\). The only requirement is a finite first moment (\(E[|X|] < \infty\)). Even heavily skewed distributions settle down quickly — the running average doesn’t care about the shape of the population, only that the mean exists.

CLT kicks in later, and the speed depends on the population shape. The CLT is about the distribution of \(\bar{X}_n\) converging to a bell curve — a stronger claim that requires more data. How much more depends on how non-normal the population is:

Population CLT kicks in at roughly… Why
Symmetric (Normal, Uniform) \(n \approx 5\text{--}10\) Already close to normal; little work for CLT to do
Moderately skewed (Exponential, \(\chi^2\)) \(n \approx 30\text{--}50\) Skewness needs to be averaged away
Heavily skewed (Bernoulli(0.05), Log-normal) \(n \approx 100\text{--}500\) Extreme skewness or heavy tails slow convergence
No finite variance (Cauchy) Never CLT requires \(\text{Var}(X) < \infty\); without it, the theorem doesn’t apply

The folklore rule “\(n \geq 30\) is enough” comes from CLT, not LLN. It’s roughly when the normal approximation becomes decent for moderately skewed populations. But it’s a rule of thumb, not a theorem — symmetric populations need far less, and heavily skewed populations need far more.

Why does skewness slow things down?

The CLT convergence rate is governed by the Berry-Esseen bound:

\[ \sup_z \left| P\!\left(\frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \leq z\right) - \Phi(z) \right| \;\leq\; \frac{C \cdot \rho}{\sigma^3 \sqrt{n}} \]

where \(\rho = E[|X - \mu|^3]\) is the third absolute moment. The ratio \(\rho / \sigma^3\) measures how “non-normal” the population is — distributions with heavy tails or strong skewness have a large \(\rho / \sigma^3\), which means the bound shrinks more slowly with \(n\). That’s the mathematical reason skewed distributions need more data before CLT kicks in.

Simulation 3: watch CLT kick in at different speeds

Pick a population and see how the sampling distribution of \(\bar{X}_n\) (standardised) compares to the \(N(0, 1)\) curve at four sample sizes. The KS distance measures how far the sampling distribution is from normal — smaller = closer.

#| standalone: true
#| viewerHeight: 620

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      selectInput("pop3", "Population:",
        choices = c(
          "Normal(0, 1)"   = "normal",
          "Uniform(0, 1)"  = "uniform",
          "Exponential(1)"  = "exp",
          "Chi-squared(3)"  = "chisq",
          "Bernoulli(0.1)"  = "bern",
          "Log-normal(0,1)" = "lognorm"
        ),
        selected = "exp"
      ),

      actionButton("run3", "Run 2,000 experiments",
                   style = "width:100%; margin-top:10px;
                            background:#3498db; color:white;
                            border:none; padding:10px; font-weight:bold;"),

      uiOutput("stats3")
    ),

    mainPanel(
      width = 9,
      plotOutput("grid_plot", height = "520px")
    )
  )
)

server <- function(input, output, session) {

  sim_data <- reactiveVal(NULL)

  observeEvent(input$run3, {
    pop <- input$pop3
    B   <- 2000
    ns  <- c(5, 30, 100, 500)

    draw <- function(n) {
      switch(pop,
        normal  = rnorm(n),
        uniform = runif(n),
        exp     = rexp(n, 1),
        chisq   = rchisq(n, df = 3),
        bern    = rbinom(n, 1, 0.1),
        lognorm = rlnorm(n, 0, 1))
    }

    mu <- switch(pop,
      normal = 0, uniform = 0.5, exp = 1,
      chisq = 3, bern = 0.1, lognorm = exp(0.5))

    sig <- switch(pop,
      normal = 1, uniform = sqrt(1/12), exp = 1,
      chisq = sqrt(6), bern = sqrt(0.1 * 0.9),
      lognorm = sqrt((exp(1) - 1) * exp(1)))

    results <- lapply(ns, function(n) {
      means <- replicate(B, mean(draw(n)))
      list(means = means, n = n)
    })

    sim_data(list(results = results, mu = mu, sig = sig,
                  pop = pop, ns = ns))
  })

  output$grid_plot <- renderPlot({
    d <- sim_data()
    if (is.null(d)) {
      plot.new()
      text(0.5, 0.5, "Press the button to run",
           cex = 1.4, col = "#7f8c8d")
      return()
    }

    par(mfrow = c(2, 2), mar = c(4, 4, 3.5, 1), oma = c(0, 0, 2, 0))

    pop_label <- switch(d$pop,
      normal = "Normal(0, 1)", uniform = "Uniform(0, 1)",
      exp = "Exponential(1)", chisq = "\u03c7\u00b2(3)",
      bern = "Bernoulli(0.1)", lognorm = "Log-normal(0, 1)")

    for (i in seq_along(d$results)) {
      r  <- d$results[[i]]
      se <- d$sig / sqrt(r$n)
      z  <- (r$means - d$mu) / se

      hist(z, breaks = 40, freq = FALSE, col = "#dfe6e9",
           border = "#b2bec3", xlim = c(-4, 4),
           main = paste0("n = ", r$n),
           xlab = "Standardised mean", ylab = "Density",
           cex.main = 1.3)

      xseq <- seq(-4, 4, length.out = 300)
      lines(xseq, dnorm(xseq), lwd = 2.5, col = "#e74c3c", lty = 2)

      ks <- suppressWarnings(ks.test(z, "pnorm")$statistic)
      legend("topright", bty = "n", cex = 0.85,
             legend = paste0("KS = ", sprintf("%.3f", ks)),
             text.col = if (ks < 0.03) "#27ae60" else "#e74c3c")
    }

    mtext(paste0("CLT convergence: ", pop_label),
          side = 3, outer = TRUE, cex = 1.2, font = 2)
  })

  output$stats3 <- renderUI({
    d <- sim_data()
    if (is.null(d)) return(NULL)

    ks_vals <- sapply(d$results, function(r) {
      se <- d$sig / sqrt(r$n)
      z  <- (r$means - d$mu) / se
      suppressWarnings(ks.test(z, "pnorm")$statistic)
    })

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>KS distance from N(0,1):</b><br>",
        paste(paste0("n = ", d$ns, ": <b>",
              sprintf("%.3f", ks_vals), "</b>"),
              collapse = "<br>"),
        "<br><hr style='margin:8px 0'>",
        "<small>KS &rarr; 0 means the<br>",
        "sampling distribution is<br>",
        "converging to normal.</small>"
      ))
    )
  })
}

shinyApp(ui, server)

What to try:

  • Normal: CLT is perfect even at \(n = 5\) (KS \(\approx 0\)) — the population is already normal, so averages are exactly normal at every \(n\).
  • Exponential: at \(n = 5\) the histogram is visibly right-skewed. By \(n = 30\) it’s close. By \(n = 100\) it’s indistinguishable from the bell curve.
  • Bernoulli(0.1): at \(n = 5\) the sample mean can only be 0, 0.2, 0.4, … — you can see the discreteness. The CLT needs \(n = 100\)+ to smooth this out.
  • Log-normal: the slowest convergence of the bunch. Heavy right tail means the CLT needs \(n = 300\)+ before the KS distance gets small. This is why the “\(n \geq 30\) rule” can be dangerously optimistic.

Did you know?

  • Jacob Bernoulli proved the first version of the Law of Large Numbers in his book Ars Conjectandi, published posthumously in 1713. He called it his “golden theorem” and spent over 20 years working on the proof. The result seems obvious in hindsight — of course averages converge — but rigorously proving why required entirely new mathematical machinery.
  • There are actually two versions: the Weak LLN (convergence in probability, proved by Bernoulli and later Chebyshev) and the Strong LLN (almost sure convergence, proved by Kolmogorov in 1933). The strong version says that convergence happens with probability 1, not just “usually.”
  • The heavy-tailed \(t\)-distribution with \(\text{df} = 1\) (the Cauchy distribution) has no finite mean, so LLN doesn’t apply. The running average never settles down, no matter how much data you collect. Try it in the simulation above — it keeps jumping around forever.