Statistical Foundations

1. Data & Distributions

A distribution is a model of variability. Imagine measuring the commute time of every person in a city. Some people take 10 minutes, most take around 30, a few are stuck for over an hour. If you made a histogram of all those commute times, the shape you’d see is the distribution — it tells you which values are common, which are rare, and how spread out things are.

Why do we care? Because variation is everywhere. Two patients given the same drug respond differently. Two students who study the same hours get different exam scores. Two identical ads shown to similar users get different click rates. None of that is a mistake — it’s the nature of data. A distribution is simply a mathematical way of describing how much things vary and in what pattern.

Once you have a distribution, you can answer useful questions: What’s the most likely outcome? How often do we see extreme values? Where does the middle 50% of the data sit?

Key objects: histogram (shape), CDF (cumulative probabilities), quantiles (where does 50% of the data fall?).

Explore different distributions below — notice how each has a distinct shape, center, and spread.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 450

library(shiny)

ui <- fluidPage(
  sidebarLayout(
    sidebarPanel(
      width = 3,
      selectInput("dist", "Distribution:",
                  choices = c("Uniform(0,1)", "Normal(0,1)",
                              "Exponential(1)", "Chi-squared(3)",
                              "Bimodal")),
      sliderInput("n", "Sample size:", min = 50, max = 5000, value = 1000, step = 50),
      actionButton("draw", "New draw", class = "btn-primary", width = "100%"),
      uiOutput("stats")
    ),
    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("hist_plot", height = "350px")),
        column(6, plotOutput("cdf_plot",  height = "350px"))
      )
    )
  )
)

server <- function(input, output, session) {
  samp <- reactive({
    input$draw
    n <- input$n
    switch(input$dist,
      "Uniform(0,1)"   = runif(n),
      "Normal(0,1)"    = rnorm(n),
      "Exponential(1)" = rexp(n),
      "Chi-squared(3)" = rchisq(n, df = 3),
      "Bimodal"        = {
        k <- rbinom(n, 1, 0.5)
        k * rnorm(n, -2, 0.6) + (1 - k) * rnorm(n, 2, 0.6)
      }
    )
  })

  output$hist_plot <- renderPlot({
    x <- samp()
    par(mar = c(4.5, 4, 3, 1))
    hist(x, breaks = 50, probability = TRUE,
         col = "#d5e8d4", border = "#82b366",
         main = paste("Histogram:", input$dist),
         xlab = "x", ylab = "Density")
  })

  output$cdf_plot <- renderPlot({
    x <- samp()
    par(mar = c(4.5, 4, 3, 1))
    plot(ecdf(x), col = "#3498db", lwd = 2,
         main = paste("CDF:", input$dist),
         xlab = "x", ylab = "F(x)")
  })

  output$stats <- renderUI({
    x <- samp()
    q <- round(quantile(x, c(0.25, 0.5, 0.75)), 3)
    tags$div(style = "background:#f0f4f8; border-radius:6px; padding:12px; margin-top:12px; font-size:14px; line-height:1.8;",
      HTML(paste0(
        "<b>Mean:</b> ", round(mean(x), 3), "<br>",
        "<b>SD:</b> ", round(sd(x), 3), "<br>",
        "<b>Q25:</b> ", q[1], "<br>",
        "<b>Median:</b> ", q[2], "<br>",
        "<b>Q75:</b> ", q[3]
      ))
    )
  })
}

shinyApp(ui, server)

2. Sampling & Uncertainty

Sampling is why statistics exists. A sample mean is not a fixed truth — it is one draw from a distribution of possible sample means. Repeat the sampling and you get a different answer every time.

The key insight: larger samples produce less variable estimates. The spread of the sampling distribution shrinks at rate \(1/\sqrt{n}\).

Try it: press “New draw” a few times at \(n = 10\), then slide up to \(n = 200\) and watch how the sample means cluster tighter around the true mean.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 420

library(shiny)

ui <- fluidPage(
  sidebarLayout(
    sidebarPanel(
      width = 3,
      sliderInput("n", "Sample size (n):", min = 5, max = 200, value = 10, step = 5),
      sliderInput("reps", "Repeated samples:", min = 5, max = 50, value = 20, step = 5),
      actionButton("go", "Draw samples", class = "btn-primary", width = "100%")
    ),
    mainPanel(
      width = 9,
      plotOutput("dot_plot", height = "350px")
    )
  )
)

server <- function(input, output, session) {
  output$dot_plot <- renderPlot({
    input$go
    n    <- input$n
    reps <- input$reps

    means <- replicate(reps, mean(rnorm(n, mean = 5, sd = 2)))

    par(mar = c(4.5, 2, 3, 1))
    stripchart(means, method = "stack", pch = 19, cex = 1.5,
               col = "#3498db", offset = 0.5,
               xlim = c(3, 7),
               main = paste0(reps, " sample means (n = ", n, ")"),
               xlab = "Sample mean")
    abline(v = 5, lty = 2, lwd = 2, col = "#e74c3c")
    legend("topright", legend = "True mean = 5",
           col = "#e74c3c", lty = 2, lwd = 2, bty = "n")
  })
}

shinyApp(ui, server)

Practice questions:

  1. What happens to the spread of sample means as \(n\) increases?
  2. Does the population distribution change when you change \(n\)?
  3. Could a single sample mean be far from the truth? Is that more likely with small or large \(n\)?

3. Confidence Intervals

A 95% confidence interval does not mean “there’s a 95% probability the true value is inside.” The true value is fixed — it’s either in there or not.

The correct interpretation: if you repeated the experiment many times and built a CI each time, 95% of those intervals would contain the true value.

The simulation below shows exactly this. Each horizontal line is one CI from a fresh sample. Most cover the true mean (blue), but about 5% miss (red).

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 500

library(shiny)

ui <- fluidPage(
  sidebarLayout(
    sidebarPanel(
      width = 3,
      sliderInput("n", "Sample size (n):", min = 10, max = 200, value = 30, step = 10),
      sliderInput("k", "Number of CIs:", min = 20, max = 100, value = 50, step = 10),
      sliderInput("conf", "Confidence level:", min = 0.80, max = 0.99, value = 0.95, step = 0.01),
      actionButton("go", "New experiment", class = "btn-primary", width = "100%"),
      uiOutput("coverage")
    ),
    mainPanel(
      width = 9,
      plotOutput("ci_plot", height = "420px")
    )
  )
)

server <- function(input, output, session) {

  res <- reactive({
    input$go
    n    <- input$n
    k    <- input$k
    conf <- input$conf
    mu   <- 0

    z <- qnorm(1 - (1 - conf) / 2)

    ci <- t(replicate(k, {
      x   <- rnorm(n, mean = mu, sd = 1)
      xbar <- mean(x)
      se   <- sd(x) / sqrt(n)
      c(xbar, xbar - z * se, xbar + z * se)
    }))

    covers <- ci[, 2] <= mu & ci[, 3] >= mu

    list(ci = ci, covers = covers, mu = mu, k = k, conf = conf)
  })

  output$ci_plot <- renderPlot({
    r <- res()
    par(mar = c(4.5, 4, 3, 1))

    plot(NULL, xlim = range(r$ci[, 2:3]), ylim = c(1, r$k),
         xlab = "Value", ylab = "Sample #",
         main = paste0(round(r$conf * 100), "% Confidence Intervals"))

    for (i in seq_len(r$k)) {
      clr <- if (r$covers[i]) "#3498db" else "#e74c3c"
      lw  <- if (r$covers[i]) 1.5 else 2.5
      segments(r$ci[i, 2], i, r$ci[i, 3], i, col = clr, lwd = lw)
      points(r$ci[i, 1], i, pch = 16, cex = 0.5, col = clr)
    }

    abline(v = r$mu, lty = 2, lwd = 2, col = "#2c3e50")
  })

  output$coverage <- renderUI({
    r <- res()
    pct <- round(100 * mean(r$covers), 1)
    miss <- sum(!r$covers)
    tags$div(style = "background:#f0f4f8; border-radius:6px; padding:12px; margin-top:12px; font-size:14px; line-height:1.8;",
      HTML(paste0(
        "<b>Coverage:</b> ", pct, "%<br>",
        "<b>Missed:</b> ", miss, " / ", r$k, "<br>",
        "<small>Target: ", round(r$conf * 100), "%</small>"
      ))
    )
  })
}

shinyApp(ui, server)

Key takeaways:

  • A confidence interval quantifies uncertainty, not probability about the parameter
  • Wider intervals = more uncertainty (small \(n\), high confidence level)
  • The coverage rate converges to the nominal level over many experiments

Did you know?

  • Florence Nightingale wasn’t just a nurse — she was a pioneering statistician. She invented the polar area diagram (a variant of the pie chart) to convince the British government that soldiers were dying from preventable disease, not combat wounds. Her charts changed military policy and saved thousands of lives.
  • The word “statistics” comes from the German Statistik, meaning “science of the state” — it originally referred to collecting data about populations for government use.
  • John Graunt (1620–1674) is considered the father of demography. He analyzed London’s death records and discovered that more boys are born than girls, that urban death rates exceed rural ones, and that plague deaths follow seasonal patterns — all from just counting.