Regression & the CEF

What is the CEF?

The conditional expectation function (CEF) answers a simple question: “What do you expect Y to be, given that you know X?”

Say you’re looking at income (\(Y\)) vs. years of education (\(X\)). The CEF answers: among everyone with exactly 12 years of education, what’s their average income? What about 16 years? 20 years? If you plot those averages, you get a curve — that’s the CEF. It could be a straight line, or it could bend, flatten out, jump — whatever the data actually does.

How does regression fit in?

OLS draws a straight line through that. Two cases:

  1. CEF is already a straight line — OLS gets it exactly right. Each extra unit of \(X\) adds the same bump to expected \(Y\). The line is the CEF.
  2. CEF is curved — maybe the first few years of education matter a lot, but returns flatten after a PhD. OLS can’t bend, so it draws the best straight line it can through that curve. It’s a useful summary, but it misses the shape.

Regression doesn’t assume the world is linear. It gives you the best linear approximation to whatever the true relationship is. The simulator below lets you see exactly where that approximation works and where it breaks down.

In the plots: the red dots show the conditional mean of \(Y\) in each bin of \(X\) (the empirical CEF), the green curve is the true CEF, and the blue line is OLS. Switch DGPs to see when they agree and when they diverge.

Reading the residual plot

The right panel shows residuals vs. fitted values with a LOESS smoother (red curve). LOESS fits a tiny weighted regression at each point using only nearby observations, producing a flexible curve that follows local patterns. If OLS is correctly specified, the LOESS line should be flat at zero. If it curves, that’s visual evidence of nonlinearity that OLS is missing.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 680

library(shiny)

dgp_choices <- c(
  "Linear",
  "Quadratic",
  "Log (diminishing returns)",
  "Step function",
  "Sine wave"
)

# True CEF for each DGP
cef_fun <- function(x, dgp) {
  switch(dgp,
    "Linear"                   = 2 + 1.5 * x,
    "Quadratic"                = 1 + 0.8 * x - 0.15 * x^2,
    "Log (diminishing returns)" = 3 * log(x + 1),
    "Step function"            = ifelse(x < 0, -1, 2),
    "Sine wave"                = 2 * sin(x),
    1.5 * x
  )
}

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .info-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.8;
    }
    .info-box b { color: #2c3e50; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      selectInput("dgp", "True DGP:",
                  choices = dgp_choices),

      sliderInput("n", "Sample size (n):",
                  min = 50, max = 1000, value = 300, step = 50),

      sliderInput("sigma", HTML("Error SD (&sigma;):"),
                  min = 0.5, max = 4, value = 1.5, step = 0.5),

      sliderInput("nbins", "Bins for CEF:",
                  min = 5, max = 30, value = 15, step = 1),

      actionButton("redraw", "New draw", class = "btn-primary", width = "100%"),

      uiOutput("info_box")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("scatter_plot", height = "420px")),
        column(6, plotOutput("resid_plot",   height = "420px"))
      )
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$redraw
    n     <- input$n
    sigma <- input$sigma
    dgp   <- input$dgp

    # X range depends on DGP
    if (dgp == "Log (diminishing returns)") {
      x <- runif(n, 0, 6)
    } else if (dgp == "Step function") {
      x <- runif(n, -3, 3)
    } else if (dgp == "Sine wave") {
      x <- runif(n, -pi, 2 * pi)
    } else {
      x <- runif(n, -3, 5)
    }

    mu <- cef_fun(x, dgp)
    y  <- mu + rnorm(n, sd = sigma)

    ols <- lm(y ~ x)

    # Bin X and compute conditional means
    nbins <- input$nbins
    breaks <- seq(min(x), max(x), length.out = nbins + 1)
    bin    <- cut(x, breaks, include.lowest = TRUE)
    bin_mid <- tapply(x, bin, mean)
    bin_cef <- tapply(y, bin, mean)
    bin_n   <- tapply(y, bin, length)

    list(x = x, y = y, mu = mu, ols = ols, dgp = dgp,
         bin_mid = bin_mid, bin_cef = bin_cef, bin_n = bin_n)
  })

  # --- Main scatter plot ---
  output$scatter_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    plot(d$x, d$y, pch = 16, col = "#bdc3c780", cex = 0.6,
         xlab = "X", ylab = "Y",
         main = paste("DGP:", d$dgp))

    # True CEF curve
    xo <- sort(d$x)
    lines(xo, cef_fun(xo, d$dgp), col = "#2ecc71", lwd = 2.5)

    # OLS line
    abline(d$ols, col = "#3498db", lwd = 2.5)

    # Binned conditional means (empirical CEF)
    keep <- !is.na(d$bin_cef) & !is.na(d$bin_mid)
    points(d$bin_mid[keep], d$bin_cef[keep],
           pch = 19, col = "#e74c3c", cex = 1.6)

    legend("topleft", bty = "n", cex = 0.85,
           legend = c(expression("True " * E * "[Y|X]"),
                      "OLS regression",
                      expression("Binned " * bar(Y) * " (empirical CEF)")),
           col = c("#2ecc71", "#3498db", "#e74c3c"),
           lwd = c(2.5, 2.5, NA),
           pch = c(NA, NA, 19),
           pt.cex = c(NA, NA, 1.4))
  })

  # --- Residual plot ---
  output$resid_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    r <- resid(d$ols)
    fv <- fitted(d$ols)

    plot(fv, r, pch = 16, col = "#9b59b680", cex = 0.6,
         xlab = "Fitted values", ylab = "Residuals",
         main = "Residuals vs Fitted")
    abline(h = 0, lty = 2, col = "gray40", lwd = 1.5)

    # Loess smoother to reveal nonlinearity
    lo <- loess(r ~ fv)
    ox <- order(fv)
    lines(fv[ox], predict(lo)[ox], col = "#e74c3c", lwd = 2)

    legend("topleft", bty = "n", cex = 0.85,
           legend = c("Loess smoother", "Zero line"),
           col = c("#e74c3c", "gray40"),
           lwd = c(2, 1.5),
           lty = c(1, 2))
  })

  # --- Info box ---
  output$info_box <- renderUI({
    d <- dat()
    b0 <- round(coef(d$ols)[1], 3)
    b1 <- round(coef(d$ols)[2], 3)
    r2 <- round(summary(d$ols)$r.squared, 3)

    linear <- d$dgp == "Linear"

    tags$div(class = "info-box",
      HTML(paste0(
        "<b>OLS:</b> Y = ", b0, " + ", b1, "X<br>",
        "<b>R&sup2;:</b> ", r2, "<br><br>",
        if (linear) {
          "<span style='color:#27ae60;'><b>&#10003; CEF is linear &mdash; OLS recovers it exactly.</b></span>"
        } else {
          "<span style='color:#e67e22;'><b>CEF is nonlinear &mdash; OLS is the best linear approximation.</b><br>Check the residual plot for the pattern.</span>"
        }
      ))
    )
  })
}

shinyApp(ui, server)

Did you know?

  • The word “regression” comes from Francis Galton’s 1886 study of heights. He noticed that tall parents tended to have children who were tall — but not as tall. Short parents had children who were short — but not as short. Children’s heights “regressed toward the mean.” The statistical technique kept the name, even though modern regression has nothing to do with reverting to averages.
  • Galton was also Charles Darwin’s half-cousin. He applied statistical thinking to heredity, fingerprints, and even the optimal way to brew tea.
  • The conditional expectation function is sometimes called the “regression function” — which makes sense once you know Galton’s story. OLS literally estimates the function that tells you the expected value of \(Y\) given \(X\).