R-squared & Pseudo R-squared

What R-squared actually measures

\(R^2\) answers one question: what fraction of the variation in \(Y\) is explained by the model?

\[R^2 = 1 - \frac{\text{SSR}}{\text{SST}}\]

where:

  • SST = total sum of squares = \(\sum(Y_i - \bar{Y})^2\) — how much \(Y\) varies overall
  • SSR = sum of squared residuals = \(\sum(Y_i - \hat{Y}_i)^2\) — how much variation the model fails to explain

If the model explains everything, SSR = 0 and \(R^2 = 1\). If the model explains nothing, SSR = SST and \(R^2 = 0\).

The Oracle View. In the simulations below, we control the true DGP — so we know exactly how much signal exists. In practice you never know the “true” \(R^2\); you only observe the sample estimate, which is always biased upward.

#| standalone: true
#| viewerHeight: 620

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .info-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.8;
    }
    .info-box b { color: #2c3e50; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("signal", "Signal strength (slope):",
                  min = 0, max = 3, value = 1, step = 0.25),

      sliderInput("noise", "Noise (SD):",
                  min = 0.5, max = 6, value = 2, step = 0.5),

      sliderInput("n", "Sample size:",
                  min = 30, max = 500, value = 100, step = 10),

      sliderInput("k", "Number of predictors:",
                  min = 1, max = 30, value = 1, step = 1),

      actionButton("go", "New draw", class = "btn-primary", width = "100%"),

      uiOutput("info")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(6, plotOutput("scatter", height = "400px")),
        column(6, plotOutput("decomp", height = "400px"))
      )
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$go
    n     <- input$n
    noise <- input$noise
    sig   <- input$signal
    k     <- input$k

    # First predictor has signal; rest are pure noise predictors
    x1 <- rnorm(n)
    y  <- sig * x1 + rnorm(n, sd = noise)

    # Build design matrix
    X <- matrix(rnorm(n * k), ncol = k)
    X[, 1] <- x1

    fit <- lm(y ~ X)

    r2     <- summary(fit)$r.squared
    adj_r2 <- summary(fit)$adj.r.squared

    list(x1 = x1, y = y, fit = fit, r2 = r2, adj_r2 = adj_r2,
         n = n, k = k, sig = sig, noise = noise)
  })

  output$scatter <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    ybar <- mean(d$y)
    fv   <- fitted(d$fit)

    plot(d$x1, d$y, pch = 16, col = "#3498db60", cex = 0.7,
         xlab = expression(X[1]), ylab = "Y",
         main = "Data + OLS fit (first predictor)")

    # Fitted line for x1
    abline(lm(d$y ~ d$x1), col = "#e74c3c", lwd = 2.5)
    abline(h = ybar, col = "gray50", lty = 2, lwd = 1.5)

    legend("topleft", bty = "n", cex = 0.85,
           legend = c(paste0("R\u00b2 = ", round(d$r2, 3)),
                      paste0("Adj R\u00b2 = ", round(d$adj_r2, 3))),
           text.col = c("#2c3e50", "#7f8c8d"))
  })

  output$decomp <- renderPlot({
    d  <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    r     <- resid(d$fit)
    fv    <- fitted(d$fit)
    ybar  <- mean(d$y)

    sst <- sum((d$y - ybar)^2)
    ssr <- sum(r^2)
    sse <- sst - ssr

    barplot(c(SST = sst, "Explained\n(SST - SSR)" = sse, "Residual\n(SSR)" = ssr),
            col = c("#95a5a6", "#27ae60", "#e74c3c"),
            main = "Variance decomposition",
            ylab = "Sum of squares", border = NA)

    text(0.7, sst * 0.5, "Total", col = "white", font = 2, cex = 1.1)
    text(1.9, sse * 0.5, paste0(round(100 * d$r2, 1), "%"), col = "white", font = 2, cex = 1.1)
    text(3.1, ssr * 0.5, paste0(round(100 * (1 - d$r2), 1), "%"), col = "white", font = 2, cex = 1.1)
  })

  output$info <- renderUI({
    d <- dat()
    gap <- round(d$r2 - d$adj_r2, 3)

    msg <- paste0(
      "<b>R\u00b2:</b> ", round(d$r2, 3),
      " &mdash; model explains ", round(100 * d$r2, 1), "% of variance<br>",
      "<b>Adj R\u00b2:</b> ", round(d$adj_r2, 3),
      " (gap: ", gap, ")<br>",
      "<hr style='margin:8px 0'>",
      "<small>With ", d$k, " predictor(s) and n=", d$n,
      ". Try adding noise predictors (k>1) with slope=0 to see R\u00b2 inflate while Adj R\u00b2 doesn't.</small>"
    )

    tags$div(class = "info-box", HTML(msg))
  })
}

shinyApp(ui, server)

Things to try

  • Increase noise, keep signal fixed: \(R^2\) drops — the model explains less of a noisier outcome.
  • Set signal = 0: \(R^2\) should be near 0, but with small \(n\) it can be surprisingly large — this is overfitting.
  • Increase predictors (k) with signal = 0: \(R^2\) climbs even though all predictors are noise. Adjusted \(R^2\) stays near 0 or goes negative. This is why adjusted \(R^2\) exists.

Adjusted R-squared

\(R^2\) always increases when you add a predictor — even a useless one — because more parameters mechanically reduce SSR. Adjusted \(R^2\) penalizes for the number of parameters \(k\):

\[\bar{R}^2 = 1 - \frac{\text{SSR} / (n - k - 1)}{\text{SST} / (n - 1)}\]

This is equivalent to using variance estimates (dividing by degrees of freedom) rather than raw sums of squares. A noise predictor that barely reduces SSR will be outweighed by the lost degree of freedom, and adjusted \(R^2\) will drop.

Adjusted \(R^2\) can be negative — this means your model is worse than just predicting \(\bar{Y}\) for everyone.


What R-squared doesn’t tell you

\(R^2\) is one of the most misunderstood statistics. High \(R^2\) does not mean:

Misconception Reality
“The model is correctly specified” A linear model with \(R^2 = 0.95\) can still miss a quadratic term. Check residual plots.
“The estimates are unbiased” Omitted variable bias can exist at any \(R^2\) level.
“The model predicts well out of sample” Overfitting inflates \(R^2\) in-sample. See model selection.
“Higher R-squared = better model” Adding garbage predictors increases \(R^2\). Use adjusted \(R^2\), AIC, or cross-validation to compare.
“Low R-squared means X doesn’t matter” In a policy study, \(X\) can have a causal effect even if \(R^2 = 0.02\). Most of the variation in \(Y\) comes from things you can’t control.

The last point is critical for applied work. Angrist and Pischke emphasize that a low \(R^2\) is perfectly fine in causal studies. If you randomize a drug and regress health outcomes on treatment, \(R^2\) will be low (most health variation comes from genetics, age, etc.) — but the treatment effect is still identified.


Pseudo R-squared: what to do when R-squared doesn’t exist

For nonlinear models like logit and probit, there are no residuals in the OLS sense and no clean SSR/SST decomposition. But we still want to measure “how much does the model explain?” Several pseudo \(R^2\) measures approximate this idea, each in a different way.

All pseudo \(R^2\) measures compare your model against a null model (intercept only). They differ in how they make the comparison.

#| standalone: true
#| viewerHeight: 620

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .info-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.8;
    }
    .info-box b { color: #2c3e50; }
    .good  { color: #27ae60; font-weight: bold; }
    .muted { color: #7f8c8d; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("beta", "True effect (logit slope):",
                  min = 0, max = 3, value = 1, step = 0.25),

      sliderInput("n3", "Sample size:",
                  min = 100, max = 1000, value = 300, step = 50),

      actionButton("go3", "New draw", class = "btn-primary", width = "100%"),

      uiOutput("info3")
    ),

    mainPanel(
      width = 9,
      fluidRow(
        column(4, plotOutput("logit_fit", height = "380px")),
        column(4, plotOutput("pseudo_compare", height = "380px")),
        column(4, plotOutput("ll_bar", height = "380px"))
      )
    )
  )
)

server <- function(input, output, session) {

  dat3 <- reactive({
    input$go3
    n    <- input$n3
    beta <- input$beta

    x <- rnorm(n)
    p <- plogis(beta * x)
    y <- rbinom(n, 1, p)

    fit_full <- glm(y ~ x, family = binomial)
    fit_null <- glm(y ~ 1, family = binomial)

    ll_full <- logLik(fit_full)
    ll_null <- logLik(fit_null)

    # McFadden
    mcf <- 1 - as.numeric(ll_full) / as.numeric(ll_null)

    # Cox-Snell
    cox_snell <- 1 - exp(-2/n * (as.numeric(ll_full) - as.numeric(ll_null)))

    # Nagelkerke
    cox_snell_max <- 1 - exp(2/n * as.numeric(ll_null))
    nagelkerke <- cox_snell / cox_snell_max

    # Efron
    phat <- fitted(fit_full)
    efron <- 1 - sum((y - phat)^2) / sum((y - mean(y))^2)

    # Count
    count_r2 <- mean((phat > 0.5) == y)

    list(x = x, y = y, p = p, fit = fit_full, beta = beta,
         ll_full = as.numeric(ll_full), ll_null = as.numeric(ll_null),
         mcf = mcf, cox_snell = cox_snell, nagelkerke = nagelkerke,
         efron = efron, count_r2 = count_r2)
  })

  output$logit_fit <- renderPlot({
    d <- dat3()
    par(mar = c(4.5, 4.5, 3, 1))

    ox <- order(d$x)
    plot(d$x, d$y, pch = 16, col = "#3498db40", cex = 0.6,
         xlab = "X", ylab = "P(Y = 1)",
         main = "Logit fit")
    lines(d$x[ox], fitted(d$fit)[ox], col = "#e74c3c", lwd = 2.5)
    lines(d$x[ox], d$p[ox], col = "#27ae60", lwd = 2, lty = 2)
    legend("topleft", bty = "n", cex = 0.8,
           legend = c("Estimated", "True"),
           col = c("#e74c3c", "#27ae60"), lwd = 2, lty = c(1, 2))
  })

  output$pseudo_compare <- renderPlot({
    d <- dat3()
    par(mar = c(4.5, 8, 3, 1))

    vals  <- c(d$mcf, d$cox_snell, d$nagelkerke, d$efron, d$count_r2)
    names <- c("McFadden", "Cox-Snell", "Nagelkerke", "Efron", "Count")
    cols  <- c("#9b59b6", "#3498db", "#e67e22", "#27ae60", "#95a5a6")

    bp <- barplot(vals, names.arg = names, col = cols, border = NA,
                  main = "Pseudo R\u00b2 measures",
                  ylab = "Value", ylim = c(0, max(1, max(vals) + 0.05)),
                  las = 2, cex.names = 0.85)
    text(bp, vals + 0.02, round(vals, 3), cex = 0.85, font = 2)
  })

  output$ll_bar <- renderPlot({
    d <- dat3()
    par(mar = c(4.5, 4.5, 3, 1))

    vals <- c(d$ll_null, d$ll_full)
    cols <- c("#e74c3c", "#27ae60")

    bp <- barplot(vals, names.arg = c("Null\n(intercept only)", "Full\n(with X)"),
                  col = cols, border = NA, main = "Log-likelihoods",
                  ylab = "Log-likelihood")
    text(bp, vals + abs(min(vals)) * 0.03, round(vals, 1), font = 2, cex = 0.9)

    arrows(bp[1], d$ll_null, bp[1], d$ll_full, col = "#2c3e50",
           lwd = 2, length = 0.1, code = 3)
    text(bp[1] + 0.3, (d$ll_null + d$ll_full) / 2, "Improvement",
         cex = 0.8, srt = 90)
  })

  output$info3 <- renderUI({
    d <- dat3()

    tags$div(class = "info-box", HTML(paste0(
      "<b>McFadden:</b> ", round(d$mcf, 3), "<br>",
      "<b>Cox-Snell:</b> ", round(d$cox_snell, 3), "<br>",
      "<b>Nagelkerke:</b> ", round(d$nagelkerke, 3), "<br>",
      "<b>Efron:</b> ", round(d$efron, 3), "<br>",
      "<b>Count:</b> ", round(d$count_r2, 3), "<br>",
      "<hr style='margin:8px 0'>",
      "<small>McFadden 0.2&ndash;0.4 = excellent fit. Don't compare to OLS R\u00b2 values.</small>"
    )))
  })
}

shinyApp(ui, server)

The pseudo R-squared family

McFadden — the most common in economics:

\[R^2_{\text{McF}} = 1 - \frac{\ln \hat{L}_{\text{full}}}{\ln \hat{L}_{\text{null}}}\]

Compares log-likelihoods. If your model is no better than the null, the ratio is 1 and \(R^2 = 0\). Values of 0.2–0.4 indicate excellent fit — don’t compare to OLS \(R^2\) magnitudes.

Cox-Snell — based on the likelihood ratio:

\[R^2_{\text{CS}} = 1 - \left(\frac{\hat{L}_{\text{null}}}{\hat{L}_{\text{full}}}\right)^{2/n}\]

Problem: its theoretical maximum is less than 1, which is awkward.

Nagelkerke — rescales Cox-Snell so the maximum is 1:

\[R^2_{\text{N}} = \frac{R^2_{\text{CS}}}{1 - \left(\hat{L}_{\text{null}}\right)^{2/n}}\]

Efron — closest to OLS \(R^2\) intuition:

\[R^2_{\text{E}} = 1 - \frac{\sum(y_i - \hat{p}_i)^2}{\sum(y_i - \bar{y})^2}\]

Uses predicted probabilities directly. This is essentially the complement of the Brier score, normalized by total variance.

Count — just the percent correctly classified. Simple but crude: it ignores how confident the predictions are.

Things to try

  • Slope = 0: all pseudo \(R^2\) measures are near 0. The model has no predictive power. Count \(R^2\) still shows ~50% (coin flip accuracy).
  • Increase the slope: all measures climb, but at different rates. McFadden is always the smallest. Nagelkerke is always \(\geq\) Cox-Snell.
  • Watch the log-likelihoods: the gap between null and full widens as the effect gets stronger. McFadden is just the ratio of these two bars.

The ML connection

Machine learning uses the same quantities but calls them different names:

Stats ML What it is
\(R^2\) r2_score Fraction of variance explained (regression tasks)
McFadden pseudo \(R^2\) Normalized version of cross-entropy; ML just uses raw cross-entropy loss
Efron pseudo \(R^2\) 1 - Brier score (normalized) How close predicted probabilities are to 0/1
Count pseudo \(R^2\) Accuracy Percent correctly classified
Adjusted \(R^2\) Cross-validation error Both penalize complexity; CV is more common in ML

The deep connection: cross-entropy loss is the negative log-likelihood. So when ML “minimizes cross-entropy,” it’s doing MLE. McFadden pseudo \(R^2\) just compares that loss between the full model and a baseline. ML skips the \(R^2\) wrapper and works with the raw loss directly, because out-of-sample performance (not in-sample summary) is the goal. See Training as MLE for more.


Did you know?

  • Sewall Wright introduced \(R^2\) (or the “coefficient of determination”) in 1921 in a paper on bone measurements in rabbits. It emerged naturally from his path analysis framework — a precursor to modern structural equation modeling.
  • Daniel McFadden won the 2000 Nobel Prize in Economics for his work on discrete choice models. His pseudo \(R^2\) became standard because it has a clean information-theoretic interpretation: it measures what fraction of the null model’s uncertainty your predictors resolve.
  • The \(R^2\) for the original Mincer (1974) wage equation — regressing log wages on education and experience — was about 0.29. Nearly all wage variation comes from things economists can’t observe. This hasn’t stopped it from being one of the most influential regressions in history.