R-squared & Pseudo R-squared
What R-squared actually measures
\(R^2\) answers one question: what fraction of the variation in \(Y\) is explained by the model?
\[R^2 = 1 - \frac{\text{SSR}}{\text{SST}}\]
where:
- SST = total sum of squares = \(\sum(Y_i - \bar{Y})^2\) — how much \(Y\) varies overall
- SSR = sum of squared residuals = \(\sum(Y_i - \hat{Y}_i)^2\) — how much variation the model fails to explain
If the model explains everything, SSR = 0 and \(R^2 = 1\). If the model explains nothing, SSR = SST and \(R^2 = 0\).
#| standalone: true
#| viewerHeight: 620
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.info-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.8;
}
.info-box b { color: #2c3e50; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("signal", "Signal strength (slope):",
min = 0, max = 3, value = 1, step = 0.25),
sliderInput("noise", "Noise (SD):",
min = 0.5, max = 6, value = 2, step = 0.5),
sliderInput("n", "Sample size:",
min = 30, max = 500, value = 100, step = 10),
sliderInput("k", "Number of predictors:",
min = 1, max = 30, value = 1, step = 1),
actionButton("go", "New draw", class = "btn-primary", width = "100%"),
uiOutput("info")
),
mainPanel(
width = 9,
fluidRow(
column(6, plotOutput("scatter", height = "400px")),
column(6, plotOutput("decomp", height = "400px"))
)
)
)
)
server <- function(input, output, session) {
dat <- reactive({
input$go
n <- input$n
noise <- input$noise
sig <- input$signal
k <- input$k
# First predictor has signal; rest are pure noise predictors
x1 <- rnorm(n)
y <- sig * x1 + rnorm(n, sd = noise)
# Build design matrix
X <- matrix(rnorm(n * k), ncol = k)
X[, 1] <- x1
fit <- lm(y ~ X)
r2 <- summary(fit)$r.squared
adj_r2 <- summary(fit)$adj.r.squared
list(x1 = x1, y = y, fit = fit, r2 = r2, adj_r2 = adj_r2,
n = n, k = k, sig = sig, noise = noise)
})
output$scatter <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
ybar <- mean(d$y)
fv <- fitted(d$fit)
plot(d$x1, d$y, pch = 16, col = "#3498db60", cex = 0.7,
xlab = expression(X[1]), ylab = "Y",
main = "Data + OLS fit (first predictor)")
# Fitted line for x1
abline(lm(d$y ~ d$x1), col = "#e74c3c", lwd = 2.5)
abline(h = ybar, col = "gray50", lty = 2, lwd = 1.5)
legend("topleft", bty = "n", cex = 0.85,
legend = c(paste0("R\u00b2 = ", round(d$r2, 3)),
paste0("Adj R\u00b2 = ", round(d$adj_r2, 3))),
text.col = c("#2c3e50", "#7f8c8d"))
})
output$decomp <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
r <- resid(d$fit)
fv <- fitted(d$fit)
ybar <- mean(d$y)
sst <- sum((d$y - ybar)^2)
ssr <- sum(r^2)
sse <- sst - ssr
barplot(c(SST = sst, "Explained\n(SST - SSR)" = sse, "Residual\n(SSR)" = ssr),
col = c("#95a5a6", "#27ae60", "#e74c3c"),
main = "Variance decomposition",
ylab = "Sum of squares", border = NA)
text(0.7, sst * 0.5, "Total", col = "white", font = 2, cex = 1.1)
text(1.9, sse * 0.5, paste0(round(100 * d$r2, 1), "%"), col = "white", font = 2, cex = 1.1)
text(3.1, ssr * 0.5, paste0(round(100 * (1 - d$r2), 1), "%"), col = "white", font = 2, cex = 1.1)
})
output$info <- renderUI({
d <- dat()
gap <- round(d$r2 - d$adj_r2, 3)
msg <- paste0(
"<b>R\u00b2:</b> ", round(d$r2, 3),
" — model explains ", round(100 * d$r2, 1), "% of variance<br>",
"<b>Adj R\u00b2:</b> ", round(d$adj_r2, 3),
" (gap: ", gap, ")<br>",
"<hr style='margin:8px 0'>",
"<small>With ", d$k, " predictor(s) and n=", d$n,
". Try adding noise predictors (k>1) with slope=0 to see R\u00b2 inflate while Adj R\u00b2 doesn't.</small>"
)
tags$div(class = "info-box", HTML(msg))
})
}
shinyApp(ui, server)
Things to try
- Increase noise, keep signal fixed: \(R^2\) drops — the model explains less of a noisier outcome.
- Set signal = 0: \(R^2\) should be near 0, but with small \(n\) it can be surprisingly large — this is overfitting.
- Increase predictors (k) with signal = 0: \(R^2\) climbs even though all predictors are noise. Adjusted \(R^2\) stays near 0 or goes negative. This is why adjusted \(R^2\) exists.
Adjusted R-squared
\(R^2\) always increases when you add a predictor — even a useless one — because more parameters mechanically reduce SSR. Adjusted \(R^2\) penalizes for the number of parameters \(k\):
\[\bar{R}^2 = 1 - \frac{\text{SSR} / (n - k - 1)}{\text{SST} / (n - 1)}\]
This is equivalent to using variance estimates (dividing by degrees of freedom) rather than raw sums of squares. A noise predictor that barely reduces SSR will be outweighed by the lost degree of freedom, and adjusted \(R^2\) will drop.
Adjusted \(R^2\) can be negative — this means your model is worse than just predicting \(\bar{Y}\) for everyone.
What R-squared doesn’t tell you
\(R^2\) is one of the most misunderstood statistics. High \(R^2\) does not mean:
| Misconception | Reality |
|---|---|
| “The model is correctly specified” | A linear model with \(R^2 = 0.95\) can still miss a quadratic term. Check residual plots. |
| “The estimates are unbiased” | Omitted variable bias can exist at any \(R^2\) level. |
| “The model predicts well out of sample” | Overfitting inflates \(R^2\) in-sample. See model selection. |
| “Higher R-squared = better model” | Adding garbage predictors increases \(R^2\). Use adjusted \(R^2\), AIC, or cross-validation to compare. |
| “Low R-squared means X doesn’t matter” | In a policy study, \(X\) can have a causal effect even if \(R^2 = 0.02\). Most of the variation in \(Y\) comes from things you can’t control. |
Pseudo R-squared: what to do when R-squared doesn’t exist
For nonlinear models like logit and probit, there are no residuals in the OLS sense and no clean SSR/SST decomposition. But we still want to measure “how much does the model explain?” Several pseudo \(R^2\) measures approximate this idea, each in a different way.
All pseudo \(R^2\) measures compare your model against a null model (intercept only). They differ in how they make the comparison.
#| standalone: true
#| viewerHeight: 620
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.info-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.8;
}
.info-box b { color: #2c3e50; }
.good { color: #27ae60; font-weight: bold; }
.muted { color: #7f8c8d; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("beta", "True effect (logit slope):",
min = 0, max = 3, value = 1, step = 0.25),
sliderInput("n3", "Sample size:",
min = 100, max = 1000, value = 300, step = 50),
actionButton("go3", "New draw", class = "btn-primary", width = "100%"),
uiOutput("info3")
),
mainPanel(
width = 9,
fluidRow(
column(4, plotOutput("logit_fit", height = "380px")),
column(4, plotOutput("pseudo_compare", height = "380px")),
column(4, plotOutput("ll_bar", height = "380px"))
)
)
)
)
server <- function(input, output, session) {
dat3 <- reactive({
input$go3
n <- input$n3
beta <- input$beta
x <- rnorm(n)
p <- plogis(beta * x)
y <- rbinom(n, 1, p)
fit_full <- glm(y ~ x, family = binomial)
fit_null <- glm(y ~ 1, family = binomial)
ll_full <- logLik(fit_full)
ll_null <- logLik(fit_null)
# McFadden
mcf <- 1 - as.numeric(ll_full) / as.numeric(ll_null)
# Cox-Snell
cox_snell <- 1 - exp(-2/n * (as.numeric(ll_full) - as.numeric(ll_null)))
# Nagelkerke
cox_snell_max <- 1 - exp(2/n * as.numeric(ll_null))
nagelkerke <- cox_snell / cox_snell_max
# Efron
phat <- fitted(fit_full)
efron <- 1 - sum((y - phat)^2) / sum((y - mean(y))^2)
# Count
count_r2 <- mean((phat > 0.5) == y)
list(x = x, y = y, p = p, fit = fit_full, beta = beta,
ll_full = as.numeric(ll_full), ll_null = as.numeric(ll_null),
mcf = mcf, cox_snell = cox_snell, nagelkerke = nagelkerke,
efron = efron, count_r2 = count_r2)
})
output$logit_fit <- renderPlot({
d <- dat3()
par(mar = c(4.5, 4.5, 3, 1))
ox <- order(d$x)
plot(d$x, d$y, pch = 16, col = "#3498db40", cex = 0.6,
xlab = "X", ylab = "P(Y = 1)",
main = "Logit fit")
lines(d$x[ox], fitted(d$fit)[ox], col = "#e74c3c", lwd = 2.5)
lines(d$x[ox], d$p[ox], col = "#27ae60", lwd = 2, lty = 2)
legend("topleft", bty = "n", cex = 0.8,
legend = c("Estimated", "True"),
col = c("#e74c3c", "#27ae60"), lwd = 2, lty = c(1, 2))
})
output$pseudo_compare <- renderPlot({
d <- dat3()
par(mar = c(4.5, 8, 3, 1))
vals <- c(d$mcf, d$cox_snell, d$nagelkerke, d$efron, d$count_r2)
names <- c("McFadden", "Cox-Snell", "Nagelkerke", "Efron", "Count")
cols <- c("#9b59b6", "#3498db", "#e67e22", "#27ae60", "#95a5a6")
bp <- barplot(vals, names.arg = names, col = cols, border = NA,
main = "Pseudo R\u00b2 measures",
ylab = "Value", ylim = c(0, max(1, max(vals) + 0.05)),
las = 2, cex.names = 0.85)
text(bp, vals + 0.02, round(vals, 3), cex = 0.85, font = 2)
})
output$ll_bar <- renderPlot({
d <- dat3()
par(mar = c(4.5, 4.5, 3, 1))
vals <- c(d$ll_null, d$ll_full)
cols <- c("#e74c3c", "#27ae60")
bp <- barplot(vals, names.arg = c("Null\n(intercept only)", "Full\n(with X)"),
col = cols, border = NA, main = "Log-likelihoods",
ylab = "Log-likelihood")
text(bp, vals + abs(min(vals)) * 0.03, round(vals, 1), font = 2, cex = 0.9)
arrows(bp[1], d$ll_null, bp[1], d$ll_full, col = "#2c3e50",
lwd = 2, length = 0.1, code = 3)
text(bp[1] + 0.3, (d$ll_null + d$ll_full) / 2, "Improvement",
cex = 0.8, srt = 90)
})
output$info3 <- renderUI({
d <- dat3()
tags$div(class = "info-box", HTML(paste0(
"<b>McFadden:</b> ", round(d$mcf, 3), "<br>",
"<b>Cox-Snell:</b> ", round(d$cox_snell, 3), "<br>",
"<b>Nagelkerke:</b> ", round(d$nagelkerke, 3), "<br>",
"<b>Efron:</b> ", round(d$efron, 3), "<br>",
"<b>Count:</b> ", round(d$count_r2, 3), "<br>",
"<hr style='margin:8px 0'>",
"<small>McFadden 0.2–0.4 = excellent fit. Don't compare to OLS R\u00b2 values.</small>"
)))
})
}
shinyApp(ui, server)
The pseudo R-squared family
McFadden — the most common in economics:
\[R^2_{\text{McF}} = 1 - \frac{\ln \hat{L}_{\text{full}}}{\ln \hat{L}_{\text{null}}}\]
Compares log-likelihoods. If your model is no better than the null, the ratio is 1 and \(R^2 = 0\). Values of 0.2–0.4 indicate excellent fit — don’t compare to OLS \(R^2\) magnitudes.
Cox-Snell — based on the likelihood ratio:
\[R^2_{\text{CS}} = 1 - \left(\frac{\hat{L}_{\text{null}}}{\hat{L}_{\text{full}}}\right)^{2/n}\]
Problem: its theoretical maximum is less than 1, which is awkward.
Nagelkerke — rescales Cox-Snell so the maximum is 1:
\[R^2_{\text{N}} = \frac{R^2_{\text{CS}}}{1 - \left(\hat{L}_{\text{null}}\right)^{2/n}}\]
Efron — closest to OLS \(R^2\) intuition:
\[R^2_{\text{E}} = 1 - \frac{\sum(y_i - \hat{p}_i)^2}{\sum(y_i - \bar{y})^2}\]
Uses predicted probabilities directly. This is essentially the complement of the Brier score, normalized by total variance.
Count — just the percent correctly classified. Simple but crude: it ignores how confident the predictions are.
Things to try
- Slope = 0: all pseudo \(R^2\) measures are near 0. The model has no predictive power. Count \(R^2\) still shows ~50% (coin flip accuracy).
- Increase the slope: all measures climb, but at different rates. McFadden is always the smallest. Nagelkerke is always \(\geq\) Cox-Snell.
- Watch the log-likelihoods: the gap between null and full widens as the effect gets stronger. McFadden is just the ratio of these two bars.
The ML connection
Machine learning uses the same quantities but calls them different names:
| Stats | ML | What it is |
|---|---|---|
| \(R^2\) | r2_score |
Fraction of variance explained (regression tasks) |
| McFadden pseudo \(R^2\) | — | Normalized version of cross-entropy; ML just uses raw cross-entropy loss |
| Efron pseudo \(R^2\) | 1 - Brier score (normalized) | How close predicted probabilities are to 0/1 |
| Count pseudo \(R^2\) | Accuracy | Percent correctly classified |
| Adjusted \(R^2\) | Cross-validation error | Both penalize complexity; CV is more common in ML |
The deep connection: cross-entropy loss is the negative log-likelihood. So when ML “minimizes cross-entropy,” it’s doing MLE. McFadden pseudo \(R^2\) just compares that loss between the full model and a baseline. ML skips the \(R^2\) wrapper and works with the raw loss directly, because out-of-sample performance (not in-sample summary) is the goal. See Training as MLE for more.
Did you know?
- Sewall Wright introduced \(R^2\) (or the “coefficient of determination”) in 1921 in a paper on bone measurements in rabbits. It emerged naturally from his path analysis framework — a precursor to modern structural equation modeling.
- Daniel McFadden won the 2000 Nobel Prize in Economics for his work on discrete choice models. His pseudo \(R^2\) became standard because it has a clean information-theoretic interpretation: it measures what fraction of the null model’s uncertainty your predictors resolve.
- The \(R^2\) for the original Mincer (1974) wage equation — regressing log wages on education and experience — was about 0.29. Nearly all wage variation comes from things economists can’t observe. This hasn’t stopped it from being one of the most influential regressions in history.