Homoskedasticity & Heteroskedasticity
What do these words mean?
- Homoskedasticity (homo = same, skedasis = spread): the variance of the errors is constant across all values of \(X\).
- Heteroskedasticity (hetero = different): the variance changes with \(X\).
Think of income vs. age. Young people’s incomes are clustered (entry-level jobs). But at age 50, some people earn $40k and others earn $500k. The spread of income increases with age — that’s heteroskedasticity.
Why does it matter?
OLS is still unbiased under heteroskedasticity — the coefficient estimates are fine. But the standard errors are wrong. And wrong standard errors mean wrong confidence intervals, wrong t-statistics, wrong p-values. You might declare something significant when it isn’t (or miss something real).
The fix: use robust standard errors (HC, or “sandwich” standard errors) that don’t assume constant variance.
#| '!! shinylive warning !!': |
#| shinylive does not work in self-contained HTML documents.
#| Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 620
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
.good { color: #27ae60; font-weight: bold; }
.bad { color: #e74c3c; font-weight: bold; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
selectInput("type", "Error structure:",
choices = c("Homoskedastic",
"Fan-shaped (variance grows with X)",
"U-shaped variance",
"Group heteroskedasticity")),
sliderInput("n", "Sample size:",
min = 50, max = 500, value = 200, step = 50),
sliderInput("b1", HTML("True β<sub>1</sub>:"),
min = 0, max = 3, value = 1.5, step = 0.25),
actionButton("go", "New draw", class = "btn-primary", width = "100%"),
uiOutput("results")
),
mainPanel(
width = 9,
fluidRow(
column(4, plotOutput("scatter", height = "380px")),
column(4, plotOutput("resid_plot", height = "380px")),
column(4, plotOutput("ci_plot", height = "380px"))
)
)
)
)
server <- function(input, output, session) {
dat <- reactive({
input$go
n <- input$n
b1 <- input$b1
type <- input$type
x <- runif(n, 0.5, 10)
if (type == "Homoskedastic") {
eps <- rnorm(n, sd = 2)
} else if (type == "Fan-shaped (variance grows with X)") {
eps <- rnorm(n, sd = 0.3 * x)
} else if (type == "U-shaped variance") {
eps <- rnorm(n, sd = 0.5 + 0.8 * abs(x - 5))
} else {
eps <- rnorm(n, sd = ifelse(x > 5, 4, 0.8))
}
y <- 2 + b1 * x + eps
fit <- lm(y ~ x)
# OLS SE
ols_se <- summary(fit)$coefficients[2, 2]
# HC robust SE (manual HC1)
r <- resid(fit)
X <- cbind(1, x)
bread <- solve(t(X) %*% X)
meat <- t(X) %*% diag(r^2 * n / (n - 2)) %*% X
robust_vcov <- bread %*% meat %*% bread
robust_se <- sqrt(robust_vcov[2, 2])
# Simulation: repeat 500 times, see how often CI covers
covers_ols <- 0
covers_robust <- 0
reps <- 500
for (i in seq_len(reps)) {
xx <- runif(n, 0.5, 10)
if (type == "Homoskedastic") {
ee <- rnorm(n, sd = 2)
} else if (type == "Fan-shaped (variance grows with X)") {
ee <- rnorm(n, sd = 0.3 * xx)
} else if (type == "U-shaped variance") {
ee <- rnorm(n, sd = 0.5 + 0.8 * abs(xx - 5))
} else {
ee <- rnorm(n, sd = ifelse(xx > 5, 4, 0.8))
}
yy <- 2 + b1 * xx + ee
ff <- lm(yy ~ xx)
b_hat <- coef(ff)[2]
ols_s <- summary(ff)$coefficients[2, 2]
rr <- resid(ff)
XX <- cbind(1, xx)
br <- solve(t(XX) %*% XX)
mt <- t(XX) %*% diag(rr^2 * n / (n - 2)) %*% XX
rob_s <- sqrt((br %*% mt %*% br)[2, 2])
if (b_hat - 1.96 * ols_s <= b1 && b1 <= b_hat + 1.96 * ols_s)
covers_ols <- covers_ols + 1
if (b_hat - 1.96 * rob_s <= b1 && b1 <= b_hat + 1.96 * rob_s)
covers_robust <- covers_robust + 1
}
list(x = x, y = y, fit = fit, type = type, b1 = b1,
ols_se = ols_se, robust_se = robust_se,
cover_ols = covers_ols / reps,
cover_robust = covers_robust / reps)
})
output$scatter <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
plot(d$x, d$y, pch = 16, col = "#3498db60", cex = 0.7,
xlab = "X", ylab = "Y", main = "Data + OLS fit")
abline(d$fit, col = "#e74c3c", lwd = 2.5)
})
output$resid_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 4.5, 3, 1))
r <- resid(d$fit)
fv <- fitted(d$fit)
plot(fv, r, pch = 16, col = "#9b59b660", cex = 0.7,
xlab = "Fitted values", ylab = "Residuals",
main = "Residuals vs Fitted")
abline(h = 0, lty = 2, col = "gray40", lwd = 1.5)
# Envelope
lo_abs <- loess(abs(r) ~ fv)
ox <- order(fv)
env <- predict(lo_abs)[ox]
lines(fv[ox], env, col = "#e67e22", lwd = 2, lty = 1)
lines(fv[ox], -env, col = "#e67e22", lwd = 2, lty = 1)
})
output$ci_plot <- renderPlot({
d <- dat()
par(mar = c(4.5, 6, 3, 1))
b_hat <- coef(d$fit)[2]
ci_ols <- c(b_hat - 1.96 * d$ols_se, b_hat + 1.96 * d$ols_se)
ci_rob <- c(b_hat - 1.96 * d$robust_se, b_hat + 1.96 * d$robust_se)
xlim <- range(c(ci_ols, ci_rob, d$b1)) + c(-0.3, 0.3)
plot(NULL, xlim = xlim, ylim = c(0.5, 2.5),
yaxt = "n", ylab = "", xlab = expression(beta[1]),
main = "95% Confidence Intervals")
axis(2, at = 1:2, labels = c("OLS SE", "Robust SE"), las = 1, cex.axis = 0.85)
segments(ci_ols[1], 1, ci_ols[2], 1, lwd = 4, col = "#e74c3c")
points(b_hat, 1, pch = 19, cex = 1.5, col = "#e74c3c")
segments(ci_rob[1], 2, ci_rob[2], 2, lwd = 4, col = "#27ae60")
points(b_hat, 2, pch = 19, cex = 1.5, col = "#27ae60")
abline(v = d$b1, lty = 2, lwd = 2, col = "#2c3e50")
text(d$b1, 2.4, expression("True " * beta[1]), cex = 0.9)
})
output$results <- renderUI({
d <- dat()
tags$div(class = "stats-box",
HTML(paste0(
"<b>OLS SE:</b> ", round(d$ols_se, 4), "<br>",
"<b>Robust SE:</b> ", round(d$robust_se, 4), "<br>",
"<hr style='margin:8px 0'>",
"<b>OLS CI coverage:</b> <span class='",
ifelse(abs(d$cover_ols - 0.95) > 0.03, "bad", "good"), "'>",
round(d$cover_ols * 100, 1), "%</span><br>",
"<b>Robust CI coverage:</b> <span class='good'>",
round(d$cover_robust * 100, 1), "%</span><br>",
"<small>Target: 95%. Over 500 simulations.</small>"
))
)
})
}
shinyApp(ui, server)
Things to try
- Homoskedastic: both SEs are similar, both CIs have ~95% coverage. Constant variance = OLS SEs are fine.
- Fan-shaped: the residual plot shows a clear funnel. OLS SEs are wrong — coverage drops below 95%. Robust SEs fix it.
- Group heteroskedasticity: one group (X > 5) is much noisier. Look at the residual plot — the right side has wider scatter. OLS pretends the variance is the same everywhere.
- Compare the right panel: under heteroskedasticity, the OLS CI (red) is often the wrong width. The robust CI (green) adjusts.
The practical rule
In applied work, always use robust standard errors (or clustered SEs). They are valid whether or not heteroskedasticity is present. If the errors happen to be homoskedastic, robust SEs give you the same answer anyway — there’s no downside.
| Software | How to get robust SEs |
|---|---|
| R | lmtest::coeftest(fit, vcov = sandwich::vcovHC) |
| Stata | reg y x, robust |
| Python | sm.OLS(y, X).fit(cov_type='HC3') |
Did you know?
- Nobody can agree on how to spell it. “Heteroskedasticity” (with a k) is the original Greek-derived spelling (skedasis = scattering). “Heteroscedasticity” (with a c) is the Latinized version. Both are correct. Econometricians tend to use k; other fields use c. The important thing is that your standard errors are right, not your spelling.
- Halbert White published his famous robust standard error estimator (HC0) in 1980. It was so influential that “White standard errors” became the default in applied economics. The paper has over 30,000 citations.
- The HC in HC0, HC1, HC2, HC3 stands for “heteroskedasticity-consistent.” HC3 is generally preferred for small samples because it gives each observation a slightly larger weight, correcting for leverage points.
- Fun debugging tip: if your robust SEs are much larger than your OLS SEs, you likely have heteroskedasticity. If they’re much smaller, something is probably wrong with your data.