Measurement Error & Attenuation Bias
The problem
You want to estimate the effect of \(X^*\) on \(Y\):
\[Y_i = \alpha + \beta X_i^* + u_i\]
But you don’t observe \(X^*\) perfectly. Instead you observe \(X\) with noise:
\[X_i = X_i^* + \eta_i, \qquad \eta_i \sim (0, \sigma_\eta^2)\]
where \(\eta\) is measurement error — independent of \(X^*\) and \(u\).
When you run OLS on the mismeasured \(X\), you don’t get \(\beta\). You get:
\[\hat{\beta}_{OLS} \xrightarrow{p} \beta \times \underbrace{\frac{\text{Var}(X^*)}{\text{Var}(X^*) + \sigma_\eta^2}}_{\lambda}\]
That fraction \(\lambda\) is always between 0 and 1. So the estimate is biased toward zero. This is attenuation bias.
Why does it shrink toward zero?
Think of it this way. The measurement error adds random noise to \(X\). From OLS’s perspective, some of the variation in \(X\) is real signal (correlated with \(Y\)) and some is pure noise (uncorrelated with \(Y\)). OLS can’t tell which is which, so it averages over both — diluting the estimated slope.
More noise → more dilution → flatter slope.
Simulation 1: Watch the slope attenuate
Increase the measurement error and watch the estimated slope shrink toward zero. The true relationship stays the same — only the noise changes.
#| standalone: true
#| viewerHeight: 650
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("beta", "True slope (\u03b2):",
min = 0.5, max = 3, value = 1.5, step = 0.1),
sliderInput("n", "Sample size:",
min = 50, max = 500, value = 200, step = 50),
sliderInput("sigma_x", "SD of true X*:",
min = 1, max = 5, value = 2, step = 0.5),
sliderInput("sigma_u", "SD of outcome noise (u):",
min = 0.5, max = 3, value = 1, step = 0.25),
sliderInput("sigma_eta", "Measurement error (\u03c3\u03b7):",
min = 0, max = 5, value = 0, step = 0.25),
uiOutput("results")
),
mainPanel(
width = 9,
plotOutput("scatter", height = "550px")
)
)
)
server <- function(input, output, session) {
# True data — only regenerates when beta, n, sigma_x, sigma_u change
base <- reactive({
n <- input$n
b <- input$beta
sx <- input$sigma_x
su <- input$sigma_u
x_star <- rnorm(n, 0, sx)
u <- rnorm(n, su)
y <- 2 + b * x_star + u
list(x_star = x_star, y = y, beta = b, sx = sx)
})
# Measurement error applied on top — changes when eta slider moves
sim <- reactive({
d <- base()
se <- input$sigma_eta
eta <- rnorm(length(d$x_star), 0, se)
x_obs <- d$x_star + eta
fit_true <- lm(d$y ~ d$x_star)
fit_obs <- lm(d$y ~ x_obs)
lambda <- d$sx^2 / (d$sx^2 + se^2)
list(x_star = d$x_star, x_obs = x_obs, y = d$y,
b_true = coef(fit_true)[2], b_obs = coef(fit_obs)[2],
beta = d$beta, lambda = lambda, sigma_eta = se)
})
output$scatter <- renderPlot({
d <- sim()
par(mfrow = c(1, 2), mar = c(4.5, 4.5, 3.5, 1))
# Left: true X*
plot(d$x_star, d$y, pch = 16, cex = 0.6,
col = adjustcolor("#3498db", 0.5),
xlab = "True X*", ylab = "Y",
main = "Regression on true X*")
abline(lm(d$y ~ d$x_star), col = "#27ae60", lwd = 3)
mtext(paste0("Slope = ", round(d$b_true, 3)),
side = 3, line = 0, cex = 1.1, font = 2, col = "#27ae60")
# Right: observed X with error
plot(d$x_obs, d$y, pch = 16, cex = 0.6,
col = adjustcolor("#e74c3c", 0.4),
xlab = "Observed X (with error)", ylab = "Y",
main = paste0("Regression on mismeasured X"))
abline(lm(d$y ~ d$x_obs), col = "#e74c3c", lwd = 3)
abline(a = coef(lm(d$y ~ d$x_star))[1],
b = d$beta, col = "#27ae60", lwd = 2, lty = 2)
mtext(paste0("Slope = ", round(d$b_obs, 3),
" (true = ", d$beta, ")"),
side = 3, line = 0, cex = 1.1, font = 2, col = "#e74c3c")
legend("topleft", bty = "n", cex = 0.85,
legend = c("OLS on mismeasured X", "True slope"),
col = c("#e74c3c", "#27ae60"), lwd = c(3, 2), lty = c(1, 2))
})
output$results <- renderUI({
d <- sim()
tags$div(class = "stats-box",
HTML(paste0(
"<b>True \u03b2:</b> ", d$beta, "<br>",
"<b>OLS on X*:</b> ", round(d$b_true, 3), "<br>",
"<b>OLS on X:</b> ", round(d$b_obs, 3), "<br>",
"<hr style='margin:8px 0'>",
"<b>Attenuation factor (\u03bb):</b><br>",
"Var(X*) / [Var(X*) + Var(\u03b7)]<br>",
"= ", round(d$lambda, 3), "<br>",
"<b>\u03b2 \u00d7 \u03bb = </b>",
round(d$beta * d$lambda, 3)
))
)
})
}
shinyApp(ui, server)
Things to try
- Start with \(\eta\) = 0: both panels are identical. No measurement error, no bias.
- Slowly increase \(\eta\): watch the right panel’s slope flatten. The cloud of points spreads horizontally (noise in X), so OLS “sees” a weaker relationship.
- \(\eta\) = 5, SD of X* = 2: the attenuation factor drops to ~0.14. Your estimate is 86% too small.
- Increase n: the slope doesn’t recover! Attenuation bias is not a small-sample problem — it persists no matter how much data you have. More data just gives you a more precise estimate of the wrong number.
Simulation 2: Attenuation is systematic, not just noisy
Run 500 regressions, each with fresh measurement error. The distribution of slope estimates is centered below the true \(\beta\) — it’s not random noise, it’s a systematic downward bias.
#| standalone: true
#| viewerHeight: 550
library(shiny)
ui <- fluidPage(
tags$head(tags$style(HTML("
.stats-box {
background: #f0f4f8; border-radius: 6px; padding: 14px;
margin-top: 12px; font-size: 14px; line-height: 1.9;
}
.stats-box b { color: #2c3e50; }
"))),
sidebarLayout(
sidebarPanel(
width = 3,
sliderInput("beta2", "True slope (\u03b2):",
min = 0.5, max = 3, value = 1.5, step = 0.1),
sliderInput("n2", "Sample size per run:",
min = 50, max = 300, value = 100, step = 50),
sliderInput("sigma_eta2", "Measurement error (\u03c3\u03b7):",
min = 0, max = 5, value = 2, step = 0.25),
sliderInput("n_sims", "Number of simulations:",
min = 100, max = 1000, value = 500, step = 100),
uiOutput("results2")
),
mainPanel(
width = 9,
plotOutput("mc_plot", height = "450px")
)
)
)
server <- function(input, output, session) {
sim <- reactive({
n <- input$n2
b <- input$beta2
se <- input$sigma_eta2
nsims <- input$n_sims
sx <- 2
betas_true <- numeric(nsims)
betas_obs <- numeric(nsims)
for (i in seq_len(nsims)) {
x_star <- rnorm(n, 0, sx)
u <- rnorm(n)
y <- 2 + b * x_star + u
eta <- rnorm(n, 0, se)
x_obs <- x_star + eta
betas_true[i] <- coef(lm(y ~ x_star))[2]
betas_obs[i] <- coef(lm(y ~ x_obs))[2]
}
lambda <- sx^2 / (sx^2 + se^2)
list(betas_true = betas_true, betas_obs = betas_obs,
beta = b, lambda = lambda)
})
output$mc_plot <- renderPlot({
d <- sim()
par(mar = c(4.5, 4.5, 3, 1))
all_b <- c(d$betas_true, d$betas_obs)
xlim <- range(all_b) + c(-0.1, 0.1)
# No-error distribution
hist(d$betas_true, breaks = 40,
col = adjustcolor("#27ae60", 0.5), border = "white",
main = "Distribution of slope estimates across simulations",
xlab = expression(hat(beta)), xlim = xlim,
freq = FALSE, cex.main = 1.3)
# With-error distribution
hist(d$betas_obs, breaks = 40,
col = adjustcolor("#e74c3c", 0.45), border = "white",
add = TRUE, freq = FALSE)
abline(v = d$beta, col = "#2c3e50", lwd = 2.5, lty = 2)
abline(v = d$beta * d$lambda, col = "#e74c3c", lwd = 2, lty = 3)
abline(v = mean(d$betas_obs), col = "#e74c3c", lwd = 1.5)
legend("topright", bty = "n", cex = 0.9,
legend = c(
paste0("No error (centered at \u03b2 = ", d$beta, ")"),
paste0("With error (centered at ", round(mean(d$betas_obs), 3), ")"),
paste0("Theory: \u03b2\u03bb = ", round(d$beta * d$lambda, 3))
),
fill = c(adjustcolor("#27ae60", 0.5),
adjustcolor("#e74c3c", 0.45), NA),
border = c("white", "white", NA),
col = c(NA, NA, "#e74c3c"),
lwd = c(NA, NA, 2), lty = c(NA, NA, 3))
})
output$results2 <- renderUI({
d <- sim()
tags$div(class = "stats-box",
HTML(paste0(
"<b>True \u03b2:</b> ", d$beta, "<br>",
"<b>Avg estimate (no error):</b> ",
round(mean(d$betas_true), 3), "<br>",
"<b>Avg estimate (with error):</b> ",
round(mean(d$betas_obs), 3), "<br>",
"<hr style='margin:8px 0'>",
"<b>Attenuation factor:</b> ", round(d$lambda, 3), "<br>",
"<b>\u03b2 \u00d7 \u03bb:</b> ",
round(d$beta * d$lambda, 3), "<br>",
"<small>Bias: ",
round(mean(d$betas_obs) - d$beta, 3), "</small>"
))
)
})
}
shinyApp(ui, server)
Things to try
- \(\eta\) = 0: both distributions overlap perfectly — no bias at all.
- \(\eta\) = 2: the red distribution shifts left. The average slope is systematically below the truth.
- Increase n to 300: the distributions get narrower (more precise) but the red one stays centered at the wrong value. Attenuation bias doesn’t go away with more data.
- Compare theory vs simulation: the theoretical \(\beta\lambda\) should closely match the average of the red distribution.
Measurement error in Y vs X
A crucial asymmetry:
| Error in X | Error in Y | |
|---|---|---|
| Bias? | Yes — toward zero | No bias |
| Precision? | Slightly worse | Worse (larger SEs) |
| Goes away with more data? | No | SEs shrink, but that’s just precision |
Why the asymmetry? When \(Y\) is measured with error, the noise goes into the residual — it’s just more \(u\). The slope is unbiased; you just estimate it less precisely. When \(X\) is measured with error, the noise is in the regressor, which contaminates the covariance between \(X\) and \(Y\) and biases the slope.
What can you do about it?
Measure better. The best fix is reducing \(\sigma_\eta\). Use validated instruments, multiple measurements, averages of repeated measures.
Instrumental variables (IV). Find a variable \(Z\) that predicts \(X^*\) but isn’t contaminated by the measurement error. Two-stage least squares recovers the true \(\beta\).
Reliability ratio correction. OLS with measurement error shrinks the coefficient by the signal-to-noise ratio (also called the reliability ratio):
\[\hat{\beta}_{OLS} \xrightarrow{p} \beta \cdot \underbrace{\frac{\text{Var}(X^*)}{\text{Var}(X^*) + \text{Var}(\eta)}}_{\text{SNR}}\]
The SNR is always between 0 and 1. High noise (\(\text{Var}(\eta) \gg \text{Var}(X^*)\)) → SNR near 0 → coefficient crushed toward zero. Low noise → SNR near 1 → barely any attenuation. If you know the SNR, you can correct: \(\hat{\beta}_{corrected} = \hat{\beta}_{OLS} / \text{SNR}\).
- Multiple indicators. If you have two noisy measures of \(X^*\) with independent errors, their covariance identifies \(\text{Var}(X^*)\), letting you compute \(\lambda\) directly.
Did you know?
The attenuation bias formula was derived by Karl Pearson in the early 1900s, making it one of the oldest results in regression theory. Pearson was studying the relationship between fathers’ and sons’ heights and realized that imprecise measurements of height would make the hereditary correlation look weaker than it really was.
In economics, Jerry Hausman (2001) showed that measurement error in survey data on income and consumption can attenuate elasticity estimates by 30–50%. Studies using administrative tax records (with near-zero measurement error) consistently find larger effects than survey-based studies — exactly what attenuation bias predicts.
The errors-in-variables literature distinguishes between classical measurement error (what we covered: \(X = X^* + \eta\) with \(\eta\) independent of \(X^*\)) and non-classical error (where the error depends on the true value). Non-classical error can bias in either direction, not just toward zero. Mean-reverting error in test scores is a common example.