Measurement Error & Attenuation Bias

The problem

You want to estimate the effect of \(X^*\) on \(Y\):

\[Y_i = \alpha + \beta X_i^* + u_i\]

But you don’t observe \(X^*\) perfectly. Instead you observe \(X\) with noise:

\[X_i = X_i^* + \eta_i, \qquad \eta_i \sim (0, \sigma_\eta^2)\]

where \(\eta\) is measurement error — independent of \(X^*\) and \(u\).

When you run OLS on the mismeasured \(X\), you don’t get \(\beta\). You get:

\[\hat{\beta}_{OLS} \xrightarrow{p} \beta \times \underbrace{\frac{\text{Var}(X^*)}{\text{Var}(X^*) + \sigma_\eta^2}}_{\lambda}\]

That fraction \(\lambda\) is always between 0 and 1. So the estimate is biased toward zero. This is attenuation bias.

Why does it shrink toward zero?

Think of it this way. The measurement error adds random noise to \(X\). From OLS’s perspective, some of the variation in \(X\) is real signal (correlated with \(Y\)) and some is pure noise (uncorrelated with \(Y\)). OLS can’t tell which is which, so it averages over both — diluting the estimated slope.

More noise → more dilution → flatter slope.

The Oracle View. In these simulations, we know the true \(X^*\) and we set \(\sigma_\eta\) (the measurement error). We can compare the regression on true \(X^*\) vs mismeasured \(X\) side by side. In practice, you only observe \(X\) — you never see \(X^*\) and you don’t know how much noise \(\eta\) adds. That’s what makes measurement error so insidious: you can’t tell it’s there just by looking at your data.


Simulation 1: Watch the slope attenuate

Increase the measurement error and watch the estimated slope shrink toward zero. The true relationship stays the same — only the noise changes.

#| standalone: true
#| viewerHeight: 650

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("beta", "True slope (\u03b2):",
                  min = 0.5, max = 3, value = 1.5, step = 0.1),

      sliderInput("n", "Sample size:",
                  min = 50, max = 500, value = 200, step = 50),

      sliderInput("sigma_x", "SD of true X*:",
                  min = 1, max = 5, value = 2, step = 0.5),

      sliderInput("sigma_u", "SD of outcome noise (u):",
                  min = 0.5, max = 3, value = 1, step = 0.25),

      sliderInput("sigma_eta", "Measurement error (\u03c3\u03b7):",
                  min = 0, max = 5, value = 0, step = 0.25),

      uiOutput("results")
    ),

    mainPanel(
      width = 9,
      plotOutput("scatter", height = "550px")
    )
  )
)

server <- function(input, output, session) {

  # True data — only regenerates when beta, n, sigma_x, sigma_u change
  base <- reactive({
    n   <- input$n
    b   <- input$beta
    sx  <- input$sigma_x
    su  <- input$sigma_u

    x_star <- rnorm(n, 0, sx)
    u      <- rnorm(n, su)
    y      <- 2 + b * x_star + u

    list(x_star = x_star, y = y, beta = b, sx = sx)
  })

  # Measurement error applied on top — changes when eta slider moves
  sim <- reactive({
    d   <- base()
    se  <- input$sigma_eta

    eta   <- rnorm(length(d$x_star), 0, se)
    x_obs <- d$x_star + eta

    fit_true <- lm(d$y ~ d$x_star)
    fit_obs  <- lm(d$y ~ x_obs)

    lambda <- d$sx^2 / (d$sx^2 + se^2)

    list(x_star = d$x_star, x_obs = x_obs, y = d$y,
         b_true = coef(fit_true)[2], b_obs = coef(fit_obs)[2],
         beta = d$beta, lambda = lambda, sigma_eta = se)
  })

  output$scatter <- renderPlot({
    d <- sim()

    par(mfrow = c(1, 2), mar = c(4.5, 4.5, 3.5, 1))

    # Left: true X*
    plot(d$x_star, d$y, pch = 16, cex = 0.6,
         col = adjustcolor("#3498db", 0.5),
         xlab = "True X*", ylab = "Y",
         main = "Regression on true X*")
    abline(lm(d$y ~ d$x_star), col = "#27ae60", lwd = 3)
    mtext(paste0("Slope = ", round(d$b_true, 3)),
          side = 3, line = 0, cex = 1.1, font = 2, col = "#27ae60")

    # Right: observed X with error
    plot(d$x_obs, d$y, pch = 16, cex = 0.6,
         col = adjustcolor("#e74c3c", 0.4),
         xlab = "Observed X (with error)", ylab = "Y",
         main = paste0("Regression on mismeasured X"))
    abline(lm(d$y ~ d$x_obs), col = "#e74c3c", lwd = 3)
    abline(a = coef(lm(d$y ~ d$x_star))[1],
           b = d$beta, col = "#27ae60", lwd = 2, lty = 2)
    mtext(paste0("Slope = ", round(d$b_obs, 3),
                 "  (true = ", d$beta, ")"),
          side = 3, line = 0, cex = 1.1, font = 2, col = "#e74c3c")
    legend("topleft", bty = "n", cex = 0.85,
           legend = c("OLS on mismeasured X", "True slope"),
           col = c("#e74c3c", "#27ae60"), lwd = c(3, 2), lty = c(1, 2))
  })

  output$results <- renderUI({
    d <- sim()

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>True \u03b2:</b> ", d$beta, "<br>",
        "<b>OLS on X*:</b> ", round(d$b_true, 3), "<br>",
        "<b>OLS on X:</b> ", round(d$b_obs, 3), "<br>",
        "<hr style='margin:8px 0'>",
        "<b>Attenuation factor (\u03bb):</b><br>",
        "Var(X*) / [Var(X*) + Var(\u03b7)]<br>",
        "= ", round(d$lambda, 3), "<br>",
        "<b>\u03b2 \u00d7 \u03bb = </b>",
        round(d$beta * d$lambda, 3)
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • Start with \(\eta\) = 0: both panels are identical. No measurement error, no bias.
  • Slowly increase \(\eta\): watch the right panel’s slope flatten. The cloud of points spreads horizontally (noise in X), so OLS “sees” a weaker relationship.
  • \(\eta\) = 5, SD of X* = 2: the attenuation factor drops to ~0.14. Your estimate is 86% too small.
  • Increase n: the slope doesn’t recover! Attenuation bias is not a small-sample problem — it persists no matter how much data you have. More data just gives you a more precise estimate of the wrong number.

Simulation 2: Attenuation is systematic, not just noisy

Run 500 regressions, each with fresh measurement error. The distribution of slope estimates is centered below the true \(\beta\) — it’s not random noise, it’s a systematic downward bias.

#| standalone: true
#| viewerHeight: 550

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 3,

      sliderInput("beta2", "True slope (\u03b2):",
                  min = 0.5, max = 3, value = 1.5, step = 0.1),

      sliderInput("n2", "Sample size per run:",
                  min = 50, max = 300, value = 100, step = 50),

      sliderInput("sigma_eta2", "Measurement error (\u03c3\u03b7):",
                  min = 0, max = 5, value = 2, step = 0.25),

      sliderInput("n_sims", "Number of simulations:",
                  min = 100, max = 1000, value = 500, step = 100),

      uiOutput("results2")
    ),

    mainPanel(
      width = 9,
      plotOutput("mc_plot", height = "450px")
    )
  )
)

server <- function(input, output, session) {

  sim <- reactive({
    n     <- input$n2
    b     <- input$beta2
    se    <- input$sigma_eta2
    nsims <- input$n_sims
    sx    <- 2

    betas_true <- numeric(nsims)
    betas_obs  <- numeric(nsims)

    for (i in seq_len(nsims)) {
      x_star <- rnorm(n, 0, sx)
      u      <- rnorm(n)
      y      <- 2 + b * x_star + u

      eta   <- rnorm(n, 0, se)
      x_obs <- x_star + eta

      betas_true[i] <- coef(lm(y ~ x_star))[2]
      betas_obs[i]  <- coef(lm(y ~ x_obs))[2]
    }

    lambda <- sx^2 / (sx^2 + se^2)

    list(betas_true = betas_true, betas_obs = betas_obs,
         beta = b, lambda = lambda)
  })

  output$mc_plot <- renderPlot({
    d <- sim()

    par(mar = c(4.5, 4.5, 3, 1))

    all_b <- c(d$betas_true, d$betas_obs)
    xlim  <- range(all_b) + c(-0.1, 0.1)

    # No-error distribution
    hist(d$betas_true, breaks = 40,
         col = adjustcolor("#27ae60", 0.5), border = "white",
         main = "Distribution of slope estimates across simulations",
         xlab = expression(hat(beta)), xlim = xlim,
         freq = FALSE, cex.main = 1.3)

    # With-error distribution
    hist(d$betas_obs, breaks = 40,
         col = adjustcolor("#e74c3c", 0.45), border = "white",
         add = TRUE, freq = FALSE)

    abline(v = d$beta, col = "#2c3e50", lwd = 2.5, lty = 2)
    abline(v = d$beta * d$lambda, col = "#e74c3c", lwd = 2, lty = 3)
    abline(v = mean(d$betas_obs), col = "#e74c3c", lwd = 1.5)

    legend("topright", bty = "n", cex = 0.9,
           legend = c(
             paste0("No error (centered at \u03b2 = ", d$beta, ")"),
             paste0("With error (centered at ", round(mean(d$betas_obs), 3), ")"),
             paste0("Theory: \u03b2\u03bb = ", round(d$beta * d$lambda, 3))
           ),
           fill = c(adjustcolor("#27ae60", 0.5),
                    adjustcolor("#e74c3c", 0.45), NA),
           border = c("white", "white", NA),
           col = c(NA, NA, "#e74c3c"),
           lwd = c(NA, NA, 2), lty = c(NA, NA, 3))
  })

  output$results2 <- renderUI({
    d <- sim()

    tags$div(class = "stats-box",
      HTML(paste0(
        "<b>True \u03b2:</b> ", d$beta, "<br>",
        "<b>Avg estimate (no error):</b> ",
        round(mean(d$betas_true), 3), "<br>",
        "<b>Avg estimate (with error):</b> ",
        round(mean(d$betas_obs), 3), "<br>",
        "<hr style='margin:8px 0'>",
        "<b>Attenuation factor:</b> ", round(d$lambda, 3), "<br>",
        "<b>\u03b2 \u00d7 \u03bb:</b> ",
        round(d$beta * d$lambda, 3), "<br>",
        "<small>Bias: ",
        round(mean(d$betas_obs) - d$beta, 3), "</small>"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • \(\eta\) = 0: both distributions overlap perfectly — no bias at all.
  • \(\eta\) = 2: the red distribution shifts left. The average slope is systematically below the truth.
  • Increase n to 300: the distributions get narrower (more precise) but the red one stays centered at the wrong value. Attenuation bias doesn’t go away with more data.
  • Compare theory vs simulation: the theoretical \(\beta\lambda\) should closely match the average of the red distribution.

Measurement error in Y vs X

A crucial asymmetry:

Error in X Error in Y
Bias? Yes — toward zero No bias
Precision? Slightly worse Worse (larger SEs)
Goes away with more data? No SEs shrink, but that’s just precision

Why the asymmetry? When \(Y\) is measured with error, the noise goes into the residual — it’s just more \(u\). The slope is unbiased; you just estimate it less precisely. When \(X\) is measured with error, the noise is in the regressor, which contaminates the covariance between \(X\) and \(Y\) and biases the slope.


What can you do about it?

  1. Measure better. The best fix is reducing \(\sigma_\eta\). Use validated instruments, multiple measurements, averages of repeated measures.

  2. Instrumental variables (IV). Find a variable \(Z\) that predicts \(X^*\) but isn’t contaminated by the measurement error. Two-stage least squares recovers the true \(\beta\).

  3. Reliability ratio correction. OLS with measurement error shrinks the coefficient by the signal-to-noise ratio (also called the reliability ratio):

\[\hat{\beta}_{OLS} \xrightarrow{p} \beta \cdot \underbrace{\frac{\text{Var}(X^*)}{\text{Var}(X^*) + \text{Var}(\eta)}}_{\text{SNR}}\]

The SNR is always between 0 and 1. High noise (\(\text{Var}(\eta) \gg \text{Var}(X^*)\)) → SNR near 0 → coefficient crushed toward zero. Low noise → SNR near 1 → barely any attenuation. If you know the SNR, you can correct: \(\hat{\beta}_{corrected} = \hat{\beta}_{OLS} / \text{SNR}\).

  1. Multiple indicators. If you have two noisy measures of \(X^*\) with independent errors, their covariance identifies \(\text{Var}(X^*)\), letting you compute \(\lambda\) directly.

Did you know?

  • The attenuation bias formula was derived by Karl Pearson in the early 1900s, making it one of the oldest results in regression theory. Pearson was studying the relationship between fathers’ and sons’ heights and realized that imprecise measurements of height would make the hereditary correlation look weaker than it really was.

  • In economics, Jerry Hausman (2001) showed that measurement error in survey data on income and consumption can attenuate elasticity estimates by 30–50%. Studies using administrative tax records (with near-zero measurement error) consistently find larger effects than survey-based studies — exactly what attenuation bias predicts.

  • The errors-in-variables literature distinguishes between classical measurement error (what we covered: \(X = X^* + \eta\) with \(\eta\) independent of \(X^*\)) and non-classical error (where the error depends on the true value). Non-classical error can bias in either direction, not just toward zero. Mean-reverting error in test scores is a common example.