Hedonic Pricing

The idea

A house is not a single good — it’s a bundle of attributes: bedrooms, bathrooms, square footage, school quality, distance to the CBD, neighborhood safety. The price you pay reflects the implicit value of each component.

Rosen (1974) formalized this. The hedonic price function \(P(z_1, z_2, \ldots, z_K)\) maps attribute levels to market prices. The marginal implicit price of attribute \(k\) is:

\[\frac{\partial P}{\partial z_k} = \text{how much an extra unit of attribute } k \text{ adds to the house price}\]

An extra bedroom might add $30,000. A one-standard-deviation improvement in school quality might add $50,000. These are the hedonic prices — the market’s revealed valuation of each characteristic.

The regression version

In practice, we estimate:

\[P_i = \beta_0 + \beta_1 \text{Bedrooms}_i + \beta_2 \text{SqFt}_i + \beta_3 \text{SchoolQual}_i + \beta_4 \text{Crime}_i + \varepsilon_i\]

Each \(\beta_k\) is an estimated hedonic price. The challenge is omitted variable bias: if nice neighborhoods have both good schools and high unobserved quality (tree-lined streets, social capital, good restaurants), the school quality coefficient captures both the value of schools and the value of everything correlated with schools.

The Key Challenge. In the simulation below, you control the correlation between school quality and an unobserved neighborhood quality variable. When you omit the unobserved variable, the school quality coefficient is biased — it picks up the value of neighborhood quality too. This is the central identification problem in hedonic pricing. See also: Omitted Variable Bias.

#| standalone: true
#| viewerHeight: 720

library(shiny)

ui <- fluidPage(
  tags$head(tags$style(HTML("
    .stats-box {
      background: #f0f4f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 14px; line-height: 1.9;
    }
    .stats-box b { color: #2c3e50; }
    .good { color: #27ae60; font-weight: bold; }
    .bad  { color: #e74c3c; font-weight: bold; }
    .info-box {
      background: #eaf2f8; border-radius: 6px; padding: 14px;
      margin-top: 12px; font-size: 13px; line-height: 1.8;
    }
    .info-box b { color: #2c3e50; }
    .reg-table {
      font-family: monospace; font-size: 12px; line-height: 1.6;
      white-space: pre; background: #f9f9f9; padding: 10px;
      border-radius: 4px; margin-top: 8px;
    }
  "))),

  sidebarLayout(
    sidebarPanel(
      width = 4,

      sliderInput("n", "Sample size:",
                  min = 100, max = 2000, value = 500, step = 100),

      tags$h4("True hedonic prices ($1000s)"),
      sliderInput("b_bed", "Bedrooms:", min = 10, max = 60, value = 30, step = 5),
      sliderInput("b_sqft", "Sq footage (per 100):", min = 5, max = 40, value = 15, step = 5),
      sliderInput("b_school", "School quality:", min = 10, max = 80, value = 50, step = 5),
      sliderInput("b_crime", "Crime rate:", min = -60, max = -5, value = -30, step = 5),
      sliderInput("b_unobs", "Unobserved nbhd quality:", min = 0, max = 60, value = 25, step = 5),

      tags$hr(),
      sliderInput("rho", "Correlation: school quality & unobserved quality:",
                  min = 0, max = 0.95, value = 0.6, step = 0.05),

      checkboxInput("include_unobs", "Include unobserved quality in regression", value = FALSE),

      actionButton("go", "New sample", class = "btn-primary", width = "100%"),

      uiOutput("info")
    ),

    mainPanel(
      width = 8,
      fluidRow(
        column(6, uiOutput("reg_output")),
        column(6, plotOutput("scatter_plot", height = "520px"))
      )
    )
  )
)

server <- function(input, output, session) {

  dat <- reactive({
    input$go
    n <- input$n
    b_bed <- input$b_bed
    b_sqft <- input$b_sqft
    b_school <- input$b_school
    b_crime <- input$b_crime
    b_unobs <- input$b_unobs
    rho <- input$rho

    # Generate attributes
    bedrooms <- sample(1:5, n, replace = TRUE, prob = c(0.05, 0.2, 0.4, 0.25, 0.1))
    sqft <- rnorm(n, mean = 15, sd = 5)  # in hundreds
    sqft <- pmax(sqft, 5)

    # School quality and unobserved quality are correlated
    z1 <- rnorm(n)
    z2 <- rnorm(n)
    school <- 5 + 2 * z1
    unobs <- 3 + 2 * (rho * z1 + sqrt(1 - rho^2) * z2)
    school <- pmax(school, 1)
    unobs <- pmax(unobs, 0)

    crime <- rnorm(n, mean = 5, sd = 2)
    crime <- pmax(crime, 0.5)

    # True price (in $1000s)
    eps <- rnorm(n, sd = 20)
    price <- 100 + b_bed * bedrooms + b_sqft * sqft + b_school * school +
             b_crime * crime + b_unobs * unobs + eps

    data.frame(price = price, bedrooms = bedrooms, sqft = sqft,
               school = school, crime = crime, unobs = unobs)
  })

  output$reg_output <- renderUI({
    d <- dat()

    # Short regression (without unobserved)
    m_short <- lm(price ~ bedrooms + sqft + school + crime, data = d)
    cs <- summary(m_short)$coefficients

    # Long regression (with unobserved)
    m_long <- lm(price ~ bedrooms + sqft + school + crime + unobs, data = d)
    cl <- summary(m_long)$coefficients

    # Format regression table
    vars <- c("(Intercept)", "Bedrooms", "Sq Ft (100s)", "School Quality", "Crime Rate")
    vars_long <- c(vars, "Nbhd Quality")

    format_row <- function(name, coef, se, star) {
      paste0(sprintf("%-16s %8.2f  (%6.2f) %s", name, coef, se, star))
    }

    get_stars <- function(p) {
      if (p < 0.01) return("***")
      if (p < 0.05) return("** ")
      if (p < 0.10) return("*  ")
      return("   ")
    }

    lines_short <- sapply(1:nrow(cs), function(i) {
      format_row(vars[i], cs[i, 1], cs[i, 2], get_stars(cs[i, 4]))
    })

    lines_long <- sapply(1:nrow(cl), function(i) {
      format_row(vars_long[i], cl[i, 1], cl[i, 2], get_stars(cl[i, 4]))
    })

    if (input$include_unobs) {
      header <- "WITH Unobserved Quality"
      lines <- lines_long
      r2 <- round(summary(m_long)$r.squared, 3)
    } else {
      header <- "WITHOUT Unobserved Quality"
      lines <- lines_short
      r2 <- round(summary(m_short)$r.squared, 3)
    }

    table_text <- paste0(
      header, "\n",
      paste(rep("-", 44), collapse = ""), "\n",
      "Variable          Coef     (SE)    \n",
      paste(rep("-", 44), collapse = ""), "\n",
      paste(lines, collapse = "\n"), "\n",
      paste(rep("-", 44), collapse = ""), "\n",
      "R-squared: ", r2, "    N = ", nrow(d), "\n"
    )

    # Bias info
    school_short <- cs["school", 1]
    school_long <- cl["school", 1]

    tags$div(
      tags$div(class = "reg-table", table_text),
      tags$div(class = "stats-box", style = "margin-top: 10px;",
        HTML(paste0(
          "<b>School coef (omitting nbhd):</b> <span class='bad'>",
          round(school_short, 1), "</span><br>",
          "<b>School coef (including nbhd):</b> <span class='good'>",
          round(school_long, 1), "</span><br>",
          "<b>True hedonic price:</b> ", input$b_school, "<br>",
          "<b>OVB:</b> ", round(school_short - school_long, 1),
          " (", round((school_short - school_long) / input$b_school * 100, 0), "% of true value)"
        ))
      )
    )
  })

  output$scatter_plot <- renderPlot({
    d <- dat()
    par(mar = c(4.5, 4.5, 3, 1))

    plot(d$school, d$price, pch = 16, cex = 0.5,
         col = adjustcolor("#3498db", 0.3),
         xlab = "School Quality",
         ylab = "House Price ($1000s)",
         main = "Price vs School Quality")

    # Short regression line
    m_short <- lm(price ~ school, data = d)
    abline(m_short, col = "#e74c3c", lwd = 3)

    # Long regression partial
    m_long <- lm(price ~ bedrooms + sqft + school + crime + unobs, data = d)
    school_range <- seq(min(d$school), max(d$school), length.out = 100)
    pred_vals <- coef(m_long)["school"] * school_range +
                 coef(m_long)["(Intercept)"] +
                 coef(m_long)["bedrooms"] * mean(d$bedrooms) +
                 coef(m_long)["sqft"] * mean(d$sqft) +
                 coef(m_long)["crime"] * mean(d$crime) +
                 coef(m_long)["unobs"] * mean(d$unobs)
    lines(school_range, pred_vals, col = "#27ae60", lwd = 3, lty = 2)

    legend("topleft", bty = "n", cex = 0.85,
           legend = c(
             paste0("Bivariate: slope = ", round(coef(m_short)[2], 1)),
             paste0("Full model: school coef = ", round(coef(m_long)["school"], 1))
           ),
           col = c("#e74c3c", "#27ae60"), lwd = 3, lty = c(1, 2))
  })

  output$info <- renderUI({
    d <- dat()
    m_short <- lm(price ~ bedrooms + sqft + school + crime, data = d)
    m_long <- lm(price ~ bedrooms + sqft + school + crime + unobs, data = d)

    bias <- coef(m_short)["school"] - coef(m_long)["school"]

    tags$div(class = "info-box",
      HTML(paste0(
        "<b>True school price:</b> $", input$b_school, "k<br>",
        "<b>Estimated (w/o nbhd):</b> $", round(coef(m_short)["school"], 1), "k<br>",
        "<b>Estimated (w/ nbhd):</b> $", round(coef(m_long)["school"], 1), "k<br>",
        "<b>Bias magnitude:</b> $", round(abs(bias), 1), "k"
      ))
    )
  })
}

shinyApp(ui, server)

Things to try

  • Set correlation to 0: the school coefficient is unbiased even without the unobserved variable. Omitting an uncorrelated variable doesn’t cause bias.
  • Set correlation to 0.9: the school coefficient is heavily biased upward — it absorbs the value of neighborhood quality. Toggle “Include unobserved quality” to see the coefficient drop to its true value.
  • Set the unobserved quality effect to 0: even with high correlation, there’s no bias. The omitted variable doesn’t affect the outcome.
  • Increase sample size: the estimates get more precise, but the bias doesn’t go away. OVB is not a small-sample problem.

Connections

  • Omitted Variable Bias — the same OVB formula applies here. The hedonic bias = (effect of unobserved quality) \(\times\) (correlation with school quality).
  • Regression & the CEF — the hedonic function is the conditional expectation of price given attributes.
  • Selection on Observables — when you have enough controls, the hedonic coefficients can be given a causal interpretation.

Did you know?

  • Sherwin Rosen’s 1974 paper “Hedonic Prices and Implicit Markets” is one of the most cited papers in economics. It showed that prices in differentiated-product markets (houses, cars, jobs) implicitly reveal the value of each characteristic.
  • Sandra Black (1999) cleverly addressed the OVB problem in hedonic pricing by comparing houses on opposite sides of school attendance boundaries — same neighborhood, different schools. She found that a 5% increase in test scores raises house prices by about 2.5%. This boundary discontinuity design is essentially a regression discontinuity applied to hedonic pricing.
  • Zillow’s Zestimate is, at its core, a massive hedonic model — predicting house prices from hundreds of attributes using machine learning. The difference from Rosen’s original framework is scale and flexibility: Zillow uses nonlinear models on millions of observations, but the fundamental idea (price = f(attributes)) is the same.