Generalized Method of Moments

Method of Moments is simple but leaves a key question unanswered: which moments should you match? And what if you have more moment conditions than parameters? GMM formalizes this. It takes a set of moment conditions — potentially more than you need — and finds the parameters that satisfy them as closely as possible in a precise, optimal sense.

From MoM to GMM

Recall how MoM works: you have \(k\) parameters and you pick \(k\) moment conditions. Set the sample moments equal to zero and solve. When you have exactly as many conditions as parameters — the just-identified case — there’s a unique solution and everything is clean.

But in practice, economic theory or statistical reasoning often gives you more moment conditions than parameters. Suppose you have \(m\) conditions for \(k\) parameters, with \(m > k\). Now the system is over-identified: you can’t satisfy all \(m\) conditions exactly, so you need a way to get as close as possible.

That’s what GMM does. Instead of solving a system of equations, it minimizes a measure of how far the sample moments are from zero.

The GMM estimator

Start with a vector of moment conditions. The theory says that at the true parameter \(\theta_0\):

\[ E[g(X_i, \theta_0)] = 0 \]

where \(g\) is an \(m\)-dimensional vector (one entry per moment condition). The sample analog is:

\[ \bar{g}(\theta) = \frac{1}{n}\sum_{i=1}^n g(X_i, \theta) \]

If \(m = k\) (just-identified), you can set \(\bar{g}(\theta) = 0\) and solve — that’s MoM. If \(m > k\) (over-identified), you can’t make all entries of \(\bar{g}\) exactly zero. GMM minimizes a weighted quadratic form:

\[ \hat{\theta}_{\text{GMM}} = \arg\min_\theta \; \bar{g}(\theta)' \, W \, \bar{g}(\theta) \]

where \(W\) is an \(m \times m\) positive definite weighting matrix. This is like a weighted sum of squared moment violations — GMM finds the \(\theta\) that makes the sample moments as close to zero as possible, with \(W\) determining how much each moment condition matters.

Choosing \(W\)

The choice of \(W\) affects efficiency but not consistency — any positive definite \(W\) gives a consistent estimator. But some choices are better than others.

Feasible two-step GMM. The most common approach:

Start with \(W = I\) (or any reasonable \(W\)) and get a preliminary estimate \(\hat{\theta}_1\)
Use \(\hat{\theta}_1\) to estimate the optimal weighting matrix: \(\hat{W} = \left[\frac{1}{n}\sum_i g(X_i, \hat{\theta}_1) \, g(X_i, \hat{\theta}_1)'\right]^{-1}\)
Re-estimate with \(\hat{W}\) to get \(\hat{\theta}_2\)

The optimal \(W\) is the inverse of the variance of the moment conditions. This weights precisely-estimated moments more heavily and noisy moments less — exactly the right thing to do.

Intuition. If one moment condition has a lot of sampling noise and another is very precise, you should trust the precise one more. The optimal weighting matrix does this automatically.

Examples that build on the course

OLS as GMM

The population moment condition behind OLS is:

\[ E[X_i(Y_i - X_i'\beta)] = 0 \]

This says the regressors are uncorrelated with the error — exactly the identification assumption from Regression & the CEF. The sample analog is:

\[ \bar{g}(\beta) = \frac{1}{n}\sum_{i=1}^n X_i(Y_i - X_i'\beta) = \frac{1}{n}X'(y - X\beta) \]

Setting this to zero and solving gives \(X'X\hat{\beta} = X'y\), or \(\hat{\beta} = (X'X)^{-1}X'y\) — the OLS formula from The Algebra Behind OLS. OLS is just-identified GMM: \(k\) regressors, \(k\) moment conditions, and \(W\) drops out because you can solve exactly.

IV/2SLS as GMM

When regressors are endogenous, you need instruments \(Z_i\) that are correlated with \(X_i\) but uncorrelated with \(\varepsilon_i\). The moment condition becomes:

\[ E[Z_i(Y_i - X_i'\beta)] = 0 \]

If you have more instruments than endogenous regressors (\(m > k\)), this is over-identified — and GMM handles it naturally. Two-stage least squares (2SLS) is a specific GMM estimator with a particular choice of weighting matrix.

The J-test (overidentification test)

When you’re over-identified (\(m > k\)), you have a built-in specification test. The logic: if the model is correct and all moment conditions are valid, the minimized GMM objective should be “small” (close to zero). If it’s “large,” at least some moment conditions are being violated, which suggests the model is misspecified.

Formally, under the null that the model is correct:

\[ J = n \cdot \bar{g}(\hat{\theta})' \hat{W} \, \bar{g}(\hat{\theta}) \;\xrightarrow{d}\; \chi^2_{m-k} \]

where \(m - k\) is the number of “extra” moment conditions (the degree of over-identification). A large \(J\) (relative to the \(\chi^2_{m-k}\) distribution) rejects the model.

What the J-test can and can’t do. The J-test can detect when your moment conditions are mutually inconsistent — they can’t all be true at the same time. But it can’t tell you which condition is wrong. And if you’re just-identified (\(m = k\)), \(J = 0\) by construction — there’s no overidentification to test.

When to use GMM

When economic theory gives you moment conditions but not a full likelihood. Many structural models in economics specify relationships between variables (Euler equations, equilibrium conditions) that translate directly into moment conditions. You don’t need to know the full distribution of the data — just these conditions. This is much less demanding than MLE.

When you want robustness to distributional assumptions. GMM only assumes the moment conditions hold; it’s agnostic about the rest of the distribution. If you use MLE with the wrong distributional assumption, your estimator converges to the wrong thing. GMM avoids this risk by not making that assumption in the first place.

When you have instruments or panel data. These settings naturally produce moment conditions (instrument exogeneity, sequential exogeneity), making GMM a natural framework.

The tradeoff. GMM is more robust than MLE but less efficient when MLE’s distributional assumptions actually hold. If you know the distribution and it’s correctly specified, MLE squeezes out every last bit of information. GMM leaves some information on the table by not using the full distribution.

Connecting to MLE

MLE is actually a special case of GMM. The MLE score equations — the first-order conditions from maximizing the log-likelihood — are moment conditions:

\[ E\!\left[\frac{\partial \log f(X_i \mid \theta_0)}{\partial \theta}\right] = 0 \]

These are the “moment conditions” that MLE implicitly uses. When you use these specific conditions with the optimal GMM weighting matrix, GMM produces exactly the MLE. So the hierarchy is:

\[ \text{MoM} \;\subset\; \text{GMM} \;\supset\; \text{MLE} \]

MoM is GMM with a specific (often suboptimal) choice of moments and \(W = I\). MLE is GMM with the score as the moment condition and the optimal \(W\). GMM is the general framework that nests both.

This is why GMM achieves MLE-level efficiency when the likelihood is correctly specified and the optimal weighting matrix is used — it’s doing MLE, just expressed in the language of moments.