Method of Moments
The simplest estimation idea you’ll encounter: set sample moments equal to population moments and solve. It won’t always give you the best estimator, but it’s often the most transparent one — and it’s the starting point for understanding everything else in this section.
The idea
A moment is just an expected value of some function of the data. The first moment of \(X\) is \(E[X]\) (the mean), the second moment is \(E[X^2]\), and so on. Many parameters you care about can be written as functions of moments:
\[ \mu = E[X], \qquad \sigma^2 = E[X^2] - (E[X])^2 \]
The population moments involve the true distribution, which you don’t know. But you do have data, so you can compute sample moments:
\[ \hat{m}_1 = \frac{1}{n}\sum_{i=1}^n X_i, \qquad \hat{m}_2 = \frac{1}{n}\sum_{i=1}^n X_i^2 \]
The Method of Moments (MoM) strategy is dead simple:
- Write the parameters as functions of population moments
- Replace population moments with sample moments
- Solve for the parameters
For the normal distribution, this gives you \(\hat{\mu} = \bar{X}\) and \(\hat{\sigma}^2 = \frac{1}{n}\sum_i (X_i - \bar{X})^2\). The sample mean and sample variance are MoM estimators — you’ve been using Method of Moments all along without calling it that.
A worked example
Suppose \(X_1, \ldots, X_n\) are drawn from a Uniform\((0, \theta)\) distribution. You want to estimate \(\theta\).
The population mean is:
\[ E[X] = \frac{\theta}{2} \]
Set the sample moment equal to the population moment:
\[ \bar{X} = \frac{\hat{\theta}}{2} \]
Solve:
\[ \hat{\theta}_{\text{MoM}} = 2\bar{X} \]
That’s it — double the sample mean. Now here’s what’s interesting: the MLE for the same problem is \(\hat{\theta}_{\text{MLE}} = \max(X_1, \ldots, X_n)\) — the largest observation. Same data, same model, completely different estimator.
This is important. Different estimation principles can give you different answers. MoM matched a moment; MLE maximized a likelihood. Neither is “wrong” — they’re optimizing different things.
When MoM works well
Simple and closed-form. You write down moment equations and solve. No optimization, no iterative algorithms. For many common distributions, the estimators are formulas you can compute by hand.
Consistent under mild conditions. As long as the law of large numbers applies (sample moments converge to population moments), MoM estimators converge to the true parameter values. You need very little for this — just finite moments and independent data.
A good starting point. Even when MoM isn’t the most efficient estimator, it’s often used as an initial value for more complex procedures (like numerical MLE or GMM).
Limitations
Inefficient. MoM estimators often have higher variance than MLE estimators. In the Uniform example above, \(\hat{\theta}_{\text{MoM}} = 2\bar{X}\) has variance that shrinks at rate \(1/n\), while \(\hat{\theta}_{\text{MLE}} = \max(X_i)\) has variance that shrinks at rate \(1/n^2\). The MLE is dramatically better because it uses the shape of the distribution, not just a single summary statistic.
Can produce inadmissible estimates. A MoM estimator for a variance could turn out negative. A MoM estimator for a probability could land outside \([0, 1]\). The method has no built-in mechanism to enforce constraints on the parameter space.
Ambiguity with many parameters. If you have \(k\) parameters, you need \(k\) moment conditions. But there are infinitely many moments to choose from — \(E[X]\), \(E[X^2]\), \(E[X^3]\), \(E[\log X]\), \(E[1/X]\), and so on. Different choices give different estimators, and MoM doesn’t tell you which is best. This ambiguity is exactly what GMM resolves.
Connecting forward
MoM is the simplest member of a family of estimation strategies. Each one builds on the same intuition — matching features of the data to features of the model — but adds sophistication:
Maximum Likelihood picks parameters to maximize the probability of the observed data. It’s often more efficient than MoM because it uses the full distributional shape, not just selected moments.
GMM formalizes “which moments to match” and handles the case where you have more moment conditions than parameters (over-identification). It also tells you the best way to weight those conditions.
OLS as MoM. The normal equations \(X'y = X'X\hat{\beta}\) are really just moment conditions in disguise: \(E[X_i(Y_i - X_i'\beta)] = 0\) says “the regressors are uncorrelated with the error.” Setting the sample analog to zero and solving gives you OLS. For more on the algebra, see The Algebra Behind OLS.