Prediction vs Causation in Foundation Models
Foundation models — large language models, vision transformers, multimodal systems — are extraordinarily good at prediction. Given a prompt, they produce a continuation that is statistically plausible. Given an image, they classify it accurately. But prediction and causal reasoning are fundamentally different tasks, and conflating them is one of the most common errors in applied work that uses these models.
\(P(Y \mid X)\) vs \(P(Y \mid do(X))\)
The distinction is precise. A predictive model learns the conditional distribution:
\[ P(Y \mid X) \]
This answers: “Given that I observe \(X = x\), what is the distribution of \(Y\)?” It captures statistical associations — correlations, patterns, regularities in the training data.
A causal model targets the interventional distribution:
\[ P(Y \mid do(X = x)) \]
This answers: “If I set \(X\) to \(x\) (intervening, not observing), what happens to \(Y\)?” This requires knowing the causal structure — which variables cause which — not just their joint distribution.
The difference matters whenever the relationship between \(X\) and \(Y\) is confounded. If \(X\) and \(Y\) are both driven by an unobserved variable \(U\):
\[ P(Y \mid X = x) \neq P(Y \mid do(X = x)) \]
Observing that \(X\) takes a particular value tells you something about \(U\) (and hence \(Y\)) that setting \(X\) does not. This is the omitted variable bias from Residuals & Controls, expressed in the language of interventions.
What foundation models actually learn
Large language models are trained to minimize cross-entropy loss — which, as shown in Training as MLE, is maximum likelihood estimation of the conditional distribution of the next token given the context.
This means LLMs are, at their core, estimators of \(P(\text{next token} \mid \text{context})\). They approximate conditional distributions over text, conditioned on the training corpus. They can reproduce causal statements that appear in their training data (“smoking causes cancer”) because those strings have high conditional probability. But reproducing a causal statement is not the same as performing causal reasoning.
To perform genuine causal inference, a model would need to:
- Distinguish observational from interventional distributions
- Identify confounders and adjust for them
- Reason counterfactually (“what would have happened if…?”)
These capabilities are not guaranteed by minimizing prediction loss, regardless of model scale. Whether and to what extent LLMs develop emergent causal reasoning abilities is an active research question — but the default assumption should be that prediction and causation are distinct.
Identification vs training
This is the conceptual anchor that connects statistical foundations to modern AI.
Training asks: given data and a loss function, find the parameters \(\hat{\theta}\) that minimize prediction error.
\[ \hat{\theta} = \arg\min_\theta \; \frac{1}{n}\sum_{i=1}^n \mathcal{L}(Y_i, f_\theta(X_i)) \]
This is an optimization problem. With enough data and a flexible enough model, you can drive the loss arbitrarily close to the Bayes-optimal predictor.
Identification asks: given the data-generating process, can the quantity of interest be recovered from the observable data at all — regardless of sample size, model complexity, or computational power?
\[ \text{Is } P(Y \mid do(X)) \text{ recoverable from } P(Y, X, Z)? \]
This is a logical problem. If the causal effect is not identified — because of unobserved confounders, selection bias, or measurement error — no amount of data or model sophistication will recover it. A billion observations and a trillion-parameter model will converge to the wrong answer just as confidently as a simple regression.
| Training | Identification | |
|---|---|---|
| Question | “What minimizes prediction error?” | “Can the causal parameter be recovered?” |
| Constraint | Computational (optimization) | Logical (causal structure) |
| Solved by | More data, bigger models | Research design, assumptions |
| Failure mode | Overfitting, distribution shift | Bias, confounding |
In-context learning and Bayesian updating
A suggestive connection: when you provide examples in a prompt, the model’s predictions shift toward the pattern in those examples. This resembles Bayesian updating — the pre-trained model is the “prior,” and the in-context examples are the “data” that update it.
For simple settings — linear regression, basic classification — this analogy has formal support. Garg et al. (2022) and others have shown that transformers trained on linear regression tasks implement something close to Bayesian posterior computation in their forward pass.
However, this equivalence does not generally hold for deep transformers on complex tasks. In-context learning may involve pattern matching, retrieval from training data, or other mechanisms that are not well-described as Bayesian updating. The analogy is useful as intuition but should not be treated as a general theoretical result.
Implications for applied work
Using LLMs for prediction: appropriate. If the goal is forecasting, classification, or text generation, the model is optimized for exactly this task. Evaluate on held-out data, check calibration, and be aware of distribution shift.
Using LLMs for causal claims: requires extreme caution. If the goal is to estimate a treatment effect, evaluate a policy, or make a causal argument, the model’s predictions reflect associations in the training data, not causal relationships. The standard tools of causal inference — randomized experiments, instrumental variables, difference-in-differences, regression discontinuity — remain necessary.
Using LLMs as research tools: promising but limited. LLMs can assist with literature review, code generation, data cleaning, and hypothesis generation. But the identification strategy — the argument for why your estimate is causal — must come from the researcher, not the model.
Product experimentation in AI systems — “which prompt works better?”, “does this fine-tuning improve user satisfaction?” — is fundamentally a randomized controlled trial. The statistical tools are the ones already in this course: power analysis, p-values, multiple testing corrections. The fact that the treatment is an AI system doesn’t change the experimental design principles.