Experimental Design for AI Systems
Evaluating AI systems — which model is better, which prompt works, whether a feature improves user outcomes — is an empirical question. And the methods for answering empirical questions are the ones already in this course: randomized experiments, power analysis, and multiple testing corrections. The fact that the treatment is an algorithm rather than a drug or a policy does not change the statistical logic.
A/B testing is a randomized controlled trial
When a technology company tests two versions of a product feature (version A vs version B), it is running a randomized controlled trial. Users are randomly assigned to conditions, outcomes are measured, and a treatment effect is estimated.
The statistical framework is identical to what’s covered in Power, Alpha, Beta & MDE:
- Unit of randomization: users, sessions, or requests
- Treatment: the new model, prompt, or feature
- Outcome: click-through rate, user satisfaction, task completion
- Estimand: the average treatment effect (ATE)
The estimator is a simple difference in means — or, equivalently, a regression of the outcome on the treatment indicator. This is OLS, which is MLE under normality, which is MoM with the moment condition \(E[D_i(Y_i - \alpha - \beta D_i)] = 0\).
Power analysis for AI experiments
AI experiments face specific power challenges:
High-dimensional outcomes. A language model change might affect response quality, latency, safety, user engagement, and revenue simultaneously. Testing all of these increases the multiple testing burden (see Multiple Testing).
Small effect sizes. Mature AI systems are already highly optimized. The marginal improvement from a new model or prompt variant may be small — perhaps a 0.5% improvement in a key metric. Detecting small effects requires large samples and precise estimation.
Interference between units. If users interact with each other (social networks, marketplaces), treating one user may affect the outcomes of untreated users. This violates SUTVA (the stable unit treatment value assumption) and biases the ATE estimate. Cluster randomization — randomizing at the group or region level — is the standard fix, with clustered standard errors for inference.
Non-stationarity. User behavior changes over time (weekday vs weekend, novelty effects, seasonal trends). This makes pre-post comparisons unreliable and motivates concurrent randomization — running treatment and control simultaneously rather than sequentially.
Multiple testing in model evaluation
Evaluating an LLM typically involves many benchmarks: reasoning, coding, math, safety, factual accuracy, instruction following. Each benchmark produces a p-value (or a confidence interval for the performance difference). Testing across many benchmarks without correction inflates the false discovery rate — the fraction of “improvements” that are actually noise.
The tools from Multiple Testing apply directly:
- Bonferroni correction: divide \(\alpha\) by the number of benchmarks. Conservative but simple.
- Benjamini-Hochberg: controls the false discovery rate (FDR) rather than the family-wise error rate. Less conservative, more appropriate when you expect some true improvements.
- Pre-registration: specify the primary benchmark before running the evaluation. Secondary benchmarks are exploratory and should be flagged as such.
Observational evaluation and its limits
Not all evaluations can be randomized. Sometimes you want to know: “Did deploying this AI system improve outcomes?” without a controlled experiment. This is an observational causal inference problem, and the standard tools apply:
- Difference-in-differences: compare outcomes before and after deployment, relative to a control group that didn’t receive the system
- Regression discontinuity: if deployment was based on a threshold (e.g., rolled out to users above a certain engagement level), exploit the discontinuity
- Synthetic control: construct a counterfactual from similar units that weren’t treated
Each of these requires identification assumptions — parallel trends, continuity, no anticipation — that must be argued for on substantive grounds. The methods are covered in the causal inference course.
The key point from Prediction vs Causation applies here: measuring the causal impact of an AI system requires an identification strategy, not just a before-after comparison. A correlation between AI adoption and improved outcomes may reflect selection (better organizations adopt AI first), not a causal effect.
Connecting to the course
This page applies tools from throughout the course to a specific domain:
- Power and p-values: the foundation of experimental design, whether the treatment is a drug or a prompt
- Multiple Testing: essential when evaluating across many benchmarks or metrics
- Clustered SEs: necessary when randomization is at the cluster level or users are not independent
- Heteroskedasticity: treatment effects may vary across user segments, requiring robust standard errors
- Prediction vs Causation: the overarching distinction between what AI systems optimize and what causal evaluations require