Experimental Design for AI Systems

Evaluating AI systems — which model is better, which prompt works, whether a feature improves user outcomes — is an empirical question. And the methods for answering empirical questions are the ones already in this course: randomized experiments, power analysis, and multiple testing corrections. The fact that the treatment is an algorithm rather than a drug or a policy does not change the statistical logic.

A/B testing is a randomized controlled trial

When a technology company tests two versions of a product feature (version A vs version B), it is running a randomized controlled trial. Users are randomly assigned to conditions, outcomes are measured, and a treatment effect is estimated.

The statistical framework is identical to what’s covered in Power, Alpha, Beta & MDE:

Unit of randomization: users, sessions, or requests
Treatment: the new model, prompt, or feature
Outcome: click-through rate, user satisfaction, task completion
Estimand: the average treatment effect (ATE)

The estimator is a simple difference in means — or, equivalently, a regression of the outcome on the treatment indicator. This is OLS, which is MLE under normality, which is MoM with the moment condition \(E[D_i(Y_i - \alpha - \beta D_i)] = 0\).

The point. A/B tests in AI are not a new methodology. They are the same randomized experiments that have been used in medicine, economics, and social science for decades. The statistical principles — randomization for identification, power for design, multiple testing for honesty — carry over directly.

Power analysis for AI experiments

AI experiments face specific power challenges:

High-dimensional outcomes. A language model change might affect response quality, latency, safety, user engagement, and revenue simultaneously. Testing all of these increases the multiple testing burden (see Multiple Testing).

Small effect sizes. Mature AI systems are already highly optimized. The marginal improvement from a new model or prompt variant may be small — perhaps a 0.5% improvement in a key metric. Detecting small effects requires large samples and precise estimation.

Interference between units. If users interact with each other (social networks, marketplaces), treating one user may affect the outcomes of untreated users. This violates SUTVA (the stable unit treatment value assumption) and biases the ATE estimate. Cluster randomization — randomizing at the group or region level — is the standard fix, with clustered standard errors for inference.

Non-stationarity. User behavior changes over time (weekday vs weekend, novelty effects, seasonal trends). This makes pre-post comparisons unreliable and motivates concurrent randomization — running treatment and control simultaneously rather than sequentially.

Multiple testing in model evaluation

Evaluating an LLM typically involves many benchmarks: reasoning, coding, math, safety, factual accuracy, instruction following. Each benchmark produces a p-value (or a confidence interval for the performance difference). Testing across many benchmarks without correction inflates the false discovery rate — the fraction of “improvements” that are actually noise.

The tools from Multiple Testing apply directly:

Bonferroni correction: divide \(\alpha\) by the number of benchmarks. Conservative but simple.
Benjamini-Hochberg: controls the false discovery rate (FDR) rather than the family-wise error rate. Less conservative, more appropriate when you expect some true improvements.
Pre-registration: specify the primary benchmark before running the evaluation. Secondary benchmarks are exploratory and should be flagged as such.

A common pattern in AI evaluation. A paper reports improvements on 12 out of 15 benchmarks. But if the improvements are small and no multiple testing correction is applied, several of those 12 may be false positives. The statistical bar for claiming improvement should scale with the number of comparisons — exactly the lesson from Multiple Testing.

Observational evaluation and its limits

Not all evaluations can be randomized. Sometimes you want to know: “Did deploying this AI system improve outcomes?” without a controlled experiment. This is an observational causal inference problem, and the standard tools apply:

Difference-in-differences: compare outcomes before and after deployment, relative to a control group that didn’t receive the system
Regression discontinuity: if deployment was based on a threshold (e.g., rolled out to users above a certain engagement level), exploit the discontinuity
Synthetic control: construct a counterfactual from similar units that weren’t treated

Each of these requires identification assumptions — parallel trends, continuity, no anticipation — that must be argued for on substantive grounds. The methods are covered in the causal inference course.

The key point from Prediction vs Causation applies here: measuring the causal impact of an AI system requires an identification strategy, not just a before-after comparison. A correlation between AI adoption and improved outcomes may reflect selection (better organizations adopt AI first), not a causal effect.

Connecting to the course

This page applies tools from throughout the course to a specific domain:

Power and p-values: the foundation of experimental design, whether the treatment is a drug or a prompt
Multiple Testing: essential when evaluating across many benchmarks or metrics
Clustered SEs: necessary when randomization is at the cluster level or users are not independent
Heteroskedasticity: treatment effects may vary across user segments, requiring robust standard errors
Prediction vs Causation: the overarching distinction between what AI systems optimize and what causal evaluations require