Multi-Armed Bandits: The Scientific Shotgun for Evals
A/B testing is too rigid a tool for AI systems. You're stuck serving worse results for the duration of the experiment and getting billed for slower models while three providers release SOTA updates this week.
Steal a trick from data science instead and use multi-armed bandits to organically surface ideal models, prompting choices and harnesses. You want your evals to be more than scores– make them an exploration in minimising regret.