AI Engineering12:30–12:48Cinema 1

Evaluation Precedes Evolution: Rubrics as the Load-Bearing Infrastructure of Self-Improving Agents

Tanya Dixit
Forward Deployed Engineer · Google

The 2025–2026 wave of "self-evolving" agents — prompt-tuning loops, memory accumulation, agent swarms, GEPA, ReasoningBank — share a structure that is sometimes lost in the jargon: every one of them is hill-climbing on a judge. The judge is the fitness function. When it's sharp, the agent compounds. When it's vague, the loop drifts confidently in the wrong direction.

This talk argues that rubrics, not prompts or scaffolds, are the load-bearing infrastructure of agent improvement. We'll walk through three concrete failures from recent work: prompt optimizers that regressed without rollback (OpenAI), memory systems that hurt performance as they grew (ReasoningBank), and 18 months of capability gains that delivered almost no reliability gain (Princeton). All three share a root cause: the rubric was the bottleneck, and nobody was looking at it.

Then we'll build one. Five principles for a rubric that can actually drive evolution — stack deterministic before semantic, score failures explicitly, measure beyond accuracy, version the rubric itself, keep it cheap. You'll leave with a checklist you can apply to your next agent before you ship a single optimization loop.

In collaboration with Pouya Ghiasnezhad Omran.