Your Agents Pass Every Benchmark—Then Memory Breaks Them in Production
You add memory to your agent, it works great in testing, and you ship it. A few weeks later, outputs start getting worse and nobody can figure out why. The agent is pulling in old information that's no longer true, retrieving context that's loosely related but clutters its reasoning, and sometimes carrying forward bad data that quietly corrupts every response after it. Standard evals won't catch any of this because they test single turns, not how memory behaves over hundreds of sessions. In this talk, we will walk through practical design principles and evaluation patterns you can implement to detect memory degradation before your users notice it. You'll walk away knowing how to design and evaluate memory enabled agents so it actually makes your agent more reliable instead of silently breaking it.