Our AI Hallucinated in Production: How We Fixed It With Evals
We shipped one of REA Group’s first generative AI features to production: Property Highlights, which turns long real-estate listings into three skimmable takeaways. The demo was easy; real traffic wasn’t—hallucinations showed up in front of real users.
This talk covers how we built an evaluation stack to launch safely at scale. Basic guardrails (three bullets, length limits) didn’t catch the failures that mattered: made-up features, off-brand tone, and useless copy. We built a review tool for side-by-side prompt/model testing, defined a rubric for factuality, usefulness, and language quality, and scaled it with an LLM-as-judge calibrated to expert reviews to score thousands of listings daily. We then tied evals to real user feedback and business metrics, including a 10% engagement lift.
You’ll get a practical pipeline and a repeatable way to iterate on LLM features using evals, not vibes.