Towards Long-Horizon Tasks
This talk argues that without a deliberate focus on long horizon tasks, even the most impressive models will remain brittle and unreliable for real world applications. Short form benchmarks and isolated prompts cannot capture the complexity of extended reasoning, planning, and execution that real world problems demand. When models lack the ability to maintain coherence across hundreds or thousands of steps, they fail in subtle but critical ways: losing track of sub goals, failing to recover from errors, or drifting away from the original objective.
To address this, the talk proposes a new framework for measuring and training long horizon capabilities, including explicit mechanisms for sub goal setting, robust error recovery, and sustained persistence over extended timeframes. These are not mere incremental improvements but fundamental shifts in how we design and evaluate AI systems.