Sijia Liu will present her General Exam "Towards Autonomous Language Agents for Long Horizon Tasks" on Tuesday, May 12, 2026 at 3:00 PM in CS 401 and via zoom. Zoom link: https://princeton.zoom.us/j/3816009387?pwd=TfbXJiyYLsJpDhsnkmVa6waA5VbLQs.1 Committee Members: Karthik Narasimhan (advisor), Danqi Chen, Zhuang Liu Abstract: LLM agents have made rapid progress on complex reasoning and software engineering tasks, yet they still struggle to make sustained, autonomous progress over long horizons. Recent work has begun using state-of-the-art coding agents (e.g., Claude Code) to automate AI research, with promising signals of agents iteratively hill-climbing toward superhuman performance. However, these efforts remain confined to well-specified subproblems and rely on substantial human-in-the-loop supervision to gauge progress. To better understand and address this gap, we study MLE-Bench, a widely used benchmark that evaluates LLM agents on Kaggle-style machine learning engineering competitions over ultra-long horizons (24 hours) under realistic resource constraints (data, memory, GPUs, and network access). Preliminary analysis shows that even frontier proprietary models fall short at proposing novel ideas to iteratively improve their solutions, and often get stuck in debugging loops or commit to suboptimal model choices. Scaffold design remains crucial even for the strongest proprietary models. Open-weights models lag substantially behind, and naive trajectory distillation from strong teachers fails to close the gap. Together, these results illuminate the core challenges on the path to fully autonomous language agents for long-horizon tasks and open-ended discovery. Reading List: https://docs.google.com/document/d/1Uby4uBlan6YTQXwx1wTXJvcaB74q7iWU_e-BZQyq... Everyone is invited to attend the talk, and those faculty wishing to remain for the oral exam following are welcome to do so.