Wuwei Zhang will present her MSE Talk"Improving Long-Context Reasoning with Query-Focused Retrieval Heads" onTuesday, April 21, 2026 at 4:00p in CS 302.
Wuwei Zhang will present her MSE Talk "Improving Long-Context Reasoning with Query-Focused Retrieval Heads" on Tuesday, April 21, 2026 at 4:00p in CS 302. Committee Members: Prof. Danqi Chen (advisor),Prof. Karthik Narasimhan (reader). All are welcome to attend. Title: Improving Long-Context Reasoning with Query-Focused Retrieval Heads Abstract: Large language models (LLMs) have demonstrated strong performance across many tasks, yet they often struggle to effectively utilize information from long contexts. Recent studies suggest that a subset of attention heads, known as retrieval heads, plays a critical role in retrieving relevant information from long contexts. However, existing methods for identifying these heads rely on copy-paste behavior in synthetic tasks, which may not reflect real-world retrieval requirements. In this thesis, we investigate how retrieval heads contribute to long-context reasoning and propose methods to better utilize them during inference. First, we introduce Query-Focused Retrieval Heads (QRHead), a method for identifying attention heads that selectively retrieve information relevant to the input query. By aggregating the accumulated attention mass of QRHead, we further develop QRRetriever, a training-free retrieval approach that selects the most relevant context segments for downstream reasoning tasks. Building on QRHead, we propose Dynamic Attention-Scaling Decoding (DySCO), an inference-time method that dynamically rescales attention based on query-focused retrieval signals, enabling models to more effectively access relevant information during generation. Our methods achieve substantial improvements on long-context reasoning benchmarks across multiple model families, without requiring additional training. Together, these results demonstrate that retrieval heads provide a useful lens for understanding and improving long-context reasoning in LLMs, and that inference-time interventions guided by these mechanisms can lead to more effective context utilization.
participants (1)
-
CS Grad Department