Ted Zadouri will present his General Exam "Hardware-Efficient Attention for Inference" on Friday, June 6, 2025 at 1:00 PM via Zoom.

 

Zoom link: https://princeton.zoom.us/j/94965926485

 

Committee Members: Tri Dao (advisor), Kai Li, Chi Jin

 

Abstract:

In light of test-time compute, inference efficiency drives progress in AI, demanding a greater emphasis on inference-aware architectures. The sequential nature of decoding limits parallelism. For large batches and long contexts, the key value cache often bottlenecks decoding as it consumes scarce GPU memory, and each decoding step must load the large cache from HBM. Fetching this cache dominates latency relative to the small matrix-vector computation per decoding step, causing prolonged low GPU utilization. This bottleneck hinders a wide range of use cases, including latency-sensitive requests, multi-step reasoning agents, long-context video models, and test-time compute scaling.

 

This work redesigns attention through the lens of arithmetic intensity to shrink the KV cache size, thereby accelerating decoding without sacrificing quality or distributed scalability. We first propose Grouped-Tied Attention (GTA), a simple variant that combines and reuses key and value states, reducing memory transfer while preserving accuracy. We then introduce Grouped Latent Attention (GLA), a parallel-friendly latent attention paired with low-level optimizations for fast decoding at full quality. Experiments show that GTA matches Grouped-Query Attention (GQA) quality while using roughly half the KV cache and that GLA matches Multi-head Latent Attention (MLA) and is easier to shard. Our optimized GLA kernel is up to 2x faster than FlashMLA, e.g., in a speculative decoding setting where the query length exceeds one. Additionally, loading a smaller KV cache per device further reduces latency and boosts throughput in online serving benchmarks.

 

Reading List:

https://docs.google.com/document/d/1aNdCzisnXfuicP_7-sqVECLbILa55uwhWha3oXPuGoM/edit?usp=sharing

Everyone is invited to attend the talk, and those facultywishing to remain for the oral exam following are welcome to do so.