Sowmya Thanvantri will present her General Exam "Generalized Combinations of Experts in Multi-Head Attention" on Monday, April 27, 2026 at 1:30 PM in CS 401.
Sowmya Thanvantri will present her General Exam "Generalized Combinations of Experts in Multi-Head Attention" on Monday, April 27, 2026 at 1:30 PM in CS 401. Committee Members: Ryan Adams (advisor), Tom Griffiths, Karthik Narasimhan Abstract: The success of large language models today is attributed largely to the attention mechanism used in transformers, which assigns a measure of uncertainty over other tokens. This idea was extended to multi-head attention, allowing models to learn several different distributions of uncertainty over tokens. However, multi-head attention does not take full advantage of the different distributions because the outputs of the heads are concatenated and passed through a linear map, which prevents learning more expressive mappings. To enhance expressivity, a mixture model could aggregate information between these distributions. Alternatively, taking a Bayesian perspective, a product-based model across distributions could lead to a more accurate representation. To achieve both of these, we draw on ideas from mixture of experts (MoE) and product of experts (PoE) and propose a generalized combination of expert heads in multi-head attention to allow models to learn a richer representation of attention. Reading List: https://docs.google.com/document/d/1PaOPsZcP-rYhBvUYQBqSR-c_BAB3HIYI4rGCSE7J... Everyone is invited to attend the talk, and those faculty wishing to remain for the oral exam following are welcome to do so.
The time for this General Exam as been updated to 2:30 PM. Other details remain the same. Sowmya Thanvantri will present her General Exam "Generalized Combinations of Experts in Multi-Head Attention" on Monday, April 27, 2026 at 2:30 PM in CS 401. Committee Members: Ryan Adams (advisor), Tom Griffiths, Karthik Narasimhan Abstract: The success of large language models today is attributed largely to the attention mechanism used in transformers, which assigns a measure of uncertainty over other tokens. This idea was extended to multi-head attention, allowing models to learn several different distributions of uncertainty over tokens. However, multi-head attention does not take full advantage of the different distributions because the outputs of the heads are concatenated and passed through a linear map, which prevents learning more expressive mappings. To enhance expressivity, a mixture model could aggregate information between these distributions. Alternatively, taking a Bayesian perspective, a product-based model across distributions could lead to a more accurate representation. To achieve both of these, we draw on ideas from mixture of experts (MoE) and product of experts (PoE) and propose a generalized combination of expert heads in multi-head attention to allow models to learn a richer representation of attention. Reading List: https://docs.google.com/document/d/1PaOPsZcP-rYhBvUYQBqSR-c_BAB3HIYI4rGCSE7J... Everyone is invited to attend the talk, and those faculty wishing to remain for the oral exam following are welcome to do so.
participants (2)
-
CS Grad Department -
gradinfo@cs.princeton.edu