Constance Ferragu will present her MSE talk "Vendi-Decoding: Diverse Sampling for Protein Sequence Design" Wednesday, April 24 at 9:30 AM in CS 402.
Advisor: Adji Bousso Dieng Reader: Olga Troyanskaya
Abstract:
Protein sequence models have become increasingly valuable for the design of novel proteins. These models learn distributions over amino acids at sequence positions. However, decoding from these models poses a significant challenge due to the exponentially large sequence space. A significant limitation is the lack of diversity and exploration. These methods tend to prioritize decoding high-likelihood tokens, resulting in repetitive or similar sequences. Generating diverse sequences with high naturalness is crucial for thorough exploration of the sequence space, essential for the discovery of novel protein sequences.
In this thesis, we propose Vendi Decoding, a sequence decoding algorithm designed to improve the efficiency of exploring sequence space and the diversity of decoded sequence sets. Our method leverages the Vendi Score, a statistical measure of diversity, to select edit positions that will most effectively improve our diversity objective and to guide the model’s hidden representations towards diverse decoding steps. Our results demonstrate that Vendi Decoding can iteratively refine a seed sequence into a set of diverse sequences more rapidly, while ensuring that the quality of sequences does not deteriorate.
CS Grad Calendar: