Ameet S. Deshpande will present his General Exam "Guiding Attention for Self-Supervised Learning with Transformers" on Thursday, April 29, 2021 at 11AM via Zoom.

Ameet S. Deshpande will present his General Exam on Thursday, April 29, 2021 at 11AM via Zoom. The members of his committee are as follows: Karthik Narasimhan (advisor), Danqi Chen, Sanjeev Arora Zoom link: http://princeton.zoom.us/my/ameetsd http://princeton.zoom.us/my/ameetsd Everyone is invited to attend the talk, and those faculty wishing to remain for the oral exam following are welcome to do so. His abstract and reading list follow below. Title: Guiding Attention for Self-Supervised Learning with Transformers Abstract: Recent advances in self-supervised pre-training have resulted in models with impressive downstream performance on several natural language processing (NLP) tasks. However, this has led to the development of enormous models, which often require days of training on non-commodity hardware (e.g., TPUs, distributed GPUs). Furthermore, studies have shown that it is quite challenging to successfully train these large Transformer models, requiring complicated learning schemes and extensive hyperparameter tuning. In this talk, we present a simple and effective technique for efficient self-supervised learning with bi-directional Transformers. Our approach is motivated by recent studies demonstrating that trained models' self-attention patterns contain a majority of non-linguistic regularities. We propose a computationally efficient auxiliary loss function to guide attention heads to conform to such patterns. Our method is agnostic to the actual pre-training objective. It results in faster convergence of models and better performance on downstream tasks compared to the baselines, achieving state-of-the-art results in low-resource settings. Reading List: Textbooks: 1. https://web.stanford.edu/~jurafsky/slp3/ed3book_dec302020.pdf Speech and Language Processing, by Jurafsky and Martin 2. <https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstei n-nlp-notes.pdf> Natural Language Processing, by Jacob Eisenstein Papers: Transformers and pre-training in NLP 1. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. "Attention is all you need." 2. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep bidirectional transformers for language understanding." 3. Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. "Roberta: A robustly optimized bert pretraining approach." 4. Clark, Kevin, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. "Electra: Pre-training text encoders as discriminators rather than generators." Modifying self-attention modules in Transformers 1. Kitaev, Nikita, Łukasz Kaiser, and Anselm Levskaya. "Reformer: The efficient transformer." 2. Child, Rewon, Scott Gray, Alec Radford, and Ilya Sutskever. "Generating long sequences with sparse transformers." 3. Beltagy, Iz, Matthew E. Peters, and Arman Cohan. "Longformer: The long-document transformer." 4. Tay, Yi, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. "Synthesizer: Rethinking self-attention in transformer models." 5. Wang, Sinong, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. "Linformer: Self-attention with linear complexity." 6. Zaheer, Manzil, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham et al. "Big bird: Transformers for longer sequences." Fixing self-attention patterns (Machine Translation) 1. You, Weiqiu, Simeng Sun, and Mohit Iyyer. "Hard-coded gaussian attention for neural machine translation." 2. Raganato, Alessandro, Yves Scherrer, and Jörg Tiedemann. "Fixed encoder self-attention patterns in transformer-based machine translation." Analyzing self-attention 1. Clark, Kevin, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. "What does bert look at? an analysis of bert's attention." 2. Kovaleva, Olga, Alexey Romanov, Anna Rogers, and Anna Rumshisky. "Revealing the dark secrets of BERT." 3. Voita, Elena, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. "Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned." 4. Michel, Paul, Omer Levy, and Graham Neubig. "Are sixteen heads really better than one?."
participants (1)
-
jfarquer@cs.princeton.edu