CS Colloquium Speaker
Speaker: Tri Dao, Stanford University
Date: Thursday, March 2 
Time: 12:30pm EST
Location: CS 105
Host: Jia Deng
Event page: https://www.cs.princeton.edu/events/26343

Title: Hardware-aware Algorithms for Efficient Machine Learning

Abstract:  Machine learning (ML) models training will continue to grow to consume more cycles, their inference will proliferate on more kinds of devices, and their capabilities will be used on more domains. Some goals central to this future are to make ML models efficient so they remain practical to train and deploy, and to unlock new application domains with new capabilities. We describe some recent developments in hardware-aware algorithms to improve the efficiency-quality tradeoff of ML models and equip them with long context. In the first half, we focus on structured sparsity, a natural approach to mitigate the extensive compute and memory cost of large ML models. We describe a line of work on learnable fast transforms which, thanks to their expressiveness and efficiency, yields some of the first sparse training methods to speed up large models in wall-clock time (2x) without compromising their quality. In the second half, we focus on efficient Transformer training and inference for long sequences. We describe FlashAttention, a fast and memory-efficient algorithm to compute attention with no approximation. By careful accounting of reads/writes between different levels of memory hierarchy, FlashAttention is 2-4x faster and uses 10-20x less memory compared to the best existing attention implementations, allowing us to train higher-quality Transformers with 8x longer context. FlashAttention is now widely used in some of the largest research labs and companies, in just 6 months after its release. We conclude with some exciting directions in ML and systems, such as software-hardware co-design, structured sparsity for scientific AI, and long context for new AI workflows and modalities.

Bio: Tri Dao is a PhD student in Computer Science at Stanford, co-advised by Christopher RĂ© and Stefano Ermon. He works at the interface of machine learning and systems, and his research interests include sequence models with long-range memory and structured matrices for compact deep learning models. His work has received the ICML 2022 Outstanding paper runner-up award.