Wentao Guo will present his General Exam "Accelerating MoE with IO-aware and tile-aware optimizations" on Friday, May 22, 2026 at 12:30 PM in CS 401 and via zoom.
Zoom link: https://princeton.zoom.us/j/9317908194
Committee Members: Tri Dao (advisor), Ravi Netravali, Kai Li
Abstract:
Mixture-of-Experts (MoE) models have emerged as the dominant architecture for scaling language models without proportional increases in training compute. Recent frontier MoE models exhibit a clear trend toward fine-grained experts (smaller intermediate dimension) and higher sparsity (more total experts at constant activated count), which improve model quality per FLOP but introduce significant hardware inefficiencies. Activation memory grows linearly with expert granularity, arithmetic intensity drops as experts become smaller and sparser, and grouped GEMM kernels waste compute on tile-size padding when each expert receives few tokens. These costs compound on modern GPUs, where memory bandwidth rather than tensor-core throughput often determines kernel runtime.
In this talk, I will present SonicMoE, a hardware/algorithm co-design that addresses these challenges through three contributions. First, I will derive an activation memory-efficient MoE algorithm for fine-grained MoEs by avoiding materialization of any intermediate tensor whose size scales with the number of activated experts. Second, I will describe IO-aware grouped GEMM kernels for both NVIDIA Hopper and Blackwell GPUs that fuse gather operations with global-memory loads, fuse activation functions with the GEMM epilogue, and hide memory IO behind tensor-core computation through Ping-Pong scheduling on Hopper and Tensor Memory double-buffering on Blackwell. Third, I will introduce a tile-aware token rounding routing method that eliminates compute wasted on grouped-GEMM padding under highly sparse MoE settings while preserving downstream task performance. For a 7B fine-grained MoE on H100 GPUs, SonicMoE reduces activation memory by 45% and improves forward-pass throughput by 1.86× over ScatterMoE, achieving training throughput on 64 H100s comparable to ScatterMoE on 96 H100s. I will further outline ongoing research for extending these IO-aware optimization techniques to expert parallelism and low-precision training in MXFP8.
Reading List:
https://docs.google.com/document/d/1RiC19A195_AGxRByVPD304wK79JY2hEm7kW1hYdnwOc/edit?usp=sharing
Everyone is invited to attend the talk, and those faculty wishing to remain for the oral exam following are welcome to do so.