Tyler Zhu will present his General Exam " Unifying Specialized Visual Encoders for Video Language Models " on Tuesday, January 14, 2025 at 10:00 AM in CS 402 and via zoom. Zoom link: https://princeton.zoom.us/my/tylerzhu Committee Members: Olga Russakovsky (advisor), Jia Deng , Danqi Chen Abstract: The recent advent of Large Language Models (LLMs) has ushered sophisticated reasoning capabilities into the realm of video through Video Large Language Models (VideoLLMs). However, VideoLLMs currently rely on a single vision encoder for all of their visual processing, which limits the amount and type of visual information that can be conveyed to the LLM. Our method, MERV, Multi-Encoder Representation of Videos, instead leverages multiple frozen visual encoders to create a unified representation of a video, providing the VideoLLM with a comprehensive set of specialized visual knowledge. Spatio-temporally aligning the features from each encoder allows us to tackle a wider range of open-ended and multiple-choice video understanding questions and outperform prior state-of-the-art works on their data mixes. MERV is up to 3.7% better in accuracy than Video-LLaVA across the standard suite video understanding benchmarks, while also having a better Video-ChatGPT score. We also improve upon SeViLA, the previous best on zero-shot Perception Test accuracy, by 2.2%. MERV introduces minimal extra parameters and trains faster than equivalent single-encoder approaches. Finally, we provide qualitative evidence that our model captures domain knowledge from each encoder simultaneously, such as on the motion classification tasks found in Something-Something v2. Our results offer promising directions for future research in utilizing multiple vision encoders for comprehensive video understanding. Reading List: https://docs.google.com/document/d/1jBXhv-IhFRsFsC6HblTb9ZE4nZkQIfGgm2-NAMOO... Everyone is invited to attend the talk, and those faculty wishing to remain for the oral exam following are welcome to do so.