Tyler Zhu will present his General Exam "Unifying Specialized Visual Encoders for Video Language Models" on Tuesday, January 14, 2025 at 10:00 AM in CS 402 and via zoom.

Zoom link: https://princeton.zoom.us/my/tylerzhu

Committee Members: Olga Russakovsky (advisor), Jia Deng, Danqi Chen

Abstract:

The recent advent of Large Language Models (LLMs) has ushered sophisticated reasoning capabilities into the realm of video through Video Large Language Models (VideoLLMs). However, VideoLLMs currently rely on a single vision encoder for all of their visual processing, which limits the amount and type of visual information that can be conveyed to the LLM. Our method, MERV, Multi-Encoder Representation of Videos, instead leverages multiple frozen visual encoders to create a unified representation of a video, providing the VideoLLM with a comprehensive set of specialized visual knowledge. Spatio-temporally aligning the features from each encoder allows us to tackle a wider range of open-ended and multiple-choice video understanding questions and outperform prior state-of-the-art works on their data mixes. MERV is up to 3.7% better in accuracy than Video-LLaVA across the standard suite video understanding benchmarks, while also having a better Video-ChatGPT score. We also improve upon SeViLA, the previous best on zero-shot Perception Test accuracy, by 2.2%. MERV introduces minimal extra parameters and trains faster than equivalent single-encoder approaches. Finally, we provide qualitative evidence that our model captures domain knowledge from each encoder simultaneously, such as on the motion classification tasks found in Something-Something v2. Our results offer promising directions for future research in utilizing multiple vision encoders for comprehensive video understanding.

Reading List:

https://docs.google.com/document/d/1jBXhv-IhFRsFsC6HblTb9ZE4nZkQIfGgm2-NAMOOuYA/edit?usp=sharing

Everyone is invited to attend the talk, and those faculty wishing to remain for the oral exam following are welcome to do so.