Alan Zhang will present his MSE talk "Detection-based Attention Steering System for Vision Language Models" on Friday, Apr 24th, 2026 in CS 105 at 2pm in CS 105.

All are welcome to attend.

Presenter: Alan Zhang (az0686)

Advisor: Ravi Netravali

Reader: Wyatt Lloyd

Abstract:

Modern Vision Language Models (VLMs) have demonstrated remarkable versatility in a wide variety of applications involving multimodal reasoning, captioning, and analysis. However, VLMs still struggle with the fundamental tasks of object detection and localization, especially when compared to traditional Convolutional Neural Network (CNN) models. Contemporary VLMs tend to allocate very scarce attention to non-textual inputs such as image tokens and lack the visual inductive biases present in CNNs, making it difficult for them to identify challenging objects in complex scenes. To address these limitations, we propose a detection-based dynamic attention steering system that utilizes the locality insight from convolutions to efficiently steer a VLM's attention toward more relevant sections of an image outlined by bounding boxes from a CNN-based detector model. The steering intensity during each inference is dynamically scaled according to the confidence of the bounding boxes. Extensive evaluations across multiple state-of-the-art VLMs demonstrate substantial improvements on object localization-oriented benchmarks, achieving up to a 4% accuracy gain. Results show the effectiveness of combining different model architectures to harness their respective strengths for advancing VLM capabilities.