Modern Vision Language Models (VLMs) have demonstrated remarkable versatility in a wide variety of applications involving multimodal reasoning, captioning, and analysis. However, VLMs still struggle with the fundamental tasks of object detection and localization, especially when compared to traditional Convolutional Neural Network (CNN) models. Contemporary VLMs tend to allocate very scarce attention to non-textual inputs such as image tokens and lack the visual inductive biases present in CNNs, making it difficult for them to identify challenging objects in complex scenes. To address these limitations, we propose a detection-based dynamic attention steering system that utilizes the locality insight from convolutions to efficiently steer a VLM's attention toward more relevant sections of an image outlined by bounding boxes from a CNN-based detector model. The steering intensity during each inference is dynamically scaled according to the confidence of the bounding boxes. Extensive evaluations across multiple state-of-the-art VLMs demonstrate substantial improvements on object localization-oriented benchmarks, achieving up to a 4% accuracy gain. Results show the effectiveness of combining different model architectures to harness their respective strengths for advancing VLM capabilities.