Guillermo Sapiro and Sanketh Vedula will be hosting guests at Princeton Precision Health's office at 252 Nassau St 2nd FL at 3:30pm - 5:00pm on Wednesday, December 10 who will present On decoding the inner workings of multimodal foundation models.

Speakers-Alberto Cazzaniga, Lorenzo Basile & Diego Doimo all Researchers from Area Science Park Trieste, Italy.

Alberto Cazzaniga will present, On image-text communication in vision language models
Vision-language models (VLMs) integrate images and text efficiently, but how
they transmit visual information into text generation remains poorly understood.
We present two mechanistic findings that clarify image–text communication in
modern VLMs. First, using counterfactual multimodal queries, we isolate a small
set of attention heads that decide whether the model follows the image or its
internal knowledge; editing these heads reliably shifts behavior and reveals the
image regions that drive it. Second, comparing native and non-native VLMs, we
show that they rely on distinct pathways for visual-to-text transfer: non-native
models distribute information across many image tokens, whereas native models
depend on a single gate-like token whose removal severely degrades image
understanding. Together, these insights offer a clearer and more actionable view
of how VLMs process visual evidence.

Lorenzo Basile will present, Head Specialization in Vision, Language, and Multimodal Transformers
Transformers often appear as opaque systems, but recent research shows that
they contain meaningful internal structure. A key example is head specialization,
where individual attention heads consistently encode specific concepts across
language, vision, and multimodal models. Some heads capture visual properties
like shape or color, while others represent linguistic or numerical information
such as sentiment or toxic words. Identifying these specialized heads not only
deepens interpretability but also provides practical tools for controlling model
behavior. By selectively amplifying or suppressing head activity, we can adjust
concept representations, adapt models to new tasks, and improve performance
with minimal parameter changes. This talk presents methods for discovering
head specialization, demonstrates its emergence across diverse architectures,
and shows how these structures can be leveraged for precise,
parameter-efficient interventions.

Diego Doimo will present, The geometry of hidden representations of large transformer models
In this talk, we will show how the geometric properties of hidden representations
can help us understand the semantic information encoded by transformers. In the
first part, we will focus on the intrinsic dimension of the internal transformer
representations, showing that it is a valuable tool for identifying the layers
encoding the semantic content of data across different domains, such as images,
biological sequences, and text. In the second part, we will analyze the probability
density of hidden representations. Specifically, we focus on how language
models solve a question-answering task with few-shot learning and fine-tuning.
We show that while both approaches can achieve similar performance, they
create very different density distributions in the hidden representations, changing
with a sharp geometrical transition in the middle of the network.

For more information, please contact Sanketh Vedula- svedula@princeton.edu

Getting to the seminar space currently requires that you climb a set of stairs. If an accommodation is needed, please contact PPH in advance at: PrincetonPPH@princeton.edu

Thank you,

Princeton Precision Health