[talks] Adithya Bhaskar will present his General Exam "Effective mechanistic interpretability via pruning" on Monday, May 5, 2025 at 3:00 PM in FC 008 and via zoom.

30 Apr 2025

      Adithya Bhaskar will present his General Exam "Effective mechanistic interpretability via pruning" on Monday, May 5, 2025 at 3:00 PM in FC 008 and via zoom. 

Zoom link: https://princeton.zoom.us/j/98937358810 

Committee Members: Danqi Chen (advisor), Sanjeev Arora, Benjamin Eysenbach 

Abstract: 

Mechanistic interpretability strives to explain language models bottom-up by understanding their components. In this talk, I will describe how pruning—removing less important parameters or connections in a model—can enable a new approach to interpretability. I will introduce a curious phenomenon as a case study: fine-tuning a language model with different random seeds can lead to similar in-domain performance but exhibit distinct patterns of generalization. Pruning the model reveals that even a single model contains many subnetworks that match its in-domain performance, but generalize differently. By connecting this result to a “grokking”-like phenomenon observed for out-of-domain generalization, I will develop a theory of why these subnetworks exist in the model, and what role they fulfill. 

Zooming out of this example, we will apply pruning to develop an efficient, accurate, and scalable tool to automatically discover transformer circuits—basic units of mechanistic interpretability. We develop Edge Pruning, a method that focuses on pruning edges rather than the parameters of a language model. Edge Pruning discovers circuits substantially more faithful to the model as measured by KL divergence while being more efficient and scaling to 100x larger models than prior approaches. We will see via an example how this tool enables future work to understand phenomena that only emerge in large (1B+ parameters) models, such as in-context learning. 

Our results imply that pruning holds great promise for interpretability—both as a paradigm and as a standalone method. 

Reading List: 

https://docs.google.com/document/d/1VJ5busQNlb-er_yt7-ae34x2UC0KXZh2irvR0viO... 

Everyone is invited to attend the talk, and those faculty wishing to remain for the oral exam following are welcome to do so.

[talks] Adithya Bhaskar will present his General Exam "Effective mechanistic interpretability via pruning" on Monday, May 5, 2025 at 3:00 PM in FC 008 and via zoom.

CS Grad Department