Dan Friedman will present his General Exam "Finding Dataset Shortcuts with Grammar Induction" on Monday, October 3, 2022 at 3:00 PM in CS 301 and via zoom.
Zoom link: https://princeton.zoom.us/j/8636741030
Committee Members: Danqi Chen (advisor), Tom Griffiths, Karthik Narasimhan
Abstract:
Many NLP datasets have been found to contain “shortcuts”, simple decision rules that achieve surprisingly high accuracy. However, it is difficult to automatically discover whether a dataset contains shortcuts. Prior work has used simple classifiers, which can find only simple patterns, or qualitative heuristics like saliency maps, which lack a clear statistical interpretation. In this work, we use probabilistic grammars to characterize and discover NLP shortcuts. We describe an approach for inducing expressive grammars from text, and show how the resulting grammars can reveal interesting shortcut features in a number of single-sentence and sentence pair classification datasets. We explore applications to other settings where it is useful to have a formal characterization of a distribution of text, including domain generalization and transfer learning.
Reading List:
https://docs.google.com/document/d/1vboaALDbdh4B3AVmVna73vTB8cO7LmbQRzrixEyNDjI/edit
Everyone is invited to attend the talk, and those faculty wishing to remain for the oral exam following are welcome to do so.