Aatmik Gupta will present his MSE talk “Selecting LLM Training Data using Subjective Qualities” on Monday, April 29, 2024 at 9:30am in Friend 125.
Aatmik Gupta will present his MSE talk “Selecting LLM Training Data using Subjective Qualities” on Monday, April 29, 2024 at 9:30am in Friend 125. The members of his committee are as follows: Danqi Chen (Adviser) and Karthik Narasimhan (reader) All ae welcome to attend. Please see abstract below. I will discuss a novel method for selecting pre-training data that focuses on capturing abstract text qualities that are intuitively recognized by humans. The focus is on four qualities: writing style, required expertise, facts & trivia, and educational value. We find that LLMs are better at comparing these qualities between texts rather than directly assessing a text's quality. We select data based on quality ratings to train language models from scratch, and discover the importance of balancing quality and diversity in data selection. Using quality ratings as selection criteria, our models outperform models trained on randomly selected data/data selected using other methods on perplexity and in-context learning task performance. Further analysis delves into the ratings' characteristics, biases, and broader implications, including how they suppress/promote texts associated with certain languages, regions, topics, or social roles. The talk concludes with a study of the offensiveness of selected texts using toxicity models.
participants (1)
-
Gradinfo