Sayash Kapoor will present his General Exam "Leakage and the reproducibility crisis in ML-based science" on Friday, May 6, 2022 at 9:00 AM via zoom only.

Zoom link: https://princeton.zoom.us/j/99164890280

Committee Members: Arvind Narayanan (advisor), Brandon Stewart, Olga Russakovsky

Abstract:

The use of Machine Learning (ML) methods for prediction and forecasting has become widespread across the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. In this work, we systematically investigate reproducibility issues in ML-based science. We show that data leakage is indeed a widespread problem and has led to severe reproducibility failures. Specifically, through a survey of literature in research communities that adopted ML methods, we find 15 fields where errors have been found, collectively affecting 304 papers and in some cases leading to wildly overoptimistic conclusions. Based on our survey, we present a fine-grained taxonomy of 8 types of leakage that range from textbook errors to open research problems. We argue for fundamental methodological changes to ML-based science so that cases of leakage can be caught before publication. To that end, we propose model cards for reporting scientific claims based on ML models and find that this would address all types of leakage identified in the 304 papers with errors in our survey. To investigate the impact of reproducibility errors and the efficacy of our model cards, we undertake a reproducibility study in a field where complex ML models are believed to vastly outperform older statistical models such as Logistic Regression (LR): Civil War prediction. We find that all papers claiming the superior performance of complex ML models compared to LR models fail to reproduce due to data leakage, and complex ML models don’t perform substantively better than decades-old LR models. While none of these errors could have been caught by reading the papers, our model cards enable the detection of leakage in each case.

Reading List:

https://docs.google.com/document/d/10Zt4UQTATXKb-aEOmNFi99xu4MbQiHcAXWc1hS3cArA/edit?usp=sharing

Everyone is invited to attend the talk, and those faculty wishing to remain for the oral exam following are welcome to do so.

Louis Riehl
Graduate Administrator
Computer Science Department, CS213
Princeton University
(609) 258-8014