Title: Context-sensitive methods for learning from genomic data

Abstract:

Recent developments in biotechnology have enabled high-throughput measurement of several complementary cellular phenomena. The wealth of data generated by such technology promises to support computational prediction of network models, but so far, successful approaches that translate these data into accurate, experimentally testable hypotheses have been limited. My thesis focuses on machine learning and signal processing approaches that utilize contextual clues that often accompany biological data to extract useful information and make precise predictions.

First, my thesis describes methods for using microarray technology to detect chromosomal aberrations. Amplification and deletion of portions of chromosomes often serves as a mechanism of rapid adaptation and have been associated with numerous cancers. Accurate and precise identification of when and where these changes occur will help us understand this important adaptive mechanism and enable steps towards effective cancer treatment. I discuss my solution to this problem, ChARM (Chromosomal Aberration Region Miner), a statistical signal processing approach based on expectation-maximization that uses chromosome context information to accurately identify even subtle chromosomal changes from either gene expression or CGH microarray data.

Second, I have addressed the more general problem of integrating diverse types of functional genomic data (e.g. gene expression, protein-protein interactions, genetic interactions, sequence, and protein localization data) to understand gene function and predict biological networks. I discuss a system we have developed for integration of these diverse data and user-driven network inference. My key contribution in this area is the notion of query context-sensitive prediction. This idea is based on the observation that most experimental technologies capture different biological processes with varying degrees of success, and thus, each source of genomic data will vary in relevance depending on the biological process one is interested in predicting. Other key contributions of this work are the data visualization approaches that support intelligent, expert browsing of genomic data, which is a largely unexplored, but powerful paradigm in bioinformatics applications. I discuss evaluation of these methods and examples of biological validation, where we have used our system to characterize several new genes.

Finally, my thesis addresses the question of how to use machine learning and other bioinformatics methods to direct large-scale genomic experiments. Until now, most bioinformatics methods have been applied downstream of data-generating experiments, serving mainly as tools for analysis. I discuss methods for directing large-scale experiments in the context of whole-genome genetic interaction screens. We have applied these methods in collaboration with experimental labs, and we demonstrate that such approaches enable more efficient use of high-throughput technology and, ultimately, help us to learn more novel biology.