Title: Context-sensitive methods for learning from genomic data
 
Abstract:
Recent developments in biotechnology have enabled high-throughput 
measurement of several complementary cellular phenomena. The wealth of data 
generated by such technology promises to support computational prediction of 
network models, but so far, successful approaches that translate these data into 
accurate, experimentally testable hypotheses have been limited.  My thesis 
focuses on machine learning and signal processing approaches that utilize 
contextual clues that often accompany biological data to extract useful 
information and make precise predictions.
First, my thesis describes methods for using microarray technology to 
detect chromosomal aberrations.  Amplification and deletion of portions of 
chromosomes often serves as a mechanism of rapid adaptation and have been 
associated with numerous cancers.  Accurate and precise identification of 
when and where these changes occur will help us understand this important 
adaptive mechanism and enable steps towards effective cancer treatment.  I 
discuss my solution to this problem, ChARM (Chromosomal Aberration Region 
Miner), a statistical signal processing approach based on 
expectation-maximization that uses chromosome context information to accurately 
identify even subtle chromosomal changes from either gene expression or CGH 
microarray data.  
Second, I have addressed the more general problem of integrating diverse 
types of functional genomic data (e.g. gene expression, protein-protein 
interactions, genetic interactions, sequence, and protein localization data) to 
understand gene function and predict biological networks.  I discuss a 
system we have developed for integration of these diverse data and user-driven 
network inference.  My key contribution in this area is the notion of query 
context-sensitive prediction.  This idea is based on the observation that 
most experimental technologies capture different biological processes with 
varying degrees of success, and thus, each source of genomic data will vary in 
relevance depending on the biological process one is interested in 
predicting.  Other key contributions of this work are the data 
visualization approaches that support intelligent, expert browsing of genomic 
data, which is a largely unexplored, but powerful paradigm in bioinformatics 
applications.  I discuss evaluation of these methods and examples of 
biological validation, where we have used our system to characterize several new 
genes.
Finally, my thesis addresses the question of how to use machine learning 
and other bioinformatics methods to direct large-scale genomic 
experiments.  Until now, most bioinformatics methods have been applied 
downstream of data-generating experiments, serving mainly as tools for 
analysis.  I discuss methods for directing large-scale experiments in the 
context of whole-genome genetic interaction screens.  We have applied these 
methods in collaboration with experimental labs, and we demonstrate that such 
approaches enable more efficient use of high-throughput technology and, 
ultimately, help us to learn more novel biology.