Title: Context-sensitive methods for learning from genomic data
Abstract:
Recent developments in biotechnology have enabled high-throughput
measurement of several complementary cellular phenomena. The wealth of data
generated by such technology promises to support computational prediction of
network models, but so far, successful approaches that translate these data into
accurate, experimentally testable hypotheses have been limited. My thesis
focuses on machine learning and signal processing approaches that utilize
contextual clues that often accompany biological data to extract useful
information and make precise predictions.
First, my thesis describes methods for using microarray technology to
detect chromosomal aberrations. Amplification and deletion of portions of
chromosomes often serves as a mechanism of rapid adaptation and have been
associated with numerous cancers. Accurate and precise identification of
when and where these changes occur will help us understand this important
adaptive mechanism and enable steps towards effective cancer treatment. I
discuss my solution to this problem, ChARM (Chromosomal Aberration Region
Miner), a statistical signal processing approach based on
expectation-maximization that uses chromosome context information to accurately
identify even subtle chromosomal changes from either gene expression or CGH
microarray data.
Second, I have addressed the more general problem of integrating diverse
types of functional genomic data (e.g. gene expression, protein-protein
interactions, genetic interactions, sequence, and protein localization data) to
understand gene function and predict biological networks. I discuss a
system we have developed for integration of these diverse data and user-driven
network inference. My key contribution in this area is the notion of query
context-sensitive prediction. This idea is based on the observation that
most experimental technologies capture different biological processes with
varying degrees of success, and thus, each source of genomic data will vary in
relevance depending on the biological process one is interested in
predicting. Other key contributions of this work are the data
visualization approaches that support intelligent, expert browsing of genomic
data, which is a largely unexplored, but powerful paradigm in bioinformatics
applications. I discuss evaluation of these methods and examples of
biological validation, where we have used our system to characterize several new
genes.
Finally, my thesis addresses the question of how to use machine learning
and other bioinformatics methods to direct large-scale genomic
experiments. Until now, most bioinformatics methods have been applied
downstream of data-generating experiments, serving mainly as tools for
analysis. I discuss methods for directing large-scale experiments in the
context of whole-genome genetic interaction screens. We have applied these
methods in collaboration with experimental labs, and we demonstrate that such
approaches enable more efficient use of high-throughput technology and,
ultimately, help us to learn more novel biology.