(Please note - talk is in Carl Icahn Lab, Room 200)
Curtis Huttenhower will present his preFPO on Friday April 25 at 10AM in Carl Icahn
Lab Room 200. The members of his committee are: Olga Troyanskaya, advisor;
Mona Singh and Hilary Coller (MOL), readers; Kai Li and David Botstein (MOL/Genomics),
nonreaders. Everyone is invited to attend his talk.
-----
TITLE: Data mining in large biological data collections
ABSTRACT:
Modern biology has developed a wealth of high-throughput experimental techniques. Many of
these, like whole-genome sequencing, produce measurements simultaneously for every gene in
an organism's genome. This means that a single assay can produce tens or hundreds of
thousands of structured data points; some assays detect pairwise interactions between
genes, squaring the amount of data. Each such genome-scale dataset represents a highly
descriptive snapshot of cellular biology. As the number of such datasets reaches the
thousands for many organisms, however, new opportunities arise to understand systems-level
biology by means of very large scale data integration and analysis.
My thesis focuses on three areas of opportunity presented by this situation.
First, analysis of single high-throughput datasets is by no means a solved
problem: as new experimental assays become available, each resulting dataset reveals new
aspects of molecular biology. I will discuss results examining the yeast phosphoproteome
by electron transfer dissociation mass spectrometry, which allows global profiling of
phosphorylation signaling, as well as a statistical model of the transcriptional response
to changes in growth rate. This model is descriptive of yeast biology and can be applied
to predict instantaneous growth rates from arbitrary expression measurements (e.g. from
other experimental platforms or unicellular organisms).
Second, it is critical to extend functional genomic analyses to higher organisms,
particularly human beings. This presents both computational and biological challenges:
the amount of available experimental data is orders of magnitude larger, the amount of
pre-existing biological knowledge is smaller, and the biology of higher eukaryotes is
itself substantially more complex. I will discuss HEFalMp, a machine learning system that
builds off of our success in yeast to integrate hundreds of genome-scale datasets for
functional genomic analysis in human beings. This has enabled new views of gene function,
functional relationships, and pathway interactions in human biology. We are currently in
the process of experimental followup for several predictions in the human autophagy
pathway, and we are planning to extend HEFalMp to more specifically investigate human
disease.
Finally, the process of experimental data integration for functional genomics can be
extended beyond single gene interactions or single organisms. By mining a sufficiently
large body of experimental results, purely data-driven relationships can be derived
between entire cellular pathways. This captures the complex interplay between pathways
and processes at a systems level and can help to describe biological results in terms of
their specific functional activity. Furthermore, all of these results can be extended
across multiple organisms to study gene and pathway functional changes over the course of
evolution. All of these results are made possible by a novel synthesis of large-scale
machine learning with genome-scale biological data.