(Please note - talk is in Carl Icahn Lab, Room 200) Curtis Huttenhower will present his preFPO on Friday April 25 at 10AM in Carl Icahn Lab Room 200. The members of his committee are: Olga Troyanskaya, advisor; Mona Singh and Hilary Coller (MOL), readers; Kai Li and David Botstein (MOL/Genomics), nonreaders. Everyone is invited to attend his talk. ----- TITLE: Data mining in large biological data collections ABSTRACT: Modern biology has developed a wealth of high-throughput experimental techniques. Many of these, like whole-genome sequencing, produce measurements simultaneously for every gene in an organism's genome. This means that a single assay can produce tens or hundreds of thousands of structured data points; some assays detect pairwise interactions between genes, squaring the amount of data. Each such genome-scale dataset represents a highly descriptive snapshot of cellular biology. As the number of such datasets reaches the thousands for many organisms, however, new opportunities arise to understand systems-level biology by means of very large scale data integration and analysis. My thesis focuses on three areas of opportunity presented by this situation. First, analysis of single high-throughput datasets is by no means a solved problem: as new experimental assays become available, each resulting dataset reveals new aspects of molecular biology. I will discuss results examining the yeast phosphoproteome by electron transfer dissociation mass spectrometry, which allows global profiling of phosphorylation signaling, as well as a statistical model of the transcriptional response to changes in growth rate. This model is descriptive of yeast biology and can be applied to predict instantaneous growth rates from arbitrary expression measurements (e.g. from other experimental platforms or unicellular organisms). Second, it is critical to extend functional genomic analyses to higher organisms, particularly human beings. This presents both computational and biological challenges: the amount of available experimental data is orders of magnitude larger, the amount of pre-existing biological knowledge is smaller, and the biology of higher eukaryotes is itself substantially more complex. I will discuss HEFalMp, a machine learning system that builds off of our success in yeast to integrate hundreds of genome-scale datasets for functional genomic analysis in human beings. This has enabled new views of gene function, functional relationships, and pathway interactions in human biology. We are currently in the process of experimental followup for several predictions in the human autophagy pathway, and we are planning to extend HEFalMp to more specifically investigate human disease. Finally, the process of experimental data integration for functional genomics can be extended beyond single gene interactions or single organisms. By mining a sufficiently large body of experimental results, purely data-driven relationships can be derived between entire cellular pathways. This captures the complex interplay between pathways and processes at a systems level and can help to describe biological results in terms of their specific functional activity. Furthermore, all of these results can be extended across multiple organisms to study gene and pathway functional changes over the course of evolution. All of these results are made possible by a novel synthesis of large-scale machine learning with genome-scale biological data.
participants (1)
-
Melissa M Lawson