I will begin with a gentle introduction to the diversity and scale of
ENCODE data and a brief overview of robust, statistical methods that we
developed for automated detection of DNA binding sites of hundreds of
regulatory proteins from noisy, experimental data. Regulatory proteins
can perform multiple functions by interacting with and co-binding DNA
with different combinations of other regulatory proteins. I developed a
novel discriminative machine learning formulation based on regularized
Rule-based ensembles that was able to sort through the combinatorial
complexity of possible regulatory interactions and learn statistically
significant item-sets of co-binding events at an unprecedented level of
detail. I found extensive evidence that regulatory proteins could switch
partners at different sets of genomic domains within a single cell-type
and across different cell-types affecting structural and chemical
properties of DNA and regulating different functional categories of
target genes. Using regulatory elements discovered from ENCODE data, we
were also able to provide putative functional interpretations for up to
81% of all publicly available sequence variants (mutations) identified
in large-scale disease studies and generate new hypotheses by
integrating multiple sources of data.
Finally, I will present a brief overview of my recent efforts on using
multivariate Hidden Markov models to analyze the dynamics of various
chemical modifications to DNA across three key axes of variation -
across multiple species, across different cell-types in a single species
(human), and across multiple human individuals for the same cell-type.
Our results indicate a remarkable universality of chemical modifications
defining hidden regulatory states across the animal kingdom with
dramatic differences in the variation and functional impact of these
regulatory elements between cell-types and individuals.