Bayesian models of structured sparsity for discovery of regulatory genetic variants
Barbara Engelhardt , Duke University
Wednesday, November 6th 4:30pm
Computer Science 105
In genomic sciences, the amount of data has grown faster than statistical methodologies necessary to analyze those data. Furthermore, the complex underlying structure of these data means that simple, unstructured statistical models do not perform well. We consider the problem of identifying allelic heterogeneity, or multiple, functionally independent, co-localized genetic regulators of gene transcription. Sparse regression techniques have been critical to the discovery of allelic heterogeneity because of their computational tractability in large data settings. These traditional models are hindered by the substantial correlation between genetic variants induced by linkage disequilibrium. I describe a new model for Bayesian structured sparse regression. This model uses positive definite covariance matrices to incorporate the arbitrarily complex structure of the predictors directly into a Gaussian field to yield structure-aware sparse regression coefficients. This broadly applicable model of Bayesian structured sparsity enables more efficient parameter estimating techniques than models assuming independence would allow. On simulated data, we find that our approach substantially outperforms the state-of-the-art models and methods. We applied this model to a large study of expression quantitative trait loci, and found that our approach yields highly interpretable, robust solutions for allelic heterogeneity, particularly when the interactions between genetic variants are well approximated by an additive model.