Ksenia Sokolova will present her FPO "Deep Learning for Sequence-Based Gene Expression Prediction" on Monday, March 18, 2024 at 10:30 AM in the 252 Nassau Street Conference room.
Location: 252 Nassau Street Conference room
The members of Ksenia’s committee are as follows:
Examiners: Olga Troyanskaya (Adviser), Kai Li, Ellen Zhong
Readers: Mona Singh, Yuri Pritykin
Everyone is invited to attend her talk.
Abstract follows below:
Human biology is rooted in highly specialized cell types programmed by a common genome, 98% of which is outside of genes. While genetic variation in the enormous noncoding space is linked to the majority of disease risk, the impact of this variation is poorly understood. The recent advances in sequencing technology made it possible to perform whole genome sequencing of the large cohorts, uncovering many variants per individual. A crucial challenge is to understand the collective impact of these variants on gene expression across varied human cell types and their subsequent roles in disease progression.
This dissertation begins by tackling the challenge of associating noncoding genetic variants with changes in gene expression in primary human cell types. We introduce ExPectoSC, an atlas of modular deep-learning-based models for predicting cell-type-specific gene expression directly from sequence. With models spanning 105 primary human cell types across seven organ systems, it offers a detailed insight into the effect of variation. The resulting atlas of sequence-based gene expression and variant effects is publicly available in a user-friendly interface and readily extensible to any primary cell types. We follow this work with an example application of the ExpectoSC to the study of glomerular diseases, a major cause of end stage renal disease in the US. Despite having similar clinical presentations, these diseases are known for their heterogeneity and variable patient outcomes. By integrating whole-genome sequencing data with ExPectoSC's predictions, we construct comprehensive gene expression disruption profiles for patients.  4 Finally, we developed a new method for genomic-centered contrastive pre-training, called cGen, to improve training of the models from sequence alone in limited-data contexts. Utilizing sequence augmentations, after pre-training cGen generates unsupervised embeddings that highlight functional clusters and are informative of gene expression in the absence of any labeled information.
Together, these contributions highlight the power of computational approaches to decode the noncoding genome, offering new avenues for the diagnosis, prognosis, and treatment of human diseases.