Chaitanya Aluru will present his FPO "Reconciliation-Based Methods for Identifying the Evolutionary Origins of Tandem Duplications in Repeat Domain Families" on Friday, 12/17/2021 at 1:30PM via Zoom and in CS 105

Zoom link: https://princeton.zoom.us/j/99860934740

The members of his committee are as follows: Mona Singh (Adviser), Readers: Ben Raphael, Bernard Chazelle, Mona Singh; Examiners: Mona Singh, Olga Troyanskaya, and Barbara Engelhardt

A copy of his thesis is available upon request. Please email gradinfo@cs.princeton.edu if you would like a copy of the thesis.

Everyone is invited to attend the talk.

Abstract follows below:

Domains are the structural, functional, and evolutionary building blocks of protein sequences. Proteins can contain multiple domain instances, and duplications and losses of
these domains are a key driver of protein evolution. Of particular interest are families of
proteins with consecutive repeats of the same domain. These tandem repeat families are
involved in a wide variety of functions, including transcriptional regulation, protein transport, muscle contraction, brain size regulation, and many others. Proteins with tandemly
repeated domains form a significant portion of the proteome across the tree of life. Despite
their prevalence and importance, the evolutionary histories and functional diversification
of many of these protein families are largely unknown. Understanding when domains duplicate, whether individually or together as part of an array of domains, could yield deeper
insights into the functions of these proteins.
Several attempts have been made to understand the evolution of repeat domains within
protein sequences. These approaches can largely be categorized into sequence-based and
reconciliation-based methods. Sequence-based approaches attempt to identify the existence
of tandem duplications, without placing them in an evolutionary context. Reconciliation
based methods, on the other hand, use gene and domain trees to simultaneously infer both
tandem duplication events and the genes they occurred in. These methods, while more
powerful, have not accurately captured tandem duplication events.
In this work, we bridge the gap between these two methods, developing reconciliationbased methods that can accurately identify tandem domain duplication events while also
placing them correctly in the evolutionary history of their gene families. We extend existing
reconciliation frameworks to include flexible cost models for duplication events. Rather
than fixed costs regardless of duplication size, we represent costs as arbitrary functions of
duplication length. We tackle the problem of distinguishing tandem duplications from other
duplication events by incorporating sequence position information from existing domains.
We provide both exact solutions and fast, accurate heuristics to these problems. Finally,
we apply these approaches to the largest repeat domain family in humans, the Cys2-His2
zinc fingers. In analysis of 494 Cys2-His2 zinc finger orthogroups, we find evidence of
numerous tandem domain duplications throughout the placental mammals.