Jeffrey Helt will present his General Exam "FDR: Addressing the Parallelism Gap in Primary-backup Databases" on Tuesday, January 21, 2020 at 1pm in CS 402.

The members of his committee are as follows: Wyatt Lloyd (adviser), Michael Freedman and Amit Levy.

Everyone is invited to attend his talk, and those faculty wishing to remain for the oral exam following are welcome to do so. His abstract and reading list follow below.

Asynchronously replicated primary-backup databases are the beating heart of many large applications. The backups provide availability if the primary fails and may serve read-only transactions to increase throughput for all clients and reduce latency for clients in other geographic regions. However, to retain the correctness of overlying applications despite their read-only transaction returning stale values, backups should appear virtually indistinguishable from the primary. Thus, backups must guarantee monotonic prefix consistency, where they present a progressing sequence of the primary's recent states. But monotonic prefix consistency severely constrains a backup's execution because it must exactly copy (or clone) the commit order resulting from the primary's concurrency control. Existing cloned concurrency control protocols guarantee this order by limiting the backup's parallelism, but as a result, a parallelism gap exists between them and the primary.

In this paper, we prove this parallelism gap leads to unbounded replication lag, where changes committed by the primary may take infinitely long to propagate to the backup, for some workloads. Unbounded lag recently caused catastrophic failures (i.e., user data loss) and outages in production systems, and thus, our proof reveals such catastrophes are potentially lurking behind every workload change. We then design FDR, the first cloned concurrency protocol to guarantee both monotonic prefix consistency and bounded replication lag. Further, we provide an implementation of FDR that is backward-compatible with existing production databases, making it possible to thwart these potential failures immediately. We demonstrate formally and experimentally FDR provides bounded replication lag under a variety of workloads.

Reading List:

Hong et al., KuaFu: Closing the parallelism gap in database replication, 2013.

Wang et al., Query fresh: log shipping on steroids, 2017.

Tu et al., Speedy transactions in multicore in-memory databases, 2013.

Zheng et al., Fast Databases with Fast Durability and Recovery Through Multicore Parallelism, 2014.

Corbett et al., Spanner: Google's globally distributed database, 2013.

Thomson et al., Fast Distributed Transactions and Strongly Consistent Replication for OLTP Database Systems, 2014.

Van Renesse and Schneider, Chain Replication for Supporting High Throughput and Availability, 2004.

Terrace and Freedman, Object Storage on CRAQ: High-Throughput Chain Replication for Read-Mostly Workloads, 2009.

Lamport, The part-time parliament. 1998.

Papadimitriou, Serializability of Concurrent Database Updates, 1979.

Herlihy and Wing, Linearizability: A correctness condition for concurrent objects, 1990.

Textbooks:

Tannenbaum and van Steen, Distributed Systems: Principles and Paradigms, 2017.