Khiem Ngo will present his FPO "TOLERATING SLOWDOWNS IN REPLICATED STATE MACHINES" on Wednesday, August 25, 2021 at 3PM via Zoom.

Zoom link: https://princeton.zoom.us/j/97848004548

The members of Khiem’s committee are as follows: Wyatt Lloyd (Adviser); Readers: Amit Levy and Siddhartha Sen (Microsoft Research); Examiners: Michael Freedman, Ravi Netravali, Wyatt Lloyd.

A copy of his thesis is available upon request. Please email gradinfo@cs.princeton.edu if you would like a copy of the thesis.

All are welcome to attend his talk.

Abstract follows below:

Replicated state machines (RSMs) are linearizable, fault-tolerant groups of replicas that are coordinated using a consensus algorithm. RSMs are used throughout large-scale sys[1]tems, such as distributed databases, cloud storage, and service managers. At such scale, it is common for some machines to be slow. In this dissertation, we address the prob[1]lem of developing slowdown-tolerant RSMs that continue to operate as fast RSMs despite the presence of slow replicas. We define s-slowdown-tolerance and identify why existing consensus protocols are not slowdown-tolerant. As the first and pragmatic step toward slowdown-tolerant RSMs, we design, implement, and evaluate two 1-slowdown-tolerant consensus protocols, Copilot and Latent Copilot. Copilot replication is the first 1-slowdown-tolerant consensus protocol: it delivers normal latency despite the slowdown of any 1 replica. Copilot uses two distinguished replicas—the pilot and copilot—to proactively add redundancy to all stages of processing a client’s command. Copilot uses dependencies and deduplication to resolve potentially differing orderings proposed by the pilots. To avoid dependencies leading to either pilot being able to slow down the group, Copilot uses fast takeovers that allow a fast pilot to complete the ongoing work of a slow pilot. Copilot includes two optimizations—ping[1]pong batching and null dependency elimination—that improve its performance when there are 0 and 1 slow pilots respectively. Our evaluation of Copilot shows its performance is lower but competitive with Multi-Paxos and EPaxos when no replicas are slow. When a replica is slow, Copilot is the only protocol that avoids high latencies. Latent Copilot, a variant of Copilot, is another design and implementation of a 1- slowdown-tolerant consensus protocol. Latent Copilot operates with one active pilot, which actively proposes commands, and one latent pilot, which proposes commands only when necessary. In this way, Latent Copilot achieves an intermediate tradeoff between Multi[1]Paxos and Copilot in terms of throughput and slowdown tolerance. Our evaluation of La[1]iii tent Copilot shows that it achieves 1-slowdown-tolerance and that its performance with no slowdowns is comparable to Multi-Paxos. Finally, we generalize the design of Copilot replication to handle more than one slow[1]down. We provide the design of an s-slowdown-tolerant protocol that achieves slowdown tolerance despite s slow replicas.