Khiem Ngo will present his Pre-FPO "Tolerating Slowdowns in Replicated State Machines using Copilots" on Tuesday, March 23, 2021 at 3pm via Zoom.

Zoom link: https://princeton.zoom.us/j/98688513083

Committee Members:
Wyatt Lloyd (Advisor)
Michael Freedman (Examiner)
Ravi Netravali (Examiner)
Amit Levy (Reader)
Siddhartha Sen (MSR, Reader)

All are welcome to attend.

Title: Tolerating Slowdowns in Replicated State Machines using Copilots

Abstract. Replicated state machines (RSMs) are linearizable, fault-tolerant groups of replicas coordinated by a consensus algorithm. Linearizability gives the RSM the illusion of being a single machine that responds to client commands one by one. Fault-tolerance enables the RSM to continue operating despite the failure of a minority of replicas. RSMs are used throughout large-scale systems, such as distributed databases, cloud storage, and service managers. At such scale, it is common for some machines to be slow. The slowdowns manifest as machines whose latency for responding to other machines is higher than usual. Thus, RSMs should also be slowdown-tolerant, i.e., provide similar performance despite the presence of slow replicas. Unfortunately, no existing consensus protocol is slowdown-tolerant: a single slow replica can sharply increase their latency. This increased latency decreases availability because a service that does not respond in time is not meaningfully available.

In this talk, I will first define s-slowdown-tolerance and explain why existing consensus protocols are not slowdown-tolerant. Then I will introduce Copilot replication, the first 1-slowdown-tolerant consensus protocol. Copilot replication uses two pilots to ensure the RSM stays fast, by using proactive redundancy in all stages of processing a client command. Our evaluation shows that Copilot delivers normal latency despite the slowdown of any 1 replica.