Nanqinqin Li will present his General Exam "Achieving Cheap, Highly Available Fault Tolerance with Disaggregated Storage" on Friday, April 29, 2022 at 9:00 AM via Zoom.
Committee Members: Michael J. Freedman (advisor), Amit Levy, Ravi Netravali
Abstract:
High availability has long been achieved by replication at the application level. In a classic scheme called Primary-Backup, the primary application instance handles all client requests and forwards the execution log to the backup instances. Application-level replication, however, provides high availability at the cost of maintaining fully redundant replicas: they require as many resources (CPU, RAM, disk, etc.) as the primary.
The ubiquity of disaggregated storage in cloud computing offers an attractive alternative. Instead of maintaining live replicas, a newly-launched backup instance recovers the application state from the underlying disaggregated storage where data is already replicated for high durability. This alternative provides fault tolerance at a much lower cost but suffers from long failover periods because it must sequentially first detects the primary failure and only then starts recovery on a backup instance.
We propose speculative recovery to accelerate failover and improve the availability of this alternative. Instead of proceeding with failover sequentially, speculative recovery safely and efficiently parallelizes detecting primary failure and running recovery on a backup by using a clone of the application storage. Our implementation and evaluation show that speculative recovery reduces failover time considerably with similarly lower costs than application-level replication
Reading List:
Everyone is invited to attend the talk, and those faculty wishing to remain for the oral exam following are welcome to do so.