Haoyu Zhang will present his General Exam on May 7, 2015 at 10am in CS 401. The members of his committee are Mike Freedman (Advisor), Jen Rexford, and Nick Feamster. Everyone is invited to attend his talk, and those faculty wishing to remain for the oral exam following are welcome to do so. His abstract and reading list follow below. Abstract
Software-defined networking (SDN) offer greater flexibility than traditional distributed network architectures, at the risk of the controller being a single point-of-failure. Unfortunately, existing fault-tolerance techniques, such as replicated state machine, are insufficient to ensure correct network behavior under controller failures. The challenge is that, in addition to the application state of the controllers, the switches maintain hard state that must be handled consistently. Thus, it is necessary to incorporate switch state into the system model to correctly offer a "logically centralized" controller.
We introduce Ravana, a fault-tolerant SDN controller platform that processes the control messages transactionally and exactly once (at both the controllers and the switches). Ravana maintains these guarantees in the face of both controller and switch crashes. The key insight in Ravana is that replicated state machines can be extended with lightweight switch-side mechanisms to guarantee correctness, without involving the switches in an elaborate consensus protocol. Our prototype implementation of Ravana provides transparent fault tolerance: controller applications can run on Ravana without modifying a single line of code. Experiments show that Ravana achieves high throughput with reasonable overhead, compared to a single controller, with a failover time under 100ms. We also use verification tools to prove Ravana's correctness under controller failures.
Reading List
[1] J. H. Saltzer and M. F. Kaashoek, Principles of Computer System Design: An Introduction. Morgan Kaufmann Publishers Inc., 2009.
[2] L. Lamport, “Time, Clocks, and the Ordering of Events in a Distributed System,” Commun. ACM, July 1978.
[3] M. Rosenblum and J. K. Ousterhout, “The Design and Implementation of a Log-structured File System,” ACM Trans. Comput. Syst., Feb. 1992.
[4] B. Liskov and J. Cowling, “Viewstamped Replication Revisited,” tech. rep., MIT, July 2012.
[5] D. B. Terry, M. M. Theimer, K. Petersen, A. J. Demers, M. J. Spreitzer, and C. H. Hauser, “Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System,” in SOSP, Dec. 1995.
[6] W. Lloyd, M. J. Freedman, M. Kaminsky, and D. G. Andersen, “Don’T Settle for Eventual: Scalable Causal Consistency for Wide-area Storage with COPS,” in SOSP, Oct. 2011.
[7] D. Ongaro and J. Ousterhout, “In Search of an Understandable Consensus Algorithm,” in USENIX ATC, June 2014.
[8] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford, “Spanner: Google’s Globally-distributed Database,” in OSDI, Oct. 2012.
[9] M. Casado, M. J. Freedman, J. Pettit, J. Luo, N. Gude, N. McKeown, and S. Shenker, “Rethinking Enterprise Network Control,” IEEE/ACM Trans. Netw., Aug. 2009.
[10] T. Koponen, M. Casado, N. Gude, J. Stribling, L. Poutievski, M. Zhu, R. Ramanathan, Y. Iwata, H. Inoue, T. Hama, and S. Shenker, “Onix: A Distributed Control Platform for Large-scale Production Networks,” in OSDI, 2010.
[11] E. B. Nightingale, K. Veeraraghavan, P. M. Chen, and J. Flinn, “Rethink the Sync,” in OSDI, Nov. 2006.