Andrew Or will present his FPO on Monday, 5/10/2021 at 10am via Zoom. Title: Abstracting Systems Challenges from Distributed Deep Learning Zoom link: https://princeton.zoom.us/j/96432320689?pwd=ZTE2Z1pGcDV4bFNmUEo5Z1VvaUw4Zz09 The members of his committee are as follows: Michael Freedman (Adviser); Readers: Amit Levy, Ravi Netravali; Examiners: Wyatt Lloyd, Kai Li, Michael Freedman. A copy of his thesis is available upon request. Please email gradinfo@cs.princeton.edu mailto:gradinfo@cs.princeton.edu if you would like a copy of the thesis. Everyone is invited to attend his talk. Abstract: State-of-the-art distributed deep learning systems, such as TensorFlow and PyTorch, are built on rigid assumptions that tightly couple model training and inference with the underlying hardware. First, they assume resource allocations must be fixed throughout the lifetime of a job, often leading to inefficient resource usage. Second, they require model hyperparameters to be retuned across different hardware configurations in order to achieve the same training result, posing a significant burden on the user. Due to these requirements, users are forced to juggle both systems challenges and application logic instead of being able to focus on just the latter. In this dissertation, we demonstrate that the above assumptions are not fundamental to distributed deep learning. We resolve these limitations by proposing two systems built on top of TensorFlow. The first is an autoscaling engine that, through trial-and-error, automatically determines the most resource-efficient hardware configuration for a given job. We propose pluggable heuristics tailored for deep learning workloads that incrementally guide the system towards such a configuration. Instead of repeatedly stopping the job and restarting it from checkpoints, which can lead to expensive hardware accelerators (e.g. GPUs, TPUs) going idle for minutes every time, our system adjusts the job’s all-reduce membership dynamically in between training steps without interrupting the job. The second system is VirtualFlow, which leverages a novel abstraction between the model and the underlying hardware called virtual node processing. From the perspective of the model, virtual nodes, instead of physical hardware accelerators, perform the computation. When multiple virtual nodes are mapped to a physical device, they are processed sequentially on that device. This representation offers users the flexibility to trade off computation time with resource requirement, allowing them to train their models using the same sets of hyperparameters across different hardware. Using this technique, VirtualFlow preserves application-level semantics while hiding hardware-level details from the user, enabling a variety of important new use cases such as experimentation, hyperparameter exploration, resource elasticity, and heterogeneous training. _____