Andrew Or will present his Pre FPO "Abstracting systems challenges from distributed deep learning" on Wednesday, March 17, 2021 at 1pm via Zoom.
Andrew Or will present his Pre FPO " Abstracting systems challenges from distributed deep learning" on Wednesday, March 17, 2021 at 1pm via Zoom. Zoom link: [ https://princeton.zoom.us/j/91618635317?pwd=cyt5WmdFQ2xaWmNGb2U0bTNlZXBYZz09 | https://princeton.zoom.us/j/91618635317?pwd=cyt5WmdFQ2xaWmNGb2U0bTNlZXBYZz09 ] Committee: Michael J. Freedman (advisor, examiner) Kai Li (examiner) Wyatt Lloyd (examiner) Amit Levy (reader) Ravi Netravali (reader) Talk abstract follows below. All are welcome to attend. Abstract: State-of-the-art distributed deep learning systems tightly couple model execution with the underlying hardware. Existing frameworks assume resource allocations must be fixed throughout the lifetime of a job, often leading to inefficiency in the usage of cluster resources. Further, model hyperparameters need to be retuned across different hardware configurations in order to achieve the same training result. In a world where the scale of deep learning workloads is growing rapidly, this requirement poses significant barriers to experimentation on smaller test beds and reproducing results across different hardware. In this thesis, we demonstrate that the above assumptions are not fundamental to distributed deep learning, and we resolve these limitations by proposing two systems built on top of TensorFlow. The first is an autoscaling engine that, through trial-and-error, automatically determines the most resource-efficient hardware configuration for a given job. We propose pluggable heuristics tailored for deep learning workloads that incrementally guide the system towards such a configuration. Instead of repeatedly stopping the job and restarting it from checkpoints, which can lead to GPUs or TPUs going idle for minutes every time, our system adjusts the all-reduce membership dynamically in between batches without interrupting the job. The second system is VirtualFlow, which leverages a novel abstraction between the model and the underlying hardware called virtual node processing. From the perspective of the model, virtual nodes, instead of physical devices (GPUs or TPUs), perform the computation. When multiple virtual nodes are mapped to a physical device, they are processed sequentially on that device. This representation allows users to train a model using the same batch size (and other hyperparameters) across different hardware, thereby preserving application-level semantics while hiding hardware-level details from the user. Using this technique, VirtualFlow also enables a variety of new important use cases that improve resource efficiency, including resource elasticity and heterogeneous training (using different types of GPUs in the same job).
participants (1)
-
Nicki Mahler