Logan Stafman will be presenting his Pre FPO on Friday, February 22, 2019 at 3pm in CS 302.

The members of his committee are as follows: Michael Freedman (adviser), Wyatt Lloyd, Kyle Jamieson, Amit Levy, and Jennifer Rexford.

All are welcome to attend. Please see below for talk title and abstract.

Utility Scheduling in Multi-Tenant Clusters

The rapid increase in data size along with the complex patterns of data usage amongst data scientists presents new challenges for large-scale data analytics systems. Modern distributed computing frameworks must support complex applications that range from answering database queries to training machine learning models. As data centers have grown, managing their resources has become an increasingly important task. New applications have become popular that make traditional scheduling systems inadequate.

In this thesis, we present distributed scheduling systems aimed at increasing cluster resource utilization by taking advantage of specific characteristics of data processing applications. We present the following systems:

(i) SLAQ: a cluster scheduling system for machine learning (ML) training jobs that aims to maximize the qualities of all models trained. In exploratory model training, models can be improved more quickly by redirecting resources to jobs with the highest potential for improvement. SLAQ reduces latency and maximizes the quality of models being trained by a distributed ML cluster.

(ii) ReLAQS: a cluster scheduling system for incremental approximate query processing (AQP) systems that aims to minimize the error of all approximate results. In AQP, queries compute approximate results by sampling data. In AQP, error can be reduced more quickly by allocating resources to queries with higher error. ReLAQS reduces the latency required to reach a query result with a given level of error in a shared AQP environment.

These works demonstrate a novel set of methods that can be used in fine-grained scheduling to build responsive, efficient distributed systems. We have evaluated these systems on standard benchmark workloads and datasets, as well as popular ML algorithms, and show both reduced latency and increased accuracy of intermediary results.