[talks] Haoyu Zhang will present his Pre FPO talk "Resource Management for Advanced Data Analytics at Large Scale" on Tuesday December 19th, 2017 at 10:00am in CS 302.

Tue Dec 12 10:28:26 EST 2017

Haoyu Zhang will present his Pre FPO talk on Tuesday December 19th, 2017 at 10:00am in CS 302. 

The members of his committee are: Michael Freedman (advisor), Kyle Jamieson, Wyatt Lloyd (readers), Kai Li, Jennifer Rexford (examiners). 

All are welcome to attend. 

Title: Resource Management for Advanced Data Analytics at Large Scale 

The rapidly growing size of data and the complexity of analytics present new challenges for large-scale data processing systems. Modern distributed computing frameworks need to support not only embarrassingly parallelizable batch jobs, but also advanced applications analyzing text and multimedia data using multi-stage DAG queries and machine learning (ML) models. Given the high costs of advanced data analytics, resource management is crucial. New applications and workloads expose vastly different characteristics which makes traditional scheduling systems inadequate, and at the same time offer many opportunities that lead to new system designs for better performance. 

In this talk, I will present resource management system designs that can efficiently utilize cluster resources by leveraging the insights from advanced data analytics applications. We identify and study the following three key scenarios: 

(i) VideoStorm: a video analytics system that processes thousands of video analytics queries on live video streams over large clusters. VideoStorm's offline scheduler generates resource-quality profiles for vision processing queries, and its online scheduler allocates resources to maximize performance in terms of quality and lag, in contrast to the commonly used fair sharing of resources. 

(ii) SLAQ: a cluster scheduling system for approximate ML training jobs that aims to maximize the overall job quality. ML training typically involves many iterations that continually improve the quality of the model. Yet in exploratory settings, better models can be obtained faster by directing resources to jobs with the most potential for improvement. SLAQ collects quality and resource usage information from concurrent jobs, and allocates resources to maximize quality improvement based on highly-tailored model quality predictions. 

(iii) Riffle: an optimized shuffle service for big-data analytics frameworks that significantly improves I/O efficiency and scales to process petabytes of data. The all-to-all data transfer (i.e., shuffle) in modern big-data systems (such as Spark and Hadoop) becomes the scaling bottleneck for multi-stage analytics jobs, due to the superlinear increase in disk I/O operations as data volume increases. Riffle significantly improves system performance by merging fragmented intermediate files and efficiently scheduling the merge operations. 

We have built and deployed the systems at large clusters and performed extensive evaluation with real production workloads. Our results show significant improvement in resource efficiency, job completion time, and system throughput. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/talks/attachments/20171212/bf7e06d8/attachment.html>