Matvey Arye will present his research seminar/general exam on Monday April 29 at 12:30 PM in
Room 401 (note room!). The members of his committee are: Michael Freedman (advisor),
Vivek Pai, and Kai Li. Everyone is invited to attend his talk and those faculty wishing to remain
for the oral exam following are welcome to do so. His abstract and reading list follow below.

Abstract:
Global-scale services generate data that is both widely distributed and big, such as system logs and video feeds. Unfortunately, traditional approaches for backhauling and analyzing this data centrally are slow and expensive, due to the high cost or availability of wide-area network bandwidth. Moreover, they require the analyst to commit to a data-collection policy upfront, making it agnostic to current and future resource conditions.

Jetstream is a system that allows adaptive and real-time analysis of large, distributed data sets. It uses dispersed, structured storage to enable data collection without a fixed policy, and adapts the fidelity of collection in response to changes in network conditions. Namely, if a given user query cannot be satisfied within the available bandwidth, Jetstream automatically transforms the query, trading precision for bandwidth. One key ingredient in Jetstream’s architecture is its storage abstraction: a novel adaptation of the data cube from OLAP databases, which we use to represent aggregations and approximations of distributed data. The cube model helps us define a range of data-degradation transforms, all of which can be implemented as standard operators in a user’s query graph. The evaluation is conducted on a system stretching between clusters in Europe and North America and demonstrates the ability to maintain real-time responsiveness, save significant bandwidth through in-place aggregation and approximation, and dynamically adapt the data degradation policies based on changing resource constraints and input data rates.

Current Reading List:

(9 papers and 1 textbook)

Principles of Computer System Design: An Introduction

Saltzer and Kaashoek

The Design of the Borealis Stream Processing Engine

Abadi et al., CIDR 2005

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly

EuroSys, 2007

TAG: A Tiny Aggregation Service for Ad-Hoc Sensor Networks

Madden et al., OSDI, 2002

A Cost-Space Approach to Distributed Query Optimization in Stream Based Overlays

Shneidman et al., NetDB 2005

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data

Sameer Agarwal, Aurojit Panda, Barzan Mozafari, Samuel Madden, Ion Stoica

To Appear in ACM EuroSys 2013

DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views

Yanif Ahmad, Oliver Kennedy, Christoph Koch, and Milos Nikolic

VLDB 2012

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat, OSDI 2004

Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Jim Gray, et al., Data Mining and Knowledge Discover 1997.

Discretized Streams: An Efﬁcient and Fault-Tolerant Model for Stream Processing on Large Clusters

Matei Zaharia, et al., HotCloud, 2012.