Matvey Arye will present his research seminar/general exam on Monday April 29 at 12:30 PM in Room 401 (note room!). The members of his committee are: Michael Freedman (advisor), Vivek Pai, and Kai Li. Everyone is invited to attend his talk and those faculty wishing to remain for the oral exam following are welcome to do so. His abstract and reading list follow below. ----- Original Message ----- Abstract: Global-scale services generate data that is both widely distributed and big, such as system logs and video feeds. Unfortunately, traditional approaches for backhauling and analyzing this data centrally are slow and expensive, due to the high cost or availability of wide-area network bandwidth. Moreover, they require the analyst to commit to a data-collection policy upfront, making it agnostic to current and future resource conditions. Jetstream is a system that allows adaptive and real-time analysis of large, distributed data sets. It uses dispersed, structured storage to enable data collection without a fixed policy, and adapts the fidelity of collection in response to changes in network conditions. Namely, if a given user query cannot be satisfied within the available bandwidth, Jetstream automatically transforms the query, trading precision for bandwidth. One key ingredient in Jetstream’s architecture is its storage abstraction: a novel adaptation of the data cube from OLAP databases, which we use to represent aggregations and approximations of distributed data. The cube model helps us define a range of data-degradation transforms, all of which can be implemented as standard operators in a user’s query graph. The evaluation is conducted on a system stretching between clusters in Europe and North America and demonstrates the ability to maintain real-time responsiveness, save significant bandwidth through in-place aggregation and approximation, and dynamically adapt the data degradation policies based on changing resource constraints and input data rates. Current Reading List: (9 papers and 1 textbook) Principles of Computer System Design: An Introduction Saltzer and Kaashoek The Design of the Borealis Stream Processing Engine Abadi et al., CIDR 2005 Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly EuroSys, 2007 TAG: A Tiny Aggregation Service for Ad-Hoc Sensor Networks Madden et al., OSDI, 2002 A Cost-Space Approach to Distributed Query Optimization in Stream Based Overlays Shneidman et al., NetDB 2005 BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Sameer Agarwal, Aurojit Panda, Barzan Mozafari, Samuel Madden, Ion Stoica To Appear in ACM EuroSys 2013 DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views Yanif Ahmad, Oliver Kennedy, Christoph Koch, and Milos Nikolic VLDB 2012 MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat, OSDI 2004 Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals Jim Gray, et al., Data Mining and Knowledge Discover 1997. Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, et al., HotCloud, 2012.