Jetstream is a system that allows adaptive and real-time analysis of large, distributed data sets. It uses dispersed, structured storage to enable data collection without a fixed policy, and adapts the fidelity of collection in response to changes in network conditions. Namely, if a given user query cannot be satisfied within the available bandwidth, Jetstream automatically transforms the query, trading precision for bandwidth. One key ingredient in Jetstream’s architecture is its storage abstraction: a novel adaptation of the data cube from OLAP databases, which we use to represent aggregations and approximations of distributed data. The cube model helps us define a range of data-degradation transforms, all of which can be implemented as standard operators in a user’s query graph. The evaluation is conducted on a system stretching between clusters in Europe and North America and demonstrates the ability to maintain real-time responsiveness, save significant bandwidth through in-place aggregation and approximation, and dynamically adapt the data degradation policies based on changing resource constraints and input data rates.
Current Reading List:
(9 papers and 1 textbook)
Principles of Computer System Design: An Introduction
Saltzer and Kaashoek
The Design of the Borealis Stream Processing Engine
Abadi et al., CIDR 2005
Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly
EuroSys, 2007
TAG: A Tiny Aggregation Service for Ad-Hoc Sensor Networks
Madden et al., OSDI, 2002
A Cost-Space Approach to Distributed Query Optimization in Stream Based Overlays
Shneidman et al., NetDB 2005
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
Sameer Agarwal, Aurojit Panda, Barzan Mozafari, Samuel Madden, Ion Stoica
To Appear in ACM EuroSys 2013
DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views
Yanif Ahmad, Oliver Kennedy, Christoph Koch, and Milos Nikolic
VLDB 2012
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat, OSDI 2004
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals
Jim Gray, et al., Data Mining and Knowledge Discover 1997.
Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters
Matei Zaharia, et al., HotCloud, 2012.