Aninda Manocha will present his General Exam, "Hardware-Software Co-Design to Accelerate Irregular Applications in the Post-Moore’s Law Era" on Wednesday, January 15, 2020 at 10am in CS 302

8 Jan 2020

      Aninda Manocha will present her General Exam, "Hardware-Software Co-Design to Accelerate Irregular Applications in the Post-Moore’s Law Era" on Wednesday, January 15, 2020 at 10am in CS 302. 

The members of her committee are as follows: Margaret Martonosi (adviser), David Wentzlaff, and David August 

Everyone is invited to attend her talk, and those faculty wishing to remain for the oral exam following are welcome to do so. Her abstract and reading list follow below. 

Despite their ubiquity in many important real-world, big-data applications, graph and other sparse applications remain difficult to accelerate even with modern accelerator-oriented, heterogeneous system designs. Many of these applications are characterized by irregular memory accesses caused by pointer indirection, which confounds caching mechanisms that hinge on the notion of locality as well as traditional prefetching and speculation techniques. As a result, these applications are bottlenecked by several long-latency memory accesses. 

First, my work approaches these workloads from a latency-tolerance perspective by proposing a Decoupled Access/Execute (DAE)-inspired technique, FAST-LLAMAs. This approach slices programs into Producer/Consumer threads and maps them onto simple in-order cores. Significant Memory-Level Parallelism (MLP) is then achieved in the form of dependency-free loads and asynchronous read-modify-write (RMWs) instructions issued on the Producer whose data is efficiently read by the Consumer. By using full-stack innovations in a simple, in-order, multi-core architecture, FAST-LLAMAs transforms long-latency memory accesses into requests that can issue asynchronously, thus hiding the cost of many costly accesses. With a single Producer/Consumer pair, FAST-LLAMAs yields a geomean of 4.21x and up to an 8.66x speedup over a single in-order core. Under the same area budget, the simple in-order cores of FAST-LLAMAs achieve a geomean of 11.79x and up to a 20.8x improvement in energy efficiency over out-of-order cores without requiring application-specific hardware. 

In order to address memory bandwidth limits of these irregular applications, my work then proposes reconfigurable, intelligent caching techniques called GSC (Graph-Specialized Caching) that optimize for separately allocated regions of memory employed in such applications. By isolating and specializing caching policies for the irregular memory accesses of such workloads, GSC techniques mimic a customized memory hierarchy with reconfigurability. More specifically, they employ logical cache partitioning, tailored access granularities, and node degree-based replacement policies to target irregular memory accesses. As a result, GSC techniques achieve a geomean of 1.70x and up to a 2.54x speedup, a geomean of 3.76x and up to 12.76x improvements in energy efficiency, and a geomean of 3.29x and up to a 4.74x improvement in bandwidth savings. This allows for parallel scaling far beyond what prior state-of-the-art memory hierarchies have achieved. 

Reading List 

James E. Smith. “Decoupled access/execute computer architectures.” In Proceedings of the 9th International Symposium on Computer Architecture (ISCA), pages 112-119. IEEE Press, 1982. 

James R. Goodman. “Using cache memory to reduce processor-memory traffic.” In Proceedings of the 10th International Symposium on Computer Architecture (ISCA), pages 124–131. ACM, 1983. 

Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August, “Automatic thread extraction with decoupled software pipelining,” in Proceedings of the 38th International Symposium on Microarchitecture (MICRO), pages 107–118. IEEE Press, 2005. 

Mattan Erez, Jung Ho Ahn, Jayanth Gummaraju, Mendel Rosenblum, and William J. Dally. “Executing irregular scientific applications on stream architectures.” In Proceedings of the 21st International Conference on Supercomputing (ICS), pages 93–104. ACM, 2007. 

Carole-Jean Wu, Aamer Jaleel, Will Hasenplaugh, Margaret Martonosi, Simon C. Steely, Jr., and Joel Emer, “SHiP: Signature-based hit predictor for high performance caching.” In Proceedings of the 44th International Symposium on Microarchitecture (MICRO), pages 430–441. ACM, 2011. 

Tae Jun Ham, Juan L. Aragón, and Margaret Martonosi. “DeSC: Decoupled supply-compute communication management for heterogeneous architectures.” In Proceedings of the 48th International Symposium on Microarchitecture (MICRO), pages 191–203. ACM, 2015. 

Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret Martonosi. “Graphicionado: A high-performance and energy-efficient accelerator for graph analytics.” In Proceedings of the 49th International Symposium on Microarchitecture (MICRO), pages 1–13. IEEE Press, 2016. 

Sam Ainsworth and Timothy M. Jones. “An event-triggered programmable prefetcher for irregular workloads.” In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 578–592. ACM, 2018. 

Abanti Basak, Shuangchen Li, Xing Hu, Sang Min Oh, Xinfeng Xie, Li Zhao, Xiaowei Jiang, and Yuan Xie. “Analysis and optimization of the memory hierarchy for graph processing workloads.” In Proceedings of the 2019 International Symposium on High Performance Computer Architecture (HPCA), pages 373–386. IEEE Press, 2019. 

Michael Pellauer, Yakun Sophia Shao, Jason Clemons, Neal Crago, Kartik Hegde, Rangharajan Venkatesan, Steven W. Keckler, Christopher W. Fletcher, and Joel Emer, “Buffets: An efficient and composable storage idiom for explicit decoupled data orchestration,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 137-151. ACM, 2019. 

Textbook: John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann 5th edition (2011).

Nicki Mahler

tags

participants (1)