
Ishita Chaturvedi will present her FPO "In the SHADOW of GhOSTs: Cross-Domain Lessons in Parallelism" on April 9, 2025 at 10am in Friend 007 and Zoom. The members of her committee are as follows: Examiners: David August (adviser), Margaret Martonosi, David Wentzlaff Readers: David August, Niraj Jha, Sharad Malik Zoom link: [ https://princeton.zoom.us/my/ishita.c | https://princeton.zoom.us/my/ishita.c ] All are welcome to attend. Title: In the SHADOW of GhOSTs: Cross-Domain Lessons in Parallelism Abstract: Modern processors struggle to balance instruction-level parallelism (ILP) and thread-level parallelism (TLP) efficiently. GPUs rely on massive TLP to hide latency but suffer in workloads with low occupancy or control divergence due to limited ILP. CPUs, optimized for ILP, become bottlenecked in memory-bound workloads where TLP could improve resource utilization. This thesis challenges the rigid separation of ILP and TLP by introducing GhOST and SHADOW, two architectures that extend ILP in GPUs and TLP in CPUs to improve execution efficiency across diverse workloads. GhOST introduces lightweight out-of-order (OoO) execution to GPUs, enabling warp-level instruction reordering without speculative execution or register renaming. Its instruction buffer-based reordering mechanism allows instructions to execute as soon as operands are ready, significantly reducing stalls in low-occupancy scenarios and mitigating control divergence overhead. GhOST achieves a 6.9 \ % geometric mean speedup (up to 36%) with only 0.007% area overhead, demonstrating the feasibility of ILP-aware GPU architectures. SHADOW is the first asymmetric simultaneous multithreading (SMT) core that dynamically balances ILP and TLP by executing OoO and in-order (InO) threads simultaneously on the same core. By leveraging deep ILP in the OoO thread and high TLP in lightweight InO threads, SHADOW maximizes CPU utilization without sacrificing single-thread performance. Unlike conventional SMT, which uniformly shares resources, SHADOW dynamically reallocates execution resources based on workload behavior. It achieves up to 3.16x speedup and 1.38x geomean improvement over an OoO CPU , with just 1% area and power overhead }. demonstrating its ability to accelerate memory-bound workloads efficiently. This thesis demonstrates how cross-domain architectural techniques can improve execution efficiency by integrating ILP into GPUs and TLP into CPUs. GhOST and SHADOW reduce execution inefficiencies, enhancing performance in workloads that challenge conventional architectures.