Bhargav Reddy Godala will present his FPO "Criticality-Aware Front-end" on Tuesday, 5/14/2024 at 11am in CS 402.

The members of his committee are as follows:
Examiners: David August (adviser), Margaret Martonosi, and David Wentzlaff
Readers: Gilles A. Pokam (Intel),  Svilen Kanev (Google), and David August 

Please see abstract below.  All are welcome to attend.

Code footprints continue to grow faster than instruction caches, putting additional pressure on existing front-end structures. Even with aggressive front-ends with
fetch-directed instruction prefetching (FDIP), modern processors experience signifcant front-end stalls. Due to the end of Moore’s Law, increasing cache sizes raises
critical path latency, with modest returns for scaling instruction cache sizes. This
dissertation aims to address front-end bottlenecks by making two key observations.
In FDIP-enabled processors, cache misses have unequal costs, and a small fraction of
critical instruction cache lines contribute to most of front-end stalls.
EMISSARY, the pioneering cost-aware replacement policy tailored for the L1 Instruction Cache (L1I), defes conventional wisdom by presenting a groundbreaking
approach. Unlike traditional replacements, EMISSARY demonstrates performance
enhancements even amidst increased instruction cache misses. However, EMISSARY
proves to be less efective when applied to datacenter workloads characterized by large
code footprints. This is due to datacenter workloads having more critical lines greater
than the capacity of L1I. This dissertation frst presents improved EMISSARY-L2,
the frst criticality-aware cache replacement family of policies specifcally designed for
datacenter workloads. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY-L2 resists evicting those cache lines whose misses
cause costly decode starvations from L2. In the context of a modern FDIP-enabled
processor, EMISSARY-L2 delivers an impressive 3.24% geomean speedup (up to
23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on
datacenter workloads. This speedup is 21.6% of the speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and confict instruction
misses.
This dissertation then proposes Priority Directed Instruction Prefetching (PDIP),
a novel cost-ware instruction prefetching technique that complements FDIP by issuing
iii
prefetches for targets along the resteer path where FDIP stalls occur. PDIP identifes
these targets and associates them with a trigger for future prefetch. When paired
with EMISSARY-L2, PDIP achieves a geomean IPC speedup of 3.7% across a set of
datacenter workloads using a budget of only 43.5KB. PDIP achieves 62% of the ideal
prefetching performance.