Bhargav Reddy Godala will present his FPO "Criticality-Aware Front-end" on Tuesday, 5/14/2024 at 11am in CS 402.

The members of his committee are as follows:

Examiners: David August (adviser), Margaret Martonosi, and David Wentzlaff

Readers: Gilles A. Pokam (Intel), Svilen Kanev (Google), and David August

Please see abstract below. All are welcome to attend.

Code footprints continue to grow faster than instruction caches, putting additional pressure on existing front-end structures. Even with aggressive front-ends with

fetch-directed instruction prefetching (FDIP), modern processors experience signifcant front-end stalls. Due to the end of Moore’s Law, increasing cache sizes raises

critical path latency, with modest returns for scaling instruction cache sizes. This

dissertation aims to address front-end bottlenecks by making two key observations.

In FDIP-enabled processors, cache misses have unequal costs, and a small fraction of

critical instruction cache lines contribute to most of front-end stalls.

EMISSARY, the pioneering cost-aware replacement policy tailored for the L1 Instruction Cache (L1I), defes conventional wisdom by presenting a groundbreaking

approach. Unlike traditional replacements, EMISSARY demonstrates performance

enhancements even amidst increased instruction cache misses. However, EMISSARY

proves to be less efective when applied to datacenter workloads characterized by large

code footprints. This is due to datacenter workloads having more critical lines greater

than the capacity of L1I. This dissertation frst presents improved EMISSARY-L2,

the frst criticality-aware cache replacement family of policies specifcally designed for

datacenter workloads. Observing that modern architectures entirely tolerate many instruction cache misses, EMISSARY-L2 resists evicting those cache lines whose misses

cause costly decode starvations from L2. In the context of a modern FDIP-enabled

processor, EMISSARY-L2 delivers an impressive 3.24% geomean speedup (up to

23.7%) and a geomean energy savings of 2.1% (up to 17.7%) when evaluated on

datacenter workloads. This speedup is 21.6% of the speedup obtained by an unrealizable L2 cache with a zero-cycle miss latency for all capacity and confict instruction

misses.

This dissertation then proposes Priority Directed Instruction Prefetching (PDIP),

a novel cost-ware instruction prefetching technique that complements FDIP by issuing

iii

prefetches for targets along the resteer path where FDIP stalls occur. PDIP identifes

these targets and associates them with a trigger for future prefetch. When paired

with EMISSARY-L2, PDIP achieves a geomean IPC speedup of 3.7% across a set of

datacenter workloads using a budget of only 43.5KB. PDIP achieves 62% of the ideal

prefetching performance.