Yun Zhang will present her preFPO on Friday April 1 at 2PM in Room 301. The members of her committee are: David August, advisor; David Walker and Scott Mahlke (Michigan), readers; Doug Clark and Jen Rexford, nonreaders. Everyone is invited to attend her talk. Her abstract follows below. ----------------------------------- Title: Decouple Redundant Execution for Transient Fault Tolerance Abstract: As semiconductor technology continues to scale, transient faults are emerging as a critical reliability concern in modern microprocessors. The increasing density of transistors on chip, reduced noise margin of each transistor, and voltage scaling are making chips more susceptible to transient faults than ever. Both hardware or software solutions have been proposed for transient fault detection and recovery. The hardware approach adds redundant hardware modules to the system, thus requiring extra chip area as well as hardware design and veri?cation cost. In addition, the scope and mechanism of fault tolerance are hardwired at design time, which could be suboptimal depending on deployment environment. Software-only solutions do not require any specialized hardware extensions and are more ?exible. However, even the best-performing software-only fault tolerance techniques have signi?cant performance cost. The overhead of prior work comes from doubled register usage, frequent inter-core communication, or barrier synchronizations. These factors prevent prior techniques from being adopted widely. To address these problems, this dissertation proposes a software-only decoupled program execution framework for fault tolerance. A compiler automatically transforms a program into its corresponding redundant execution version. Only the values that escape the scope of replication may affect the externally behavior of the program and therefore need to be veri?ed for correctness. At program runtime, the program speculates that transient fault detection code never detects a fault, thus allowing decoupled execution of program code, transient fault detection, I/O operations and system calls. A comprehensive misspeculation detection and recovery framework is also described for programs to detect a transient fault and recover from it with low runtime cost. The prototype of this framework was implemented as a set of automatic compiler transformations and evaluated on a commodity multi-core system. The evaluation demonstrated that with this framework, transient fault tolerance can achieve best-in-class performance, high fault coverage, and fast recovery with no hardware module involved.
participants (1)
-
Melissa Lawson