[talks] Y Zhang preFPO

Melissa Lawson mml at CS.Princeton.EDU
Fri Mar 25 13:17:12 EDT 2011

Yun Zhang will present her preFPO on Friday April 1 at 2PM in Room 301.  The members 
of her committee are:  David August, advisor; David Walker and Scott Mahlke (Michigan), readers; 
Doug Clark and Jen Rexford, nonreaders.  Everyone is invited to attend her talk.  Her abstract 
follows below.

Title: Decouple Redundant Execution for Transient Fault Tolerance

As semiconductor technology continues to scale, transient faults are 
emerging as a critical reliability concern in modern microprocessors. 
The increasing density of transistors on chip, reduced noise margin of 
each transistor, and voltage scaling are making chips more susceptible 
to transient faults than ever.

Both hardware or software solutions have been proposed for transient 
fault detection and recovery. The hardware approach adds redundant 
hardware modules to the system, thus requiring extra chip area as well 
as hardware design and veri?cation cost. In addition, the scope and 
mechanism of fault tolerance are hardwired at design time, which could 
be suboptimal depending on deployment environment. Software-only 
solutions do not require any specialized hardware extensions and are 
more ?exible. However, even the best-performing software-only fault 
tolerance techniques have signi?cant performance cost. The overhead of 
prior work comes from doubled register usage, frequent inter-core 
communication, or barrier synchronizations. These factors prevent prior 
techniques from being adopted widely.

To address these problems, this dissertation proposes a software-only 
decoupled program execution framework for fault tolerance. A compiler 
automatically transforms a program into its corresponding redundant 
execution version. Only the values that escape the scope of replication 
may affect the externally behavior of the program and therefore need to 
be veri?ed for correctness. At program runtime, the program speculates 
that transient fault detection code never detects a fault, thus allowing 
decoupled execution of program code, transient fault detection, I/O 
operations and system calls. A comprehensive misspeculation detection 
and recovery framework is also described for programs to detect a 
transient fault and recover from it with low runtime cost. The prototype 
of this framework was implemented as a set of automatic compiler 
transformations and evaluated on a commodity multi-core system. The 
evaluation demonstrated that with this framework, transient fault 
tolerance can achieve best-in-class performance, high fault coverage, 
and fast recovery with no hardware module involved.

More information about the talks mailing list