[talks] Y Zhang preFPO
Melissa Lawson
mml at CS.Princeton.EDU
Fri Mar 25 13:17:12 EDT 2011
Yun Zhang will present her preFPO on Friday April 1 at 2PM in Room 301. The members
of her committee are: David August, advisor; David Walker and Scott Mahlke (Michigan), readers;
Doug Clark and Jen Rexford, nonreaders. Everyone is invited to attend her talk. Her abstract
follows below.
-----------------------------------
Title: Decouple Redundant Execution for Transient Fault Tolerance
Abstract:
As semiconductor technology continues to scale, transient faults are
emerging as a critical reliability concern in modern microprocessors.
The increasing density of transistors on chip, reduced noise margin of
each transistor, and voltage scaling are making chips more susceptible
to transient faults than ever.
Both hardware or software solutions have been proposed for transient
fault detection and recovery. The hardware approach adds redundant
hardware modules to the system, thus requiring extra chip area as well
as hardware design and veri?cation cost. In addition, the scope and
mechanism of fault tolerance are hardwired at design time, which could
be suboptimal depending on deployment environment. Software-only
solutions do not require any specialized hardware extensions and are
more ?exible. However, even the best-performing software-only fault
tolerance techniques have signi?cant performance cost. The overhead of
prior work comes from doubled register usage, frequent inter-core
communication, or barrier synchronizations. These factors prevent prior
techniques from being adopted widely.
To address these problems, this dissertation proposes a software-only
decoupled program execution framework for fault tolerance. A compiler
automatically transforms a program into its corresponding redundant
execution version. Only the values that escape the scope of replication
may affect the externally behavior of the program and therefore need to
be veri?ed for correctness. At program runtime, the program speculates
that transient fault detection code never detects a fault, thus allowing
decoupled execution of program code, transient fault detection, I/O
operations and system calls. A comprehensive misspeculation detection
and recovery framework is also described for programs to detect a
transient fault and recover from it with low runtime cost. The prototype
of this framework was implemented as a set of automatic compiler
transformations and evaluated on a commodity multi-core system. The
evaluation demonstrated that with this framework, transient fault
tolerance can achieve best-in-class performance, high fault coverage,
and fast recovery with no hardware module involved.
More information about the talks
mailing list