[talks] F Perry preFPO

Melissa M Lawson mml at CS.Princeton.EDU
Thu Dec 13 14:48:37 EST 2007


Frances Perry will present her preFPO on Wednesday December 19 at 10:30 AM in 
Room 402.  The members of her committee are:  David Walker, advisor; Andrew 
Appel and David Tarditi (Microsoft), readers; David August and Margaret Martonosi, 
nonreaders.  A copy of her abstract follows below.  Everyone is invited to attend her
talk.

Formal Reasoning about Transient Faults

 

My thesis research focuses on guaranteeing that software is reliable in the presence of
transient faults.  A transient fault occurs when an energetic particle strikes a chip and
causes a change in state in the processor. These faults do not cause permanent damage, but
they may affect the computation that is currently executing. Transient faults have been
shown to be the cause of crashes at major companies, and current technology trends, such
as decreasing transistor size, will make future processors much more susceptible.

 

Researchers in architecture and compilers have already developed a number of different
solutions using software, hardware or a combination of the two. The general approach of
all these solutions is to duplicate computations and check for consistency between the
copies. Unfortunately, generating correct fault-tolerant code is difficult, and there has
been very little research on verifying these techniques. Testing is rarely sufficient
because full coverage would require testing all combinations of features in conjunction
with all combinations of faults.

 

Along with colleagues, I designed a new type system that guarantees that faulty executions
of a well-typed program are identical to non-faulty executions up to the point where a
fault is detected. We use a hybrid hardware-software solution in which an added hardware
queue buffers the stores from the original computation and only writes them to memory if
the duplicate computation agrees. Faults can occur at any point and randomly modify a
register or queue value. The assembly-level type system tracks information about the two
computations which we use to rigorously prove that well-typed programs satisfy reliability
properties. We use these reliability properties to show that all executions containing a
single fault will always detect the fault before the difference from non-faulty versions
becomes observable.

 

More recently, I designed a typed system for reasoning about software solutions for
control-flow faults. Control-flow faults arise when a fault causes a jump or branch
instruction to transfer control to an unexpected code address. Current software solutions
can detect most, but not all, control-flow faults. I formally analyzed a subset of these
solutions that can detect any fault that causes incorrect control transfers between basic
blocks. Similarly to the previous work, this is accomplished by using a type system to
track reliability properties. However, the type system contains new techniques that allow
it to reason soundly even when code transfers to an unexpected location and continues
executing. Here, I proved that executions with a single fault will visit at most one
incorrect block before the fault is detected.

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.cs.princeton.edu/pipermail/talks/attachments/20071213/e491a630/attachment.htm 


More information about the talks mailing list