My thesis research focuses on guaranteeing that software is
reliable in the presence of transient faults. A transient fault occurs
when an energetic particle strikes a chip and causes a change in state in the
processor. These faults do not cause permanent damage, but they may affect the
computation that is currently executing. Transient faults have been shown to be
the cause of crashes at major companies, and current technology trends, such as
decreasing transistor size, will make future processors much more
susceptible.
Researchers in architecture and compilers have already
developed a number of different solutions using software, hardware or a
combination of the two. The general approach of all these solutions is to
duplicate computations and check for consistency between the copies.
Unfortunately, generating correct fault-tolerant code is difficult, and there
has been very little research on verifying these techniques. Testing is rarely
sufficient because full coverage would require testing all combinations of
features in conjunction with all combinations of faults.
Along with colleagues, I designed a new type system that
guarantees that faulty executions of a well-typed program are identical to
non-faulty executions up to the point where a fault is detected. We use a hybrid
hardware-software solution in which an added hardware queue buffers the stores
from the original computation and only writes them to memory if the duplicate
computation agrees. Faults can occur at any point and randomly modify a register
or queue value. The assembly-level type system tracks information about the two
computations which we use to rigorously prove that well-typed programs satisfy
reliability properties. We use these reliability properties to show that all
executions containing a single fault will always detect the fault before the
difference from non-faulty versions becomes observable.
More recently, I designed a typed system for reasoning about
software solutions for control-flow faults. Control-flow faults arise when a
fault causes a jump or branch instruction to transfer control to an unexpected
code address. Current software solutions can detect most, but not all,
control-flow faults. I formally analyzed a subset of these solutions that can
detect any fault that causes incorrect control transfers between basic blocks.
Similarly to the previous work, this is accomplished by using a type system to
track reliability properties. However, the type system contains new techniques
that allow it to reason soundly even when code transfers to an unexpected
location and continues executing. Here, I proved that executions with a single
fault will visit at most one incorrect block before the fault is
detected.