My thesis research focuses on guaranteeing that software is reliable in the presence of transient faults. A transient fault occurs when an energetic particle strikes a chip and causes a change in state in the processor. These faults do not cause permanent damage, but they may affect the computation that is currently executing. Transient faults have been shown to be the cause of crashes at major companies, and current technology trends, such as decreasing transistor size, will make future processors much more susceptible.

Researchers in architecture and compilers have already developed a number of different solutions using software, hardware or a combination of the two. The general approach of all these solutions is to duplicate computations and check for consistency between the copies. Unfortunately, generating correct fault-tolerant code is difficult, and there has been very little research on verifying these techniques. Testing is rarely sufficient because full coverage would require testing all combinations of features in conjunction with all combinations of faults.

Along with colleagues, I designed a new type system that guarantees that faulty executions of a well-typed program are identical to non-faulty executions up to the point where a fault is detected. We use a hybrid hardware-software solution in which an added hardware queue buffers the stores from the original computation and only writes them to memory if the duplicate computation agrees. Faults can occur at any point and randomly modify a register or queue value. The assembly-level type system tracks information about the two computations which we use to rigorously prove that well-typed programs satisfy reliability properties. We use these reliability properties to show that all executions containing a single fault will always detect the fault before the difference from non-faulty versions becomes observable.

More recently, I designed a typed system for reasoning about software solutions for control-flow faults. Control-flow faults arise when a fault causes a jump or branch instruction to transfer control to an unexpected code address. Current software solutions can detect most, but not all, control-flow faults. I formally analyzed a subset of these solutions that can detect any fault that causes incorrect control transfers between basic blocks. Similarly to the previous work, this is accomplished by using a type system to track reliability properties. However, the type system contains new techniques that allow it to reason soundly even when code transfers to an unexpected location and continues executing. Here, I proved that executions with a single fault will visit at most one incorrect block before the fault is detected.