Understanding and Optimizing Heterogeneous Soft-Error Protection

Szafaryn, Lukasz, Computer Engineering - School of Engineering and Applied Science, University of Virginia
Skadron, Kevin, Department of Computer Science, University of Virginia

The trend of continuing technology scaling in circuits exaggerates effects of physical phenomena, such as particle strikes and process variation, that cause soft errors . While recent advances in fabrication technology decrease the severity of these effects for the next transistor generation, the trend of the increasing error rates inevitably continues with further scaling or change of the operating environment. As a result, it is not becoming necessary to protect a larger range of processor types, including commodity products, in addition to high-end servers and safety-critical systems. Maintaining the same level of resiliency with the constantly increasing processor complexity requires novel approaches to use a wider range of resiliency techniques for improved coverage and efficiency.

While there is a broad range of existing and proposed protection techniques that span hardware, architecture and software layers, only a few of them made it to commercial products while most have only been analyzed individually on inconsistent platforms. Therefore there is little understanding of relative trade-offs between available solutions which makes it difficult to arrive at optimal decisions for the resilient design. Moreover, none of the previous work attempted to compare the more efficient techniques from architecture and software layers and analyze their combinations with hardware-level solutions that could potentially tailor to the protection needs better. Furthermore, there is little analysis of recovery cost, crucial to the efficiency of protection.

To better understand and optimize soft-error protection, we study the following research issues:: 1) the design of a framework for the analysis of protection overheads of traditionally used resilience solutions at different core or component granularities for the OpenRISC sample processor, 2) implementation of data-flow checking, an efficient architecture-level resilience technique, for the Leon and IVM processors to compare against low-overhead hardware and software level techniques, their complementary combinations and recovery methods, as well as 3) analysis of the hardware algorithm-specific protection technique for the FFT accelerator in comparison to the traditionally used hardware solutions to demonstrate the benefits of the algorithm-specific approach for accelerators.

PHD (Doctor of Philosophy)
Reliability, Computer Architecture, Computer Engineering, Hardware, Software, general-purpose processor, CPU, accelerator, Soft errors, OpenRISC, Leon, IVM
All rights reserved (no additional license for public reuse)
Issued Date: