Towards Wearout-Aware and Accelerated Self-Healing Digital Systems

Author: ORCID icon orcid.org/0000-0002-2374-3953
Guo, Xinfei, Computer Engineering - School of Engineering and Applied Science, University of Virginia
Advisor:
Stan, Mircea, Department of Electrical and Computer Engineering, University of Virginia
Abstract:

The down-scaling of CMOS technologies has continuously offered better performance, lower power and higher level of integration. However, in advanced nodes with smaller feature sizes, on-chip components such as transistors and interconnects are experiencing more aggressive degradations due to wearout (aging) effects, which are dominated by Bias temperature instability (BTI) and Electromigration (EM). Transistors become more susceptible to voltage stress due to the increased effective field with the scaling of the thin oxide. Similarly, the shrinking geometries of metal interconnects render higher current densities, and the tremendous number of transistors within a compact area has resulted in higher power densities as well. Together, these lead to increased on-chip temperatures which potentially accelerate the wearout effects. In the meanwhile, with the ubiquity of electronics (e.g. IoT) in our daily lives, there have been increasing demands for reliable system design. Many of such applications require very long lifetime, higher utilization rate and tighter hard error tolerances. They are possibly deployed in extreme environmental conditions, such as high temperatures, which, unfortunately, further accelerate wearout.

Conventional techniques of coping with wearout by “tolerating”, “slowing down” or “compensating” still leave the wearout themselves unchecked since they keep accumulating fundamentally as the system operates. Moreover, the continuous increase of irreversible components of wearout over the entire lifetime will cause permanent errors and failures potentially. This thesis proposes and demonstrates a new category of techniques that “repair” wearout in a physical sense through accelerated and active recovery, by which wearout (both BTI and EM) can be reversed by actively applying several techniques, such as high temperature, negative voltages (for BTI) and reverse currents (for EM), thus leading to effective accelerated self-healing. By studying the frequency dependent behaviors of wearout and recovery experimentally, we demonstrate that the permanent portion of wearout can be almost fully eliminated and avoided by using in-time scheduled recovery. Our experiments on hardware demonstrate that the accelerated self-healing techniques for both BTI and EM wearout are fast, effective and feasible to implement.

To enable the on-chip implementations of the accelerated self-healing techniques and fully utilize the explored frequency dependent behaviors, we propose a cross-layer accelerated self-healing (CLASH) system which instruments the notion of recovery across multiple layers (from circuit to the system level) of the system stack. A full set of circuit IP blocks, including recovery boost components, novel BTI and EM sensors, multi-mode recovery assist scheme and novel power gating structures, is designed and implemented. As wearout-induced failures become more visible at the system level, we also explore several potential architecture and system level solutions that are able to take advantages of intrinsic sleep behaviors for full recovery. Overall, these techniques work together to guarantee that the entire system performs for more of the time at higher levels of performance and power efficiency by fully exploiting the extra opportunities enabled by the accelerated self-healing.

Leading-edge nodes such as FinFETs endeavor to offer advantages of future scaled devices while offsetting the problems introduced by many generations of planar CMOS scaling. Adapting to the new challenges and fully benefiting from FinFETs require the growing knowledge and design experiences. To contribute to this knowledge base, in this thesis, we perform a comprehensive study based on circuit simulations across multiple technology nodes ranging from conventional bulk to advanced planar technology nodes such as Fully Depleted Silicon-on-Insulator (FDSOI), to FinFETs. As challenges such as wearout appear to be more pronounced in these advanced nodes, and this grow to be critical especially in IoT applications in which industry is in the process of updating technologies. Each of those end markets has unique needs and characteristics, which affects how chips are used and under what conditions. We investigate how wearout can impact different categories of future IoT applications with foundry-provided wearout models. We conclude that wearout needs to be considered in the full design cycle and the IoT lifetime estimation requires to incorporate wearout as an important factor. IoT-specific design solutions for mitigating wearout are also presented in this thesis.

Degree:
PHD (Doctor of Philosophy)
Keywords:
Accelerated Self-Healing, BTI, EM, FinFETs, IoT, Recovery, Cross-layer
Language:
English
Rights:
All rights reserved (no additional license for public reuse)
Issued Date:
2018/04/16