ISCA 2005 Tutorial: Robust System Design from Unreliable Components





Abstract:

Design of robust systems meeting stringent quality, reliability, and availability requirements is becoming increasingly difficult in advanced technologies. Some of the major causes of increases in hardware failures in advanced technologies include increased susceptibility of systems to radiation induced transient errors (also called soft errors), reduced timing and voltage margins, significant process variability, and the possibility of increased infant mortality when reliability screens such as burn-in become ineffective. The current design paradigm which assumes that no gate or interconnect will ever operate incorrectly within the lifetime of a product must change to cope with such failures. New architectural features are required for robust system design with built-in mechanisms for failure tolerance, detection and recovery during normal system operation. The tutorial will focus on new design techniques required for building robust systems: concurrent error detection, recovery, and self-repair. A broad spectrum of circuit-level, logic-level, micro-architectural, hardware subsystem, and software techniques will be covered; the associated trade-offs among techniques will be presented. Implemented protection mechanisms are determined by a complex evaluation of power and performance requirements and constraints, in addition to the vulnerability of specific circuits or structures to failures. The applicability of the presented techniqiues to actual industrial designs will be a major focus of this tutorial. An overview of various causes of hardware failures such as radiation-induced soft errors, infant mortality, manufacturing defects and wearout mechanisms also will be presented.

Reference


About the Presenters