Design of robust systems meeting stringent quality, reliability, and
availability requirements is becoming increasingly difficult in advanced
technologies. Some of the major causes of increases in hardware failures in
advanced technologies include increased susceptibility of systems to radiation induced
transient errors (also called soft errors), reduced timing and voltage margins,
significant process variability, and the possibility of increased infant mortality
when reliability screens such as burn-in become ineffective. The current design paradigm
which assumes that no gate or interconnect will ever operate incorrectly within the
lifetime of a product must change to cope with such failures. New architectural features
are required for robust system design with built-in mechanisms for failure tolerance,
detection and recovery during normal system operation.
The tutorial will focus on new design techniques required for building
robust systems: concurrent error detection, recovery, and self-repair. A
broad spectrum of circuit-level, logic-level, micro-architectural, hardware subsystem,
and software techniques will be covered; the associated trade-offs among techniques
will be presented. Implemented protection mechanisms are determined by a complex evaluation
of power and performance requirements and constraints, in addition to the vulnerability of specific
circuits or structures to failures.
The applicability of the presented techniqiues to actual industrial designs will be a major
focus of this tutorial. An overview of various causes of hardware failures such as
radiation-induced soft errors, infant mortality, manufacturing defects and wearout mechanisms
also will be presented.
Subhasish Mitra is a Senior Staff Engineer at Intel Corporation,
a Consulting Assistant Professor in the Electrical Engineering
Department of Stanford University, and the Associate Director of
the Center for Reliable Computing of Stanford University. His research
interests include robust system design, VLSI design and test, fault-tolerant computing
and computer architecture.
At Intel, Dr. Mitra is responsible for developing enabling technologies for Design for Excellence
(DFX) Design for Testability, Reliability, Manufacturability and Debug in advanced technologies.
At the Center for Reliable Computing (CRC) of Stanford University, he supervises Ph.D. students and
is currently involved with the Stanford CRC test chip experiment projects. Before that he was the
leader of the DARPA sponsored Stanford CRC ROAR (Reliability Obtained from Adaptive Reconfiguration)
project. During 2000-2001, he provided consulting at Agilent Technologies in their
System Chip Testing program.
Dr. Mitra has published more than 60 technical papers in leading conferences and journals,
and invented design and test techniques that have seen wide-spread proliferation in the industry.
His most recent award is the Intel Achievement Award, Intel highest corporate award, that he
received in 2004 for the development and deployment of a breakthrough test compression
technology.
Vijaykrishnan Narayanan is an associate professor in the Computer Science and
Engineering Department at the Pennsylvania State University. His research
interests are in the areas of energy-aware reliable systems, nano/VLSI systems
and computer architecture. Dr. Vijaykrishnan is currently leading a
DoE-supported project on experimental investigation of the influence of soft
errors on memories and FPGAs at the Breazeale Nuclear Reactor.
He has presented several tutorials and short courses
on low-power and reliable systems at various international conferences
and to industrial audiences. Dr. Vijaykrishnan has received several awards
including the IEEE VLSI Transactions Best Paper Award,
the ACM SIGDA outstanding new faculty award, Upsilon Pi Epsilon award for
academic excellence, NSF CAREER Award and the IEEE Computer Society Richard E.
Merwin Award.
Lisa Spainhower is an IBM Distinguished Engineer in Poughkeepsie, NY. She
serves as RAS (Reliability, Availability and Serviceability) architect in
the Systems Technology and Architecture organization within IBM Systems and
Technology Group. She is a member of the IBM Academy of Technology, IEEE, EEE Computer
Society, and the Executive Committee of the IEEE Technical Committee on
Fault Tolerant Computing. Lisa is a graduate of the University of
Michigan.
Yuan Xie is an Assistant Professor in the Computer Science and
Engineering Department at the Pennsylvania State University.
Yuan Xie received his B.S. degree from Electronics Engineering Department,
Tsinghua University in Beijing, China, his M.S. and Ph.D. degrees in computer
engineering from Electrical Engineering Department, Princeton University.
Prior to joining Penn State in Fall 2003, he was working for IBM Microelectronics
Division's Worldwide Design Center. Dr. Yuan Xie's research interests include
VLSI Design, Reliable circuit design, Embedded Systems Design, Electronics
Design Automation. Dr. Xie won the Semiconductor Research Corporation's Inventor
Recognition Award in 2002. He has presented several tutorials to industry
audience on reliable circuit design.