Cross-Layer Reliability of Computing Systems
2: Department of Informatics & Telecommunications, National and Kapodistrian University of Athens Greece, Athens, Greece
3: Politecnico di Torino, Turin, Italy
4: Ecole Centrale de Lyon France, Lyon Institute of Nanotechnology, Lyon, France
5: Facultat d'Informatica de Barcelona, Universitat Politecnica de Catalunya-Barcelona Tech, Catalonia, Spain
Reliability has always been a major concern in designing computing systems. However, the increasing complexity of such systems has led to a situation where efforts for assuring reliability have become extremely costly, both for the design of solutions for the mitigation of possible faults, and for the reliability assessment of such techniques. Cross-layer reliability is fast becoming the preferred solution. In a cross-layer resilient system, physical and circuit level techniques can mitigate low-level faults. Hardware redundancy can be used to manage errors at the hardware architecture layer. Eventually, software implemented error detection and correction mechanisms can manage those errors that escaped the lower layers of the stack. This book presents state-of-the-art solutions for increasing the resilience of computing systems, both at single levels of abstraction and multi-layers. The book begins by addressing design techniques to improve the resilience of computing systems, covering the logic layer, the architectural layer and the software layer. The second part of the book focuses on cross-layer resilience, including coverage of physical stress, reliability assessment approaches, fault injection at the ISA level, analytical modelling for cross-later resiliency, and stochastic methods. Cross-Layer Reliability of Computing Systems is a valuable resource for researchers, postgraduate students and professional computer architects focusing on the dependability of computing systems.
Inspec keywords: stochastic processes; instruction sets; computer architecture; fault tolerant computing; multiprocessing systems
Other keywords: fault injection; multicore processors; soft error simulation; stochastic methods; soft error modeling; software layer; physical stress; analytical modeling; ISA level; technological layer; cross-layer resiliency; logic layer; architectural layer; instruction set architecture level; microarchitecture-level reliability assessment; cross-layer resilience
Subjects: Computer architecture; Other topics in statistics; Multiprocessing systems; Performance evaluation and testing; General and management topics; Systems software
- Book DOI: 10.1049/PBCS057E
- Chapter DOI: 10.1049/PBCS057E
- ISBN: 9781785617973
- e-ISBN: 9781785617980
- Page count: 329
- Format: PDF
-
Front Matter
- + Show details - Hide details
-
p.
(1)
-
Part I. Design techniques to improve the resilience of computing systems
1 Technological layer
- + Show details - Hide details
-
p.
3
–22
(20)
This chapter describes the fundamental characteristics of Complementary Metal-Oxide-Semiconductor (CMOS) technology, and how it can be assessed for system reliability studies. After some definitions, the dominating manufacturing technologies are described together with its advantages and disadvantages. Then, the core memory circuits present in today's computing systems are presented. Finally, the chapter provides an evaluation of these memory circuits when considering reliability across technology nodes.
2 Design techniques to improve the resilience of computing systems: logic layer
- + Show details - Hide details
-
p.
23
–41
(19)
High-reliability and high-dependability applications require integrated solutions against random hardware faults and transient faults. Random hardware faults or intermittent faults are generated by process or time-dependent variations, i.e., aging, while transients are induced either by radiation, namely, soft errors, or by extreme operating conditions or electronic interference. Indeed, nanometric static process variations, voltage and temperature dynamic fluctuations due to chip activity, Bias Temperature Instability caused by the stress on the transistors, and single event effects or soft errors are reported to be very important issues in nanometric technology nodes [1,2]. These phenomena induce performance reduction if not taken care properly and may reduce circuit lifetime and Mean Time To Failure. Hence, onchip accurate yield, reliability and performance monitors that check online or periodically violations of guardbands have become necessary. Adaptive compensation schemes are combined with monitors in the attempt to recover from potential error when timing violation occurs. This chapter presents up-to-date state of the art of performance and reliability monitors, insertion methodology and experimental results of different sensors and monitors used for process and environment variations as well as aging compensation.
3 Design techniques to improve the resilience of computing systems: architectural layer
- + Show details - Hide details
-
p.
43
–93
(51)
Unreliable hardware components will affect computing system at several levels -all the way from incorrect transistor outputs, to incorrect values in memory elements, incorrect program variables and control flow, finally causing application failure. Resilience is the ability of the system to tolerate errors when they occur and comprises two main aspects-(i) how to detect the errors and (ii) how to recover from the errors. The lower the level of abstraction at which we can detect and correct the error, the less disruption it causes to all the upper layers of computing abstraction. This chapter gives the overview of all the techniques at processor architecture level to detect and correct the errors.
4 Design techniques to improve the resilience of computing systems: software layer
- + Show details - Hide details
-
p.
95
–111
(17)
Hardware techniques to improve the robustness of a computing system can be very expensive, difficult to implement and validate. Moreover, they require long evaluation processes that could lead to the redesign of the hardware itself when reliability requirements are not satisfied. This chapter will cover the software techniques that allow improving the tolerance of the system to hardware faults by acting at software level only. We will cover the recently proposed approaches to detect and correct transient and permanent faults.
5 Cross-layer resilience
- + Show details - Hide details
-
p.
113
–153
(41)
Resilience to errors in the underlying hardware is a key design objective for a large class of computing systems, from embedded systems all the way to the cloud. Sources of hardware errors include radiation, circuit aging, variability induced by manufacturing and operating conditions, manufacturing test escapes, and early-life failures. Many publications have suggested that cross-layer resilience, where multiple error resilience techniques from different layers of the system stack cooperate to achieve cost-effective resilience, is essential for designing cost-effective resilient digital systems. This chapter presents a unique framework to address fundamental cross-layer resilience questions: achieve desired resilience targets at minimal costs (energy, power, execution time, and area) by combining resilience techniques across various layers of the system stack (circuit, logic, architecture, software, and algorithm). This framework systematically explores the large space of comprehensive resilience techniques and their combinations across various layers of the system stack, derives cost-effective solutions that achieve resilience targets at minimal costs, and provides guidelines for the design of new resilience techniques.
-
Part II. Reliability assessment
6 Physical stress
- + Show details - Hide details
-
p.
157
–174
(18)
In this chapter, we have covered some of the general challenges in the physical stress of electronic devices. The benefits of using particles beams to evaluate the reliability of devices are various, such as a realistic error rate and a realistic error model. Additionally, thanks to the accelerated beam, a statistically significant amount of data can be gathered in a relatively short time (hours). However, the preparation of the setup is very challenging and, while there are some general guidelines, each facility has its own constraints. When performing beam experiments, it is fundamental to carefully design the setup. As beam experiments do not allow visibility as fault injection, it is necessary to ensure the test of all components and to test proper benchmarks. Finally, the correlation between beam experiments data and fault injection is still an open question.
7 Soft error modeling and simulation
- + Show details - Hide details
-
p.
175
–216
(42)
Although the sources of soft errors are device-level interactions, the generated errors could propagate and cause system-level failures. As a result, it is very important to analyze the impact of soft errors using a device to system-level approach. Therefore, an efficient soft error vulnerability estimation technique has to be able to accurately model the error generation at device-level as well as the masking behavior at higher abstraction levels. The proposed cross-layer Soft Error Rate (SER) analysis platform employs a combination of empirical models at the device level, error site analysis at chip layout, analytical Error Propagation (EP) at logic level, and fault simulation/emulation at the architecture/application level to provide the detailed contribution of each component (flip-flops, combinational gates, and memory arrays) to the overall SER. At each stage in the modeling hierarchy, an appropriate level of abstraction is used to propagate the effect of errors to the next higher level.
8 Microarchitecture-level reliability assessment of multi-core processors
- + Show details - Hide details
-
p.
217
–238
(22)
One of the primary purposes of reliability evaluation is to identify and protect vulnerable components of a system. At the hardware level, the results of the evaluation can lead to design revisions that aim to increase the fault tolerance of the system. Every potential change is subject to validation and additionally requires more iterations of reliability evaluation. This back-and-forth process is expensive, especially when considering that hardware design changes require significant amounts of time to be applied. To address this problem, microarchitecture-level reliability assessment has been proposed. Instead of assessing the actual hardware design, the evaluation can be performed at microarchitecture-level (or performance) models that are often available very early in the design chain and are both flexible and allow high observability. The existence of reliability evaluation results before the actual design implementation enables early reliability-related design decisions and significantly decreases the cost of redesign cycles. But the absence of transistor-level detail on the evaluation inheritably results to some accuracy loss. Only components that are accurately modeled at the microarchitecture level, which are mostly memory elements (Static Random-Access Memory (SRAM) arrays, flops, and latches), can be assessed. Combinational logic and sequential elements are (in majority) functionally modeled, and thus, they cannot be evaluated at microarchitecture level. Fortunately, literature suggests that only a small portion (< 10%) of failures sources from these elements, which implies that the accuracy loss that can be attributed to the un-modeled resources at microarchitecture-level is limited. In this chapter, we present the throughput, capabilities, and accuracy of microarchitecturelevel reliability assessment, and how it can be effectively used at early design stages.
9 Fault injection at the instruction set architecture (ISA) level
- + Show details - Hide details
-
p.
239
–260
(22)
Fault Injection (FI) is a commonly used technique to evaluate the reliability of systems. As soft errors become more commonplace in computer systems, it is often necessary to involve the software in the overall system's resilience. Therefore, it is important to inject faults at the ISA level to emulate soft errors that are visible to the software, in order to test software resilience mechanisms. Consequently, there is a need to develop Instruction Set Architecture (ISA)-level FI tools and techniques. We start by outlining the goals of ISA-level FI, followed by the main metrics that can be measured by the same. We then present a survey of techniques in the literature that attempt to inject faults at the ISA-level and up in the system stack. Finally, we present an overview of LLFI and PINFI, two fault injectors developed inour research group, that allow programers to inject faults at the LLVM compiler's Intermediate Representation (IR) level and x86 assembly code level, respectively. We conclude with a survey of the open challenges in the area.
10 Analytical modeling for cross-layer resiliency
- + Show details - Hide details
-
p.
261
–279
(19)
Analytical models, techniques and methods are mathematical models that have a closed form solution. In other words, the solution to the equations used to describe the changes in the system being modeled can be expressed as a mathematical analytic function. They are used widely in the design of complex systems. In microprocessor design, they are most often employed to analyze performance, power and reliability. In reliability, analytical models are typically employed in the computation of vulnerability metrics, such asArchitecturalVulnerability Factor (AVF) [1] or Program Vulnerability Factor (PVF) [2]. Vulnerability factors quantify the probability that a bit corruption or fault results in an actual error by determining whether the fault propagates to an observable error state. Analytical models can compute vulnerability factors of a system orders of magnitude faster than brute force methods such as statistical fault injection [3], at the cost of some accuracy [4]. This speed/accuracy trade-off is at the heart of a well-balanced analytical model. Analytical models may be used at any level of modeling abstraction, from Register Transfer Level (RTL) models, through emulation models, all the way up to architectural or functional models [5-9]. Most of the analytical modeling techniques described in this chapter are used either at the functional model level, the architectural/micro-architectural level or at the RTL. Some are hybrid techniques that use multiple levels of abstraction to achieve high levels of accuracy [10]. These hybrid techniques achieve high levels of accuracy from lower level models such as RTL while retaining the speed benefits of higher level architectural models.
11 Stochastic methods
- + Show details - Hide details
-
p.
281
–304
(24)
It is clear from the previous chapters of this book that both fault-injection techniques and analytical approaches for cross-layer reliability analysis have both positive and negative aspects that must be carefully analyzed whenever choosing the best approach to evaluate the reliability of a computing system, and none of them alone represents an optimal solution. With the increasing complexity of future computing systems, analyzing the impact on system reliability of any change in the technology, circuit, microarchitecture and software is a critical and complex design task that requires proper tools and models. The adoption of cross-layer reliability techniques makes this analysis even more complex and challenging. Therefore, there is an increasing interest into stochastic reliability models that are able to combine the benefit of fault-injection techniques at different abstraction levels and analytical approaches such as Register Data Lifetime [1] or Architectural Correct Execution [2-4] analysis into a unified stochastic model that is able to cope with the complexity of the target design together. This is favored by an increasing discussion on the use of these models in the framework of relevant reliability and safety-related standards such as the IEC 61508 [5] and ISO 26262 [6].
-
Back Matter
- + Show details - Hide details
-
p.
(1)