Ultrascale Computing Systems
2: LaBRI Laboratory, University Bordeaux, Bordeaux, France
3: University of Sydney, Sydney, NSW, Australia
The needs of future digital data and computer systems are expected to be two to three orders of magnitude larger than for today's systems, to take account of unprecedented amounts of heterogeneous hardware, lines of source code, numbers of users, and volumes of data. Ultrascale computing systems (UCS) are a solution. Envisioned as large-scale complex systems joining parallel and distributed computing systems, which can be located at multiple sites and cooperate to provide the required resources and performance to the users, these technologies will extend individual systems to provide the resources that are very much needed. Based on the research work in the COST Action IC 1305 Network for Sustainable Ultrascale Computing (NESUS) this book presents important results and methods towards achieving sustainable UCS. The authors present a wide range of emerging programming models that facilitate the task of scaling and extracting performance on continuously evolving platforms, while providing resilience and fault-tolerant mechanisms to tackle the increasing probability of failures throughout the entire software stack. These methods are needed to achieve scale handling, better programmability and adaptation to rapidly changing underlying computing architecture, data centric programming models, resilience, and energy-efficiency.
Inspec keywords: human computer interaction; hardware-software codesign; fault tolerant computing; distributed processing; power aware computing
Other keywords: heterogeneous hardware; parallel computing systems; system resilience; UCS development; data management; multi-domain cooperative approaches; domain-specific interoperable tools; energy efficiency; fault tolerance; distributed computing systems; large-scale complex systems; programming models; ultrascale computing systems; human-computer interaction; hardware-software co-design principles
Subjects: Performance evaluation and testing; General and management topics; Parallel programming; Operating systems; Hardware-software codesign; Distributed systems software
- Book DOI: 10.1049/PBPC024E
- Chapter DOI: 10.1049/PBPC024E
- ISBN: 9781785618338
- e-ISBN: 9781785618345
- Page count: 302
- Format: PDF
-
Front Matter
- + Show details - Hide details
-
p.
(1)
-
1 Introduction
- + Show details - Hide details
-
p.
1
–7
(7)
With the spread of the Internet, applications and web-based services, distributed computing infrastructures, local parallel systems, and the availability of huge amounts of dispersed data, software-dependent systems will be more and more connected, more and more networked, leading to the creation of supersystems. The phrase ultrascale computing systems (UCSs) refers to this type of IT supersystems. UCSs are complex large-scale ecosystems aggregating high-performance parallel and distributed computing infrastructures. These systems provide to the end user intrinsically heterogeneous solutions, located at multiple sites and capable of delivering tremendous performance boosts. They are indispensable to applications offering several orders of magnitude increase in the size of data and in the computing power relative to today's existing conventional technologies. However, to really speak of UCS, we must consider several orders of magnitude increase in the size of data, in the computing power and in the network complexity relative to what is existing now.
-
2 Programming models and runtimes
- + Show details - Hide details
-
p.
9
–63
(55)
Several millions of execution flows will be executed in ultrascale computing systems (UCS), and the task for the programmer to understand their coherency and for the runtime to coordinate them is unfathomable. Moreover, related to UCS large scale and their impact on reliability, the current static point of view is not more sufficient. A runtime cannot consider to restart an application because of the failure of a single node as statically several nodes will fail every day. Classical management of these failures by the programmers using checkpoint restart is also too limited due to the overhead at such a scale. The article explores programming models and runtimes required to facilitate the task of scaling and extracting performance on continuously evolving platforms, while providing resilience and fault-tolerant mechanisms to tackle the increasing probability of failures throughout the whole software stack.
-
3 Resilience and fault tolerance
- + Show details - Hide details
-
p.
65
–83
(19)
Ultrascale computing is a new computing paradigm that comes naturally from the necessity of computing systems that should be able to handle massive data in possibly very large-scale distributed systems, enabling new forms of applications that can serve a very large amount of users and in a timely manner that we have never experienced before. It is very challenging to find sustainable solutions for ultrascale computing system (UCS) due to their scale and a wide range of possible applications and involved technologies.systems, and big data management. One of the challenges regarding sustainable UCS is resilience. Traditionally, it has been an important aspect in the area of critical infrastructure protection.
-
4 Data management techniques
- + Show details - Hide details
-
p.
85
–126
(42)
Today, it is projected that data storage and management is becoming one of the key challenges in order to achieve ultrascale computing for several reasons. First, data is expected to grow exponentially in the coming years and this progression will imply that disruptive technologies will be needed to store large amounts of data and more importantly to access it in a timely manner. Second, the improvement of computing elements and their scalability are shifting application execution from CPU bound to I/O bound. This creates additional challenges for significantly improving the access to data to keep with computation time and thus avoid high-performance computing (HPC) from being underutilized due to large periods of I/O activity. Third, the two initially separate worlds of HPC that mainly consisted on one hand of simulations that are CPU bound and on the other hand of analytics that mainly perform huge data scans to discover information and are I/O bound are blurring. Now, simulations and analytics need to work cooperatively and share the same I/O infrastructure.
-
5 Energy aware ultrascale systems
- + Show details - Hide details
-
p.
127
–188
(62)
Energy consumption is one of the main limiting factors for the design of ultrascale infrastructures. Multi-level hardware and software optimizations must be designed and explored in order to reduce energy consumption for these largescale equipment. This chapter addresses the issue of energy efficiency of ultrascale systems in front of other quality metrics. The goal of this chapter is to explore the design of metrics, analysis, frameworks and tools for putting energy awareness and energy efficiency at the next stage. Significant emphasis will be placed on the idea of “energy complexity,” reflecting the synergies between energy efficiency and quality of service, resilience and performance, by studying computation power, communication/data sharing power, data access power, algorithm energy consumption, etc.
-
6 Applications for ultrascale systems
- + Show details - Hide details
-
p.
189
–244
(56)
The needed reformulation of algorithms and applications from different areas of research toward their usage for ultrascale systems and platforms has to address different challenges that arise from the different application areas, algorithms and programs. The challenges include scalability of the applications using a large number of system resources efficiently, the usage of resilience methods to include mechanisms to enable application programs to react to system failures, as well as the inclusion of energy-awareness features into the application programs to be able to obtain an energy-efficient execution. The programming models should enable to concentrate on the algorithmic aspects and problem-specific issues of the specific application area such that program development is supported as far as possible.
-
7 Conclusion
- + Show details - Hide details
-
p.
245
–248
(4)
The main conclusion is that it is important to enable ultrascale computing by supporting the evolution of ultrascale systems towards on-demand computing across highly diverse environments by providing domain specific, but interoperable tools to enable high productivity of human-computer interaction, leading toward robust solutions through multi-domain cooperative approaches using energy efficient hardware-software co-design principles.
-
Back Matter
- + Show details - Hide details
-
p.
(1)