Many-Core Computing: Hardware and Software
Computing has moved away from a focus on performance-centric serial computation, instead towards energy-efficient parallel computation. This provides continued performance increases without increasing clock frequencies, and overcomes the thermal and power limitations of the dark-silicon era. As the number of parallel cores increases, we transition into the many-core computing era. There is considerable interest in developing methods, tools, architectures and applications to support many-core computing. The primary aim of this edited book is to provide a timely and coherent account of the recent advances in many-core computing research. Starting with programming models, operating systems and their applications; the authors present runtime management techniques, followed by system modelling, verification and testing methods, and architectures and systems. The book ends with some examples of innovative applications.
Inspec keywords: program processors; parallel programming; multiprocessing systems; parallel processing
Other keywords: processor speed; computing systems; multicore computing; many-core computing; hardware; software; parallel computation
Subjects: General and management topics; Parallel software; Multiprocessing systems
- Book DOI: 10.1049/PBPC022E
- Chapter DOI: 10.1049/PBPC022E
- ISBN: 9781785615825
- e-ISBN: 9781785615832
- Page count: 568
- Format: PDF
-
Front Matter
- + Show details - Hide details
-
p.
(1)
-
Part I - Programming models, OS and applications
1 HPC with many core processors
- + Show details - Hide details
-
p.
3
–26
(24)
The current trends in building clusters and supercomputers are to use medium-to-big symmetric multi-processors (SMP) nodes connected through a high-speed network. Applications need to accommodate to these execution environments using distributed and shared memory programming, and thus become hybrid. Hybrid applications are written with two or more programming models, usually message passing interface (MPI) [1,2] for the distributed environment and OpenMP [3,4] for the shared memory support. The goal of this chapter is to show how the two programming models can be made interoperable and ease the work of the programmer. Thus, instead of asking the programmers to code optimizations targeting performance, it is possible to rely on the good interoperability between the programming models to achieve high performance. For example, instead of using non-blocking message passing and double buffering to achieve computation-communication overlap, our approach provides this feature by taskifying communications using OpenMP tasks [5,6].
2 From irregular heterogeneous software to reconfigurable hardware
- + Show details - Hide details
-
p.
27
–47
(21)
A heterogeneous system is the one that incorporates more than one kind of computing device. Such a system can offer better performance per Watt than a homogeneous one if the applications it runs are programmed to take advantage of the different strengths of the different devices in the system. A typical heterogeneous setup involves a master processor (the `host' CPU) offloading some easily parallelised computations to a graphics processing unit (GPU) or to a custom accelerator implemented on a field-programmable gate array (FPGA).This arrangement can benefit performance because it exploits the massively parallel natures of GPU and FPGA architectures.
3 Operating systems for many-core systems
- + Show details - Hide details
-
p.
49
–68
(20)
The ongoing trend toward many-core computer systems and adequate new programming models has spawned numerous new activities in the domain of operating system (OS) research during recent years. This chapter will address the challenges and opportunities for OS developers in this new field and give an overview of state-of-the-art research.This section will introduce the reader to the spectrum of contemporary many-core CPU architectures, application programming models for many-core systems, give a brief overview of the resulting challenges for OS developers.
4 Decoupling the programming model from resource management in throughput processors
- + Show details - Hide details
-
p.
69
–115
(47)
This chapter introduces a new resource virtualization framework, Zorua, that decouples the graphics processing unit (GPU) programming model from the management of key on-chip resources in hardware to enhance programming ease, portability, and performance. The application resource specification-a static specification of several parameters such as the number of threads and the scratchpad memory usage per thread block-forms a critical component of the existing GPU programming models. This specification determines the parallelism, and, hence, performance of the application during execution because the corresponding on-chip hardware resources are allocated and managed purely based on this specification. This tight coupling between the software-provided resource specification and resource management in hardware leads to significant challenges in programming ease, portability, and performance, as we demonstrate in this chapter using real data obtained on state-of-the-art GPU systems. Our goal in this work is to reduce the dependence of performance on the software-provided static resource specification to simultaneously alleviate the above challenges. To this end, we introduce Zorua, a new resource virtualization framework, that decouples the programmer-specified resource usage of a GPU application from the actual allocation in the on-chip hardware resources. Zorua enables this decoupling by virtualizing each resource transparently to the programmer. The virtualization provided by Zorua builds on two key concepts-dynamic allocation of the on-chip resources and their oversubscription using a swap space in memory. Zorua provides a holistic GPU resource virtualization strategy designed to (i) adaptively control the extent of oversubscription and (ii) coordinate the dynamic management of multiple on-chip resources to maximize the effectiveness of virtualization.We demonstrate that by providing the illusion of more resources than physically available via controlled and coordinated virtualization, Zorua offers several important benefits: (i) Programming ease. It eases the burden on the programmer to provide code that is tuned to efficiently utilize the physically available on-chip resources. (ii) Portability. It alleviates the necessity of retuning an application's resource usage when porting the application across GPU generations. (iii) Performance. By dynamically allocating resources and carefully oversubscribing them when necessary, Zorua improves or retains the performance of applications that are already highly tuned to best utilize the resources. The holistic virtualization provided by Zorua has many other potential uses, e.g., fine-grained resource sharing among multiple kernels, low latency preemption of GPU programs, and support for dynamic parallelism, which we describe in this chapter.
5 Tools and workloads for many-core computing
- + Show details - Hide details
-
p.
117
–140
(24)
Proper tools and workloads are required to evaluate any computing systems. This enables designers to fulfill the desired properties expected by the end-users. It can be observed that multi/many-core chips are omnipresent from small-to-large-scale systems, such as mobile phones and data centers. The reliance on multi/many-core chips is increasing as they provide high-processing capability to meet the increasing performance requirements of complex applications in various application domains. The high-processing capability is achieved by employing parallel processing on the cores where the application needs to be partitioned into a number of tasks or threads and they need to be efficiently allocated onto different cores. The applications considered for evaluations represent workloads and toolchains required to facilitate the whole evaluation are referred to as tools. The tools facilitate realization of different actions (e.g., thread-to-core mapping and voltage/frequency control, which are governed by OS scheduler and power governor, respectively) and their effect on different performance monitoring counters leading to a change in the performance metrics (e.g., energy consumption and execution time) concerned by the end-users.
6 Hardware and software performance in deep learning
- + Show details - Hide details
-
p.
141
–161
(21)
In recent years, deep neural networks (DNNs) have emerged as the most successful technology for many difficult problems in image, video, voice and text processing. DNNs are resource hungry and require very large amounts of computation and memory, which is a particular challenge on IoT, mobile and embedded systems. In this chapter, we outline some major performance challenges of DNNs such as computation, parallelism, data locality and memory requirements. We describe research on these problems, such as the use of existing high-performance linear algebra libraries, hardware acceleration, reduced-precision storage and arithmetic and sparse data representations. Finally, we discuss recent trends in adapting compiler and domain-specific program generation techniques to create high-performance parallel DNN programs.
-
Part II - Runtime management
7 Adaptive-reflective middleware for power and energy management in many-core heterogeneous systems
- + Show details - Hide details
-
p.
165
–189
(25)
8 Advances in power management of many-core processors
- + Show details - Hide details
-
p.
191
–213
(23)
9 Runtime thermal management of many-core systems
- + Show details - Hide details
-
p.
215
–245
(31)
10 Adaptive packet processing on CPU-GPU heterogeneous platforms
- + Show details - Hide details
-
p.
247
–269
(23)
11 From power-efficient to power-driven computing
- + Show details - Hide details
-
p.
271
–293
(23)
The dramatic spread of computing, at the scale of trillions of ubiquitous devices, is delivering on the pervasive penetration into the real world in the form of Internet of Things (IoT). Today, the widely used power-efficient paradigms directly related to the behaviour of computing systems are those of real-time (working to deadlines imposed from the real world) and low-power (prolonging battery life or reducing heat dissipation and electricity bills). None of these addresses the strict requirements on power supply, allocation and utilisation that are imposed by the needs of new devices and applications in the computing swarm - many of which are expected to be confronted with challenges of autonomy and battery-free long life. Indeed, we need to design and build systems for survival, operating under a wide range of power constraints; we need a new power-driven paradigm called real-power computing (RPC). The article provides an overview of this emerging paradigm with definition, taxonomies and a case study, together with a summary of the existing research. Towards the end, the overview leads to research and development challenges and opportunities surfacing this paradigm. Throughout the article, we have used the power and energy terms as follows. From the supply side, the energy term will be used to refer to harvesters with built-in storage, while the power term will indicate instantaneous energy dispensation. For the computing logic side, the energy term will define the total power consumed over a given time interval.
-
Part III - System modelling, verification, and testing
12 Modelling many-core architectures
- + Show details - Hide details
-
p.
297
–322
(26)
Architectural modelling has two primary objectives: (1) navigating the design space exploration, i.e. guiding the architects to arrival at better design choices, and (2) facilitating dynamic management, i.e. providing the functional relationships between workloads'characteristics and architectural configurations to enable appropriate runtime hardware/software adaptations. In the past years, many-core architectures, as a typical computing fabric evolving from the monolithic single-/multicore architectures, have been shown to be scalable to uphold the staggering the Moore's Law. The many-core architectures enable two orthogonal approaches, scale-up and scale-out, to utilize the growing budget of transistors. Understanding the rationale behind these approaches is critical to make more efficient use of the powerful computing fabric.
13 Power modelling of multicore systems
- + Show details - Hide details
-
p.
323
–344
(22)
The chapter first gives a brief overview of how power is consumed in CPUs before exploring the various energy-saving techniques and power management considerations. A description of different power modelling approaches and applications is presented before top-down, run-time power models are described in detail, highlighting many important, but often-overlooked, considerations. Bottom-up approaches, their accuracy, and methods of improving their representativeness are then discussed and finally, hybrid approaches are proposed.
14 Developing portable embedded software for multicore systems through formal abstraction and refinement
- + Show details - Hide details
-
p.
345
–365
(21)
Run-time management (RTM) systems are used in embedded systems to dynamically adapt hardware performance to minimise energy consumption. An RTM system implementation is coupled with the hardware platform specifications and is implemented individually for each specific platform. A significant challenge is that RTM software can require laborious manual adjustment across different hardware platforms due to the diversity of architecture characteristics. Hardware specifications vary from one platform to another and include a number of characteristic such as the number of supported voltage and frequency (VF) settings. Formal modelling offers the potential to simplify the management of platform diversity by shifting the focus away from handwritten platform-specific code to platform-independent models from which platform-specific implementations are automatically generated. The article presents an overview of the motivations for this work. It goes on to overview the RTM architecture and requirements and introduce the Event-B formal method and its tool support. The article then describes the Event-B model of two different RTMs and presents the portability support provided by formal modelling and code generation. Finalyy, it reviews the verification and experimental results.
15 Self-testing of multicore processors
- + Show details - Hide details
-
p.
367
–394
(28)
The purpose of this chapter is to develop a review of state-of-the-art techniques and methodologies for the self-testing of multicore processors. The chapter is divided into two main sections: (a) self-testing solutions covering general-purpose multicore microprocessors such as chip multiprocessors (CMPs) and (b) self-testing solutions targeting application-specific multicore designs known as SoCs. In the first section (general-purpose), a taxonomy of current self-testing approaches is initially presented, followed by a review of the state-of-the-art for each class. The second section (application-specific) provides an overview of the test scheduling flows for multicore SoCs, as well as the testing strategies for the individual components (sub-systems) of such systems.
16 Advances in hardware reliability of reconfigurable many-core embedded systems
- + Show details - Hide details
-
p.
395
–416
(22)
The chapter discusses the background for the most demanding dependability challenges for reconfigurable processors in many-core systems and presents a dependable runtime reconfigurable processor for high reliability. It uses an adaptive modular redundancy technique that guarantees an application-specified level of reliability under changing SEU rates by budgeting the effective critical bits among all kernels and all accelerators of an application. This allows to deploy reconfigurable processors in harsh environments without statically protecting them.
-
Part IV - Architectures and systems
17 Manycore processor architectures
- + Show details - Hide details
-
p.
419
–448
(30)
Trade-offs between performance and power have dominated the processor architecture landscape in recent times and are expected to exert a considerable influence in the future. Processing technologies ceased to provide automatic speedups across generations, leading to the reliance on architectural innovation for achieving better performance. Manycore processor systems have found their way into various computing segments ranging from mobile systems to the desktop and server space. With the advent of graphics processing units (GPUs) with a large number of processing elements into the computing space, manycore systems have become the default engine for all target computing domains. We have focused in this chapter on mainly the desktop and system-on-chip (SoC) domain, but the architectural possibilities blend in a seamless way into the other domains also. We outline a high-level classification of manycore processors and go on to describe the major architectural components typically expected in modern and future processors, with a focus on the computing elements. Issues arising out of the integration of the various components are outlined. Future trends are also identified.
18 Silicon photonics enabled rack-scale many-core systems
- + Show details - Hide details
-
p.
449
–470
(22)
The increasingly higher demands on computing power from scientific computations, big data processing and deep learning are pushing the emergence of exascale computing systems. Tens of thousands of or even more manycore nodes are connected to build such systems. It imposes huge performance and power challenges on different aspects of the systems. As a basic block in high-performance computing systems, modularized rack will play a significant role in addressing these challenges. In this chapter, we introduce rack-scale optical networks (RSON), a silicon photonics enabled inter/intra-chip network for rack-scale many-core systems. RSON leverages the fact that most traffic is within rack and the high bandwidth and low-latency rack-scale optical network can improve both performance and energy efficiency. We codesign the intra-chip and inter-chip optical networks together with optical internode interface to provide balanced data access to both local memory and remote note's memory, making the nodes within rack cooperate effectively. The evaluations show that RSON can improve the overall performance and energy efficiency dramatically. Specifically, RSON can deliver as much as 5.4x more performance under the same energy consumption compared to traditional InfiniBand connected rack.
19 Cognitive I/O for 3D-integrated many-core system
- + Show details - Hide details
-
p.
471
–495
(25)
Increasing demands to process large amounts of data in real time leads to an increase in the many-core microprocessors, which is posing a grand challenge for an effective and management of available resources. As communication power occupies a significant portion of power consumption when processing such big data, there is an emerging need to devise a methodology to reduce the communication power without sacrificing the performance. To address this issue, we introduce a cognitive I/O designed toward 3D-integrated many-core microprocessors that performs adaptive tuning of the voltage-swing levels depending on the achieved performance and power consumption. We embed this cognitive I/O in a many-core microprocessor with DRAM memory partitioning to perform energy saving for application such as fingerprint matching and face recognition.
20 Approximate computing across the hardware and software stacks
- + Show details - Hide details
-
p.
497
–522
(26)
Emerging fields like big data and IoT have brought a number of challenges for hardware as well as software design community. Some of the major challenges are to scale the computational and memory resources and the efficiency of the processing devices as per the growing needs. In the past few years, a number of fields have emerged for addressing these challenges. We focus on one of the prominent paradigms that have the potential to improve the resource efficiency regardless of the underlying technology, i.e., approximate computing (AC). AC aims at relaxing the bounds of exact computing to provide new opportunities for achieving gains in terms of energy, power, performance, and/or area efficiency at the cost of reduced output quality, typically within the tolerable range. We first provide an overview of AC and the techniques which are commonly being employed at different abstraction levels for alleviating the resource requirements of computationally intensive applications. Afterwards, a detailed discussion on component-level approximations and their probabilistic behavior by considering approximate adders and multipliers is presented. At the next step, a methodology used to construct efficient accelerators from these components will be discussed. The discussion will then be extended to approximate memories and runtime management systems. Toward the end of the chapter, we present a methodology for designing energy efficient many-core systems based upon approximate components followed by the challenges in adopting a cross-layer approach for designing highly energy, power, and performance-efficient systems.
21 Many-core systems for big-data computing
- + Show details - Hide details
-
p.
523
–544
(22)
In many ways, big data should be the poster-child of many-core computing. By necessity, such applications typically scale extremely well across machines, featuring high levels of thread-level parallelism. Programming techniques, such as Google's MapReduce, have allowed many applications running in the data centre to be programmed with parallelism directly in mind and have enabled extremely high throughput across machines. We explore the state-of-the-art in terms of techniques used to make many-core architectures work for big-data workloads. We explore how tail-latency concerns mean that even though workloads are parallel, high performance is still necessary in at least some parts of the system. We take a look at how memory-system issues can cause some big-data applications to scale less favourably than we would like for many-core architectures. We examine the programming models used for big-data workloads and consider how these both help and hinder the typically complex mapping seen elsewhere for many-core architectures. And we also take a look at the alternatives to traditional many-core systems in exploiting parallelism for efficiency in the big-data space.
22 Biologically-inspired massively-parallel computing
- + Show details - Hide details
-
p.
545
–558
(14)
Half a century of progress in computer technology has delivered machines of formidable capability and an expectation that similar advances will continue into the foreseeable future. However, much of the past progress has been driven by developments in semiconductor technology following Moore's Law, and there are strong grounds for believing that these cannot continue at the same rate. This, and related issues, suggest that there are huge challenges ahead in meeting the expectations of future progress, such as understanding how to exploit massive parallelism and how to deliver improvements in energy efficiency and reliability in the face of diminishing component reliability. Alongside these issues, recent advances in machine learning have created a demand for machines with cognitive capabilities, for example, to control autonomous vehicles, that we will struggle to deliver. Biological systems have, through evolution, found solutions to many of these problems, but we lack a fundamental understanding of how these solutions function. If we could advance our understanding of biological systems, we would open a rich source of ideas for unblocking progress in our engineered systems. An overview is given of SpiNNaker - a spiking neural network architecture. The SpiNNaker machine puts these principles together in the form of a massively parallel computer architecture designed both to model the biological brain, in order to accelerate our understanding of its principles of operation, and also to explore engineering applications of such machines.
-
Back Matter
- + Show details - Hide details
-
p.
(1)