ReRAM-based Machine Learning
2: Huawei Technologies, Shenzhen, China
3: Department of Electrical and Computer Engineering, George Mason University (GMU), USA
The transition towards exascale computing has resulted in major transformations in computing paradigms. The need to analyze and respond to such large amounts of data sets has led to the adoption of machine learning (ML) and deep learning (DL) methods in a wide range of applications. One of the major challenges is the fetching of data from computing memory and writing it back without experiencing a memory-wall bottleneck. To address such concerns, in-memory computing (IMC) and supporting frameworks have been introduced. In-memory computing methods have ultra-low power and high-density embedded storage. Resistive Random-Access Memory (ReRAM) technology seems the most promising IMC solution due to its minimized leakage power, reduced power consumption and smaller hardware footprint, as well as its compatibility with CMOS technology, which is widely used in industry. In this book, the authors introduce ReRAM techniques for performing distributed computing using IMC accelerators, present ReRAM-based IMC architectures that can perform computations of ML and data-intensive applications, as well as strategies to map ML designs onto hardware accelerators. The book serves as a bridge between researchers in the computing domain (algorithm designers for ML and DL) and computing hardware designers.
Inspec keywords: learning (artificial intelligence); compressed sensing; AI chips; resistive RAM
Other keywords: in-memory computing; XIMA machine learning architecture; machine learning accelerators; machine learning algorithms; ReRAM-based machine learning; ResNet; compressive sensing
Subjects: Machine learning (artificial intelligence); Microprocessors and microcomputers; Memory circuits; Microprocessor chips; General electrical engineering topics; Digital storage; General and management topics
- Book DOI: 10.1049/PBPC039E
- Chapter DOI: 10.1049/PBPC039E
- ISBN: 9781839530814
- e-ISBN: 9781839530821
- Page count: 261
- Format: PDF
-
Front Matter
- + Show details - Hide details
-
p.
(1)
-
Part I. Introduction
1 Introduction
- + Show details - Hide details
-
p.
3
–17
(15)
Computing-in-memory methods based on emerging nonvolatile devices are popular in both academia and industry. Many researchers believe that this architecture will be an opportunity to break Moore's law because of its ultra-low power and high-density embedded storage. Various devices including resistive random-access memory (ReRAM), phase-change random-access memory (PCRAM), Magentic RAM (MRAM), Ferroelectric RAM (FeRAM) and NOR Flash have been discussed. ReRAM is the most promising among all these devices due to its potentiality for multilevel resistance and compatibility with CMOS technology. Well-known companies such as IBM and HP have invested in this field, and we believe that this ReRAM-based IMC will be commercialized with Internet-of-things (IoT) products in the next 2-3 years. This chapter introduces the need and motivation to deploy ML algorithms for data-intensive computations.
2 The need of in-memory computing
- + Show details - Hide details
-
p.
19
–43
(25)
In this computing paradigm, the computing unit can only process one task at a certain interval and wait for memory to update its results, because both data and instructions are stored in the same memory space, which greatly limits the throughput and causes idle power consumption. Although mechanisms like cache and branch prediction can partially eliminate the issues, the "memory wall" still poises a grand challenge for the massive data interchanging in modern processor technology. To break "memory wall," in-memory processing has been studied since 2000s and regarded as a promising way to reduce redundant data movement between mem-ory and processing unit and decrease power consumption. The concept has been implemented with different hardware tools, e.g., 3D-stack dynamic random access memory (DRAM) [69] and embedded FLASH [70]. Software solutions like pruning, quantization and mixed precision topologies are implemented to reduce the intensity of signal interchanging.
3 The background of ReRAM devices
- + Show details - Hide details
-
p.
45
–75
(31)
In this chapter, we will first introduce the characteristics of resistive random-access memory (ReRAM) devices. Post-characterization of the ReRAMs, we introduce the structural design of the ReRAMs followed by a few applicational designs of ReRAMbased devices.
-
Part II. Machine learning accelerators
4 The background of machine learning algorithms
- + Show details - Hide details
-
p.
79
–98
(20)
This chapter provides an introduction to the machine learning algorithms along with the optimization techniques. We first introduce support vector machines (SVMs) followed by neural networks and their variants. The applications considered as basis to illustrate these algorithms are the classification problems.
5 XIMA: the in-ReRAM machine learning architecture
- + Show details - Hide details
-
p.
99
–114
(16)
ReRAM neural networks with focus on intensive matrix multiplication operations. ReRAM-crossbar network can be used as matrix-vector multiplication accelerator and then to illustrate the detailed mapping. The coupled ReRAM oscillator network can be applied for low-power and high-throughput L2-norm calculation. The 3D single-layer CMOS-ReRAM architecture will be used for tensorized neural network (TNN). A 3D multilayer CMOS-ReRAM architecture has advantages in three man-ifold. First, by utilizing ReRAM crossbar for input data storage, leakage power of memory is largely removed. In a 3D architecture with TSV interconnection, the bandwidth from this layer to next layer is sufficiently large to perform parallel computation. Second, ReRAM crossbar can be configured as computational units for the matrix-vector multiplication with high parallelism and low power. Lastly, with an additional layer of CMOS-ASIC, more complicated tasks such as division and non-linear mapping can be performed. As a result, the whole training process of ML can be fully mapped to the proposed 3D multilayer CMOS-ReRAM accelerator architecture towards real-time training and testing.
6 The mapping of machine learning algorithms on XIMA
- + Show details - Hide details
-
p.
115
–163
(49)
Machine learning applications on distributed in-memory computing architecture (XIMA), including single-layer feedforward neural network (SLFN)-based learning, binary convolutional neural network (BCNN)-based inferences on passive array as well as One Selector One ReRAM (1s1R) array. In this network, matrix-vector multiplication, the major operation in machine learning on ReRAM crossbar is implemented by three steps with detailed mapping strategy for each step. The process of mapping of X'A to ReRAM. The proposed 3D CMOS-ReRAM accelerator can greatly improve the energy efficiency and neural network processing speed. The matrix-vector multiplication can be intrinsically implemented by ReRAM crossbar. Compared to CMOS, a multibit tensor-core weight can be represented by a single ReRAM, and the addition can be realized by current merging. Moreover, as all the tensor cores for the neural network weights are stored in the ReRAM devices, better area efficiency can be achieved. The power consumption of the proposed CMOS-ReRAM is much smaller than the CMOS implementation due to nonvolatile property of ReRAM. For energy efficiency, our proposed accelerator can achieve 1,055.95 GOPS/W, which is equivalent to 7.661 TOP S/W for uncompressed neural network. Our pro-posed TNN accelerator can also achieve 2.39 x better energy efficiency comparing to 3D CMOS-ASIC result (441.36 GOPS/W), and 227.24x better energy efficiency comparing to Nvidia Tesla K40, which can achieve 1,092 GFLOPS and consume 235W
-
Part III. Case studies
7 Large-scale case study: accelerator for ResNet
- + Show details - Hide details
-
p.
167
–187
(21)
For this case study, we have developed a quantized large-scale ResNet-50 network using ImageNet [265] benchmark with high accuracy. We further show that the quantized ResNet-50 network can be realized on ReRAM crossbar with significantly improved throughput and energy efficiency.
8 Large-scale case study: accelerator for compressive sensing
- + Show details - Hide details
-
p.
189
–213
(25)
Biomedical wireless circuits for applications such as health telemonitoring and implantable biosensors are energy sensitive. To prolong the lifetime of their services, it is essential to perform the dimension reduction while acquiring original data. The compressive sensing is a signal processing technique that exploits signal sparsity so that signal can be reconstructed under lower sampling rate than that of Nyquist sampling theorem. The existing works that apply compressive sensing technique on biomedical hardware focus on the efficient signal reconstruction by either dictionary learning or more efficient algorithms of finding the sparsest coefficients. However, these works, by improving the recon-struction on mobile/server nodes instead of data acquisition on sensor nodes, can only indirectly reduce the number of samples for wireless transmission with lower energy. In this work, we aim to achieve both high-performance signal acquisition and low sampling hardware cost at sensor nodes directly.
9 Conclusions: wrap-up, open questions and challenges
- + Show details - Hide details
-
p.
215
–216
(2)
This book has shown a thorough study on resistive random-access memory (ReRAM)-based nonvolatile in-memory architecture towards machine learning applications from circuit level, to architecture level, and all the way to system level.
-
Back Matter
- + Show details - Hide details
-
p.
(1)