Hardware Architectures for Deep Learning
2: Department of Electrical and Computer Engineering, University of Tehran, Tehran, Iran
This book presents and discusses innovative ideas in the design, modelling, implementation, and optimization of hardware platforms for neural networks. The rapid growth of server, desktop, and embedded applications based on deep learning has brought about a renaissance in interest in neural networks, with applications including image and speech processing, data analytics, robotics, healthcare monitoring, and IoT solutions. Efficient implementation of neural networks to support complex deep learning-based applications is a complex challenge for embedded and mobile computing platforms with limited computational/storage resources and a tight power budget. Even for cloud-scale systems it is critical to select the right hardware configuration based on the neural network complexity and system constraints in order to increase power- and performance-efficiency. Hardware Architectures for Deep Learning provides an overview of this new field, from principles to applications, for researchers, postgraduate students and engineers who work on learning-based services and hardware platforms.
Inspec keywords: recurrent neural nets; feedforward neural nets; learning (artificial intelligence); neuromorphic engineering; neural net architecture; neural chips
Other keywords: hardware architectures; embedded systems; model sparsity; deep learning hardware; error-tolerance; recurrent neural network; RNN; analog accelerators; feedforward models; low-precision data representation; hardware accelerators; convolutional neural networks; ultra-low-power IoT smart applications; inverter-based memristive neuromorphic circuit; stochastic data representations; binary data representations
Subjects: Neural nets (circuit implementations); Knowledge engineering techniques; Neural computing techniques; Parallel architecture; General and management topics; General electrical engineering topics; Neural net devices
- Book DOI: 10.1049/PBCS055E
- Chapter DOI: 10.1049/PBCS055E
- ISBN: 9781785617683
- e-ISBN: 9781785617690
- Page count: 328
- Format: PDF
-
Front Matter
- + Show details - Hide details
-
p.
(1)
-
Part I. Deep learning and neural networks: concepts and models
1 An introduction to artificial neural networks
- + Show details - Hide details
-
p.
3
–26
(24)
In this chapter, an introduction to neural networks (NNs) with an emphasis on classification and regression applications is presented. In this chapter, some preliminaries about natural and artificial neural networks (ANNs) are introduced first. Then, by giving initial concepts about classification and regression problems, appropriate overall structures of ANNs for such applications are explained. The simple structures of NNs and their limitations as well as some more powerful multilayer and deep learning models are introduced in the next part of this chapter. Finally, convolutional NNs and some of their well-known developments are briefly explained.
2 Hardware acceleration for recurrent neural networks
- + Show details - Hide details
-
p.
27
–51
(25)
This chapter focuses on the LSTM model and is concerned with the design of a high-performance and energy-efficient solution to implement deep learning inference. The chapter is organized as follows: Section 2.1 introduces Recurrent Neural Networks (RNNs). In this section Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) network models are discussed as special kind of RNNs. Section 2.2 discusses inference acceleration with hardware. In Section 2.3, a survey on various FPGA designs is presented within the context of the results of previous related works and after which Section 2.4 concludes the chapter.
3 Feedforward neural networks on massively parallel architectures
- + Show details - Hide details
-
p.
53
–76
(24)
In this chapter, we present ClosNN, a specialized NoC for NNs based on the well-known Clos topology. Clos is perhaps the most popular Multistage Interconnection Network (MIN) topology. Clos is used commonly as a base of switching infrastructures in various commercial telecommunication and network routers and switches.
-
Part II. Deep learning and approximate data representation
4 Stochastic-binary convolutional neural networks with deterministic bit-streams
- + Show details - Hide details
-
p.
79
–94
(16)
In this chapter, we proposed a low-cost and energy -efficient design for hardware implementation of CNNs. LD deterministic bit -streams and simple standard AND gates are used to perform fast and accurate multiplication operations in the first layer of the NN. Compared to prior random bit -stream -based designs, the proposed design achieves a lower misclassification rate for the same processing time. Evaluating LeNet5 NN with MINIST dataset as the input, the proposed design achieved the same classification rate as the conventional fixed-point binary design with 70% saving in the energy consumption of the first convolutional layer. If accepting slight inaccuracies, higher energy savings are also feasible by processing shorter bit -streams.
5 Binary neural networks
- + Show details - Hide details
-
p.
95
–115
(21)
Convolutional neural networks (CNNs) are used in a spread spectrum of machine learning applications, such as computer vision and speech recognition. Computation and memory accesses are the major challenges for the deployment of CNNs in resource -limited and low -power embedded systems. The recently proposed binary neural networks (BNNs) use just 1 bit for weights and/or activations instead of full precision values, hence substitute complex multiply -accumulation operations with bitwise logic operations to reduce the computation and memory footprint drastically. However, most BNN models come with some accuracy loss, especially in big datasets. Improving the accuracy of BNNs and designing efficient hardware accelerator for them are two important research directions that have attracted many attentions in recent years. In this chapter, we conduct a survey on the state-of-the-art researches on the design and hardware implementation of the BNN models.
-
Part III. Deep learning and model sparsity
6 Hardware and software techniques for sparse deep neural networks
- + Show details - Hide details
-
p.
119
–145
(27)
Over the past four decades, every generation of processors has delivered 2x performance boost, as predicted by Moore's law. Ironically, the end of Moore's law occurred at almost the same time as computationally intensive deep learning algorithms were emerging. Deep neural networks (DNNs) offer state-of-the-art solutions for many applications, including computer vision, speech recognition, and natural language processing. However, this is just the tip of the iceberg. Deep learning is taking over many classic machine -learning applications and also creating new markets, such as autonomous vehicles, which will tremendously amplify the demand for even more computational power.
7 Computation reuse-aware accelerator for neural networks
- + Show details - Hide details
-
p.
147
–158
(12)
Power consumption has long been a significant concern in neural networks. In particular, large neural networks that implement novel machine learning techniques require much more computation, and hence power, than ever before. In this chapter, we showed that computation reuse could exploit the inherent redundancy in the arithmetic operations of the neural network to save power. Experimental results showed that computation reuse, when coupled with the approximation property of neural networks, can eliminate up to 90% of multiplication, effectively reducing power consumption by 61%, on average in the presented architecture. The proposed computation reuse -aware design can be extended in several ways. First, it can be integrated into several state-of-the-art customized architectures for LSTM, spiking , and convolutional neural network models to further reduce power consumption. Second, we can couple computation reuse with existing mapping and scheduling algorithms toward developing reusable scheduling and mapping methods for neural network. Computation reuse can also boost the performance of the methods that eliminate ineffectual computations in deep learning neural networks . Evaluating the impact of CORN on reliability and customizing the CORN architecture for FPGA-based neural network implementation are the other future works in this line.
-
Part IV. Convolutional neural networks for embedded systems
8 CNN agnostic accelerator design for low latency inference on FPGAs
- + Show details - Hide details
-
p.
161
–189
(29)
In this chapter, we study the factors impacting CNN accelerator designs on FPGAs, show how on-chip memory configuration affects the usage of off-chip bandwidth, and present a uniform memory model that effectively uses both memory systems. A majority of the work in the area of FPGA-based acceleration of CNNs has focused on maximizing the throughput. Such implementations use batch processing for throughput improvement and are mainly tailored for cloud deployment. However, they fall short in latency-critical applications such as autonomous driving, drone surveillance and interactive speech recognition. Therefore, we avoid batching of any kind and focus on reducing the latency for each input image. In addition, we avoid the use of Winograd Transformations to retain flexible support for various filter sizes and different CNN architectures, both are optimized only for 3 x 3 filter layers and lack flexibility. Furthermore, we provide complete end-to-end automation, including data quantization exploration with Ristretto. The efficiency of the proposed architecture is shown by studying its performance on AlexNet, VGG, SqueezeNet and GoogLeNet.
9 Iterative convolutional neural network (ICNN): an iterative CNN solution for low power and real-time systems
- + Show details - Hide details
-
p.
191
–232
(42)
With convolutional neural networks (CNN) becoming more of a commodity in the computer vision field, many have attempted to improve CNN in a bid to achieve better accuracy to a point that CNN accuracies have surpassed that of human's capabilities. However, with deeper networks, the number of computations and consequently the energy needed per classification has grown considerably. In this chapter, an iterative approach is introduced, which transforms the CNN from a single feed-forward network that processes a large image into a sequence of smaller networks, each processing a subsample of each image. Each smaller network combines the features extracted from all the earlier networks, to produce classification results. Such a multistage approach allows the CNN function to be dynamically approximated by creating the possibility of early termination and performing the classification with far fewer operations compared to a conventional CNN.
-
Part V. Deep learning on analog accelerators
10 Mixed-signal neuromorphic platform design for streaming biomedical signal processing
- + Show details - Hide details
-
p.
235
–264
(30)
This chapter presents the mixed -signal design approach for the design of neuromorphic platforms for the biomedical signal processing. The proposed approach combines algorithmic, architectural and circuit design concepts to offer a low -power neuromorphic platform for streaming biomedical signal processing. The platform employs liquid state machines using spiking neurons (implemented on analog neuron circuits) and support vector machine (SVM) (implemented as software running on advanced RISC machine (ARM) processor). A dynamic global synaptic communication network realized using the ultralow leakage IGZO thin film transistor (TFT) technology circuit switch is also presented. The proposed architectural technique offers a scalable low -power neuromorphic platform design approach suitable for processing real-time biomedical signals.
11 Inverter-based memristive neuromorphic circuit for ultra-low-power IoT smart applications
- + Show details - Hide details
-
p.
265
–295
(31)
Nowadays, the analysis of massive amounts of data is generally performed by remotely accessing large cloud computing resources. The cloud computing is however hindered by the security limitation, bandwidth bottleneck, and high cost. In addition, while unstructured an multimedia data (video, audio, etc.) are straightforwardly recognized and processed by the human brain, conventional digital computing architecture has major difficulties in processing this type of data, especially in real time. Another major concern for data processing, especially in the case of Internet of Things (IoT) devices which do distributed sensing and typically rely on energy scavenging, is power consumption. One of the ways to deal with the cloud computing bottlenecks is to use low -power neuromorphic circuits, which are a type of embedded intelligent circuits aimed at real-time screening and preprocessing of data before submitting the data to the cloud for further processing. This chapter explores ultra -low -power analog neuromorphic circuits for processing sensor data in the IoT devices where low -power, yet area -efficient computations are required. To reduce power consumption without losing performance, we resort to a memristive neuromorphic circuit that employs inverters instead of power-hungry op -amps. We also discuss ultra -low -power mixed -signal analog -to -digital converters (ADC) and digital -to -analog converters (DAC) to make the analog neuromorphic circuit connectable to other digital components such as an embedded processor. To illustrate how inverter -based memristive neuromorphic circuits can be exploited for reducing power and area, several case studies are presented.
-
Back Matter
- + Show details - Hide details
-
p.
(1)