Your browser does not support JavaScript!
http://iet.metastore.ingenta.com
1887

CNN agnostic accelerator design for low latency inference on FPGAs

CNN agnostic accelerator design for low latency inference on FPGAs

For access to this article, please select a purchase option:

Buy chapter PDF
$16.00
(plus tax if applicable)
Buy Knowledge Pack
10 chapters for $120.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Name:*
Email:*
Your details
Name:*
Email:*
Department:*
Why are you recommending this title?
Select reason:
 
 
 
 
 
Hardware Architectures for Deep Learning — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

In this chapter, we study the factors impacting CNN accelerator designs on FPGAs, show how on-chip memory configuration affects the usage of off-chip bandwidth, and present a uniform memory model that effectively uses both memory systems. A majority of the work in the area of FPGA-based acceleration of CNNs has focused on maximizing the throughput. Such implementations use batch processing for throughput improvement and are mainly tailored for cloud deployment. However, they fall short in latency-critical applications such as autonomous driving, drone surveillance and interactive speech recognition. Therefore, we avoid batching of any kind and focus on reducing the latency for each input image. In addition, we avoid the use of Winograd Transformations to retain flexible support for various filter sizes and different CNN architectures, both are optimized only for 3 x 3 filter layers and lack flexibility. Furthermore, we provide complete end-to-end automation, including data quantization exploration with Ristretto. The efficiency of the proposed architecture is shown by studying its performance on AlexNet, VGG, SqueezeNet and GoogLeNet.

Chapter Contents:

  • 8.1 Introduction
  • 8.2 Brief review of efforts on FPGA-based acceleration of CNNs
  • 8.3 Network structures and operations
  • 8.3.1 Convolution
  • 8.3.2 Inner product
  • 8.3.3 Pooling
  • 8.3.4 Other operations
  • 8.4 Optimizing parallelism sources
  • 8.4.1 Identifying independent computations
  • 8.4.2 Acceleration strategies
  • 8.5 Computation optimization and reuse
  • 8.5.1 Design control variables
  • 8.5.2 Partial sums and data reuse
  • 8.5.2.1 IFMs first strategy
  • 8.5.2.2 OFMs first strategy
  • 8.5.3 Proposed loop coalescing for flexibility with high efficiency
  • 8.6 Bandwidth matching and compute model
  • 8.6.1 Resource utilization
  • 8.6.2 Unifying off-chip and on-chip memory
  • 8.6.2.1 Impact of unmatched system
  • 8.6.2.2 Effective bandwidth latency
  • 8.6.3 Analyzing runtime
  • 8.6.3.1 Estimating required off-chip bandwidth
  • 8.7 Library design and architecture implementation
  • 8.7.1 Concurrent architecture
  • 8.7.2 Convolution engine
  • 8.7.2.1 Optimal DRAM access
  • 8.7.3 Restructuring fully connected layers
  • 8.7.4 Zero overhead pooling
  • 8.7.5 Other layers
  • 8.8 Caffe integration
  • 8.9 Performance evaluation
  • 8.9.1 Optimizer results
  • 8.9.1.1 Latency estimation model
  • 8.9.1.2 Exploration strategy
  • 8.9.1.3 Design variable optimization
  • 8.9.2 Onboard runs
  • 8.9.2.1 Network-specific runs
  • 8.9.2.2 Cross-network run
  • 8.9.3 Architecture comparison
  • 8.9.3.1 Raw performance improvement
  • 8.9.3.2 Comparison with state-of-the-art
  • References

Inspec keywords: field programmable gate arrays; inference mechanisms; convolutional neural nets

Other keywords: Winograd Transformations; data quantization exploration; AlexNet; cloud deployment; SqueezeNet; batch processing; on-chip memory configuration; throughput improvement; low latency inference; uniform memory model; CNN agnostic accelerator design; CNN architectures; VGG; GoogLeNet; latency-critical applications; FPGA-based acceleration; end-to-end automation; off-chip bandwidth; memory systems

Subjects: Logic and switching circuits; Logic circuits; Neural computing techniques

Preview this chapter:
Zoom in
Zoomout

CNN agnostic accelerator design for low latency inference on FPGAs, Page 1 of 2

| /docserver/preview/fulltext/books/cs/pbcs055e/PBCS055E_ch8-1.gif /docserver/preview/fulltext/books/cs/pbcs055e/PBCS055E_ch8-2.gif

Related content

content/books/10.1049/pbcs055e_ch8
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading
This is a required field
Please enter a valid email address