IET Computers & Digital Techniques
Volume 7, Issue 5, September 2013
Volumes & issues:
Volume 7, Issue 5
September 2013
A case for three-dimensional stacking of tightly coupled data memories over multi-core clusters using low-latency interconnects
- Author(s): Erfan Azarkhish ; Igor Loi ; Luca Benini
- Source: IET Computers & Digital Techniques, Volume 7, Issue 5, p. 191 –199
- DOI: 10.1049/iet-cdt.2013.0031
- Type: Article
- + Show details - Hide details
-
p.
191
–199
(9)
Shared tightly coupled data memories are key architectural elements for building multi-core clusters in programmable accelerators and embedded systems, as they provide a convenient shared memory abstraction while avoiding cache coherence overheads. The performance of these memories largely depends on the architecture of the interconnect used between processing elements (PEs) and memory banks. The advent of three-dimensional (3D) technology has provided new opportunities to increase design modularity and reduce latency and manufacturing cost. In this study, the authors propose two 3D network architectures: C-logarithmic interconnect (LIN) and Distributed logarithmic interconnect (D-LIN) (designed in synthesisable RTL), which allow modular stacking of multiple L1 memory dies over a multi-core cluster with a limited number of PEs. The authors have used two through-silicon-via technologies: the state-of-the-art micro-bumps and the promising and dense Cu–Cu direct bonding. The overhead of electrostatic discharge protection circuits has been considered, as well. Architectural simulation results demonstrate that, in processor-to-L1-memory context, C-LIN and D-LIN perform significantly better than traditional network-on-chips and simple time-division multiplexing buses. Furthermore, post-layout results show that the proposed 3D architectures achieve comparable speed against their 2D counterparts, whereas enabling modularity: from 256 kB to 2 MB L1 memory configurations with a single mask set.
AMBA bus hardware accelerator IP for Viola–Jones face detection
- Author(s): Laurentiu Acasandrei and Angel Barriga
- Source: IET Computers & Digital Techniques, Volume 7, Issue 5, p. 200 –209
- DOI: 10.1049/iet-cdt.2012.0118
- Type: Article
- + Show details - Hide details
-
p.
200
–209
(10)
Face detection is an important aspect for biometrics, video surveillance and human computer interaction. Owing to the complexity of the detection algorithms any biometric system requires a huge amount of computational and memory resources. A direct software-like implementation of any detection algorithm on a low speed, low resource, low power system on chip (SoC) is not feasible. Instead, a software–hardware codesign approach can be used to build hardware accelerators for the most computational consuming parts of the detection algorithms. Therefore the authors propose a compliant advanced microcontroller bus architecture (AMBA) bus hardware IP, a modularised, highly configurable, low power and technology independent core written in an hardware description language (HDL) language. The IP core accelerates Viola–Jones algorithm considered to be one of the most used algorithms for face detection. The hardware accelerator IP is used in an embedded face detection system built around the LEON3 Sparc V8 processor. The authors present the methodology, challenges and performance results for software, hardware and system level design. For the mentioned system the authors have obtained an acceleration factor of 10–12 when using the hardware accelerator in comparison with the software only traditional approach.
Built-in-self-test technique for diagnosis of delay faults in cluster-based field programmable gate arrays
- Author(s): Nachiketa Das ; Pranab Roy ; Hafizur Rahaman
- Source: IET Computers & Digital Techniques, Volume 7, Issue 5, p. 210 –220
- DOI: 10.1049/iet-cdt.2012.0111
- Type: Article
- + Show details - Hide details
-
p.
210
–220
(11)
The increased circuit complexity of field programmable gate array (FPGA) poses a major challenge in the testing of FPGAs. One of the test challenges is to detect the delay faults in high-speed circuits. Built-in-self-test (BIST) Technique is an ease solution compared with expensive automatic test equipment. In this work, a BIST structure is proposed to detect the delay faults in the various resources of the FPGA such as multiplier, digital signal processing (DSP) block, look-up tables etc. and interconnects of FPGA. The authors have also proposed a full-diagnosable BISTer structure that improves the testing efficiency of the logic BIST. The proposed BISTer structure can diagnose the faulty configurable logic block (CLB), when all the CLBs in the 2 × 3 BIST are faulty. The proposed scheme has been simulated in Xilinx Vertex FPGA, using ISE tool, Jbits3.0 API and XHWI (Xilinx HardWare Interface) and MATLAB7.0. The result shows significant improvement compared with earlier BIST methods.
Field programmable gate arrays-based differential evolution coprocessor: a case study of spectrum allocation in cognitive radio network
- Author(s): Kiran Kumar Anumandla ; Rangababu Peesapati ; Samrat L. Sabat ; Siba K. Udgata ; Ajith Abraham
- Source: IET Computers & Digital Techniques, Volume 7, Issue 5, p. 221 –234
- DOI: 10.1049/iet-cdt.2012.0109
- Type: Article
- + Show details - Hide details
-
p.
221
–234
(14)
In this study, a scalable coprocessor for accelerating the Differential Evolution (DE) algorithm is presented. The coprocessor is interfaced with PowerPC embedded processor of Xilinx Virtex-5 FX70T Field Programmable Gate Array. In the proposed design, the DE algorithm module is tightly coupled with fitness function module to reduce communication and control overhead. The fixed point DE algorithm is implemented in the coprocessor whereas both fixed and floating point DE are implemented in the embedded processor. Performance of the coprocessor is evaluated by optimising benchmark functions of different complexities. The implementation results show that the coprocessor is 73.14–160.2× and 2.19–27.63× faster compared to the software execution time of the floating and fixed point algorithm respectively. As a case study, spectrum allocation problem of cognitive radio network is evaluated with the coprocessor. Results show an acceleration of 76.79–105× and 5.19–6.91× with respect to floating and fixed point DE in embedded processor. It is also observed that the application occupies 56% of BRAM, 54% of DSP48E, 16% of slice LUTs and maximum frequency of operation as 63.55 MHz in a Virtex-5 FPGA. This type of coprocessor is suitable for embedded applications where the fitness function remains unchanged.
Most viewed content
Most cited content for this Journal
-
High-performance elliptic curve cryptography processor over NIST prime fields
- Author(s): Md Selim Hossain ; Yinan Kong ; Ehsan Saeedi ; Niras C. Vayalil
- Type: Article
-
Majority-based evolution state assignment algorithm for area and power optimisation of sequential circuits
- Author(s): Aiman H. El-Maleh
- Type: Article
-
Scalable GF(p) Montgomery multiplier based on a digit–digit computation approach
- Author(s): M. Morales-Sandoval and A. Diaz-Perez
- Type: Article
-
Fabrication and characterisation of Al gate n-metal–oxide–semiconductor field-effect transistor, on-chip fabricated with silicon nitride ion-sensitive field-effect transistor
- Author(s): Rekha Chaudhary ; Amit Sharma ; Soumendu Sinha ; Jyoti Yadav ; Rishi Sharma ; Ravindra Mukhiya ; Vinod K. Khanna
- Type: Article
-
Adaptively weighted round-robin arbitration for equality of service in a many-core network-on-chip
- Author(s): Hanmin Park and Kiyoung Choi
- Type: Article