IEE Proceedings - Computers and Digital Techniques
Online ISSN 1359-7027
Published from 1994-2006, IEE Proceedings - Computers and Digital Techniques contained significant and original contributions on computers, computing and digital techniques. It contained technical papers describing research and development work in all aspects of digital system-on-chip design and the testing of electronic and embedded systems, including the development of design automation tools. It was aimed at researchers, engineers and educators in the fields of computer and digital systems design and testing.
This publication was previously known as IEE Proceedings E (Computers and Digital Techniques) 1980-1993. ISSN 0143-7062. more..
This publication is continued by IET Computers & Digital Techniques 2007-. ISSN 1751-8601. more..
Volumes & issues:
Latest content
-
Ant colony optimisation for task matching and scheduling
- Author(s): C.-W. Chiang; Y.-C. Lee; C.-N. Lee; T.-Y. Chou
- + Show Description
-
Hide details
-
p.
373
–380
(8)
PC clusters have recently received considerable interest as cost-effective parallel platforms for CPU-intensive applications. A cluster of PCs generally comprises of a collection of heterogeneous process elements (PEs). To make effective use of a PC cluster, a parallel program, which is characterised by a node- and edge-weighted directed acyclic graph (DAG), can usually be decomposed into a set of precedence-constrained atomic tasks such that PEs are able to accommodate these tasks and minimise the overall program-completion time. Consequently, techniques for task matching and scheduling become extremely important for effectively harnessing the computing power of the target cluster-based system. This work presents a constructive algorithm based on ant colony optimisation (ACO). The proposed algorithm, namely ACO-TMS, adopts a new state transition rule that reduces the time required when finding the satisfactory scheduling results. The proposed algorithm also integrates a local search procedure that proposed to help improve the scheduling results. The performance of this algorithm is demonstrated by comparing it against other existing algorithms, such as the genetic-algorithm-based scheduling method and the dynamic priority scheduling (DPS) heuristic, in terms of overall schedule length of randomly generated DAGs. Experimental results indicate that the proposed algorithm outperforms the genetic algorithm and the DPS heuristic algorithm for high communication to computation and heterogeneous computing environment.
-
Two-phase prediction of L1 data cache misses
- Author(s): A. Mahjur; A.H. Jahangir
- + Show Description
-
Hide details
-
p.
381
–388
(8)
Hardware prefetching schemes which divide the misses into streams are generally preferred to other hardware based schemes. But, as they do not know when the next miss of a stream happens, they cannot prefetch a block in appropriate time. Some of them use a substantial amount of hardware storage to keep the predicted miss blocks from all streams. The other approaches follow the program flow and prefetch all target addresses including those blocks which already exist in the L1 data cache. The approach presented predicts the stream of next miss and then prefetches only the next miss address of the stream. It offers a general prefetching framework, two-phase prediction algorithm (TPP), that lets each stream have its own address predictor. Comparing the TPP algorithm with the latest variant of stream buffers and Markov predictor using SPEC CPU 2000 benchmarks shows that in average (1) the TPP approach has 18% speedup compared to 1% speedup in Markov and 0.05% in stream buffers. (2) 78% of the TPP prefetches have been useful, whereas in stream buffers and Markov, only 18% and 24% of them were useful, respectively.
-
LSQ: a power efficient and scalable implementation
- Author(s): F. Castro; D. Chaver; L. Pinuel; M. Prieto; M.C. Huang; F. Tirado
- + Show Description
-
Hide details
-
p.
389
–398
(10)
The load–store queue (LSQ) of modern superscalar processors is a critical and non-scalable component responsible for keeping the order of memory operations. As new architectures become more aggressive, the number of in-flight memory instructions increases, and the LSQ must satisfy higher capacity requirements. An efficient LSQ state filtering mechanism based on Bloom filtering is proposed, which, in conjunction with a dynamic or profiling-based predictor, provides significant energy reduction (up to 55% in the LSQ and 4% in the whole processor), and only incurs a small performance loss.
-
Efficient new approach for modulo 2n−1 addition in RNS
- Author(s): R.A. Patel; M. Benaissa; S. Boussakta
- + Show Description
-
Hide details
-
p.
399
–405
(7)
A new modulo 2n−1 addition algorithm is presented, which is applicable in the residue number system. In contrast to previous work, the input carry in the first stage of the addition is set to one. The associated output carry is then used to conditionally modify the sum to produce the correct modulo 2n−1 result. Moreover, unlike recent adders in the literature, the result never exceeds the dynamic range of the modulus. Actual VLSI implementations using 130 nm standard-cell technology show that the corresponding architectures provide improved trade-offs in the power–delay–area space when compared against existing designs.
-
Flexible GF(2m) arithmetic architectures for subword parallel processing ASIPs
- Author(s): W.M. Lim; M. Benaissa
- + Show Description
-
Hide details
-
p.
291
–301
(11)
Subword parallel (SWP) architectures for Galois field multiplication and division over GF(2m) to meet the flexibility against performance requirements of an application-specific instruction set processor for applications within the domain of GF(2m) are presented. Suitable choices of basis, algorithm and architecture are addressed. Techniques for mapping an underlying Galois field arithmetic operation into these SWP architectures are described in the context of suitable well-known GF(2m) division and multiplication algorithms. The results of a detailed complexity analysis undertaken to quantify the configuration overheads as well as employing these in a SWP processor for cryptography are presented.
-
Bottom-up approach in automated embedded memory model generation for high-performance microprocessors
- Author(s): J. Bhadra; M.S. Abadir; D. Burgess; E. Trofimova
- + Show Description
-
Hide details
-
p.
302
–312
(11)
In modern high-performance microprocessors, embedded memories account for approximately half the area and more than 50% of the transistors. Because of their ubiquitous nature, modelling memories remain an immensely important part of the design methodology. Adding to the challenge of memory modelling is a complication that arises from the requirement that the memories need to be modelled for each individual debug methodology – testing, formal verification, validation, emulation and so on. A tool (MemGen) that automates generation of all memory models required by testing, verification, and emulation methodologies is described. MemGen is a robust pivot of our overall design methodology, and is currently in routine use in all live design projects in Freescale Semiconductor's high-performance design centre. Results obtained from using MemGen-generated embedded memories in real-life design projects of Freescale G2 and G4 microprocessors have been presented.
-
Improving energy-efficiency in high-performance processors by bypassing trivial instructions
- Author(s): E. Atoofian; A. Baniasadi
- + Show Description
-
Hide details
-
p.
313
–322
(10)
Energy-efficiency benefits of bypassing trivial computations in high-performance processors are studied. Trivial computations are those computations whose output can be determined without performing the computation. Bypassing trivial instructions reduces energy consumption while improving performance. The present study shows that by bypassing trivial instructions and for the subset of SPEC'2K and MiBench benchmarks studied here, it is possible to improve energy and energy-delay up to 15.6 and 30.6%, respectively, in an optimistic scenario and by 10.8 and 21.7% in a pessimistic scenario, over a conventional processor.
-
Applying the Handel-C design flow in designing an HMAC-hash unit on FPGAs
- Author(s): E. Khan; M. Watheq El-Kharashi; F. Gebali; M. Abd-El-Barr
- + Show Description
-
Hide details
-
p.
323
–334
(12)
An emerging system design methodology in designing a reconfigurable HMAC-hash unit is utilised. This methodology directly maps a design described in a high-level language, Handel-C, to field programmable gate array platforms. The Handel-C approach narrows the gap between performance and flexibility and thus, reduces the risk of translating a high-level prototype into hardware description languages. It allows for a high degree of flexibility from two viewpoints: the language level of abstraction and the hardware reconfiguration. A detailed case study is considered: a reconfigurable HMAC-hash unit that implements six standard hash functions: MD5, SHA-1, RIPEMD-160, HMAC-MD5, HMAC-SHA-1 and HMAC-RIPEMD-160. The performance of the designed unit has been enhanced by applying pipelining, parallelism and reconfigurability through the usage of the Handel-C methodology. The use of Handel-C resulted in the HMAC-hash unit architecture that is better in speed than most of the previously designed units. At the same time, the area cost for putting the six standard algorithms on the same hardware core is also kept to a minimum. It is found that the time required to design, implement and test the designed unit using this methodology is reasonably low compared with the time required using other design approaches.
-
Simultaneous wiring and buffer block planning with optimal wire-sizing for interconnect-driven floorplanning
- Author(s): J.-T. Yan
- + Show Description
-
Hide details
-
p.
335
–347
(13)
As VLSI circuits are scaled into advanced deep-submicron (DSM) dimensions, interconnection delay plays an important role for any performance-driven design. In general, the techniques of wire sizing and buffer insertion can be further used to reduce the timing delay of any interconnection net. Basically, the concept of uniform wire sizing cannot lead to the timing optimisation of any interconnection net. On the basis of the analysis of optimal wire width on one wire segment, the wire planning with optimal wire widths is proposed to use less routing area to reduce the timing delay of any interconnection net. Furthermore, given a compact floorplan with a set of interconnection nets, on the basis of the analysis of buffer locations on one wire segment and the construction of a recursive buffer-location graph, an area-driven buffer block planning with optimal wire sizing (ABBP_OWS) algorithm is proposed to insert the feasible buffers into the given floorplan for each net without destroying the timing constraint of any routing net, and the time complexity of the proposed ABBP_OWS algorithm is proved to be O(mn2), where m is the number of interconnection nets and n is the number of circuit blocks in the floorplan. Finally, the experimental results show that the proposed ABBP_OWS algorithm uses less routing area and floorplan area to meet more interconnection nets on all the tested benchmark circuits for interconnect-driven floorplanning.
-
Employing pipelined thinning architecture for real-time fingerprint verifier
- Author(s): P.Y. Hsiao; X.Z. Chen; C.C. Lin; C.H. Hua; C.C. Chang
- + Show Description
-
Hide details
-
p.
348
–354
(7)
Thinning is a very important operation in the pre-processing stage of fingerprint recognition. With the availability of fast thinning hardware, real-time image processing applications can be achieved. The authors introduce a detailed hardware architecture design of a thinning processor used in an embedded fingerprint recognition system. The proposed thinning algorithm has a parallel-pipelining structure suited to hardware realisation, which is implemented and verified using FPGA. Equipped with a modification unit array, a designated operating schedule, and an address generator based on systolic counter, this thinning processor is able to perform a thinning operation within 0.07 s at 40 MHz for a 512×512 picture, which is at least 40 times faster than software execution. Consequently, the proposed thinning processor was successfully integrated into a real-time fingerprint recognition system.

