Home
>
Journals & magazines
>
IEE Proceedings - Computers and Digital Technique...
>
Volume 152
Issue 5
IEE Proceedings - Computers and Digital Techniques
Volume 152, Issue 5, September 2005
Volumes & issues:
Volume 152, Issue 5
September 2005
-
- Author(s): K.B. Kent ; M. Serra ; N. Horspool
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 5, p. 537 –548
- DOI: 10.1049/ip-cdt:20041264
- Type: Article
- + Show details - Hide details
-
p.
537
–548
(12)
Hardware/software co-design and (re)configurable computing with field programmable gate arrays (FPGAs) are used to create a highly efficient implementation of the Java virtual machine (JVM). Guidelines are provided for applying a general hardware/software co-design process to virtual machines, as are algorithms for context switching between the hardware and software partitions. The advantages of using co-design as an implementation approach for virtual machines are assessed using several benchmarks applied to the implemented co-design of the JVM. It is shown that significant performance improvements are achievable with appropriate architectural and co-design choices. The co-designed JVM could be a cost-effective solution for use in situations where the usual methods of virtual machine acceleration are inappropriate. - Author(s): H.R. Simpson
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 5, p. 549 –560
- DOI: 10.1049/ip-cdt:20045031
- Type: Article
- + Show details - Hide details
-
p.
549
–560
(12)
A common form of asynchronous communication mechanism (ACM) provides a one-way data transfer connection between two concurrent processes in which the writer uses a control algorithm to release data within the mechanism and the reader uses a control algorithm to acquire data within the mechanism, without recourse to arbitration or exclusion which could impede the progress of either the writer or the reader. When execution of the control algorithms is distinct, the reader acquires the latest data item released by the writer. When execution of the control algorithms overlaps, the reader acquires either a data item which is being or has been released during the overlapped acquisition, or the data item which is the latest released prior to the start of acquisition. The particular data item acquired is determined by the interleaving of critical events in each algorithm. The paper formally analyses a pair of algorithms to determine the way in which the data item acquired depends on the precise ordering of interleaved algorithm events. New insights into the notion of data freshness in ACMs are gained, and the essential similarity between several well known pairs of control algorithms is exposed. The formal analysis of ACM algorithms complements the previously derived formal specifications for such ACMs. - Author(s): H.T. Vergos and C. Efstathiou
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 5, p. 561 –566
- DOI: 10.1049/ip-cdt:20055037
- Type: Article
- + Show details - Hide details
-
p.
561
–566
(6)
Squarers modulo M are useful design blocks for digital signal processors that internally use a residue number system and for implementing the exponentiators required in cryptographic algorithms. In these applications, some of the most commonly used moduli are those of the form 2n+1. To avoid using (n+1)-bit circuits, the diminished-1 number system can be effectively used in modulo 2n+1 arithmetic applications. In the paper, for the first time in the open literature, the authors formally derive modulo 2n+1 squarers that adopt the diminished-1 number system. The resulting implementations are built using only full-and half-adders and a final diminished-1 adder and can therefore be pipelined straightforwardly. - Author(s): B.J. Falkowski and C. Fu
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 5, p. 567 –576
- DOI: 10.1049/ip-cdt:20045162
- Type: Article
- + Show details - Hide details
-
p.
567
–576
(10)
New fastest linearly independent (LI) transforms over Galois field (3) (GF(3)) and their corresponding polynomial expansions have been introduced. The number of required additions and multiplications in new LI transforms is lower when compared with the ternary Reed–Muller transform, which was previously known as the most efficient transform over GF(3). The paper discusses various properties of these fastest LI transforms and their corresponding polynomial expansions over GF(3) as well as their comparison with the ternary Reed–Muller transform. Experimental results in one class of fastest LI transforms for some ternary benchmark functions are also shown here and compared with those of the fixed polarity Reed–Muller transform over GF(3). - Author(s): S.-K. Lu ; H.-C. Wu ; Y.-C. Tsai
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 5, p. 577 –584
- DOI: 10.1049/ip-cdt:20045035
- Type: Article
- + Show details - Hide details
-
p.
577
–584
(8)
A novel fault detection and diagnosis technique named the ‘ping-pong’ type approach for field programmable gate arrays (FPGAs) is proposed in the paper. The authors first derive efficient (k+1) test configurations for a single configurable logic block (CLB) which guarantees 100% fault coverage, where k denotes the number of inputs of a lookup table (LUT). Furthermore, the whole CLB array is divided into cell groups and each group contains two cells – the master cell and the slave cell. Since both cells can be used as the test pattern generator (TPG) and the blocks under test (BUTs) at the same time, one test session is required instead of two test sessions for traditional fault detection techniques. The test complexity can then be reduced significantly. The name of the ping-pong type approach comes from the fact that, if the master cell sends a test pattern to the slave cell, the output of the slave cell is forwarded to the input of the master cell as a test pattern. By iterating this process, all cells will receive pseudo-exhaustive test patterns. The output of each cell is read out immediately after one test pattern is applied through the configuration memory readback or implicit scan circuitry. Therefore, multiple fault detection and location can be achieved easily. Since the number of test sessions is less than that for the traditional approaches, significant speedup can be guaranteed. The detection and diagnosis complexity are compared with those of other works. - Author(s): A. Baniasadi
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 5, p. 585 –595
- DOI: 10.1049/ip-cdt:20045117
- Type: Article
- + Show details - Hide details
-
p.
585
–595
(11)
Designers have invested much effort in developing accurate branch predictors. To maintain accuracy, current processors update the predictor regularly and frequently. Although this aggressive approach helps to achieve high accuracy, for a large number of branches, quite often, updating the branch predictor unit is unnecessary as there is already enough information available to the predictor to predict the branch outcome accurately. Therefore, the current approach appears to be inefficient since it results in unnecessary energy consumption. The author introduces the power-aware branch predictor update (PABU). PABU uses a simple power efficient structure to identify well behaved accurately predicted branch instructions. Once such branches are identified, the predictor is no longer accessed to update the associated data. The key to the success of the proposed technique is a power efficient method that can effectively identify such branches. The author exploits branch instruction behaviour to identify such branch instructions. He shows that it is possible to reduce the number of predictor updates considerably without losing performance. The technique is evaluated by studying energy and performance tradeoffs for SPEC2000 benchmarks. It is shown that the technique can reduce branch prediction energy consumption considerably for both floating point and integer benchmarks. This comes with a negligible impact on performance. - Author(s): Q. Zhao ; S.-J. Lee ; D.J. Lilja
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 5, p. 596 –608
- DOI: 10.1049/ip-cdt:20045148
- Type: Article
- + Show details - Hide details
-
p.
596
–608
(13)
Value prediction has been proposed as a technique to break true data dependences in order to increase the instruction-level parallelism available in programs. Recent work has pointed out, however, that the delay inherent in updating the value prediction table with the actual correct value can introduce a substantial number of wrong value predictions, which can then decrease the overall processor performance. The authors propose and systematically study a technique that they call ‘hyperprediction’ to compensate for the delay in updating the value prediction table. This approach accurately computes and records the number of outstanding instances of an instruction, which is the number of times an instruction has requested a value to be predicted since the last time the corresponding entry in the value prediction table was updated. With this information, the value predictor can provide reliable predictions for the currently requesting instance of an instruction based on both the currently stored value and the number of outstanding instances. They show how the hyperprediction technique can be implemented in a stride value predictor, a context-based predictor and a hybrid predictor. Their simulations using an extension to the SimpleScalar tool set and integer and floating-point programs from the SPEC95 and SPEC2000 benchmark suites indicate that this technique can consistently improve both the value prediction accuracy and the overall processor performance for each of the different types of value predictors. - Author(s): W.-D. Tseng
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 5, p. 609 –617
- DOI: 10.1049/ip-cdt:20045139
- Type: Article
- + Show details - Hide details
-
p.
609
–617
(9)
Scan testing is expensive in power consumption as each test vector requires a large number of shift operations with a high circuit activity. For a scan cell, the number of transitions caused by a test vector being scanned in depends not only on the transitions in the test vector but also on its position in the scan chain. Depending on the circuit structure, the transitions at some scan cells may cause more transitions at the internal circuit than those at other scan cells. Therefore, reducing scan transitions at those scan cells that cause more transitions in the internal circuit will result in greater reduction in switching activity. In the paper, the authors propose a scan cell ordering approach to reduce scan power consumption by arranging the scan cells which cause more internal circuit transitions to the positions with low transition weights in the scan chain. Two functions are developed to compute the transition weight of a scan cell and to measure the impact of transitions at a scan cell on switching activity in the internal circuit, respectively. Experiments performed on the ISCAS 89 benchmark circuits show that the average power consumption during scan testing can be reduced up to 17.35%. Moreover, because the proposed approach is independent of the order of test vectors, it can be utilised together with the existing test vector reordering techniques to further reduce test power. - Author(s): J.A. Gibbons ; D.M. Howard ; A.M. Tyrrell
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 5, p. 619 –631
- DOI: 10.1049/ip-cdt:20045178
- Type: Article
- + Show details - Hide details
-
p.
619
–631
(13)
The paper addresses the potential benefits of using a field programmable gate array (FPGA) as opposed to a traditional processor for music synthesis. The benefits result from the use of a cellular design, with each cell performing identical operations on its own state and the states of its neighbours. This gives advantages of design simplicity through inherent parallelism. A cellular model which has previously been used for music synthesis is the mass spring paradigm, and this model is implemented on an FPGA. On a sequential processor, the clock speed requirements of this model increase as N2 (where N is the number of cells), whereas the clock speed requirement increases as N on an FPGA (if the cells can be made small enough that available area on the chip is not a constraint). To make the cells small enough, a bit-serial design was used. This work was performed to advocate the FPGA as a stand-alone live performance music synthesis platform, which has the potential to be reconfigured for example in performance, between synthesis techniques as different sound patches are selected. The paper considers in particular the mass-spring model, because it demonstrates the type of music synthesis technique for which FPGAs provide the greatest potential benefit. The capability to reconfigure combined with the high performance that can be achieved using cellular design suggests that the FPGA is an ideal platform for a live performance hardware music synthesiser, combining the flexibility of software with the speed of a custom ASIC. - Author(s): G. Srinivasan ; S. Bhattacharya ; S. Cherubal ; A. Chatterjee
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 5, p. 632 –642
- DOI: 10.1049/ip-cdt:20045074
- Type: Article
- + Show details - Hide details
-
p.
632
–642
(11)
A novel test methodology for fast and accurate testing of RF power amplifiers used in wireless communication that employ time division multiplexing is presented. The steep cost of high frequency testers can be largely complemented by the proposed method due to its ease of implementation on low cost testers. TDMA power amplifiers usually have a control voltage to operate the device in various modes of operation. At each of the control voltage values, all the specifications of the power amplifier are measured to ensure the performance of each tested device. A new method is proposed to test all the specifications of these devices at different control voltage values by capturing the transient current response of their bias circuits to a time-varying control voltage stimulus. This results in shorter test times compared to conventional test methods. The test specification values are measured to an accuracy of less than 5% for all the specifications, and test time reduction of a factor more than 3 was achieved. The proposed test approach can specifically benefit production test of quad-band amplifiers (GSM850, GSM900, PCS/DCS), as a single transient current measurement can be used to compute all the specifications of the device in each of the modes of operation. - Author(s): J. Hu and R. Marculescu
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 5, p. 643 –651
- DOI: 10.1049/ip-cdt:20045092
- Type: Article
- + Show details - Hide details
-
p.
643
–651
(9)
The objective of the paper is to introduce a novel energy-aware scheduling (EAS) algorithm which statically schedules application-specific communication transactions and computation tasks onto heterogeneous network-on-chip (NoC) architectures. The proposed algorithm automatically assigns the application tasks onto different processing elements and then schedules their execution under real-time constraints. At the same time, the algorithm takes into consideration the exact communication delay by scheduling communication transactions in parallel. As the main theoretical contribution, the authors first formulate the problem of concurrent communication and task scheduling for heterogeneous NoC architectures and then propose an efficient heuristic to solve it. Experimental results show that significant energy savings can be achieved while meeting the specified performance constraints. For instance, for a complex multimedia application, 31% energy savings have been observed, on average, compared to the schedules generated by a standard earliest-deadline-first scheduler. - Author(s): S.-F. Hsiao and M.-C. Chen
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 5, p. 653 –665
- DOI: 10.1049/ip-cdt:20045152
- Type: Article
- + Show details - Hide details
-
p.
653
–665
(13)
The Rijndael advanced encryption standard (AES) contains two paired important transformations, MixColumns (inverse MixColumns) and SubByte (inverse SubBytes), the most crucial operations in the AES encryption /decryption processes. They consist of XOR-based inner production operations in GF(28). In the paper, two substructure sharing methods are proposed to reduce the area cost of implementing these transformations. The first method exploits pure bit-level sharing with two optimisation stages, while the second method combines both the byte-level and bit-level techniques to further improve the area /speed performance. Comparisons in both the architectural-level designs and the technology-dependent cell-based implementations are given. An AES processor with iterative architecture is implemented using both a 0.18 µm UMC cell library and a Xilinx FPGA device. Experimental results show that the whole AES processor based on our proposed method can reduce area cost significantly compared with Synopsys area-optimised synthesis results or other previous implementations. - Author(s): G. Reinman
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 5, p. 666 –678
- DOI: 10.1049/ip-cdt:20045099
- Type: Article
- + Show details - Hide details
-
p.
666
–678
(13)
The register file of a modern superscalar processor is a critical component of the processor pipeline that can have a large impact on processor performance. Large register files provide larger windows of speculation to the processor and allow greater levels of instruction-level parallelism. However, the access time and energy consumption of these structures can grow quite large when these structures increase in size, especially considering the number of ports required. The paper proposes an architecture that moves the large register file needed to fully exploit greater levels of instruction level parallelism off the schedule to the execute path of the processor. This is accomplished by decoupling the instruction window (the amount of instruction state maintained in the reorder buffer and register file) from the scheduling window (the working set of registers required by the instruction scheduler and execution core). The state of the scheduling window is maintained by an operand file and a speculative logical register file. The operand file stores only the set of input registers to be consumed by instructions in the issue queue, and provides low-latency and energy efficient storage for the working set of registers. This design can reduce the energy dissipation by a factor of 6.5 on average over a traditional large register file, and allows the instruction window to be scaled independently of the register file structures on the schedule to execute path. - Author(s): T. Rejimon and S. Bhanja
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 5, p. 679 –685
- DOI: 10.1049/ip-cdt:20045106
- Type: Article
- + Show details - Hide details
-
p.
679
–685
(7)
The authors propose a novel fault/error model based on a graphical probabilistic framework. They arrive at the logic induced fault encoded directed acrylic graph (LIFE-DAG), which is proven to be a Bayesian network, capturing all spatial dependencies induced by the circuit logic. Bayesian networks are the minimal and exact representation of the joint probability distribution of the underlying probabilistic dependencies that not only use conditional independencies in modelling but also exploit them for achieving minimality and smart probabilistic inference. The detection probabilities also act as a measure of soft error susceptibility (an increased threat in the nano-domain logic block) which depends on the structural correlations of the internal nodes and also on input patterns. Based on this model, they show that they are able to estimate detection probabilities of faults/errors on ISCAS'85 benchmarks with high accuracy, linear space requirement complexity, and with an order of magnitude (≈5 times) reduction in estimation time over corresponding binary decision diagram based approaches. - Author(s): B. Cao ; T. Srikanthan ; C.H. Chang
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 5, p. 687 –696
- DOI: 10.1049/ip-cdt:20045155
- Type: Article
- + Show details - Hide details
-
p.
687
–696
(10)
A new reverse conversion algorithm is presented for the four-moduli set {2n−1,2n,2n+1, 2n+1−1}, for even values of n. The number theoretic properties of the popular three-moduli set {2n−1,2n, 2n+1} have been exploited to realise a VLSI efficient alternative to that reported in the literature. The architecture proposed for most time efficient implementation provides for about three times speed-up. Another four-moduli set {2n−1, 2n, 2n+1, 2n−1−1} has also been proposed by further extending this algorithm in an attempt to better adjust to dynamic ranges that cannot be best represented by the former four-moduli set. Unlike the existing reverse converter for the four-moduli set {2n−1, 2n, 2n+1, 2n−1−1}, the proposed architecture is shown to be more efficient both in terms of area and time, mainly due to deploying the properties of the three-moduli set {2n−1, 2n, 2n+1}. Moreover, adder-based architectures for each moduli set lend themselves well to VLSI efficient implementations. Finally, both the architectures can be readily pipelined to achieve higher throughputs.
Hardware/software co-design for virtual machines
Data transfer analysis for a pair of asynchronous communication algorithms
Diminished-1 modulo 2n+1 squarer design
Fastest classes of linearly independent transforms over GF(3) and their properties
Multiple fault detection and diagnosis techniques for lookup table FPGAs
Power-aware branch predictor update
Using hyperprediction to compensate for delayed updates in value predictors
Scan chain ordering technique for switching activity reduction during scan test
FPGA implementation of 1D wave equation for real-time audio synthesis
Fast specification test of TDMA power amplifiers using transient current measurements
Communication and task scheduling of application-specific networks-on-chip
Efficient substructure sharing methods for optimising the inner-product operations in Rijndael advanced encryption standard
Using an operand file to save energy and to decouple commit resources
Time and space efficient method for accurate computation of error detection probabilities in VLSI circuits
Efficient reverse converters for four-moduli sets { 2n−1, 2n, 2n+1, 2n+1−1} and {2n−1, 2n, 2n+1, 2n−1−1}
Most viewed content for this Journal
Article
content/journals/ip-cdt
Journal
5

Most cited content for this Journal
We currently have no most cited data available for this content.