Home
>
Journals & magazines
>
IEE Proceedings - Computers and Digital Technique...
>
Volume 152
Issue 4
IEE Proceedings - Computers and Digital Techniques
Volume 152, Issue 4, July 2005
Volumes & issues:
Volume 152, Issue 4
July 2005
-
- Author(s): S. Kim ; N. Vijaykrishnan ; M. Kandemir ; M.J. Irwin
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 4, p. 457 –466
- DOI: 10.1049/ip-cdt:20045124
- Type: Article
- + Show details - Hide details
-
p.
457
–466
(10)
Increasing clock frequencies and issue rates aggravates the memory latency problem, imposing higher memory bandwidth requirements. While caches can be multi-ported to provide high memory bandwidth, the increase in access latency with the increase in the number of ports limits their potential. The paper proposes a novel technique, called the ‘temporal load cache architecture’, to reduce load latencies and provide higher memory bandwidths. The key motivation for the technique is that temporal loads – dynamic instances of a static load instruction that access the same address as that accessed by the last dynamic instance of the same static load – constitute 48% of all dynamic loads on average for the SPEC2000 benchmarks. When a load is predicted to be temporal, the data predicted to be accessed by it are read early in the pipeline from a small temporal load cache that stores the temporal data. The proposed temporal load cache architecture has two main advantages. First, since instructions dependent on a temporal load are provided with their data early in the pipeline, they can be issued as soon as they resolve their remaining data dependences and resource conflicts. Second, since a large percentage of loads can be filtered by the temporal load cache, the main data cache can service other (nontemporal) loads better, providing higher memory bandwidth. The experimental results show that the proposed temporal load cache architecture improves performance by 8.3% on average for the SPEC2000 integer benchmarks. - Author(s): T.A. Bartic ; J.-Y. Mignolet ; V. Nollet ; T. Marescaux ; D. Verkest ; S. Vernalde ; R. Lauwereins
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 4, p. 467 –472
- DOI: 10.1049/ip-cdt:20045016
- Type: Article
- + Show details - Hide details
-
p.
467
–472
(6)
Network-on-chip designs promise to offer considerable advantages over the traditional bus-based designs in solving the numerous technological, economic and productivity problems associated with billion-transistor system-on-chip development. The authors believe that different types of networks will be required, depending on the application domain. Therefore, a very flexible network design is proposed that is highly scalable, and can be easily changed to accomodate various needs. A network-on-chip design, realised as part of the platform that the authors are developing for reconfigurable systems, is presented. This design is suitable for building networks with irregular topologies, and with low latency and high throughput. - Author(s): L.N. Vintan ; A. Florea ; A. Gellert
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 4, p. 473 –481
- DOI: 10.1049/ip-cdt:20045090
- Type: Article
- + Show details - Hide details
-
p.
473
–481
(9)
Value prediction (VP) is a relatively new technique that increases performance by eliminating true data dependency constraints. VP architectures allow data dependent instructions to issue and execute speculatively using the predicted value. This technique is built on the concept of value locality, which describes the likelihood of a previously seen value recurring within a storage location. The authors extend dynamic VP by introducing the concept of register-centric prediction instead of instruction-centric prediction. The value localities obtained on some registers of MIPS architecture were quite remarkable leading to the conclusion that VP might be successfully applied, at least on these favourable registers. The idea of attaching a value predictor for the processor's favourable registers is original and might involve new architectural techniques for improving performance and reducing the hardware cost of speculative micro-architectures. The register VP technique consists in predicting the registers' next values based on the previously seen values. It executes the subsequent data dependent instructions using the predicted values. The speculative execution will be validated when the correct values are known. If the value was correctly predicted the critical path is reduced, otherwise the instructions executed with wrong entries must be executed again. The authors examine different favourable register selections and different basic value predictors to capture certain type of value predictabilities from the SPEC benchmarks (1995 and 2000) to obtain higher prediction accuracies. Their results show that there is a time correlation between the names of the destination registers and the values stored in these registers. The simulations show that the hybrid predictor optimally exploits this correlation with an average prediction accuracy of 85.44%, which is quite remarkable (on some benchmarks the values are over 96%). Considering an eight-issue out-of-order superscalar processor it is shown that register-centric VP produces average speedups of 17.30% for the SPECint95 benchmarks, and of 13.58% for the SPECint2000 benchmarks. - Author(s): P. Petrov and A. Orailoglu
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 4, p. 482 –488
- DOI: 10.1049/ip-cdt:20041101
- Type: Article
- + Show details - Hide details
-
p.
482
–488
(7)
A methodology for a low-power branch identification mechanism which enables the design of extremely power-efficient branch predictors for embedded processors is presented. The proposed technique utilises application-specific information regarding the control-flow structure of the program's major loops. Such information is used to completely eliminate the power hungry branch target buffer (BTB) lookups which normally occur at every execution cycle. Exact application knowledge regarding the control-flow structure of the program obviates the power expensive BTB operations, thus enabling the utilisation of contemporary branch predictors in high-end, yet power-sensitive embedded processors. The utilisation of exact application knowledge results not only in the complete elimination of the power hungry BTB structure but also in a perfect branch and target address identification. A cost-efficient and programmable hardware architecture for capturing the control-flow structure of the program is presented. The hardware complexity of the proposed architecture is carefully analysed in terms of power, performance and area overhead. The proposed technique delivers power reductions in excess of 90% for a set of representative embedded benchmarks. - Author(s): E. Savaş ; M. Naseer ; A.A-A. Gutub ; Ç.K. Koç
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 4, p. 489 –498
- DOI: 10.1049/ip-cdt:20059032
- Type: Article
- + Show details - Hide details
-
p.
489
–498
(10)
Computation of multiplicative inverses in finite fields GF(p) and GF(2n) is the most time-consuming operation in elliptic curve cryptography, especially when affine co-ordinates are used. Since the existing algorithms based on the extended Euclidean algorithm do not permit a fast software implementation, projective co-ordinates, which eliminate almost all of the inversion operations from the curve arithmetic, are preferred. In the paper, the authors demonstrate that affine co-ordinate implementation provides a comparable speed to that of projective co-ordinates with careful hardware realisation of existing algorithms for calculating inverses in both fields without utilising special moduli or irreducible polynomials. They present two inversion algorithms for binary extension and prime fields, which are slightly modified versions of the Montgomery inversion algorithm. The similarity of the two algorithms allows the design of a single unified hardware architecture that performs the computation of inversion in both fields. They also propose a hardware structure where the field elements are represented using a multi-word format. This feature allows a scalable architecture able to operate in a broad range of precision, which has certain advantages in cryptographic applications. In addition, they include statistical comparison of four inversion algorithms in order to help choose the best one amongst them for implementation onto hardware. - Author(s): Y.-T. Lin ; P.-Y. Tsai ; T.-D. Chiueh
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 4, p. 499 –506
- DOI: 10.1049/ip-cdt:20041224
- Type: Article
- + Show details - Hide details
-
p.
499
–506
(8)
Fast Fourier transform (FFT) processing is one of the key procedures in the popular orthogonal frequency division multiplexing (OFDM) communication systems. Structured pipeline architectures and low power consumption are the main concerns for its VLSI implementation. In the paper, the authors report a variable-length FFT processor design that is based on a radix-2/4/8 algorithm and a single-path delay feedback architecture. The processor can be used in various OFDM-based communication systems, such as digital audio broadcasting (DAB), digital video broadcasting-terrestrial (DVB-T), asymmetric digital subscriber loop (ADSL) and very-high-speed digital subscriber loop (VDSL). To reduce power consumption and chip area, special current-mode SRAMs are adopted to replace shift registers in the delay lines. In addition, techniques including complex multipliers containing three real multiplications, and reduced sine/cosine tables are adopted. The chip is fabricated using a 0.35 µm CMOS process and it measures 3900 µm × 5500 µm. According to the measured results, the 2048-point FFT operation can function correctly up to 45 MHz with a 3.3 V supply voltage and power consumption of 640 mW. In low-power operation, when the supply voltage is scaled down to 2.3 V, the processor consumes 176 mW when it runs at 17.8 MHz. - Author(s): J.-H. Lee ; C. Weems ; S.-D. Kim
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 4, p. 507 –516
- DOI: 10.1049/ip-cdt:20045025
- Type: Article
- + Show details - Hide details
-
p.
507
–516
(10)
The authors present a translation lookaside buffer (TLB) system with low power consumption for embedded processors. The proposed TLB is constructed as multiple banks, each with an associated block buffer and a corresponding comparator. Either the block buffer or the main bank is selectively accessed on the basis of two bits in the tag buffer. Dynamic power savings are achieved by reducing the number of entries accessed in parallel, as a result of using the tag buffer as a filtering mechanism. The performance overhead of the proposed TLB is negligible compared with other hierarchical TLB structures. For example, the two-cycle overhead of the proposed TLB is only ∼1%, as compared with 5% overhead for a filter (micro)-TLB and 14% overhead for a banked-TLB with block buffering. The authors show that the average hit ratios of the block buffers and the main banks of the proposed TLB are 94% and 6%, respectively. Dynamic power is reduced by ∼93% with respect to a fully associative TLB, 87% with respect to a filter-TLB and 60% relative to a banked-TLB with block buffering. Therefore, significant power savings are achieved with only a small performance degradation. - Author(s): N. Kavvadias and S. Nikolaidis
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 4, p. 517 –526
- DOI: 10.1049/ip-cdt:20041187
- Type: Article
- + Show details - Hide details
-
p.
517
–526
(10)
Multimedia algorithms generally consist of regular repetitive loop constructs. The authors present a novel control unit design for implementing such loop intensive algorithms. The proposed architecture, termed a zero-overhead loop controller (ZOLC) exploits the regularity of computations, which is a common characteristic of multimedia algorithms, in order to efficiently support the corresponding datapaths. The ZOLC controls the operations in datapath modules by activating/deactivating their corresponding controlling FSMs. Algorithmic flow dependencies, which determine the appropriate loop sequencing, are mapped onto a look-up table (LUT). For another algorithm to execute, only the LUT context and the FSM configurations have to be reprogrammed, assuming a generic datapath. Thus, partial reconfiguration possibilities to implement multimedia algorithms on programmable platforms can be exploited. As proof-of-concept, implementations of algorithms of the multimedia domain are investigated to evaluate the performance of the proposed unit, against other methods of control. Also, a full-search motion estimation processor employing the ZOLC is synthesised. It is shown that the ZOLC provides flexibility by supporting various algorithms of the multimedia field with performance improvements of up to 2.1 over conventional control methods. - Author(s): Y.-Y. Chen and K.-L. Leu
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 152, Issue 4, p. 527 –536
- DOI: 10.1049/ip-cdt:20045103
- Type: Article
- + Show details - Hide details
-
p.
527
–536
(10)
A new concurrent error-detection scheme monitors the signatures in online detection of instruction memory and control flow errors caused by transient and intermittent faults. The proposed signature-monitoring technique is based on the grouping of column bit information of instructions in a block to produce the block signature. The grouping size that represents the number of bits in a group could affect the fault coverage. It is shown that the fault coverage of a three-bit grouping scheme is better than that of two-bit grouping, and approaches to 1.0. The issue of the effect of state assignment on fault coverage is discussed. A methodology is given that can be used to select a near-optimal state assignment that guarantees a near-optimal fault coverage among all possible state assignments. A software-based simulation is conducted to justify the near-optimal state selected and validate the effectiveness of the proposed techniques. The proposed schemes are implemented in VHDL and hardware-based fault simulations running several benchmark programs verify the results obtained. Comparisons between various schemes are conducted.
Exploiting temporal loads for low latency and high bandwidth memory
Topology adaptive network-on-chip design and implementation
Focalising dynamic value prediction to CPU's context
Low-power branch target buffer for application-specific embedded processors
Efficient unified Montgomery inversion with multibit shifting
Low-power variable-length fast Fourier transform processor
Selective block buffering TLB system for embedded processors
Zero-overhead loop controller that implements multimedia algorithms
Signature-monitoring technique based on instruction-bit grouping
Most viewed content for this Journal
Article
content/journals/ip-cdt
Journal
5
Most cited content for this Journal
We currently have no most cited data available for this content.