Home
>
Journals & magazines
>
IEE Proceedings - Computers and Digital Technique...
>
Volume 151
Issue 2
IEE Proceedings - Computers and Digital Techniques
Volume 151, Issue 2, March 2004
Volumes & issues:
Volume 151, Issue 2
March 2004
-
- Author(s): P. Foglia ; R. Giorgi ; C.A. Prete
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 151, Issue 2, p. 93 –109
- DOI: 10.1049/ip-cdt:20040349
- Type: Article
- + Show details - Hide details
-
p.
93
–109
(17)
The infrastructure to support electronic commerce is one of the areas where more processing power is needed. A multiprocessor system can offer advantages for running electronic commerce applications. The memory performance of an electronic commerce server, i.e. a system running electronic commerce applications, is evaluated in the case of shared-bus multiprocessor architecture. The software architecture of this server is based on a three-tier model and the workloads have been setup as specified by the TPC-W benchmark. The hardware configurations are: a single SMP running tiers two and three, and two SMPs each one running a single tier. The influence of memory subsystem on performance and scalability is analysed and several solutions aimed at reducing the latency of memory considered. After initial experiments, which validate the methodology, choices for cache, scheduling algorithm, and coherence protocol are explored to enhance performance and scalability. As in previous studies on shared-bus multiprocessors, it was found that the memory performance is highly influenced by cache parameters. While scaling the machine, the coherence overhead weighs more and more on the memory performance. False sharing in the kernel is among the main causes of this overhead. Unlike previous studies, passive sharing i.e. the useless sharing of the private data of the migrating processes, is shown to be an important factor that influences performance. This is especially true when multiprocessors with a higher number of processors are considered: an increase in the number of processors produces real benefits only if advanced techniques for reducing the coherence overhead are properly adopted. Scheduling techniques limiting process migration may reduce passive sharing, while restructuring techniques of the kernel data may reduce false sharing misses. However, even when process migration is reduced through cache-affinity techniques, standard coherence protocols like MESI protocol don't allow the best performance. Coherence protocols such as PSCR and AMSD produce performance benefits. PSCR, in particular, eliminates coherence overhead due to passive sharing and minimises the number of coherence misses. The adoption of PSCR and cache-affinity scheduling allows the multiprocessor scalability to be extended to 20 processors for a 128-bit shared bus and current values of main-memory-to-processor speed gap. - Author(s): H.R. Simpson
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 151, Issue 2, p. 110 –118
- DOI: 10.1049/ip-cdt:20040071
- Type: Article
- + Show details - Hide details
-
p.
110
–118
(9)
The paper is concerned with a common form of asynchronous communication mechanism (ACM) which can be used to connect a single writer to a single reader, so that the intermediate data in the ACM can be updated at any time by the writer and can be inspected at any time by the reader, without recourse to arbitration or exclusion which would impede either writer or reader. A class of such ACMs is considered in which data is inserted into the ACM and then released by the writer, and data is acquired within the ACM and then extracted by the reader. The temporal relationship between output and input data is analysed in terms of the control sequences which release and acquire the data. Freshness of data is an essential property, and a formal real-time logic specification is derived giving the relationship between written and read values using the events delineating the starts and ends of the control sequences. A formal specification for the independent property of data sequencing is also given. The paper derives a more precise and relevant freshness specification for these ACMs than can be achieved using the alternative atomic register concept. - Author(s): S.-K. Lu and C.-Y. Lee
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 151, Issue 2, p. 119 –126
- DOI: 10.1049/ip-cdt:20040039
- Type: Article
- + Show details - Hide details
-
p.
119
–126
(8)
Because of the rapid increase in the complexity of VLSI circuits, a yield of 100% is virtually impossible. The problem rises intuitively: how can design for testability (DFT) and design for yield (DFY) be combined so as to save money? This question must be dealt with today for SOC designs at an early stage of the design cycle. To address this problem, a profit-evaluation system (PES) for IC designers is proposed from the business perspective. This system will help designers to determine the yield and test plan when a specified quality level is given. The type of circuit fabric and raw manufacturing data (i.e., wafer size, wafer cost, defect density and distribution) are given for the system. The outputs of the system are the values of yield and fault coverage that generate the maximum profit. Different yield models and cost models are selectable by the user. Experimental results show that the system can find the optimal yield and test plan for generating the maximum profit. - Author(s): A. Hiasat and A. Sweidan
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 151, Issue 2, p. 127 –130
- DOI: 10.1049/ip-cdt:20040033
- Type: Article
- + Show details - Hide details
-
p.
127
–130
(4)
Previous publications have given the moduli set (2n, 2n−1, 2n+1) considerable attention. In the residue number system literature this moduli set was referred to as the popular set. However, the dynamic range of this set is limited to 3n bits. A new moduli set (22n, 2n−1, 2n+1) is proposed with a dynamic range of 4n bits. This enhanced set enjoys the same features of the popular one. Also proposed are closed forms for multiplicative inverses for the set and an algorithm for decoding the residue digits into their binary equivalent. Although it increases the dynamic range by 33%, the residue-to-binary decoder of the new set requires the same hardware and time complexity as the popular one. - Author(s): I. Ahmad ; F.M. Ali ; A.S. Das
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 151, Issue 2, p. 131 –140
- DOI: 10.1049/ip-cdt:20040350
- Type: Article
- + Show details - Hide details
-
p.
131
–140
(10)
Finite-state-machines (FSMs) model a variety of hardware and software systems. Unique input/output (UIO) sequences are used in generation of test sequences to verify that a machine is in an expected state, which in turn ensures system reliability. The paper presents an efficient algorithm for computing UIO sequences for completely and incompletely specified FSMs with binary inputs. Set of states which produces identical outputs for nonconflicting inputs are partitioned by incrementally constructing a successor tree. The transitions of FSM are modified to create a set of more specified transitions. This also resulted in constructing UIO sequences for an additional number of states. By a heuristic breadth-first search of successor tree, shorter test sequences are generated. In addition, classifying the nodes as ‘active’, ‘inactive’ and ‘dead’ and application of pruning rules and termination checks applied while partitioning the states reduced the search space and speeded up UIO-sequence construction. A chain-node search technique was used when successor tree failed to construct UIO sequences in some FSMs. The proposed algorithm, designated LANG, was tested with a large number of FSMs including the MCNC FSM benchmarks. To show the effectiveness of the proposed algorithm, results are compared with existing UIO-sequence-computation algorithms. The proposed algorithm computed shorter UIO sequences in negligible CPU time. - Author(s): K. Vivekanandarajah ; T. Srikanthan ; S. Bhattacharyya
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 151, Issue 2, p. 141 –146
- DOI: 10.1049/ip-cdt:20040032
- Type: Article
- + Show details - Hide details
-
p.
141
–146
(6)
Filter cache (FC) is an auxiliary cache much smaller than the main cache. The FC is closest in hierarchy to the instruction fetch unit and it must be small in size to achieve energy-efficient realisations. A pattern prediction scheme is adapted to maximise energy savings in the FC hierarchy. The pattern prediction mechanism proposed relies on the spatial hit or miss pattern of the instruction access stream over previous FC line accesses. Unlike existing techniques, which make predominantly incorrect hit predictions, the proposed approach aims to minimise this, thereby reducing the performance and power penalties associated with it. Simulation results on an extensive set of multimedia benchmarks are presented as proof of its efficacy. The prediction technique results in energy-delay savings of up to 6.8% over the NFPT predictor, which has been proposed in the past as the preferred prediction scheme for FC structures. Investigations conclusively demonstrate that the performance of the proposed prediction scheme is comparable with and in most cases better than that based on NFPT. Unlike NFPT, the new proposed prediction technique lends well for VLSI efficient implementation, making it the preferred choice for energy aware implementations. - Author(s): E. Savas ; A.F. Tenca ; M.E. Çiftçibasi ; Ç.K. Koç
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 151, Issue 2, p. 147 –160
- DOI: 10.1049/ip-cdt:20040047
- Type: Article
- + Show details - Hide details
-
p.
147
–160
(14)
Two new hardware architectures are proposed for performing multiplication in GF(p) and GF (2n), which are the most time-consuming operations in many cryptographic applications. The architectures provide very fast and efficient execution of multiplication in both GF(p) and GF(2n), and can be mainly used in elliptic curve cryptography. Both architectures are scalable and therefore can handle operands of any size. They can be configured to the available area and/or desired performance. The algorithm implemented in the architectures is the Montgomery multiplication algorithm which proved to be very efficient in both fields. The first architecture utilises a precomputation technique that reduces the critical path delay at the expense of using extra logic, which has a limited negative impact on the silicon area for operand precisions of cryptographic interest. The second architecture computes multiplication faster in GF(2n) than GF(p), which conforms with the premise of GF(2n) for hardware realisations. Both architectures provide new alternatives that offer faster computation of multiplication and useful features. - Author(s): J. Gu ; C.-H. Chang ; K.-S. Yeo
- Source: IEE Proceedings - Computers and Digital Techniques, Volume 151, Issue 2, p. 161 –172
- DOI: 10.1049/ip-cdt:20040328
- Type: Article
- + Show details - Hide details
-
p.
161
–172
(12)
The authors present a design approach for an arithmetic macrocell that computes the scalar product of two vectors, an operation ubiquitously present in the solution of many communications and digital signal processing problems. The core of the proposed architecture is a full combinational design containing a partial product generator, a partial product accumulator and a vector accumulator. The design addresses the competing optimisation goals of VLSI area, power dissipation and latency in the deep submicron regime. Compared with conventional merged arithmetic architectures, the proposed macrocell design represents a substantial improvement in the VLSI layout with little area wastage, a high degree of regularity and a good scalability for different vector lengths and operand widths. A theoretical analysis shows that the design of a 16-bit scalar product multiplier for input vectors with 16 elements, in comparison with traditionally designed architecture, achieves a saving of 38.6% in the silicon area, an up to 73% increase in the area usage efficiency and a 29.4% saving in the interconnect delay. Post-layout simulations of the proposed circuit, based on a 0.18 µm CMOS process, show an average power dissipation of 64.96 mW and a latency of 6.92 ns at a standard supply voltage of 1.8 V, a superior performance for a single cycle instruction in a high-speed, low voltage 16-bit digital signal processor operating at 144 MHz. The use of shorter interconnects and more equalised interconnect delays, leads to the power dissipation and delay incurred by the interconnects being substantially reduced. Post-layout simulation of our proposed circuit at supply voltages ranging from 0.7 to 3.3 V shows a significant power reduction of 6 to 13% over the pre-layout simulation results of the conventional design.
Simulation study of memory performance of SMP multiprocessors running a TPC-W workload
Freshness specification for a class of asynchronous communication mechanisms
Modelling economics of DFT and DFY: a profit perspective
Residue-to-binary decoder for an enhanced moduli set
LANG – algorithm for constructing unique input/output sequences in finite-state machines
Energy-delay efficient filter cache hierarchy using pattern prediction scheme
Multiplier architectures for GF(p) and GF(2n)
Algorithm and architecture for a high density, low power scalar product macrocell
Most viewed content for this Journal
Article
content/journals/ip-cdt
Journal
5
Most cited content for this Journal
We currently have no most cited data available for this content.