New Publications are available for Parallel programming and algorithm theory
http://dl-live.theiet.org
New Publications are available now online for this publication.
Please follow the links to view the publication.A new parallel collision detection algorithm based on mixed BVH and symmetry breaking
http://dl-live.theiet.org/content/conferences/10.1049/cp_20080920
Collision detection is a key technology of virtual reality, and the speed and precision are very important meaning for collision detection. In this paper, we present a new parallel collision detection algorithm based on symmetry breaking and divide-and-conquer technologies. At first, we incorporate the merits of both AABB bounding box and bounding spheres to construct a hybrid bounding representation of arbitrary non-convex polyhedra (S-AABB) for attaining speed, especially the S-AABB is balanced using divide-and-conquer technologies. Then we apply symmetry breaking - coloring technology which is also important in parallel algorithm in order to reduce different categories, and assign them to different processors; Also multi-thread is used in multi-processor computer. At last, experiments results have shown that our algorithm is advantageous over other current typical collision detection algorithms such as I-COLLIDE [Cohen J. et al., 1995], so can meets the real-time and accurate requirements in complex interactive virtual environment.New parallelization scheme for simulated annealing
http://dl-live.theiet.org/content/conferences/10.1049/ic_20070689
In this paper, an approach is proposed for Parallel Simulated Annealing known as New Parallel Simulated Annealing (NPSA). This approach features coarse-granularity in parallelization, directed at message-passing systems such as clusters. It contains heuristics such as adaptive clustering with SA to achieve more efficiency in local search. Through experiments with various optimization problems and comparison with available schemes. It is shown that NPSA is a powerful general-purposed optimization method. It can also serve as a framework for meta-hevristies to gain broader application.WP4PI: a proof system based on the weakest preconditions for the applied π-calculus
http://dl-live.theiet.org/content/conferences/10.1049/cp_20070750
The π-calculus provides a simple but powerful formal basis for specification and verification of parallel and distributed systems with evolving structures. Applied π-calculus is a general extension of the π-calculus with value passing, primitive functions and equations among terms. In this paper, we present WP4PI, a proof system based on the weakest preconditions for the applied π-calculus. An algorithm is developed to compute the fixpoint of the weakest precondition for a recursive process and a balanced binary tree structure is used to deal with the complexity introduced by conditional construct of the applied π-calculus. Primitive functions in the applied π-calculus are handled by equipping the proof system with an extensible equation sub-system. Based on the weakest preconditions and Hoare's logic, WP4PI is a target-oriented proof system and captures the term substitution property of the applied π-calculus.A framework for model-checking timed CSP
http://dl-live.theiet.org/content/conferences/10.1049/ic_19990011
Timed CSP is a well-known process algebra, built as an extension to Hoare's original CSP, designed to handle concurrency combined with timing considerations. It achieves this over a continuous time domain (the non-negative real numbers), which has the drawback of precluding standard model-checking approaches, as the state-space of any process is naturally a priori (uncountably) infinite. This paper shows how to circumvent this problem by translating and reinterpreting timed CSP processes within a new model of standard CSP. In this discrete model, which draws on previous work by A.W. Roscoe (1997) and A. Mukkaram (1993), timing of events is provided by the consistent and regular communication of a special tock event, analogous to the `tick' of a clock. The various parallel components of a process are therefore required to synchronise on tock, ensuring a uniform rate of passage of time. General results yielding tight bounds on the loss of information inherent to the translation are given. (4 pages)A one pass parallel hexagonal thinning algorithm
http://dl-live.theiet.org/content/conferences/10.1049/cp_19990443
The development of a fully parallel hexagonal thinning algorithm is reported and compared with a similar algorithm for use with rectangularly sampled images. Both algorithms produced good skeletons, but the hexagonal could be implemented with only 50% of the logical operations required by the rectangular.Parallel genetic algorithms for optimised fuzzy modelling with application to a fermentation process
http://dl-live.theiet.org/content/conferences/10.1049/cp_19971167
This paper reports the construction and application of an evolution program to a computational intelligence system used as a software `sensor' in state-estimation and prediction of biomass concentration in a fermentation process. A fuzzy logic system (FLS) is used as a computational engine to `infer' the production of biomass from variables easily measured on-line. For this purpose, genetic algorithms (GAs) are employed to train and tune the desired parameters of the fuzzy logic system. It is shown that the fuzzy logic system, which was tuned by two genetic algorithms implemented in parallel, produces better results in prediction of biomass concentration. The mean sum of squared errors and graphical fit are used to compare the performance of the genetically optimised FLS with artificial neural networks (ANN), which is trained using Levenberg-Marquardt second-order nonlinear optimisation method.Direct design of uniform LTI controllers from plant I/O data using a parallel evolutionary algorithm
http://dl-live.theiet.org/content/conferences/10.1049/cp_19960633
This paper attempts to revolutionise control system design by unifying all linear time invariant (LTI) approaches in both time and frequency domains under performance satisfaction. The design is automated by efficient evolution from plant step response data, bypassing the system identification stage. The underlying aim is for a control engineers to obtain an “off-the-computer” controller by feeding the developed CACSD (computer-aided control system design) system with plant I/O data and customer specifications. Validations against linear and nonlinear plants are convincing, where a better performance of the controller evolved from I/O data of an internally nonlinear plant than that designed from an identified model is observed.Structured parallel design for embedded vision systems: an application case study
http://dl-live.theiet.org/content/conferences/10.1049/cp_19950752
Presents a case study showing how a top-down structured generic parallel design method can be successfully applied to a typical multi-level computer vision algorithm. The case study involves a written postal address recognition system. The address verification is discussed in particular. Incrementally controllable speedups of up to 15 for the complete application were obtained using up to 29 processors. Although this speedup is insufficient to achieve the required real-time performance of processing 10 envelope images/second, it would be straightforward to apply the same parallel design model to current generation parallel processors such as the TMS320C40 or T9000 which have at least 10 times the processing power and communications bandwidth of the T800 transputers used as processing elements in the present case study. The range of algorithms utilised in the case study, and the flexibility of the parallel solutions generated, lead the authors to believe that the design method is applicable to a wide range of embedded vision applications.Implementing mathematical morphology in ISETL-LINDA
http://dl-live.theiet.org/content/conferences/10.1049/cp_19950780
Presents the description of a parallel implementation of mathematical morphology using the LINDA model of parallel communication embedded within ISETL. The morphological operation of dilation of binary images is considered in detail, with an ISETL implementation given initially, then a description of how this was converted to run in a parallel version of ISETL, ISETL-Linda.Algorithmic engineering as a teaching aid for DSP
http://dl-live.theiet.org/content/conferences/10.1049/ic_19950214
The concept of algorithmic engineering was introduced by McWhirter (see IEE Proc.-E, vol.139, no.3, June 1992) in the context of adaptive signal processing. In effect, it constitutes a simple but rigorous diagrammatic approach to the design of parallel algorithms and architectures for digital signal processing. For the purposes of this design methodology, a parallel algorithm is represented in terms of its basic signal flow graph (SFG). In the context of teaching DSP, this could be of particular advantage. In the article, the potential use of algorithmic engineering as a design and teaching aid is illustrated by means of two particularly relevant examples. The first relates to the FFT. The second concerns the least squares lattice algorithm for linear prediction and adaptive filtering. It is shown how each can be derived, and hence taught, by starting with the SIG for a more basic algorithm (the DFT and recursive least squares algorithms respectively) and then applying a sequence of simple diagrammatic transformations. The least squares lattice derivation starts with the SFG for a multichannel least squares adaptive filter based on QR decomposition. (3 pages)The evolution of morphology using Kauffman nets
http://dl-live.theiet.org/content/conferences/10.1049/cp_19951059
Kauffman has proposed a theory of cell differentiation using random boolean nets as a model of genetic action. This scheme is used as the basis of a developmental representation in the context of genetic algorithms. A number of simple linear morphologies are evolved using this system.A parallel implementation of evolutionary divide and conquer for the TSP
http://dl-live.theiet.org/content/conferences/10.1049/cp_19951098
Evolutionary divide and conquer (EDAC) applied to the geometric travelling salesman problem uses a genetic algorithm to explore the space of problem subdivisions. The underlying techniques for subdivision and patching are based on the cellular dissection algorithms of Richard Karp (1977), but the addition of new repair heuristics as well as the genetic algorithm, appears to have succeeded in lifting the quality of solution way above Karp's algorithms, whilst maintaining almost linear scaling of the algorithm with increased problem size. In this paper we outline a parallel implementation of EDAC and present some preliminary results of this work.Parallel genetic algorithms for optimizing morphological filters
http://dl-live.theiet.org/content/conferences/10.1049/cp_19950762
Several design methods for parallel genetic algorithms (PGAs) optimizing morphological filters are discussed and an optimal scale and machine independent PGA for a loosely coupled, homogeneous or inhomogeneous multiprocessor computer is developed. The optimization is made in terms of computation speed, parallelization efficiency and quality. Quality means, in this context, the fitness of the best chromosome, which is an objective measure for the performance of the corresponding morphological filter in a particular environment.Multilevel distributed genetic algorithms
http://dl-live.theiet.org/content/conferences/10.1049/cp_19951099
Some problems are very difficult to solve by mathematical programming approaches. A genetic algorithm (GA) is an extremely powerful optimization technique that could be used to solve such problems, but its efficiency is dependent on its ability to do a large number of evaluations in a reasonable amount of time. A classical GA contains three basic operators-reproduction, crossover and mutation. To increase the efficiency of a genetic algorithm the influence of migration in a multilevel distributed GA (MDGA) was tested. Several different structures of PC computers connected in a local area network (LAN) were used for the MDGAs. MDGAs use the power of the computers better than one-level distributed GAs. The problem of communication between the computers in the MDGAs was dealt with in two different ways, with files on a server or by sending packets.The block regularised linear quadratic optimal controller
http://dl-live.theiet.org/content/conferences/10.1049/cp_19940317
We have presented an LQ adaptive controller which has a data sample and control action generation interval of O(1), and the readjustment interval of control law of N * O(1), where N is the length of the horizon used for the LQ control computation, by using a pipelined parallel implementation. This controller has a number of important properties. The regularised identification procedure is robust because it prevents covariance of the estimates from becoming reduced-rank and causing the estimator to "blow up" and also limits the estimates from varying beyond a "safe" region. The identification is implemented on O(n<sup xmlns="http://pub2web.metastore.ingenta.com/ns/">2</sup>) processing cells and produces an estimate every O(n) time intervals. The computation of the LQ control law also uses O(n<sup xmlns="http://pub2web.metastore.ingenta.com/ns/">2</sup>) processing cells and has a speed of N * O(1), which is independent of the system order. This is an order of magnitude faster than that previously proposed and has the added advantage of no feedback loops on the systolic array. The computation of the control law allows for the setpoints to be nonzero and, also time-varying, which is important for practical applications. The penalties can also be time-varying. Lastly, the computation required to implement the control law is very simple and has a speed of O(1) on O(n) processing elements.Fast parallel algorithms for predictive control
http://dl-live.theiet.org/content/conferences/10.1049/cp_19940316
It is shown that the generalized predictive control (GPC) law for an n-th order plant and N-steps prediction horizon can be computed in O(Nn) processing time on a serial processor and in O(N) time on a linear array of O(n) processors.Distributed RANSAC for the robust estimation of three-dimensional reconstruction
http://dl-live.theiet.org/content/journals/10.1049/iet-cvi.2010.0223
Many low- or middle-level three-dimensional reconstruction algorithms involve a robust estimation and selection step whereby parameters of the best model are estimated and inliers fitting this model are selected. The RANSAC (RANdom SAmple consensus) algorithm is the most widely used robust algorithm for this task. A new version of RANSAC, called distributed RANSAC (D-RANSAC), is proposed, to save computation time and improve accuracy. The authors compare their results with those of classical RANSAC and randomised RANSAC (R-RANSAC). Experiments show that D-RANSAC is superior to RANSAC and R-RANSAC in computational complexity and accuracy in most cases, particularly when the inlier proportion is below 65%.Design methodology for construction of asynchronous pipelines with Handel-C
http://dl-live.theiet.org/content/journals/10.1049/ip-sen_20030206
CSP channels are proposed as a means of developing high-level, asynchronous pipeline architectures over and above existing synchronous logic. Channel-based design allows hardware systems to be designed and constructed using top-down software engineering methods, which have not previously been available within hardware–software co-design. The intention is to enhance support for future large-scale co-designs. The design methodology and its performance implications are demonstrated through an exemplar, pipelined design of the Karhunen–Loève Transform (KLT) algorithm, implemented using the Handel-C silicon compiler applied to dense FPGAs.View-centric reasoning for Linda and Tuple Space computation
http://dl-live.theiet.org/content/journals/10.1049/ip-sen_20030129
In contrast to sequential computation, concurrent computation gives rise to parallel events. Efforts to translate the history of concurrent computations into sequential event traces result in the potential uncertainty of the observed order of these events. Loosely coupled distributed systems complicate this uncertainty even further by introducing the element of multiple imperfect observers of these parallel events. Properties of such systems are difficult to reason about and, in some cases, attempts to prove safety or liveness lead to ambiguities. The authors present a survey of challenges of reasoning about properties of concurrent systems. They then propose a new approach, view-centric reasoning, that avoids the problem of translating concurrency into a sequential representation. Finally, they demonstrate the usefulness of view-centric reasoning as a framework for disambiguating the meaning of Tuple Space predicate operations, versions of which exist commercially in IBM's T Spaces and Sun's JavaSpaces.CSP extended: imperative state and true concurrency
http://dl-live.theiet.org/content/journals/10.1049/ip-sen_20030133
is an extension of CSP and ℋ is in turn an extension of which captures the semantics of hardware compilation. Because it is a superset of , it can describe both hardware and software and so is useful for co-design. The extensions beyond include: true concurrency; new hardware constructors; and a simple and natural way to represent an imperative state. Both and ℋ were invented to cope with problems that arose while the author was trying to prove that the hardware that he had designed correctly implemented <i xmlns="http://pub2web.metastore.ingenta.com/ns/">channels</i> between a processor and an FPGA. Standard CSP did not capture priority, yet the circuits in the FPGA and the occam processes in the processor both depended on priority for their correctness. The current state of ℋ is reported and attention is focused on handling the imperative state and true concurrency. The acceptance denotational semantics is described briefly.Supporting architectural concerns in component interoperability standards
http://dl-live.theiet.org/content/journals/10.1049/ip-sen_20000913
There has been considerable work in industry on the development of component-interoperability models, such as COM, CORBA and JavaBeans. These models are intended to reduce the complexity of software development and to facilitate reuse of off-the-shelf components. The focus of these models is syntactic interface specification, component packaging, intercomponent communication, and bindings to a runtime environment. What these models lack is a consideration of architectural concerns—specifying systems of communicating components, explicitly representing loci of component interaction, and exploiting architectural styles that provide well understood global design solutions. The work described involves introducing support for architectural concerns in component models, particularly studying techniques to support notions of architectural style and explicit connectors. The JavaBeans component model has been enhanced to support component composition according to the C2 architectural style. The approach enables the design and development of applications in the C2 style using off-the-shelf Java components or `beans' that are available to the designer. The techniques underlying the approach are described, along with a composition environment called `ARABICA' that embodies these techniques. A number of important issues that must be addressed when extending component standards to support architectural concerns are identified.Sequential decoding of convolutional codes by a compressed multiple queue algorithm
http://dl-live.theiet.org/content/journals/10.1049/ip-com_19941281
The conventional multiple stack algorithm (MSA) is an-efficient approach for solving erasure problems in sequential decoding. However, the requirements of multiple stacks and large memory make its implementation difficult. Furthermore,the MSA allows only one stack to be in use at a time: the ether stacks will stay idle until the process in that stack is terminated. Thus it seems difficult to implement the MSA with parallel processing technology. A two-stack scheme is proposed to achieve similar effects to the MSA. The scheme greatly reduces the loading for data transfer and I/O complexity required in the MSA, and makes parallel processing possible. An erasure-free sequential decoding algorithm for convolutional codes, the compressed multiple-queue algorithm (CMQA), is introduced, based on systolic priority queue technology, which-can reorder the path metrics in a short and constant time. The decoding speed will therefore be much faster than in traditional sequential decoders using sorting methods. In the CMQA, a systolic priority queue is divided into two queues by adding control signals, thereby simplifying implementation. Computer simulations show that the CMQA outperforms the MSA in bit error rate, with about one-third the memory requirement of the MSA.Extending CSP: denotational semantics
http://dl-live.theiet.org/content/journals/10.1049/ip-sen_20030128
Modern CSP is an algebra which describes processes which engage in events. Its characteristic comprehensiveness combined with simplicity arises in large part from the fact that it was originally defined by a denotational semantics. This sets it apart from other process algebras. Part of the simplicity of CSP arises from its high level of abstraction. In particular, conventional CSP abstracts from priority. CSP practitioners occasionally find this awkward, particularly when describing real time systems and in hardware compilation. An extension of CSP is introduced which includes priority without compromising simplicity or abstraction. Furthermore, the extension is built by way of a denotational semantics which is in full sympathy with the spirit of CSP. An unexpected merit of the new semantics is that it solves completely some long standing difficulties in describing infinite behaviour with the conventional denotational semantic models for CSP. Furthermore it provides elegant and simple treatments of termination and divergence in contrast to earlier rather awkward approaches.One-pass parallel hexagonal thinning algorithm
http://dl-live.theiet.org/content/journals/10.1049/ip-vis_20010076
A comparison between two fully parallel thinning algorithms designed for images sampled on the square and hexagonal grids is reported. Using techniques from mathematical morphology, a hexagonal algorithm has been designed to closely match the operation of a well known square grid algorithm. Proofs of the connectivity and single pixel limb width of the resulting converged hexagonal skeleton have been presented. Implementations of both algorithms were found to produce accurate skeletons, but the hexagonal could be implemented with only 50% of the logical operations required by the square.Introducing monitoring events to timed-CSP
http://dl-live.theiet.org/content/journals/10.1049/ip-sen_20010244
Timed-CSP is a process algebra designed to help in the modelling and analysis of real-time concurrent systems. Timed-CSP can handle synchronisation events that require the co-operation of all the interested parties, and broadcasting events which do not require any cooperation from the environment. The paper argues that, still, there are scenarios that cannot be adequately modelled in timed-CSP, and proposes an extension that would allow the modelling of more advanced communication mechanisms, such as multicasting.Fail-aware datagram service
http://dl-live.theiet.org/content/journals/10.1049/ip-sen_19990400
In distributed real-time systems it is often useful for a process <i xmlns="http://pub2web.metastore.ingenta.com/ns/">p</i> to know that another process <i xmlns="http://pub2web.metastore.ingenta.com/ns/">q</i> will not use a certain piece of information that <i xmlns="http://pub2web.metastore.ingenta.com/ns/">p</i> has sent to <i xmlns="http://pub2web.metastore.ingenta.com/ns/">q</i> beyond a certain deadline. If <i xmlns="http://pub2web.metastore.ingenta.com/ns/">p</i> can learn about the occurrence of the deadline by simply measuring the passage of time on its own local clock, we call this kind of information exchange `communication by time'. It is shown that communication by time is possible in systems where there exists no <i xmlns="http://pub2web.metastore.ingenta.com/ns/">a priori</i> known upper bound on the transmission delay of messages and where clocks are not synchronised. It is sufficient if one can compute an <i xmlns="http://pub2web.metastore.ingenta.com/ns/">a posteriori </i>upper bound on the transmission delay of a message <i xmlns="http://pub2web.metastore.ingenta.com/ns/">m</i>, i.e. at the time when <i xmlns="http://pub2web.metastore.ingenta.com/ns/">m</i> is delivered. The authors show how one can compute an <i xmlns="http://pub2web.metastore.ingenta.com/ns/">a posteriori</i> upper bound on the one-way message transmission delay of a message even if the clocks of the sender and receiver process are not synchronised. The method is used to design a <i xmlns="http://pub2web.metastore.ingenta.com/ns/">fail-aware datagram service</i>. This service supports communication by time by delivering all messages whose computed one-way transmission delays are smaller than a given bound as `fast' and all other messages as `slow'.The properties of this service are specified and an efficient implementation is provided for it. To illustrate how this service supports communication by time, a leader election protocol that guarantees the existence of at most one leader at any real time is sketched and it is shown how this allows the detection of out-of-date sensor information in process control applications.Graphical modelling language for specifying concurrency based on CSP
http://dl-live.theiet.org/content/journals/10.1049/ip-sen_20030132
A graphical modelling language for specifying concurrency in software designs is presented. The language notations are derived from the communicating sequential process (CSP) language and the resulting designs form CSP diagrams. The notations reflect both data-flow and control-flow aspects of concurrent software architectures. These designs can automatically be described by CSP algebraic expressions that can be used for formal analysis. The designer does not have to be aware of the underlying mathematics. The techniques and rules presented provide guidance to the development of concurrent software architectures. One can detect and reason about compositional conflicts (errors in design), potential deadlocks (errors at run-time), and priority inversion problems (performance burden) at a high level of abstraction. The CSP diagram collaborates with object-oriented modelling languages and structured methods.Parallel matching algorithm for CIOQ switches with multiple traffic classes
http://dl-live.theiet.org/content/journals/10.1049/ip-com_20030600
It has recently been shown that a combined input and output queued (CIOQ) switch with a speed-up factor of 2 can exactly emulate an output-queued (OQ) switch. The main benefit of using a CIOQ switch is to reduce memory bandwidth requirement while providing quality of service (QoS) guarantee. The key component for exact emulation is a matching algorithm for bipartite graphs. For example, a CIOQ switch with a speed-up factor of 2, which adopts the least cushion first/most urgent first (LCF/MUF) matching algorithm, can exactly emulate an OQ switch with any arbitrary service discipline. However, the complexities of cushion calculation and cell reordering required at the output ports make the algorithm very difficult to be realised in a high-speed switch. The authors propose an approximate LCF/MUF algorithm and evaluate its performance for the weighted round-robin service discipline. For ease of implementation, the proposed algorithm calculates approximate cushions and does not perform reordering at the output ports. The trade-off is that it loses the property of exact emulation. It was found, via computer simulations, that the performance of a CIOQ switch with the proposed single-iteration matching algorithm is close to that of an OQ switch under uniform, nonuniform, and correlated input traffic models for offered load up to 0.9. In addition, the cell departure order can be maintained under the single-iteration algorithm.Prioritised dynamic communicating and mobile processes
http://dl-live.theiet.org/content/journals/10.1049/ip-sen_20030182
Continuing research on language design, compilation and kernel support for highly dynamic concurrent reactive systems is reported. The work extends the occam multiprocessing language, which is both sufficiently small to allow for easy experimentation and sufficiently powerful to yield results that are directly applicable to a wide range of industrial and commercial practice. Classical occam was designed for embedded systems and enforced a number of constraints, such as statically predetermined memory allocation and concurrency limits, that were relevant to that generation of application and hardware technology. This work removes most of these constraints and introduces a number of new facilities: explicit channel ends, channel bundles, mobile ends of channels and bundles, dynamic process creation, the extended rendezvous and process priorities. These significantly broaden occam's field of application and raise the level of concurrent system design directly supported. Concurrency overheads have been driven ever downwards, for example synchronising channel communication is now around 100 nanoseconds on an 800 MHz P3, and most operations have unit time cost. Finally, a proposal for secure mobile processes is made.The Honeysuckle programming language: an overview
http://dl-live.theiet.org/content/journals/10.1049/ip-sen_20030130
A language for programming systems with <i xmlns="http://pub2web.metastore.ingenta.com/ns/">Communicating Process Architecture</i> is introduced which builds on the success of occam. The principal innovations are presented, such as <i xmlns="http://pub2web.metastore.ingenta.com/ns/">a priori</i> deadlock-freedom, a <i xmlns="http://pub2web.metastore.ingenta.com/ns/">transfer</i> primitive, which conveys object ownership, and an <i xmlns="http://pub2web.metastore.ingenta.com/ns/">alternation</i> construct. The latter affords direct expression of prioritised reactive (event-driven) behaviour, including exception response. Honeysuckle also offers source-code modularity, object encapsulation, and the recursive definition of both object and process. Despite such ambition, a primary aim has been to retain simplicity in abstraction, expression and implementation.Wait-free cache-affinity thread scheduling
http://dl-live.theiet.org/content/journals/10.1049/ip-sen_20030127
Cache utilisation is often very poor in multithreaded applications, due to the loss of data access locality incurred by frequent context switching. This problem is compounded on shared memory multiprocessors when dynamic load balancing is introduced, as thread migration disrupts cache content. Batching, a technique for reducing the negative impact of fine grain multithreading on cache performance is introduced to mitigate this problem. Moreover, the related issue of contention for shared data within a thread scheduler for shared memory multiprocessors is considered. In this regard wait-free techniques are applied in lieu of conventional lock-based methods for accessing internal scheduler structures, alleviating to some extent serialisation and hence the degree of contention. Prototype schedulers which make use of the above ideas are described, and finally experimental results which illustrate the observed improvements are presented.Analysis and evaluation of distributed checkpoint algorithms to avoid rollback propagation
http://dl-live.theiet.org/content/journals/10.1049/ip-sen_19982442
Checkpointing is a very well known mechanism to achieve fault tolerance. In distributed applications where processes can checkpoint independently of each other, a local checkpoint is useful for fault tolerance purposes only if it belongs to at least one consistent global checkpoint. In this case, execution can be restarted from it without needing to rollback the execution in the past. The paper exploits a theoretical framework that facilitates the definition and analysis of distributed checkpoint algorithms to avoid rollback propagation. Several distributed algorithms are presented which avoid rollback propagation by forcing additional checkpoints in processes. The effectiveness of the algorithms is evaluated in several testbed applications, showing their limited capability of bounding the number of additional checkpoints.Predicate transformers in the semantics of <i xmlns="http://pub2web.metastore.ingenta.com/ns/">Circus</i>
http://dl-live.theiet.org/content/journals/10.1049/ip-sen_20030131
<i xmlns="http://pub2web.metastore.ingenta.com/ns/">Circus</i> is a combination of Z and CSP; its chief distinguishing feature is the inclusion of the ideas of the refinement calculus. The main objective is the definition of refinement methods for concurrent programs. The original semantic model for <i xmlns="http://pub2web.metastore.ingenta.com/ns/">Circus</i> is Hoare and He's unifying theories of programming. An equivalent semantics based on predicate transformers is presented. With this model, a more adequate basis for the formalisation of refinement and verification-condition generation rules is provided. Furthermore, this framework makes it possible to include logical variables and angelic nondeterminism in <i xmlns="http://pub2web.metastore.ingenta.com/ns/">Circus</i>. The consistency of the relational and predicate transformer models gives us confidence in their accuracy.Parallel-decomposition algorithm for discrete computational problems and its application in developing an efficient discrete convolution algorithm
http://dl-live.theiet.org/content/journals/10.1049/ip-vis_19951680
The authors present a parallel-decomposition algorithm for discrete computational problems. An efficient convolution algorithm is developed under the proposed decomposition algorithm. The decomposition operation is based on integer modular arithmetic. Congruent sets are generated from integer modular operation over the index of the problem and constitute a partition. The partition is used to decompose the problem into several small subproblems. The partition under the integer modular operation is unique. Complete reconstruction of the problem is guaranteed by the uniqueness of the partition. Since the algorithm is established on the foundation of all algebraic systems, i.e. number and set theories, it is highly problem-independent, and provides an uniform approach to parallel decomposition of general problems on the discrete domain. Because the subproblems are highly regular, it is suitable for VLSI hardware implementation. The computational complexity of the developed convolution algorithm is reduced by a factor of p<sup xmlns="http://pub2web.metastore.ingenta.com/ns/">2</sup>, and (p<sup xmlns="http://pub2web.metastore.ingenta.com/ns/">2</sup>)<sup xmlns="http://pub2web.metastore.ingenta.com/ns/">n</sup> for n-dimensional problems (p can be any common factor of the input sequences). Compared to the popular FFT methods, where round-off error is introduced, the convolution result from the algorithm is exact. In addition, it does not have restrictions on the length of the input sequences, as in the well known block convolution algorithm. As a result, the algorithm is highly efficient for convolution of large input sequences or correlation operations.Generalised approach to parallelising image sequence coding algorithms
http://dl-live.theiet.org/content/journals/10.1049/ip-vis_19941558
The author describes the parallelisation of three different versions of the CCITT H.261 encoder algorithm using a generalised parallel design methodology based upon pipelines of processor farms (PPFs). For each algorithm, a theoretical upper-bound scaling model was derived by analysing the execution time profile of the algorithm and its feedback structure. The performance predicted by the model was, in each case, in good agreement with that achieved by the corresponding practical implementation. Practical throughput scaling up to a factor of 11 was achieved, using PPFs containing up to 16 processors. The design examples illustrate the impact which feedback has on potential speedup for image coding algorithms, and the diagnostic role of the model in identifying those algorithm components which restrict scaling performance. It is believed that the techniques presented may be useful both in developing embedded image coders based upon multiple DSP devices, and for simulation work with large image sequences in application areas such as image coding for HDTV and SHDTV.Efficient hierarchical chaotic image encryption algorithm and its VLSI realisation
http://dl-live.theiet.org/content/journals/10.1049/ip-vis_20000208
An efficient hierarchical chaotic image encryption algorithm and its VLSI architecture are proposed. Based on a chaotic system and a permutation scheme, all the partitions of the original image are rearranged and the pixels in each partition are scrambled. Its properties of high security, parallel and pipeline processing, and no distortion will be analysed. To implement the algorithm, its VLSI architecture with pipeline processing, real-time processing capability, and low hardware cost is designed and the FPGA realisation of its key modules is given. Finally, the encrypted image is simulated and its fractal dimension is computed to demonstrate the effectiveness of the proposed scheme.Parallelising a set of 2-D frequency transforms in a flexible manner
http://dl-live.theiet.org/content/journals/10.1049/ip-vis_19981693
The implementation of parallel two-dimensional frequency transforms intended for the acceleration of image-processing algorithms is described. The way these routines fit into a wider generic format for such parallel routines is also indicated. There is a brief consideration of the design decisions needed to combine choice of efficient routine with wide utility. Due consideration is given to auxiliary functions. Details are included of the book-keeping required to enable real-valued data to be efficiently transformed in a parallel setting.Robust adaptive quasi-Newton algorithms for eigensubspace estimation
http://dl-live.theiet.org/content/journals/10.1049/ip-vis_20030767
A novel quasi-Newton algorithm for adaptively estimating the principal eigensubspace of a covariance matrix by making use of an approximation of its Hessian matrix is derived. A rigorous analysis of the convergence properties of the algorithm by using the stochastic approximation theory is presented. It is shown that the recursive least squares (RLS) technique can be used to implement the quasi-Newton algorithm, which significantly reduces the computational requirements from O( pN<sup xmlns="http://pub2web.metastore.ingenta.com/ns/">2</sup> ) to O( pN), where <i xmlns="http://pub2web.metastore.ingenta.com/ns/">N</i> is the data vector dimension and <i xmlns="http://pub2web.metastore.ingenta.com/ns/">p</i> is the number of desired eigenvectors. The algorithm is further generalised by introducing two adjustable parameters that efficiently accelerate the adaptation process. The proposed algorithm is applied to different applications such as eigenvector estimation and the Comon–Golub test in order to study the convergence behaviour of the algorithm when compared with others such as PASTd, NIC, and the Kang <i xmlns="http://pub2web.metastore.ingenta.com/ns/">et al.</i> quasi-Newton algorithm. Simulation results show that the new algorithm is robust against changes of the input scenarios and is thus well suited to parallel implementation with online deflation.Parallel algorithms for direct solution of large sparse power system matrix equations
http://dl-live.theiet.org/content/journals/10.1049/ip-gtd_20010583
A coarse-grain parallel implementation is presented of LU factorisation, forward and backward substitution for solving large, sparse linear sets of algebraic equations arising from network analysis. A block solution approach was chosen instead of the usual element-wise method, to reduce communication overhead and consequently to obtain a better performance of the parallel implementation. An inverse-based technique was used to further improve the overall efficiency of repeated solutions. Data exchanges among processors are kept to the minimum in the factorisation and solution phases. This method has been successfully applied to a realistic UK 811-busbar power system network with up to 16 processors. Results are presented with detailed information on computation and communication.Unit commitment by parallel simulated annealing
http://dl-live.theiet.org/content/journals/10.1049/ip-gtd_19952215
The paper investigates the application of parallel simulated annealing for unit commitment problems. Two parallel simulated annealing concepts, speculative computation and serial subset, are applied to a unit commitment problem of ten thermal generators. The authors also propose a combined scheme where speculative computation is used in the initial phase and the serial subset is used in the final phase. The parallel simulated annealing schemes are tested with an example problem and the results show that the parallel schemes can considerably speed up the computation of simulated annealing.Team algorithms in distributed load flow computations
http://dl-live.theiet.org/content/journals/10.1049/ip-gtd_19952249
This paper reports the use of a general technique to combine several different methods to solve complex systems of algebraic equations in the context of load flow calculations of electrical power networks. Such a combinations of methods, referred to as 'team algorithms', seem specially well suited to be used with distributed memory computer systems, in an asynchronous environment. Experimental results solving example problems in a commercially available parallel computer system show that a 'synergetic effect' with considerable speedup can be obtained using these 'team algorithms'.Decoupled asymmetrical three-phase load flow study by parallel processing
http://dl-live.theiet.org/content/journals/10.1049/ip-gtd_19960108
A decoupling-compensation technique used to break the three-phase coupled power system model into three sequence-decoupled models is proposed. With such a decoupling, positive-, negative- and zero-sequence system models can be solved simultaneously. By combining these models with the well-known Newton–Raphson or fast-decoupled algorithms, two novel methods used for asymmetrical three-phase load flow study are established. Based on these the synchronous parallel algorithm can then be employed and the corresponding software used for this purpose can be implemented onto a multiprocessor system. The validity of the proposed methods is verified and the computation results are analysed herein.Efficient heuristic partitioning algorithm for parallel processing of large power systems network equations
http://dl-live.theiet.org/content/journals/10.1049/ip-gtd_19952211
An efficient heuristic algorithm for automatic partitioning of a power system network is presented. The algorithm was designed for use in conjunction with a diakoptics-based coarse grain parallel method for solving large, sparse linear systems arising from network analysis. It exploits the techniques of factorisation path graph partitioning and equivalent post-ordering. Test results for systems up to 811 busbars and 1476 transmission lines are included for comparison purposes. This indicates that the method is able to divide a power system network model into a number of equal sized subnetworks in order to optimise the use of parallel computer systems for network analysis while the resultant number of fill-in elements is kept to a minimum.Message optimal fully decentralised evaluation of associative and commutative functions
http://dl-live.theiet.org/content/journals/10.1049/ip-cdt_19941165
Decentralised protocols can be characterised by successive rounds of message interchanges. We show that at least kN((N<sup xmlns="http://pub2web.metastore.ingenta.com/ns/">1/k</sup>)-1) messages are required for fully decentralised evaluating functions that are both associative and commutative if k rounds of message interchanges are used in an N-node system. We then present a family of fully decentralised algorithms that requires, at most, a total of kN((N<sup xmlns="http://pub2web.metastore.ingenta.com/ns/">1/k</sup>)-1) messages to be sent with k rounds of message interchanges. Therefore, the family of algorithms is optimal with respect to the total number of messages exchanged among the processing nodes. The problems which can be modelled as an evaluation of associative and commutative functions include extrema findings and distributed transaction commitments.Analysis of new pivoting strategy for the <strong xmlns="http://pub2web.metastore.ingenta.com/ns/"><i>LDL</i></strong><sup xmlns="http://pub2web.metastore.ingenta.com/ns/"><i>T</i></sup> decomposition on a multiprocessor system with distributed memory
http://dl-live.theiet.org/content/journals/10.1049/ip-cdt_20030059
It is well known that optimal control techniques can provide the ability to design suitable strategies, however, the on-line computing requirements are excessive. The normal procedure is to make various assumptions so that the processing demands are reduced. Based on these assumptions, sequences of linear-quadratic-performance optimal control problems need to be considered. These in turn give rise to standard two-point-boundary-value problems. The solution to such problems involves the computation of the algebraic Riccati equations (AREs). The block diagonal decomposition <strong xmlns="http://pub2web.metastore.ingenta.com/ns/"><i>LDL</i></strong><sup xmlns="http://pub2web.metastore.ingenta.com/ns/"><i>T</i></sup>, is the key step for those algorithms based on the matrix sign function that are used in solving the AREs. The last few years have witnessed a tremendous effort towards the development of reliable algorithms to solve the AREs and apply it in industrial situations. However, all the implementations and testing of the proposed algorithms have been performed on powerful machines thereby limiting their practical application. The authors present a pivoting strategy that: (i) requires only one-dimension of the matrix for the selection of the pivot; (ii) generates regular communication patterns; and (iii) establishes a software mechanism for the development of fault tolerant applications. The results obtained from a multiprocessor system with a one-way ring topology indicate that the block diagonal <strong xmlns="http://pub2web.metastore.ingenta.com/ns/"><i>LDL</i></strong><sup xmlns="http://pub2web.metastore.ingenta.com/ns/"><i>T</i></sup> decomposition is a true candidate for real-time use and fault tolerant applications and also as a framework-test for the LAPACK library.Difference clocks: A new scheme for logical time in distributed systems
http://dl-live.theiet.org/content/journals/10.1049/ip-cdt_19960468
Logical clocks and vector clocks were proposed in the past to capture causality between events of different processes of a distributed computation. However, these clocks could not capture intraprocess concurrency. Later, bit-matrix clocks and hierarchical clocks were developed to capture interprocess concurrency as well as intraprocess concurrency. The major disadvantages of these clocks are the associated storage and communication overheads. To overcome these disadvantages, the authors introduce the concept of difference clocks. The difference clocks maintain minimal information about the differences among various local clocks. It is shown that they can be used to reconstruct the bit-matrix clocks and result in substantial reduction in storage space and communication overhead.Unified approach to designing parallel Winograd algorithms
http://dl-live.theiet.org/content/journals/10.1049/ip-cdt_19949981
Although the recurrence equation for the Winograd algorithm is uniform, no unified approach has been proposed to design parallel Winograd algorithms. The authors propose a unified approach to designing parallel Winograd algorithms. Using this approach, several parallel algorithms are designed. These algorithms are executed on regular arrays including conventional systolic arrays and nonplanar regular arrays. A comparison of their performance is given.Parallel multigrid algorithms on CM-5
http://dl-live.theiet.org/content/journals/10.1049/ip-cdt_19951865
Solutions to partial differential equations are required in many engineering applications. The multigrid method is an iterative technique for speeding up the solution of these equations. The authors describe a parallel implementation of the multigrid method on the Connection Machine CM-5 architecture. An analytic model is presented for estimating the computation and communication times of the multigrid algorithm. The times predicted by the analytic model are within 5% of the results obtained from CM-5. Results demonstrate that the communication overhead incurred by the parallel multigrid algorithm is relatively small compared to the computation time. Consequently, implementations of the multigrid algorithm on the CM-5 easily achieve processor efficiencies near 100%.Design and application of Parsim — a message-passing computer simulator
http://dl-live.theiet.org/content/journals/10.1049/ip-cdt_19970961
Currently many interconnection networks and parallel algorithms exist for message-passing computers. Users of these machines wish to determine which message-passing computer is best for a given job, and how it will scale with the number of processors and algorithm size. The paper describes a general purpose simulator for message-passing multiprocessors (Parsim), which facilitates system modelling. A structured method for simulator design has been used which gives Parsim the ability to simulate different topologies and algorithm combinations easily. This is illustrated by applying Parsim to a number of algorithms on a variety of topologies. Parsim is then used to predict the performance of the new IBM SP2 parallel computer, with topologies ranging up to 1024 processors.Dynamic evaluation strategy for fine-grain data-parallel computing
http://dl-live.theiet.org/content/journals/10.1049/ip-cdt_19960333
The placement of elemental operations (as opposed to data) of a data-driven data-parallel computation in a network of processors is examined. A fast suboptimal algorithm is proposed for such placement which tends to minimise the overall network load when the computation is essentially nonlocal. The cases of grid, torus and hypercube topology are considered. It is shown that the proposed algorithm, while having moderate computational complexity, demonstrates up to a 50% reduction in required network throughput over some straightforward placement schemes in the practical range of network sizes.