Your browser does not support JavaScript!
http://iet.metastore.ingenta.com
1887

Analysis and evaluation of distributed checkpoint algorithms to avoid rollback propagation

Analysis and evaluation of distributed checkpoint algorithms to avoid rollback propagation

For access to this article, please select a purchase option:

Buy article PDF
£12.50
(plus tax if applicable)
Buy Knowledge Pack
10 articles for £75.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Name:*
Email:*
Your details
Name:*
Email:*
Department:*
Why are you recommending this title?
Select reason:
 
 
 
 
 
IEE Proceedings - Software — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

Checkpointing is a very well known mechanism to achieve fault tolerance. In distributed applications where processes can checkpoint independently of each other, a local checkpoint is useful for fault tolerance purposes only if it belongs to at least one consistent global checkpoint. In this case, execution can be restarted from it without needing to rollback the execution in the past. The paper exploits a theoretical framework that facilitates the definition and analysis of distributed checkpoint algorithms to avoid rollback propagation. Several distributed algorithms are presented which avoid rollback propagation by forcing additional checkpoints in processes. The effectiveness of the algorithms is evaluated in several testbed applications, showing their limited capability of bounding the number of additional checkpoints.

References

    1. 1)
      • Wang, Y.-M., Fuchs, W.K.: `Scheduling message processing for reducing rollback propagation', 22nd international symposium on Fault-tolerant computing, July 1992, p. 204–211.
    2. 2)
      • O. Babaoglu , K. Marzullo , S.J. Mullender . (1993) Consistent global states of distributed systems, Distributed systems.
    3. 3)
      • Netzer, R.H.B., Xu, Y.: `Replaying distributed programs without message logging', 6th IEEE symposium on High-performance distributed computing,HPDC–6, August 1997.
    4. 4)
      • Baldoni, R., Helary, J.M., Mostefaoui, A., Raynal, M.: `A communication induced algorithm that ensures the rollback dependencytrackability', 27th international symposium on Fault-tolerant computingsystems, July 1997, Seattle, USA.
    5. 5)
      • D.L. Russell . State restoration in systems of communication processes. IEEE Trans. Softw. Eng. , 2 , 183 - 194
    6. 6)
      • R. Koo , S. Toueg . Checkpointing and roll-back recovery for distributed systems. IEEE Trans. Softw. Eng. , 1 , 23 - 31
    7. 7)
      • Manivannan, D., Singhal, M.: `A low overhead recovery technique using quasi-synchronous checkpointing', 16th conference on Distributed computing systems, May 1996, Hong Kong, p. 100–107.
    8. 8)
      • Kim, K.H., You, J.H., Abouelnaga, A.: `A scheme for coordinated execution of independently designed recoverabledistributed processes', 16th symposium on Fault-tolerant computing systems, 1986, p. 130–135.
    9. 9)
      • L. Lamport . Time, clocks and the ordering of events in distributed systems. Commun. ACM , 7 , 558 - 565
    10. 10)
      • L. Alvisi , K. Marzullo . Message logging: pessimistic, optimistic, causal and optimal. IEEE Trans. Softw. Eng. , 2 , 149 - 159
    11. 11)
      • Y.-M. Wang . Consistent global checkpoints that contains a given set of local checkpoints. IEEE Trans. Comput. , 4 , 456 - 468
    12. 12)
      • R.H.B. Netzer , J. Xu . Necessary and sufficient conditions for consistent global snapshots. IEEE Trans. Parallel Distrib. Syst. , 2 , 165 - 169
    13. 13)
      • K. Many Chandy , L. Lamport . Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. , 1 , 63 - 75
    14. 14)
      • B. Randell . System structure for software fault-tolerance. IEEE Trans. Softw. Eng. , 2 , 220 - 232
    15. 15)
      • A. De , L.M. Drummond , V.C. Barbosa . Distributed breakpoint detection in message-passing programs. J. Parallel Distrib. Comput. , 2 , 153 - 167
    16. 16)
      • Zambonelli, F.: `Logging algorithms for parallel program replay', DEIS-LIA-97-007, Technical Report, June 1997, www-lia.deis.unibo.it.
    17. 17)
      • Netzer, R.H.B., Xu, J.: `Adaptive independent checkpointing for reducing rollback propagation', 5th IEEE symposium on Parallel and distributed processing, December 1993, Dallas, Texas, p. 754–761.
    18. 18)
      • Helary, J.M., Mostefaoui, A., Netzer, R.H.B., Raynal, M.: `Preventing useless checkpoints in distributed computations', 16th IEEE symposium on Reliable distributed systems, October 1997, Durham, North Carolina.
    19. 19)
      • Briatico, D., Ciufoletti, A., Simoncini, L.: `Distributed domino-effect free recovery algorithm', 4th IEEE symposium on Reliability in distributed softwareand database systems, October 1984, Maryland, p. 207–215.
    20. 20)
      • D. Manivannan , M. Singhal , R.H.B. Netzer . Finding consistent global checkpoints in a distributed computation. IEEE Trans. Parallel Distrib. Syst. , 6 , 623 - 627
    21. 21)
      • T.H. Lai , T.H. Yang . On distributed snapshots. Inf. Process. Lett. , 25 , 153 - 158
http://iet.metastore.ingenta.com/content/journals/10.1049/ip-sen_19982442
Loading

Related content

content/journals/10.1049/ip-sen_19982442
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading
This is a required field
Please enter a valid email address