Analysis and evaluation of distributed checkpoint algorithms to avoid rollback propagation

F. Zambonelli

Analysis and evaluation of distributed checkpoint algorithms to avoid rollback propagation

Access Full Text

Analysis and evaluation of distributed checkpoint algorithms to avoid rollback propagation

Author(s): F. Zambonelli
DOI: 10.1049/ip-sen:19982442

For access to this article, please select a purchase option:

Buy article PDF

Buy Knowledge Pack

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership

Recommend Title Publication to library

IEE Proceedings - Software — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

Author(s): F. Zambonelli ¹
- Affiliations: 1: Dipartimento di Scienze dell'Ingegneria, Università di Modena e Reggio Emilia, Modena, Italy
Source: Volume 145, Issue 6, December 1998, p. 212 – 218
DOI: 10.1049/ip-sen:19982442 , Print ISSN 1462-5970, Online ISSN 1463-9831

Published

Checkpointing is a very well known mechanism to achieve fault tolerance. In distributed applications where processes can checkpoint independently of each other, a local checkpoint is useful for fault tolerance purposes only if it belongs to at least one consistent global checkpoint. In this case, execution can be restarted from it without needing to rollback the execution in the past. The paper exploits a theoretical framework that facilitates the definition and analysis of distributed checkpoint algorithms to avoid rollback propagation. Several distributed algorithms are presented which avoid rollback propagation by forcing additional checkpoints in processes. The effectiveness of the algorithms is evaluated in several testbed applications, showing their limited capability of bounding the number of additional checkpoints.

References

1. 1)
  - Wang, Y.-M., Fuchs, W.K.: `Scheduling message processing for reducing rollback propagation', 22nd international symposium on Fault-tolerant computing, July 1992, p. 204–211.
2. 2)
  - O. Babaoglu , K. Marzullo , S.J. Mullender . (1993) Consistent global states of distributed systems, Distributed systems.
3. 3)
  - Netzer, R.H.B., Xu, Y.: `Replaying distributed programs without message logging', 6th IEEE symposium on High-performance distributed computing,HPDC–6, August 1997.
4. 4)
  - Baldoni, R., Helary, J.M., Mostefaoui, A., Raynal, M.: `A communication induced algorithm that ensures the rollback dependencytrackability', 27th international symposium on Fault-tolerant computingsystems, July 1997, Seattle, USA.
5. 5)
  - D.L. Russell . State restoration in systems of communication processes. IEEE Trans. Softw. Eng. , 2 , 183 - 194
6. 6)
  - R. Koo , S. Toueg . Checkpointing and roll-back recovery for distributed systems. IEEE Trans. Softw. Eng. , 1 , 23 - 31
7. 7)
  - Manivannan, D., Singhal, M.: `A low overhead recovery technique using quasi-synchronous checkpointing', 16th conference on Distributed computing systems, May 1996, Hong Kong, p. 100–107.
8. 8)
  - Kim, K.H., You, J.H., Abouelnaga, A.: `A scheme for coordinated execution of independently designed recoverabledistributed processes', 16th symposium on Fault-tolerant computing systems, 1986, p. 130–135.
9. 9)
  - L. Lamport . Time, clocks and the ordering of events in distributed systems. Commun. ACM , 7 , 558 - 565
10. 10)
  - L. Alvisi , K. Marzullo . Message logging: pessimistic, optimistic, causal and optimal. IEEE Trans. Softw. Eng. , 2 , 149 - 159
11. 11)
  - Y.-M. Wang . Consistent global checkpoints that contains a given set of local checkpoints. IEEE Trans. Comput. , 4 , 456 - 468
12. 12)
  - R.H.B. Netzer , J. Xu . Necessary and sufficient conditions for consistent global snapshots. IEEE Trans. Parallel Distrib. Syst. , 2 , 165 - 169
13. 13)
  - K. Many Chandy , L. Lamport . Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. , 1 , 63 - 75
14. 14)
  - B. Randell . System structure for software fault-tolerance. IEEE Trans. Softw. Eng. , 2 , 220 - 232
15. 15)
  - A. De , L.M. Drummond , V.C. Barbosa . Distributed breakpoint detection in message-passing programs. J. Parallel Distrib. Comput. , 2 , 153 - 167
16. 16)
  - Zambonelli, F.: `Logging algorithms for parallel program replay', DEIS-LIA-97-007, Technical Report, June 1997, www-lia.deis.unibo.it.
17. 17)
  - Netzer, R.H.B., Xu, J.: `Adaptive independent checkpointing for reducing rollback propagation', 5th IEEE symposium on Parallel and distributed processing, December 1993, Dallas, Texas, p. 754–761.
18. 18)
  - Helary, J.M., Mostefaoui, A., Netzer, R.H.B., Raynal, M.: `Preventing useless checkpoints in distributed computations', 16th IEEE symposium on Reliable distributed systems, October 1997, Durham, North Carolina.
19. 19)
  - Briatico, D., Ciufoletti, A., Simoncini, L.: `Distributed domino-effect free recovery algorithm', 4th IEEE symposium on Reliability in distributed softwareand database systems, October 1984, Maryland, p. 207–215.
20. 20)
  - D. Manivannan , M. Singhal , R.H.B. Netzer . Finding consistent global checkpoints in a distributed computation. IEEE Trans. Parallel Distrib. Syst. , 6 , 623 - 627
21. 21)
  - T.H. Lai , T.H. Yang . On distributed snapshots. Inf. Process. Lett. , 25 , 153 - 158

Login

Not registered yet?

Share

Tools

Login to add to favourites

Key

Analysis and evaluation of distributed checkpoint algorithms to avoid rollback propagation

Analysis and evaluation of distributed checkpoint algorithms to avoid rollback propagation

Buy article PDF

Buy Knowledge Pack

Thank you

References

Related content