Computer Science Colloquium, 2005-2006

Adnan Agbaria, Information Sciences Institute (ISI) University of Southern California
January 4th, 2006

Compiler-Driven Distributed Checkpointing

Distributed checkpointing is an important concept in providing fault tolerance in computer systems. Fault tolerance is important for distributed systems, for which the failure rate is high. In today's applications, e.g., grid and massively parallel applications, the imposed overhead of taking a distributed checkpoint using the known approaches can often outweigh its benefits, due to coordination and other overhead from the processes. In this talk, I present an innovative approach for distributed checkpointing. In this approach, during compilation, the checkpoints are specified in the application code using analysis based on the application level. During execution, no coordination is required, and every process takes a local checkpoint as specified in the code, independent of the other processes. In addition, I present a performance analysis using stochastic models to compare the imposed checkpoint overheads of this approach with other existed checkpointing protocols.

Shuly Wintner
Last modified: Wed Nov 2 08:58:59 IST 2005