Compiler-Driven Distributed Checkpointing
Distributed checkpointing is an important concept in providing fault tolerance in computer systems. Fault tolerance is important for distributed systems, for which the failure rate is high. In today's applications, e.g., grid and massively parallel applications, the imposed overhead of taking a distributed checkpoint using the known approaches can often outweigh its benefits, due to coordination and other overhead from the processes. In this talk, I present an innovative approach for distributed checkpointing. In this approach, during compilation, the checkpoints are specified in the application code using analysis based on the application level. During execution, no coordination is required, and every process takes a local checkpoint as specified in the code, independent of the other processes. In addition, I present a performance analysis using stochastic models to compare the imposed checkpoint overheads of this approach with other existed checkpointing protocols.