Home Knowledge Base Checkpoint/Restart and Fault Tolerance in Parallel Computing

Checkpoint/Restart and Fault Tolerance in Parallel Computing is the reliability mechanism that periodically saves the complete execution state of a parallel program to persistent storage so that a failed computation can be resumed from the last checkpoint rather than restarted from scratch — essential for long-running HPC and AI training jobs where job failure without checkpointing wastes days to weeks of compute time. At the scale of 10,000+ GPU clusters, hardware failures are not exceptional events but statistically near-certain over training runs lasting weeks.

Why Fault Tolerance Is Necessary at Scale

Checkpoint Types

TypeScopeSpeedRecoveryOverhead
Application-levelUser code saves model weightsFast, targetedApplication-levelLow if infrequent
System-level (transparent)OS snapshots all process memoryComplete stateFully transparentHigh (copy all memory)
CoordinatedAll processes checkpoint simultaneouslySlow (coordination)Consistent stateSignificant
UncoordinatedEach process checkpoints independentlyFastComplex recoveryVariable

Application-Level Checkpointing (Deep Learning)

DMTCP (Distributed Multi-Threaded CheckPointing)

Coordinated Checkpointing (MPI)

Asynchronous Checkpointing

AI Training Checkpoint Optimization

Recovery from Failure

1. Detect failure: Heartbeat timeout, NCCL error, hardware watchdog
2. Kill all processes in the job
3. Identify last complete checkpoint
4. Respawn job on new healthy nodes (replace failed GPU)
5. Load checkpoint: All ranks restore from checkpoint files
6. Verify consistency: Check step number, optimizer state
7. Resume training from checkpoint step

Failure Detection

Checkpoint/restart is the insurance policy that makes large-scale AI training economically viable — without it, a single hardware failure in a 10,000-GPU cluster after 20 days of training would waste 200,000 GPU-hours of compute; with hourly checkpoints, the same failure costs at most 10,000 GPU-hours, transforming catastrophic loss into a manageable interruption and enabling the multi-week training runs that produce frontier AI models.

checkpoint restartfault tolerance paralleldmtcpcheckpoint recoveryresilient computingparallel fault tolerance

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.