Checkpoint/Restart and Fault Tolerance in Parallel Computing

Checkpoint/Restart and Fault Tolerance in Parallel Computing is the reliability mechanism that periodically saves the complete execution state of a parallel program to persistent storage so that a failed computation can be resumed from the last checkpoint rather than restarted from scratch — essential for long-running HPC and AI training jobs where job failure without checkpointing wastes days to weeks of compute time. At the scale of 10,000+ GPU clusters, hardware failures are not exceptional events but statistically near-certain over training runs lasting weeks.

Why Fault Tolerance Is Necessary at Scale

- Single GPU MTBF (Mean Time Between Failures): ~1 year.
- 10,000 GPU cluster: Expected failures per day = 10,000 / 365 ≈ 27 GPU failures/day.
- A 2-week LLM training job: ~380 GPU failures expected → without checkpointing → all compute lost.
- With hourly checkpoints: Maximum 1 hour of compute lost per failure → 99.7% efficiency maintained.

Checkpoint Types

| Type | Scope | Speed | Recovery | Overhead |
|------|-------|-------|----------|----------|
| Application-level | User code saves model weights | Fast, targeted | Application-level | Low if infrequent |
| System-level (transparent) | OS snapshots all process memory | Complete state | Fully transparent | High (copy all memory) |
| Coordinated | All processes checkpoint simultaneously | Slow (coordination) | Consistent state | Significant |
| Uncoordinated | Each process checkpoints independently | Fast | Complex recovery | Variable |

Application-Level Checkpointing (Deep Learning)

- Save model weights + optimizer state + training step counter to persistent storage (HDFS, S3, NFS).
- PyTorch: torch.save(checkpoint, path) → saves state dict.
- Resume: model.load_state_dict(torch.load(checkpoint)) → continue training from saved step.
- Frequency: Checkpoint every 100–1000 training steps (1–10 minutes typically).
- Storage: LLM checkpoint can be 100s GB → fast NVMe or parallel file system needed.

DMTCP (Distributed Multi-Threaded CheckPointing)

- Transparent checkpointing at OS/library level → works without modifying application.
- Intercepts system calls → saves file descriptors, memory maps, socket state.
- Supports MPI, OpenMP, multi-GPU workloads.
- Resume: Re-execute from checkpoint → process state restored → application continues.
- Use case: Legacy HPC applications that cannot be easily modified for application-level checkpointing.

Coordinated Checkpointing (MPI)

- All MPI processes checkpoint at same logical point → consistent global snapshot.
- Coordination: Blocking protocol — all processes save state, then synchronize → resume.
- Problem: N processes × large memory → checkpoint I/O time grows with scale.
- Incremental checkpointing: Save only changed memory pages (dirty pages) → reduce I/O.
- Memory-copy-on-write: Fork process → parent continues; child writes checkpoint to disk → overlap compute and I/O.

Asynchronous Checkpointing

- Main process: Continues computation after triggering checkpoint.
- Shadow process: Asynchronously writes state to disk.
- Risk: If failure occurs during async checkpoint write → last complete checkpoint used.
- Reduces checkpoint overhead from minutes to seconds (overlap compute and I/O).

AI Training Checkpoint Optimization

- Mixed precision checkpoint: Save FP16 model + FP32 optimizer states separately → smaller total size.
- Sharded checkpoint: Each GPU rank saves its own state slice → parallel writes → faster I/O.
- DeepSpeed ZeRO checkpoint: Sharded optimizer + model states → consolidate only for inference.
- Flash checkpoint (Meta, 2024): Copy checkpoint to CPU memory first → async write to disk → near-zero training pause.

Recovery from Failure

``1. Detect failure: Heartbeat timeout, NCCL error, hardware watchdog 2. Kill all processes in the job 3. Identify last complete checkpoint 4. Respawn job on new healthy nodes (replace failed GPU) 5. Load checkpoint: All ranks restore from checkpoint files 6. Verify consistency: Check step number, optimizer state 7. Resume training from checkpoint step``

Failure Detection

- Heartbeat monitoring: Each node sends periodic heartbeat → orchestrator detects silence → declare failure.
- NCCL timeout: Communication operation exceeds timeout → NCCL signals failure → job manager kills job.
- Hardware watchdog: GPU driver detects GPU hang → SIGKILL to process.

Checkpoint/restart is the insurance policy that makes large-scale AI training economically viable — without it, a single hardware failure in a 10,000-GPU cluster after 20 days of training would waste 200,000 GPU-hours of compute; with hourly checkpoints, the same failure costs at most 10,000 GPU-hours, transforming catastrophic loss into a manageable interruption and enabling the multi-week training runs that produce frontier AI models.

Checkpoint/Restart and Fault Tolerance in Parallel Computing

Want to learn more?