Checkpoint/Restart and Fault Tolerance

Checkpoint/Restart and Fault Tolerance enable resilience against hardware failures in long-running HPC simulations through periodic application state snapshots, essential for exascale computing where mean-time-between-failures measured in hours.

System-Level vs Application-Level Checkpointing

- System-Level (DMTCP): Dungeon Master of the Totality of Cells library. Captures entire process state (memory, open files, network sockets). Transparent to application.
- Advantages of System-Level: No code modification required. Works with legacy applications. Automatic recovery without application awareness.
- Disadvantages: Large checkpoint size (all memory including unused pages). Slower than selective application checkpoint.
- Application-Level Checkpointing: Application explicitly saves necessary state (not entire memory). Selective, optimized checkpoints. Requires code instrumentation.

SCR (Scalable Checkpoint/Restart) Library

- SCR Design: Leverages node-local storage (NVMe, RAID) for fast checkpoints. Metadata distributed across multiple nodes for reliability.
- Checkpoint Stages: Write checkpoint locally (fast, 100 GB/s NVMe). Metadata distributed synchronously (ensures recovery info available).
- Recovery: Upon failure, restart from latest checkpoint on local storage or rebuilt from distributed metadata.
- Scalability: Checkpoint time ~1-10 seconds for 100k node systems (vs minutes for single shared storage). Enables frequent checkpoint (every 100-1000 iterations).

Checkpoint Interval Optimization (Young's Formula)

- Optimal Interval: T_checkpoint = sqrt(2 × MTBF × T_checkpoint_time). Balances checkpoint overhead vs recovery loss.
- MTBF (Mean Time Between Failure): System reliability metric. Decreases with scale. 100k nodes, 24hr MTBF typical (T_checkpoint ~10 minutes).
- Young's Formula Derivation: Minimize: (work lost in failure + checkpoint time) / total work. Optimal checkpoint interval sqrt(2 × MTBF × Tchkpt).
- Practical Example: MTBF = 1 day, checkpoint time = 10 min. Optimal interval = sqrt(2 × 86400 × 600) ≈ 32,000 sec ≈ 9 hours. Trade-off: frequent checkpoint (overhead) vs longer recovery.

HDF5 and Parallel HDF5 for Checkpoint Data

- HDF5 (Hierarchical Data Format 5): Self-describing binary format. Metadata + data coexist. Ideal for large scientific datasets.
- Parallel HDF5 (pHDF5): Collective I/O by all MPI ranks to single file. MPIIO underneath enables all-ranks-to-single-file write in parallel.
- Checkpoint Structure: Groups per timestep (e.g., /timestep_1000/), datasets for arrays/fields. Metadata (simulation time, iteration count) stored as HDF5 attributes.
- I/O Performance: pHDF5 achieves 100-500 GB/s aggregate throughput on large clusters (vs 10-50 GB/s single-file write).

In-Memory Checkpointing

- Memory-to-Memory Checkpoints: Store checkpoints in node-local DRAM (redundant copy) instead of persistent storage. Faster than disk I/O.
- Redundancy Pattern: Checkpoint of rank R stored on different rank (cross-node mirroring). Single node failure → recover from mirror.
- Trade-off: Extra memory overhead (~2× for mirroring). Resilience to single node failure only; multiple simultaneous failures unrecoverable.
- Hybrid: Disk checkpoint every N iterations (slow but persistent), in-memory checkpoints between disk checkpoints (fast). Best of both.

Silent Data Corruption (SDC) Detection

- Silent Error Threat: Bit flips from cosmic rays, manufacturing defects undetected. Undetected errors cause incorrect results (scientific validity compromised).
- Detection via Redundancy: Dual computation (compute twice, compare results). Bit flips detected via mismatch. Redundancy overhead ~100% (2x compute).
- Application-Level Detection: Sanity checks on computed quantities. Example: conservation laws (energy, mass), bounds checks (physical reasonableness).
- Lightweight Checking: Checksum computation (CRC over data). Detects most errors, overhead ~10% (single checksum pass).

Exascale Fault Tolerance Challenges

- Failure Rate Scaling: Mean-time-between-failure inversely proportional to system size. 1 exaflop = 10^18 FLOP/s = 1000 PF systems (combined). MTBF ~30 minutes at exascale.
- Checkpoint Overhead: Checkpoint time scales with data size. Exascale system checkpoint ~100GB-1TB (Checkpoint time = 10-100 sec @ 10TB/s interconnect).
- Optimal Interval: sqrt(2 × 1800s × 50s) ≈ 450 seconds. Checkpoint every ~7 minutes optimal. Overhead: 50/450 ≈ 11% lost to checkpointing.
- Novel Approaches: Algorithmic redundancy (encode computation, tolerate errors), lossy compression (lower checkpoint precision for reduced size), in-situ analytics (checkpoint only critical outputs).

Recovery Mechanisms and Rollback

- Checkpoint-Restart Workflow: Upon failure detected, job stopped, latest checkpoint loaded, simulation resumed from checkpoint.
- Rollback Logic: Simulation time set to checkpoint timestamp. Iteration counters reset. Environment variables, pseudo-random state restored.
- Data Validity: Post-recovery, simulation continues as if failure never occurred. Transparent recovery (from application perspective).
- Checkpoint Frequency Trade-off: Frequent checkpoint = low rollback loss, high overhead. Sparse checkpoint = low overhead, high rollback loss. Optimized balance.

Checkpoint/Restart and Fault Tolerance

Want to learn more?