Fault-Tolerant Distributed Computing is the design of distributed systems that continue to operate correctly despite the failure of individual components (nodes, networks, storage), using redundancy, replication, and recovery mechanisms to mask failures from applications and users — as systems scale to thousands of nodes, component failures become not exceptions but statistical certainties, making fault tolerance a fundamental design requirement.
Failure Classification:
- Crash Failures: a node stops executing and doesn't recover — the simplest failure model, handled by detecting absence (heartbeats) and replacing the failed node
- Omission Failures: a node fails to send or receive some messages — more subtle than crashes, can cause protocol violations if not anticipated
- Byzantine Failures: a node behaves arbitrarily — may send conflicting messages, corrupt data, or collude with other faulty nodes — the hardest to tolerate, requiring 3f+1 nodes for f failures
- Network Partitions: communication between groups of nodes is severed — the CAP theorem proves that a distributed system cannot simultaneously guarantee consistency, availability, and partition tolerance
Checkpoint/Restart:
- Coordinated Checkpointing: all processes synchronize and write their state to stable storage simultaneously — creates a globally consistent snapshot but the coordination barrier limits scalability
- Uncoordinated Checkpointing: each process checkpoints independently — avoids synchronization overhead but recovery requires finding a consistent cut across independent checkpoints, risking the domino effect (cascading rollbacks)
- Incremental Checkpointing: only saves pages modified since the last checkpoint — reduces checkpoint volume by 60-90% using dirty page tracking (OS page protection or hash-based change detection)
- Multi-Level Checkpointing: stores checkpoints at multiple levels — L1 in local RAM (fast, survives process crash), L2 on partner node (survives node crash), L3 on parallel file system (survives rack failure) — SCR library implements this hierarchy
Replication Strategies:
- Active Replication: all replicas process every request independently and vote on the output — tolerates Byzantine failures but requires deterministic execution and 3f+1 replicas for f failures
- Passive Replication (Primary-Backup): one primary processes requests and forwards state updates to backups — on primary failure, a backup takes over — simpler and cheaper than active replication but doesn't handle Byzantine failures
- Chain Replication: requests flow through a chain of replicas (head processes writes, tail responds to reads) — provides strong consistency with high throughput by distributing work across the chain
- Quorum Replication: reads and writes require responses from R and W replicas respectively, where R + W > N — tunable consistency-availability tradeoff (W=1 for fast writes, R=1 for fast reads)
Failure Detection:
- Heartbeat Protocols: nodes periodically send heartbeat messages to a monitor — failure is suspected after missing k consecutive heartbeats (typically k=3-5 with 1-5 second intervals)
- Phi Accrual Detector: instead of binary alive/dead decisions, computes a suspicion level (φ) based on heartbeat arrival time distribution — φ > 8 typically indicates failure with high confidence
- SWIM Protocol: Scalable Weakly-consistent Infection-style Membership — combines direct probing with indirect probing through randomly selected peers, disseminates membership changes via gossip — detects failures in O(log n) time with O(1) message overhead per node
- Perfect vs. Eventual Detectors: perfect failure detectors (complete and accurate) are impossible in asynchronous systems — practical detectors are eventually accurate (may temporarily suspect correct nodes)
Fault Tolerance in HPC:
- MPI Fault Tolerance: standard MPI aborts the entire job on any process failure — ULFM (User-Level Failure Mitigation) proposal adds MPI_Comm_revoke and MPI_Comm_shrink to enable application-level recovery
- Algorithm-Based Fault Tolerance (ABFT): encodes redundancy into the computation itself — for matrix operations, maintaining row/column checksums allows detecting and correcting single-node data corruption without full checkpoint/restart
- Proactive Migration: monitoring hardware health indicators (ECC error rates, temperature trends) and migrating processes away from predicted failures before they occur — reduces unexpected failures by 40-60%
- Elastic Scaling: frameworks like Spark and Ray automatically redistribute work when nodes fail or join — the computation continues with reduced parallelism rather than aborting
Recovery Techniques:
- Rollback Recovery: restore process state from the most recent checkpoint and replay logged messages — recovery time is proportional to the logging interval and message volume
- Forward Recovery: continue execution without rollback by recomputing lost results from available data — possible when the computation is idempotent or redundantly encoded
- Lineage-Based Recovery (Spark): instead of checkpointing intermediate data, track the sequence of transformations (lineage) — on failure, recompute lost partitions from the original input data by replaying the lineage
- Transaction Rollback: databases use write-ahead logging (WAL) to ensure atomic transactions — on failure, incomplete transactions are rolled back using the log while committed data is preserved
Fault tolerance introduces overhead (5-30% for checkpointing, 2-3× for full replication) but is non-negotiable at scale — a 10,000-node cluster with 5-year MTTF per node experiences a node failure every 4 hours, making any long-running computation impossible without fault tolerance mechanisms.