Distributed Consensus Algorithms

Distributed Consensus Algorithms are protocols that enable multiple nodes in a distributed system to agree on a single value or sequence of values despite node failures and network partitions — consensus is the foundational primitive for building reliable distributed systems including databases, coordination services, and blockchain networks.

The Consensus Problem:
- Agreement: all non-faulty nodes must decide on the same value — once a value is decided, the decision is irrevocable
- Validity: the decided value must have been proposed by some node — the algorithm cannot fabricate values
- Termination: all non-faulty nodes must eventually decide — liveness guarantee that prevents indefinite blocking
- FLP Impossibility: Fischer, Lynch, and Paterson proved that deterministic consensus is impossible in an asynchronous system with even one crash failure — practical algorithms circumvent this with timeouts, randomization, or partial synchrony assumptions

Paxos Algorithm:
- Roles: proposers suggest values, acceptors vote on proposals, learners discover decided values — a single node can serve multiple roles simultaneously
- Phase 1 (Prepare): proposer sends Prepare(n) with proposal number n to a majority of acceptors — acceptors promise not to accept proposals numbered less than n and return any previously accepted value
- Phase 2 (Accept): if the proposer receives promises from a majority, it sends Accept(n, v) where v is the highest-numbered previously accepted value (or the proposer's own value) — acceptors accept if they haven't promised to a higher-numbered proposal
- Multi-Paxos: optimizes repeated consensus decisions by electing a stable leader that skips Phase 1 — amortizes leader election cost over many consensus instances, reducing round trips from 2 to 1 per decision

Raft Algorithm:
- Leader Election: nodes start as followers, transition to candidates after an election timeout, and request votes from peers — a candidate receiving votes from a majority becomes leader for the current term
- Log Replication: the leader appends client commands to its log and replicates entries to followers via AppendEntries RPCs — an entry is committed when replicated to a majority of nodes
- Safety Guarantees: only nodes with up-to-date logs can be elected leader (election restriction), committed entries are never lost (leader completeness), and the log is never inconsistent (log matching property)
- Understandability: Raft was explicitly designed for understandability — its decomposition into leader election, log replication, and safety makes it significantly easier to implement correctly than Paxos

Byzantine Fault Tolerance (BFT):
- Byzantine Failures: nodes may behave arbitrarily (crash, send conflicting messages, collude) — requires 3f+1 nodes to tolerate f Byzantine failures vs. 2f+1 for crash failures
- PBFT (Practical BFT): Castro and Liskov's protocol uses a three-phase commit (pre-prepare, prepare, commit) with 3f+1 replicas — achieves consensus in 2 round trips with O(n²) message complexity
- Blockchain Consensus: Nakamoto consensus (Bitcoin) uses proof-of-work to achieve probabilistic BFT in a permissionless setting — sacrifices finality for open participation
- HotStuff: linear-complexity BFT protocol using threshold signatures — reduces message complexity from O(n²) to O(n), adopted by Meta's Diem (Libra) blockchain

Practical Implementations:
- ZooKeeper (ZAB): atomic broadcast protocol similar to Paxos — provides distributed coordination primitives (locks, barriers, leader election) used by Kafka, HBase, and Hadoop
- etcd (Raft): distributed key-value store powering Kubernetes cluster coordination — implements Raft with snapshotting, log compaction, and membership changes
- CockroachDB: uses Raft consensus per data range for strongly consistent distributed SQL — thousands of independent Raft groups coordinate data across nodes
- Google Spanner: combines Paxos with TrueTime (GPS-synchronized clocks) for globally consistent transactions — externally consistent reads without coordination using timestamp ordering

Performance Considerations:
- Latency: consensus requires at minimum one round trip to a majority — cross-datacenter Paxos/Raft adds 50-200ms per decision due to WAN latency
- Throughput: batching multiple proposals into a single consensus round amortizes overhead — Multi-Paxos with batching achieves 100K+ decisions/second on modern hardware
- Flexible Paxos: relaxes the requirement that prepare and accept quorums must both be majorities — any two quorums that intersect suffice, allowing optimization for read-heavy or write-heavy workloads

Consensus algorithms are the backbone of modern distributed infrastructure — every strongly consistent distributed database, every coordination service, and every blockchain ultimately relies on some form of consensus to ensure that distributed nodes agree on a shared state despite failures.

Want to learn more?