Distributed Consensus Protocols

Distributed Consensus Protocols are the algorithms that enable a group of distributed nodes to agree on a single value or sequence of operations despite node failures and network partitions — solving the fundamental problem of maintaining consistency in distributed systems, with Paxos and Raft being the most widely deployed protocols that underpin every distributed database, configuration service, and replicated state machine in production today.

The Consensus Problem

- N nodes must agree on a value.
- Requirements: Agreement (all correct nodes decide same value), Validity (decided value was proposed by some node), Termination (all correct nodes eventually decide).
- FLP Impossibility: No deterministic consensus protocol can guarantee termination in an asynchronous system with even 1 crash failure.
- Practical protocols use timeouts/leader election to circumvent FLP.

Paxos (Lamport, 1989)

- Three roles: Proposer, Acceptor, Learner.
- Two phases:
- Phase 1a (Prepare): Proposer sends prepare(n) to acceptors.
- Phase 1b (Promise): Acceptors promise not to accept proposals < n.
- Phase 2a (Accept): Proposer sends accept(n, value) to acceptors.
- Phase 2b (Accepted): Acceptors accept if no higher promise.
- Quorum: Majority (N/2 + 1) must respond → tolerates (N-1)/2 failures.
- Multi-Paxos: Leader elected → skips Phase 1 for subsequent values → better performance.

Raft (Ongaro & Ousterhout, 2014)

- Designed for understandability — equivalent to Multi-Paxos but much simpler.
- Three states: Leader, Follower, Candidate.
- Leader Election: Followers timeout → become candidate → request votes → majority wins.
- Log Replication: Leader appends entries to log → replicates to followers → commits when majority have it.
- Safety guarantee: If a log entry is committed, it will be present in all future leaders' logs.

Raft vs. Paxos

| Aspect | Paxos | Raft |
|--------|-------|------|
| Understandability | Notoriously complex | Designed for clarity |
| Leader | Implicit (Multi-Paxos) | Explicit, strong leader |
| Log ordering | Flexible (gaps allowed) | Strictly sequential |
| Implementation | Many variants, tricky | Straightforward |
| Performance | Similar | Similar |

Production Implementations

| System | Protocol | Use Case |
|--------|---------|----------|
| etcd | Raft | Kubernetes configuration, service discovery |
| ZooKeeper | ZAB (Paxos-like) | Hadoop coordination, distributed locking |
| CockroachDB | Raft | Distributed SQL database |
| Google Spanner | Paxos | Globally distributed database |
| TiKV | Raft | Distributed KV store (TiDB) |
| Consul | Raft | Service mesh, KV store |

Performance Considerations

- Latency: Each consensus round requires 1-2 RTTs to majority (3-5 ms within datacenter).
- Throughput: Batching multiple client requests per consensus round → amortize overhead.
- Multi-group (Multi-Raft): Partition data into groups, each with own Raft instance → parallelism.
- Geo-replication: Cross-datacenter RTT (50-200 ms) → leader placement critical.

Distributed consensus protocols are the foundational primitive that makes distributed systems reliable — every distributed database, coordination service, and replicated state machine depends on consensus to maintain consistency, making Raft and Paxos among the most important and widely deployed algorithms in all of computer science.

Want to learn more?