Distributed Consensus Protocols

Keywords: distributed consensus raft protocol,paxos consensus algorithm,leader election distributed,log replication consensus,split brain prevention

Distributed Consensus Protocols are algorithms that enable a group of distributed nodes to agree on a single value or sequence of values despite node failures and network partitions — providing the foundation for replicated state machines, distributed databases, and fault-tolerant coordination services.

Consensus Problem Definition:
- Agreement: all non-faulty nodes decide on the same value; no two correct nodes decide differently
- Validity: the decided value was proposed by some node; consensus doesn't fabricate values
- Termination: all non-faulty nodes eventually decide; the protocol makes progress despite failures (liveness)
- FLP Impossibility: Fischer-Lynch-Paterson proved that deterministic consensus is impossible in asynchronous systems with even one crash failure — practical protocols circumvent this by using timeouts (partial synchrony) or randomization

Raft Protocol:
- Leader Election: nodes start as followers; if a follower receives no heartbeat within a randomized timeout (150-300 ms), it becomes a candidate and requests votes; the candidate with a majority of votes becomes leader for the current term; randomized timeouts prevent split-vote scenarios
- Log Replication: the leader receives client requests, appends them to its log, and replicates log entries to followers via AppendEntries RPCs; once a majority of followers have written the entry, the leader commits it and applies to the state machine
- Safety: committed entries are never lost — a candidate cannot win election unless its log is at least as up-to-date as a majority of nodes; this ensures the elected leader always has all committed entries
- Membership Changes: Raft supports joint consensus for configuration changes — adding/removing nodes without downtime by transitioning through a joint configuration where both old and new memberships must agree

Paxos Family:
- Basic Paxos: two-phase protocol (Prepare/Accept) for agreeing on a single value; proposer sends Prepare(n) with proposal number n; acceptors promise to reject lower-numbered proposals and reply with any previously accepted value; proposer sends Accept(n, v) with the highest-numbered previously accepted value (or its own if none)
- Multi-Paxos: optimization for agreeing on a sequence of values; a stable leader skips the Prepare phase for consecutive proposals, reducing each consensus round to a single Accept phase — equivalent to Raft's steady-state log replication
- Flexible Paxos: generalizes quorum requirements — Prepare quorum and Accept quorum need not be majority, only their intersection must be non-empty; enables optimizing for read-heavy or write-heavy workloads by adjusting quorum sizes

Production Systems:
- etcd (Raft): Kubernetes' coordination service; 3-5 node cluster providing linearizable key-value storage for cluster state, leader election, and distributed locking; handles 10-30K writes/sec per cluster
- ZooKeeper (ZAB): Zab (ZooKeeper Atomic Broadcast) protocol similar to Raft but with different leader election mechanism; used by Hadoop, Kafka, and HBase for coordination; being gradually replaced by Raft-based alternatives
- CockroachDB/TiKV (Multi-Raft): run thousands of independent Raft groups — one per data range/partition; each range independently elects leaders and replicates data; enables horizontal scaling while maintaining per-range consistency

Performance Trade-offs:
- Latency: consensus requires majority acknowledgment — minimum 1 RTT for leader-based protocols in steady state; 2 RTT for leaderless Paxos; cross-datacenter consensus adds 50-200 ms per commit
- Throughput: leader bottleneck limits write throughput to single-node capacity; batching multiple client requests into single log entries improves throughput by 10-100× at the cost of slightly higher latency
- Availability: requires majority alive (3 nodes tolerate 1 failure, 5 tolerate 2); network partitions may cause temporary unavailability for the minority partition — CAP theorem makes consistency-availability tradeoff explicit

Distributed consensus is the bedrock of reliable distributed systems — Raft and Paxos provide the theoretical and practical foundations that make distributed databases, configuration management, and leader election reliable in production cloud environments.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT