Cache Coherence Protocols

Cache Coherence Protocols are hardware mechanisms that ensure all processors in a shared-memory multiprocessor system observe a consistent view of memory by coordinating cache line states across private caches — without coherence protocols, one processor's cached copy of data could become stale when another processor modifies the same memory location.

The Coherence Problem:
- Private Caches: each processor core has private L1/L2 caches for low-latency access — when multiple cores cache the same memory address, modifications by one core must be visible to all others
- Write Propagation: a write to a shared location must eventually become visible to all processors — coherence ensures that reads always return the most recent write
- Write Serialization: all processors must observe writes to the same location in the same order — prevents inconsistent views of memory state
- False Sharing: when two processors modify different variables that happen to reside on the same cache line (typically 64 bytes), the coherence protocol forces unnecessary invalidations — a significant performance pitfall

MESI Protocol:
- Modified (M): the cache line has been modified and is the only valid copy — the cache is responsible for writing back the data before another processor can access it
- Exclusive (E): the cache line is unmodified and is the only cached copy — can be silently promoted to Modified on a write without bus transaction (important optimization over MSI)
- Shared (S): the cache line is unmodified and may exist in other caches — a write requires an invalidation broadcast to transition to Modified
- Invalid (I): the cache line is not valid — any access requires fetching the line from another cache or main memory

MOESI and MESIF Extensions:
- Owned (O) in MOESI: the cache holds a modified copy that is shared with other caches — the owning cache supplies the data on requests instead of main memory, reducing memory bandwidth (used by AMD processors)
- Forward (F) in MESIF: designates one shared copy as the supplier for future requests — prevents all shared copies from responding simultaneously, reducing bus traffic (used by Intel processors)
- State Transitions: each memory operation (read, write, eviction) triggers a state transition that may involve bus transactions — the protocol's efficiency depends on minimizing these transactions

Snooping Protocols:
- Bus-Based Snooping: all cache controllers monitor (snoop) the shared bus for memory transactions — when a cache detects a relevant transaction, it updates its state accordingly
- Write-Invalidate: on a write, the writing cache broadcasts an invalidation to all other copies — other caches mark their copies as Invalid and must fetch the updated version on next access
- Write-Update (Dragon Protocol): on a write, the new value is broadcast to all shared copies — reduces read miss latency but consumes more bus bandwidth than write-invalidate
- Scalability Limitation: snooping requires all caches to observe all bus transactions — practical limit is 8-16 cores before bus bandwidth becomes a bottleneck

Directory-Based Protocols:
- Directory Structure: a centralized or distributed directory tracks which caches hold copies of each memory block — eliminates the need for broadcast by sending targeted messages only to relevant sharers
- Bit Vector: directory entry contains one bit per processor indicating whether that processor caches the line — scales to hundreds of processors but directory storage grows as O(N × M) where N is processors and M is memory blocks
- Coarse Directory: reduces storage by tracking groups of processors rather than individual ones — sacrifices precision (invalidates entire groups) for reduced memory overhead
- NUMA Integration: directory-based coherence naturally integrates with Non-Uniform Memory Access architectures — the directory is distributed across memory controllers, with local lookups for local memory and remote requests for remote memory

Performance Impact:
- Coherence Traffic: in a 64-core system running a shared-data workload, coherence messages can consume 30-50% of interconnect bandwidth — optimizing data layout to minimize sharing reduces this overhead
- False Sharing Mitigation: padding data structures to cache line boundaries (64 bytes) prevents false sharing — __attribute__((aligned(64))) or alignas(64) in C/C++ ensures each variable occupies its own cache line
- Read-Write Asymmetry: read sharing is cheap (multiple Shared copies coexist), but write sharing is expensive (requires invalidation) — designing data structures for reader-writer separation dramatically reduces coherence traffic
- Coherence Latency: an L1 cache hit takes 1-4 cycles, but a cache-to-cache transfer for a coherence miss takes 20-100 cycles depending on interconnect topology — minimizing sharing reduces average memory access time

Cache coherence is invisible to most programmers but fundamentally shapes the performance of parallel software — understanding the underlying protocol helps explain why some parallel data structures scale linearly while others hit performance walls at just a few cores.

Want to learn more?