HPC Storage and Burst Buffer: Multi-Tier I/O Architecture — parallel file systems combined with NVMe burst buffer tier enabling asynchronous I/O and checkpoint aggregation
Lustre Parallel File System
- Architecture: metadata server (MDS, single or pair), object storage targets (OSTs: 100s-1000s), clients (compute nodes)
- Object-Based: data stored as objects (striped across OSTs), not centralized file server
- Striping: file striped across multiple OSTs (default stripe 1 MB chunks), single file achieves 100 GB/s if N OSTs available
- Metadata Operations: MDS handles file creation, deletion, attribute changes (separate from data path)
- Performance: 100-400 GB/s aggregate bandwidth typical (Lustre @ DOE facilities), sustained (not peak)
BeeGFS (Parallel File System)
- Distribution: metadata distributed across multiple targets (scalable MDS), no single-point failure
- Hardware: commodity storage servers + Ethernet (no Infiniband required), simpler deployment
- Flexibility: dynamic capacity expansion (add OSTs online), adaptive rebalancing
- Use Cases: smaller clusters (<1000 nodes) favor BeeGFS, enterprise storage, lower TCO
I/O Bottleneck in HPC
- Compute-to-I/O Ratio: compute ~1-10 TFLOPS per node, I/O ~1-10 GB/s per node, ratio ~100:1 (I/O much slower)
- Bandwidth Imbalance: 10,000-node system @ 10 GB/s per node = 100 TB/s demand, but storage ~10 TB/s available (10× mismatch)
- Synchronous I/O: if all nodes write checkpoints simultaneously, I/O bandwidth saturated (stalls computation)
- Latency Penalty: file system metadata operations (list files, stat) ~1-10 ms round-trip, totals 100 ms+ for thousands of ops
Burst Buffer Architecture
- Tier 0 (Compute Node Memory): DRAM on compute nodes (typical 64-256 GB), fast but limited size
- Tier 1 (Burst Buffer): NVMe SSD (10-100 TB per node, aggregate 1-10 PB system-wide), moderate bandwidth (1-4 TB/s per node)
- Tier 2 (Parallel File System): HDD-based storage (multi-PB, 100+ GB/s aggregate), slow but large capacity
- Asynchronous I/O: application writes to burst buffer (fast, doesn't stall), background daemon asynchronously flushes to Lustre
Burst Buffer Use Cases
- Checkpoint I/O: application checkpoints every 5-30 min (fault tolerance), writes to burst buffer (fast), daemon stages to Lustre (slow, batched)
- Aggregation: multiple I/O nodes (E/S nodes: I/O and storage) run staging daemons, aggregate multiple checkpoint streams (reduce load on single Lustre server)
- Temporary Data: intermediate results stored in burst buffer (fast access), discarded after analysis (no need for permanent storage)
DataWarp (Cray Burst Buffer)
- Architecture: SSDs in specializedI/O nodes (separate from compute nodes), connected via network
- Capacity: 1-10 PB typical, persistent (survives job completion), shared across multiple jobs
- Performance: 1-2 TB/s per node (aggregate), lower than local NVMe but shared fairly
- Integration: POSIX interface (standard file I/O), transparent to applications
DAOS (Distributed Asynchronous Object Storage — Intel)
- Architecture: distributed storage pool (storage nodes with local NVMe), replication for fault tolerance
- Object Interface: key-value store semantic (not traditional file), flexible for structured data
- Consistency Model: eventual consistency (asynchronous replication), suitable for HPC (not strict ACID)
- Performance: low-latency I/O (~10 µs), high-throughput (100s GB/s aggregate)
- POSIX Interop: FUSE bridge enables POSIX file semantics, backward-compatible with existing applications
I/O Forwarding Layer
- E/S Node (I/O Forwarder): subset of cluster dedicated to I/O (10-20% of total nodes typical), aggregate I/O from compute nodes
- Aggregation Logic: collocate multiple compute node I/O requests, batch forward to Lustre (reduce metadata operations)
- Caching: E/S node maintains cache (hot data accessed frequently), avoids repeated Lustre accesses
- Throughput Improvement: 5-10× I/O throughput via intelligent aggregation
Checkpoint I/O Optimization
- Incremental Checkpointing: save only changed data (vs full state), reduces checkpoint size 2-10×
- Asynchronous Checkpointing: background thread saves checkpoint (application continues), reduces stall time
- Lossy Compression: compress checkpoint (trades fidelity for speed), acceptable if error-correcting codes can recover
- Checkpoint Frequency: balance between fault tolerance (frequent) and I/O overhead (infrequent), typically 10-30 min intervals
Bandwidth Hierarchy
- Compute-Local Cache: ~10 GB/s per node (fast, limited to local data)
- Burst Buffer: ~1-4 TB/s per node (moderate speed, larger capacity)
- Parallel FS (Lustre): ~100-400 GB/s aggregate (slow, unlimited capacity)
- Design Pattern: exploit hierarchy (data locality first, then burst buffer, finally Lustre)
Data Movement and Power
- I/O Power: moving 1 GB from DRAM to disk consumes ~0.1 Joule (storage + network), exceeds computation energy for data-intensive workloads
- Co-Location: store compute near data (minimize movement), reduces power + latency
- In-Memory Analytics: keep data in DRAM for repeated analysis, burst buffer not always necessary
Reliability and Data Integrity
- Replication: data replicated across OSTs (default 2-3 copies), tolerates single OST failure
- RAID: hardware RAID on individual storage servers (10, 6), protects against disk failures
- Checksums: verify data integrity (detect bit errors), background scrubber detects silent corruption
Scalability Considerations
- Metadata Scaling: MDS becomes bottleneck (metadata request rate O(N²) for N nodes), distributed metadata (BeeGFS) preferred at extreme scale
- Network Congestion: many nodes writing simultaneously saturates network, requires oversubscribed network (2-4× compute bandwidth)
Future Directions: disaggregated storage (separate compute + storage, enable flexible provisioning), persistent memory (NVMe over Fabrics), tiered storage with AI-driven data placement optimization.