HPC Benchmarking (HPL/HPCG)

HPC Benchmarking (HPL/HPCG) establishes standardized performance measurements for supercomputers, enabling fair comparison across architectures and identifying achievable sustained performance on realistic workloads.

High Performance LINPACK (HPL) Benchmark

- HPL Algorithm: Dense LU factorization with partial pivoting (Ax = b solution). Highly optimized, cache-friendly operation; achieves 80-90% theoretical peak on modern hardware.
- Matrix Size: Adjustable N (problem dimension). Typical: N = 100,000-5,000,000 (depends on available memory). Larger N better utilizes memory bandwidth.
- Computation: O(2N³/3) floating-point operations. Perfect for profiling (predictable load, uniform memory access).
- Measurement: GFLOP/s = (2N³/3) / wall-clock time. Top500 list ranked by HPL performance.

HPL Scaling Characteristics

- Weak Scaling: Fixed work per processor. Increase processors + matrix size proportionally. Time = constant (ideal). HPL scales to 100,000+ cores.
- Strong Scaling: Fixed problem size. Increase processors, time decreases. Eventually communication dominates; speedup saturates.
- Efficiency: Sustained GFLOP/s / Theoretical peak GFLOP/s. Modern systems achieve 80-90% HPL efficiency (vs 10-30% for irregular applications).
- Tuning: Matrix size, process grid (P×Q), block size (NB) all impact performance. Tuned HPL achieves near-peak throughput.

HPCG (High-Performance Conjugate Gradient) Benchmark

- HPCG Algorithm: Sparse symmetric positive-definite system solved via CG with multigrid preconditioning. Memory-bound, irregular access patterns.
- Advantages Over HPL: HPL unrealistic (dense linear algebra rare in science); HPCG more representative of real applications (structural mechanics, CFD, electromagnetics).
- Sparse Matrix: 3D stencil (~27-point stencil, only ~27 nonzeros per row). Structured sparsity, but irregular memory access.
- Multigrid Preconditioning: Coarse grids constructed automatically (AMG). Multiple levels of processing. Memory-bound bottleneck (low arithmetic intensity).

HPCG Metrics

- Throughput: GFLOP/s (same as HPL, but lower numbers typical). 10-50 GFLOP/s vs 100+ GFLOP/s HPL on same machine (5-10x difference).
- Memory Bandwidth Efficiency: HPCG measures memory bandwidth utilization indirectly (embedded in GFLOP/s). Typical: 20-40% of theoretical memory bandwidth.
- Problem Size: Adjustable N (array dimension). Typical: N = 100-10,000. Smaller than HPL (memory-limited).
- Green500 Ranking: HPCG combined with power consumption (watts) creates energy efficiency ranking. Energy per GFLOP metric. Leading systems: 20-40 MFLOP/watt.

HPL vs HPCG Comparison

- HPL Throughput-Oriented: Peak performance demonstration. Ideal for vendor marketing. Not representative of real workloads.
- HPCG Realism: More representative of application behavior (memory-bound, sparse). Better predictor of actual application performance on system.
- System Ranking Correlation: HPL rank differs from HPCG (e.g., systems with large caches rank higher in HPL than HPCG). Reveals architecture trade-offs.
- Procurement Value: Both benchmarks used by facilities to evaluate systems. HPL important for peak performance marketing; HPCG important for sustained performance.

Top500 List Methodology

- Ranking Criterion: Sustained LINPACK performance (HPL GFLOP/s). Updated twice yearly (June, November).
- Threshold: Entry #500 sets minimum performance (~1 PFLOP/s in 2024). Systems below threshold not ranked.
- Rmax (Achieved Performance): Actual HPL performance measured (with tuning allowances). Conservative estimate → likely achievable on comparable systems.
- Rpeak (Theoretical Peak): Manufacturer specification × core count × clock rate. Rpeak typically 2-3x Rmax (realistic difference).

Green500 and Alternative Benchmarks

- Green500: Separate ranking emphasizing energy efficiency. GFLOP/watt metric. Data center power consumption critical; efficiency rankings increasingly important.
- NAS Parallel Benchmarks: Application-based benchmarks (CFD, sparse LU, etc.). More realistic than HPL but less standardized.
- Sandia Mantevo: Proxy applications mimicking real workloads. Smaller scale, shorter runtime than full application. Good for procurement testing.
- Application-Specific Benchmarks: DL (Resnet, Transformer training), HPC (WRF weather, GROMACS molecular dynamics). Industry-relevant performance metrics.

Benchmark Methodology and Reproducibility

- HPL Run Rules: Specific rules for code generation, compiler flags, network tuning. Ensures comparison fairness but allows vendor optimization.
- Reproducibility: Multiple runs required, statistical significance checked. Variability typically <5% (excellent).
- Tuning Scope: Compiler optimization, blocking factors, process layout all tunable. Tuning often consumed as part of benchmark time.
- Credibility: Independent verification (Top500 committee) checks submitted results. Outliers questioned, spot checks performed on suspicious results.

Want to learn more?