Parallel Sorting Algorithms

Parallel Sorting Algorithms are sorting methods designed to efficiently distribute the work of ordering n elements across p processors, achieving O(n log n / p) computation with minimal communication — a fundamental building block for parallel databases, scientific computing, and distributed systems where sorted data enables binary search, merge joins, and load-balanced partitioning.

Sorting is among the most studied problems in parallel computing because it combines computation, communication, and load balancing challenges in a compact, well-defined problem. Achieving near-linear speedup requires careful attention to data distribution and communication patterns.

Parallel Sorting Algorithms:

| Algorithm | Computation | Communication | Best For |
|----------|------------|--------------|----------|
| Bitonic sort | O(n log^2 n / p) | O(log^2 p) steps | GPU, fixed networks |
| Sample sort | O(n log n / p) | O(n/p + p log p) | Distributed memory |
| Parallel merge sort | O(n log n / p) | O(n/p log p) | General purpose |
| Radix sort | O(n*w / p) | O(n/p) per digit | Integer keys, GPU |
| Histogram sort | O(n log n / p) | O(n/p) | Load-balanced distributed |

Sample Sort: The most practical algorithm for distributed-memory systems. Steps: (1) each process sorts its local n/p elements; (2) each process selects s regular samples from its sorted data; (3) samples are gathered, sorted globally, and p-1 splitters are chosen that partition the key space into p balanced buckets; (4) each process sends elements to the process responsible for their bucket (all-to-all exchange); (5) each process merges received elements. With over-sampling factor s = O(log n), load imbalance is bounded to within a constant factor.

GPU Radix Sort: The fastest sort on GPUs. Processes keys digit-by-digit (typically 4-8 bits per pass). Each pass: compute per-digit histogram (parallel histogram), prefix sum to determine output positions (parallel scan), scatter keys to new positions (parallel scatter). CUB's radix sort achieves >100 billion keys/second on modern GPUs by optimizing shared memory usage and minimizing global memory passes.

Bitonic Sort: A comparison-based network sort that performs n*log^2(n)/2 comparisons arranged in log^2(n) stages. Each stage consists of independent compare-and-swap operations between pairs of elements — perfect for SIMD/GPU execution where all threads perform the same operation. Not work-optimal (extra log n factor) but the regular communication pattern makes it efficient on GPUs and fixed-topology networks.

Load Balancing in Distributed Sort: The fundamental challenge is ensuring each process receives approximately n/p elements after redistribution. Skewed key distributions cause some processes to receive disproportionately many elements. Solutions: over-sampling (more samples provide better splitter estimates), two-round sorting (first pass determines distribution, second pass sorts with adjusted splitters), and dynamic load balancing (redistribute excess elements from overloaded processes).

Parallel sorting algorithms demonstrate the core tension in parallel computing — the work of sorting can be divided easily, but the communication required to produce a globally sorted output is inherent and irreducible, making the algorithm designer's challenge one of performing that communication as efficiently as the network and memory hierarchy allow.

Want to learn more?