SIMD Vectorization

SIMD Vectorization is the parallel execution technique that processes multiple data elements simultaneously using wide vector registers and single instructions — achieving 4-16× throughput improvement on modern CPUs by exploiting data-level parallelism within individual cores, complementing thread-level parallelism across cores.

SIMD Instruction Set Evolution:
- SSE/SSE2 (128-bit): four 32-bit floats or two 64-bit doubles per instruction; introduced with Pentium III/4; still the baseline for x86 SIMD compatibility
- AVX/AVX2 (256-bit): eight 32-bit floats or four 64-bit doubles; includes fused multiply-add (FMA) instructions; dominant in current production code (available on all modern x86 CPUs since 2013)
- AVX-512 (512-bit): sixteen 32-bit floats with mask registers for predicated execution, gather/scatter instructions, and conflict detection; available on Xeon/EPYC server CPUs and Intel 11th+ gen desktop
- ARM NEON/SVE/SVE2: NEON provides 128-bit fixed-width SIMD on all ARMv8 cores; SVE provides scalable vector length (128-2048 bits) for HPC; Apple M-series implements 128-bit NEON with exceptional throughput

Auto-Vectorization:
- Loop Vectorization: compiler transforms scalar loops into SIMD operations when iteration are independent; GCC/Clang -O2 enables basic vectorization, -O3 enables aggressive vectorization with loop transformations
- SLP Vectorization: superword-level parallelism detects adjacent scalar operations on independent data and packs them into SIMD instructions; effective for straight-line code without loops
- Vectorization Blockers: loop-carried dependencies, function calls without SIMD variants, irregular memory access patterns, and conditional branches prevent auto-vectorization; __restrict pointers and alignment hints help the compiler
- Compiler Reports: -fopt-info-vec (GCC), -Rpass=loop-vectorize (Clang) report which loops were vectorized and why others were not — essential for diagnosing missed vectorization opportunities

Intrinsics Programming:
- Explicit SIMD: compiler intrinsics (_mm256_mul_ps, _mm512_fmadd_ps) provide direct access to SIMD instructions without assembly — portable across compilers while giving precise control over instruction selection
- Data Types: __m128/__m256/__m512 for floats, __m128i/__m256i/__m512i for integers; load/store intrinsics handle alignment (_mm256_load_ps requires 32-byte alignment; _mm256_loadu_ps handles unaligned)
- Mask Operations: AVX-512 mask registers (__mmask16) enable predicated execution — each element can be independently enabled/disabled, eliminating branch divergence overhead for conditional operations
- Gather/Scatter: AVX2/AVX-512 support indexed load (_mm256_i32gather_ps) and indexed store from arbitrary memory locations — enabling SIMD processing of indirect array accesses, though at significantly lower throughput than contiguous access

Performance Optimization:
- Memory Bandwidth: SIMD increases compute throughput but not memory bandwidth; memory-bound code gains nothing from wider vectors — arithmetic intensity must be sufficient to benefit from SIMD
- Alignment: aligned loads are 0-10% faster than unaligned on modern CPUs (much larger gap on older hardware); aligning arrays to vector width (32 bytes for AVX2) with posix_memalign or alignas is best practice
- Register Pressure: wide SIMD operations consume physical registers proportionally; AVX-512 code may reduce available registers, increasing spilling for complex kernels — shorter AVX2 code sometimes outperforms AVX-512 due to better register utilization and higher clock frequency
- Frequency Throttling: heavy AVX-512 usage triggers frequency reduction on some Intel processors (100-300 MHz reduction); the effective speedup may be less than the 2× vector width increase suggests — benchmark on actual target hardware

SIMD vectorization is the most accessible form of parallelism available to every programmer — delivering immediate 4-16× speedup for data-parallel operations within a single core, it multiplies the benefit of multi-core threading and is essential for achieving peak performance in numerical computing, signal processing, and machine learning inference.

Want to learn more?