Home Knowledge Base Vector Processing Units and SIMD Extensions

Vector Processing Units and SIMD Extensions is the processor hardware capability that applies a single instruction to multiple data elements simultaneously — the primary mechanism for exploiting data-level parallelism within a single CPU core. Modern CPUs achieve 4–16× throughput improvement for vectorizable workloads through SIMD (Single Instruction, Multiple Data) by processing 128–512 bits of data per instruction rather than 64 bits, enabling HPC, multimedia, AI inference, and signal processing applications to fully utilize CPU compute resources.

SIMD Evolution on x86

ExtensionYearWidthData ElementsPeak Throughput
SSE1999128-bit4× float324 FP32 ops/cycle
SSE4.22008128-bit4× float324 FP32 ops/cycle
AVX2011256-bit8× float328 FP32 ops/cycle
AVX22013256-bit8× float3216 FP32 ops/cycle (FMA)
AVX-5122017512-bit16× float3232 FP32 FMA ops/cycle
AMX2021Tile matrix16×16 BF162048 BF16 ops/cycle

AVX-512 Detailed

RISC-V Vector Extension (RVV)

Vectorization in Practice (C++)

// Auto-vectorized by GCC/Clang with -O2 -march=native
void saxpy(float a, float* x, float* y, int n) {
    for (int i = 0; i < n; i++) {
        y[i] = a * x[i] + y[i]; // FMA instruction
    }
}
// Compiler: Generates vmovups + vfmadd231ps AVX-512 instructions

// Explicit SIMD with intrinsics
#include <immintrin.h>
void saxpy_avx512(float a, float* x, float* y, int n) {
    __m512 va = _mm512_set1_ps(a);
    for (int i = 0; i < n; i += 16) {
        __m512 vx = _mm512_loadu_ps(x + i);
        __m512 vy = _mm512_loadu_ps(y + i);
        vy = _mm512_fmadd_ps(va, vx, vy);
        _mm512_storeu_ps(y + i, vy);
    }
}

Auto-Vectorization Requirements

ARM Neon and SVE

AMX (Advanced Matrix Extensions, Intel)

Vector processing units and SIMD extensions are the single-core parallelism engine that transforms modern CPUs from sequential processors into mini-GPUs — by processing 16 float32 values simultaneously with AVX-512, a single CPU core achieves 32× the throughput of scalar execution for vectorizable kernels, making SIMD the difference between a scientific code that runs in 1 hour and one that runs in 3 hours, and enabling CPUs to compete with dedicated accelerators for structured, vectorizable AI inference workloads.

vector processing unitvpusimd widthavx512 vectorizationrvv risc-v vectorvector extension

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.