Vulkan Compute Shaders

Vulkan Compute Shaders enable portable, hardware-agnostic GPU computing across diverse platforms (NVIDIA, AMD, Intel, mobile GPUs), leveraging SPIR-V intermediate representation and compute pipelines for general-purpose GPU applications.

Compute Pipeline Setup in Vulkan

- Compute Pipeline Creation: VkComputePipelineCreateInfo specifies compute shader and layout (descriptor sets, push constants). Compiled to GPU-specific code via driver.
- Shader Module: SPIR-V bytecode (intermediate representation). Compiler (glslc, shadercDebugger) converts GLSL/HLSL → SPIR-V.
- Pipeline Layout: Describes resource bindings (storage buffers, samplers, push constants). Enables validation, optimization by driver.
- Specialization Constants: Constants baked into shader at compile time. Different specializations for different problem sizes (block size, unroll factor) without recompilation.

SPIR-V Shader Representation

- SPIR-V (Standard Portable Intermediate Representation): Cross-platform assembly language. Designed for graphics/compute portable intermediate representation.
- Advantages: Portable across vendors (NVIDIA, AMD, Intel, ARM). Compiled once, deployed everywhere. Decouples shader source from driver compiler.
- Bytecode Format: 32-bit word stream. First word magic (0x07230203), version, generator ID, bound (max ID), schema (optional).
- Instruction Format: Each instruction = word count + opcode + operands. Typed SSA (static single assignment) representation.

Workgroup and Thread Invocation Model

- Local Size Declaration: layout(local_size_x = 8, local_size_y = 8, local_size_z = 1) in; Declares 8×8×1 = 64 threads per workgroup (threadblock in CUDA terminology).
- Workgroup Size: Max 1024 threads per workgroup (typical). Larger workgroups more parallelism but higher register pressure. Trade-off application-dependent.
- Global Invocation ID: gl_GlobalInvocationID = global index (0 to N-1). Typically computed from workgroup + local ID.
- Local Invocation ID: gl_LocalInvocationID = thread index within workgroup (0 to local_size-1). Used for shared memory addressing, synchronization.

Descriptor Sets and Bindings

- Descriptor Set Layout: Describes set of resources (buffers, images, samplers) at specific bindings. VkDescriptorSetLayout.
- Storage Buffer Binding: Binding point for read/write buffer. Shader accesses via buffer[index]. SSBO (shader storage buffer object) in OpenGL.
- Descriptor Set: Instance of layout with actual resources. Multiple descriptor sets enable different data per dispatch (e.g., different input/output buffers).
- Pipeline Layout: Groups descriptor set layouts and push constant ranges. Defines all resources accessible to shader.

Push Constants and Shader Parameters

- Push Constants: Small constant values (typically 256 bytes) passed directly to shader. Faster than buffer updates, ideal for parameters.
- Example Usage: Output buffer dimensions, iteration count, algorithm parameters. Avoids buffer updates between dispatches.
- Size Limitation: 256 bytes guaranteed (all platforms). Larger structures require storage buffers.
- Performance: Push constant updates zero-latency (no resource binding overhead). Preferred for frequently-changing parameters.

Vulkan Synchronization (Barriers and Semaphores)

- Memory Barrier: vkCmdPipelineBarrier() ensures memory visibility across shader stages. Synchronization within command buffer (host → GPU → host).
- Execution Barrier: Ensures all prior instructions complete before proceeding. Necessary after compute dispatches before reading results.
- Memory Synchronization Scopes: Workgroup barrier (gl_memoryBarrierShared) for shared memory visibility. Global barrier (gl_memoryBarrier) for global memory visibility.
- Semaphores: GPU-to-GPU or GPU-to-host synchronization. Binary semaphore (signaled/unsignaled) or timeline semaphore (specific value).

Shared Memory and Local Synchronization

- Shared Memory Declaration: shared vec4 data[256]; declares 256×16 bytes = 4KB shared memory per workgroup (Vulkan: workgroup memory).
- Memory Coherence: All threads in workgroup see consistent state after barrier. Synchronization primitive: barrier() (or memoryBarrier + execution barrier).
- Bank Conflict Avoidance: Shared memory bank structure (32 banks typical). Stride-1 access conflict-free. Padding arrays avoids conflict penalties.
- Usage: Reduce operation (sum, min, max across workgroup). Shared data staging (load global, store shared, process, store global).

Compute Shader Compilation and Optimization

- Compilation Pipeline: GLSL/HLSL → SPIR-V (via glslc) → Driver-specific code (NVIDIA PTX/SASS, AMD GCN ISA).
- Driver Optimization: Vendor-specific compiler optimizes SPIR-V. Register allocation, instruction scheduling, cache optimization.
- Inline Pragmas: Compiler may inline functions; explicitly declare [[vk::inline]] for guaranteed inlining vs [[vk::dont_inline]].
- Optimization Feedback: Profilers (Vulkan profile, Nvidia Nsight) show generated ISA, register usage, cache misses.

Comparison with CUDA and Comparison for Non-NVIDIA Hardware

- Portability Advantage: Vulkan compute targets NVIDIA, AMD, Intel, ARM (mobile). CUDA NVIDIA-only. HIP (AMD's CUDA-like API) alternative.
- Ecosystem: Vulkan ecosystem smaller than CUDA (fewer libraries, kernels). CUDA dominance in ML/HPC (TensorFlow, PyTorch optimized for CUDA).
- Performance Parity: Vulkan compute achieves similar throughput to CUDA on NVIDIA hardware (driver translates efficiently). May lag slightly on AMD/Intel (less compiler maturity).
- Use Cases: Graphics + compute integration (real-time rendering), cross-platform applications (games, simulation), mobile computing.

Want to learn more?