GPU Tensor Core Programming

GPU Tensor Core Programming is the practice of utilizing specialized matrix multiply-accumulate (MMA) hardware units in NVIDIA GPUs that perform small matrix operations (e.g., 16×16×16) in a single clock cycle with mixed-precision arithmetic — Tensor Cores deliver 5-10× higher throughput than standard CUDA cores for matrix-heavy workloads like deep learning and scientific computing.

Tensor Core Hardware Architecture:
- Matrix Operation: each Tensor Core performs D = A × B + C where A and B are small matrices (typically 4×4 in hardware, exposed as 16×16×16 at the warp level) — inputs A, B can be FP16/BF16/TF32/INT8 while accumulator C/D is FP32 or FP16
- Throughput per SM: Ampere (A100) has 4 Tensor Cores per SM, each performing 256 FP16 FMA operations per cycle — total 1024 FMA ops/cycle/SM vs. 64 FMA ops/cycle/SM for CUDA cores (16× speedup)
- Supported Precisions: FP16×FP16→FP32, BF16×BF16→FP32, TF32×TF32→FP32, FP64×FP64→FP64, INT8×INT8→INT32, INT4×INT4→INT32, FP8×FP8→FP32 (Hopper)
- Warp-Level Operation: Tensor Core instructions are warp-cooperative — all 32 threads in a warp collectively provide the input matrix fragments and receive the output fragments

WMMA API (Warp Matrix Multiply-Accumulate):
- Fragment Declaration: wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag — declares a fragment for the A matrix in 16×16×16 configuration with FP16 row-major layout
- Load Operation: wmma::load_matrix_sync(a_frag, ptr, stride) — cooperatively loads a 16×16 matrix tile from global or shared memory across all threads in the warp
- MMA Operation: wmma::mma_sync(d_frag, a_frag, b_frag, c_frag) — performs the matrix multiply-accumulate D = A × B + C using Tensor Cores in a single warp-synchronous call
- Store Operation: wmma::store_matrix_sync(ptr, d_frag, stride, wmma::mem_row_major) — cooperatively stores the result fragment back to memory

MMA PTX Instructions (Lower-Level):
- mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32: PTX instruction for 16×8×16 matrix multiply — finer granularity than WMMA, allows more flexible tiling strategies
- Register Mapping: each thread holds specific elements of the matrix fragments in its registers — understanding the thread-to-element mapping is critical for efficient data loading
- CUTLASS Library: NVIDIA's templated C++ library abstracts MMA instructions with compile-time tile size selection — provides optimized epilogue fusion, software pipelining, and warp specialization
- Warp Specialization: in Hopper's programming model, warps are specialized for either data loading (producer) or computation (consumer) — decouples memory access from Tensor Core execution

Performance Optimization:
- Shared Memory Staging: load global memory tiles into shared memory, then load WMMA fragments from shared memory — eliminates redundant global memory accesses across warps computing adjacent output tiles
- Software Pipelining: overlap global-to-shared memory loads for the next tile with Tensor Core computation on the current tile — maintains Tensor Core utilization at >90% for large matrices
- Register Pressure: WMMA fragments consume significant register space (a 16×16×16 operation uses ~64 registers per thread) — balance fragment count against occupancy to maximize throughput
- Memory Layout: Tensor Cores achieve peak performance with specific memory alignment (256-byte aligned, contiguous in the fast-changing dimension) — column-major A and row-major B avoid bank conflicts in shared memory

Mixed-Precision Training Pattern:
- Forward Pass: store master weights in FP32, cast to FP16/BF16 for Tensor Core GEMM operations — Tensor Cores compute in reduced precision but accumulate in FP32
- Loss Scaling: multiply loss by a scale factor (typically 1024-65536) before backward pass to prevent FP16 gradient underflow — dynamic loss scaling adjusts the factor based on overflow detection
- Gradient Accumulation: accumulate gradients in FP32 even when individual gradient computations use FP16 — prevents precision loss during summation across micro-batches
- Weight Update: apply FP32 gradients to FP32 master weights, then cast back to FP16 for next iteration — maintains model accuracy while achieving 2-3× training speedup from Tensor Cores

Tensor Cores have transformed GPU computing from a throughput-oriented architecture to a matrix-computation engine — modern AI training and inference workloads spend 90%+ of their compute time in Tensor Core GEMM operations, making their efficient utilization the single most important optimization for GPU performance.

Want to learn more?