Home Knowledge Base CUDA Tensor Operations and cuBLAS

CUDA Tensor Operations and cuBLAS is the NVIDIA GPU library ecosystem for high-performance linear algebra and tensor computation — providing highly optimized implementations of matrix multiplication (GEMM), convolution, and tensor contractions that form the computational backbone of deep learning training and inference, scientific simulation, and numerical computing. cuBLAS and its extensions (cuBLASLt, cuDNN, cuTENSOR) achieve near-theoretical-peak GPU performance by exploiting Tensor Cores, memory hierarchy, and instruction-level parallelism.

Why GEMM Is Central

NVIDIA Tensor Cores

cuBLAS API

// Single-precision GEMM: C = A × B
cublasHandle_t handle;
cublasCreate(&handle);

cublasSgemm(handle,
    CUBLAS_OP_N, CUBLAS_OP_N, // no transpose
    M, N, K,                   // dimensions
    &alpha,                     // scalar
    d_A, M,                    // matrix A (device)
    d_B, K,                    // matrix B (device)
    &beta,
    d_C, M);                   // output C (device)

cuBLASLt (LightWeight cuBLAS)

cuTENSOR

Memory Hierarchy Optimization

FP8 GEMM (H100 Feature)

Batched GEMM

Flash Attention vs. cuBLAS GEMM

CUDA tensor operations and cuBLAS are the computational engine underneath every major AI framework — when PyTorch, TensorFlow, or JAX run a matrix multiplication, they invoke cuBLAS at the lowest level, making cuBLAS performance optimization directly equivalent to optimizing the throughput of every neural network trained or deployed on NVIDIA hardware, which encompasses the vast majority of AI computation worldwide.

cuda tensor operationscublascublasltmatrix multiply cudagemm gpucuda linear algebra

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.