CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model that enables developers to use NVIDIA GPUs for general-purpose computation, including deep learning training and inference. CUDA is the foundation of the modern AI hardware ecosystem.
Why CUDA Dominates AI
- First-Mover Advantage: CUDA launched in 2007 and has had over 15 years of development, libraries, and ecosystem building.
- Software Ecosystem: Decades of optimized libraries — cuDNN (deep learning primitives), cuBLAS (linear algebra), NCCL (multi-GPU communication), TensorRT (inference optimization).
- Framework Support: PyTorch and TensorFlow are built on CUDA. Virtually all ML research code assumes CUDA.
- Developer Community: Millions of developers, extensive documentation, tutorials, and Stack Overflow answers.
CUDA Architecture Concepts
- Kernel: A function executed in parallel by many GPU threads.
- Thread: The smallest unit of execution. Threads are organized in blocks, and blocks form a grid.
- Streaming Multiprocessor (SM): The GPU's compute unit — each SM runs multiple thread blocks concurrently.
- Shared Memory: Fast, on-chip memory shared between threads in a block. Critical for performance optimization.
- Global Memory: The GPU's main memory (HBM/GDDR). High capacity but higher latency than shared memory.
CUDA for Deep Learning
- cuDNN: NVIDIA's deep learning library providing optimized implementations of convolutions, attention, normalization, activation functions, and other neural network operations.
- TensorRT: Inference optimization engine that takes trained models and produces optimized CUDA kernels for production deployment.
- FlashAttention: Custom CUDA kernel that implements attention more efficiently by optimizing memory access patterns.
- NCCL: Multi-GPU and multi-node communication library for distributed training (AllReduce, AllGather, etc.).
CUDA Versions and Compatibility
- CUDA versions must be compatible with the GPU's compute capability (hardware generation) and the NVIDIA driver version.
- CUDA 12.x: Current version, supporting Hopper (H100) and Ada Lovelace (RTX 4090) GPUs.
- Framework compatibility: PyTorch releases are built against specific CUDA versions.
The CUDA Moat
CUDA's dominance is both technical and economic — the vast ecosystem of libraries, tools, and developer knowledge creates a massive switching cost that competitors (AMD ROCm, Intel oneAPI) struggle to overcome. This "CUDA moat" is NVIDIA's most valuable asset beyond the hardware itself.