CUDA (Compute Unified Device Architecture)

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model that enables developers to use NVIDIA GPUs for general-purpose computation, including deep learning training and inference. CUDA is the foundation of the modern AI hardware ecosystem.

Why CUDA Dominates AI

- First-Mover Advantage: CUDA launched in 2007 and has had over 15 years of development, libraries, and ecosystem building.
- Software Ecosystem: Decades of optimized libraries — cuDNN (deep learning primitives), cuBLAS (linear algebra), NCCL (multi-GPU communication), TensorRT (inference optimization).
- Framework Support: PyTorch and TensorFlow are built on CUDA. Virtually all ML research code assumes CUDA.
- Developer Community: Millions of developers, extensive documentation, tutorials, and Stack Overflow answers.

CUDA Architecture Concepts

- Kernel: A function executed in parallel by many GPU threads.
- Thread: The smallest unit of execution. Threads are organized in blocks, and blocks form a grid.
- Streaming Multiprocessor (SM): The GPU's compute unit — each SM runs multiple thread blocks concurrently.
- Shared Memory: Fast, on-chip memory shared between threads in a block. Critical for performance optimization.
- Global Memory: The GPU's main memory (HBM/GDDR). High capacity but higher latency than shared memory.

CUDA for Deep Learning

- cuDNN: NVIDIA's deep learning library providing optimized implementations of convolutions, attention, normalization, activation functions, and other neural network operations.
- TensorRT: Inference optimization engine that takes trained models and produces optimized CUDA kernels for production deployment.
- FlashAttention: Custom CUDA kernel that implements attention more efficiently by optimizing memory access patterns.
- NCCL: Multi-GPU and multi-node communication library for distributed training (AllReduce, AllGather, etc.).

CUDA Versions and Compatibility

- CUDA versions must be compatible with the GPU's compute capability (hardware generation) and the NVIDIA driver version.
- CUDA 12.x: Current version, supporting Hopper (H100) and Ada Lovelace (RTX 4090) GPUs.
- Framework compatibility: PyTorch releases are built against specific CUDA versions.

The CUDA Moat

CUDA's dominance is both technical and economic — the vast ecosystem of libraries, tools, and developer knowledge creates a massive switching cost that competitors (AMD ROCm, Intel oneAPI) struggle to overcome. This "CUDA moat" is NVIDIA's most valuable asset beyond the hardware itself.

CUDA (Compute Unified Device Architecture)

Want to learn more?