gpu memory hierarchy, gpu, hardware
Levels of memory on GPU.
9,967 technical terms and definitions
Levels of memory on GPU.
How much GPU memory is used.
NVIDIA GPU Operator manages GPU drivers and plugins in Kubernetes. Required for GPU workloads on k8s.
Percentage of GPU compute actually being used.
# GPU (Graphics Processing Unit)
## Graphics Processing Unit
- **GPU (Graphics Processing Unit)**: A specialized processor designed for parallel processing tasks
- **GPUs**: Plural form of GPU
- **Graphics Card**: Physical hardware component containing a GPU, VRAM, and cooling system
- **Accelerator**: Specialized hardware that offloads computation from the CPU
## Architecture Fundamentals
### Core Components
- **Streaming Multiprocessors (SMs)**: Contain multiple CUDA cores for parallel execution
- **VRAM (Video RAM)**: High-bandwidth memory dedicated to the GPU
- **Memory Bus**: Data pathway between GPU and VRAM
- **PCIe Interface**: Connection to the motherboard/CPU
### Parallelism Model
GPUs excel at **SIMD** (Single Instruction, Multiple Data) operations:
$$
\text{Speedup} = \frac{T_{\text{sequential}}}{T_{\text{parallel}}} \leq \frac{1}{(1-P) + \frac{P}{N}}
$$
Where:
- $P$ = Parallelizable fraction of code
- $N$ = Number of parallel processors
- This is **Amdahl's Law**
## Performance Metrics
### FLOPS (Floating Point Operations Per Second)
$$
\text{FLOPS} = \text{Cores} \times \text{Clock Speed (Hz)} \times \text{FLOPs per cycle}
$$
Example calculation for a GPU with 10,000 cores at 2 GHz:
$$
\text{FLOPS} = 10{,}000 \times 2 \times 10^9 \times 2 = 40 \text{ TFLOPS}
$$
### Memory Bandwidth
$$
\text{Bandwidth (GB/s)} = \frac{\text{Memory Clock (Hz)} \times \text{Bus Width (bits)} \times \text{Data Rate}}{8 \times 10^9}
$$
### Arithmetic Intensity
$$
\text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Accessed}}
$$
The **Roofline Model** bounds performance:
$$
\text{Attainable FLOPS} = \min\left(\text{Peak FLOPS}, \text{Bandwidth} \times \text{Arithmetic Intensity}\right)
$$
## GPU Computing Concepts
### Thread Hierarchy (CUDA Model)
- **Thread**: Smallest unit of execution
- Each thread has unique indices: `threadIdx.x`, `threadIdx.y`, `threadIdx.z`
- **Block**: Group of threads that can cooperate
- Shared memory accessible within block
- Maximum threads per block: typically 1024
- **Grid**: Collection of blocks
- Total threads: $\text{Grid Size} \times \text{Block Size}$
### Memory Hierarchy
| Memory Type | Scope | Latency | Size |
|-------------|-------|---------|------|
| Registers | Thread | ~1 cycle | ~256 KB total |
| Shared Memory | Block | ~5 cycles | 48-164 KB |
| L1 Cache | SM | ~30 cycles | 128 KB |
| L2 Cache | Device | ~200 cycles | 4-50 MB |
| Global Memory (VRAM) | Device | ~400 cycles | 8-80 GB |
## Matrix Operations Key for AI/ML
### Matrix Multiplication Complexity
Standard matrix multiplication for $A_{m \times k} \cdot B_{k \times n}$:
$$
C_{ij} = \sum_{l=1}^{k} A_{il} \cdot B_{lj}
$$
- **Time Complexity**: $O(m \times n \times k)$
- **Naive**: $O(n^3)$ for square matrices
- **Strassen's Algorithm**: $O(n^{2.807})$
### Tensor Core Operations
Mixed-precision matrix multiply-accumulate:
$$
D = A \times B + C
$$
Where:
- $A, B$ are FP16 (16-bit floating point)
- $C, D$ are FP32 (32-bit floating point)
Throughput comparison:
- **FP32 CUDA Cores**: ~40 TFLOPS
- **FP16 Tensor Cores**: ~300+ TFLOPS
- **INT8 Tensor Cores**: ~600+ TFLOPS
## Power and Thermal Equations
### Thermal Design Power (TDP)
$$
P_{\text{dynamic}} = \alpha \cdot C \cdot V^2 \cdot f
$$
Where:
- $\alpha$ = Activity factor
- $C$ = Capacitance
- $V$ = Voltage
- $f$ = Frequency
### Temperature Relationship
$$
T_{\text{junction}} = T_{\text{ambient}} + (P \times R_{\theta})
$$
Where $R_{\theta}$ is thermal resistance in °C/W.
## Deep Learning Operations
### Convolution (CNN)
For a 2D convolution with input $I$, kernel $K$, output $O$:
$$
O(i,j) = \sum_{m}\sum_{n} I(i+m, j+n) \cdot K(m,n)
$$
Output dimensions:
$$
O_{\text{size}} = \left\lfloor \frac{I_{\text{size}} - K_{\text{size}} + 2P}{S} \right\rfloor + 1
$$
Where:
- $P$ = Padding
- $S$ = Stride
### Attention Mechanism (Transformers)
$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$
Memory complexity: $O(n^2 \cdot d)$ where $n$ is sequence length.
## Major GPU Vendors
### NVIDIA
- **Gaming**: GeForce RTX series
- **Professional**: Quadro / RTX A-series
- **Data Center**: A100, H100, H200, B100, B200
- **CUDA Ecosystem**: Dominant in AI/ML
### AMD
- **Gaming**: Radeon RX series
- **Data Center**: Instinct MI series (MI300X)
- **ROCm**: Open-source GPU computing platform
### Intel
- **Consumer**: Arc A-series
- **Data Center**: Gaudi accelerators, Max series
## Code Example: CUDA Kernel
```cuda
// Vector addition kernel
__global__ void vectorAdd(float *A, float *B, float *C, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
C[idx] = A[idx] + B[idx];
}
}
// Launch configuration
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<
Reasoning about visual scenes.
Graceful degradation maintains partial functionality when components fail.
System maintains functionality when components fail.
Graclus pooling uses deterministic graph coarsening algorithm for hierarchical graph classification.
Visualize important regions for predictions.
Gradient-weighted Class Activation Mapping highlights image regions important for CNN predictions.
Improved version of GradCAM.
Simulate larger batches.
Virtual batch size multiplication.
Gradient accumulation sums gradients over mini-batches before update. Simulates larger batch size when GPU memory is limited.
Gradient accumulation sums gradients over multiple microbatches before update. Simulates larger batch size.
Accumulate gradients over multiple mini-batches before updating weights.
Boosted trees for identifying defects.
Gradient boosting builds trees sequentially. XGBoost, LightGBM, CatBoost.
Group gradients for efficient communication.
Center gradients to improve training.
Gradient clipping bounds gradient norms preventing privacy leakage and training instability.
Gradient clipping limits gradient magnitude. Prevents exploding gradients. Typical max norm 1.0.
Cap gradient magnitude to prevent exploding gradients.
Compress gradients while preserving privacy.
Reduce gradient communication.
Gradient compression reduces communication in distributed training. Quantize or sparsify gradients.
Constrain gradients to preserve knowledge.
Maintaining gradients in very deep models.
Maintain gradient flow in sparse networks.
Make gradients uninformative.
Add noise to gradients for regularization.
Normalize gradient magnitude.
Regularize gradient magnitude (GANs).
Quantize gradients for transmission.
Reverse gradients for adversarial training.
Scale gradients to prevent underflow.
Send only significant gradient components.
Aggregate gradients across devices.
Mask tokens with large gradients.
Optimize architecture with gradients.
Optimize continuous prompt embeddings using gradients.
Gradient-based pruning estimates weight importance using gradient information.
Use gradients to determine importance.
Backpropagation computes gradients via chain rule, flowing error from output to input. Gradients update weights via optimizer.
Gradio creates ML demo interfaces quickly. Hugging Face integration. Share instantly.
Slowly increase traffic to new version.
Gradually increase traffic to new model: 1%, 10%, 50%, 100%. Monitor metrics at each stage.
Unfreeze from top to bottom gradually.
Grafana dashboards visualize metrics. Alerts on thresholds. Operations visibility.