← Back to AI Factory Chat

AI Factory Glossary

1,005 technical terms and definitions

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z All
Showing page 19 of 21 (1,005 entries)

cross-modal distillation, multimodal ai

Distill from one modality to another.

cross-modal distillation, multimodal ai

Distill knowledge across modalities.

cross-modal generation, multimodal ai

Generate one modality from another.

cross-modal pretext tasks, multimodal ai

Self-supervised across modalities.

cross-modal retrieval, audio & speech

Cross-modal retrieval finds matching samples across modalities like retrieving audio from visual queries.

cross-modal retrieval, multimodal ai

Query one modality retrieve another.

cross-modal retrieval,multimodal ai

Retrieve across different modalities.

cross-section preparation,metrology

Cut samples to reveal internal structure.

cross-section sem,metrology

SEM of cleaved or FIB-cut wafer to see layers and profiles.

cross-sectioning (package),cross-sectioning,package,failure analysis

Cut open package for inspection.

cross-silo federated learning, federated learning

Federation across organizations.

cross-stitch networks, multi-task learning

Learn task combination.

cross-training, quality & reliability

Cross-training develops employee capability in multiple roles improving flexibility.

cross-view consistency, multi-view learning

Enforce agreement between views.

crosstalk, signal & power integrity

Crosstalk is unwanted coupling of signals between adjacent interconnects through capacitive and inductive coupling affecting signal integrity.

crosstalk,design

Unwanted coupling between adjacent signals.

crossvit, computer vision

Dual-branch ViT with different patch sizes.

crossvit,computer vision

Dual-branch vision transformer.

crow-amsaa, reliability

Continuous growth model.

crowdsourcing,data

Use many workers to collect annotations cheaply.

crows-pairs, evaluation

Test stereotypes across categories.

crows-pairs,evaluation

Challenge set for measuring biases.

crr, crr, reinforcement learning advanced

Critic Regularized Regression is an offline RL algorithm that combines advantage-weighted regression with a learned critic for policy improvement.

cryo pump, manufacturing operations

Cryogenic pumps capture gases by freezing on cold surfaces.

cryogenic etch,etch

Etch at very low temperature for better anisotropy and selectivity.

cryptographic watermarking,ai safety

Use crypto techniques to prove AI generation.

crystal graph features, materials science

Graph-based material representations.

crystal orientation effects, material science

How orientation affects properties.

crystal structure prediction, materials science

Predict stable crystal structures.

csrm, csrm, recommendation systems

Collaborative Session-based Recommendation Model integrates global user preferences with session context.

ctc loss, ctc, audio & speech

Connectionist Temporal Classification enables training of sequence models without frame-level alignment by marginalizing over all possible alignments.

ctc-attention, audio & speech

CTC-Attention combines Connectionist Temporal Classification with attention for improved ASR robustness.

ctdg, ctdg, graph neural networks

Continuous-Time Dynamic Graphs represent evolving networks where edges and nodes change in continuous time.

ctdne, ctdne, graph neural networks

Continuous-Time Dynamic Network Embeddings learn representations respecting temporal ordering of interactions.

cte matching with underfill, cte, packaging

Reduce thermal mismatch.

cte mismatch, cte, reliability

Different CTEs causing stress.

ctle, ctle, signal & power integrity

Continuous-Time Linear Equalization amplifies high frequencies compensating for channel loss.

ctrl (conditional transformer language),ctrl,conditional transformer language,foundation model

Generate text conditioned on control codes.

cts, cts, design & verification

Clock Tree Synthesis automatically generates clock distribution meeting skew targets.

cu-cu bonding, advanced packaging

Direct copper-to-copper bonding.

cublas, infrastructure

CUDA basic linear algebra subroutines.

cuda core,shader,gpu core

CUDA cores are NVIDIA GPU processing units. Run parallel threads. Thousands per GPU for massive parallelism.

cuda cores, cuda, hardware

General-purpose GPU compute units.

cuda graph, cuda, optimization

Capture and replay GPU work.

cuda programming, cuda, infrastructure

Programming NVIDIA GPUs.

cuda streams, cuda, infrastructure

Concurrent kernel execution.

cuda,compute capability,nvidia

# CUDA, Compute Capability, and NVIDIA Hardware Logic ## Introduction This document provides an in-depth analysis of NVIDIA's GPU computing ecosystem, covering three fundamental aspects: 1. **CUDA** - The programming model and software interface 2. **Compute Capability** - The versioning and feature classification system 3. **Hardware Logic** - The physical silicon implementation ## CUDA Architecture ### Programming Model Hierarchy CUDA organizes computation into a hierarchical structure: ``` Grid ├── Block 0 │ ├── Thread 0 │ ├── Thread 1 │ └── ... ├── Block 1 │ └── ... └── Block N ``` #### Thread Organization - **Grid**: The entire kernel launch space - Contains multiple thread blocks - Can be 1D, 2D, or 3D dimensioned - Maximum dimensions: 2^31-1 × 65535 × 65535 - **Block**: A group of threads that can cooperate - Threads within a block can synchronize - Share memory accessible to all threads in block - Maximum threads per block: typically 1024 - Block dimensions: (Bx, By, Bz) where Bx × By × Bz ≤ 1024 - **Warp**: The fundamental execution unit - Fixed size of 32 threads - All threads execute in lockstep (SIMT) - Number of warps per block: $\lceil \frac{\text{threads per block}}{32} \rceil$ #### Thread Indexing Mathematics For a thread in a 3D grid of 3D blocks: $$\text{Global Thread ID} = \text{blockIdx} \cdot \text{blockDim} + \text{threadIdx}$$ For 1D case: $$\text{tid} = \text{blockIdx.x} \times \text{blockDim.x} + \text{threadIdx.x}$$ For 2D case: $$\text{tid} = (\text{blockIdx.y} \times \text{gridDim.x} + \text{blockIdx.x}) \times (\text{blockDim.x} \times \text{blockDim.y}) + (\text{threadIdx.y} \times \text{blockDim.x} + \text{threadIdx.x})$$ ### Memory Hierarchy #### Memory Types and Characteristics | Memory Type | Location | Access Speed | Scope | Size | |-------------|----------|--------------|-------|------| | Registers | On-chip | 1 cycle | Per-thread | ~64KB per SM | | Shared Memory | On-chip | ~5-30 cycles | Per-block | 48-164KB per SM | | L1 Cache | On-chip | ~30 cycles | Per-SM | 128KB per SM | | L2 Cache | On-chip | ~200 cycles | Global | 6-60MB | | Global Memory | Off-chip | ~400-800 cycles | Global | 8-80GB | #### Memory Bandwidth Calculations Theoretical peak bandwidth: $$BW_{\text{peak}} = \frac{\text{Memory Clock} \times \text{Bus Width} \times 2}{8} \text{ GB/s}$$ For GDDR6X on RTX 4090: $$BW_{\text{peak}} = \frac{1313 \text{ MHz} \times 384 \text{ bits} \times 2}{8} = 1008 \text{ GB/s}$$ Effective bandwidth with efficiency $\eta$: $$BW_{\text{effective}} = \eta \times BW_{\text{peak}}$$ #### Memory Coalescing For optimal memory access, threads in a warp should access contiguous memory addresses. **Coalesced Access Pattern:** - Thread 0 accesses address $A$ - Thread 1 accesses address $A + 4$ (for 4-byte elements) - Thread $i$ accesses address $A + 4i$ - Results in: $\frac{128 \text{ bytes}}{32 \text{ transactions}} = 1$ transaction **Uncoalesced Access Pattern:** - Random access patterns - Stride access with non-unit stride - Results in: up to 32 separate transactions Memory transaction efficiency: η_mem = Requested Bytes / (Transactions × 32 bytes) ## Compute Capability ### Version History and Features #### Compute Capability Matrix | CC | Architecture | Year | Key Features | |----|--------------|------|--------------| | 3.0 | Kepler | 2012 | Dynamic Parallelism, Hyper-Q | | 3.5 | Kepler | 2013 | Dynamic Parallelism improvements | | 5.0 | Maxwell | 2014 | Improved power efficiency | | 5.2 | Maxwell | 2015 | Better FP32/FP16 performance | | 6.0 | Pascal | 2016 | NVLink, Unified Memory improvements | | 6.1 | Pascal | 2016 | Consumer Pascal cards | | 7.0 | Volta | 2017 | **Tensor Cores (1st gen)**, Independent Thread Scheduling | | 7.5 | Turing | 2018 | RT Cores, Tensor Cores (2nd gen) | | 8.0 | Ampere | 2020 | Tensor Cores (3rd gen), Sparsity, Async Copy | | 8.6 | Ampere | 2020 | Consumer Ampere | | 8.9 | Ada Lovelace | 2022 | Tensor Cores (4th gen), DLSS 3 | | 9.0 | Hopper | 2022 | Transformer Engine, FP8, Thread Block Clusters | ### Feature Evolution Equations #### Tensor Core Performance Growth Tensor core peak performance (TFLOPS for FP16): $$P_{\text{tensor}}(CC) = N_{\text{SM}}(CC) \times N_{\text{TC/SM}}(CC) \times f_{\text{clock}} \times \text{OPS}_{\text{per cycle}}$$ Where: - $N_{\text{SM}}(CC)$ = Number of SMs for compute capability CC - $N_{\text{TC/SM}}(CC)$ = Tensor cores per SM - $f_{\text{clock}}$ = Clock frequency in GHz - $\text{OPS}_{\text{per cycle}}$ = Operations per cycle per tensor core Example for H100 (CC 9.0): $$P_{\text{tensor}} = 132 \times 4 \times 1.98 \times 512 = 534 \text{ TFLOPS (FP16)}$$ #### Compute Capability Feature Set Feature availability function: $$ F(CC, \text{feature}) = \begin{cases} 1 & \text{if } CC \geq CC_{\text{min}}(\text{feature}) \\ 0 & \text{otherwise} \end{cases} $$ Examples: - F(CC, Tensor Cores) = 1 if CC ≥ 7.0 - F(CC, FP8 Support) = 1 if CC ≥ 8.9 (Ada) or CC ≥ 9.0 (Hopper) ## NVIDIA Hardware Logic ### Streaming Multiprocessor (SM) Architecture #### SM Component Breakdown **SM = Streaming Multiprocessor** Components per SM (example from Ampere GA102): - **CUDA Cores**: 128 per SM - FP32 units: 128 - INT32 units: 64 - FP64 units: 2 (consumer cards) or 64 (datacenter) - **Tensor Cores**: 4 per SM (3rd generation) - Matrix dimensions: $16 \times 16 \times 16$ - Operations per cycle: 256 FP16 FMA operations - **Special Function Units (SFU)**: 32 per SM - Transcendental functions: $\sin, \cos, \log, \exp$ - Throughput: 1/4 to 1/8 of FP32 operations - **Load/Store Units (LD/ST)**: 32 per SM - Memory transactions per cycle: 1 per unit - **Warp Schedulers**: 4 per SM - Can issue 1 instruction per warp per cycle - Total: up to 4 instructions per cycle across different warps #### SM Execution Model Number of active warps per SM: $$N_{\text{warps}}^{\text{active}} = \min\left(\frac{N_{\text{threads}}}{32}, \frac{R_{\text{total}}}{R_{\text{per thread}}}, \frac{S_{\text{total}}}{S_{\text{per block}}}\right)$$ Where: - $N_{\text{threads}}$ = Total threads per block - $R_{\text{total}}$ = Total registers per SM (e.g., 65536) - $R_{\text{per thread}}$ = Registers used per thread - $S_{\text{total}}$ = Total shared memory per SM (e.g., 100KB) - $S_{\text{per block}}$ = Shared memory used per block #### Occupancy Calculation Occupancy is the ratio of active warps to maximum warps: $$\text{Occupancy} = \frac{N_{\text{warps}}^{\text{active}}}{N_{\text{warps}}^{\text{max}}}$$ For most modern GPUs: $N_{\text{warps}}^{\text{max}} = 64$ per SM **Example Calculation:** Given: - Kernel uses 32 registers per thread - Block size: 256 threads = 8 warps - Shared memory: 16KB per block - SM has: 65536 registers, 100KB shared memory, max 64 warps Register limit: $$\text{Blocks}_R = \left\lfloor\frac{65536}{256 \times 32}\right\rfloor = \left\lfloor 8.0 \right\rfloor = 8 \text{ blocks}$$ Shared memory limit: $$\text{Blocks}_S = \left\lfloor\frac{100 \times 1024}{16 \times 1024}\right\rfloor = 6 \text{ blocks}$$ Warp limit: $$\text{Blocks}_W = \left\lfloor\frac{64}{8}\right\rfloor = 8 \text{ blocks}$$ Actual blocks: $\min(8, 6, 8) = 6$ blocks Occupancy: $$\text{Occupancy} = \frac{6 \times 8}{64} = \frac{48}{64} = 75\%$$ ### Warp Execution and Divergence #### SIMT Execution Model In SIMT (Single Instruction, Multiple Thread), all 32 threads in a warp execute the same instruction. **Branch Divergence Cost:** For a conditional with branch probability $p$: $$T_{\text{divergent}} = T_{\text{path1}} + T_{\text{path2}}$$ vs. non-divergent: $$T_{\text{non-divergent}} = \max(T_{\text{path1}}, T_{\text{path2}})$$ Divergence overhead: $$\text{Overhead} = \frac{T_{\text{divergent}} - T_{\text{non-divergent}}}{T_{\text{non-divergent}}} \times 100\%$$ **Example:** ```cuda if (threadIdx.x < 16) { // Path A: 10 cycles } else { // Path B: 15 cycles } ``` - Threads 0-15 execute Path A: 10 cycles - Threads 16-31 execute Path B: 15 cycles - Total warp time: 10 + 15 = 25 cycles - Without divergence: max(10, 15) = 15 cycles - Overhead: $\frac{25-15}{15} = 66.7\%$ #### Warp Scheduling Instruction throughput per cycle: $$\text{IPC}_{\text{SM}} = \sum_{i=1}^{N_{\text{schedulers}}} \text{Issued}_i$$ Where each scheduler can issue 1 instruction per cycle to 1 warp. For 4 schedulers: Maximum $\text{IPC}_{\text{SM}} = 4$ ### Tensor Core Architecture #### Matrix Multiply-Accumulate (MMA) Tensor cores compute: $$D = A \times B + C$$ Where matrices have dimensions: - $A: M \times K$ - $B: K \times N$ - $C: M \times N$ - $D: M \times N$ For typical tensor core: $M = N = K = 16$ **Operations per Tensor Core per cycle:** $$\text{OPS}_{\text{TC}} = 2 \times M \times N \times K = 2 \times 16 \times 16 \times 16 = 8192$$ (Factor of 2 for multiply-add = 2 operations) **Peak Performance Calculation:** For GPU with $N_{\text{SM}}$ SMs, each with $N_{\text{TC}}$ tensor cores, at frequency $f$: $$\text{TFLOPS} = \frac{N_{\text{SM}} \times N_{\text{TC}} \times \text{OPS}_{\text{TC}} \times f}{10^{12}}$$ **H100 Example:** $$\text{TFLOPS}_{FP16} = \frac{132 \times 4 \times 512 \times 256 \times 1.98 \times 10^9}{10^{12}} \approx 989 \text{ TFLOPS}$$ #### Sparsity Acceleration Ampere introduced 2:4 structured sparsity: - In every 4 consecutive values, 2 must be zero - Provides 2× speedup for sparse operations - Effective performance with sparsity: $$P_{\text{sparse}} = 2 \times P_{\text{dense}} \times \eta_{\text{sparsity}}$$ Where $\eta_{\text{sparsity}}$ is the efficiency factor (typically 0.9-1.0) ## Performance Analysis ### Roofline Model The Roofline model relates performance to arithmetic intensity: Performance = min(Peak FLOPs, Arithmetic Intensity × Peak Bandwidth) Where: Arithmetic Intensity = FLOPs / Bytes transferred **Roofline Equation:** $$ P(I) = \begin{cases} B \times I & \text{if } I < I_{\text{ridge}} \quad \text{(memory-bound)} \\ P_{\text{peak}} & \text{if } I \geq I_{\text{ridge}} \quad \text{(compute-bound)} \end{cases} $$ Ridge point: $$I_{\text{ridge}} = \frac{P_{\text{peak}}}{B}$$ Where: - P(I) = Achievable performance at intensity I - B = Peak memory bandwidth - P_peak = Peak computational throughput - I = Arithmetic intensity (FLOPs/Byte) **Example for A100:** - $P_{\text{peak}} = 312$ TFLOPS (FP16 with Tensor Cores) - $B = 1555$ GB/s = 1.555 TB/s - $I_{\text{ridge}} = \frac{312}{1.555} \approx 200$ FLOPs/Byte ### Memory Bandwidth Utilization Effective bandwidth: $$BW_{\text{eff}} = \frac{\text{Data transferred (GB)}}{\text{Time (s)}}$$ Bandwidth efficiency: $$\eta_{BW} = \frac{BW_{\text{eff}}}{BW_{\text{peak}}} \times 100\%$$ ### Kernel Performance Metrics #### Execution Time Model Total kernel execution time: $$T_{\text{kernel}} = \max\left(\frac{W_{\text{compute}}}{P_{\text{compute}}}, \frac{W_{\text{memory}}}{BW}\right) + T_{\text{overhead}}$$ Where: - $W_{\text{compute}}$ = Total computational work (FLOPs) - $P_{\text{compute}}$ = Compute throughput (FLOPs/s) - $W_{\text{memory}}$ = Total memory traffic (Bytes) - $BW$ = Effective bandwidth (Bytes/s) - $T_{\text{overhead}}$ = Kernel launch overhead (typically 1-10 μs) #### Scalability Analysis Speedup with $N$ SMs: $$S(N) = \frac{T_1}{T_N}$$ Efficiency: $$E(N) = \frac{S(N)}{N} \times 100\%$$ Amdahl's Law for parallel fraction $p$: $$S(N) = \frac{1}{(1-p) + \frac{p}{N}}$$ ## Mathematical Models ### CUDA Core Performance Model #### Instruction Throughput Each CUDA core can execute one FP32 operation per clock cycle: $$\text{TFLOPS}_{\text{FP32}} = \frac{N_{\text{cores}} \times f_{\text{clock}} \times 2}{10^{12}}$$ Factor of 2 for FMA (Fused Multiply-Add) = 2 operations **Example - RTX 4090:** - Cores: 16384 CUDA cores - Clock: ~2.5 GHz (boost) - $\text{TFLOPS}_{\text{FP32}} = \frac{16384 \times 2.52 \times 2}{10^{12}} = 82.6$ TFLOPS ### Register Pressure Analysis Register usage impacts occupancy. For kernel requiring $R$ registers per thread: $$\text{Max Threads per SM} = \min\left(\text{Max}_{\text{threads}}, \left\lfloor\frac{R_{\text{total}}}{R}\right\rfloor\right)$$ Register allocation is quantized in multiples of 256 (warp size × 8). Actual registers allocated per warp: $$R_{\text{warp}} = \lceil R \times 32 / 256 \rceil \times 256$$ ### Shared Memory Bank Conflicts Shared memory is divided into $N_{\text{banks}}$ banks (typically 32). Bank conflict multiplier for $n$ simultaneous accesses to same bank: $$M_{\text{conflict}} = n$$ Memory access time with conflicts: $$T_{\text{access}} = T_{\text{base}} \times M_{\text{conflict}}$$ **Conflict-Free Access Pattern:** For stride $s$ and thread $t$: $$\text{Address}(t) = \text{base} + t \times s$$ Conflict-free when: $\gcd(s, N_{\text{banks}}) = 1$ ### Cache Performance Model #### Hit Rate and Effective Latency Effective memory latency: $$L_{\text{eff}} = h \times L_{\text{cache}} + (1-h) \times L_{\text{mem}}$$ Where: - $h$ = cache hit rate - $L_{\text{cache}}$ = cache access latency (~30 cycles) - $L_{\text{mem}}$ = memory access latency (~400 cycles) **Example:** With 80% hit rate: $$L_{\text{eff}} = 0.8 \times 30 + 0.2 \times 400 = 24 + 80 = 104 \text{ cycles}$$ ### Tensor Core Utilization #### GEMM Performance Model For matrix multiplication $C = AB + C$ where $A$ is $M \times K$, $B$ is $K \times N$: **Computational work:** $$W_{\text{GEMM}} = 2MNK \text{ FLOPs}$$ **Memory traffic:** $$W_{\text{mem}} = (MK + KN + 2MN) \times \text{sizeof(dtype)}$$ **Arithmetic intensity:** $$I_{\text{GEMM}} = \frac{2MNK}{(MK + KN + 2MN) \times \text{sizeof(dtype)}}$$ For large square matrices ($M = N = K$): $$I_{\text{GEMM}} \approx \frac{2N^3}{4N^2 \times \text{sizeof}} = \frac{N}{2 \times \text{sizeof}}$$ For FP16 (2 bytes): $I_{\text{GEMM}} \approx \frac{N}{4}$ **Tile-level efficiency:** For tile size $T_M \times T_N \times T_K$: $$\eta_{\text{tile}} = \frac{2 \times T_M \times T_N \times T_K}{\text{Tensor Core OPS}} \times f_{\text{utilization}}$$ ### Power and Energy Models #### Dynamic Power Power consumption during execution: $$P = C \times V^2 \times f \times \alpha$$ Where: - $C$ = Capacitance - $V$ = Voltage - $f$ = Frequency - $\alpha$ = Activity factor (0-1) #### Energy Efficiency Energy per operation: $$E_{\text{op}} = \frac{P \times T}{N_{\text{ops}}}$$ For tensor cores vs CUDA cores: $$\frac{E_{\text{CUDA}}}{E_{\text{TC}}} \approx 8-16\times$$ (Tensor cores are 8-16× more energy efficient for matrix operations) ### Latency Hiding Warps needed to hide latency $L$: $$N_{\text{warps}} = \left\lceil\frac{L}{I}\right\rceil$$ Where $I$ is instruction interval (cycles between dependent instructions) **Little's Law for GPUs:** $$\text{Throughput} = \frac{\text{Concurrency}}{\text{Latency}}$$ Applied to memory: $$BW_{\text{eff}} = \frac{N_{\text{threads}} \times \text{bytes per thread}}{L_{\text{mem}}}$$ ## Advanced Topics ### Multi-GPU Scaling For $N$ GPUs with interconnect bandwidth $B_{\text{link}}$: Communication overhead fraction: $$\alpha = \frac{W_{\text{comm}}/B_{\text{link}}}{W_{\text{comp}}/P_{\text{GPU}}}$$ Effective speedup: $$S_{\text{eff}}(N) = \frac{N}{1 + \alpha(N-1)}$$ ### Mixed Precision Training Tensor core operations in mixed precision (FP16 compute, FP32 accumulate): $$D_{\text{FP32}} = \text{FP32}(\text{FP16}(A) \times \text{FP16}(B)) + C_{\text{FP32}}$$ Memory savings: $$R_{\text{mem}} = \frac{\text{sizeof(FP32)}}{\text{sizeof(FP16)}} = \frac{4}{2} = 2\times$$ Speedup combines compute and memory improvements: $$S_{\text{mixed}} = \min(S_{\text{compute}}, S_{\text{memory}})$$ ### Quantization Effects For INT8 vs FP16 tensor cores: Throughput increase: $$S_{\text{INT8}} = \frac{\text{OPS}_{\text{INT8}}}{\text{OPS}_{\text{FP16}}} \approx 2\times$$ Memory bandwidth improvement: $$S_{\text{BW}} = \frac{\text{sizeof(FP16)}}{\text{sizeof(INT8)}} = 2\times$$ ## Performance Optimization Checklists ### Memory Optimization - Achieve coalesced memory access patterns - Minimize global memory transactions - Use shared memory for data reuse - Avoid bank conflicts in shared memory - Maximize L1/L2 cache hit rates - Use appropriate memory types (constant, texture) ### Compute Optimization - Maximize occupancy (balance registers and shared memory) - Minimize warp divergence - Use tensor cores for matrix operations - Leverage mixed precision where applicable - Optimize instruction mix - Avoid low-throughput operations (divisions, transcendentals) ### Execution Configuration - Choose optimal block size (multiples of warp size) - Balance thread blocks across SMs - Use appropriate grid dimensions - Consider dynamic parallelism for irregular workloads - Minimize kernel launch overhead ## Conclusion The NVIDIA GPU computing ecosystem represents a sophisticated interplay between: 1. **Software abstraction** (CUDA programming model) 2. **Architectural versioning** (Compute capability) 3. **Hardware implementation** (Silicon logic) Understanding these three layers and their mathematical relationships enables: $$\text{Optimal Performance} = f(\text{Algorithm}, \text{Implementation}, \text{Hardware Knowledge})$$ Where each component must be optimized in concert with the others to achieve peak efficiency. ## References - NVIDIA CUDA Programming - NVIDIA GPU Architecture (Volta, Turing, Ampere, Hopper) - Programming Massively Parallel Processors - CUDA Best Practices - PTX ISA - Tensor Core Programming

cuda,hardware

NVIDIA's parallel computing platform for GPUs.

cuda,nvidia,programming

CUDA is NVIDIA parallel programming platform. Write GPU kernels in C++. Dominant in AI/ML ecosystem.

cudnn, infrastructure

NVIDIA's deep learning primitives library.