Thread block is the cooperating group of GPU threads that executes on one SM and shares synchronization and local memory resources - it is the core work unit for organizing cooperation, reuse, and locality in CUDA kernels.
What Is Thread block?
- Definition: Fixed-size thread group with shared-memory scope and barrier synchronization capability.
- Execution Mapping: A block is scheduled on a single SM and may run concurrently with other resident blocks.
- Coordination Tools: Threads can communicate through shared memory and synchronize via block barriers.
- Size Constraints: Block dimensions are limited by architecture maximum threads and resource budgets.
Why Thread block Matters
- Data Reuse: Block-level collaboration reduces redundant global memory access.
- Synchronization: Many parallel algorithms rely on intra-block barriers for correctness.
- Performance: Block shape influences occupancy, memory coalescing, and scheduler effectiveness.
- Algorithm Design: Choosing right block decomposition is central to efficient GPU kernel structure.
- Portability: Well-designed block patterns adapt better across changing SM resource limits.
How It Is Used in Practice
- Shape Selection: Match block dimensions to data layout and shared-memory tiling strategy.
- Resource Budgeting: Tune register and shared-memory use so enough blocks can reside concurrently.
- Correctness Checks: Verify barrier placement and shared-memory indexing to avoid race conditions.
Thread block design is the building block of efficient CUDA parallelization - effective intra-block cooperation is critical for both correctness and high GPU performance.