Task Graph Execution and Runtime Systems

Task Graph Execution and Runtime Systems is the parallel computing model where computation is expressed as a directed acyclic graph (DAG) of tasks with explicit data dependencies — enabling runtime systems to automatically schedule, parallelize, and manage heterogeneous resources (CPU cores, GPUs, remote nodes) by exploiting dependency analysis to identify which tasks can execute in parallel, while deferring scheduling decisions to runtime to adapt to actual resource availability and data locality. Task-based runtimes like Legion, StarPU, OpenMP tasks, and Intel TBB implement this model.

Why Task Graphs Over Fork-Join

- Fork-join: Simple parallel regions → coarse-grained parallelism → difficult to handle irregular or dynamic parallelism.
- Task graph: Fine-grained tasks with explicit dependencies → runtime discovers parallelism automatically → handles irregular workloads.
- Key advantage: Automatic load balancing, adaptive scheduling, heterogeneous resource management.

Task DAG Structure

``Task A → Task B → Task D → Task F (Output) ↘ Task C ↗

- A must complete before B and C start - B and C run in parallel (no dependency between them) - D waits for both B and C to complete - Critical path: A → B → D → F (or A → C → D → F)`

Critical Path and Span

- Work (T₁): Total task time if executed sequentially. - Span (T∞): Length of the critical path (longest chain of dependent tasks). - Parallelism: T₁ / T∞ → maximum achievable speedup on infinite processors. - Brent's theorem: Actual speedup on P processors: S_P ≥ T₁ / (T∞ + T₁/P).

OpenMP Task-Based Parallelism

`cpp #pragma omp parallel #pragma omp single { #pragma omp task depend(out: A_data) compute_A(); #pragma omp task depend(in: A_data) depend(out: B_data) compute_B(); // B depends on A #pragma omp task depend(in: A_data) depend(out: C_data) compute_C(); // C depends on A, runs parallel with B #pragma omp task depend(in: B_data, C_data) compute_D(); // D depends on both B and C }`

Intel TBB (Threading Building Blocks)

- tbb::task_group: Submit tasks → runtime schedules on thread pool. -tbb::flow_graph`: Explicit DAG nodes connected by edges → data-flow execution.
- Work-stealing scheduler: Idle threads steal tasks from busy threads' queues → automatic load balancing.
- Used by: Intel oneAPI, many C++ parallel applications.

Legion Runtime

- Developed at Stanford/NVIDIA/Los Alamos → HPC and AI applications.
- Key abstraction: Logical regions (data) + Tasks (computation) with explicit privileges (read/write/reduce).
- Runtime analyzes privileges → constructs dependency graph → schedules tasks in parallel where safe.
- Transparent data movement: Legion moves data to where computation runs (CPU ↔ GPU ↔ distributed node) automatically.
- Used by: FlexFlow (DNN training), S3D (combustion simulation), circuit simulation.

StarPU

- Runtime for heterogeneous architectures: CPU cores + CUDA GPUs + OpenCL + other accelerators.
- Tasks annotated with multiple implementations (CPU version, CUDA version).
- Runtime selects which implementation to use based on data location and resource availability → avoids unnecessary data transfers.
- Performance model: Calibrates task duration on each device → schedules to minimize total DAG completion time.

Task Graph for Deep Learning (MLCommons)

- TensorFlow dataflow graph: Operations as nodes, tensors as edges → task graph execution.
- PyTorch autograd: Builds dynamic computation graph (DAG) → reverse-mode AD traverses DAG.
- CUDA graphs: Capture kernel launch sequence as graph → replay without CPU overhead → 10–20% throughput improvement for repetitive workloads.

Scheduling Algorithms

| Algorithm | Strategy | Good For |
|-----------|---------|----------|
| List scheduling | Sort tasks by HEFT priority → schedule greedily | Heterogeneous tasks |
| Work-stealing | Idle workers steal from busy ones | Dynamic, irregular |
| Critical-path | Prioritize tasks on critical path | Balanced DAGs |
| HEFT | Heterogeneous Earliest Finish Time | Multi-device scheduling |

Task graph execution is the runtime intelligence that makes heterogeneous parallel computing tractable — by expressing computation as an explicit dependency graph and delegating scheduling to a runtime that can observe actual resource availability, data locality, and task performance, task-based systems adaptively achieve parallelism that static fork-join programs would require manual tuning to approach, enabling scientific codes and AI frameworks to scale naturally across the diverse mix of CPUs, GPUs, and memory hierarchies that characterize modern heterogeneous computing infrastructure.

Task Graph Execution and Runtime Systems

Want to learn more?