HPC Job Scheduling and Workflow Management: SLURM and DAG-Based Workflows — resource allocation and execution sequencing for batch HPC jobs and complex multi-stage scientific pipelines

HPC Job Scheduling and Workflow Management: SLURM and DAG-Based Workflows — resource allocation and execution sequencing for batch HPC jobs and complex multi-stage scientific pipelines

SLURM (Simple Linux Utility for Resource Management)
- Design Philosophy: open-source, scalable to 1000s nodes, integrated into most HPC clusters
- Architecture: controller daemon (slurmctld) on head node, compute nodes run slurmd (agent), clients submit/query via slurm tools
- Key Components: partitions (node groups for scheduling), queues (job queues per partition), nodes (individual compute resources)
- Scalability: controller handles 1000s jobs, supports hierarchical scheduling (tree of controllers for 10,000+ nodes)

SLURM Job Submission (sbatch)
- Batch Script: shell script specifies resources (`#SBATCH' directives), input/output files, command to execute - Example:sbatch --nodes=100 --tasks-per-node=1 --cpus-per-task=4 --time=01:00:00 myjob.sh' - Job Array: array syntax (--array=0-99') spawns 100 independent jobs (parameter sweep) - Dependencies:--dependency=afterok:123' ensures job 123 finishes before current job starts

SLURM Parallel Launch (srun) - MPI Process Binding: srun handles MPI startup (rank assignment, process placement on cores) - CPU Binding:srun --cpu-bind=sockets' pins processes to sockets (improves memory locality) - Heterogeneous Steps:srun --job-name=gpu_step --gpus=1' runs specific step with GPU allocation

SLURM Accounting and Fairshare - Fairshare Algorithm: tracks resource usage (CPU-hours per user/group), prioritizes lower-usage users - Priority Boost: long-waiting jobs increase priority over time (starvation prevention) - Reservation: advance resource reservation (scontrol create reservation'), ensures availability for high-priority jobs - QOS (Quality of Service): different tiers (standard, premium, debug), different limits/priorities

PBS/Torque Job Scheduler - Design: older HPC standard (predates SLURM), similar functionality, less adoption now - qsub Command: equivalent to sbatch (submit job), qstat (check status), qdel (delete job) - Compatibility: SLURM dominance reduced PBS adoption (but still used in some facilities) - Comparison: SLURM more feature-rich, PBS simpler but slower to evolve

Workflow Management: Nextflow - DSL (Domain-Specific Language): describe pipeline as directed acyclic graph (DAG), intuitive for scientists - Process Definition: workflow consists of processes (scripts/tasks), linked by channels (data flow) - Parallelism: automatic parallelization (fork-join) across data items, job submission to HPC cluster via backend - Backend Flexibility: supports SLURM, PBS, Kubernetes, cloud platforms (same workflow portable) - Reproducibility: frozen dependency versions (containers, Nextflow versioning), enables publication-quality reproducibility

Snakemake Workflow Framework - Python-Based: rules written in Python (familiar to scientists), conditional execution, workflow inference - Dependency Resolution:snakemake' analyzes file dependencies, constructs implicit DAG, executes in parallel - Example: rule align_fastq reads BAM file, outputs aligned BAM, explicit dependency modeling - Distributed Execution: Snakemake schedules to SLURM/cloud, similar to Nextflow but Python-first

HPC Queue Priority and Scheduling - FIFO (First-In-First-Out): fairest simple scheduling, but can starve small jobs behind large jobs - Backfill: scheduler identifies gaps (small jobs can fit before large job completion), fills gaps (improves utilization) - Gang Scheduling: time-share nodes (multiple jobs on same node, swapped via preemption), increases utilization but adds latency - Preemption: high-priority job preempts lower-priority (saves state if possible, or kills), ensures critical work gets resources

Resource Allocation Strategies - Pack: schedule jobs densely (fill nodes completely before using new node), reduces fragmentation - Spread: distribute across nodes (anti-pack), improves memory bandwidth but uses more nodes - Balance: balance between pack/spread based on workload (compute-heavy: pack, memory-heavy: spread) - Constraint-Based: specify required resources (CPU cores, memory, GPU, specific node features)

Heterogeneous Job Allocation - Multiple Resource Types: job requests CPU + GPU + memory (e.g., 4 CPU + 1 GPU + 8 GB memory) - Scheduling Complexity: scheduler must find nodes with specific resource combinations, NP-hard in general - Heuristic Solution: greedy packing (fit largest resource requests first) - Utilization Impact: heterogeneity reduces bin packing efficiency (~10-20% utilization loss)

Job Dependency Management - afterok: job runs after predecessor succeeds (exit code 0) - afternotok: job runs if predecessor fails (exit code non-zero) - afterany: job runs regardless of predecessor status - DAG Support: Nextflow/Snakemake auto-generate dependencies (no manual specification needed)

Batch vs Interactive Jobs - Batch (sbatch): job submitted to queue, executed when resources available, results written to files (asynchronous) - Interactive (salloc): allocate resources, get shell prompt on compute node, immediate feedback - Use Cases: batch for long-running simulations (1000+ core-hours), interactive for debugging/development - Reservation: interactive jobs can reserve resources (salloc --time=1:00:00'), blocks other jobs

Advance Reservation - Use Case: ensure resources available for specific time window (maintenance, deadline-driven project) - Mechanism:`scontrol create reservation starttime=2024-03-01T09:00:00 duration=3600 nodes=100'
- Preemption: reserved time guaranteed (other jobs preempted if necessary)
- Cost: reduces cluster utilization (reserved but potentially idle), justified for critical work

Job Checkpointing and Restart
- Checkpoint: save job state (memory, open files, execution context) to disk
- Restart: reload state, resume execution (avoids recomputation)
- Benefit: enables job preemption (save + restart), fault tolerance (survive crashes)
- Mechanism: application-level (custom code) or system-level (transparent, but limited portability)

Scientific Workflow Provenance
- Record Execution: track which inputs → outputs, tool versions, parameters, execution environment
- Reproducibility: re-run same pipeline (deterministic if possible), verify results match
- PROV-DM Standard: W3C standard for provenance representation (graph of entities, activities, agents)
- Tools: Galaxy (web-based workflow platform), Common Workflow Language (CWL) for portable workflows

Scalability of Scheduling
- Large Clusters (10,000+ nodes): scheduling becomes critical bottleneck, decision latency limits throughput
- Optimization: approximate scheduling algorithms (not NP-hard exact solutions), fast heuristics
- Distributed Scheduling: multiple schedulers coordinate (reduces single-point bottleneck), enables elasticity

Future Directions: AI-driven scheduling (predict job characteristics, optimize placement), serverless HPC (FaaS model), containers standardizing job environments (reducing scheduling constraints).

HPC Job Scheduling and Workflow Management: SLURM and DAG-Based Workflows — resource allocation and execution sequencing for batch HPC jobs and complex multi-stage scientific pipelines

Want to learn more?