HPC Job Scheduling and Workflow Management: SLURM and DAG-Based Workflows — resource allocation and execution sequencing for batch HPC jobs and complex multi-stage scientific pipelines
SLURM (Simple Linux Utility for Resource Management)
- Design Philosophy: open-source, scalable to 1000s nodes, integrated into most HPC clusters
- Architecture: controller daemon (slurmctld) on head node, compute nodes run slurmd (agent), clients submit/query via slurm tools
- Key Components: partitions (node groups for scheduling), queues (job queues per partition), nodes (individual compute resources)
- Scalability: controller handles 1000s jobs, supports hierarchical scheduling (tree of controllers for 10,000+ nodes)
SLURM Job Submission (sbatch)
- Batch Script: shell script specifies resources (`#SBATCH' directives), input/output files, command to execute
- Example: sbatch --nodes=100 --tasks-per-node=1 --cpus-per-task=4 --time=01:00:00 myjob.sh'
- Job Array: array syntax (--array=0-99') spawns 100 independent jobs (parameter sweep)
- Dependencies: --dependency=afterok:123' ensures job 123 finishes before current job starts
SLURM Parallel Launch (srun)
- MPI Process Binding: srun handles MPI startup (rank assignment, process placement on cores)
- CPU Binding: srun --cpu-bind=sockets' pins processes to sockets (improves memory locality)
- Heterogeneous Steps: srun --job-name=gpu_step --gpus=1' runs specific step with GPU allocation
SLURM Accounting and Fairshare
- Fairshare Algorithm: tracks resource usage (CPU-hours per user/group), prioritizes lower-usage users
- Priority Boost: long-waiting jobs increase priority over time (starvation prevention)
- Reservation: advance resource reservation (scontrol create reservation'), ensures availability for high-priority jobs
- QOS (Quality of Service): different tiers (standard, premium, debug), different limits/priorities
PBS/Torque Job Scheduler
- Design: older HPC standard (predates SLURM), similar functionality, less adoption now
- qsub Command: equivalent to sbatch (submit job), qstat (check status), qdel (delete job)
- Compatibility: SLURM dominance reduced PBS adoption (but still used in some facilities)
- Comparison: SLURM more feature-rich, PBS simpler but slower to evolve
Workflow Management: Nextflow
- DSL (Domain-Specific Language): describe pipeline as directed acyclic graph (DAG), intuitive for scientists
- Process Definition: workflow consists of processes (scripts/tasks), linked by channels (data flow)
- Parallelism: automatic parallelization (fork-join) across data items, job submission to HPC cluster via backend
- Backend Flexibility: supports SLURM, PBS, Kubernetes, cloud platforms (same workflow portable)
- Reproducibility: frozen dependency versions (containers, Nextflow versioning), enables publication-quality reproducibility
Snakemake Workflow Framework
- Python-Based: rules written in Python (familiar to scientists), conditional execution, workflow inference
- Dependency Resolution: snakemake' analyzes file dependencies, constructs implicit DAG, executes in parallel
- Example: rule align_fastq reads BAM file, outputs aligned BAM, explicit dependency modeling
- Distributed Execution: Snakemake schedules to SLURM/cloud, similar to Nextflow but Python-first
HPC Queue Priority and Scheduling
- FIFO (First-In-First-Out): fairest simple scheduling, but can starve small jobs behind large jobs
- Backfill: scheduler identifies gaps (small jobs can fit before large job completion), fills gaps (improves utilization)
- Gang Scheduling: time-share nodes (multiple jobs on same node, swapped via preemption), increases utilization but adds latency
- Preemption: high-priority job preempts lower-priority (saves state if possible, or kills), ensures critical work gets resources
Resource Allocation Strategies
- Pack: schedule jobs densely (fill nodes completely before using new node), reduces fragmentation
- Spread: distribute across nodes (anti-pack), improves memory bandwidth but uses more nodes
- Balance: balance between pack/spread based on workload (compute-heavy: pack, memory-heavy: spread)
- Constraint-Based: specify required resources (CPU cores, memory, GPU, specific node features)
Heterogeneous Job Allocation
- Multiple Resource Types: job requests CPU + GPU + memory (e.g., 4 CPU + 1 GPU + 8 GB memory)
- Scheduling Complexity: scheduler must find nodes with specific resource combinations, NP-hard in general
- Heuristic Solution: greedy packing (fit largest resource requests first)
- Utilization Impact: heterogeneity reduces bin packing efficiency (~10-20% utilization loss)
Job Dependency Management
- afterok: job runs after predecessor succeeds (exit code 0)
- afternotok: job runs if predecessor fails (exit code non-zero)
- afterany: job runs regardless of predecessor status
- DAG Support: Nextflow/Snakemake auto-generate dependencies (no manual specification needed)
Batch vs Interactive Jobs
- Batch (sbatch): job submitted to queue, executed when resources available, results written to files (asynchronous)
- Interactive (salloc): allocate resources, get shell prompt on compute node, immediate feedback
- Use Cases: batch for long-running simulations (1000+ core-hours), interactive for debugging/development
- Reservation: interactive jobs can reserve resources (salloc --time=1:00:00'), blocks other jobs
Advance Reservation
- Use Case: ensure resources available for specific time window (maintenance, deadline-driven project)
- Mechanism: `scontrol create reservation starttime=2024-03-01T09:00:00 duration=3600 nodes=100'
- Preemption: reserved time guaranteed (other jobs preempted if necessary)
- Cost: reduces cluster utilization (reserved but potentially idle), justified for critical work
Job Checkpointing and Restart
- Checkpoint: save job state (memory, open files, execution context) to disk
- Restart: reload state, resume execution (avoids recomputation)
- Benefit: enables job preemption (save + restart), fault tolerance (survive crashes)
- Mechanism: application-level (custom code) or system-level (transparent, but limited portability)
Scientific Workflow Provenance
- Record Execution: track which inputs → outputs, tool versions, parameters, execution environment
- Reproducibility: re-run same pipeline (deterministic if possible), verify results match
- PROV-DM Standard: W3C standard for provenance representation (graph of entities, activities, agents)
- Tools: Galaxy (web-based workflow platform), Common Workflow Language (CWL) for portable workflows
Scalability of Scheduling
- Large Clusters (10,000+ nodes): scheduling becomes critical bottleneck, decision latency limits throughput
- Optimization: approximate scheduling algorithms (not NP-hard exact solutions), fast heuristics
- Distributed Scheduling: multiple schedulers coordinate (reduces single-point bottleneck), enables elasticity
Future Directions: AI-driven scheduling (predict job characteristics, optimize placement), serverless HPC (FaaS model), containers standardizing job environments (reducing scheduling constraints).