slurm_tutorial
This is an old revision of the document!
Table of Contents
SLURM - Simple Linux Utility for Resource Management
SLURM is an open-source workload manager designed for Linux clusters of all sizes. It provides job scheduling and resource management to optimize cluster utilization.It is a highly scalable cluster management and job scheduling system for large and small Linux clusters. It is used by some of the world’s most powerful supercomputers.
Please refer to SLURM documentation for more information.
Key Features
- Open-source and actively developed
- Scalable to tens of thousands of nodes
- Flexible job scheduling options
- Supports job arrays, reservations, and dependencies
- Plugins available for authentication, accounting, and more
Basic Terminology
- Node – A single computer in the cluster.
- Partition – A group of nodes (like a queue).
- Job – A user-submitted task to be run on the cluster.
- Job Step – A component of a job, such as a single MPI process.
- Scheduler – The component that determines which jobs run when.
Basic Architecture
- slurmctld - Central controller that manages job scheduling and the overall state of the cluster.
- slurmd - Node daemon that runs on each compute node to execute assigned tasks.
- slurmdbd (optional) - Handles accounting and job information storage via a database backend.
Each component communicates over a secure protocol to coordinate resource usage and job execution efficiently.
Official Source
SchedMD - Slurm Workload Manager
SLURM Commands
Most common commands are:
Command | Description |
---|---|
`srun` | Run a job or job step |
`sbatch` | Submit a job script for batch scheduling |
`scancel` | Cancel a running or pending job |
`squeue` | View job queue |
`sinfo` | View information about nodes and partitions |
Example: Submitting a Job
Create a job script (e.g., `job.sbatch`):
#!/bin/bash #SBATCH --job-name=test_job # Name of your job. #SBATCH --output=result.out # Result file, standard output. #SBATCH --error=result.err # Standard error. #SBATCH --time=01:00:00 # Work time. #SBATCH --partition=standard # Partition to work on. #SBATCH --ntasks=1 # Number of parallel tasks. echo "Hello from SLURM job"
slurm_tutorial.1744044343.txt.gz · Last modified: 2025/04/07 16:45 by nshegunov