slurm_tutorial
This is an old revision of the document!
Table of Contents
SLURM - Simple Linux Utility for Resource Management
SLURM is an open-source workload manager designed for Linux clusters of all sizes. It provides job scheduling and resource management to optimize cluster utilization.It is a highly scalable cluster management and job scheduling system for large and small Linux clusters. It is used by some of the world’s most powerful supercomputers.
Key Features
- Open-source and actively developed
- Scalable to tens of thousands of nodes
- Flexible job scheduling options
- Supports job arrays, reservations, and dependencies
- Plugins available for authentication, accounting, and more
Basic Terminology
- Node – A single computer in the cluster.
- Partition – A group of nodes (like a queue).
- Job – A user-submitted task to be run on the cluster.
- Job Step – A component of a job, such as a single MPI process.
- Scheduler – The component that determines which jobs run when.
SLURM Commands
Here are some commonly used SLURM commands:
Command | Description |
---|---|
`srun` | Run a job or job step |
`sbatch` | Submit a job script for batch scheduling |
`scancel` | Cancel a running or pending job |
`squeue` | View job queue |
`sinfo` | View information about nodes and partitions |
Example: Submitting a Job
Create a job script (e.g., `job.sh`):
```bash
#!/bin/bash #SBATCH --job-name=test_job #SBATCH --output=result.out #SBATCH --error=result.err #SBATCH --time=01:00:00 #SBATCH --partition=standard #SBATCH --ntasks=1
echo “Hello from SLURM job”
slurm_tutorial.1744042855.txt.gz · Last modified: 2025/04/07 16:20 by nshegunov