Table of Contents

SLURM - Simple Linux Utility for Resource Management

SLURM is an open-source workload manager designed for Linux clusters of all sizes. It provides job scheduling and resource management to optimize cluster utilization.It is a highly scalable cluster management and job scheduling system for large and small Linux clusters. It is used by some of the world’s most powerful supercomputers. It menages resources such as CPU, GPU, Memory, etc…, It allows different users to work simultaneously on the cluster and provides mechanism for distributing the user programs across the cluster.

Please refer to SLURM documentation for more information.

Key Features

Basic Terminology

Basic Architecture

SLURM architecture overview ( original picture)

Slurm is based on different components, to menage the cluster resources. Bellow you can find a short summary:

Each component communicates over a secure protocol to coordinate resource usage and job execution efficiently.

Official Source

SchedMD - Slurm Workload Manager

SLURM Commands

Most common commands are:

Command Description
`srun` Run a job or job step
`sbatch` Submit a job script for batch scheduling
`scancel` Cancel a running or pending job
`squeue` View job queue
`sinfo` View information about nodes and partitions

Example: Submitting a Job

Create a job script (e.g., `job.sbatch`):

  #!/bin/bash
  #SBATCH --job-name=test_job  # Name of your job. 
  #SBATCH --output=result.out  # Result file, standard output.
  #SBATCH --error=result.err   # Standard error.
  #SBATCH --time=01:00:00      # Work time.
  #SBATCH --partition=standard # Partition to work on.
  #SBATCH --ntasks=1           # Number of parallel tasks.
 
  echo "Hello from SLURM job"