slurm_tutorial
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
slurm_tutorial [2025/04/07 16:20] – nshegunov | slurm_tutorial [2025/04/07 17:03] (current) – [SLURM - Simple Linux Utility for Resource Management] nshegunov | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== SLURM - Simple Linux Utility for Resource Management ====== | ====== SLURM - Simple Linux Utility for Resource Management ====== | ||
- | SLURM is an open-source workload manager designed for Linux clusters of all sizes. It provides job scheduling and resource management to optimize cluster utilization.It is a highly scalable cluster management and job scheduling system for large and small Linux clusters. It is used by some of the world’s most powerful supercomputers. | + | SLURM is an open-source workload manager designed for Linux clusters of all sizes. It provides job scheduling and resource management to optimize cluster utilization.It is a highly scalable cluster management and job scheduling system for large and small Linux clusters. It is used by some of the world’s most powerful supercomputers. It menages resources such as CPU, GPU, Memory, etc..., It allows different users to work simultaneously on the cluster and provides mechanism for distributing the user programs across the cluster. |
+ | |||
+ | Please refer to [[https:// | ||
===== Key Features ===== | ===== Key Features ===== | ||
Line 17: | Line 19: | ||
* **Job Step** – A component of a job, such as a single MPI process. | * **Job Step** – A component of a job, such as a single MPI process. | ||
* **Scheduler** – The component that determines which jobs run when. | * **Scheduler** – The component that determines which jobs run when. | ||
+ | |||
+ | ===== Basic Architecture ===== | ||
+ | | {{ : | ||
+ | | SLURM architecture overview ([[https:// | ||
+ | |||
+ | Slurm is based on different components, to menage the cluster resources. Bellow you can find a short summary: | ||
+ | |||
+ | * **slurmctld (Controller Daemon)** | ||
+ | - Runs on the management (head) node. | ||
+ | - Handles job scheduling, resource allocation, and overall cluster state. | ||
+ | - Usually consists of a primary and a backup controller for failover. | ||
+ | |||
+ | * **slurmd (Node Daemon)** | ||
+ | - Runs on each compute node. | ||
+ | - Responsible for launching, monitoring, and cleaning up jobs on the node. | ||
+ | - Communicates with the slurmctld to receive instructions. | ||
+ | |||
+ | * **slurmdbd (Database Daemon)** '' | ||
+ | - Manages job accounting and usage data. | ||
+ | - Works with an external database (e.g., MySQL, MariaDB). | ||
+ | - Enables commands like **sacct** and **sreport** for usage reporting. | ||
+ | |||
+ | * **Client Commands** | ||
+ | - Tools used by users and admins to interact with Slurm: | ||
+ | - **sbatch** – submit batch jobs | ||
+ | - **srun** – run parallel jobs interactively | ||
+ | - **scancel** – cancel jobs | ||
+ | - **squeue** – view job queues | ||
+ | |||
+ | * **Central Database** '' | ||
+ | - Stores job and usage records. | ||
+ | - Used in conjunction with **slurmdbd** for accounting and reporting. | ||
+ | - Supports multiple clusters if needed. | ||
+ | |||
+ | Each component communicates over a secure protocol to coordinate resource usage and job execution efficiently. | ||
+ | |||
+ | ==== Official Source ==== | ||
+ | |||
+ | SchedMD - Slurm Workload Manager | ||
+ | * https:// | ||
===== SLURM Commands ===== | ===== SLURM Commands ===== | ||
- | Here are some commonly used SLURM commands: | + | Most common |
^ Command ^ Description ^ | ^ Command ^ Description ^ | ||
Line 31: | Line 73: | ||
===== Example: Submitting a Job ===== | ===== Example: Submitting a Job ===== | ||
- | Create a job script (e.g., `job.sh`): | + | Create a job script (e.g., `job.sbatch`): |
- | ```bash | + | < |
- | # | + | |
- | #SBATCH --job-name=test_job | + | |
- | #SBATCH --output=result.out | + | |
- | #SBATCH --error=result.err | + | |
- | #SBATCH --time=01: | + | |
- | #SBATCH --partition=standard | + | |
- | #SBATCH --ntasks=1 | + | |
- | echo "Hello from SLURM job" | + | # |
+ | #SBATCH --job-name=test_job | ||
+ | #SBATCH --output=result.out | ||
+ | #SBATCH --error=result.err | ||
+ | #SBATCH --time=01: | ||
+ | #SBATCH --partition=standard # Partition to work on. | ||
+ | #SBATCH --ntasks=1 | ||
+ | |||
+ | | ||
+ | </ | ||
slurm_tutorial.1744042855.txt.gz · Last modified: 2025/04/07 16:20 by nshegunov