slurm_tutorial
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
slurm_tutorial [2025/04/07 16:33] – nshegunov | slurm_tutorial [2025/04/07 17:03] (current) – [SLURM - Simple Linux Utility for Resource Management] nshegunov | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== SLURM - Simple Linux Utility for Resource Management ====== | ====== SLURM - Simple Linux Utility for Resource Management ====== | ||
- | SLURM is an open-source workload manager designed for Linux clusters of all sizes. It provides job scheduling and resource management to optimize cluster utilization.It is a highly scalable cluster management and job scheduling system for large and small Linux clusters. It is used by some of the world’s most powerful supercomputers. | + | SLURM is an open-source workload manager designed for Linux clusters of all sizes. It provides job scheduling and resource management to optimize cluster utilization.It is a highly scalable cluster management and job scheduling system for large and small Linux clusters. It is used by some of the world’s most powerful supercomputers. |
Please refer to [[https:// | Please refer to [[https:// | ||
Line 19: | Line 19: | ||
* **Job Step** – A component of a job, such as a single MPI process. | * **Job Step** – A component of a job, such as a single MPI process. | ||
* **Scheduler** – The component that determines which jobs run when. | * **Scheduler** – The component that determines which jobs run when. | ||
+ | |||
+ | ===== Basic Architecture ===== | ||
+ | | {{ : | ||
+ | | SLURM architecture overview ([[https:// | ||
+ | |||
+ | Slurm is based on different components, to menage the cluster resources. Bellow you can find a short summary: | ||
+ | |||
+ | * **slurmctld (Controller Daemon)** | ||
+ | - Runs on the management (head) node. | ||
+ | - Handles job scheduling, resource allocation, and overall cluster state. | ||
+ | - Usually consists of a primary and a backup controller for failover. | ||
+ | |||
+ | * **slurmd (Node Daemon)** | ||
+ | - Runs on each compute node. | ||
+ | - Responsible for launching, monitoring, and cleaning up jobs on the node. | ||
+ | - Communicates with the slurmctld to receive instructions. | ||
+ | |||
+ | * **slurmdbd (Database Daemon)** '' | ||
+ | - Manages job accounting and usage data. | ||
+ | - Works with an external database (e.g., MySQL, MariaDB). | ||
+ | - Enables commands like **sacct** and **sreport** for usage reporting. | ||
+ | |||
+ | * **Client Commands** | ||
+ | - Tools used by users and admins to interact with Slurm: | ||
+ | - **sbatch** – submit batch jobs | ||
+ | - **srun** – run parallel jobs interactively | ||
+ | - **scancel** – cancel jobs | ||
+ | - **squeue** – view job queues | ||
+ | |||
+ | * **Central Database** '' | ||
+ | - Stores job and usage records. | ||
+ | - Used in conjunction with **slurmdbd** for accounting and reporting. | ||
+ | - Supports multiple clusters if needed. | ||
+ | |||
+ | Each component communicates over a secure protocol to coordinate resource usage and job execution efficiently. | ||
+ | |||
+ | ==== Official Source ==== | ||
+ | |||
+ | SchedMD - Slurm Workload Manager | ||
+ | * https:// | ||
===== SLURM Commands ===== | ===== SLURM Commands ===== |
slurm_tutorial.1744043611.txt.gz · Last modified: 2025/04/07 16:33 by nshegunov