Using Slurm Workload Manager with Intel Gaudi

Slurm is an open-source workload manager used for job scheduling and cluster management in high-performance computing (HPC) environments. Slurm provides support for Intel® Gaudi® AI accelerator using Generic Resource (GRES) feature. GRES enables Slurm to manage and schedule jobs that require specialized hardware components such as Intel Gaudi, FPGA accelerators, InfiniBand adapters, custom networking hardware, or any other non-standard resources. For more details, see Slurm documentation.

Prerequisites

  • Intel Gaudi software stack (drivers and hl-smi). For more details, refer to the Installation Guide.

  • Intel Gaudi container runtime for docker with the proper /etc/docker/daemon.json configuration. For more details, refer to the Installation Guide.

  • Intel Gaudi cards external networking ports are up.

Build and Install

Intel Gaudi provides a Slurm fork which includes changes to add support for the Intel Gaudi software. To use Slurm with Gaudi, you must install Intel Gaudi’s Slurm fork.

Managing Gaudi Accelerators in Slurm

To manage Gaudi accelerators in Slurm you need to configure Slurm to recognize Intel Gaudi accelerators in the cluster, schedule jobs that require Gaudi resources, and efficiently allocate Gaudi accelerators to those jobs. The following lists the required steps:

  1. Configure Gaudi detection for Slurm to recognize the Gaudi accelerators available in the cluster using the following Slurm configuration files:

    1. slurm.conf - Slurm configuration file. For an example configuration file, see Intel Gaudi’s Slurm fork:

      1. GresTypes=gpu – A comma separated list of GRES provided by nodes in the cluster.

      2. NodeName=... Gres=gpu:8 – Specifies what GRES contains on each node. You can provide the card module such as HL-225 for more specific provisioning in a mixed GRES cluster (gres=gpu:HL-225:8).

    2. gres.conf - GRES configuration file. This file specifies how to access GRES on the nodes. For an example configuration file, see Intel Gaudi’s Slurm fork:

      1. AutoDetect - Slurm automatically identifies the resources available on the nodes. It looks for the resources on the cluster’s nodes using a pre-programmed path and fills the resources metadata on its own. This can be done only if Slurm has the relevant GRES plugin. For example, if the HLML plugin is installed, when applying the configuration AutoDetect=hlml, Slurm automatically detects GPU type GRES under the path /etc/accel/accel*, and fills the GRES metadata fields with parameters provided on the cards by the HLML C library.

      2. Manual configuration - Configure GRES manually with custom resources. Manual configuration can be used if the relevant plugin does not exist, or custom detection rules needs to be applied. For example, to manage Intel Gaudi accelerators, it is possible to apply the following configuration in the file: Name=gpu Type=HL-225 File=/etc/accel/accel[0-7] COREs=0,1. Note that when using this approach, the card’s metadata fields will not be filled, and in case of an accelerator fault, the scheduler will still provision the faulted accelerator because it does not know how to monitor it.

  2. Specify Gaudi requirements in job submissions to Slurm. This can be done using the --gres flag followed by the type and quantity of Gaudis required. For example, --gres=gpu:2 requests two Gaudi accelerators for the job.

  3. (Optional) Configure a dedicated Gaudi partition in Slurm to separate Gaudi compute nodes from other nodes, such as CPU only nodes. This allows you to submit jobs specifically to the Gaudi partition, ensuring that nodes are reserved for Gaudi-accelerated workloads.

  4. Allocate resource and scheduling using Slurm’s scheduler. The scheduler takes Gaudi requests into account when allocating resources to jobs. It ensures that jobs requesting Gaudi accelerators are only scheduled on nodes equipped with the requested and that Gaudi resources are allocated efficiently based on job priorities and dependencies.

  5. Track monitoring and accounting in Slurm to track Gaudi usage across the cluster. This includes tracking the number of Gaudi accelerators allocated to each job, the duration of Gaudi usage, and other relevant metrics for accounting and billing purposes.

  6. Integrate Slurm with Gaudi management tools and libraries to enhance Gaudi management capabilities. Slurm can work with Gaudi monitoring tools to provide real-time insights into Gaudi utilization and performance.

Running Jobs with Gaudi

Every node that needs to expose a Generic Resource for Slurm jobs must be configured with the gres variable, which contains a list of available GRES: Gres=resource_type1:quanitiy1,resource_type2:quanitiy2,...,resource_typeN:quanitiyN. This comma separated list is structured as a list of gres components and the quantity of units this specific node has to offer. For example, for a node with two Gaudi accelerators on the machine and a single network port, the definition should be Gres=gpu:2,nic:1. The Slurm controller daemon, slurmctld, maintains a list of gres called gres_list for each node, job and step. This step provides the available and required GRES, and is structured differently for each type. When a resource is not associated with any GRES, this list will be defined as NULL.

To use Gaudi in a Slurm job, you must explicitly specify it by running the job using either the --gres or --gpus flag. The following flags are available:

  • --gres specifies – The type and number of generic resources required per node.

  • --gpus specifies – The number of Gaudi accelerators required for an entire job.

  • --gpus-per-node – In the case of a multi-server job, define the number of Gaudi accelerators required for each one.

  • --gpus-per-task – Specifies how many Gaudi accelerators are required for each task. This definition enables resource isolation so each task has its own Gaudi accelerators. This requires specifying the number of tasks using either the --ntasks or --ntasks-per-node flag.

Using Intel Gaudi Environment Variables

For each running task using Gaudi accelerators, Slurm will export the HABANA_VISIBLE_DEVICES environment variable. This environment variable contains a comma separated list of device IDs of the available resources allocated for the specific task - HABANA_VISIBLE_DEVICES=0,1,2.

To isolate the allocated accelerators for the task to use, this environment variable needs to be forwarded to the habanalabs-container-runtime while running the workload inside a containerized environment. This environment can be based on any containerization tool such as Docker, containerd, cri-o and so on. When running outside of a containerized environment, the job will be able to access unallocated Gaudi accelerators, interfering with other running tasks.

srun --gres=gpu:1 bash -c “docker run --init --rm --runtime=habana -e HABANA_VISIBLE_DEVICES ${ML_FRAMEWORK_IMAGE} hl-smi