Running Kubernetes Workloads with Gaudi

Kubernetes provides an efficient and manageable way to orchestrate deep learning workloads at scale.

Prerequisites

  • Kubernetes version listed in the Support Matrix.

  • Make sure to install the Intel Gaudi Base Operator for Kubernetes or the Intel Gaudi Device Plugin for Kubernetes. For more details, refer to Kubernetes Installation.

Running Gaudi Jobs Example

You can create a Kubernetes job that acquires a Gaudi device by using the resource.limits field. Below is an example using Intel Gaudi’s PyTorch container image.

  1. Run the job:

    cat <<EOF | kubectl apply -f -
    
    apiVersion: batch/v1
    kind: Job
    metadata:
       name: habanalabs-gaudi-demo
    spec:
       template:
          spec:
             hostIPC: true
             restartPolicy: OnFailure
             containers:
              - name: habana-ai-base-container
                image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
                workingDir: /root
                command: ["hl-smi"]
                securityContext:
                   capabilities:
                      add: ["SYS_NICE"]
                resources:
                   limits:
                      habana.ai/gaudi: 1
                      memory: 409Gi
                      hugepages-2Mi: 95000Mi
    EOF
    
  2. Check the pod status:

    kubectl get pods
    
  3. Retrieve the name of the pod and see the results:

    kubectl logs <pod-name>
    

Note

After setting up your Kubernetes cluster, use Prometheus Metric Exporter to collect the Gaudi device metrics.

Running Gaudi MNIST Training Job Example

Below is an example of training a MNIST PyTorch model using Intel Gaudi’s PyTorch container image.

  1. Create a mnist.yaml file:

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: mnist-demo
    spec:
      template:
        spec:
          hostIPC: true
          restartPolicy: OnFailure
          containers:
            - name: mnist
              image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
              command: ["/bin/bash", "-c"]
              args:
                - >-
                  git clone --branch 1.18.0 https://github.com/HabanaAI/Model-References.git /Model-References;
                  MODEL_PATH=/Model-References/PyTorch/examples/computer_vision/hello_world;
                  cd $MODEL_PATH;
    
                  MNIST_CMD="python mnist.py \
                    --batch-size=64 \
                    --epochs=1 \
                    --lr=1.0 \
                    --gamma=0.7 \
                    --hpu";
    
                  mpirun -np 8 \
                    --allow-run-as-root \
                    --bind-to core \
                    --map-by ppr:4:socket:PE=6 \
                    -rank-by core --report-bindings \
                    --tag-output \
                    --merge-stderr-to-stdout --prefix $MPI_ROOT \
                    $MNIST_CMD;
              securityContext:
                capabilities:
                  add: ["SYS_NICE"]
              resources:
                limits:
                  habana.ai/gaudi: 8
                  memory: 409Gi
                  hugepages-2Mi: 95000Mi
                requests:
                  habana.ai/gaudi: 8
                  memory: 409Gi
                  hugepages-2Mi: 95000Mi
    
  2. Run the job:

    kubectl apply -f mnist.yaml
    
  3. Check the pod status:

    kubectl get pods
    
  4. Retrieve the name of the pod and see the results:

    kubectl logs <pod-name>
    

MPI Operator for Multi-Gaudi Nodes

Intel® Gaudi® uses the standard MPI Operator from Kubeflow that allows running MPI allreduce style workloads in Kubernetes and leveraging Gaudi accelerators. In combination with Intel Gaudi hardware and software, it enables large scale distributed training with simple Kubernetes job distribution model.

Installing MPI Operator

Follow MPI Operator documentation for instructions on setting up MPI Operator on your Kubernetes cluster.

Running Multi-Gaudi Workloads Example

Below is an example of a MPIJob on a MNIST model on 16 Gaudi devices.

  1. Create mpijob-mnist.yaml file. Make sure to set the number of Gaudi nodes in Worker -> replicas:

    apiVersion: kubeflow.org/v2beta1
    kind: MPIJob
    metadata:
      name: mnist-run
    spec:
      slotsPerWorker: 8
      runPolicy:
        cleanPodPolicy: Running
      mpiReplicaSpecs:
        Launcher:
          replicas: 1
          template:
            spec:
              containers:
                - image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
                  name: mnist-launcher
                  command: ["/bin/bash", "-c"]
                  args:
                    - >-
                      /usr/bin/ssh-keygen -A;
                      /usr/sbin/sshd;
    
                      HOSTSFILE=$OMPI_MCA_orte_default_hostfile;
                      MASTER_ADDR="$(head -n 1 $HOSTSFILE | sed -n s/[[:space:]]slots.*//p)";
    
                      NUM_NODES=$(wc -l < $HOSTSFILE);
                      CARDS_PER_NODE=8;
                      N_CARDS=$((NUM_NODES*CARDS_PER_NODE));
    
                      SETUP_CMD="git clone --branch 1.18.0 https://github.com/HabanaAI/Model-References.git /Model-References";
                      $SETUP_CMD;
                      mpirun --npernode 1 \
                        --tag-output \
                        --allow-run-as-root \
                        --prefix $MPI_ROOT \
                        $SETUP_CMD;
    
                      MODEL_PATH=/Model-References/PyTorch/examples/computer_vision/hello_world;
                      MNIST_CMD="python $MODEL_PATH/mnist.py \
                        --batch-size=64 \
                        --epochs=1 \
                        --lr=1.0 \
                        --gamma=0.7 \
                        --hpu";
    
                      cd $MODEL_PATH;
                      mpirun -np ${N_CARDS} \
                        --allow-run-as-root \
                        --bind-to core \
                        --map-by ppr:4:socket:PE=6 \
                        -rank-by core --report-bindings \
                        --tag-output \
                        --merge-stderr-to-stdout --prefix $MPI_ROOT \
                        -x MASTER_ADDR=$MASTER_ADDR \
                        $MNIST_CMD;
        Worker:
          replicas: 2
          template:
            spec:
              hostIPC: true
              containers:
                - image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
                  name: mnist-worker
                  resources:
                    limits:
                      habana.ai/gaudi: 8
                      memory: 409Gi
                      hugepages-2Mi: 95000Mi
                    requests:
                      habana.ai/gaudi: 8
                      memory: 409Gi
                      hugepages-2Mi: 95000Mi
                  command: ["/bin/bash", "-c"]
                  args:
                    - >-
                      /usr/bin/ssh-keygen -A;
                      /usr/sbin/sshd;
                      sleep 365d;
    

    Note

    • PyTorch uses shared memory buffers to communicate between processes. By default, Docker containers are allocated 64MB of shared memory. When using more than one HPU, this allocation can be insufficient. Setting hostIPC: true allows re-using the host’s shared memory space inside the container.

    • According to Kubernetes’ backoff policy, if a failure occurs, such as the worker pods are not running, the job is automatically restarted. This is useful for resuming long-running training from a checkpoint if an error causes the job to crash. For more information, refer to Kubernetes backoff failure policy.

  2. Run the job:

    kubectl apply -f mpijob-mnist.yaml
    
  3. Check the pod status:

    kubectl get pods -A
    
  4. Retrieve the name of the pod and see the results:

    kubectl logs <pod-name>
    

    Note

    After setting up your Kubernetes cluster, use Prometheus Metric Exporter to collect the Gaudi device metrics.