Running a Job on the Cluster

  1. Create job-hl.yaml file. The config file can pull a docker image and set up a container according to habana.ai/gaudi, hugepages-2Mi, memory, etc. These parameters could be adapted by your task and model. The job will run the command hl-smi to print devices info in the terminal.

The following is an example of job-hl.yaml:

apiVersion: batch/v1
kind: Job
metadata:
    name: job-hl
spec:
  template:
    metadata:
        labels:
            app: job-hl
    spec:
        containers:
        - name: job-hl
          image: vault.habana.ai/gaudi-docker/1.7.1/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.10.1
          command: ["hl-smi"]
          workingDir: /home
          resources:
            limits:
              habana.ai/gaudi: 8
              hugepages-2Mi: "21000Mi"
              memory: 720Gi
              vpc.amazonaws.com/efa: 4
            requests:
              habana.ai/gaudi: 8
              hugepages-2Mi: "21000Mi"
              memory: 700Gi
              vpc.amazonaws.com/efa: 4
          securityContext:
            capabilities:
              add: ["SYS_RAWIO"]
        hostNetwork: true
        restartPolicy: Never
  backoffLimit: 0
  1. Run the job by running the following command:

kubectl apply -f job-hl.yaml
  1. Check the job status by running the following command:

kubectl get pods -A
  1. Retrieve the name of the created pod and run the following command to see the results:

kubectl logs <pod-name>