Running a Job on the Cluster

  1. Create job-hl.yaml file. The job will run the command hl-smi to display a summary table of the Gaudi devices. The following is an example of job-hl.yaml:

apiVersion: batch/v1
kind: Job
metadata:
    name: job-hl
spec:
  template:
    metadata:
        labels:
            app: job-hl
    spec:
        restartPolicy: Never
        hostIPC: true
        containers:
        - name: job-hl
          image: vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
          command: ["hl-smi"]
          workingDir: /home
          securityContext:
            capabilities:
              add: ["SYS_NICE"]
          resources:
            limits:
              habana.ai/gaudi: 8
              hugepages-2Mi: "42000Mi"
              memory: 690Gi
              vpc.amazonaws.com/efa: 4
            requests:
              habana.ai/gaudi: 8
              hugepages-2Mi: "42000Mi"
              memory: 690Gi
              vpc.amazonaws.com/efa: 4
  1. Run the job:

    kubectl apply -f job-hl.yaml
    
  2. Check the job status:

    kubectl get pods -A
    
  3. Retrieve the name of the created pod and run the following command to see the results:

    kubectl logs <pod-name>