Running a Job on the Cluster

  1. Create job-hl.yaml file. The job will run the hl-smi tool to display a summary table of the Gaudi devices. The following is an example of job-hl.yaml file:

    apiVersion: batch/v1
    kind: Job
    metadata:
        name: job-hl
    spec:
      template:
        metadata:
            labels:
                app: job-hl
        spec:
            restartPolicy: Never
            hostIPC: true
            containers:
            - name: job-hl
              image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
              command: ["hl-smi"]
              workingDir: /home
              securityContext:
                capabilities:
                  add: ["SYS_NICE"]
              resources:
                limits:
                  habana.ai/gaudi: 8
                  hugepages-2Mi: "42000Mi"
                  memory: 690Gi
                  vpc.amazonaws.com/efa: 4
                requests:
                  habana.ai/gaudi: 8
                  hugepages-2Mi: "42000Mi"
                  memory: 690Gi
                  vpc.amazonaws.com/efa: 4
    
  2. Run the job:

    kubectl apply -f job-hl.yaml
    
  3. Check the job status:

    kubectl get pods -A
    
  4. Retrieve the name of the created pod and run the following command to see the results:

    kubectl logs <pod-name>