Running a Job on the Cluster¶

Create job-hl.yaml file. The job will run the hl-smi tool to display a summary table of the Gaudi devices. The following is an example of job-hl.yaml file:

apiVersion: batch/v1
kind: Job
metadata:
    name: job-hl
spec:
  template:
    metadata:
        labels:
            app: job-hl
    spec:
        restartPolicy: Never
        hostIPC: true
        containers:
        - name: job-hl
          image: vault.habana.ai/gaudi-docker/1.21.2/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
          command: ["hl-smi"]
          workingDir: /home
          securityContext:
            capabilities:
              add: ["SYS_NICE"]
          resources:
            limits:
              habana.ai/gaudi: 8
              hugepages-2Mi: "42000Mi"
              memory: 690Gi
              vpc.amazonaws.com/efa: 4
            requests:
              habana.ai/gaudi: 8
              hugepages-2Mi: "42000Mi"
              memory: 690Gi
              vpc.amazonaws.com/efa: 4

Run the job:
kubectl apply -f job-hl.yaml
Check the job status:
kubectl get pods -A
Retrieve the name of the created pod and run the following command to see the results:
kubectl logs <pod-name>

Gaudi Documentation 1.21.1 documentation

Running a Job on the Cluster

Running a Job on the Cluster¶