Running a Job on the Cluster
Running a Job on the Cluster¶
Create
job-hl.yaml
file. The job will run thehl-smi
tool to display a summary table of the Gaudi devices. The following is an example ofjob-hl.yaml
file:apiVersion: batch/v1 kind: Job metadata: name: job-hl spec: template: metadata: labels: app: job-hl spec: restartPolicy: Never hostIPC: true containers: - name: job-hl image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest command: ["hl-smi"] workingDir: /home securityContext: capabilities: add: ["SYS_NICE"] resources: limits: habana.ai/gaudi: 8 hugepages-2Mi: "42000Mi" memory: 690Gi vpc.amazonaws.com/efa: 4 requests: habana.ai/gaudi: 8 hugepages-2Mi: "42000Mi" memory: 690Gi vpc.amazonaws.com/efa: 4
Run the job:
kubectl apply -f job-hl.yaml
Check the job status:
kubectl get pods -A
Retrieve the name of the created pod and run the following command to see the results:
kubectl logs <pod-name>