Running a Job on the Cluster
Running a Job on the Cluster¶
Create
job-hl.yaml
file. The job will run the commandhl-smi
to display a summary table of the Gaudi devices. The following is an example ofjob-hl.yaml
:
apiVersion: batch/v1
kind: Job
metadata:
name: job-hl
spec:
template:
metadata:
labels:
app: job-hl
spec:
restartPolicy: Never
hostIPC: true
containers:
- name: job-hl
image: vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
command: ["hl-smi"]
workingDir: /home
securityContext:
capabilities:
add: ["SYS_NICE"]
resources:
limits:
habana.ai/gaudi: 8
hugepages-2Mi: "42000Mi"
memory: 690Gi
vpc.amazonaws.com/efa: 4
requests:
habana.ai/gaudi: 8
hugepages-2Mi: "42000Mi"
memory: 690Gi
vpc.amazonaws.com/efa: 4
Run the job:
kubectl apply -f job-hl.yaml
Check the job status:
kubectl get pods -A
Retrieve the name of the created pod and run the following command to see the results:
kubectl logs <pod-name>