Running a Job on the Cluster
Running a Job on the Cluster¶
Create
job-hl.yaml
file. The config file can pull a docker image and set up a container according tohabana.ai/gaudi
,hugepages-2Mi
,memory
, etc. These parameters could be adapted by your task and model. The job will run the commandhl-smi
to print devices info in the terminal.
The following is an example of job-hl.yaml
:
apiVersion: batch/v1
kind: Job
metadata:
name: job-hl
spec:
template:
metadata:
labels:
app: job-hl
spec:
hostNetwork: true
restartPolicy: Never
containers:
- name: job-hl
image: vault.habana.ai/gaudi-docker/1.9.0/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.11.0
command: ["hl-smi"]
workingDir: /home
securityContext:
capabilities:
add: ["SYS_NICE"]
resources:
limits:
habana.ai/gaudi: 8
hugepages-2Mi: "42000Mi"
memory: 690Gi
vpc.amazonaws.com/efa: 4
requests:
habana.ai/gaudi: 8
hugepages-2Mi: "42000Mi"
memory: 690Gi
vpc.amazonaws.com/efa: 4
apiVersion: batch/v1
kind: Job
metadata:
name: job-hl
spec:
template:
metadata:
labels:
app: job-hl
spec:
restartPolicy: Never
hostNetwork: true
hostIPC: true
containers:
- name: job-hl
image: vault.habana.ai/gaudi-docker/1.9.0/ubuntu20.04/habanalabs/pytorch-installer-1.13.1
command: ["hl-smi"]
workingDir: /home
securityContext:
capabilities:
add: ["SYS_NICE"]
resources:
limits:
habana.ai/gaudi: 8
hugepages-2Mi: "42000Mi"
memory: 690Gi
vpc.amazonaws.com/efa: 4
requests:
habana.ai/gaudi: 8
hugepages-2Mi: "42000Mi"
memory: 690Gi
vpc.amazonaws.com/efa: 4
Run the job by running the following command:
Check the job status by running the following command:
Retrieve the name of the created pod and run the following command to see the results: