Running a Job on the Cluster

  1. Create job-hl.yaml file. The config file can pull a docker image and set up a container according to habana.ai/gaudi, hugepages-2Mi, memory, etc. These parameters could be adapted by your task and model. The job will run the command hl-smi to print devices info in the terminal.

The following is an example of job-hl.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: job-hl
spec:
  template:
    metadata:
      labels:
        app: job-hl
    spec:
      hostNetwork: true
      restartPolicy: Never
      containers:
        - name: job-hl
          image: vault.habana.ai/gaudi-docker/1.9.0/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.11.0
          command: ["hl-smi"]
          workingDir: /home
          securityContext:
            capabilities:
              add: ["SYS_NICE"]
          resources:
            limits:
              habana.ai/gaudi: 8
              hugepages-2Mi: "42000Mi"
              memory: 690Gi
              vpc.amazonaws.com/efa: 4
            requests:
              habana.ai/gaudi: 8
              hugepages-2Mi: "42000Mi"
              memory: 690Gi
              vpc.amazonaws.com/efa: 4
Copy to clipboard
apiVersion: batch/v1
kind: Job
metadata:
    name: job-hl
spec:
  template:
    metadata:
        labels:
            app: job-hl
    spec:
        restartPolicy: Never
        hostNetwork: true
        hostIPC: true
        containers:
        - name: job-hl
          image: vault.habana.ai/gaudi-docker/1.9.0/ubuntu20.04/habanalabs/pytorch-installer-1.13.1
          command: ["hl-smi"]
          workingDir: /home
          securityContext:
            capabilities:
              add: ["SYS_NICE"]
          resources:
            limits:
              habana.ai/gaudi: 8
              hugepages-2Mi: "42000Mi"
              memory: 690Gi
              vpc.amazonaws.com/efa: 4
            requests:
              habana.ai/gaudi: 8
              hugepages-2Mi: "42000Mi"
              memory: 690Gi
              vpc.amazonaws.com/efa: 4
Copy to clipboard
  1. Run the job by running the following command:

    kubectl apply -f job-hl.yaml
    
    Copy to clipboard
  2. Check the job status by running the following command:

    kubectl get pods -A
    
    Copy to clipboard
  3. Retrieve the name of the created pod and run the following command to see the results:

    kubectl logs <pod-name>
    
    Copy to clipboard