Running Multiple Workloads on a Single Node K8s Cluster

Running Multiple Workloads on a Single Docker Container with K8s

This scenario is similar to bare metal case outlined in Multiple Workloads on a Single Docker, and also uses the multi_tenants_resnet.sh example for multiple tenants on a single docker container. To run the above example on K8s, follow the below steps:

  1. Provide the following Pod job yaml file, multi_ten_test.yaml:

 apiVersion: v1
 kind: Pod
 metadata:
    name: multi-tenant
 spec:
    volumes:
    - name: datasets
       hostPath: {path: <path-to-imagenet-dataset>/tf_records/, type: Directory}
    - name: models
       hostPath: {path: <path-to-model-code>/Model-References, type: Directory}
    containers:
    - name: multi-tenant-runner
       image: vault.habana.ai/gaudi-docker/1.11.0/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.12.1
       volumeMounts:
       - mountPath: /root/tf_records
          name: datasets
       - mountPath: /root/Model-References
          name: models
       env:
       - name: PYTHON
          value: "python3"
       - name: PYTHONPATH
          value: "/root/Model-References:/usr/lib/habanalabs"
       - name: TRAIN_EPOCHS
          value: "3"
       workingDir: /root
       command: ["bash", "-c"]
       args:
       - >-
          /root/Model-References/TensorFlow/examples/multitask/multi_tenants_resnet.sh;
          sleep infinity

       securityContext:
       capabilities:
          add:
             - SYS_RAWIO
             - SYS_PTRACE
       resources:
       limits:
          habana.ai/gaudi: 8
          hugepages-2Mi: "30000Mi"
          cpu: "96"
  1. Run the following Kubernetes cmds to start the Pod job:

kubectl create -f multi_ten_test.yaml
  1. Run the following Kubernetes cmds to check the job traces:

kubectl exec -ti multi-tenant -- bash -c "cat job1.err"
kubectl exec -ti multi-tenant -- bash -c "cat job2.err"
  1. Make sure the job run finishes with the proper performance numbers.

Running Multiple Workloads with Two Pods, One Workload per Docker with K8s

In this scenario, you need to create two Pods, each should include a docker with a single ResNet-50 workload. To run, follow the below steps:

  1. Provide the following Pod job yaml files, multi_ten_test_1/2.yaml`:

 apiVersion: v1
 kind: Pod
 metadata:
    name: multi-tenant-tf-1/2
    namespace: default
 spec:
    restartPolicy: Never
    volumes:
    - name: datasets
       hostPath: {path: <path-to-imagenet-dataset>/tf_records/, type: Directory}
    - name: models
       hostPath: {path: <path-to-model-code>/Model-References, type: Directory}
    - name: tmp
       hostPath: {path: /tmp, type: Directory}
    containers:
    - name: multi-tenant-runner
       image: vault.habana.ai/gaudi-docker/1.11.0/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.12.1
       volumeMounts:
       - mountPath: /root/tf_records
          name: datasets
       - mountPath: /root/Model-References
          name: models
       - mountPath: /tmp
          name: tmp
       env:
       - name: PYTHON
          value: "python3"
       - name: PYTHONPATH
          value: "/root/Model-References:/usr/lib/habanalabs"
       - name: TRAIN_EPOCHS
          value: "3"
       workingDir: /root
       command:
       [
          "mpirun",
          "--allow-run-as-root",
          "-map-by",
          "socket:PE=6",
          "-np",
          "4",
          "--bind-to",
          "core",
          "--report-bindings",
          "--tag-output",
          "bash",
          "-c",
       ]
       args:
       - python3 /root/Model-References/TensorFlow/computer_vision/Resnets/resnet_keras/resnet_ctl_imagenet_main.py
          --enable_tensorboard
          --dtype bf16
          --data_loader_image_type bf16
          --use_horovod
          --data_dir
          "/root/tf_records"
          --steps_per_loop 100
          --use_horovod
          --batch_size 256
          --train_epochs 3
          --epochs_between_evals 5
          --optimizer LARS
          --base_learning_rate 9.5
          --warmup_epochs 3
          --lr_schedule polynomial
          --label_smoothing 0.1
          --weight_decay 0.0001
          --single_l2_loss_op
          --model_dir "/tmp/resnet_keras"
       securityContext:
       capabilities:
          add:
             - SYS_RAWIO
             - SYS_PTRACE
       resources:
       limits:
          habana.ai/gaudi: 4
          hugepages-2Mi: "15000Mi"
          cpu: "45"
       requests:
          habana.ai/gaudi: 4
          hugepages-2Mi: "15000Mi"
          cpu: "45"
  1. Run the following Kubernetes cmds to start the 2-Pod jobs:

kubectl create -f multi_ten_test_1.yaml
kubectl create -f multi_ten_test_2.yaml
  1. Run following Kubernetes cmds to check the status:

kubectl logs -f multi-tenant-tf-1
kubectl logs -f multi-tenant-tf-2
  1. Make sure the jobs run finishes with the proper performance numbers.