Running Multiple Workloads on a Single Node K8s Cluster
On this Page
Running Multiple Workloads on a Single Node K8s Cluster¶
Running Multiple Workloads on a Single Docker Container with K8s¶
This scenario is similar to bare metal case outlined in Multiple Workloads on a Single Docker, and also uses the multi_tenants_resnet.sh example for multiple tenants on a single docker container. To run the above example on K8s, follow the below steps:
Provide the following Pod job yaml file,
multi_ten_test.yaml
:
apiVersion: v1
kind: Pod
metadata:
name: multi-tenant
spec:
volumes:
- name: datasets
hostPath: {path: <path-to-imagenet-dataset>/tf_records/, type: Directory}
- name: models
hostPath: {path: <path-to-model-code>/Model-References, type: Directory}
containers:
- name: multi-tenant-runner
image: vault.habana.ai/gaudi-docker/1.11.0/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.12.1
volumeMounts:
- mountPath: /root/tf_records
name: datasets
- mountPath: /root/Model-References
name: models
env:
- name: PYTHON
value: "python3"
- name: PYTHONPATH
value: "/root/Model-References:/usr/lib/habanalabs"
- name: TRAIN_EPOCHS
value: "3"
workingDir: /root
command: ["bash", "-c"]
args:
- >-
/root/Model-References/TensorFlow/examples/multitask/multi_tenants_resnet.sh;
sleep infinity
securityContext:
capabilities:
add:
- SYS_RAWIO
- SYS_PTRACE
resources:
limits:
habana.ai/gaudi: 8
hugepages-2Mi: "30000Mi"
cpu: "96"
Run the following Kubernetes cmds to start the Pod job:
kubectl create -f multi_ten_test.yaml
Run the following Kubernetes cmds to check the job traces:
kubectl exec -ti multi-tenant -- bash -c "cat job1.err"
kubectl exec -ti multi-tenant -- bash -c "cat job2.err"
Make sure the job run finishes with the proper performance numbers.
Running Multiple Workloads with Two Pods, One Workload per Docker with K8s¶
In this scenario, you need to create two Pods, each should include a docker with a single ResNet-50 workload. To run, follow the below steps:
Provide the following Pod job yaml files,
multi_ten_test_1/2.yaml`
:
apiVersion: v1
kind: Pod
metadata:
name: multi-tenant-tf-1/2
namespace: default
spec:
restartPolicy: Never
volumes:
- name: datasets
hostPath: {path: <path-to-imagenet-dataset>/tf_records/, type: Directory}
- name: models
hostPath: {path: <path-to-model-code>/Model-References, type: Directory}
- name: tmp
hostPath: {path: /tmp, type: Directory}
containers:
- name: multi-tenant-runner
image: vault.habana.ai/gaudi-docker/1.11.0/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.12.1
volumeMounts:
- mountPath: /root/tf_records
name: datasets
- mountPath: /root/Model-References
name: models
- mountPath: /tmp
name: tmp
env:
- name: PYTHON
value: "python3"
- name: PYTHONPATH
value: "/root/Model-References:/usr/lib/habanalabs"
- name: TRAIN_EPOCHS
value: "3"
workingDir: /root
command:
[
"mpirun",
"--allow-run-as-root",
"-map-by",
"socket:PE=6",
"-np",
"4",
"--bind-to",
"core",
"--report-bindings",
"--tag-output",
"bash",
"-c",
]
args:
- python3 /root/Model-References/TensorFlow/computer_vision/Resnets/resnet_keras/resnet_ctl_imagenet_main.py
--enable_tensorboard
--dtype bf16
--data_loader_image_type bf16
--use_horovod
--data_dir
"/root/tf_records"
--steps_per_loop 100
--use_horovod
--batch_size 256
--train_epochs 3
--epochs_between_evals 5
--optimizer LARS
--base_learning_rate 9.5
--warmup_epochs 3
--lr_schedule polynomial
--label_smoothing 0.1
--weight_decay 0.0001
--single_l2_loss_op
--model_dir "/tmp/resnet_keras"
securityContext:
capabilities:
add:
- SYS_RAWIO
- SYS_PTRACE
resources:
limits:
habana.ai/gaudi: 4
hugepages-2Mi: "15000Mi"
cpu: "45"
requests:
habana.ai/gaudi: 4
hugepages-2Mi: "15000Mi"
cpu: "45"
Run the following Kubernetes cmds to start the 2-Pod jobs:
kubectl create -f multi_ten_test_1.yaml
kubectl create -f multi_ten_test_2.yaml
Run following Kubernetes cmds to check the status:
kubectl logs -f multi-tenant-tf-1
kubectl logs -f multi-tenant-tf-2
Make sure the jobs run finishes with the proper performance numbers.