Enabling Plugins

To start training, Habana Gaudi Device, EFA Device and Habana MPI Operator should be enabled as explained in the following sections.

Enable Habana Gaudi Device

  1. To enable Gaudi devices, run the Habana device plugin on all the nodes that are equipped with the Habana device by deploying the following Daemonset using the kubectl create command:

kubectl create -f https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml
  1. Check the device plugin deployment status by running the following command:

kubectl get pods -n habana-system

Enable EFA Device Plugin

  1. To enable EFA, run the EFA Device plugin by deploying the following Daemonset using the kubectl create command:

kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-efa-eks/main/manifest/efa-k8s-device-plugin.yml
  1. Check the device plugin deployment status by running the following command:

kubectl get pods -A

Enable Habana MPI Operator for MPIJob

To enable MPIJob type for multi-node cluster, install MPI Operator. For further information, refer to Kubeflow mpi-operator installation guide.