Enabling Plugins

To start training, the Intel® Gaudi® AI accelerator, EFA Device and Intel Gaudi MPI Operator should be enabled as explained in the following sections.

Enable Gaudi Device

  1. To enable Gaudi devices, run the Intel Gaudi device plugin on all the nodes that are equipped with Gaudi by deploying the following Daemonset using the kubectl create command:

kubectl create -f https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml
  1. Check the device plugin deployment status by running the following command:

kubectl get pods -n habana-system

Enable EFA Device Plugin

  1. To enable EFA, run the EFA Device plugin by deploying the following Daemonset using the kubectl create command:

kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-efa-eks/main/manifest/efa-k8s-device-plugin.yml
  1. Check the device plugin deployment status by running the following command:

kubectl get pods -A

Enable Intel Gaudi MPI Operator for MPIJob

To enable MPIJob type for multi-node cluster, install MPI Operator. For further information, refer to Kubeflow mpi-operator installation guide.