MPI Operator for Kubernetes

Habana uses the standard MPI Operator from Kubeflow that enables the running of MPI all reduce style workloads in Kubernetes, leveraging Gaudi accelerators. In combination with Habana’s hardware and software, it enables large scale distributed training with simple Kubernetes job distribution model.

Prerequisites

The below lists the prerequisites needed for running the Habana MPI Operator on Habana hardware:

  • 1.10 <= Kubernetes version < 1.22

  • SynapseAI SW drivers loaded on the system.

  • Make sure the habana-container-runtime package is installed.

  • Set up the habana-container-runtime.

The installation instructions for the above components are detailed in Installation Guide.

Running Multi-Gaudi Workloads

For more details on how to deploy and run workloads at a scale leveraging the MPI Operator, refer to the MPI operator documentation.