5. Getting Started Guide EKS with Habana

5.1. Introduction

This document provides guidelines on how to set up a Habana Deep Learning AMI on Amazon EKS. First, you create a cluster and then create a node group for the cluster. You need to specify Habana Gaudi device and install the Habana device plugin in every node in order to enable the accelerator. To run a job, you may need to apply some changes to the job config file so that the pod can work properly.

5.2. Create a Cluster

Follow the guide Getting Started with Amazon EKS – eksctl to learn how to create an EKS cluster. In order to use Habana devices, we recommend you avoid creating node groups in this step. You can add “–without-nodegroup” to the command in the Step 1 Managed Nodes – Linux tab. Please note that the command “kubectl get nodes -o wide” in Step 2 will not show any nodes since no node groups were added.

5.3. Determine Zone DL1 Available for your Account

Please run the following command to determine what zone dl1.24xlarge is available in for your account. Please modify the region appropriately:

aws ec2 describe-instance-type-offerings --location-type availability
-zone --filters Name=instance-type,Values=dl1.24xlarge --region us-east-1
--output table

Please ensure to set one of the two zones in the step below to the zone (location) discovered in this step.

5.4. Find the Latest AMI ID

You need to find the latest Habana EKS AMI ID to setup the node:

  1. Open the EC2 homepage and choose the right region.

  2. Click AMIs and search “habana” and “eks”.

  3. Select the AMI ID with the Source being aws-marketplace (ami-066d1f0319ef14b51).

../_images/Find_Latest_AMI_ID.png

5.5. Add Node Group

Follow this guide Creating a managed node group to learn how to create node group. You can create the node group with a launch template:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
availabilityZones: ["us-east-1a", "us-east-1b"]
metadata:
    name: my-cluster
    region: us-east-1
managedNodeGroups:
- name: ng01
  ami: ami-066d1f0319ef14b51
  ssh:
    publicKeyName: DLAMI_Key
  instanceType: dl1.24xlarge
  desiredCapacity: 1
  overrideBootstrapCommand: |
    #!/bin/bash
    /etc/eks/bootstrap.sh my-cluster

Here we create a node group with name ng01 and Habana EKS AMI in my-cluster with region us-east-1.

Update these areas:

  1. AMI, to desired Habana EKS AMI ID (Ex: ami-066d1f0319ef14b51). Refer to Find the Latest AMI ID to find Latest Habana EKS AMI ID.

  2. publicKeyName, to your own key.

5.6. Enable Habana Gaudi Device

In order to enable Gaudi devices, you have to run the Habana device plugin on all the nodes that are equipped with the Habana device by deploying the following Daemonset using the kubectl create command:

kubectl create -f https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml

Check the device plugin deployment status by running the following command:

kubectl get pods -n habana_system

5.7. Run Job on the Cluster

First, we create a file, job-hl.yaml. The config file can pull a docker image and set up a container according to habana.ai/gaudi, hugepages-2Mi, memory, etc. These three parameters could be adapted by your task and model. The job will run the command hl-smi to print devices info in the terminal. The command to run the job is:

kubectl apply –f job-hl.yaml

Here is an example of the job-hl.yaml:

apiVersion: batch/v1
kind: Job
metadata:
    name: job-hl
spec:
  template:
    metadata:
        labels:
            app: job-hl
    spec:
        containers:
        - name: job-hl
          image: vault.habana.ai/gaudi-docker/1.2.0/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.7.0:1.2.0-585
          command: ["hl-smi"]
          workingDir: /home
          resources:
            limits:
              habana.ai/gaudi: 8
              hugepages-2Mi: "21000Mi"
              memory: 720Gi
            requests:
              habana.ai/gaudi: 8
              hugepages-2Mi: "21000Mi"
              memory: 700Gi
          securityContext:
           capabilities:
              add: ["SYS_RAWIO"]
        hostNetwork: true
        restartPolicy: Never
  backoffLimit: 0

5.8. Clean and Delete the Cluster

You can run the following command to delete the node group and the cluster:

eksctl delete nodegroup --cluster=<clusterName> --name=<nodegroupName>
eksctl delete cluster --name my-cluster --region us-east-1