Prometheus Metric Exporter for Kubernetes

This is a Kubernetes Prometheus exporter implementation that enables the collection of Intel® Gaudi® AI accelerator metrics in a container cluster for compute workload. With the appropriate hardware and this plugin deployed in your Kubernetes cluster, you will be able to collect information regarding the state of a Gaudi device. The Intel Gaudi Prometheus metric exporter for Kubernetes is a Daemonset.

Prerequisites

The list of prerequisites for running the Intel Gaudi Prometheus metric exporter is described below:

  • Intel Gaudi software drivers loaded on the system

  • Kubernetes version listed in the Support Matrix

Create a Namespace

Create the habanalabs namespace if necessary as the metric exporter is deployed into the namespace:

kubectl create ns habanalabs

Deployment

Enabling Intel Gaudi Prometheus metric exporter support in Kubernetes.

The metric exporter needs to be run on all the nodes that are equipped with Gaudi cards. The simplest way of doing so is deploying the following Daemonset by using the kubectl apply command.

Note

kubectl must have access to a Kubernetes cluster to implement these commands.

$ kubectl create -f https://vault.habana.ai/artifactory/gaudi-metric-exporter/yaml/1.15.1/metric-exporter-daemonset.yaml

It is highly recommended that you deploy the Prometheus metric exporter along with kube-prometheus. If you are deploying in a cluster that uses kube-prometheus, you will want to deploy a Kubernetes Service and kube-prometheus ServiceMonitor to integrate the Prometheus metric exporter with kube-prometheus.

To install the Service and ServiceMonitor run the following commands:

$ kubectl create -f https://vault.habana.ai/artifactory/gaudi-metric-exporter/yaml/1.15.1/metric-exporter-service.yaml
$ kubectl create -f https://vault.habana.ai/artifactory/gaudi-metric-exporter/yaml/1.15.1/metric-exporter-serviceMonitor.yaml

Collecting Metrics

Now you can collect metrics on a node with Gaudi cards by querying the endpoint of the metric exporter pod using port :41611 with the cluster. To find the end points associated with the metric you can use the –port flag (int) to set a different port for the application exporter. Run the following command:

$ kubectl get ep -n habana-system

Once you have the associated end points for the metric exporter a simple command like the below will retrieve Prometheus metrics for all Gaudi cards on that node:

$ curl http://<endpoint_ip>:41611/metrics