Prometheus Metric Exporter¶

This is a Prometheus exporter implementation that enables the collection of Intel® Gaudi® AI accelerator metrics in a container cluster for compute workload. With the appropriate hardware and this plugin deployed in your cluster, you can collect information regarding the state of a Gaudi device.

Prerequisites¶

Intel Gaudi software drivers loaded on the system. For more details, refer to Installation Guide.
For Kubernetes only, the Kubernetes version listed in the Support Matrix.

Deploying Prometheus Metric Exporter in Docker¶

Start the container:

docker run -it --privileged --network=host -v /dev:/dev vault.habana.ai/gaudi-metric-exporter/metric-exporter:1.21.4-3 --port <PORT_NUMBER> (default port number 41611)

Define the Prometheus configuration file. Prometheus fundamentally stores all data as a time series: streams of timestamped values of the same metric and the same sets of labeled dimensions. The metric data exported from the exporter can be accessed in Prometheus for easier management. For details, refer to Prometheus documentation. For example:

- job_name: bmc
scrape_interval: 30s                  # A 30s scrape interval is recommended
metrics_path: /metrics                      # The exporter exposes its own metrics at /metrics
static_configs:
- targets:
    - 192.168.22.189                     # Name of the server running the metric exporter
relabel_configs:
- source_labels: [__address__]
    target_label: __param_target
- source_labels: [__param_target]
    target_label: instance
- target_label: __address__
    replacement: localhost:41611         # The location of the exporter to Prometheus

Deploying Prometheus Metric Exporter in Kubernetes¶

Create the monitoring namespace if necessary as the metric exporter is deployed into the namespace:
kubectl create ns monitoring
Run the metric exporter on all the Gaudi nodes by deploying the following DaemonSet using the kubectl create command. Use the associated .yaml file to set up the environment:
$ kubectl create -f https://vault.habana.ai/artifactory/gaudi-metric-exporter/yaml/1.21.4/metric-exporter-daemonset.yaml
Note

kubectl requires access to a Kubernetes cluster to implement its commands. To check the access to kubectl command, run $ kubectl get pod -A.
To enable the Prometheus metric exporter and kube-prometheus integration, install Kubernetes Service and kube-prometheus ServiceMonitor by running the following commands. Make sure to install Prometheus Operator as it is essential for deploying a ServiceMonitor. The Prometheus Operator allows you to create, configure, and manage Prometheus clusters on Kubernetes:
$ kubectl create -f https://vault.habana.ai/artifactory/gaudi-metric-exporter/yaml/1.21.4/metric-exporter-service.yaml
$ kubectl create -f https://vault.habana.ai/artifactory/gaudi-metric-exporter/yaml/1.21.4/metric-exporter-serviceMonitor.yaml
It is highly recommended to deploy the Prometheus metric exporter along with kube-prometheus.

Note

Prometheus metric exporter exposes metrics to Intel Gaudi network interfaces using hostNetwork: true.

Collecting Metrics¶

Now you can collect metrics on a node with Gaudi cards by querying the endpoint of the metric exporter pod using port :41611 with the cluster by following the below:

To find the end points associated with the metric, use the --port flag (int) to set a different port for the application exporter:
$ kubectl get ep -n monitoring
Once you have the associated end points for the metric exporter, run a simple command such as the below to retrieve Prometheus metrics for all Gaudi cards on that node:
$ curl http://<endpoint_ip>:41611/metrics

Gaudi Documentation 1.21.1 documentation

Prometheus Metric Exporter

On this Page

Prometheus Metric Exporter¶

Prerequisites¶

Deploying Prometheus Metric Exporter in Docker¶

Deploying Prometheus Metric Exporter in Kubernetes¶

Collecting Metrics¶