Prometheus Metric Exporter

This is a Prometheus exporter implementation that enables the collection of Intel® Gaudi® AI accelerator metrics in a container cluster for compute workload. With the appropriate hardware and this plugin deployed in your cluster, you can collect information regarding the state of a Gaudi device.

Prerequisites

  • Intel Gaudi software drivers loaded on the system. For more details, refer to Installation Guide.

  • For Kubernetes only, the Kubernetes version listed in the Support Matrix.

Deploying Prometheus Metric Exporter in Docker

  1. Start the container:

    docker run -it --privileged --network=host -v /dev:/dev vault.habana.ai/gaudi-metric-exporter/metric-exporter:1.18.0 --port <PORT_NUMBER> (default port number 41611)
    
  2. Define the Prometheus configuration file. Prometheus fundamentally stores all data as a time series: streams of timestamped values of the same metric and the same sets of labeled dimensions. The metric data exported from the exporter can be accessed in Prometheus for easier management. For details, refer to Prometheus documentation. For example:

    - job_name: bmc
    scrape_interval: 30s                  # A 30s scrape interval is recommended
    metrics_path: /metrics                      # The exporter exposes its own metrics at /metrics
    static_configs:
    - targets:
        - 192.168.22.189                     # Name of the server running the metric exporter
    relabel_configs:
    - source_labels: [__address__]
        target_label: __param_target
    - source_labels: [__param_target]
        target_label: instance
    - target_label: __address__
        replacement: localhost:41611         # The location of the exporter to Prometheus
    

Deploying Prometheus Metric Exporter in Kubernetes

  1. Create the habanalabs namespace if necessary as the metric exporter is deployed into the namespace:

    kubectl create ns habanalabs
    
  2. Run the metric exporter on all the Gaudi nodes by deploying the following DaemonSet using the kubectl create command. Use the associated .yaml file to set up the environment:

    $ kubectl create -f https://vault.habana.ai/artifactory/gaudi-metric-exporter/yaml/1.18.0/metric-exporter-daemonset.yaml
    

    Note

    kubectl requires access to a Kubernetes cluster to implement its commands. To check the access to kubectl command, run $ kubectl get pod -A.

  3. To enable the Prometheus metric exporter and kube-prometheus integration, install Kubernetes Service and kube-prometheus ServiceMonitor by running the following commands:

    $ kubectl create -f https://vault.habana.ai/artifactory/gaudi-metric-exporter/yaml/1.18.0/metric-exporter-service.yaml
    
    $ kubectl create -f https://vault.habana.ai/artifactory/gaudi-metric-exporter/yaml/1.18.0/metric-exporter-serviceMonitor.yaml
    

    It is highly recommended to deploy the Prometheus metric exporter along with kube-prometheus.

Note

Prometheus metric exporter exposes metrics to Intel Gaudi network interfaces using hostNetwork: true.

Collecting Metrics

Now you can collect metrics on a node with Gaudi cards by querying the endpoint of the metric exporter pod using port :41611 with the cluster by following the below:

  1. To find the end points associated with the metric, use the --port flag (int) to set a different port for the application exporter:

    $ kubectl get ep -n habana-system
    
  2. Once you have the associated end points for the metric exporter, run a simple command such as the below to retrieve Prometheus metrics for all Gaudi cards on that node:

    $ curl http://<endpoint_ip>:41611/metrics