Prometheus Metric Exporter¶

This is a Prometheus exporter implementation that enables the collection of Intel® Gaudi® AI accelerator metrics in a container cluster for compute workload. With the appropriate hardware and this plugin deployed in your cluster, you can collect information regarding the state of a Gaudi device.

Prerequisites¶

Intel Gaudi software drivers loaded on the system. For more details, refer to Installation Guide.
For Kubernetes only, the Kubernetes version listed in the Support Matrix.

Deploying Prometheus Metric Exporter in Docker¶

Start the container:

docker run -it --privileged --network=host -v /dev:/dev vault.habana.ai/gaudi-metric-exporter/metric-exporter:1.23.0-695 --port <PORT_NUMBER> (default port number 41611)

Define the Prometheus configuration file. Prometheus fundamentally stores all data as a time series: streams of timestamped values of the same metric and the same sets of labeled dimensions. The metric data exported from the exporter can be accessed in Prometheus for easier management. For details, refer to Prometheus documentation. For example:

- job_name: bmc
scrape_interval: 30s                  # A 30s scrape interval is recommended
metrics_path: /metrics                      # The exporter exposes its own metrics at /metrics
static_configs:
- targets:
    - 192.168.22.189                     # Name of the server running the metric exporter
relabel_configs:
- source_labels: [__address__]
    target_label: __param_target
- source_labels: [__param_target]
    target_label: instance
- target_label: __address__
    replacement: localhost:41611         # The location of the exporter to Prometheus

Deploying Prometheus Metric Exporter in Kubernetes¶

Create the monitoring namespace if necessary as the metric exporter is deployed into the namespace:
kubectl create ns monitoring
Run the metric exporter on all the Gaudi nodes by deploying the following DaemonSet using the kubectl create command. Use the associated .yaml file to set up the environment:
$ kubectl create -f https://vault.habana.ai/artifactory/gaudi-metric-exporter/yaml/1.23.0/metric-exporter-daemonset.yaml
Note

kubectl requires access to a Kubernetes cluster to implement its commands. To check the access to kubectl command, run $ kubectl get pod -A.
To enable the Prometheus metric exporter and kube-prometheus integration, install Kubernetes Service and kube-prometheus ServiceMonitor by running the following commands. Make sure to install Prometheus Operator as it is essential for deploying a ServiceMonitor. The Prometheus Operator allows you to create, configure, and manage Prometheus clusters on Kubernetes:
$ kubectl create -f https://vault.habana.ai/artifactory/gaudi-metric-exporter/yaml/1.23.0/metric-exporter-service.yaml
$ kubectl create -f https://vault.habana.ai/artifactory/gaudi-metric-exporter/yaml/1.23.0/metric-exporter-serviceMonitor.yaml
It is highly recommended to deploy the Prometheus metric exporter along with kube-prometheus.

Note

Prometheus metric exporter exposes metrics to Intel Gaudi network interfaces using hostNetwork: true.

Collecting Metrics¶

Now you can collect metrics on a node with Gaudi cards by querying the endpoint of the metric exporter pod using port :41611 with the cluster by following the below:

To find the end points associated with the metric, use the --port flag (int) to set a different port for the application exporter:
$ kubectl get ep -n monitoring
Once you have the associated end points for the metric exporter, run a simple command such as the below to retrieve Prometheus metrics for all Gaudi cards on that node:
$ curl http://<endpoint_ip>:41611/metrics

Exposed Metrics¶

Metric	Description
go_gc_duration_seconds	A summary of the pause duration of garbage collection cycles.
go_goroutines	Number of goroutines that currently exist.
go_info	Information about the Go environment.
go_memstats_alloc_bytes	Number of bytes allocated and still in use.
go_memstats_alloc_bytes_total	Total number of bytes allocated, even if freed.
go_memstats_buck_hash_sys_bytes	Number of bytes used by the profiling bucket hash table.
go_memstats_frees_total	Total number of frees.
go_memstats_gc_sys_bytes	Number of bytes used for garbage collection system metadata.
go_memstats_heap_alloc_bytes	Number of heap bytes allocated and still in use.
go_memstats_heap_idle_bytes	Number of heap bytes waiting to be used.
go_memstats_heap_inuse_bytes	Number of heap bytes that are in use.
go_memstats_heap_objects	Number of allocated objects.
go_memstats_heap_released_bytes	Number of heap bytes released to OS.
go_memstats_heap_sys_bytes	Number of heap bytes obtained from system.
go_memstats_last_gc_time_seconds	Number of seconds since 1970 of last garbage collection.
go_memstats_lookups_total	Total number of pointer lookups.
go_memstats_mallocs_total	Total number of mallocs.
go_memstats_mcache_inuse_bytes	Number of bytes in use by mcache structures.
go_memstats_mcache_sys_bytes	Number of bytes used for mcache structures obtained from system.
go_memstats_mspan_inuse_bytes	Number of bytes in use by mspan structures.
go_memstats_mspan_sys_bytes	Number of bytes used for mspan structures obtained from system.
go_memstats_next_gc_bytes	Number of heap bytes when next garbage collection will take place.
go_memstats_other_sys_bytes	Number of bytes used for other system allocations.
go_memstats_stack_inuse_bytes	Number of bytes in use by the stack allocator.
go_memstats_stack_sys_bytes	Number of bytes obtained from system for stack allocator.
go_memstats_sys_bytes	Number of bytes obtained from system.
go_threads	Number of OS threads created.
habanalabs_clock_soc_max_mhz	Maximum SoC clock frequency.
habanalabs_clock_soc_mhz	Operating SoC clock frequency.
habanalabs_device_config	Device information.
habanalabs_ecc_feature_mode	ECC feature status.
habanalabs_energy	Device energy usage.
habanalabs_memory_free_bytes	Current free bytes of memory.
habanalabs_memory_total_bytes	Current total bytes of memory.
habanalabs_memory_used_bytes	Current used bytes of memory.
habanalabs_nic_port_status	NIC port status.
habanalabs_pci_link_speed	PCIe link speed.
habanalabs_pci_link_width	PCIe link width.
habanalabs_pcie_receive_throughput	PCIe receive throughput.
habanalabs_pcie_replay_count	Total number of PCIe replay events.
habanalabs_pcie_rx	PCIe receive traffic.
habanalabs_pcie_transmit_throughput	PCIe transmit throughput.
habanalabs_pcie_tx	PCIe transmit traffic.
habanalabs_pending_rows_state	Number of memory rows in pending state.
habanalabs_pending_rows_with_double_bit_ecc_errors	Number of memory rows with double-bit ECC errors.
habanalabs_pending_rows_with_single_bit_ecc_errors	Number of memory rows with single-bit ECC errors.
habanalabs_power_default_limit_mW	Power cap for the device.
habanalabs_power_mW	Power usage in milliwatts.
habanalabs_temperature_onboard	Temperature on the board in Celsius.
habanalabs_temperature_onchip	Temperature on the ASIC in Celsius.
habanalabs_temperature_threshold_gpu	Threshold temperature for GPU in Celsius.
habanalabs_temperature_threshold_memory	Threshold temperature for memory in Celsius.
habanalabs_temperature_threshold_shutdown	Temperature at which device shuts down in Celsius.
habanalabs_temperature_threshold_slowdown	Temperature at which device slows down in Celsius.
habanalabs_utilization	Device utilization.
process_cpu_seconds_total	Total user and system CPU time spent in seconds.
process_max_fds	Maximum number of open file descriptors.
process_open_fds	Number of open file descriptors.
process_resident_memory_bytes	Resident memory size in bytes.
process_start_time_seconds	Start time of the process since unix epoch in seconds.
process_virtual_memory_bytes	Virtual memory size in bytes.
process_virtual_memory_max_bytes	Maximum amount of virtual memory available in bytes.
promhttp_metric_handler_requests_in_flight	Current number of scrapes being served.
promhttp_metric_handler_requests_total	Total number of scrapes by HTTP status code.

Gaudi Documentation 1.23.0 documentation

Prometheus Metric Exporter

On this Page

Prometheus Metric Exporter¶

Prerequisites¶

Deploying Prometheus Metric Exporter in Docker¶

Deploying Prometheus Metric Exporter in Kubernetes¶

Collecting Metrics¶

Exposed Metrics¶