BMC Exporter User Guide

BMC exporter exposes metrics to Prometheus and ensures data is organized, secure, and exported efficiently while scaling out to numerous machines. Prometheus is an open-source application used alongside Grafana, a dashboard visualization tool. Grafana utilizes AlertManager, which manages alerts and notifies users of potential issues.

The BMC exporter uses Redfish protocol to transmit commands during each data scrape. With Redfish, users can interact with systems through standard web services, like HTTP and REST APIs.

../../_images/BMC_Exporter.png

Figure 22 Prometheus Components

Note

The BMC exporter is supported for Gaudi 2 only.

Prerequisites

To use the BMC exporter for Prometheus, the management node (local system) must have one of the following containerized solutions installed:

  • Docker

  • Podman

  • Kubernetes

Using the BMC Exporter

To use the BMC exporter, generate a JSON configuration file containing the server access details. Two available types of files are described below:

  • Basic Configuration - If the hosts share the same username and password, you can add a single set of credentials in this configuration file. This set will be considered as the default, allowing the BMC exporter to access all hosts using these credentials. For example:

    {
        "username": "ADMIN",
        "password": "ADMIN"
    }
    
  • Advanced Configuration - If the credentials of each host are different, you can specify them alongside their respective host IP addresses in the configuration file. The default port for the BMC exporter is 4001, although it can be modified if necessary. For example:

    {
        "username": "ADMIN",
        "password": "ADMIN",
        "port": "5000",
        "servers": [
            {
                "password": "ADMIN",
                "username": "ADMIN",
                "hostname": "192.168.22.188"
            },
            {
                "password": "ADMIN",
                "username": "ADMIN",
                "hostname": "192.168.22.189"
            }
        ]
    }
    

Highlights:

  • The BMC exporter is specifically tested and designed for use with Intel Gaudi drivers. Attempting to run it with other drivers, such as a developer driver, may result in missing functionality.

  • Ulimit: The BMC exporter requires one file descriptor per BMC for the UDP socket, so it may be necessary to increase the limit. To check the current number of file descriptors a process can open, run ulimit -n command. You can maximize the number of file descriptors used by the BMC exporter by running ulimit -Sn $(ulimit -Hn).

  • Upon receiving SIGINT or SIGTERM signals, the BMC exporter gracefully shuts down its web server, waits for all ongoing scrapes to complete, and then closes all BMC connections.

  • Alerts: In addition to alerting on BMC metrics, you may also want to receive notifications if the BMC exporter becomes unhealthy. Both the BMC exporter and its underlying BMC library were developed with Prometheus in mind, providing a wide range of metrics including collection latency and the number of attempted IPMI commands for each IP address.

Running the BMC Exporter

To run the BMC exporter, use one of the solutions described below.

Docker

To deploy the BMC exporter in Docker, run the below:

docker run -p 5003:5000 -v `pwd`:/tmp  habana-bmc-exporter:1.16.2 -config /tmp/config.json

Kubernetes

To deploy the BMC exporter in Kubernetes, apply the files described below. To apply all the files simultaneously, run kubectl apply -f <file1> -f <file2>.

Note

Make sure to have Kubernetes Prometheus stack installed before you start. Refer to helm-charts.

  • Deployment

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: bmc-monitoring
      namespace: monitoring
      labels:
        app: bmc-monitoring
    spec:
      selector:
        matchLabels:
          app: bmc-monitoring
      template:
        metadata:
          labels:
            app: bmc-monitoring
        spec:
          containers:
          - name: bmc-monitoring
            image: vault.habana.ai/habana-bmc-exporter/bmc-exporter:1.16.2
            imagePullPolicy: Always
            args:
              - "--config"
              - "/tmp/config.json"
            resources:
              limits:
                memory: 3Gi
                cpu: 500m
              requests:
                cpu: 350m
                memory: 2Gi
            # readiness probes mark the service available to accept traffic.
            readinessProbe:
              httpGet:
                path: /debug/readiness
                port: 5000
              initialDelaySeconds: 50
              periodSeconds: 15
              timeoutSeconds: 5
              successThreshold: 1
              failureThreshold: 2
            # liveness probes mark the service alive or dead (to be restarted).
            livenessProbe:
              httpGet:
                path: /debug/liveness
                port: 5000
              initialDelaySeconds: 50
              periodSeconds: 30
              timeoutSeconds: 5
              successThreshold: 1
              failureThreshold: 2
            volumeMounts:
            - name: config-volume
              mountPath: /tmp
            env:
              - name: USERNAME
                valueFrom:
                  secretKeyRef:
                    name: bmc-monitoring-secret
                    key: username
              - name: PASSWORD
                valueFrom:
                  secretKeyRef:
                    name: bmc-monitoring-secret
                    key: password
          volumes:
          - name: config-volume
            configMap:
              name: bmc-exporter-conf
    
  • Secret

    apiVersion: v1
    kind: Secret
    metadata:
      name: bmc-monitoring-secret
      namespace: monitoring
    type: Opaque
    data:
      username: <BASE64 USERNAME>
      password: <BASE64 PASSWORD>
    
  • Config Map

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: bmc-exporter-conf
      namespace: monitoring
    data:
      config.json: |
        {
        "username": "",
        "password": "",
        "port": "5000",
        "servers": [
        {
          "hostname": ""
        }
        ]
    
  • Service

    apiVersion: v1
    kind: Service
    metadata:
      name: bmc-monitoring-service
      namespace: monitoring
      labels:
        app: bmc-monitoring
    spec:
      selector:
        app: bmc-monitoring
      ports:
        - protocol: TCP
          port: 5000
          targetPort: 5000
          name: bmc-monitoring-endpoint
    
  • Service Monitor

    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      labels:
        release: stable
        app: bmc-monitoring
      name: bmc-monitoring-service-monitor
      namespace: monitoring
    spec:
      endpoints:
      - interval: 5m
        path: /metric
        port: bmc-monitoring-endpoint
        scrapeTimeout: 1m30s
      namespaceSelector:
        matchNames:
        - monitoring
      selector:
        matchLabels:
          app: bmc-monitoring
    

Exposed Data

The following outlines the monitoring components exposed to Prometheus:

Monitoring Component

Description

Endpoint

OAM Info

Internal modules IDs and main memories sizes.

/info

OAM Status

Device operational status.

/status

Temperature

Current, max and historical temperatures and thresholds.

/temperature

Power

Peak and current power consumption.

/power

Frequency

Max and current frequencies.

/frequency

Ethernet Info

Ethernet configuration info and complete connectivity status.

/ethernet-info

Ethernet Status

Ethernet status per port.

/ethernet-status, /ethernet-status-counters

PCIe Info

PCIe information and errors.

/pcie-info

Alerts

Error alerts and information.

/alerts

Sensors Temperature

Temperature sensors readouts.

/sensor-temperature

Ctemperature

Current maximal temperature between SOC and HBMs temperatures.

/ctemperature

Sensors Voltage

Voltage sensors readouts.

/sensor-voltage

Sensors Voltage Monitor

Voltage monitors readouts.

/sensor-voltage-monitor

Sensors Current

Current sensors readouts.

/sensor-current

Security

Security related information.

/security

HBM

HBM information related to repairs and repair resources.

/hbm

BMC state

BMC state (up/down).

/bmc-state

Exporter info

Application information, such as version.

/exporter-info

Direct

Direct NVMe-MI information.

/direct

Configuration Using Prometheus

Prometheus fundamentally stores all data as a time series: streams of timestamped values of the same metric and the same sets of labeled dimensions. The BMC data exported from the BMC exporter can be accessed in Prometheus for easier management. For details, refer to Prometheus documentation. For example:

    - job_name: bmc
scrape_interval: 30s                  # a 30s scrape interval is recommended
metrics_path: /metrics                      # the exporter exposes its own metrics at /metrics
static_configs:
- targets:
    - 192.168.22.189                     # strings corresponding to the keys in secrets.yml
    - 192.168.22.188
relabel_configs:
- source_labels: [__address__]
    target_label: __param_target
- source_labels: [__param_target]
    target_label: instance
- target_label: __address__
    replacement: localhost:5000         # the location of the exporter to Prometheus

Adding Grafana Dashboard and Alerts

Follow the below steps to add Grafana dashboard and alerts.

Note

Adding Grafana via API is only available for Grafana versions 9.0 and 9.5.

Preparation

  1. Create a dashboard folder:

    1. Sign in to Grafana.

    2. Click “Dashboards” on the left-side menu.

    3. On the Dashboards page, click “New” and select “New folder” in the dropdown menu.

    4. Enter a unique name and click “Create”. For further details, refer to Manage dashboards | Grafana documentation.

  2. Get a folder UID:

    1. Select the folder you want.

    2. Click on “Go to folder”.

    3. Save the folder UID. For example, in https://habana-grafana.com/dashboards/f/toPjxZy4z/bmc-exporter, toPjxZy4z is the folder UID.

  3. Create an API Key. For further instructions, refer to API keys.

Importing habana_alert

  1. Before importing habana_alert, make sure to set up the following:

  2. Import habana_alert using the API:

    wget <JSON VAULT>
    # Modify the folderUID in JSON alert by running the jq command.
    jq '.folderUID = "<FOLDER ID>"' ./alerts.json  > ./habana_alerts.json
    # Modify the datasource UID to your datasource uid
    sed  -i s/'"datasourceUid": "prometheus"/" datasourceUid ": "<DATASOURCE_UID">' >  habana_alerts.json
    
    curl --data "@<JSON FILE PATH>" -H "Authorization: Bearer <API KEY>" -H "Content-Type: application/json" <GRAFANA URL>/api/v1/provisioning/alert-rules
    

Importing habana_dashboard

Import habana_dashboard using the API:

wget <JSON VAULT>
# Update the dashboard with your DataSource uid.
sed  -i s/'"uid": "prometheus"/"uid": "<DATASOURCE_UID">' >  habana_dashboard.json
curl -X POST -H "Authorization: Bearer <API KEY>"  -H "Content-Type: application/json" -d @<JSON File PATH> <GRAFANA URL>/api/dashboards/db

To import habana_dashboard using GUI, refer to Manage dashboards | Grafana documentation.