BMC Exporter Guide

BMC exporter exposes metrics to Prometheus and ensures data is organized, secure, and exported efficiently while scaling out to numerous machines. Prometheus is an open-source application used alongside Grafana, a dashboard visualization tool. Grafana utilizes AlertManager, which manages alerts and notifies users of potential issues.

The BMC exporter uses Redfish protocol to transmit commands during each data scrape. With Redfish, users can interact with systems through standard web services, like HTTP and REST APIs.

../../_images/BMC_Exporter.png

Figure 22 Prometheus Components

Note

The BMC exporter is supported for Gaudi 2 only.

Prerequisites

To use the BMC exporter for Prometheus, the management node (local system) must have one of the following containerized solutions installed:

  • Docker

  • Podman

  • Kubernetes

Using the BMC Exporter

Follow the below steps to get started with the BMC exporter.

Creating a Configuration File

To use the BMC Exporter, generate a JSON configuration file containing the server access details. Two types of files, basic and advanced configuration files, are used as described below:

  • Basic Configuration - if the hosts share the same username and password, you can set a single set of credentials, username and password, in this configuration file. This set will be considered as the default, allowing the BMC exporter to access all hosts using these credentials. For example:

{
    "username": "ADMIN",
    "password": "ADMIN"
}
  • Advanced Configuration - if the credentials, usernames and passwords, of each host are different, you can specify them alongside their respective host IP addresses in the configuration file. The default port for the BMC exporter is 4001, although it can be modified if necessary. For example:

{
    "username": "ADMIN",
    "password": "ADMIN",
    "port": "5000",
    "servers": [
        {
            "password": "ADMIN",
            "username": "ADMIN",
            "hostname": "192.168.22.188"
        },
        {
            "password": "ADMIN",
            "username": "ADMIN",
            "hostname": "192.168.22.189"
        }
    ]
}

Highlights:

  • The BMC exporter is specifically tested and designed for use with Intel Gaudi drivers. Attempting to run it with other drivers, such as a developer driver, may result in missing functionality.

  • Ulimit: The exporter requires one file descriptor per BMC for the UDP socket, so it may be necessary to increase the limit. To check the current number of file descriptors a process can open, run ulimit -n command. You can maximize the number of file descriptors that the exporter can use by running ulimit -Sn $(ulimit -Hn).

  • Upon receiving SIGINT or SIGTERM signals, the exporter gracefully shuts down its web server, waits for all ongoing scrapes to complete, and then closes all BMC connections.

  • Alerts: In addition to alerting on BMC metrics, you may also want to receive notifications if the exporter becomes unhealthy. Both the exporter and its underlying BMC library were developed with Prometheus in mind, providing a wide range of metrics including collection latency and the number of attempted IPMI commands for each IP address.

Running the BMC Exporter

To run the BMC exporter, use one of the solutions described below.

Docker

To deploy the BMC exporter in Docker, run the below:

docker run -p 5003:5000 -v `pwd`:/tmp  habana-bmc-exporter:1.15.1 -config /tmp/config.json

Kubernetes

To deploy the BMC exporter in Kubernetes, apply the files described below. To apply all the files simultaneously, run kubectl apply -f <file1> -f <file2>.

Note

Make sure to have Kubernetes Prometheus stack installed before you start. Refer to helm-charts.

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: bmc-monitoring
  namespace: monitoring
  labels:
    app: bmc-monitoring
spec:
  selector:
    matchLabels:
      app: bmc-monitoring
  template:
    metadata:
      labels:
        app: bmc-monitoring
    spec:
      containers:
      - name: bmc-monitoring
        image: vault.habana.ai/habana-bmc-exporter/bmc-exporter:1.15.1
        imagePullPolicy: Always
        args:
          - "--config"
          - "/tmp/config.json"
        resources:
          limits:
            memory: 3Gi
            cpu: 500m
          requests:
            cpu: 350m
            memory: 2Gi
        # readiness probes mark the service available to accept traffic.
        readinessProbe:
          httpGet:
            path: /debug/readiness
            port: 5000
          initialDelaySeconds: 50
          periodSeconds: 15
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 2
        # liveness probes mark the service alive or dead (to be restarted).
        livenessProbe:
          httpGet:
            path: /debug/liveness
            port: 5000
          initialDelaySeconds: 50
          periodSeconds: 30
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 2
        volumeMounts:
        - name: config-volume
          mountPath: /tmp
        env:
          - name: USERNAME
            valueFrom:
              secretKeyRef:
                name: bmc-monitoring-secret
                key: username
          - name: PASSWORD
            valueFrom:
              secretKeyRef:
                name: bmc-monitoring-secret
                key: password
      volumes:
      - name: config-volume
        configMap:
          name: bmc-exporter-conf

Secret

apiVersion: v1
kind: Secret
metadata:
  name: bmc-monitoring-secret
  namespace: monitoring
type: Opaque
data:
  username: <BASE64 USERNAME>
  password: <BASE64 PASSWORD>

Config Map

apiVersion: v1
kind: ConfigMap
metadata:
  name: bmc-exporter-conf
  namespace: monitoring
data:
  config.json: |
    {
    "username": "",
    "password": "",
    "port": "5000",
    "servers": [
    {
      "hostname": ""
    }
    ]

Service

apiVersion: v1
kind: Service
metadata:
  name: bmc-monitoring-service
  namespace: monitoring
  labels:
    app: bmc-monitoring
spec:
  selector:
    app: bmc-monitoring
  ports:
    - protocol: TCP
      port: 5000
      targetPort: 5000
      name: bmc-monitoring-endpoint

Service Monitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    release: stable
    app: bmc-monitoring
  name: bmc-monitoring-service-monitor
  namespace: monitoring
spec:
  endpoints:
  - interval: 5m
    path: /metric
    port: bmc-monitoring-endpoint
    scrapeTimeout: 1m30s
  namespaceSelector:
    matchNames:
    - monitoring
  selector:
    matchLabels:
      app: bmc-monitoring

Exposed Data

The following outlines the monitoring components exposed to Prometheus.

Monitoring Component

Description

Endpoint

OAM Info

Internal Modules IDs and main memories sizes

/info

OAM Status

Device operational status

/status

Temperature

Current, Max and historical temperatures and thresholds

/temperature

Power

Peak and Current power consumption

/power

Frequency

Max and Current Frequencies

/frequency

Ethernet Info

Ethernet Configuration Info and complete connectivity status

/ethernet-info

Ethernet Status

Ethernet Status per port

/ethernet-status, /ethernet-status-counters

PCIe Info

PCIe Information and errors

/pcie-info

Alerts

Error alerts and information

/alerts

Sensors Temperature

Temperature sensors readouts

/sensor-temperature

Ctemperature

Current maximal temperature between SOC and HBMs Temperatures

/ctemperature

Sensors Voltage

Voltage sensors readouts

/sensor-voltage

Sensors Voltage Monitor

Voltage monitors readouts

/sensor-voltage-monitor

Sensors Current

Current sensors readouts

/sensor-current

Security

Security related information

/security

HBM

HBM information related to repairs and repair resources

/hbm

BMC state

BMC state (up/down)

/bmc-state

Exporter info

Application information, such as version

/exporter-info

Direct

Direct NVMe-MI information

/direct

Data Scraping

To allow Prometheus access, enable data scraping on the target hosts by using http://<bmc_exporter_host>:/metrics?target=. For example:

http://localhost:5000/metrics?target=192.168.22.189

Configuration Using Prometheus

Prometheus fundamentally stores all data as a time series: streams of timestamped values of the same metric and the same sets of labeled dimensions. The BMC data exported from the Exporter can be accessed in Prometheus for easier management. For details, refer to Prometheus documentation. For example:

    - job_name: bmc
scrape_interval: 30s                  # a 30s scrape interval is recommended
metrics_path: /metrics                      # the exporter exposes its own metrics at /metrics
static_configs:
- targets:
    - 192.168.22.189                     # strings corresponding to the keys in secrets.yml
    - 192.168.22.188
relabel_configs:
- source_labels: [__address__]
    target_label: __param_target
- source_labels: [__param_target]
    target_label: instance
- target_label: __address__
    replacement: localhost:5000         # the location of the exporter to Prometheus

Adding Grafana Dashboard and Alerts

Follow the below steps to add Grafana dashboard and alerts.

Note

Adding Grafana via API is only available for Grafana versions 9.0 and 9.5.

Preparation

  1. Create a dashboard folder:

    1. Sign in to Grafana.

    2. Click Dashboards on the left-side menu.

    3. On the Dashboards page, click New and select New folder in the dropdown menu.

    4. Enter a unique name and click Create. For further details, refer to Manage dashboards | Grafana documentation.

  2. Get a folder UID:

    1. Select the folder you want.

    2. Click on Go to folder.

    3. Save the folder UID. For example, in https://habana-grafana.com/dashboards/f/toPjxZy4z/bmc-exporter, toPjxZy4z is the folder UID.

  3. Create an API Key. For further instructions, refer to API keys.

Importing habana_alert

  1. Before importing habana_alert, make sure to set up the following:

  1. Import habana_alert using the API:

wget <JSON VAULT>
# Modify the folderUID in JSON alert by running the jq command.
jq '.folderUID = "<FOLDER ID>"' ./alerts.json  > ./habana_alerts.json
# Modify the datasource UID to your datasource uid
sed  -i s/'"datasourceUid": "prometheus"/" datasourceUid ": "<DATASOURCE_UID">' >  habana_alerts.json

curl --data "@<JSON FILE PATH>" -H "Authorization: Bearer <API KEY>" -H "Content-Type: application/json" <GRAFANA URL>/api/v1/provisioning/alert-rules

Importing habana_dashboard

Import habana_dashboard using the API:

wget <JSON VAULT>
# Update the dashboard with your DataSource uid.
sed  -i s/'"uid": "prometheus"/"uid": "<DATASOURCE_UID">' >  habana_dashboard.json
curl -X POST -H "Authorization: Bearer <API KEY>"  -H "Content-Type: application/json" -d @<JSON File PATH> <GRAFANA URL>/api/dashboards/db

To import habana_dashboard using GUI, refer to Manage dashboards | Grafana documentation.