BMC Exporter Guide
On this Page
BMC Exporter Guide¶
BMC exporter exposes metrics to Prometheus and ensures data is organized, secure, and exported efficiently while scaling out to numerous machines. Prometheus is an open-source application used alongside Grafana, a dashboard visualization tool. Grafana utilizes AlertManager, which manages alerts and notifies users of potential issues.
The BMC exporter uses Redfish protocol to transmit commands during each data scrape. With Redfish, users can interact with systems through standard web services, like HTTP and REST APIs.
Note
The BMC exporter is supported for Gaudi 2 only.
Prerequisites¶
To use the BMC exporter for Prometheus, the management node (local system) must have one of the following containerized solutions installed:
Docker
Podman
Kubernetes
Using the BMC Exporter¶
Follow the below steps to get started with the BMC exporter.
Creating a Configuration File¶
To use the BMC Exporter, generate a JSON configuration file containing the server access details. Two types of files, basic and advanced configuration files, are used as described below:
Basic Configuration - if the hosts share the same username and password, you can set a single set of credentials, username and password, in this configuration file. This set will be considered as the default, allowing the BMC exporter to access all hosts using these credentials. For example:
{
"username": "ADMIN",
"password": "ADMIN"
}
Advanced Configuration - if the credentials, usernames and passwords, of each host are different, you can specify them alongside their respective host IP addresses in the configuration file. The default port for the BMC exporter is 4001, although it can be modified if necessary. For example:
{
"username": "ADMIN",
"password": "ADMIN",
"port": "5000",
"servers": [
{
"password": "ADMIN",
"username": "ADMIN",
"hostname": "192.168.22.188"
},
{
"password": "ADMIN",
"username": "ADMIN",
"hostname": "192.168.22.189"
}
]
}
Highlights:
The BMC exporter is specifically tested and designed for use with Intel Gaudi drivers. Attempting to run it with other drivers, such as a developer driver, may result in missing functionality.
Ulimit: The exporter requires one file descriptor per BMC for the UDP socket, so it may be necessary to increase the limit. To check the current number of file descriptors a process can open, run
ulimit -n
command. You can maximize the number of file descriptors that the exporter can use by runningulimit -Sn $(ulimit -Hn)
.Upon receiving SIGINT or SIGTERM signals, the exporter gracefully shuts down its web server, waits for all ongoing scrapes to complete, and then closes all BMC connections.
Alerts: In addition to alerting on BMC metrics, you may also want to receive notifications if the exporter becomes unhealthy. Both the exporter and its underlying BMC library were developed with Prometheus in mind, providing a wide range of metrics including collection latency and the number of attempted IPMI commands for each IP address.
Running the BMC Exporter¶
To run the BMC exporter, use one of the solutions described below.
Docker¶
To deploy the BMC exporter in Docker, run the below:
docker run -p 5003:5000 -v `pwd`:/tmp habana-bmc-exporter:1.15.1 -config /tmp/config.json
Kubernetes¶
To deploy the BMC exporter in Kubernetes, apply the files described below. To apply all the files simultaneously,
run kubectl apply -f <file1> -f <file2>
.
Note
Make sure to have Kubernetes Prometheus stack installed before you start. Refer to helm-charts.
Deployment¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: bmc-monitoring
namespace: monitoring
labels:
app: bmc-monitoring
spec:
selector:
matchLabels:
app: bmc-monitoring
template:
metadata:
labels:
app: bmc-monitoring
spec:
containers:
- name: bmc-monitoring
image: vault.habana.ai/habana-bmc-exporter/bmc-exporter:1.15.1
imagePullPolicy: Always
args:
- "--config"
- "/tmp/config.json"
resources:
limits:
memory: 3Gi
cpu: 500m
requests:
cpu: 350m
memory: 2Gi
# readiness probes mark the service available to accept traffic.
readinessProbe:
httpGet:
path: /debug/readiness
port: 5000
initialDelaySeconds: 50
periodSeconds: 15
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 2
# liveness probes mark the service alive or dead (to be restarted).
livenessProbe:
httpGet:
path: /debug/liveness
port: 5000
initialDelaySeconds: 50
periodSeconds: 30
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 2
volumeMounts:
- name: config-volume
mountPath: /tmp
env:
- name: USERNAME
valueFrom:
secretKeyRef:
name: bmc-monitoring-secret
key: username
- name: PASSWORD
valueFrom:
secretKeyRef:
name: bmc-monitoring-secret
key: password
volumes:
- name: config-volume
configMap:
name: bmc-exporter-conf
Secret¶
apiVersion: v1
kind: Secret
metadata:
name: bmc-monitoring-secret
namespace: monitoring
type: Opaque
data:
username: <BASE64 USERNAME>
password: <BASE64 PASSWORD>
Config Map¶
apiVersion: v1
kind: ConfigMap
metadata:
name: bmc-exporter-conf
namespace: monitoring
data:
config.json: |
{
"username": "",
"password": "",
"port": "5000",
"servers": [
{
"hostname": ""
}
]
Service¶
apiVersion: v1
kind: Service
metadata:
name: bmc-monitoring-service
namespace: monitoring
labels:
app: bmc-monitoring
spec:
selector:
app: bmc-monitoring
ports:
- protocol: TCP
port: 5000
targetPort: 5000
name: bmc-monitoring-endpoint
Service Monitor¶
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
release: stable
app: bmc-monitoring
name: bmc-monitoring-service-monitor
namespace: monitoring
spec:
endpoints:
- interval: 5m
path: /metric
port: bmc-monitoring-endpoint
scrapeTimeout: 1m30s
namespaceSelector:
matchNames:
- monitoring
selector:
matchLabels:
app: bmc-monitoring
Exposed Data¶
The following outlines the monitoring components exposed to Prometheus.
Monitoring Component |
Description |
Endpoint |
---|---|---|
OAM Info |
Internal Modules IDs and main memories sizes |
/info |
OAM Status |
Device operational status |
/status |
Temperature |
Current, Max and historical temperatures and thresholds |
/temperature |
Power |
Peak and Current power consumption |
/power |
Frequency |
Max and Current Frequencies |
/frequency |
Ethernet Info |
Ethernet Configuration Info and complete connectivity status |
/ethernet-info |
Ethernet Status |
Ethernet Status per port |
/ethernet-status, /ethernet-status-counters |
PCIe Info |
PCIe Information and errors |
/pcie-info |
Alerts |
Error alerts and information |
/alerts |
Sensors Temperature |
Temperature sensors readouts |
/sensor-temperature |
Ctemperature |
Current maximal temperature between SOC and HBMs Temperatures |
/ctemperature |
Sensors Voltage |
Voltage sensors readouts |
/sensor-voltage |
Sensors Voltage Monitor |
Voltage monitors readouts |
/sensor-voltage-monitor |
Sensors Current |
Current sensors readouts |
/sensor-current |
Security |
Security related information |
/security |
HBM |
HBM information related to repairs and repair resources |
/hbm |
BMC state |
BMC state (up/down) |
/bmc-state |
Exporter info |
Application information, such as version |
/exporter-info |
Direct |
Direct NVMe-MI information |
/direct |
Data Scraping¶
To allow Prometheus access, enable data scraping on the target hosts by using http://<bmc_exporter_host>:/metrics?target=
. For example:
http://localhost:5000/metrics?target=192.168.22.189
Configuration Using Prometheus¶
Prometheus fundamentally stores all data as a time series: streams of timestamped values of the same metric and the same sets of labeled dimensions. The BMC data exported from the Exporter can be accessed in Prometheus for easier management. For details, refer to Prometheus documentation. For example:
- job_name: bmc
scrape_interval: 30s # a 30s scrape interval is recommended
metrics_path: /metrics # the exporter exposes its own metrics at /metrics
static_configs:
- targets:
- 192.168.22.189 # strings corresponding to the keys in secrets.yml
- 192.168.22.188
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: localhost:5000 # the location of the exporter to Prometheus
Adding Grafana Dashboard and Alerts¶
Follow the below steps to add Grafana dashboard and alerts.
Note
Adding Grafana via API is only available for Grafana versions 9.0 and 9.5.
Preparation¶
Create a dashboard folder:
Sign in to Grafana.
Click Dashboards on the left-side menu.
On the Dashboards page, click New and select New folder in the dropdown menu.
Enter a unique name and click Create. For further details, refer to Manage dashboards | Grafana documentation.
Get a folder UID:
Select the folder you want.
Click on Go to folder.
Save the folder UID. For example, in
https://habana-grafana.com/dashboards/f/toPjxZy4z/bmc-exporter
,toPjxZy4z
is the folder UID.
Create an API Key. For further instructions, refer to API keys.
Importing habana_alert
¶
Before importing
habana_alert
, make sure to set up the following:
<API KEY>
<JSON FILE PATH>
Import
habana_alert
using the API:
wget <JSON VAULT>
# Modify the folderUID in JSON alert by running the jq command.
jq '.folderUID = "<FOLDER ID>"' ./alerts.json > ./habana_alerts.json
# Modify the datasource UID to your datasource uid
sed -i s/'"datasourceUid": "prometheus"/" datasourceUid ": "<DATASOURCE_UID">' > habana_alerts.json
curl --data "@<JSON FILE PATH>" -H "Authorization: Bearer <API KEY>" -H "Content-Type: application/json" <GRAFANA URL>/api/v1/provisioning/alert-rules
Importing habana_dashboard
¶
Import habana_dashboard
using the API:
wget <JSON VAULT>
# Update the dashboard with your DataSource uid.
sed -i s/'"uid": "prometheus"/"uid": "<DATASOURCE_UID">' > habana_dashboard.json
curl -X POST -H "Authorization: Bearer <API KEY>" -H "Content-Type: application/json" -d @<JSON File PATH> <GRAFANA URL>/api/dashboards/db
To import habana_dashboard
using GUI, refer to Manage dashboards | Grafana documentation.