RDMA / RoCE NIC Telemetry
Grafana dashboard for rdma_exporter covering port health, throughput, congestion, and error diagnostics.
RDMA / RoCE Port Telemetry Dashboard
This Grafana dashboard visualizes telemetry emitted by the rdma_exporter Prometheus exporter. The dashboard focuses on InfiniBand and RoCE ports, highlighting availability, throughput, congestion, and error signals that are read from /sys/class/infiniband on each host.
Prerequisites
- Grafana 9.0 or later with a Prometheus data source.
rdma_exporterrunning on each RDMA-capable node with HTTP access to its/metricsendpoint.- Prometheus scraping the exporter at an interval that matches your operational needs (the dashboard defaults to 1m/5m/15m range selectors).
Data Pipeline
rdma_exportertraverses the RDMA sysfs hierarchy and exposes counters such asrdma_port_rcv_data_totalandrdma_port_xmit_wait_total, along with therdma_port_infogauge that carries metadata (device, port, link state, speed, width, etc.).- Prometheus scrapes the exporter and stores the metrics with labels
job,instance,device, andport. - Grafana queries Prometheus using the expressions embedded in the dashboard panels and renders time-series, single-stat, and table visualizations for operators.
A minimal Prometheus scrape configuration might look like the following:
# prometheus.yml
global:
scrape_interval: 30s
scrape_timeout: 5s
scrape_configs:
- job_name: rdma-exporter
static_configs:
- targets:
- host-a.example.com:9879
- host-b.example.com:9879
metrics_path: /metrics
scheme: httpNote: No CollectD layer is required. If you already use CollectD, you can expose its metrics via the
write_prometheusplugin on a different port; this dashboard is specifically tuned for therdma_exportermetric names listed below.
Importing the Dashboard
- Open Grafana and navigate to Dashboards → Import.
- Click Upload JSON file and select
dashboards/rdma_exporter_dashboard.json, or paste its contents into the JSON textarea. - Choose your Prometheus data source when prompted (the default variable is named
Datasource). - Save the dashboard; it will appear under the name RDMA / RoCE Port Telemetry (rdma_exporter).
Template Variables
The dashboard ships with template variables to scope queries:
Panels at a Glance
Extending the Dashboard
- Duplicate panels and swap in any other
rdma_*_totalcounters exposed by the exporter (e.g.,rdma_duplicate_request_total). - Adjust the
$intervalvariable defaults if your Prometheus scrape interval is higher than 60 seconds. - Pair the dashboard with Grafana alerts on critical expressions (e.g. sustained
rdma_link_downed_totalrates).
Data source config
Collector config:
Upload an updated version of an exported dashboard.json file from Grafana
| Revision | Description | Created | |
|---|---|---|---|
| Download |
