NVIDIA DCGM Exporter
This dashboard is to display the metrics from DCGM Exporter
This dashboard displays GPU metrics collected from NVIDIA dcgm-exporter via a metric endpoint added to Prometheus. A separate endpoint is added to Prometheus via a Service Monitor.
Management Node: (download and build dcgm-exporter)
[yaoge123]$ git clone https://github.com/NVIDIA/dcgm-exporter.git
[yaoge123]$ cd dcgm-exporter
[yaoge123]$ make binary
Compute Node shell script:
#!/bin/sh
if [[ $(/sbin/lspci|/usr/bin/grep NVIDIA) ]];then
wget -q -O /usr/local/sbin/dcgm-exporter http://mgmt/dcgm-exporter/cmd/dcgm-exporter/dcgm-exporter
mkdir /etc/dcgm-exporter
wget -q -O /etc/dcgm-exporter/default-counters.csv http://mgmt/dcgm-exporter/etc/default-counters.csv
wget -q -O /etc/dcgm-exporter/dcp-metrics-included.csv http://mgmt/dcgm-exporter/etc/dcp-metrics-included.csv
chmod +x /usr/local/sbin/dcgm-exporter
if [[ "$(timeout 2s /usr/local/sbin/dcgm-exporter 2>&1|grep DCP)" =~ "\"Collecting DCP Metrics\"" ]];then
collectors="dcp-metrics-included.csv"
else
collectors="default-counters.csv"
fi
cat > /etc/systemd/system/dcgm-exporter.service <<EOF
[Unit]
Description=Prometheus DCGM exporter
Wants=network-online.target nvidia-dcgm.service
After=network-online.target nvidia-dcgm.service
[Service]
Type=simple
Restart=always
ExecStart=/usr/local/sbin/dcgm-exporter -f /etc/dcgm-exporter/$collectors
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable dcgm-exporter.service
systemctl restart dcgm-exporter.service
curl -X PUT -d '{\"id\": \"${HOSTNAME}_dcgm-exporter\",\"name\": \"dcgm_exporter\",\"address\": \"${HOSTNAME}\",\"port\": 9400,\"tags\": ["prometheus","hpc","compute"],\"checks\": [{\"http\": \"http://${HOSTNAME}:9400/metrics\",\"interval\": \"60s\"}]}' http://consul:8500/v1/agent/service/register
fi
Data source config
Collector config:
Upload an updated version of an exported dashboard.json file from Grafana
Revision | Description | Created | |
---|---|---|---|
Download |