NVIDIA DCGM Exporter

This dashboard is to display the metrics from DCGM Exporter

NVIDIA DCGM Exporter screenshot 1
NVIDIA DCGM Exporter screenshot 2

This dashboard displays GPU metrics collected from NVIDIA dcgm-exporter via a metric endpoint added to Prometheus. A separate endpoint is added to Prometheus via a Service Monitor.

Management Node: (download and build dcgm-exporter)

[yaoge123]$ git clone https://github.com/NVIDIA/dcgm-exporter.git
[yaoge123]$ cd dcgm-exporter
[yaoge123]$ make binary

Compute Node shell script:

#!/bin/sh
if [[ $(/sbin/lspci|/usr/bin/grep NVIDIA) ]];then
	wget -q -O /usr/local/sbin/dcgm-exporter http://mgmt/dcgm-exporter/cmd/dcgm-exporter/dcgm-exporter
	mkdir /etc/dcgm-exporter
	wget -q -O /etc/dcgm-exporter/default-counters.csv http://mgmt/dcgm-exporter/etc/default-counters.csv
	wget -q -O /etc/dcgm-exporter/dcp-metrics-included.csv http://mgmt/dcgm-exporter/etc/dcp-metrics-included.csv
	chmod +x /usr/local/sbin/dcgm-exporter
if [[ "$(timeout 2s /usr/local/sbin/dcgm-exporter 2>&1|grep DCP)" =~ "\"Collecting DCP Metrics\"" ]];then 
	collectors="dcp-metrics-included.csv"
else
	collectors="default-counters.csv"
fi

cat > /etc/systemd/system/dcgm-exporter.service <<EOF

[Unit] Description=Prometheus DCGM exporter Wants=network-online.target nvidia-dcgm.service After=network-online.target nvidia-dcgm.service

[Service] Type=simple Restart=always ExecStart=/usr/local/sbin/dcgm-exporter -f /etc/dcgm-exporter/$collectors

[Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable dcgm-exporter.service systemctl restart dcgm-exporter.service curl -X PUT -d '{"id": "${HOSTNAME}_dcgm-exporter","name": "dcgm_exporter","address": "${HOSTNAME}","port": 9400,"tags": ["prometheus","hpc","compute"],"checks": [{"http": "http://${HOSTNAME}:9400/metrics","interval": "60s"}]}' http://consul:8500/v1/agent/service/register fi

Revisions
RevisionDescriptionCreated

Get this dashboard

Import the dashboard template

or

Download JSON

Datasource
Dependencies