← All dashboards

NVIDIA DCGM Exporter

This dashboard is to display the metrics from DCGM Exporter

This dashboard displays GPU metrics collected from NVIDIA dcgm-exporter via a metric endpoint added to Prometheus. A separate endpoint is added to Prometheus via a Service Monitor.

Management Node: (download and build dcgm-exporter)

[yaoge123]$ git clone https://github.com/NVIDIA/dcgm-exporter.git
[yaoge123]$ cd dcgm-exporter
[yaoge123]$ make binary

Compute Node shell script:

#!/bin/sh
if [[ $(/sbin/lspci|/usr/bin/grep NVIDIA) ]];then
	wget -q -O /usr/local/sbin/dcgm-exporter http://mgmt/dcgm-exporter/cmd/dcgm-exporter/dcgm-exporter
	mkdir /etc/dcgm-exporter
	wget -q -O /etc/dcgm-exporter/default-counters.csv http://mgmt/dcgm-exporter/etc/default-counters.csv
	wget -q -O /etc/dcgm-exporter/dcp-metrics-included.csv http://mgmt/dcgm-exporter/etc/dcp-metrics-included.csv
	chmod +x /usr/local/sbin/dcgm-exporter
	
	if [[ "$(timeout 2s /usr/local/sbin/dcgm-exporter 2>&1|grep DCP)" =~ "\"Collecting DCP Metrics\"" ]];then 
		collectors="dcp-metrics-included.csv"
	else
		collectors="default-counters.csv"
	fi

	cat > /etc/systemd/system/dcgm-exporter.service <<EOF
[Unit]
Description=Prometheus DCGM exporter
Wants=network-online.target nvidia-dcgm.service
After=network-online.target nvidia-dcgm.service

[Service]
Type=simple
Restart=always
ExecStart=/usr/local/sbin/dcgm-exporter -f /etc/dcgm-exporter/$collectors

[Install]
WantedBy=multi-user.target
EOF
	systemctl daemon-reload
	systemctl enable dcgm-exporter.service
	systemctl restart dcgm-exporter.service
	curl -X PUT -d '{\"id\": \"${HOSTNAME}_dcgm-exporter\",\"name\": \"dcgm_exporter\",\"address\": \"${HOSTNAME}\",\"port\": 9400,\"tags\": ["prometheus","hpc","compute"],\"checks\": [{\"http\": \"http://${HOSTNAME}:9400/metrics\",\"interval\": \"60s\"}]}' http://consul:8500/v1/agent/service/register
fi

Dashboard revisions

RevisionDecscriptionCreated

Reviews

Login or Sign up to write a review

Reviews from the community
Sign up for Grafana Cloud
Tooltip hover

Get this dashboard

Data source:

Dependencies:

Import the dashboard template:

or

Download JSON

Docs: Importing dashboards

Downloads: 308