GPU Health - Cluster
Cluster Wide View of Common GPU Errors
This Dashboard provides common GPU errors exported from Nvidia-DCGM-Exporter.
How to implement:
- this dashboard assumes you are using Nvidia DCGM Exporter version 3.3.5-3.4.1-ubuntu22.04 or later. You can find an example bootstrap script here which installs DCGM exporter and runs in a docker container.
- For example of Prometheus + Grafana + Cluster Architecture, find here: https://github.com/aws-samples/awsome-distributed-training/tree/main/4.validation_and_observability/4.prometheus-grafana
About this dashboard
This dashboard displays metrics of common GPU errors including Uncorrectable Remapped Rows, Correctable Remapped Rows, XID Error Codes, Row Remap Failure, Thermal violations, and Missing GPUs (from Nvidia-SMI)
Data source config
Collector config:
Upload an updated version of an exported dashboard.json file from Grafana
Revision | Description | Created | |
---|---|---|---|
Download |