---
title: "GPU Observability | Grafana Cloud documentation"
description: "Track GPU utilization, temperature, memory usage, and hardware performance metrics"
---

# GPU observability

GPU Observability provides comprehensive hardware-level monitoring for GPU infrastructure used in AI workloads, essential for ensuring optimal performance and preventing hardware issues.

## Overview

The GPU Monitoring dashboard provides hardware-level monitoring for AI infrastructure:

- **Hardware utilization** - Real-time GPU usage and performance tracking
- **Thermal management** - Temperature monitoring and cooling system analysis
- **Performance tracking** - Compute efficiency and throughput metrics
- **Resource management** - Multi-GPU coordination and resource allocation

## Key features

### Resource optimization

- **GPU instance tracking** - Individual GPU performance across infrastructure
- **Resource allocation** - GPU resource distribution across workloads
- **Capacity planning** - Usage trend analysis for scaling decisions
- **Cost optimization** - GPU usage efficiency monitoring for cost management

### Hardware health

- **Power consumption** - GPU power usage and efficiency tracking
- **Hardware error rates** - GPU hardware failure and error monitoring
- **Driver stability** - GPU driver performance and stability metrics
- **Device availability** - GPU device status and accessibility monitoring

## Getting started

[Setup Guide  
\
Set up GPU Observability to monitor GPU hardware performance and utilization](./setup)

[Configuration  
\
Configure thermal monitoring, performance alerts, and resource optimization](./configuration)
