---
title: "GPU Observability Setup | Grafana Cloud documentation"
description: "Set up GPU Observability to monitor GPU hardware performance and utilization"
---

# GPU observability setup

GPU Observability provides comprehensive hardware-level monitoring for GPU infrastructure used in AI workloads, essential for ensuring optimal performance and preventing hardware issues.

## Install OpenLIT SDK

Install the OpenLIT SDK in your Python environment:

Bash ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```bash
pip install openlit
```

## Configure OTEL environment variables

### Get your Grafana Cloud OTEL credentials

If you haven’t already obtained your OTEL credentials during the [main AI Observability setup](../setup/), follow these steps:

1. Sign in to Grafana Cloud and go to the [Grafana Cloud Portal](/profile/org)
2. Select your organization if you have access to multiple
3. Click your stack from the sidebar or main stack list
4. Under **Manage your stack**, click the **Configure** button in the OpenTelemetry section
5. Scroll down to the **Password / API Token** section and click **Generate now** (if you don’t have a token)
6. Enter a name for the token and click **Create token**
7. Click **Close** - you don’t need to copy the token manually
8. Scroll down and copy the `OTEL_EXPORTER_OTLP_ENDPOINT` and `OTEL_EXPORTER_OTLP_HEADERS` values from the **Environment variables** section

### Set the environment variables

Set up the OpenTelemetry endpoints using the values you copied:

Bash ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```bash
export OTEL_EXPORTER_OTLP_ENDPOINT="<YOUR_GRAFANA_OTEL_GATEWAY_URL>"
export OTEL_EXPORTER_OTLP_HEADERS="<YOUR_GRAFANA_OTEL_GATEWAY_AUTH>"
```

Replace the values with those you copied:

- Replace `<YOUR_GRAFANA_OTEL_GATEWAY_URL>` with the `OTEL_EXPORTER_OTLP_ENDPOINT` value  
  Example: `https://otlp-gateway-<ZONE>.grafana.net/otlp`
- Replace `<YOUR_GRAFANA_OTEL_GATEWAY_AUTH>` with the `OTEL_EXPORTER_OTLP_HEADERS` value  
  Example: `Authorization=Basic%20<BASE64 ENCODED INSTANCE ID AND API TOKEN>`

## Initialize OpenLIT in your application

Add the following to your application code to enable GPU monitoring:

Python ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```python
import openlit

openlit.init(collect_gpu_stats=True)
```

The OpenLIT SDK automatically uses the OTEL environment variables to send telemetry data to your Grafana Cloud instance. The `collect_gpu_stats=True` parameter enables GPU performance monitoring.

## Alternative: GPU collector deployment

For containerized environments, you can use the dedicated GPU collector instead of the SDK approach:

### Pull the GPU collector Docker image

Bash ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```bash
docker pull ghcr.io/openlit/otel-gpu-collector:latest
```

### Run the GPU collector container

Bash ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```bash
docker run --gpus all \
    -e GPU_APPLICATION_NAME='my-ai-app' \
    -e GPU_ENVIRONMENT='production' \
    -e OTEL_EXPORTER_OTLP_ENDPOINT="<YOUR_GRAFANA_OTEL_GATEWAY_URL>" \
    -e OTEL_EXPORTER_OTLP_HEADERS="<YOUR_GRAFANA_OTEL_GATEWAY_AUTH>" \
    ghcr.io/openlit/otel-gpu-collector:latest
```

Use the same OTEL environment variable values you copied from Grafana Cloud Portal.

This approach is useful when you want to monitor GPU metrics system-wide rather than from within specific applications.

## Visualize and analyze

With the GPU Observability data now being collected and sent to Grafana Cloud, the next step is to visualize and analyze this data to get insights into your GPU utilization, performance, and identify areas of improvement.

Navigate to the **GPU Observability** dashboard in your Grafana Cloud instance to start exploring. The dashboard provides:

- **Hardware monitoring** - GPU utilization, temperature, and power consumption
- **Memory tracking** - GPU memory usage and allocation patterns
- **Performance metrics** - Compute performance and throughput analysis
- **Resource optimization** - Efficiency insights and capacity planning
