GPU Memory Fragmentation: OOM on half-full GPU debug (multi-node)

Multi-node NVIDIA GPU memory fragmentation dashboard. Per-GPU used/free/total, fragmentation estimate (heuristic), top processes by allocation, IOCTL event volume per cmd code (experimental kprobe on nvidia_unlocked_ioctl). For OOM-on-half-full-GPU debugging across the cluster.

What this dashboard shows

A multi-node NVIDIA GPU memory fragmentation view. For the "my GPU is half-empty but allocations fail" debugging case across a cluster.

The headline panels:

  • Avg fragmentation estimate - across selected GPUs. Heuristic: 1 - (largest free block / total free). 0 = unfragmented; 1 = fully fragmented.
  • Free memory + fragmentation per GPU - time series. Spikes correlate with alloc churn (DataLoader rebuilds, optimizer state shuffles, KV-cache reset).
  • IOCTL events / sec + distinct cmd codes - aggregate signal from the experimental memfrag IOCTL kprobe. Higher rate
    • more diverse cmd codes = more memory operations underway.
  • Top 10 IOCTL events by cmd - per-cmd time series. Decode cmd values against nv-ioctl-numbers.h and nvos.h from NVIDIA's open-gpu-kernel-modules repo to map raw codes to RM operation kinds.
  • Top 10 processes by allocated memory - from gpu_memory_process_allocated_bytes.

Linux only. amd64 + arm64 agents.

How the data is collected

Two metric families feed this dashboard:

Production (always emitted)

NVML / nvidia-smi --query-gpu polls every --memfrag-poll-interval (default 10s):

  • gpu_memory_used_bytes{gpu_uuid} / _free_bytes / _total_bytes
  • gpu_memory_fragmentation_estimate{gpu_uuid} (heuristic)
  • gpu_memory_process_allocated_bytes{gpu_uuid,pid} (from nvidia-smi --query-compute-apps)

Experimental (gated behind --enable-experimental-kprobes)

Per-cmd nvidia driver IOCTL counter:

  • gpu_memfrag_ioctl_event_total{cmd}

Backed by an eBPF kprobe attached to nvidia_unlocked_ioctl in the closed NVIDIA driver. Emits one ringbuf event per IOCTL invocation; the cmd arg is recorded raw. Only loads when the running NVIDIA driver + Linux kernel pair is on internal/kprobe.DefaultAllowlist. Off-allowlist hosts surface a startup warning and the experimental panels stay empty.

Boot the agent:

sudo ingero trace --enable-experimental-kprobes --prometheus :9090

Cluster aggregation via Prometheus / Grafana Alloy / Grafana Cloud scrape, OR via OTLP push to the Ingero Fleet collector + Echo store.

Install agent: https://github.com/ingero-io/ingero Install Fleet: https://github.com/ingero-io/ingero-fleet

How to read the heuristic

The fragmentation estimate is a coarse signal. Values above ~0.6 across multiple sequential polls usually correlate with:

  • Alloc / free churn (DataLoader workers, KV-cache reset)
  • Mixed-size allocator pressure (a single PyTorch process running variable-batch inference, or a Triton server hot-reloading models)
  • Driver-side fragmentation that survives cudaMemPool trimming

For the actual allocator-level signal (per-cmd IOCTL volume), turn on --enable-experimental-kprobes and watch the bottom panel.

Companion dashboards

  • Cluster set (multi-node aggregation): GPU Cluster Overview, NCCL Stragglers, GPU Memcpy Bandwidth, Per-Node GPU Drill-Down. All published under https://grafana.com/orgs/ingero
  • Single-host: GPU Trace Overview, CUDA Op Profiler, GPU Data Movement, GPU Memory & Throttle.
  • Fleet pipeline health: the operator dashboard for the Ingero Fleet collector itself.

Source

Issues, panel suggestions, dashboard PRs welcome on https://github.com/ingero-io/ingero-fleet/issues

Revisions
RevisionDescriptionCreated

Get this dashboard

Import the dashboard template

or

Download JSON

Datasource
Dependencies