GPU Memcpy Bandwidth: h2d / d2h / d2d, latency percentiles (multi-node)

Multi-node CUDA memcpy bandwidth dashboard. Per-direction throughput (h2d, d2h, d2d, peer, default), per-direction latency p50/p95/p99 from per-event histogram. Surfaces data-pipeline bottlenecks (h2d-d2h skew = CPU-bound prep; d2d dominance = peer-copy hot path). eBPF uprobes on libcudart cudaMemcpy*.

What this dashboard shows

A multi-node CUDA memcpy bandwidth view, broken out by direction (h2d, d2h, d2d, default, unknown). Built for ML pipeline engineers debugging data-pipeline bottlenecks across a cluster.

The headline panels:

  • Bandwidth per direction - rate of gpu_memcpy_bytes_total per direction. h2d-d2h skew = data-pipeline imbalance (CPU-bound prep). d2d dominance = peer-copy or in-GPU shuffle hot path.
  • p50 / p95 / p99 duration per direction - per-event latency percentiles from the gpu_memcpy_duration_ms histogram. Tail latency surfaces transient contention; sustained p99 elevation surfaces structural memcpy-shape problems.
  • Heatmap (all directions) - duration distribution over time. Right-skewed clusters = tail latency. Bimodal clusters = two distinct memcpy shapes interleaving.
  • Per-direction summary table (1h) - cumulative bytes + event counts. Quick way to confirm which direction owns the volume.

Linux only. amd64 + arm64 agents.

How the data is collected

Each node runs the Ingero agent with the Prometheus exporter:

sudo ingero trace --prometheus :9090

The agent attaches eBPF uprobes to libcudart cudaMemcpy* symbols (cudaMemcpy, cudaMemcpyAsync, cudaMemcpy2D, cudaMemcpy2DAsync, cudaMemcpyPeer, cudaMemcpyPeerAsync) at function entry + return. The cudaMemcpyKind argument is read from the userspace register at uprobe entry to label the direction.

Two metric families per direction:

  • gpu_memcpy_bytes_total{direction} - cumulative bytes counter. Sum-of-rates on this gives cluster-wide bandwidth.
  • gpu_memcpy_duration_ms_{bucket,sum,count}{direction} - per-event histogram. histogram_quantile() over rate(_bucket) gives accurate per-direction percentiles.

The 2D variants emit direction=unknown because cudaMemcpyKind is

  • gpu_memcpy_duration_ms_{bucket,sum,count}{direction} - per-event histogram. histogram_quantile() over rate(_bucket) gives accurate per-direction percentiles.

The 2D variants emit direction=unknown because cudaMemcpyKind is the 7th parameter and not portably readable from BPF on amd64 + arm64.

Cluster aggregation comes from your Prometheus / Grafana Alloy / Grafana Cloud agent scraping all nodes, OR from pushing OTLP to the Ingero Fleet collector and Echo store.

Install agent: https://github.com/ingero-io/ingero Install Fleet: https://github.com/ingero-io/ingero-fleet

Why eBPF for memcpy

nvidia-smi and DCGM-exporter expose throughput counters but no per-call duration. Ingero's libcudart uprobes capture every memcpy at the function boundary, yielding per-event duration (not averaged) and per-event bytes counted at the call site. The histogram exposition gives accurate percentiles via standard histogram_quantile() on Prometheus rather than averages.

Companion dashboards

  • Cluster set (multi-node aggregation): GPU Cluster Overview, NCCL Stragglers, GPU Memory Fragmentation, Per-Node GPU Drill-Down. All published under https://grafana.com/orgs/ingero
  • Single-host: GPU Trace Overview, CUDA Op Profiler, GPU Data Movement (memcpy + NCCL on one host), GPU Memory & Throttle.
  • Fleet pipeline health: the operator dashboard for the Ingero Fleet collector itself (peer-relative median + MAD across the cluster). Different metrics, different audience.

Source

Issues, panel suggestions, dashboard PRs welcome on https://github.com/ingero-io/ingero-fleet/issues

Revisions
RevisionDescriptionCreated

Get this dashboard

Import the dashboard template

or

Download JSON

Datasource
Dependencies