GPU Data Movement: CUDA memcpy + NCCL collectives (single host)

Single-host CUDA memcpy + NCCL data-movement dashboard. Per-direction memcpy throughput + p50/p95/p99 (from per-event histogram), local NCCL collective rates by op_type, libnccl version roster. Answer: is data movement my bottleneck on this box? eBPF uprobes on libcudart and libnccl.

What this dashboard shows

A single-host data-movement view: CUDA memcpy bandwidth + per-event duration percentiles by direction, plus local NCCL collective rates + libnccl version roster. Answer to: "Is data movement my bottleneck on this box?"

The headline panels:

Memcpy section:

Total memcpy throughput - sum of gpu_memcpy_bytes_total rate across all directions.
p50 / p95 / p99 duration (all directions) - per-event latency percentiles from the gpu_memcpy_duration_ms histogram.
Bandwidth per direction - h2d / d2h / d2d / default / unknown over time. h2d-d2h skew = data-pipeline imbalance.
Duration percentiles per direction - p95 per direction to surface direction-specific tail latency.
Duration distribution heatmap - shape of the duration histogram over time. Right-skewed = tail latency.

NCCL on this node section:

Local NCCL processes - count of libnccl-loaded PIDs from the discovery scanner.
Collective rate by op_type - locally fired collectives: ncclAllReduce, ncclAllGather, ncclReduceScatter, ncclBcast, ncclSend, ncclRecv.
Bandwidth by op_type - rate of cumulative bytes counter.
Distinct libnccl versions - count of unique libnccl_version labels (different PyTorch venvs ship different NCCL ABIs).
Local libnccl-loaded processes (table) - PID + comm + libnccl path + version per process.

Linux only. amd64 + arm64.

How the data is collected

Install the Ingero agent:

curl -sSL https://github.com/ingero-io/ingero/releases/latest/download/install.sh | bash

Run with the Prometheus exporter and NCCL uprobes:

sudo ingero trace --nccl --prometheus :9090

Scrape and import:

Scrape :9090/metrics from your Prometheus / Grafana Alloy / Grafana Cloud agent. Import this dashboard, pick your Prometheus datasource.

Memcpy metrics

eBPF uprobes on libcudart cudaMemcpy symbols (cudaMemcpy, cudaMemcpyAsync, cudaMemcpy2D, cudaMemcpy2DAsync, cudaMemcpyPeer, cudaMemcpyPeerAsync) at function entry + return. The cudaMemcpyKind argument is read from the userspace register at uprobe entry to label the direction.

gpu_memcpy_bytes_total (per direction label) - cumulative bytes.
gpu_memcpy_duration_ms_bucket / _sum / _count (per direction label) - per-event histogram. Use histogram_quantile() on rate(_bucket) for accurate percentiles.

The 2D variants emit direction=unknown because cudaMemcpyKind is the 7th parameter and not portably readable from BPF on amd64 + arm64.

NCCL metrics

eBPF uprobes on libnccl.so symbols (ncclAllReduce, ncclAllGather, ncclReduceScatter, ncclBcast, ncclSend, ncclRecv, etc.) at function entry + return. A runtime libnccl-discovery scanner finds PyTorch and pip-installed NCCL ABIs in /proc/PID/maps (no system-libnccl required) and attaches uprobes to each libnccl path on the fly.

gpu_nccl_collective_count (per op_type label) - total events.
gpu_nccl_collective_bytes_total (per op_type label) - total bytes.
gpu_nccl_collective_barrier_events (per op_type label) - barrier-shape collectives.
gpu_nccl_process_loaded (with pid, comm, libnccl_path, libnccl_version labels) - per-PID gauge from discovery scanner.
gpu_nccl_processes_total - count of discovered PIDs.

Why eBPF for data movement

nvidia-smi --query-gpu=memory.used polls device-state registers; it cannot tell you whether the bytes flowed h2d, d2h, or d2d, and has no per-call latency. DCGM-exporter has the same limitation. Ingero's libcudart + libnccl uprobes capture every memcpy and every collective at the call boundary, so the per-direction breakdown + per-op_type breakdown both come from real call-site data, not aggregate counters.

For NCCL specifically, the multi-libnccl-discovery is what makes this work on PyTorch + pip workloads, where libnccl ships inside the venv (e.g. $VENV/lib/python3.10/site-packages/torch/../nvidia/nccl/lib/libnccl.so.2) rather than a system path. Without runtime discovery the agent would attach to whatever libnccl was on LD_LIBRARY_PATH at boot, which is usually nothing useful for PyTorch wheels.

Companion dashboards

Single-host set (this set): GPU Trace Overview (#25277), CUDA Op Profiler (#25278), GPU Data Movement (this one), GPU Memory & Throttle.
Multi-node cluster set: GPU Cluster Overview (#25271), NCCL Stragglers (#25273), GPU Memcpy Bandwidth (#25274), GPU Memory Fragmentation (#25275), Per-Node Drill-Down (#25276).
Fleet pipeline health: operator dashboard for the Ingero Fleet collector itself.

All published under https://grafana.com/orgs/ingero