CUDA Op Profiler: per-op latency p50/p95/p99 (single host, eBPF)

Single-host CUDA Runtime + Driver operation profiler. Per-op p50/p95/p99 latency from eBPF uprobes on libcudart and libcuda. Find your slowest kernel: top-10 slowest p99 table, per-op timeseries, op rate. Covers cudaMemcpy, cudaLaunchKernel, cuLaunchKernel, cuMemAlloc_v2, cuCtxSynchronize, cuda graphs, host tracepoints.

What this dashboard shows

A single-host CUDA Runtime + Driver operation profiler. Per-op p50 / p95 / p99 latency, captured by eBPF uprobes on libcudart and libcuda. The "find my slow kernel" view.

The headline panels:

Slowest p99 (current) - top stat. Worst p99 across all selected ops in microseconds.
p99 duration per operation - tail latency time series. Spikes here surface kernels that occasionally take much longer than their median (often hardware contention or compiler-output divergence).
p50 / p95 duration per operation - median + tail.
Op rate per operation - per-op frequency. A slow op that fires rarely matters less than a slow op that fires constantly.
Top 10 slowest ops by p99 (table) - quick triage list.
Anomaly count - rate of ingero_anomaly_count from the agent's correlator output.

Source + operation template variables let you scope to a specific tracer (cuda / driver / cuda_graph / host / io / tcp / net) or specific op family (cudaMemcpy*, cudaMalloc*, cuLaunchKernel*, sched_switch, block_rq_*, etc).

Linux only. amd64 + arm64.

How the data is collected

Install the Ingero agent:

curl -sSL https://github.com/ingero-io/ingero/releases/latest/download/install.sh | bash

Run with the Prometheus exporter:

sudo ingero trace --prometheus :9090

Scrape and import:

Scrape :9090/metrics from your Prometheus / Grafana Alloy / Grafana Cloud agent. Import this dashboard, pick your Prometheus datasource.

The agent exposes per-op latency reservoirs via:

gpu_cuda_operation_duration_microseconds{source,operation,percentile} with percentile in {p50, p95, p99}
gpu_cuda_operation_count_total{source,operation} (cumulative rate)
ingero_anomaly_count (correlator output)

Source decoder

The source label distinguishes which tracer captured the event:

cuda - libcudart Runtime API uprobes (cudaMemcpy, cudaMalloc, cudaLaunchKernel, cudaStreamSynchronize, etc.)
driver - libcuda Driver API uprobes (cuLaunchKernel, cuCtxSynchronize, cuMemAlloc_v2, etc.)
cuda_graph - CUDA Graph capture / launch ops
host - host kernel tracepoints (sched_switch, page_fault, etc.)
io - block I/O tracepoints (block_rq_issue / complete)
tcp - TCP retransmit tracepoints
net - network socket syscalls (sendto / recvfrom)

Why eBPF over CUDA profilers

nvprof, nsys, and the CUPTI library all hook the CUDA driver through the official profiling interface. They give detailed single-process traces but require attaching at process start and add measurable overhead. Ingero's libcudart + libcuda uprobes attach to running processes (no recompile, no restart), have sub-microsecond overhead per call, and produce continuous per-percentile signals suitable for scraping.

The two are complementary: use Nsight Systems for one-time deep dives; use Ingero for continuous monitoring + alerting on production workloads.

Companion dashboards

Single-host set (this set): GPU Trace Overview (#25277), CUDA Op Profiler (this one), GPU Data Movement, GPU Memory & Throttle.
Multi-node cluster set: GPU Cluster Overview (#25271), NCCL Stragglers (#25273), GPU Memcpy Bandwidth (#25274), GPU Memory Fragmentation (#25275), Per-Node Drill-Down (#25276).
Fleet pipeline health: operator dashboard for the Ingero Fleet collector itself.

All published under https://grafana.com/orgs/ingero