NCCL Stragglers: distributed training slowdown finder (multi-GPU)

Multi-node NCCL straggler dashboard. Top-N nodes by collective byte volume, per-op_type bandwidth (ncclAllReduce, ncclAllGather, ncclReduceScatter, ncclBcast, ncclSend/Recv), barrier-event ratio, per-rank rates. Find which node is dragging down distributed PyTorch / FSDP / DeepSpeed training. eBPF uprobes on libnccl.so.

NCCL Stragglers: distributed training slowdown finder (multi-GPU)

Find which node is dragging down distributed PyTorch / FSDP / DeepSpeed training by surfacing per-node NCCL collective-op volume, bandwidth, and barrier-wait ratio across the cluster.

What it shows

Cluster-wide totals over the last hour: ncclAllReduce count, total NCCL bytes, NCCL-loaded processes, distinct libnccl versions.
Event volume and byte volume broken out per collective op type (ncclAllReduce, ncclAllGather, ncclReduceScatter, ncclBcast, ncclSend / ncclRecv).
Per-node collective rate and per-op-type bandwidth as time series.
Barrier-event ratio per op type (a fast read on synchronization tax).
Top 10 nodes by NCCL byte volume in the last hour, ranked - the candidate straggler list.

How it helps

Stragglers in distributed training are usually invisible to per-node GPU dashboards. Each rank's GPU looks busy, but one rank is consistently late to the collective, and every other rank waits at the barrier. This dashboard sorts on the right axis (per-node byte volume + barrier-wait ratio) so the laggard is the row that stands out.

Typical use:

Cluster overview tells you training throughput dropped.
Open this dashboard, scope to the same time range.
The "Top 10 nodes by NCCL byte volume" table and the per-node "Collective rate per op_type" panel surface the outlier rank.
Drill into that node with the "Per-Node Drilldown" dashboard (Grafana ID 25276).

Requirements

Data source: Prometheus, scraping the Ingero agent's /metrics endpoint on every node in the training cluster.
Agent: Ingero v0.15 or newer. Older agents do not expose the gpu_nccl_* metric family.
Metrics consumed: gpu_nccl_collective_count, gpu_nccl_collective_bytes_total, gpu_nccl_collective_barrier_events, gpu_nccl_processes_total, gpu_nccl_process_loaded. All emitted by the agent's libnccl.so uprobes when an NCCL-using process is

The "Top 10 nodes by NCCL byte volume" table and the per-node "Collective rate per op_type" panel surface the outlier rank.
Drill into that node with the "Per-Node Drilldown" dashboard (Grafana ID 25276).

Requirements

Data source: Prometheus, scraping the Ingero agent's /metrics endpoint on every node in the training cluster.
Agent: Ingero v0.15 or newer. Older agents do not expose the gpu_nccl_* metric family.
Metrics consumed: gpu_nccl_collective_count, gpu_nccl_collective_bytes_total, gpu_nccl_collective_barrier_events, gpu_nccl_processes_total, gpu_nccl_process_loaded. All emitted by the agent's libnccl.so uprobes when an NCCL-using process is detected on the host.
Variables: $instance (multi-value, populated from any of the metrics above).

How to install

In Grafana, Dashboards → New → Import, paste the dashboard ID, pick your Prometheus datasource.

Or, with the Grafana operator's dashboard sidecar:

  kubectl create configmap ingero-nccl-stragglers \
    --from-file=cluster-02-nccl-stragglers.json \
    --namespace=monitoring
  kubectl label configmap ingero-nccl-stragglers grafana_dashboard=1

The dashboard JSON is Apache-2.0 and lives at https://github.com/ingero-io/ingero-grafana-app/blob/main/src/dashboards/cluster-02-nccl-stragglers.json. Forks and PRs welcome.

Notes

NCCL tracing is via eBPF uprobes on libnccl.so; the agent attaches when it sees an NCCL-linked process and detaches on exit. There is no library-side instrumentation and no NCCL build flag required.
Panels stay empty on hosts that do not load libnccl (inference-only nodes, control-plane nodes). That is expected.
Barrier-event ratio is a heuristic, not a kernel-level wait timer. A high ratio means a high share of collective ops emit a barrier event; treat it as a relative signal across nodes in the same job, not an absolute latency.

Ingero Cluster: Overview (25271) - start here when training throughput drops.
Ingero Cluster: Per-Node Drilldown (25276) - drill into the candidate straggler this dashboard surfaces.
Ingero Cluster: Memcpy Bandwidth (25274) - rule out a data-pipeline bottleneck on the candidate node before blaming NCCL.

License

Apache-2.0. Source: https://github.com/ingero-io/ingero-grafana-app

Revisions

Revision	Description	Created
			Download

Get this dashboard

Import the dashboard template

Download JSON

Datasource

Dependencies

Resources

Docs: Importing dashboards Webinar: Getting started with Grafana dashboard design Webinar: Building advanced Grafana dashboards

NCCL Stragglers: distributed training slowdown finder (multi-GPU)