EKS Node Monitoring Agent Observability

Comprehensive health monitoring for EKS nodes powered by the EKS Node Monitoring Agent and kubernetes-event-exporter. This dashboard tracks critical hardware, kernel, and network conditions that trigger Karpenter Node Auto-Repair (Disruption) to maintain cluster reliability and automate infrastructure recovery.

EKS Node Monitoring Agent Observability screenshot 1

EKS Node Monitoring Agent Observability

This dashboard provides comprehensive visibility into EKS Node health, specifically designed to monitor the EKS Node Monitoring Agent add-on. It correlates low-level system signals with Karpenter's Node Auto-Repair logic to provide a complete picture of cluster reliability.

Key Features

  • AWS Health Integration: Fully mapped to the AWS EKS Node Health documentation.
  • Auto-Repair Synergy: Tracks events that trigger Karpenter Node Auto-Repair.
  • 5-Pillar Monitoring: Categorized views for Accelerated Hardware (GPU/Neuron), Networking, Storage, Kernel, and Container Runtime.
  • Noise-Reduced Alerts: Optimized LogQL queries to distinguish between transient Events and critical Conditions.

Prerequisites

To use this dashboard, ensure your cluster has the following components installed:

  1. EKS Node Monitoring Agent: The core add-on providing node-level signals.
  2. kubernetes-event-exporter: Required to scrape K8s events and push them to Loki/Prometheus.
  3. Grafana Loki: Used as the primary data source for event logs.
  4. Prometheus: (Optional) For correlating system metrics with health events.
  5. Karpenter: To visualize automated disruption and repair actions.

Dashboard Sections

  • Summary (Global Health): High-level "Healthy Node Ratio" and active critical conditions.
  • Accelerated Hardware: Monitoring NVIDIA/Neuron XID and HBM errors.
  • Networking & CNI: Tracks Bandwidth, PPS, and IPAMD address exhaustion.
  • Storage & I/O: Visualizes EBS throughput limits and disk latency.
  • Kernel & OS Stability: Detects Soft Lockups, OOMs, and PID/File exhaustion.
  • Runtime & Kubelet: Tracks CRI failures and Pod/Service restart loops.

Setup Instructions

  1. Import this dashboard via JSON or ID.
  2. Select your Loki data source.
  3. Ensure your kubernetes-event-exporter is configured to include the field reportingComponent: eks-node-monitoring-agent.
  4. (Recommended) Set the "No Value" option in Stat panels to 0 or Healthy to maintain a clean UI when no errors are present.
Revisions
RevisionDescriptionCreated
Grafana Agent

Grafana Agent

by Grafana Labs
Grafana Labs solution

Easily monitor metrics and logs from a Grafana Agent instance with Grafana Cloud's out-of-the-box monitoring solution.

Learn more

Get this dashboard

Import the dashboard template

or

Download JSON

Datasource
Dependencies