Grafana Cloud

Review infrastructure conditions

The Infrastructure section on Kubernetes Overview surfaces platform conditions, the layer below your workloads. When this section lights up, workload symptoms in Stability usually follow.

An infrastructure check identifies:

  • Nodes under resource pressure
  • Pods that have been evicted or lost contact with the Cluster
Infrastructure panels on **Kubernetes Overview** home page
Infrastructure panels on Kubernetes Overview home page

Click View detail on any tile to see the affected items listed under Detail view at the bottom of the page.

Node pressure

These are Nodes with an active MemoryPressure, DiskPressure, or PIDPressure condition. These are early warning signals for Pod evictions or failures.

High memory consumption on the Node, low available disk space or inodes, or too many running processes exhausting available PIDs.
Free up Node resources by evicting or rescheduling non-critical workloads, expand disk capacity, or increase the PID limit. Address these before they cause Pod evictions or failures.

Evicted Pods

These are Pods evicted by the kubelet due to Node resource pressure or by the scheduler due to priority preemption.

Node memory, disk, or PID pressure; priority preemption by higher-priority Pods.
Resolve the underlying Node pressure (refer to Node pressure above), adjust Pod priority classes, or redistribute workloads across Nodes.

Pods in unknown phase

These are Pods in the Unknown phase. This typically occurs when the Node hosting the Pod becomes unreachable and Kubernetes can no longer determine the Pod’s state.

Node failure or severe network partition between the Node and the control plane.
Check Node status and network connectivity. Restart the Node or restore network access. Unknown Pods transition to Failed or recover once Node connectivity is restored.

Nodes that cannot be scheduled

These are Nodes that have been cordoned or tainted to block new Pod placement, usually during maintenance or drains. If Nodes cannot be scheduled for long periods, effective Cluster capacity shrinks and Pods can stay pending or concentrate load on the remaining Nodes.

A manual cordon for maintenance or upgrade, an in-progress or abandoned drain, a NoSchedule or NoExecute taint added by an operator or controller (for example, GPU or dedicated workload Nodes), the cluster autoscaler marking a Node before scale-down, or automatic taints triggered by MemoryPressure, DiskPressure, or PIDPressure.
Identify who or what cordoned the Node and confirm the maintenance window is still active. Review the Node’s taints and conditions, then uncordon the Node, remove the offending taint, or resolve the underlying pressure condition (refer to Node pressure above). If a drain stalled, complete or cancel it.