Monitor infrastructure

Kubernetes Monitoring

Optimize resource usage and efficiency

Cluster health

Risks

Grafana Cloud

Risks

The Risks tab on the Cluster Health page shows the results of twelve checks that have a live count of active reliability problems in your Clusters. These checks are categorized by the following:

Availability: Workloads that are completely unreachable
Stability: Workloads that are running but degraded or failing
Infrastructure: Underlying node-level issues that can cascade into wider outages

Each check is color-coded:

Green: No issues found. The count is zero.
Red: One or more active issues that need attention.
Blue: The check is currently selected and its detail table is displayed below.

Click View detail on any check to jump to the detail table for that issue. You can also switch between tables using the Detail view drop-down menu.

Health Risks tab showing current state of missing configuration risks

Availability

Availability checks identify workloads and nodes that are down or unable to serve traffic.

Zero replica deployments

These are deployments that are configured to run at least one replica but have zero available replicas running. The workload is fully down. This excludes deployments intentionally scaled to zero.

Failed rollouts, image pull errors, insufficient cluster resources, misconfigured probes.

Check rollout status, Pod events, and container logs. Roll back to a previous revision, fix the image reference, free up cluster resources, or correct probe settings.

Deployment rollout issues

These are deployments whose rollout has one of these conditions:

Not Progressing means the deployment controller has not made progress within the deadline.
Replica Failure means at least one replica Pod could not be created or deleted.

Insufficient cluster resources (CPU or memory), image pull errors (wrong image name, tag, or expired credentials), failing readiness or liveness probes, resource quota limits exceeded, volume mount failures, or Pod security policy violations.

Inspect deployment events and Pod status. Scale up node resources, fix image references or registry credentials, adjust probe settings, increase resource quotas, correct volume configurations, or update security policies.

Nodes not ready

These are Nodes where the Ready condition is False or Unknown. A NotReady node prevents new Pods from being scheduled and may disrupt running workloads. The Status column distinguishes a confirmed NotReady state from a transient Unknown state (meaning the node is unreachable).

kubelet crash or failure to report status, Node running out of memory, disk, or PIDs, network connectivity loss between the Node and the control plane, underlying VM or hardware failure, expired Node certificates, kernel or OS-level crash.

Check kubelet logs and Node events. Restart the kubelet, free up Node resources, restore network connectivity, renew certificates, or replace the failed Node.

Pods not ready

These are Pods in the Running phase that are failing their readiness probe. They are excluded from Service endpoints and are not receiving traffic.

Misconfigured readiness probes, application startup delays, or missing dependencies.

Review the readiness probe configuration and adjust thresholds or timeouts. Check that dependent services are available and that the application starts within the expected window.

Stability

A stability check detects containers and Pods that are crashing, restarting, or stuck.

Restarting containers

Containers that have restarted more than twice in the last hour, sorted by highest restart count.

A high restart count typically signals a crash loop.

OOM kills, failed liveness probes, or application errors.

Inspect Pod logs and events to determine why the container is crashing. Adjust resource limits, correct the liveness probe configuration, or fix the underlying application error.

OOMKilled containers

These are containers with the most recent termination caused by the kernel OOMKilled event.

The container exceeded its memory limit, or the Node ran out of memory.

Increase memory limits or requests for the affected container. If the Node is under memory pressure, consider scaling up or redistributing workloads.

Pending Pods

Pods stuck in the Pending phase that cannot be scheduled onto a Node.

Insufficient cluster resources, unsatisfiable Node affinity or taints, missing PersistentVolumes, or image pull failures.

Scale up the cluster or free resources, adjust Node affinity rules or taints, provision the required PersistentVolumes, or fix image references.

Image pull errors

These are containers waiting because their image cannot be pulled. ImagePullBackOff means Kubernetes is retrying with exponential backoff. ErrImagePull is the initial failure.

Incorrect image names or tags, missing or expired registry credentials, or rate limiting by the registry.

Verify the image name and tag, update or create registry pull secrets, or wait for rate limits to reset and consider authenticating to increase limits.

Infrastructure

An infrastructure check identifies:

Nodes under resource pressure
Pods that have been evicted or lost contact with the Cluster

Node pressure

These are Nodes with an active MemoryPressure, DiskPressure, or PIDPressure condition. These are early warning signals for Pod evictions or failures.

High memory consumption on the Node, low available disk space or inodes, or too many running processes exhausting available PIDs.

Free up Node resources by evicting or rescheduling non-critical workloads, expand disk capacity, or increase the PID limit. Address these before they cause Pod evictions or failures.

Evicted Pods

These are Pods evicted by the kubelet due to Node resource pressure or by the scheduler due to priority preemption.

Node memory, disk, or PID pressure; priority preemption by higher-priority Pods.

Resolve the underlying Node pressure (refer to Node pressure above), adjust Pod priority classes, or redistribute workloads across Nodes.

Pods in unknown phase

These are Pods in the Unknown phase. This typically occurs when the Node hosting the Pod becomes unreachable and Kubernetes can no longer determine the Pod’s state.

Node failure or severe network partition between the Node and the control plane.

Check Node status and network connectivity. Restart the Node or restore network access. Unknown Pods transition to Failed or recover once Node connectivity is restored.