Identify unhealthy infrastructure

Grafana Cloud

Identify unhealthy infrastructure

Use the entity catalog to find infrastructure components (Pods, nodes, clusters) with health issues across your entire environment. Filter by insight categories to focus on critical problems like resource saturation or Pod failures.

When to use this workflow

Use this workflow when you want to:

Find all infrastructure with critical insights across clusters
Identify resource saturation before it impacts services
Locate failing Pods or nodes during an incident
Audit infrastructure health across your environment

This workflow is essential for infrastructure teams, SREs, and during incident response.

Before you begin

Ensure your infrastructure is:

Sending Kubernetes metrics to Grafana Cloud
Configured with infrastructure metrics collection (CPU, memory, disk)
Visible in the entity catalog

Open the entity catalog

From Grafana Cloud, navigate to Observability > Entity catalog.

Filter to infrastructure entities

Narrow the entity catalog to show only infrastructure components with issues.

Show infrastructure entity types

Under type, select one infrastructure type:
- Pod - Individual Kubernetes Pods
- Node - Kubernetes worker nodes
- NodeGroup - Collections of similar nodes
- KubeCluster - Entire clusters
- Namespace - Logical groupings

Start with Pod to see the most granular infrastructure issues, or select Node to focus on worker node health.

Filter by insight categories

Show only infrastructure with specific problems:

Under Insight Rings, select relevant categories:
- Saturation - Resources approaching limits (CPU, memory, disk)
- Failure - Pod crashes, node failures, persistent volume issues
- Anomaly - Unusual resource consumption patterns

Focus on Saturation and Failure for infrastructure troubleshooting.

Review infrastructure health

After you’ve filtered to infrastructure entities, examine the specific health indicators for each entity type.

Pods with issues

When viewing Pods, check for:

Restart count spikes

High restart counts indicate CrashLoopBackOff or out of memory killed Pods
Click the Pod to see logs and identify crash causes

CPU/Memory saturation

Red insight rings indicate Pods hitting resource limits
Check if limits are too low or if the Pod has a memory leak

Common Pod insights

Pod OOMKilled - Increase memory limits or investigate memory usage
Pod CrashLoopBackOff - Check logs for application errors
CPU Throttling - Increase CPU limits or optimize application

Nodes with issues

When viewing nodes, look for:

Resource pressure

Nodes with high CPU or memory usage
Disk pressure warnings
Too many Pods scheduled on the node

Node status

NotReady status indicates node failure
DiskPressure or MemoryPressure conditions
Network or kubelet issues

Common node insights

Node disk pressure - Clean up disk space or add capacity
Node memory pressure - Re-balance Pods or add nodes
Node NotReady - Investigate kubelet logs or infrastructure issues

Namespaces and clusters

For higher-level views:

Namespace resource usage

Total CPU/memory across all Pods in namespace
Pod count approaching quota limits
Identify which namespace is consuming most resources

Cluster capacity

Total nodes and their health status
Overall cluster resource utilization
Pods pending due to insufficient capacity

Investigate infrastructure issues

Click any infrastructure entity to open its details:

Pod details

Kubernetes tab:

CPU and memory usage over time
Network I/O patterns
Restart history

Logs tab:

Pre-filtered to this Pod
Shows crash logs and error messages
Check logs around restart times

Properties tab:

Node the Pod is running on
Resource requests and limits
Labels and annotations

Node details

Kubernetes tab:

CPU/memory capacity vs usage
Disk usage and I/O
Pod count on this node

Connected entities:

See all Pods running on the node
Check if specific Pods are causing issues
Identify if Pods should be rescheduled

Common infrastructure patterns

Recognize these patterns to quickly diagnose and respond to infrastructure issues.

Pod failures

If you see multiple Pods failing in the same namespace:

Check if they run on the same node (node failure).
Look for shared dependency failures (database, external service).
Review recent deployments or configuration changes.

Resource saturation

When CPU or memory saturation appears:

Immediate: Check if auto-scaling is configured.
Short-term: Increase resource limits if appropriate.
Long-term: Investigate application efficiency and optimization.

Node problems affecting services

If services are degraded and you suspect infrastructure:

Filter the entity catalog to services with errors.
Click a service and view Connected entities.
Check Pods and nodes running the service.
Look for Pod restarts or node issues correlating with service errors.

Use RCA workbench for multi-entity investigation

When infrastructure issues span multiple entities:

From the entity catalog, click problematic Pods or nodes.
Click Add to RCA workbench for each relevant entity.
Navigate to Observability > RCA workbench.
View insights on a timeline to see:
- Which failures happened first
- Correlation between infrastructure and service issues
- Amend insights (deployments, scale events) that triggered problems

Filter by cluster, namespace, or environment

Narrow your view to specific parts of your infrastructure using property filters.

By cluster

Click the dropdown and select Show all KubeClusters.
Select the cluster experiencing issues.
See all unhealthy infrastructure in that cluster.

By namespace

Use the Namespace dropdown.
Select namespaces your team owns.
Focus on infrastructure you’re responsible for.

By environment

Use the Env dropdown.
Select production, staging, or other environments.
Prioritize production infrastructure issues.

Bookmark critical views

Save filtered views for quick access to infrastructure health checks.

Create bookmarked views for common scenarios:

All critical infrastructure - Filter to Saturation + Failure insights
Production Pods with issues - Filter to production namespace + Pod + insights
Node health - Filter to Node entity type + insights
Cluster capacity - Show all KubeCluster entities