Identify unhealthy infrastructure
Use the entity catalog to find infrastructure components (Pods, nodes, clusters) with health issues across your entire environment. Filter by insight categories to focus on critical problems like resource saturation or Pod failures.
When to use this workflow
Use this workflow when you want to:
- Find all infrastructure with critical insights across clusters
- Identify resource saturation before it impacts services
- Locate failing Pods or nodes during an incident
- Audit infrastructure health across your environment
This workflow is essential for infrastructure teams, SREs, and during incident response.
Before you begin
Ensure your infrastructure is:
- Sending Kubernetes metrics to Grafana Cloud
- Configured with infrastructure metrics collection (CPU, memory, disk)
- Visible in the entity catalog
Open the entity catalog
From Grafana Cloud, navigate to Observability > Entity catalog.
Filter to infrastructure entities
Narrow the entity catalog to show only infrastructure components with issues.
Show infrastructure entity types
- Under type, select one infrastructure type:
- Pod - Individual Kubernetes Pods
- Node - Kubernetes worker nodes
- NodeGroup - Collections of similar nodes
- KubeCluster - Entire clusters
- Namespace - Logical groupings
Start with Pod to see the most granular infrastructure issues, or select Node to focus on worker node health.
Filter by insight categories
Show only infrastructure with specific problems:
- Under Insight Rings, select relevant categories:
- Saturation - Resources approaching limits (CPU, memory, disk)
- Failure - Pod crashes, node failures, persistent volume issues
- Anomaly - Unusual resource consumption patterns
Focus on Saturation and Failure for infrastructure troubleshooting.
Review infrastructure health
After you’ve filtered to infrastructure entities, examine the specific health indicators for each entity type.
Pods with issues
When viewing Pods, check for:
Restart count spikes
- High restart counts indicate CrashLoopBackOff or out of memory killed Pods
- Click the Pod to see logs and identify crash causes
CPU/Memory saturation
- Red insight rings indicate Pods hitting resource limits
- Check if limits are too low or if the Pod has a memory leak
Common Pod insights
Pod OOMKilled- Increase memory limits or investigate memory usagePod CrashLoopBackOff- Check logs for application errorsCPU Throttling- Increase CPU limits or optimize application
Nodes with issues
When viewing nodes, look for:
Resource pressure
- Nodes with high CPU or memory usage
- Disk pressure warnings
- Too many Pods scheduled on the node
Node status
NotReadystatus indicates node failureDiskPressureorMemoryPressureconditions- Network or
kubeletissues
Common node insights
Node disk pressure- Clean up disk space or add capacityNode memory pressure- Re-balance Pods or add nodesNode NotReady- Investigatekubeletlogs or infrastructure issues
Namespaces and clusters
For higher-level views:
Namespace resource usage
- Total CPU/memory across all Pods in namespace
- Pod count approaching quota limits
- Identify which namespace is consuming most resources
Cluster capacity
- Total nodes and their health status
- Overall cluster resource utilization
- Pods pending due to insufficient capacity
Investigate infrastructure issues
Click any infrastructure entity to open its details:
Pod details
Kubernetes tab:
- CPU and memory usage over time
- Network I/O patterns
- Restart history
Logs tab:
- Pre-filtered to this Pod
- Shows crash logs and error messages
- Check logs around restart times
Properties tab:
- Node the Pod is running on
- Resource requests and limits
- Labels and annotations
Node details
Kubernetes tab:
- CPU/memory capacity vs usage
- Disk usage and I/O
- Pod count on this node
Connected entities:
- See all Pods running on the node
- Check if specific Pods are causing issues
- Identify if Pods should be rescheduled
Common infrastructure patterns
Recognize these patterns to quickly diagnose and respond to infrastructure issues.
Pod failures
If you see multiple Pods failing in the same namespace:
- Check if they run on the same node (node failure).
- Look for shared dependency failures (database, external service).
- Review recent deployments or configuration changes.
Resource saturation
When CPU or memory saturation appears:
- Immediate: Check if auto-scaling is configured.
- Short-term: Increase resource limits if appropriate.
- Long-term: Investigate application efficiency and optimization.
Node problems affecting services
If services are degraded and you suspect infrastructure:
- Filter the entity catalog to services with errors.
- Click a service and view Connected entities.
- Check Pods and nodes running the service.
- Look for Pod restarts or node issues correlating with service errors.
Use RCA workbench for multi-entity investigation
When infrastructure issues span multiple entities:
- From the entity catalog, click problematic Pods or nodes.
- Click Add to RCA workbench for each relevant entity.
- Navigate to Observability > RCA workbench.
- View insights on a timeline to see:
- Which failures happened first
- Correlation between infrastructure and service issues
- Amend insights (deployments, scale events) that triggered problems
Filter by cluster, namespace, or environment
Narrow your view to specific parts of your infrastructure using property filters.
By cluster
- Click the dropdown and select Show all KubeClusters.
- Select the cluster experiencing issues.
- See all unhealthy infrastructure in that cluster.
By namespace
- Use the Namespace dropdown.
- Select namespaces your team owns.
- Focus on infrastructure you’re responsible for.
By environment
- Use the Env dropdown.
- Select
production,staging, or other environments. - Prioritize production infrastructure issues.
Bookmark critical views
Save filtered views for quick access to infrastructure health checks.
Create bookmarked views for common scenarios:
- All critical infrastructure - Filter to Saturation + Failure insights
- Production Pods with issues - Filter to production namespace + Pod + insights
- Node health - Filter to Node entity type + insights
- Cluster capacity - Show all KubeCluster entities
What to look for
Prioritize infrastructure issues based on severity and potential impact.
Immediate action required
- Pods in CrashLoopBackOff (application startup failing)
- Nodes NotReady (infrastructure failure)
- Disk saturation above 90% (imminent failure)
- Memory saturation causing out of memory kills
Monitor closely
- CPU throttling (performance impact)
- Memory usage above 80% (approaching limits)
- Restart counts increasing over time
- Network I/O saturation
Investigate proactively
- Resource usage trends climbing
- Anomaly insights on resource consumption
- Namespaces approaching quota limits
Next steps
After identifying unhealthy infrastructure:
- Pod issues: Check telemetry data for logs and metrics
- Service impact: Use monitor services to see affected applications
- Multi-entity incident: Use investigate incidents in RCA workbench
- Deployment correlation: Use track changes to find triggering events
Related workflows
- Monitor services - See service impact of infrastructure issues
- Investigate incidents - Correlate infrastructure and service problems
- Explore dependencies - Understand which services run on affected infrastructure



