Documentationbreadcrumb arrow Grafana Cloudbreadcrumb arrow Knowledge graphbreadcrumb arrow Use casesbreadcrumb arrow Identify unhealthy infrastructure
Grafana Cloud

Identify unhealthy infrastructure

Use the entity catalog to find infrastructure components (Pods, nodes, clusters) with health issues across your entire environment. Filter by insight categories to focus on critical problems like resource saturation or Pod failures.

When to use this workflow

Use this workflow when you want to:

  • Find all infrastructure with critical insights across clusters
  • Identify resource saturation before it impacts services
  • Locate failing Pods or nodes during an incident
  • Audit infrastructure health across your environment

This workflow is essential for infrastructure teams, SREs, and during incident response.

Before you begin

Ensure your infrastructure is:

  • Sending Kubernetes metrics to Grafana Cloud
  • Configured with infrastructure metrics collection (CPU, memory, disk)
  • Visible in the entity catalog

Open the entity catalog

From Grafana Cloud, navigate to Observability > Entity catalog.

Filter to infrastructure entities

Narrow the entity catalog to show only infrastructure components with issues.

Show infrastructure entity types

  1. Under type, select one infrastructure type:
    • Pod - Individual Kubernetes Pods
    • Node - Kubernetes worker nodes
    • NodeGroup - Collections of similar nodes
    • KubeCluster - Entire clusters
    • Namespace - Logical groupings

Start with Pod to see the most granular infrastructure issues, or select Node to focus on worker node health.

Filter by insight categories

Show only infrastructure with specific problems:

  1. Under Insight Rings, select relevant categories:
    • Saturation - Resources approaching limits (CPU, memory, disk)
    • Failure - Pod crashes, node failures, persistent volume issues
    • Anomaly - Unusual resource consumption patterns

Focus on Saturation and Failure for infrastructure troubleshooting.

Review infrastructure health

After you’ve filtered to infrastructure entities, examine the specific health indicators for each entity type.

Pods with issues

When viewing Pods, check for:

Restart count spikes

  • High restart counts indicate CrashLoopBackOff or out of memory killed Pods
  • Click the Pod to see logs and identify crash causes

CPU/Memory saturation

  • Red insight rings indicate Pods hitting resource limits
  • Check if limits are too low or if the Pod has a memory leak

Common Pod insights

  • Pod OOMKilled - Increase memory limits or investigate memory usage
  • Pod CrashLoopBackOff - Check logs for application errors
  • CPU Throttling - Increase CPU limits or optimize application

Nodes with issues

When viewing nodes, look for:

Resource pressure

  • Nodes with high CPU or memory usage
  • Disk pressure warnings
  • Too many Pods scheduled on the node

Node status

  • NotReady status indicates node failure
  • DiskPressure or MemoryPressure conditions
  • Network or kubelet issues

Common node insights

  • Node disk pressure - Clean up disk space or add capacity
  • Node memory pressure - Re-balance Pods or add nodes
  • Node NotReady - Investigate kubelet logs or infrastructure issues

Namespaces and clusters

For higher-level views:

Namespace resource usage

  • Total CPU/memory across all Pods in namespace
  • Pod count approaching quota limits
  • Identify which namespace is consuming most resources

Cluster capacity

  • Total nodes and their health status
  • Overall cluster resource utilization
  • Pods pending due to insufficient capacity

Investigate infrastructure issues

Click any infrastructure entity to open its details:

Pod details

Kubernetes tab:

  • CPU and memory usage over time
  • Network I/O patterns
  • Restart history

Logs tab:

  • Pre-filtered to this Pod
  • Shows crash logs and error messages
  • Check logs around restart times

Properties tab:

  • Node the Pod is running on
  • Resource requests and limits
  • Labels and annotations

Node details

Kubernetes tab:

  • CPU/memory capacity vs usage
  • Disk usage and I/O
  • Pod count on this node

Connected entities:

  • See all Pods running on the node
  • Check if specific Pods are causing issues
  • Identify if Pods should be rescheduled

Common infrastructure patterns

Recognize these patterns to quickly diagnose and respond to infrastructure issues.

Pod failures

If you see multiple Pods failing in the same namespace:

  1. Check if they run on the same node (node failure).
  2. Look for shared dependency failures (database, external service).
  3. Review recent deployments or configuration changes.

Resource saturation

When CPU or memory saturation appears:

  1. Immediate: Check if auto-scaling is configured.
  2. Short-term: Increase resource limits if appropriate.
  3. Long-term: Investigate application efficiency and optimization.

Node problems affecting services

If services are degraded and you suspect infrastructure:

  1. Filter the entity catalog to services with errors.
  2. Click a service and view Connected entities.
  3. Check Pods and nodes running the service.
  4. Look for Pod restarts or node issues correlating with service errors.

Use RCA workbench for multi-entity investigation

When infrastructure issues span multiple entities:

  1. From the entity catalog, click problematic Pods or nodes.
  2. Click Add to RCA workbench for each relevant entity.
  3. Navigate to Observability > RCA workbench.
  4. View insights on a timeline to see:
    • Which failures happened first
    • Correlation between infrastructure and service issues
    • Amend insights (deployments, scale events) that triggered problems

Filter by cluster, namespace, or environment

Narrow your view to specific parts of your infrastructure using property filters.

By cluster

  1. Click the dropdown and select Show all KubeClusters.
  2. Select the cluster experiencing issues.
  3. See all unhealthy infrastructure in that cluster.

By namespace

  1. Use the Namespace dropdown.
  2. Select namespaces your team owns.
  3. Focus on infrastructure you’re responsible for.

By environment

  1. Use the Env dropdown.
  2. Select production, staging, or other environments.
  3. Prioritize production infrastructure issues.

Bookmark critical views

Save filtered views for quick access to infrastructure health checks.

Create bookmarked views for common scenarios:

  • All critical infrastructure - Filter to Saturation + Failure insights
  • Production Pods with issues - Filter to production namespace + Pod + insights
  • Node health - Filter to Node entity type + insights
  • Cluster capacity - Show all KubeCluster entities

What to look for

Prioritize infrastructure issues based on severity and potential impact.

Immediate action required

  • Pods in CrashLoopBackOff (application startup failing)
  • Nodes NotReady (infrastructure failure)
  • Disk saturation above 90% (imminent failure)
  • Memory saturation causing out of memory kills

Monitor closely

  • CPU throttling (performance impact)
  • Memory usage above 80% (approaching limits)
  • Restart counts increasing over time
  • Network I/O saturation

Investigate proactively

  • Resource usage trends climbing
  • Anomaly insights on resource consumption
  • Namespaces approaching quota limits

Next steps

After identifying unhealthy infrastructure: