Menu
Grafana Cloud

Explore your infrastructure with Kubernetes Monitoring

Kubernetes Monitoring offers visualization and analysis tools for you to:

  • Carefully examine your data to evaluate the health, efficiency, and cost of Kubernetes infrastructure components.
  • Analyze historical data as well as predictions created with machine learning.
  • Discover issues with resource usage to make informed decisions about efficiency and costs.
  1. Navigate to your Grafana Cloud portal.
  2. In the menu, select the stack you want to work with.
  3. Click the upper-left menu icon.
  4. In the main menu, expand Infrastructure, then click Kubernetes.
    Animation of navigating on the menu to Kubernetes
    Start sending data button

See the issues at a glance

The main Kubernetes page displays a snapshot of issues that exceed specific thresholds (and any associated alerts) for the data source chosen in the drop-down menu.

Issues exceeding thresholds and associated alerts
Issues exceeding thresholds and associated alerts

At this view, you can see the graphed counts for Clusters, Nodes, Pods, and containers, as well as:

  • Pods that have been in a non-running state for 15 minutes or more
  • Node issues with CPU and memory usage over 90% for over 5 minutes, and disks exceeding capacity of over 90%
  • Persistent Volumes that have been using over 90% of their capacity

Sort the columns, and with one click, go to Pod, Cluster, Node, and namespace views for greater detail.

Drill into data

Click the Cluster navigation menu item to navigate from Clusters, namespaces, workloads, and Nodes through to containers. Use filters and sorting to target the data you want.

List of Clusters
List of Clusters

Analyze costs

In the list view on any page, select Cost to see the estimated cost data.

Cost switch
Cost switch
Cost for each workload for the last two days
Cost for each workload for the last two days

Click Cost on the main menu to view the Cost Overview and Savings pages. Here you can view at a higher level the costs of resources, and the cost per provider if you use more than one.

Cost Overview for last seven days
Cost Overview for last seven days

Every detail view provides cost data as well.

Detail of Cluster with CPU and memory cost and idle cost for last seven days
Detail of Cluster with CPU and memory cost and idle cost for last seven days

For more information, refer to Manage costs.

Understand efficiency and resource use

Throughout the app, resource usage statistics show for each item so that you can filter and sort to make the best use of your time. In the list view on any page, select Usage to see usage data.

Usage switch
Usage switch
List of workloads with CPU and memory usage statistics, and number of alerts
List of workloads with CPU and memory usage statistics, and number of alerts

Detail views also reveal efficiency data and recommendations, so you can optimize resource usage.

Usage graphs and suggested sizing and limits
Usage graphs and suggested sizing and limits

With this data, you can:

  • Understand performance and troubleshoot stability issues by correlating between average and maximum resource usage.
  • Observe resource usage for each Kubernetes object.
  • Discover any stranded resources in your fleet.

Manage alerts

From the main menu, click Alerts to view all Kubernetes-related alerts.

Alerts page
Alerts page

You can also manage preconfigured alerting rules.

Resolve issues better with cross-functionality

Navigate easily within the Kubernetes Monitoring app to other capabilities in Grafana Cloud to analyze, troubleshoot, and solve issues.

Diagnose with Sift investigations

From a Pod, Cluster, namespace, or workload view, you can begin an incident investigation by clicking Run Sift investigation. Sift performs a set of automated system checks and surfaces potential issues in your Kubernetes environment, and works to identify the root cause of an incident.

Open a Sift investigation from Kubernetes Monitoring
Open a Sift investigation from Kubernetes Monitoring

Go directly to the RCA Workbench

Within Kubernetes Monitoring, you can go directly to the Asserts RCA Workbench from any list of Clusters, Nodes, workloads, namespaces, or Pods you choose. To do so, select the box to the left of the list item and click the Compare in Asserts Workbench button.

Raw data, query details, and graph regarding outlier data
Selected list items and RCA Workbench button

The RCA Workbench opens in a new tab. You can take troubleshooting deeper by understanding relationships between components and what is occurring between them.

Note

To access the RCA Workbench, enable Asserts on your stack.

View raw metrics with Explore

To further query data, use any of the Explore buttons available throughout the interface (such as Explore namespaces or Explore alerts). You see a view that provides additional query tools.

Raw query with options to add, view query history, and inspect query
Raw metrics

Access Application Observability

On the detail page for a Pod or workload, click Application Observability to navigate directly to more data on the application.

Navigate directly to the Application Observability app
Navigate directly to the Application Observability app

To return to Kubernetes Monitoring, click the Kubernetes icon.

Kubernetes icon in Application Observability
Kubernetes icon in Application Observability

Analyze historical data

Select a time range to see your historical data for any time frame you choose. As you navigate from page to page, the time range shows for the period you set until you change it again.

Time range selector options
Time range selector options

As an example, the Pod optimization section of the Pod detail page shows a time range over several hours. You can use this to understand the historical pattern of CPU usage and memory usage.

Graphs showing Pod bursting over CPU request and bursting above memory requests
Pod optimization view on Pod detail page

Learn what’s predicted

CPU and memory prediction can help you ensure resources are available during spikes in resource usage and help you decrease the amount of unused resources due to over provisioning. To use prediction tools, first enable the Machine Learning plugin.

The following buttons are available in various views. Click them to show a prediction for Clusters, namespaces, workloads, Nodes, Pods, and containers:

  • Predict Mem Usage: Shows a predictive graph for memory usage one week in the future. Calculations are based on metrics from the previous week.
  • Predict CPU: Shows a predictive graph for CPU usage one week in the future. Calculations are based on metrics from the previous week.
    Three graph lines showing the actual CPU usage, the lower predicted future usage, and upper predicted future usage
    Predictions for Node CPU Usage

Within a workload view, click the Detect Outlier CPU Usage amongst Pods button to identify a Pod that has CPU usage different from the other Pods.

Link to explore outlier detection query
Outlier message and exploration link

Click Explore this query in the Machine Learning plugin to view the raw data. Here you can adjust parameters and see a more detailed graph of the findings.

Raw data, query details, and graph regarding outlier data
Outlier raw data

Control app refresh

You can control the automatic refresh interval of the GUI as well as disable the auto refresh until you are ready to do so manually.

Menu for controlling automatic refresh and refresh interval
Menu for controlling automatic refresh and refresh interval

Use color cues

Throughout the views in Kubernetes Monitoring, you see color used as an additional means of indicating status or condition. For example, sometimes text is a different color for Pod status:

List of pods with the status of running showing in green
Color coding
TextColorComments
RunningGreenHealthy Pod
RunningRedPod failing to start
FailedRedFailed Pod
UnknownGreyPod status unknown
SucceededGreenJob Pod successfully run

For more information on Pod status, refer to the Kubernetes documentation on Pod lifecycle.

The following table describes the color indicators for resource capacity and the state of resource usage:

Usage ColorsUsageComments
Green60-90% of maximumThis is the ideal state of resource usage.
YellowBelow 60%Low usage percentages indicate that the item might be over provisioned.
Red90%+Your resource usage is close to or above its configured capacity.

If you choose to enable traces when you configure Kubernetes Monitoring, you can easily click to see them.

  1. Click the main menu icon.

  2. Click Explore.

  3. Choose the Tempo data source.

  4. With the TraceQL tab selected, enter your search query.

  5. Click Run query.

    A table of traces appears.

  6. Click a trace to see the detail.

Explore detail page showing table of traces, TraceQL query, and trace graph
View traces

Manage configuration

If you have the admin role, you can manage the configuration of Kubernetes Monitoring by working with:

  • Data source choices
  • Alerts
  • Integration installations
  • Optional custom log queries
  • Configuration instructions for Grafana Kubernetes Monitoring Helm chart to deploy, configure, and keep it up to date

For more information, refer to Configure Kubernetes Monitoring.