Grafana Cloud

Explore your infrastructure with Kubernetes Monitoring

Kubernetes Monitoring offers visualization and analysis tools for you to:

  • Carefully examine your data to evaluate the health, efficiency, and cost of Kubernetes infrastructure components.
  • Analyze historical data as well as predictions created with machine learning.
  • Discover issues with resource usage to make informed decisions about efficiency and costs.
  1. Navigate to your Grafana Cloud portal.
  2. In the menu, select the stack you want to work with.
  3. Click the upper-left menu icon.
  4. In the main menu, expand Infrastructure, then click Kubernetes.
    Animation of navigating on the menu to Kubernetes
    Start sending data button

See the issues at a glance

The main Kubernetes page displays a snapshot of issues that exceed specific thresholds for the data source chosen in the drop-down menu.

Home page showing issues exceeding thresholds
Snapshot of issues

At this view, you can see the graphed counts for Clusters, Nodes, Pods, and containers, as well as:

  • Pods that have been in a non-running state for 15 minutes or more
  • Node issues with CPU and memory usage over 90% for over 5 minutes, and disks exceeding capacity of over 90%
  • Persistent Volumes that have been using over 90% of their capacity

Sort the columns, and with one click, go to Pod, Cluster, Node, and namespace views for greater detail.

Drill down into data

As you delve into your data, navigate from Nodes through to Pods by clicking the Cluster navigation menu item and choose any of the following tabs:

  • Clusters
  • Namespaces
  • Workloads
  • Nodes
View with list of Clusters
Cluster view

Analyze historical data

Select a time range to see your historical data for any time frame you choose. As you navigate from page to page, the time range shows for the period you set until you change it again.

Selection of dates or a list of times such as last 5 minutes
Time picker

As an example, the Pod optimization section of the Pod detail page shows a time range over several hours. You can use this to understand the historical pattern of CPU usage and memory usage.

Graphs showing Pod bursting over CPU request and bursting above memory requests
Pod optimization view on Pod detail page

Learn what’s predicted

CPU and memory prediction can help you ensure resources are available during spikes in resource usage and help you decrease the amount of unused resources due to over provisioning. To use prediction tools, first enable the Machine Learning plugin.

The following buttons are available in various views. Click them to show a prediction for Clusters, namespaces, workloads, Nodes, Pods, and containers:

  • Predict Mem Usage: Shows a predictive graph for memory usage one week in the future. Calculations are based on metrics from the previous week.
  • Predict CPU: Shows a predictive graph for CPU usage one week in the future. Calculations are based on metrics from the previous week.
Three graph lines showing the actual CPU usage, the lower predicted future usage, and upper predicted future usage
Predictions for Node CPU Usage

Within a workload view, click the Detect Outlier CPU Usage amongst Pods button to identify a Pod that has CPU usage different from the other Pods.

Link to explore outlier detection query
Outlier message and exploration link

Click Explore this query in the Machine Learning plugin to view the raw data. Here you can adjust parameters and see a more detailed graph of the findings.

Raw data, query details, and graph regarding outlier data
Outlier raw data

Understand efficiency and resource use

The Efficiency page shows a correlation between CPU, memory, and storage use for Clusters, Nodes, and namespaces. The list of Clusters indicates each Cluster’s resource usage. Use this data to:

  • Understand performance and troubleshoot stability issues by correlating between average and maximum resource usage.
  • Observe resource usage per Cluster and per Cloud provider.
  • Discover any stranded resources in your fleet.
The list of clusters with sortable columns for average/max cpu usage, average/max memory usage, and max root partition storage
Efficiency page

You can also explore resource usage for detail views on:

  • Clusters
  • Namespaces
  • Workloads
  • Nodes
  • Pods
  • Containers

Analyze costs

Use the Cost page to help you understand the costs of resources consumed by your Kubernetes infrastructure, and identify areas of potential savings.

You can also explore costs on any optimization detail panel for Clusters, namespaces, workloads, Nodes, Pods, and containers. Refer to Optimize resource usage and efficiency for more information.

View out-of-the-box dashboards

Kubernetes Monitoring includes preconfigured dashboards. For more details, refer to Use dashboards.

Filter for data

Use the controls on each page to further specify the data you want to view and examine. For example, choose a data source and use filters to refine what you want to analyze. Click the heading of a list column to sort it. Click underlined items within lists to further explore details about the item.

Use color cues

Throughout the views in Kubernetes Monitoring, you see color used as an additional means of indicating status or condition. For example, sometimes text is a different color for Pod status:

List of pods with the status of running showing in green
Color coding
RunningGreenHealthy Pod
RunningRedPod failing to start
FailedRedFailed Pod
UnknownGreyPod status unknown
SucceededGreenJob Pod successfully run

For more information on Pod status, refer to the Kubernetes documentation on Pod lifecycle.

The following table describes the color indicators for resource capacity and the state of resource usage:

Usage Bar ColorUsageComments
Green60-90% of maximumThis is the ideal state of resource usage.
YellowBelow 60%Low usage percentages indicate that the Node might be over-provisioned.
Red90-100%Your Node resource is dangerously close to maximum capacity.
Node capacity graphs for pods, CPU, memory, and disk space showing a yellow color on the capacity bar to indicate percentage
Node color coding

View raw metrics

To further query data, use any of the Explore buttons available throughout the interface (such as Explore namespaces or Explore alerts). You see a view that provides additional query tools.

Raw query with options to add, view query history, and inspect query
Raw metrics

Manage alerts

Kubernetes Monitoring includes pre-configured alerting rules that trigger alerts. The Alerts view shows alert rules by namespace or group and the status of any alerts that have been triggered by that rule. For more information on alerts, see Configure alerting.

Detail of alert rule, including labels, expression, description, runbook URL, summary, and matching instances
Alert rule detail

You can silence some default alerts temporarily as a useful strategy when you are investigating alerts.

If you choose to enable traces when you configure Kubernetes Monitoring, you can easily click to see them.

  1. Click the main menu icon.

  2. Click Explore.

  3. Choose the Tempo data source.

  4. With the TraceQL tab selected, enter your search query.

  5. Click Run query.

    A table of traces appears.

  6. Click a trace to see the detail.

Explore detail page showing table of traces, TraceQL query, and trace graph
View traces

Manage configuration

If you have the admin role, you can manage the configuration of Kubernetes Monitoring by working with:

  • Data source choices
  • Prebuilt dashboards and alerts
  • Integration installations
  • Optional custom log queries
  • Configuration instructions for Grafana Kubernetes Monitoring Helm chart to deploy, configure, and keep it up to date.

For more information, refer to Configure Kubernetes Monitoring.