Kubernetes Monitoring pages reflect the hierarchy of Kubernetes objects, so you can begin at any level above containers. Main pages include lists of Clusters, namespaces, workloads, and Nodes.
For example, the Cluster main page shows the list of your Clusters. When you click on a Cluster in the list, it opens the Cluster detail page. That page shows the details for the Cluster along with a list of Nodes within that Cluster.
You can continue to drill into a Node and see the list of Pods for that Node, all the way to the container level.
The Kubernetes Overview page gives you a high-level view of your Clusters, usage, and alerts. This page brings to the forefront key data about your infrastructure.
Refine counts of Kubernetes objects and navigate to them
Adjust the time range and filter by Cluster and namespace to narrow and include historical data for:
Clusters, Nodes, namespaces, workloads, Pods, and containers
The Overview page calculation uses the most recent data point within your selected time range. The rest of Kubernetes Monitoring also includes objects which are no longer active. For example, a Node can be active and then not active many times throughout a given time range. Therefore, you may see a discrepancy between the count on the Overview page and the count on the list page.
Find usage spikes
Use the time range selector to focus on a time period while looking for patterns or spikes in CPU and memory usage in your Clusters.
When spikes occur:
Zoom in on the graph to narrow the time selection.
Hover over and click the peak of the spike to see the percentage of use compared to capacity. In the following example, the spike shows 46.5% of CPU usage compared to capacity.
On the Cost page, use the Overview and Savings tabs to gain an understanding what Kubernetes is costing and how you can save.
You can see the cost of each item in any list view as well as on the detail pages.
Throughout Kubernetes Monitoring, resource usage statistics are available for Kubernetes objects.
CPU and memory tabs
On any detail page you can view an overview of CPU and memory usage. You can also click the CPU tab or the Memory tab to view more correlated usage information. For example, the CPU and Memory tabs on the Cluster detail page show:
Requests compared to capacity
Usage compared to capacity
Usage compared to requests for Nodes and namespaces in the Cluster
Graphs in the storage tab on the Cluster, Namespace, Workload, Node, and Pod detail pages show how persistent volume (PV) storage changes over a specific time range.
You can gain insight into:
The status phase of the PV and PVC, including the binding of the PVC request
Throughput to understand how much data is being read and written per second
IOPS (Input/Output Operations per Second) to understand how many read and write operations are being performed per second
The PV status on the Pod details page indicates the relationship between persistent volumes and Pods, and also shows the name of the volume, which can change over time.
CPU and memory prediction can help you ensure resources are available during spikes in usage, as well as help you decrease the amount of unused resources due to over provisioning.
To use prediction tools, first enable the Machine Learning plugin.
The following buttons are available in various views. Click them to show a prediction for Clusters, namespaces, workloads, Nodes, Pods, and containers. The time range you select must be at least two hours to use these prediction tools:
Predict Mem Usage: Shows a predictive graph for memory usage one week in the future. Calculations are based on metrics from the previous week.
Predict CPU: Shows a predictive graph for CPU usage one week in the future. Calculations are based on metrics from the previous week.
You can identify any Pods that have CPU usage different from other Pods.
For any multi-Pod workload, go to the workload detail page, and review the information in the Overview tab. If there is a Pod in the workload that is an outlier for CPU usage, it is indicated in the outliers by CPU field. Click the link to open Explore and discover the outlier Pod.
Select a time range to see your historical data for any time frame you choose. As you navigate from page to page, the time range remains the same for period you set until you change it again.
As an example, the Pod optimization section of the Pod detail page shows a time range over several hours. You can use this to understand the historical pattern of CPU usage and memory usage.
Zoom into an area of any graph on the detail pages to narrow the time range selector even further. The time range remains selected until you click Back to default.
Give it a try using Grafana Play
With Grafana Play, you can explore and see how it works, learning from practical examples to accelerate your development.
This feature can be seen on this workload details page set for the last 2 days.
You can monitor manual jobs and scheduled (cron) jobs. Use the main menu to find and select All jobs. Use the Cronjobs and Jobs lists to view jobs across all Clusters and Namespaces, based on the time range you choose in the time range selector. You can view:
A color-coded status indicator for each job
How jobs are distributed and where jobs are placed across the infrastructure
For cron jobs:
Last succeeded, to verify jobs are completing successfully
Last scheduled compared to succeeded, to view any gaps that reveal failed or skipped executions
For manual jobs, Pods/completions to track when the job was run
On the job detail page, the Overview tab contains:
Status, start time, end time, Pod status phase, logs, and events
CPU and memory usage, to identify any over or under provisioning as well spot any gradual increases that indicate memory leaks
Container logs for debugging failed runs
Events for identifying error messages or unexpected behavior
Runs table to track success/failure patterns over time, and understand duration and completion
You can further explore each job’s CPU and Memory tabs for greater insight.
Find deleted Kubernetes objects
You can find deleted Clusters, namespaces, workloads, Nodes, Pods, and containers to understand what occurred in the past. To do so, set the time range selector to a past time period.
The following example shows a time range of the previous 30 days with some Nodes that show no data (also colored in white text). When you click on a Node with no data, you can learn when the Node expired.
Grafana Cloud has a default 30-day limit for queries. If your Kubernetes object was deleted 30 days beyond the current date, use the time range selector to choose a specific 30-day time frame in the past.
Access Nodes in Cloud provider accounts
You can navigate to the EC2 dashboard for Nodes managed by AWS from Kubernetes Monitoring. For example:
Find the EC2 Node by go to Search to search for the Node name.
In the search results, click the Node name to open the Node detail page.
On the far right-hand side of the screen, open the AWS drop-down to see the link to the EC2 instance.
Use the network panels to understand when bandwidth limits are causing network saturation, which can lead to dropped packets.
On any detail page for Cluster, namespace, workload, Node, or Pod, click the Network tab to view:
Network Bandwidth Rx/Tx: Shows the rate of received and transmitted bytes
Network Saturation Rx/Tx dropped packets: Shows rate of received and transmitted packets dropped
Network Bandwidth and Network Saturation by Node, workload, or Pod: Shows the bandwidth and saturation by object
With Grafana Play, you can explore and see how it works, learning from practical examples to accelerate your development.
This feature can be seen on the Network tab of this namespace details page.
From any detail page, click the Logs & Events tab to view the logs and events for that Kubernetes object. You can filter for many dimensions, including:
Navigate easily from Kubernetes Monitoring to other capabilities in Grafana Cloud to analyze, troubleshoot, and solve issues.
Start an automated diagnostic
From a Pod, Cluster, namespace, or workload detail page, you can begin an automated investigation by clicking Run Sift investigation.
Sift performs a set of automated system checks, and surfaces potential issues in your Kubernetes environment.
It then works to identify the root cause of an incident.
To access root cause analysis tools, enable Asserts on your stack.
You can take troubleshooting deeper by understanding relationships between components and what is occurring between them.
Within Kubernetes Monitoring, access RCA Workbench to perform root cause analysis.
Access the RCA Workbench by any of these methods:
Select the box to the left of the list item, and click the Compare button.
On the detail page for a Pod or workload, click View application layer, then Go to Application Observability to navigate directly to more data, such as the service health.
To return to Kubernetes Monitoring, click the browser back button.
View queries to troubleshoot with Explore
To further query data, use any of the Explore buttons available throughout the interface (such as Explore namespaces or Explore alerts). You see a view that provides additional query tools for troubleshooting.
If you have the admin role, you can manage the configuration of Kubernetes Monitoring by working with:
Data source choices
Alerts
Integration installations
Optional custom log queries
Configuration instructions for Grafana Kubernetes Monitoring Helm chart to deploy, configure, and keep it up to date
Access more information
Click the documentation links on a page to find more information about what you’re viewing.
Navigation tips
Here are some tips and shortcuts for getting around in Kubernetes Monitoring.
Give it a try using Grafana Play
With Grafana Play, you can explore and see how it works, learning from practical examples to accelerate your development.
This feature can be seen on the Kubernetes Monitoring Overview.
Throughout the views in Kubernetes Monitoring, you see color used as an additional means of indicating status or condition.
For example, sometimes text is a different color for Pod status: