Grafana Cloud

Cilium Enterprise integration for Grafana Cloud

The Cilium Enterprise integration uses Grafana Alloy to collect metrics exposed by the Cilium Operator, Cilium Agent and its components, as well as Hubble. A series of dashboards have been provided, both for overviews and per-component basis. This integration includes 18 useful alerts and 20 pre-built dashboards to help monitor and visualize Cilium Enterprise metrics.

Kubernetes instructions

Dashboards

The Cilium Enterprise integration installs the following dashboards in your Grafana Cloud instance to help monitor your system.

  • Cilium / Agent Overview
  • Cilium / Components / API
  • Cilium / Components / Agent
  • Cilium / Components / BPF
  • Cilium / Components / Conntrack
  • Cilium / Components / Datapath
  • Cilium / Components / External HA FQDN Proxy
  • Cilium / Components / FQDN Proxy
  • Cilium / Components / Identities
  • Cilium / Components / Kubernetes
  • Cilium / Components / L3 Policy
  • Cilium / Components / L7 Proxy
  • Cilium / Components / Network
  • Cilium / Components / Nodes
  • Cilium / Components / Policy
  • Cilium / Components / Resource Utilization
  • Cilium / Operator
  • Cilium / Overview
  • Hubble / Overview
  • Hubble / Timescape

Cilium Overview

Cilium Overview

Cilium Overview (2)

Cilium Overview (2)

Cilium Agent Overview

Cilium Agent Overview

Alerts

The Cilium Enterprise integration includes the following useful alerts:

Cilium Endpoints

AlertDescription
CiliumAgentEndpointFailuresWarning: Cilium Agent endpoints in the invalid state.
CiliumAgentEndpointUpdateFailureWarning: API calls to Cilium Agent API to create or update Endpoints are failing.
CiliumAgentContainerNetworkInterfaceApiErrorEndpointCreateInfo: Cilium Endpoint API endpoint rate limiter is reporting errors while doing endpoint create.
CiliumAgentApiEndpointErrorsWarning: API calls to Cilium Endpoints API are failing due to server errors.

Cilium IPAM

AlertDescription
CiliumOperatorExhaustedIpamIpsCritical: Cilium Operator has exhausted its IPAM IPs.
CiliumOperatorLowAvailableIpamIpsWarning: Cilium Operator has used up over 90% of its available IPs.
CiliumOperatorEniIpamErrorsCritical: Cilium Operator has high error rate while trying to create/attach ENIs for IPAM.

Cilium Maps

AlertDescription
CiliumAgentMapOperationFailuresWarning: Cilium Agent is experiencing errors updating BPF maps on Agent Pod.
CiliumAgentBpfMapPressureWarning: Map on Cilium Agent Pod is currently experiencing high map pressure.

Cilium NAT

AlertDescription
CiliumAgentNatTableFullCritical: Cilium Agent Pod is dropping packets due to “No mapping for NAT masquerade” errors.

Cilium API

AlertDescription
CiliumAgentApiHighErrorRateInfo: Cilium Agent API on Pod is experiencing a high error rate.

Cilium Conntrack

AlertDescription
CiliumAgentConntrackTableFullCritical: Ciliums conntrack map is failing on new insertions on Agent Pod.
CiliumAgentConnTrackFailedGarbageCollectorRunsWarning: Cilium Agent Conntrack GC runs are failing on Agent Pod.

Cilium Drops

AlertDescription
CiliumAgentHighDeniedRateInfo: Cilium Agent is experiencing a high drop rate due to policy rule denies.

Cilium Policy

AlertDescription
CiliumAgentPolicyMapPressureWarning: Cilium Agent is experiencing high BPF map pressure.

Cilium Identity

AlertDescription
CiliumNodeLocalHighIdentityAllocationWarning: Cilium is using a very high percent (over 80%) of its maximum per-node identity limit (65535).
RunningOutOfCiliumClusterIdentitiesWarning: Cilium is using a very high percent of its maximum cluster identity limit (65280).

Cilium Nodes

AlertDescription
CiliumUnreachableNodesInfo: Cilium Agent is reporting unreachable Nodes in the cluster.

Metrics

The most important metrics provided by the Cilium Enterprise integration, which are used on the pre-built dashboards and Prometheus alerts, are as follows:

  • cilium_agent_api_process_time_seconds_count
  • cilium_agent_api_process_time_seconds_sum
  • cilium_api_limiter_processed_requests_total
  • cilium_bpf_map_ops_total
  • cilium_bpf_map_pressure
  • cilium_controllers_runs_duration_seconds_count
  • cilium_controllers_runs_duration_seconds_sum
  • cilium_controllers_runs_total
  • cilium_datapath_conntrack_gc_duration_seconds_count
  • cilium_datapath_conntrack_gc_duration_seconds_sum
  • cilium_datapath_conntrack_gc_entries
  • cilium_datapath_conntrack_gc_key_fallbacks_total
  • cilium_datapath_conntrack_gc_runs_total
  • cilium_drop_bytes_total
  • cilium_drop_count_total
  • cilium_endpoint_regeneration_time_stats_seconds_count
  • cilium_endpoint_regeneration_time_stats_seconds_sum
  • cilium_endpoint_regenerations_total
  • cilium_endpoint_state
  • cilium_errors_warnings_total
  • cilium_forward_bytes_total
  • cilium_forward_count_total
  • cilium_identity
  • cilium_ip_addresses
  • cilium_k8s_client_api_calls_total
  • cilium_k8s_client_api_latency_time_seconds_count
  • cilium_k8s_client_api_latency_time_seconds_sum
  • cilium_kubernetes_events_received_total
  • cilium_kubernetes_events_total
  • cilium_nodes_all_events_received_total
  • cilium_nodes_all_num
  • cilium_operator_ces_queueing_delay_seconds_bucket
  • cilium_operator_ces_sync_errors_total
  • cilium_operator_ec2_api_duration_seconds_bucket
  • cilium_operator_identity_gc_entries
  • cilium_operator_identity_gc_runs
  • cilium_operator_ipam_allocation_ops
  • cilium_operator_ipam_deficit_resolver_duration_seconds_bucket
  • cilium_operator_ipam_interface_creation_ops
  • cilium_operator_ipam_ips
  • cilium_operator_ipam_k8s_sync_queued_total
  • cilium_operator_ipam_nodes
  • cilium_operator_ipam_resync_queued_total
  • cilium_operator_ipam_resync_total
  • cilium_operator_number_of_ceps_per_ces_sum
  • cilium_operator_process_cpu_seconds_total
  • cilium_operator_process_open_fds
  • cilium_operator_process_resident_memory_bytes
  • cilium_operator_process_virtual_memory_bytes
  • cilium_policy
  • cilium_policy_endpoint_enforcement_status
  • cilium_policy_l7_denied_total
  • cilium_policy_l7_forwarded_total
  • cilium_policy_l7_received_total
  • cilium_proxy_redirects
  • cilium_proxy_upstream_reply_seconds_count
  • cilium_proxy_upstream_reply_seconds_sum
  • cilium_services_events_total
  • cilium_triggers_policy_update_call_duration_seconds_count
  • cilium_triggers_policy_update_call_duration_seconds_sum
  • cilium_unreachable_nodes
  • cilium_version
  • hubble_dns_queries_total
  • hubble_dns_response_types_total
  • hubble_dns_responses_total
  • hubble_drop_total
  • hubble_flows_processed_total
  • hubble_http_request_duration_seconds_bucket
  • hubble_http_requests_total
  • hubble_http_responses_total
  • hubble_icmp_total
  • hubble_port_distribution_total
  • hubble_tcp_flags_total
  • isovalent_external_dns_proxy_policy_l7_total
  • isovalent_external_dns_proxy_processing_duration_seconds
  • isovalent_external_dns_proxy_update_errors_total
  • isovalent_external_dns_proxy_update_queue_size
  • timescape_clickhouse_queries_duration_seconds_bucket
  • timescape_clickhouse_queries_results_count
  • timescape_clickhouse_queries_results_sum
  • timescape_ingestor_flows_ingested_total
  • timescape_ingestor_ingest_duration_seconds_bucket
  • timescape_ingestor_ingest_running
  • timescape_ingestor_ingestfilter_batch_duration_seconds_bucket
  • timescape_ingestor_ingestfilter_filtered_errors_total
  • timescape_ingestor_ingestfilter_filtered_skipped_total
  • timescape_ingestor_ingestfilter_filtered_total
  • timescape_ingestor_ingestlog_getinfo_queries
  • up

Changelog

md
# 1.0.0 - June 2024

* Update Mixin to latest version
  - Removed pod filter from alert rules
  - Added thresholds for alerts using rate()
  - Added aggregation label support

# 0.0.4 - November 2023

* Replaced Angular dashboard panels with React panels

# 0.0.3 - July 2023

* Added support for using the integration in the Grafana Cloud Kubernetes App
* Update all scrape intervals to be 60s
* Fix job name to correct value in static agent config

# 0.0.2 - January 2023

* Update mixin to latest version:
  - Add new alert `CiliumOperatorEniIpamErrors` to alert on errors related to allocating new IPAM addresses and situations where nodes are experiencing IPAM exhaustion
  - Fix alert conditions to trigger correctly

# 0.0.1 - October 2022

* Initial release

Cost

By connecting your Cilium Enterprise instance to Grafana Cloud, you might incur charges. To view information on the number of active series that your Grafana Cloud account uses for metrics included in each Cloud tier, see Active series and dpm usage and Cloud tier pricing.