Cilium Enterprise integration for Grafana Cloud
The Cilium Enterprise integration uses Grafana Alloy to collect metrics exposed by the Cilium Operator, Cilium Agent and its components, as well as Hubble. A series of dashboards have been provided, both for overviews and per-component basis. This integration includes 18 useful alerts and 20 pre-built dashboards to help monitor and visualize Cilium Enterprise metrics.
Kubernetes instructions
Before you begin with Kubernetes
Please note: These instructions assume the use of the Kubernetes Monitoring Helm chart
This integration monitors a Cilium Enterprise & Hubble Enterprise deployment that has metrics exporters enabled. Please ensure you have completed the following setup steps:
- Enabled the embedded Prometheus exporter in your Cilium deployment to collect and expose metrics
 - Enabled the embedded Prometheus exporter in Hubble if you want Hubble metrics to be included.
 
Once the exporters have been enabled, the metrics will be automatically exposed and available for collection by either Prometheus or Grafana Alloy deployed to your cluster.
This integration assumes Hubble metrics have been enabled for:
- dns
 - drop
 - tcp
 - flow
 - icmp
 - http
 
e.g. via a helm command similar to the following, adjusted for Cilium Enterprise:
helm install <cilium-enterprise-repository> --version 1.12.2 \
  --namespace kube-system \
  --set hubble.metrics.enabled="{dns,drop,tcp,flow,icmp,http}"Cilium version 1.12.2 and greater is supported.
Configuration snippets for Kubernetes Helm chart
The following snippets provide examples to guide you through the configuration process.
To scrape your Cilium Enterprise instances, manually modify your Kubernetes Monitoring Helm chart with these configuration snippets.
Replace any values between the angle brackets <> in the provided snippets with your desired configuration values.
Metrics snippets
# Replace any values between the angle brackets '<>', with your desired configuration
alloy-metrics:
    extraConfig: |-
        // Cilium Agent
        discovery.kubernetes "cilium_agent" {
            role = "service"
            selectors {
                role = "service"
                label = "k8s-app=cilium"
            }
        }
        
        discovery.relabel "cilium_agent" {
            targets = discovery.kubernetes.cilium_agent.targets
            rule {
                source_labels = ["__meta_kubernetes_endpoint_port_name"]
                regex = "metrics"
                action = "keep"
            }
            rule {
                source_labels = ["__meta_kubernetes_service_label_k8s_app"]
                target_label = "k8s_app"
            }
        }
        prometheus.scrape "cilium_agent" {
            targets      = discovery.relabel.cilium_agent.output
            job_name     = "integrations/cilium-enterprise/cilium-agent"
            honor_labels = true
            forward_to   = [prometheus.remote_write.grafana_cloud_metrics.receiver]
        }
        // Cilium Operator
        discovery.kubernetes "cilium_operator" {
            role = "service"
            selectors {
                role = "service"
                label = "name=cilium-operator,io.cilium/app=operator"
            }
        }
        discovery.relabel "cilium_operator" {
            targets = discovery.kubernetes.cilium_operator.targets
            rule {
                source_labels = ["__meta_kubernetes_endpoint_port_name"]
                regex = "metrics"
                action = "keep"
            }
            rule {
                source_labels = ["__meta_kubernetes_service_label_io_cilium_app_app"]
                target_label = "io_cilium_app"
            }
        }
        prometheus.scrape "cilium_operator" {
            targets      = discovery.relabel.cilium_operator.output
            job_name     = "integrations/cilium-enterprise/cilium-operator"
            honor_labels = true
            forward_to   = [prometheus.remote_write.grafana_cloud_metrics.receiver]
        }
        // Hubble Relay
        discovery.kubernetes "hubble_relay" {
            role = "service"
            selectors {
                role = "service"
                label = "k8s-app=hubble-relay"
            }
        }
        discovery.relabel "hubble_relay" {
            targets = discovery.kubernetes.hubble_relay.targets
            rule {
                source_labels = ["__meta_kubernetes_endpoint_port_name"]
                regex = "metrics"
                action = "keep"
            }
        }
        prometheus.scrape "hubble_relay" {
            targets    = discovery.relabel.hubble_relay.output
            job_name   = "integrations/cilium-enterprise/hubble-relay"
            forward_to = [prometheus.remote_write.grafana_cloud_metrics.receiver]
        }
        // Hubble
        discovery.kubernetes "hubble" {
            role = "service"
            selectors {
                role = "service"
                label = "k8s-app=hubble"
            }
        }
        
        discovery.relabel "hubble" {
            targets = discovery.kubernetes.services.targets
            rule {
                source_labels = ["__meta_kubernetes_endpoint_port_name"]
                regex = "hubble-metrics"
                action = "keep"
            }
        }
        prometheus.scrape "hubble" {
            targets      = discovery.relabel.hubble.output
            job_name     = "integrations/cilium-enterprise/hubble"
            honor_labels = true
            forward_to   = [prometheus.remote_write.grafana_cloud_metrics.receiver]
        }
        // Hubble Enterprise
        discovery.kubernetes "hubble_enterprise" {
            role = "service"
            selectors {
                role = "service"
                label = "app.kubernetes.io/name=hubble-enterprise"
            }
        }
        discovery.relabel "hubble_enterprise" {
            targets = discovery.kubernetes.hubble_enterprise.targets
            rule {
                source_labels = ["__meta_kubernetes_endpoint_port_name"]
                regex = "metrics"
                action = "keep"
            }
        }
        prometheus.scrape "hubble_enterprise" {
            targets      = discovery.relabel.hubble_enterprise.output
            job_name     = "integrations/cilium-enterprise/hubble-enterprise"
            honor_labels = true
            forward_to   = [prometheus.remote_write.grafana_cloud_metrics.receiver]
        }
        // Hubble Timescape Ingester
        discovery.kubernetes "hubble_timescape_ingester" {
            role = "service"
            selectors {
                role = "service"
                label = "app.kubernetes.io/name=hubble-timescape-ingester,app.kubernetes.io/component=ingester"
            }
        }
        discovery.relabel "hubble_timescape_ingester" {
            targets = discovery.kubernetes.hubble_timescape_ingester.targets
            rule {
            source_labels = ["__meta_kubernetes_endpoint_port_name"]
            regex = "metrics"
            action = "keep"
            }
        }
        
        prometheus.scrape "hubble_timescape_ingester" {
            targets      = discovery.relabel.hubble_timescape_ingester.output
            job_name     = "integrations/cilium-enterprise/hubble-timescape-ingester"
            honor_labels = true
            forward_to   = [prometheus.remote_write.grafana_cloud_metrics.receiver]
        }
        // Hubble Timescape Server
        discovery.kubernetes "hubble_timescape_server" {
            role = "service"
            selectors {
                role = "service"
                label = "app.kubernetes.io/name=hubble-timescape-server,app.kubernetes.io/component=server"
            }
        }
        discovery.relabel "hubble_timescape_server" {
            targets = discovery.kubernetes.hubble_timescape_server.targets
            rule {
                source_labels = ["__meta_kubernetes_endpoint_port_name"]
                regex = "metrics"
                action = "keep"
            }
        }
        
        prometheus.scrape "hubble_timescape_server" {
            targets      = discovery.relabel.hubble_timescape_server.output
            job_name     = "integrations/cilium-enterprise/hubble-timescape-server"
            honor_labels = true
            forward_to   = [prometheus.remote_write.grafana_cloud_metrics.receiver]
        }Dashboards
The Cilium Enterprise integration installs the following dashboards in your Grafana Cloud instance to help monitor your system.
- Cilium / Agent Overview
 - Cilium / Components / API
 - Cilium / Components / Agent
 - Cilium / Components / BPF
 - Cilium / Components / Conntrack
 - Cilium / Components / Datapath
 - Cilium / Components / External HA FQDN Proxy
 - Cilium / Components / FQDN Proxy
 - Cilium / Components / Identities
 - Cilium / Components / Kubernetes
 - Cilium / Components / L3 Policy
 - Cilium / Components / L7 Proxy
 - Cilium / Components / Network
 - Cilium / Components / Nodes
 - Cilium / Components / Policy
 - Cilium / Components / Resource Utilization
 - Cilium / Operator
 - Cilium / Overview
 - Hubble / Overview
 - Hubble / Timescape
 
Cilium Overview

Cilium Overview (2)

Cilium Agent Overview

Alerts
The Cilium Enterprise integration includes the following useful alerts:
Cilium Endpoints
| Alert | Description | 
|---|---|
| CiliumAgentEndpointFailures | Warning: Cilium Agent endpoints in the invalid state. | 
| CiliumAgentEndpointUpdateFailure | Warning: API calls to Cilium Agent API to create or update Endpoints are failing. | 
| CiliumAgentContainerNetworkInterfaceApiErrorEndpointCreate | Info: Cilium Endpoint API endpoint rate limiter is reporting errors while doing endpoint create. | 
| CiliumAgentApiEndpointErrors | Warning: API calls to Cilium Endpoints API are failing due to server errors. | 
Cilium IPAM
| Alert | Description | 
|---|---|
| CiliumOperatorExhaustedIpamIps | Critical: Cilium Operator has exhausted its IPAM IPs. | 
| CiliumOperatorLowAvailableIpamIps | Warning: Cilium Operator has used up over 90% of its available IPs. | 
| CiliumOperatorEniIpamErrors | Critical: Cilium Operator has high error rate while trying to create/attach ENIs for IPAM. | 
Cilium Maps
| Alert | Description | 
|---|---|
| CiliumAgentMapOperationFailures | Warning: Cilium Agent is experiencing errors updating BPF maps on Agent Pod. | 
| CiliumAgentBpfMapPressure | Warning: Map on Cilium Agent Pod is currently experiencing high map pressure. | 
Cilium NAT
| Alert | Description | 
|---|---|
| CiliumAgentNatTableFull | Critical: Cilium Agent Pod is dropping packets due to “No mapping for NAT masquerade” errors. | 
Cilium API
| Alert | Description | 
|---|---|
| CiliumAgentApiHighErrorRate | Info: Cilium Agent API on Pod is experiencing a high error rate. | 
Cilium Conntrack
| Alert | Description | 
|---|---|
| CiliumAgentConntrackTableFull | Critical: Ciliums conntrack map is failing on new insertions on Agent Pod. | 
| CiliumAgentConnTrackFailedGarbageCollectorRuns | Warning: Cilium Agent Conntrack GC runs are failing on Agent Pod. | 
Cilium Drops
| Alert | Description | 
|---|---|
| CiliumAgentHighDeniedRate | Info: Cilium Agent is experiencing a high drop rate due to policy rule denies. | 
Cilium Policy
| Alert | Description | 
|---|---|
| CiliumAgentPolicyMapPressure | Warning: Cilium Agent is experiencing high BPF map pressure. | 
Cilium Identity
| Alert | Description | 
|---|---|
| CiliumNodeLocalHighIdentityAllocation | Warning: Cilium is using a very high percent (over 80%) of its maximum per-node identity limit (65535). | 
| RunningOutOfCiliumClusterIdentities | Warning: Cilium is using a very high percent of its maximum cluster identity limit (65280). | 
Cilium Nodes
| Alert | Description | 
|---|---|
| CiliumUnreachableNodes | Info: Cilium Agent is reporting unreachable Nodes in the cluster. | 
Metrics
The most important metrics provided by the Cilium Enterprise integration, which are used on the pre-built dashboards and Prometheus alerts, are as follows:
- cilium_agent_api_process_time_seconds_count
 - cilium_agent_api_process_time_seconds_sum
 - cilium_api_limiter_processed_requests_total
 - cilium_bpf_map_ops_total
 - cilium_bpf_map_pressure
 - cilium_controllers_runs_duration_seconds_count
 - cilium_controllers_runs_duration_seconds_sum
 - cilium_controllers_runs_total
 - cilium_datapath_conntrack_gc_duration_seconds_count
 - cilium_datapath_conntrack_gc_duration_seconds_sum
 - cilium_datapath_conntrack_gc_entries
 - cilium_datapath_conntrack_gc_key_fallbacks_total
 - cilium_datapath_conntrack_gc_runs_total
 - cilium_drop_bytes_total
 - cilium_drop_count_total
 - cilium_endpoint_regeneration_time_stats_seconds_count
 - cilium_endpoint_regeneration_time_stats_seconds_sum
 - cilium_endpoint_regenerations_total
 - cilium_endpoint_state
 - cilium_errors_warnings_total
 - cilium_forward_bytes_total
 - cilium_forward_count_total
 - cilium_identity
 - cilium_ip_addresses
 - cilium_k8s_client_api_calls_total
 - cilium_k8s_client_api_latency_time_seconds_count
 - cilium_k8s_client_api_latency_time_seconds_sum
 - cilium_kubernetes_events_received_total
 - cilium_kubernetes_events_total
 - cilium_nodes_all_events_received_total
 - cilium_nodes_all_num
 - cilium_operator_ces_queueing_delay_seconds_bucket
 - cilium_operator_ces_sync_errors_total
 - cilium_operator_ec2_api_duration_seconds_bucket
 - cilium_operator_identity_gc_entries
 - cilium_operator_identity_gc_runs
 - cilium_operator_ipam_allocation_ops
 - cilium_operator_ipam_deficit_resolver_duration_seconds_bucket
 - cilium_operator_ipam_interface_creation_ops
 - cilium_operator_ipam_ips
 - cilium_operator_ipam_k8s_sync_queued_total
 - cilium_operator_ipam_nodes
 - cilium_operator_ipam_resync_queued_total
 - cilium_operator_ipam_resync_total
 - cilium_operator_number_of_ceps_per_ces_sum
 - cilium_operator_process_cpu_seconds_total
 - cilium_operator_process_open_fds
 - cilium_operator_process_resident_memory_bytes
 - cilium_operator_process_virtual_memory_bytes
 - cilium_policy
 - cilium_policy_endpoint_enforcement_status
 - cilium_policy_l7_denied_total
 - cilium_policy_l7_forwarded_total
 - cilium_policy_l7_received_total
 - cilium_proxy_redirects
 - cilium_proxy_upstream_reply_seconds_count
 - cilium_proxy_upstream_reply_seconds_sum
 - cilium_services_events_total
 - cilium_triggers_policy_update_call_duration_seconds_count
 - cilium_triggers_policy_update_call_duration_seconds_sum
 - cilium_unreachable_nodes
 - cilium_version
 - hubble_dns_queries_total
 - hubble_dns_response_types_total
 - hubble_dns_responses_total
 - hubble_drop_total
 - hubble_flows_processed_total
 - hubble_http_request_duration_seconds_bucket
 - hubble_http_requests_total
 - hubble_http_responses_total
 - hubble_icmp_total
 - hubble_port_distribution_total
 - hubble_tcp_flags_total
 - isovalent_external_dns_proxy_policy_l7_total
 - isovalent_external_dns_proxy_processing_duration_seconds
 - isovalent_external_dns_proxy_update_errors_total
 - isovalent_external_dns_proxy_update_queue_size
 - timescape_clickhouse_queries_duration_seconds_bucket
 - timescape_clickhouse_queries_results_count
 - timescape_clickhouse_queries_results_sum
 - timescape_ingestor_flows_ingested_total
 - timescape_ingestor_ingest_duration_seconds_bucket
 - timescape_ingestor_ingest_running
 - timescape_ingestor_ingestfilter_batch_duration_seconds_bucket
 - timescape_ingestor_ingestfilter_filtered_errors_total
 - timescape_ingestor_ingestfilter_filtered_skipped_total
 - timescape_ingestor_ingestfilter_filtered_total
 - timescape_ingestor_ingestlog_getinfo_queries
 - up
 
Changelog
# 1.0.0 - June 2024
* Update Mixin to latest version
  - Removed pod filter from alert rules
  - Added thresholds for alerts using rate()
  - Added aggregation label support
# 0.0.4 - November 2023
* Replaced Angular dashboard panels with React panels
# 0.0.3 - July 2023
* Added support for using the integration in the Grafana Cloud Kubernetes App
* Update all scrape intervals to be 60s
* Fix job name to correct value in static agent config
# 0.0.2 - January 2023
* Update mixin to latest version:
  - Add new alert `CiliumOperatorEniIpamErrors` to alert on errors related to allocating new IPAM addresses and situations where nodes are experiencing IPAM exhaustion
  - Fix alert conditions to trigger correctly
# 0.0.1 - October 2022
* Initial releaseCost
By connecting your Cilium Enterprise instance to Grafana Cloud, you might incur charges. To view information on the number of active series that your Grafana Cloud account uses for metrics included in each Cloud tier, see Active series and dpm usage and Cloud tier pricing.



