Kafka integration for Grafana Cloud
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
This integration includes eight useful alerts and seven pre-built dashboards to help monitor and visualize Kafka metrics.
Pre-install configuration for the Kafka integration
In order for the integration to work, you must configure a JMX exporter on each instance composing your Kafka Cluster, including all brokers, zookeepers, ksqldb, schema registries and kafka connect nodes.
Each of these instances has it’s own JMX Exporter config file. The following files should be used for each respective kafka component. For more details on how to configure your Kaka JVM with the JMX exporter, plese refer to the JMX Exporter documentation.
If you want to monitor consumption lag as well, you will also need to update the Grafana Agent to verion 0.17.0 or higher. Here’s the Grafana Agent configuration reference
Install Kafka integration for Grafana Cloud
- In your Grafana Cloud instance, click Integrations and Connections (lightning bolt icon).
- Navigate to the Kafka tile and review the prerequisites. Then click Install integration.
- Once the integration is installed, follow the steps on the Configuration Details page to setup Grafana Agent and start sending Kafka metrics to your Grafana Cloud instance.
Post-install configuration for the Kafka integration
It’s recommended that you configure a separate user for the Agent, and give it only the strictly mandatory security privileges necessary for monitoring your node, as per the documentation.
Below is an agent configuration example for this integration:
metrics:
wal_directory: /tmp/wal
configs:
- name: integrations
scrape_configs:
- job_name: integrations/kafka
static_configs:
- targets: ['kafka-node1:7001', 'kafka-node2:7001', 'kafka-node3:7001']
- job_name: integrations/kafka-zookeeper
static_configs:
- targets: ['zookeeper-node1:7001', 'zookeeper-node2:7001', 'zookeeper-node3:7001']
- job_name: integrations/kafka-connect
static_configs:
- targets: ['kafka-connect-node1:7001', 'kafka-connect-node2:7001', 'kafka-connect-node3:7001']
- job_name: integrations/kafka-schemaregistry
static_configs:
- targets:
[
'kafka-schemaregistry-node1:7001',
'kafka-schemaregistry-node2:7001',
'kafka-schemaregistry-node3:7001',
]
- job_name: integrations/kafka-ksqldb
static_configs:
- targets: ['kafka-ksqldb-node1:7001', 'kafka-ksqldb-node2:7001', 'kafka-ksqldb-node3:7001']
remote_write:
- url: http://cortex:9009/api/prom/push
integrations:
kafka_exporter:
enabled: true
kafka_uris: ['kafka-node1:9091', 'kafka-node2:9091', 'kafka-node3:9091']
Dashboards
The Kafka integration installs the following dashboards in your Grafana Cloud instance to help monitor your metrics.
- Kafka Connect Overview
- Kafka Lag Overview
- Kafka Overview
- Kafka Topics
- Schema Registry Overview
- Zookeeper Overview
- ksqldb Overview
Kafka Overview dashboard
Kafka Connect Overview dashboard
Kafka KSQL Overview dashboard
Schema Registry Overview dashboard
Alerts
The Kafka integration includes the following useful alerts:
Group: Kafka_Alerts
Alert | Description |
---|---|
KafkaOfflinePartitonCount | Critical: After successful leader election, if the leader for partition dies, then the partition moves to the OfflinePartition state. Offline partitions are not available for reading and writing. Restart the brokers, if needed, and check the logs for errors. |
KafkaUnderReplicatedPartitionCount | Critical: Under-replicated partitions means that one or more replicas are not available. This is usually because a broker is down. Restart the broker, and check for errors in the logs. |
KafkaActiveController | Critical: No broker in the cluster is reporting as the active controller in the last 1 minute interval. During steady state there should be only one active controller per cluster. |
KafkaUncleanLeaderElection | Critical: There is unclean partition leader elections in the cluster reported in the last 1 minute interval. When unclean leader election is held among out-of-sync replicas, there is a possibility of data loss if any messages were not synced prior to the loss of the former leader. So if the number of unclean elections is greater than 0, investigate broker logs to determine why leaders were re-elected, and look for WARN or ERROR messages. Consider setting the broker configuration parameter unclean.leader.election.enable to false so that a replica outside of the set of in-sync replicas is never elected leader. |
KafkaISRExpandRate | Warning: If a broker goes down, ISR for some of the partitions shrink. When that broker is up again, ISRs are expanded once the replicas are fully caught up. Other than that, the expected value for ISR expansion rate is 0. If ISR is expanding and shrinking frequently, adjust Allowed replica lag. |
KafkaISRShrinkRate | Warning: If a broker goes down, ISR for some of the partitions shrink. When that broker is up again, ISRs are expanded once the replicas are fully caught up. Other than that, the expected value for ISR shrink rate is 0. If ISR is expanding and shrinking frequently, adjust Allowed replica lag. |
KafkaBrokerCount | Critical: Broker count is 0. |
KafkaZookeeperSyncConnect | Warning: Zookeeper Sync Disconnected. |
Metrics
The following metrics are automatically written to your Grafana Cloud instance by connecting your Kafka instance through this integration:
- jvm_gc_collection_seconds_sum
- jvm_memory_bytes_max
- jvm_memory_bytes_used
- kafka_cluster_partition_underminisr
- kafka_cluster_partition_underreplicated
- kafka_connect_connect_metrics_connection_count
- kafka_connect_connect_metrics_failed_authentication_total
- kafka_connect_connect_metrics_incoming_byte_rate
- kafka_connect_connect_metrics_io_ratio
- kafka_connect_connect_metrics_network_io_rate
- kafka_connect_connect_metrics_outgoing_byte_rate
- kafka_connect_connect_metrics_request_rate
- kafka_connect_connect_metrics_response_rate
- kafka_connect_connect_metrics_successful_authentication_rate
- kafka_connect_connect_worker_metrics_connector_destroyed_task_count
- kafka_connect_connect_worker_metrics_connector_failed_task_count
- kafka_connect_connect_worker_metrics_connector_paused_task_count
- kafka_connect_connect_worker_metrics_connector_running_task_count
- kafka_connect_connect_worker_metrics_connector_total_task_count
- kafka_connect_connect_worker_metrics_connector_unassigned_task_count
- kafka_connect_connect_worker_rebalance_metrics_rebalance_avg_time_ms
- kafka_connect_connect_worker_rebalance_metrics_time_since_last_rebalance_ms
- kafka_connect_connector_metrics
- kafka_connect_connector_task_metrics_batch_size_avg
- kafka_connect_connector_task_metrics_batch_size_max
- kafka_connect_connector_task_metrics_offset_commit_avg_time_ms
- kafka_connect_connector_task_metrics_offset_commit_success_percentage
- kafka_connect_connector_task_metrics_running_ratio
- kafka_connect_sink_task_metrics_partition_count
- kafka_connect_sink_task_metrics_put_batch_avg_time_ms
- kafka_connect_sink_task_metrics_put_batch_max_time_ms
- kafka_connect_source_task_metrics_poll_batch_avg_time_ms
- kafka_connect_source_task_metrics_poll_batch_max_time_ms
- kafka_connect_source_task_metrics_source_record_active_count_avg
- kafka_connect_source_task_metrics_source_record_active_count_max
- kafka_connect_source_task_metrics_source_record_poll_rate
- kafka_connect_source_task_metrics_source_record_write_rate
- kafka_connect_task_error_metrics_deadletterqueue_produce_requests
- kafka_connect_task_error_metrics_total_errors_logged
- kafka_connect_task_error_metrics_total_record_errors
- kafka_connect_task_error_metrics_total_record_failures
- kafka_connect_task_error_metrics_total_records_skipped
- kafka_connect_task_error_metrics_total_retries
- kafka_consumer_lag_millis
- kafka_consumergroup_current_offset
- kafka_consumergroup_uncommitted_offsets
- kafka_controller_ControllerStats_UncleanLeaderElectionsPerSec
- kafka_controller_KafkaController_ActiveControllerCount
- kafka_controller_KafkaController_OfflinePartitionsCount
- kafka_controller_controllerstats_uncleanleaderelectionspersec
- kafka_controller_kafkacontroller_activecontrollercount
- kafka_controller_kafkacontroller_offlinepartitionscount
- kafka_controller_kafkacontroller_preferredreplicaimbalancecount
- kafka_coordinator_group_groupmetadatamanager_numgroups
- kafka_coordinator_group_groupmetadatamanager_numgroupscompletingrebalance
- kafka_coordinator_group_groupmetadatamanager_numgroupsdead
- kafka_coordinator_group_groupmetadatamanager_numgroupsempty
- kafka_coordinator_group_groupmetadatamanager_numgroupspreparingrebalance
- kafka_coordinator_group_groupmetadatamanager_numgroupsstable
- kafka_log_log_logendoffset
- kafka_log_log_logstartoffset
- kafka_log_log_size
- kafka_network_acceptor_acceptorblockedpercent
- kafka_network_requestchannel_requestqueuesize
- kafka_network_requestchannel_responsequeuesize
- kafka_network_requestmetrics_localtimems
- kafka_network_requestmetrics_remotetimems
- kafka_network_requestmetrics_requestqueuetimems
- kafka_network_requestmetrics_requestspersec
- kafka_network_requestmetrics_responsequeuetimems
- kafka_network_requestmetrics_responsesendtimems
- kafka_network_socketserver_networkprocessoravgidlepercent
- kafka_schema_registry_jersey_metrics_request_latency_99
- kafka_schema_registry_jersey_metrics_request_rate
- kafka_schema_registry_jetty_metrics_connections_active
- kafka_schema_registry_registered_count
- kafka_schema_registry_schemas_created
- kafka_server_KafkaServer_BrokerState
- kafka_server_ReplicaManager_IsrExpandsPerSec
- kafka_server_ReplicaManager_IsrShrinksPerSec
- kafka_server_ReplicaManager_UnderReplicatedPartitions
- kafka_server_SessionExpireListener_ZooKeeperSyncConnectsPerSec
- kafka_server_brokertopicmetrics_bytesinpersec
- kafka_server_brokertopicmetrics_bytesoutpersec
- kafka_server_brokertopicmetrics_fetchmessageconversionspersec
- kafka_server_brokertopicmetrics_messagesinpersec
- kafka_server_brokertopicmetrics_producemessageconversionspersec
- kafka_server_brokertopicmetrics_totalfetchrequestspersec
- kafka_server_brokertopicmetrics_totalproducerequestspersec
- kafka_server_kafkarequesthandlerpool_requesthandleravgidlepercent_total
- kafka_server_replicamanager_isrexpandspersec
- kafka_server_replicamanager_isrshrinkspersec
- kafka_server_replicamanager_leadercount
- kafka_server_replicamanager_partitioncount
- kafka_server_replicamanager_underreplicatedpartitions
- kafka_server_sessionexpirelistener_zookeeperauthfailurespersec
- kafka_server_sessionexpirelistener_zookeeperdisconnectspersec
- kafka_server_sessionexpirelistener_zookeeperexpirespersec
- kafka_server_sessionexpirelistener_zookeepersyncconnectspersec
- kafka_server_socketservermetrics_connection_close_rate
- kafka_server_socketservermetrics_connection_count
- kafka_server_socketservermetrics_connection_creation_rate
- kafka_server_socketservermetrics_connections
- kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms
- kafka_streams_stream_state_metrics_delete_latency_avg
- kafka_streams_stream_state_metrics_delete_latency_max
- kafka_streams_stream_state_metrics_delete_rate
- kafka_streams_stream_state_metrics_fetch_latency_avg
- kafka_streams_stream_state_metrics_fetch_rate
- kafka_streams_stream_state_metrics_put_if_absent_latency_avg
- kafka_streams_stream_state_metrics_put_if_absent_latency_max
- kafka_streams_stream_state_metrics_put_if_absent_rate_rate
- kafka_streams_stream_state_metrics_put_latency_avg
- kafka_streams_stream_state_metrics_put_latency_max
- kafka_streams_stream_state_metrics_put_rate
- kafka_streams_stream_state_metrics_restore_latency_avg
- kafka_streams_stream_state_metrics_restore_latency_max
- kafka_streams_stream_state_metrics_restore_rate
- kafka_streams_stream_thread_metrics_commit_latency_avg
- kafka_streams_stream_thread_metrics_commit_latency_max
- kafka_streams_stream_thread_metrics_poll_latency_avg
- kafka_streams_stream_thread_metrics_poll_latency_max
- kafka_streams_stream_thread_metrics_process_latency_avg
- kafka_streams_stream_thread_metrics_process_latency_max
- kafka_streams_stream_thread_metrics_punctuate_latency_avg
- kafka_streams_stream_thread_metrics_punctuate_latency_max
- kafka_topic_partition_current_offset
- kafka_topic_partitions
- ksql_ksql_engine_query_stats_error_queries
- ksql_ksql_engine_query_stats_liveness_indicator
- ksql_ksql_engine_query_stats_messages_consumed_per_sec
- ksql_ksql_engine_query_stats_messages_produced_per_sec
- ksql_ksql_engine_query_stats_not_running_queries
- ksql_ksql_engine_query_stats_num_active_queries
- ksql_ksql_engine_query_stats_num_idle_queries
- ksql_ksql_engine_query_stats_num_persistent_queries
- ksql_ksql_engine_query_stats_pending_shutdown_queries
- ksql_ksql_engine_query_stats_rebalancing_queries
- ksql_ksql_engine_query_stats_running_queries
- ksql_ksql_metrics_ksql_queries_query_status
- process_cpu_seconds_total
- zookeeper_avgrequestlatency
- zookeeper_inmemorydatatree_nodecount
- zookeeper_inmemorydatatree_watchcount
- zookeeper_maxrequestlatency
- zookeeper_minrequestlatency
- zookeeper_numaliveconnections
- zookeeper_outstandingrequests
- zookeeper_status_quorumsize
- zookeeper_ticktime
Changelog
# 0.0.4 - December 2022
- Update mixin to latest version:
- Fix missing job and instance label on all the dashboards
- Fix alert names to have a Kafka prefix
# 0.0.3 - February 2022
- Added the following alerts:
- OfflinePartitonCount
- UnderReplicatedPartitionCount
- ActiveController
- UncleanLeaderElection
- ISRExpandRate
- ISRShrinkRate
- BrokerCount
- ZookeeperSyncConnect
# 0.0.2 - October 2021
- Update mixin to latest version:
- Update all rate queries to use `$__rate_interval` so they respect the default resolution
# 0.0.1 - June 2021
- Initial release
Cost
By connecting your Kafka instance to Grafana Cloud you might incur charges. To view information on the number of active series that your Grafana Cloud account uses for metrics included in each Cloud tier, see Active series and dpm usage and Cloud tier pricing.