Menu
Grafana Cloud

Kafka integration for Grafana Cloud

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

This integration includes 8 useful alerts and 7 pre-built dashboards to help monitor and visualize Kafka metrics.

Before you begin

For the integration to work, you must configure a JMX exporter on each instance composing your Kafka Cluster, including all brokers, zookeepers, ksqlDB, schema registries, and Kafka Connect nodes.

Each of these instances has its own JMX Exporter config file. The following files should be used for each respective Kafka component. For more details on how to configure your Kafka JVM with the JMX exporter, refer to the JMX Exporter documentation.

We strongly recommend that you configure a separate user for Grafana Alloy and give it only the strictly mandatory security privileges necessary for monitoring your node.

Install Kafka integration for Grafana Cloud

  1. In your Grafana Cloud stack, click Connections in the left-hand menu.
  2. Find Kafka and click its tile to open the integration.
  3. Review the prerequisites in the Configuration Details tab and set up Grafana Agent to send Kafka metrics to your Grafana Cloud instance.
  4. Click Install to add this integration’s pre-built dashboards and alerts to your Grafana Cloud instance, and you can start monitoring your Kafka setup.

Configuration snippets for Grafana Alloy

Advanced mode

To instruct Grafana Alloy to scrape your Kafka nodes, go though the subsequent instructions.

The snippets provide examples to guide you through the configuration process.

First, manually copy and append the following snippets into your Grafana Alloy configuration file.

Then follow the instructions below to modify the necessary variables.

Advanced metrics snippets

alloy
discovery.relabel "metrics_integrations_kafka" {
    targets = [{
        __address__ = "kafka-node:7001",
    }]

    rule {
        target_label = "instance"
        replacement  = "<your-instance-name>"
    }

    rule {
        target_label = "kafka_cluster"
        replacement  = "<your-cluster-name>"
    }
}
discovery.relabel "metrics_integrations_kafka_zookeeper" {
    targets = [{
        __address__ = "zookeeper-node:7001",
    }]

    rule {
        target_label = "instance"
        replacement  = "<your-instance-name>"
    }

    rule {
        target_label = "kafka_cluster"
        replacement  = "<your-cluster-name>"
    }

}
discovery.relabel "metrics_integrations_kafka_connect" {
    targets = [{
        __address__ = "kafka-connect-node:7001",
    }]

    rule {
        target_label = "instance"
        replacement  = "<your-instance-name>"
    }

    rule {
        target_label = "kafka_cluster"
        replacement  = "<your-cluster-name>"
    }

}
discovery.relabel "metrics_integrations_kafka_schemaregistry" {
    targets = [{
        __address__ = "kafka-schemaregistry-node:7001",
    }]

    rule {
        target_label = "instance"
        replacement  = "<your-instance-name>"
    }

    rule {
        target_label = "kafka_cluster"
        replacement  = "<your-cluster-name>"
    }

}
discovery.relabel "metrics_integrations_kafka_ksqldb" {
    targets = [{
        __address__ = "kafka-ksqldb-node:7001",
    }]

    rule {
        target_label = "instance"
        replacement  = "<your-instance-name>"
    }

    rule {
        target_label = "kafka_cluster"
        replacement  = "<your-cluster-name>"
    }
}
prometheus.scrape "metrics_integrations_kafka" {
    targets    = [discovery.relabel.metrics_integrations_kafka.output, discovery.relabel.metrics_integrations_kafka_zookeeper.output, discovery.relabel.metrics_integrations_kafka_connect.output, discovery.relabel.metrics_integrations_kafka_schemaregistry.output, discovery.relabel.metrics_integrations_kafka_ksqldb.output]
    forward_to = [prometheus.remote_write.metrics_service.receiver]
    job_name   = "integrations/kafka"
}             

After enabling the JMX exporter in each node, instruct Grafana Alloy to scrape them.

One discovery.relabel must be added for each node composing your cluster (Kafka Server, Schema Registry, ksqlDB, Zookeeper, Kafka Connect) to avoid instance label conflicts.

Make sure to match the name instance label name used in the exporter snippet for the Kafka Server nodes.

Configure the following properties within each discovery.relabel component:

  • __address__: The address to your Kafka node.
  • <your-instance-name>: The instance label for all metrics scraped from this Kafka node.
  • <your-cluster-name>: The kafka_cluster label to group your Kafka nodes within a cluster. Set the same value for all nodes within your cluster.

Finally, reference each discovery.relabel component within the targets property of the prometheus.scrape component.

Advanced integrations snippets

alloy
prometheus.exporter.kafka "integrations_kafka_exporter" {
    kafka_uris = ["kafka-node1:9091"]
}
discovery.relabel "integrations_kafka_exporter" {
    targets = prometheus.exporter.kafka.integrations_kafka_exporter.targets

    rule {
        target_label = "job"
        replacement  = "integrations/kafka"
    }

    rule {
        target_label = "kafka_cluster"
        replacement  = "<your-cluster-name>"
    }

    rule {
        target_label = "instance"
        replacement  = "<your-instance-name>"
    }
}
prometheus.scrape "integrations_kafka_exporter" {
    targets    = discovery.relabel.integrations_kafka_exporter.output
    forward_to = [prometheus.remote_write.metrics_service.receiver]
    job_name   = "integrations/kafka_exporter"
}

To monitor consumption lag, you must add a pair of prometheus.exporter.kafka and discovery.relabel to your Grafana Alloy configuration file for each Kafka Server you monitor, to avoid instance label conflicts.

Configure the following property within the prometheus.exporter.kafka component:

  • kafka_uris: The URI to connect to your Kafka Server node.

Refer to prometheus.exporter.kafka in Grafana Alloy reference documentation for a complete description of the configuration options.

Configure the following properties within the discovery.relabel component:

  • <your-instance-name>: this will set the instance label for all metrics from this Kafka Server node.
  • <your-cluster-name>: this will set the kafka_cluster label to group your Kafka nodes within a cluster. Set the same value for all nodes within your cluster.

Finally, reference each discovery.relabel component within the targets property of prometheus.scrape component.

Finally, reference each prometheus.exporter.kafka component within the targets property of the prometheus.scrape component.

Grafana Agent static configuration (deprecated)

The following section shows configuration for running Grafana Agent in static mode which is deprecated. You should use Grafana Alloy for all new deployments.

Dashboards

The Kafka integration installs the following dashboards in your Grafana Cloud instance to help monitor your system.

  • Kafka Connect Overview
  • Kafka Overview
  • Kafka Topics
  • Kafka lag overview
  • Schema Registry Overview
  • Zookeeper overview
  • ksqldb Overview

Kafka Overview dashboard

Kafka Overview dashboard

Kafka Connect Overview dashboard

Kafka Connect Overview dashboard

Kafka KSQL Overview dashboard

Kafka KSQL Overview dashboard

Alerts

The Kafka integration includes the following useful alerts:

AlertDescription
KafkaOfflinePartitonCountCritical: Kafka has offline partitons.
KafkaUnderReplicatedPartitionCountCritical: Kafka has under replicated partitons.
KafkaActiveControllerCritical: Kafka has no active controller.
KafkaUncleanLeaderElectionCritical: Kafka has unclean leader elections.
KafkaISRExpandRateWarning: Kafka ISR Expansion Rate is expanding.
KafkaISRShrinkRateWarning: Kafka ISR Expansion Rate is shrinking.
KafkaBrokerCountCritical: Kafka has no Brokers online.
KafkaZookeeperSyncConnectWarning: Kafka Zookeeper Sync Disconected.

Metrics

The most important metrics provided by the Kafka integration, which are used on the pre-built dashboards and Prometheus alerts, are as follows:

  • jvm_gc_collection_seconds_sum
  • jvm_memory_bytes_max
  • jvm_memory_bytes_used
  • kafka_cluster_partition_underminisr
  • kafka_cluster_partition_underreplicated
  • kafka_connect_app_info
  • kafka_connect_connect_metrics_connection_count
  • kafka_connect_connect_metrics_failed_authentication_total
  • kafka_connect_connect_metrics_incoming_byte_rate
  • kafka_connect_connect_metrics_io_ratio
  • kafka_connect_connect_metrics_network_io_rate
  • kafka_connect_connect_metrics_outgoing_byte_rate
  • kafka_connect_connect_metrics_request_rate
  • kafka_connect_connect_metrics_response_rate
  • kafka_connect_connect_metrics_successful_authentication_rate
  • kafka_connect_connect_worker_metrics_connector_count
  • kafka_connect_connect_worker_metrics_connector_destroyed_task_count
  • kafka_connect_connect_worker_metrics_connector_failed_task_count
  • kafka_connect_connect_worker_metrics_connector_paused_task_count
  • kafka_connect_connect_worker_metrics_connector_running_task_count
  • kafka_connect_connect_worker_metrics_connector_startup_failure_total
  • kafka_connect_connect_worker_metrics_connector_startup_success_total
  • kafka_connect_connect_worker_metrics_connector_total_task_count
  • kafka_connect_connect_worker_metrics_connector_unassigned_task_count
  • kafka_connect_connect_worker_metrics_task_count
  • kafka_connect_connect_worker_metrics_task_startup_failure_total
  • kafka_connect_connect_worker_metrics_task_startup_success_total
  • kafka_connect_connect_worker_rebalance_metrics_rebalance_avg_time_ms
  • kafka_connect_connect_worker_rebalance_metrics_time_since_last_rebalance_ms
  • kafka_connect_connector_info
  • kafka_connect_connector_metrics
  • kafka_connect_connector_task_metrics_batch_size_avg
  • kafka_connect_connector_task_metrics_batch_size_max
  • kafka_connect_connector_task_metrics_offset_commit_avg_time_ms
  • kafka_connect_connector_task_metrics_offset_commit_success_percentage
  • kafka_connect_connector_task_metrics_pause_ratio
  • kafka_connect_connector_task_metrics_running_ratio
  • kafka_connect_sink_task_metrics_partition_count
  • kafka_connect_sink_task_metrics_put_batch_avg_time_ms
  • kafka_connect_sink_task_metrics_put_batch_max_time_ms
  • kafka_connect_source_task_metrics_poll_batch_avg_time_ms
  • kafka_connect_source_task_metrics_poll_batch_max_time_ms
  • kafka_connect_source_task_metrics_source_record_active_count_avg
  • kafka_connect_source_task_metrics_source_record_active_count_max
  • kafka_connect_source_task_metrics_source_record_poll_rate
  • kafka_connect_source_task_metrics_source_record_write_rate
  • kafka_connect_task_error_metrics_deadletterqueue_produce_requests
  • kafka_connect_task_error_metrics_total_errors_logged
  • kafka_connect_task_error_metrics_total_record_errors
  • kafka_connect_task_error_metrics_total_record_failures
  • kafka_connect_task_error_metrics_total_records_skipped
  • kafka_connect_task_error_metrics_total_retries
  • kafka_consumer_lag_millis
  • kafka_consumergroup_current_offset
  • kafka_consumergroup_uncommitted_offsets
  • kafka_controller_controllerstats_uncleanleaderelectionspersec
  • kafka_controller_kafkacontroller_activecontrollercount
  • kafka_controller_kafkacontroller_offlinepartitionscount
  • kafka_controller_kafkacontroller_preferredreplicaimbalancecount
  • kafka_coordinator_group_groupmetadatamanager_numgroups
  • kafka_coordinator_group_groupmetadatamanager_numgroupscompletingrebalance
  • kafka_coordinator_group_groupmetadatamanager_numgroupsdead
  • kafka_coordinator_group_groupmetadatamanager_numgroupsempty
  • kafka_coordinator_group_groupmetadatamanager_numgroupspreparingrebalance
  • kafka_coordinator_group_groupmetadatamanager_numgroupsstable
  • kafka_log_log_logendoffset
  • kafka_log_log_logstartoffset
  • kafka_log_log_size
  • kafka_network_acceptor_acceptorblockedpercent
  • kafka_network_requestchannel_requestqueuesize
  • kafka_network_requestchannel_responsequeuesize
  • kafka_network_requestmetrics_localtimems
  • kafka_network_requestmetrics_remotetimems
  • kafka_network_requestmetrics_requestqueuetimems
  • kafka_network_requestmetrics_requestspersec
  • kafka_network_requestmetrics_responsequeuetimems
  • kafka_network_requestmetrics_responsesendtimems
  • kafka_network_socketserver_networkprocessoravgidlepercent
  • kafka_schema_registry_jersey_metrics_request_latency_99
  • kafka_schema_registry_jersey_metrics_request_rate
  • kafka_schema_registry_jetty_metrics_connections_active
  • kafka_schema_registry_registered_count
  • kafka_schema_registry_schemas_created
  • kafka_server_brokertopicmetrics_bytesinpersec
  • kafka_server_brokertopicmetrics_bytesoutpersec
  • kafka_server_brokertopicmetrics_fetchmessageconversionspersec
  • kafka_server_brokertopicmetrics_messagesinpersec
  • kafka_server_brokertopicmetrics_producemessageconversionspersec
  • kafka_server_brokertopicmetrics_totalfetchrequestspersec
  • kafka_server_brokertopicmetrics_totalproducerequestspersec
  • kafka_server_kafkarequesthandlerpool_requesthandleravgidlepercent_total
  • kafka_server_kafkaserver_brokerstate
  • kafka_server_replicamanager_isrexpandspersec
  • kafka_server_replicamanager_isrshrinkspersec
  • kafka_server_replicamanager_leadercount
  • kafka_server_replicamanager_partitioncount
  • kafka_server_replicamanager_underreplicatedpartitions
  • kafka_server_sessionexpirelistener_zookeeperauthfailurespersec
  • kafka_server_sessionexpirelistener_zookeeperdisconnectspersec
  • kafka_server_sessionexpirelistener_zookeeperexpirespersec
  • kafka_server_sessionexpirelistener_zookeepersyncconnectspersec
  • kafka_server_socketservermetrics_connection_close_rate
  • kafka_server_socketservermetrics_connection_count
  • kafka_server_socketservermetrics_connection_creation_rate
  • kafka_server_socketservermetrics_connections
  • kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms
  • kafka_streams_stream_state_metrics_delete_latency_avg
  • kafka_streams_stream_state_metrics_delete_latency_max
  • kafka_streams_stream_state_metrics_delete_rate
  • kafka_streams_stream_state_metrics_fetch_latency_avg
  • kafka_streams_stream_state_metrics_fetch_rate
  • kafka_streams_stream_state_metrics_put_if_absent_latency_avg
  • kafka_streams_stream_state_metrics_put_if_absent_latency_max
  • kafka_streams_stream_state_metrics_put_if_absent_rate_rate
  • kafka_streams_stream_state_metrics_put_latency_avg
  • kafka_streams_stream_state_metrics_put_latency_max
  • kafka_streams_stream_state_metrics_put_rate
  • kafka_streams_stream_state_metrics_restore_latency_avg
  • kafka_streams_stream_state_metrics_restore_latency_max
  • kafka_streams_stream_state_metrics_restore_rate
  • kafka_streams_stream_thread_metrics_commit_latency_avg
  • kafka_streams_stream_thread_metrics_commit_latency_max
  • kafka_streams_stream_thread_metrics_poll_latency_avg
  • kafka_streams_stream_thread_metrics_poll_latency_max
  • kafka_streams_stream_thread_metrics_process_latency_avg
  • kafka_streams_stream_thread_metrics_process_latency_max
  • kafka_streams_stream_thread_metrics_punctuate_latency_avg
  • kafka_streams_stream_thread_metrics_punctuate_latency_max
  • kafka_topic_partition_current_offset
  • ksql_ksql_engine_query_stats_error_queries
  • ksql_ksql_engine_query_stats_liveness_indicator
  • ksql_ksql_engine_query_stats_messages_consumed_per_sec
  • ksql_ksql_engine_query_stats_messages_produced_per_sec
  • ksql_ksql_engine_query_stats_not_running_queries
  • ksql_ksql_engine_query_stats_num_active_queries
  • ksql_ksql_engine_query_stats_num_idle_queries
  • ksql_ksql_engine_query_stats_num_persistent_queries
  • ksql_ksql_engine_query_stats_pending_shutdown_queries
  • ksql_ksql_engine_query_stats_rebalancing_queries
  • ksql_ksql_engine_query_stats_running_queries
  • ksql_ksql_metrics_ksql_queries_query_status
  • process_cpu_seconds_total
  • up
  • zookeeper_avgrequestlatency
  • zookeeper_inmemorydatatree_nodecount
  • zookeeper_inmemorydatatree_watchcount
  • zookeeper_maxrequestlatency
  • zookeeper_minrequestlatency
  • zookeeper_numaliveconnections
  • zookeeper_outstandingrequests
  • zookeeper_quorumsize
  • zookeeper_status_quorumsize
  • zookeeper_ticktime

Changelog

md
# 1.0.1 - January 2024

* Update mixin to latest version:
 - Update all Angular based panels to React panels

# 1.0.0 - September 2023

* Update mixin to latest version:
  - Added new kafka_cluster label to differentiate from kubernetes reserved cluster label
  - Set all job names to integrations/kafka
  - Added links between all dashboards
  - Added telemetry status panels
  - Improved alerts  
* Enable Kubernetes support

# 0.0.6 - September 2023

* New Filter Metrics option for configuring the Grafana Agent, which saves on metrics cost by dropping any metric not used by this integration. Beware that anything custom built using metrics that are not on the snippet will stop working.
* New hostname relabel option, which applies the instance name you write on the text box to the Grafana Agent configuration snippets, making it easier and less error prone to configure this mandatory label.

# 0.0.5 - May 2023

* Update mixin to latest version:
  - Kafka overview: Show only 0.99 percentile by default
  - Kafka lag: Change table panel to bar chart for partitions per topic panel
  - Kafka lag: Stretch kafka lag dashboard to full screen width
  - Kafka lag panels: Convert old graph to timeseries (message per sec/per minute)
  - Kafka lag: Change delta() to increase() for per minute metrics
  - Add multichoice and 'All' options supportable in 'job'
  - Zookeeper dashboard: Use sentence case
  - Zookeeper dashboard: Get templated variables by non quorum metric. Otherwise, standalone zookeeper couldn't be discovered
  - Zookeeper dashboard: Add support in queries to jmx_config metrics notations used in Strimzi operator
  - Zookeeper dashboard: Convert graphs to timeseries panel
  - Zookeeper dashboard: Temp fix for latency graphs ignoring (minrequestlatency, ticktime)

# 0.0.4 - December 2022

* Update mixin to latest version:
  - Fix missing job and instance label on all the dashboards
  - Fix alert names to have a Kafka prefix

# 0.0.3 - February 2022

* Added the following alerts:
  - OfflinePartitonCount
  - UnderReplicatedPartitionCount
  - ActiveController
  - UncleanLeaderElection
  - ISRExpandRate
  - ISRShrinkRate
  - BrokerCount
  - ZookeeperSyncConnect

# 0.0.2 - October 2021

* Update mixin to latest version:
  - Update all rate queries to use `$__rate_interval` so they respect the default resolution

# 0.0.1 - June 2021

* Initial release

Cost

By connecting your Kafka instance to Grafana Cloud, you might incur charges. To view information on the number of active series that your Grafana Cloud account uses for metrics included in each Cloud tier, see Active series and dpm usage and Cloud tier pricing.