About Kafka integration pre-built alerts
The Kafka integration provides a variety of pre-built alerts that you can use right away to begin troubleshooting issues. In this step of the journey, you’ll become familiar with these pre-built alerts and learn how to use them to address various problems.
Did you know?
If your Kafka cluster is functioning properly, you won’t receive any alerts. No news is good news!
Kafka alerts
Description: Kafka has offline partitions.
What this means: One or more partitions are offline, meaning they have no active leader and cannot be read from or written to. This indicates a critical issue affecting data availability.
What to do: Check broker health and logs to identify why partitions went offline. Verify that the cluster has sufficient healthy brokers and that replication is functioning properly.
Description: Kafka has under-replicated partitions.
What this means: Some partitions don’t have the configured number of in-sync replicas, meaning data durability is at risk. If the leader broker fails, data loss could occur.
What to do: Investigate broker performance and network connectivity. Check if any brokers are down or experiencing high load that prevents replication from keeping up.
Description: No active Kafka controller detected or multiple controllers detected.
What this means: The cluster either has no active controller (preventing partition leadership changes) or has split-brain with multiple controllers, both of which are critical issues.
What to do: Check ZooKeeper connectivity and health. Examine broker logs for controller election issues. Ensure network connectivity between brokers and ZooKeeper is stable.
Description: Unclean leader elections are occurring.
What this means: Kafka is electing partition leaders from brokers that were not in-sync, which can result in message loss. This indicates the cluster is prioritizing availability over data consistency.
What to do: Investigate why in-sync replicas are unavailable. Review broker health and replication lag. Consider adjusting replication factors or min.insync.replicas settings.
Description: In-Sync Replica (ISR) expansion rate is high.
What this means: Replicas are frequently joining the ISR set, which may indicate intermittent broker or network issues causing replicas to fall behind and then catch up.
What to do: Monitor broker performance and network stability. Check for broker restarts or network partitions. Review replication lag metrics to identify problematic brokers.
Description: In-Sync Replica (ISR) shrink rate is high.
What this means: Replicas are frequently being removed from the ISR set because they’re falling behind the leader, indicating replication performance issues.
What to do: Investigate broker resource utilization (CPU, disk I/O, network). Check for slow disks or network issues. Review replication lag and broker logs for errors.
Description: Kafka broker count has changed.
What this means: The number of active brokers in the cluster has decreased, which may indicate broker failures or planned maintenance.
What to do: Verify if the broker loss was intentional. If unplanned, investigate why brokers went offline and restore them to maintain cluster capacity and fault tolerance.
Description: ZooKeeper sync connection issues detected.
What this means: Kafka brokers are experiencing problems maintaining connections to ZooKeeper, which can affect cluster metadata operations and coordination.
What to do: Check ZooKeeper cluster health and network connectivity between Kafka brokers and ZooKeeper nodes. Review ZooKeeper logs for errors.
Description: Consumer lag is too high.
What this means: Consumer groups are falling significantly behind in processing messages, indicating consumers can’t keep up with the message production rate.
What to do: Scale up consumer instances, optimize consumer processing logic, or investigate performance bottlenecks in consumer applications.
Description: Consumer lag keeps increasing.
What this means: Consumer lag is continuously growing over time, indicating a persistent problem where consumers cannot keep pace with producers.
What to do: Urgently investigate consumer performance issues. Consider increasing consumer parallelism, optimizing consumer code, or adjusting partition assignments.
Description: JVM memory filling up for Kafka broker.
What this means: Kafka broker JVM heap memory usage is high and trending upward, which may indicate a memory leak or insufficient heap size configuration.
What to do: Monitor for garbage collection issues. Consider increasing JVM heap size if appropriate, or investigate potential memory leaks in custom components or configurations.
Description: JVM threads are deadlocked in Kafka broker.
What this means: The Kafka broker JVM has detected threads that are stuck in a deadlock, which can cause the broker to become unresponsive.
What to do: Collect thread dumps and analyze for deadlock situations. This typically requires restarting the affected broker and may indicate a bug that needs investigation.
In the next milestone, you explore the Kafka metrics displayed in your dashboards.
