Runbooks
Enterprise

Grafana Enterprise Metrics runbooks

This document contains runbooks specific to Grafana Enterprise Metrics (GEM), extending the Mimir runbooks. These runbooks provide troubleshooting procedures for GEM-specific alerts and components.

Alerts

GEMFederationFrontendRemoteClusterErrors

This alert fires when the federation-frontend is receiving a high error rate (>1%) from a remote cluster over a 15-minute period.

How it works:

  • The federation-frontend delegates queries to remote clusters to run them
  • The alert triggers when more than 1% of requests to a specific remote cluster result in server errors (5xx) over a 15-minute window
  • If partial responses are disabled (default configuration), clients querying the federation-frontend receive errors
  • If partial responses are enabled, responses are incomplete but still returned to clients

How to investigate:

  1. Check the federation-frontend logs for detailed error messages about the failing requests to the remote cluster
  2. Check the health of the remote cluster:
    • Look for any ongoing alerts in the remote cluster
    • Check resource utilization (CPU, memory, disk)
    • Verify that the remote cluster’s query path components are healthy
  3. Check network connectivity:
    • Verify network connectivity between clusters
    • Check for any firewall or security group changes
    • Ensure DNS resolution is working correctly
  4. Monitor the error rate and request patterns on the Mimir / Federation-frontend dashboard:
    • Look at the “Remote requests / sec by request type” panel
    • Check the error rates by remote cluster

Common causes and solutions:

  1. Remote cluster is overloaded:

    • Check the remote cluster’s resource utilization
    • Consider scaling up the remote cluster’s query path components
  2. Network connectivity issues:

    • Verify network paths between clusters
    • Check for any recent network infrastructure changes
    • Ensure all required ports are open between clusters
  3. Authentication/Authorization issues:

    • Verify that the federation-frontend’s credentials for the remote cluster are valid. See Cross-cluster query federation for setting up authentication with GEM.
    • Check if any authentication tokens or certificates have expired