Menu

Important: This documentation is about an older version. It's relevant only to the release noted, many of the features and functions have been updated or replaced. Please view the current version.

Enterprise

Grafana Enterprise Metrics runbooks

This document contains runbooks specific to Grafana Enterprise Metrics (GEM), extending the Mimir runbooks. These runbooks provide troubleshooting procedures for GEM-specific alerts and components.

Alerts

GEMFederationFrontendRemoteClusterErrors

This alert fires when the federation-frontend is receiving a high error rate (>1%) from a remote cluster over a 15-minute period.

How it works:

  • The federation-frontend delegates queries to remote clusters to run them
  • The alert triggers when more than 1% of requests to a specific remote cluster result in server errors (5xx) over a 15-minute window
  • If partial responses are disabled (default configuration), clients querying the federation-frontend receive errors
  • If partial responses are enabled, responses are incomplete but still returned to clients

How to investigate:

  1. Check the federation-frontend logs for detailed error messages about the failing requests to the remote cluster
  2. Check the health of the remote cluster:
    • Look for any ongoing alerts in the remote cluster
    • Check resource utilization (CPU, memory, disk)
    • Verify that the remote cluster’s query path components are healthy
  3. Check network connectivity:
    • Verify network connectivity between clusters
    • Check for any firewall or security group changes
    • Ensure DNS resolution is working correctly
  4. Monitor the error rate and request patterns on the Mimir / Federation-frontend dashboard:
    • Look at the “Remote requests / sec by request type” panel
    • Check the error rates by remote cluster

Common causes and solutions:

  1. Remote cluster is overloaded:

    • Check the remote cluster’s resource utilization
    • Consider scaling up the remote cluster’s query path components
  2. Network connectivity issues:

    • Verify network paths between clusters
    • Check for any recent network infrastructure changes
    • Ensure all required ports are open between clusters
  3. Authentication/Authorization issues:

    • Verify that the federation-frontend’s credentials for the remote cluster are valid. See Cross-cluster query federation for setting up authentication with GEM.
    • Check if any authentication tokens or certificates have expired