Blog  /  Engineering

New in Grafana Enterprise Metrics 2.0: Cross-tenant alerting and recording rules

April 13, 2022 7 min

On the heels of launching our new open source TSDB Grafana Mimir, we are excited to introduce Grafana Enterprise Metrics 2.0

GEM 2.0 is built on top of Grafana Mimir 2.0, our easy-to-operate, high-scale database which we’ve shown can handle upwards of 1 billion active series. That means that GEM 2.0 inherits all of the highlights of Mimir, including easy deployment, native multi-tenancy, high availability, durable long-term metrics storage, and exceptional query performance. 

But GEM also includes additional functionality that users need to operate a time series database at scale in their organization, including simplified tenant management. One new addition we’ve made to this is GEM 2.0’s experimental cross-tenant alerting and recording rules feature.

How single tenant alerting and recording rules work

By default, GEM runs alerting and recording rules as queries that are limited to single tenants. So, for example, you can create a tenant for a group in your organization and name the tenant “Team A.” You could then set up a recording rule that read and acted on the series stored in the Team A tenant. But if you wanted to deploy the same rule across multiple teams, you would need to create the rule group in each team’s tenant. You can see how this can quickly become difficult and tedious to manage in large organizations. 

We’ve also had users who want to aggregate their metrics and analyze the results of all their recording rules, which would ultimately mean introducing a new federated data source into their infrastructure. The other option? Write all the tenant data twice — once to the tenant and once to the administrator tenant who needs the overview. This leads to obvious (and negative) implications.

For alerts, the same friction exists. To deploy the same alerting rules across an organization, you would have to manage the yaml rules for each tenant. This means that a single incident affecting multiple tenants could trigger a storm of alerts — one for each tenant affected.

What are cross-tenant alerting and recording rules in GEM 2.0?

To solve for all these pain points, GEM 2.0 inherits Grafana Mimir’s experimental support for cross-tenant alerting and recording rules, which operate on data from multiple tenants.

Now, instead of managing your alerting and recording rules in various tenants, you can create a single rule group that uses data from multiple tenants by setting the source_tenants field in your rule group. Translation: GEM 2.0 allows you to monitor all your tenants at once without the hassle of an additional data source and without duplicating data.

How cross-tenant alerting and recording rules work

In the example above, the source_tenants are Team A and Team B. Team Infra needs to monitor both of these tenants. 

Team Infra can monitor both Team A and Team B using the same recording rules without duplicating data using the cross-tenant recording rule functionality in GEM 2.0. The metrics for Team A (top right) show that request errors spike every 2 minutes. For Team B (bottom right), request failures peak every 5 minutes. With the cross-tenant recording rule, Team Infra can also see unified data from Team A and Team B (top left), which allows them to create alerts based on the overall performance of the two tenants. Here, a cross-tenant alerting rule has been set up to fire if both tenants have an elevated error rate (bottom left).

Below is the cross-tenant rule group definition (federated-rules.yaml):

groups:
  - name: http-requests-results
    interval: 15s
    source_tenants: [ team-a, team-b ]
    rules:
      - record: result:my_app_http_requests_total:sum
        expr: sum by (result) (rate(my_app_http_requests_total[1m]))

      - alert: HTTPRequestsHighFailureRate
        expr: sum by (result) (rate(my_app_http_requests_total{result="failure"}[1m])) > 1.1
        for: 15s

The new cross-tenant alerting rules in GEM 2.0 are especially helpful in use cases where you want to set an alert for a single metric that exists in multiple tenants. For example, let’s say your team monitors CPU usage across multiple teams. Now if one tenant is running at 100% CPU usage, and they’re getting throttled (and don’t even realize it), you can create a single cross-tenant alerting rule that will look across your org to find any services, pods, machines, or containers that have a CPU utilization of 100% and fire an alert to identify the overtaxed target(s).

Another example is if your team is monitoring and maintaining databases, and you want to be alerted whenever any of your clients start seeing errors from your databases. With cross-tenant alerting rules, you can create a rule that sums the count of errors across all tenants to help you identify databases with the highest error counts. This way, if any of your tenants start experiencing increased errors, you’d be alerted. You could take the same approach to set a global alert that would flag high latency queries in any of your tenants.

Access controls

Unlike in Grafana Mimir, we had to face a big question when incorporating this new feature in Grafana Enterprise Metrics: How do we make it work with GEM’s existing access control system? Specifically, how do we evaluate cross-tenant rule groups and ensure that they are still authorized to read data from the specified source_tenants?

In order to read metrics from other tenants, you need metrics:read scope over all of them. The challenge here is that GEM authorization only happens on the edge — when you push metrics, when you query, etc. Once deployed, rules constantly run internally (every 15s, every 1m, etc.), and no tokens were necessary to continue evaluating the rules. So if someone changes the access policy after the group has been created, the group will keep on running.

To navigate this issue, we now store the access policy name used to create the rule within the rule group. On every rule storage synchronization, we check if the policy still permits metrics:read over the source_tenants. If the permission has been removed (e.g., the access policy has been modified to remove read permissions for one of the source tenants), the rule stops being evaluated. Thus, GEM 2.0 seamlessly integrates with GEM’s existing access control model.

To set up a cross-tenant alerting or recording rule in GEM 2.0, an administrator starts by creating an access policy. In the example above, we identify Team A and Team B as the source_tenants and define the rules that apply to both Team A and Team B within the access policy. Then the admin will generate a token, which serves as the password for creating new rule groups. The output of the rule evaluation gets written to whatever tenant the rule group was created for.

Administrator benefits

With cross-tenant alerting and recording rules, GEM 2.0 allows admins to separate administrative rule groups from each tenant’s rule groups. So if Team Infra sets up an alerting rule which monitors CPU usage across all tenants, a user in Team A or Team B could not accidentally delete the rule (and thus break Team Infra’s alerts).

You can also use federated rules to copy data verbatim to other tenants. If Team A is submitting data that you want to be able to query in Team B, then Team Infra can create a recording rule in Team B in which the query is simply the metric name. This will start writing the whole metric and all of its future data into Team B.

Learn more about GEM 2.0

As it stands, you can only view cross-tenant alerting and recording rules in the Grafana Alerting UI. Creating or editing of these rules can only be done via mimirtool or by directly interfacing with the ruler API. In the future, we intend to make it possible to create and modify cross-tenant alerting and recording rules via the Alerting UI, so users can get the same point-and-click experience that they do with single tenant alerting and recording rules. 

For more information about cross-tenant alerting and recording rules, read our GEM 2.0 documentation and our federated recording rules documentation

If you’d like to learn more about Grafana Enterprise Metrics, you can register for our upcoming free webinar “Intro to metrics with Grafana: Prometheus, Grafana Mimir, Graphite, and beyond" on April 19.

You can also contact us if you’d like to try Grafana Enterprise Metrics today!