Introducing o11y-bench: an open benchmark for AI agents running observability workflows

Yasir Ekinci

Jack Gordley

•

2026-04-21•8 min

Evaluating agents is hard. Verifying observability tasks is harder.

Yes, AI agents have gotten dramatically and quantifiably better at coding and tool use, but observability presents a different kind of challenge. In a real incident, the hard part is rarely just writing a query. It's deciding which signal matters, figuring out whether a spike is noise or symptom, correlating metrics with logs and traces, and sometimes making a change in Grafana without breaking the dashboard another engineer depends on.

To help the Grafana community navigate this new world of AI-assisted observability, we’re open sourcing grafana/o11y-bench, a benchmark for evaluating AI agents on observability workflows. It runs agents against a real Grafana stack with access to Grafana MCP server and grades them on a set of observability tasks within that environment.

o11y-bench is built on Harbor, an open source framework released by the creators of Terminal Bench that standardizes environments for benchmarking agents against sets of focused tasks. The benchmark we developed focuses on the workflows that actually matter in practice: querying metrics, logs, and traces; investigating incidents; and making targeted dashboard changes.

Why observability needs its own benchmark

Observability isn't just another straightforward agent tool-calling problem. Observability tasks such as root-cause investigations or dashboard creation often depend on the interaction between large amounts of metrics, logs, traces, time ranges, and saved application state. And that collection of variables makes it harder to tell whether an agent actually got the work right. For example, a query can be syntactically valid and still select the wrong series; a dashboard can render and still be saved incorrectly.

To properly evaluate AI systems today, benchmark tasks and simulated environments must reflect reality. o11y-bench runs agents against a real Grafana stack and evaluates them on a set of focused criteria simulating the complexity of a modern observability stack.

The main page for o11y-bench, with a description of the observability benchmark, as well as links to the leaderboard, GitHub and Grafana blog post, as well as a ranking of the top agents

This type of standardized measuring can provide critical insights for Grafana users because the outcome can help you discern the difference between an agent that looks helpful in a demo and one you can trust in a real workflow. In observability, the dangerous mistakes are often the subtle ones.

And by open sourcing the tasks, environment, grading logic, and results, we want this to be inspectable, reproducible, and open to challenge. We are also hoping that these tasks can help the next generation of models improve their observability related skills.

Open source, open testing

Built on Harbor, o11y-bench allows you to run your model, agent harness, or any combination of the two in a sandboxed environment alongside a Grafana Docker container with synthetic metrics, logs, and traces present. It’s as simple as running the following task to get started:

mise run bench:job -- --model openai/gpt-5.4-nano --task-name query-cpu-metrics --agent opencode

This command will kick off the benchmark over just one task, (query-cpu-metrics ) and output the results to the /jobs folder where you can inspect agent trajectory, see the LLM-as-a-judge and heuristic scoring, and understand how your agent or model performed.

Our goal with o11y-bench is to engage the community to see what's possible. We have kicked off the leaderboard with a set of base frontier models, but we welcome new combinations of agent harnesses, model configurations, and experimentation to push agent capabilities in observability forward.

What tasks o11y-bench tests

The first public release of o11y-bench includes 63 tasks across various observability workflows:

Prometheus and PromQL tasks
Loki and LogQL tasks
Tempo and TraceQL tasks
Multi-step incident investigations
Dashboard editing and repair tasks

The tasks we have curated aim to be deterministic enough to grade reliably, but rich enough to produce real failure modes. For instance, take a problem from the Prometheus query category, promql-retry-backlog-triage:

“We think the payment incident may have built up retries behind the scenes. Over roughly the last six hours, which service showed the highest retry/backlog depth, about how high did it get, and does the next-worst service look like a smaller spillover or a comparable primary problem?”

To a human familiar with the system, this problem seems relatively straightforward. However, we noticed high-thinking or token-heavy agents would spin their wheels by gathering too much information about the system, wasting tokens and timing out. On the other hand, more-focused agents were able to zero-in on the proper queries and diagnose the system quickly and accurately.

While a high-thinking agent may get there in the end, the metrics included with o11y-bench also allow us to examine cost, token usage, and overall performance rather than just a "0" or "1" answer, providing actionable insights on agents and models we may want to use for these types of scenarios.

Why verifying observability work is hard

Coming up with observability tasks that sufficiently test an agent is only part of the assessment process. You also need to be able to verify that those tasks are accurately completed.

If a user asks an agent to investigate latency, compare error rates, or update a dashboard, simply getting a final answer that looks good isn't good enough. For many query tasks, we run a reference Prometheus or Loki query against the same stack the agent saw, then compare that value to what the model actually cited. For dashboard tasks, we inspect the saved Grafana state and, when needed, execute the saved panel query and compare it against a reference query for the same case.

We start with outcomes. For the explanation itself, we still score the response, but we pair that with verifiable facts from the environment rather than treating fluent prose as evidence.

Two simple examples:

If a model says “the p95 latency was about 2.3 seconds,” the verifier can run the reference query against the same Prometheus or Loki data and check whether that number is actually supported.
If a model says it fixed a dashboard, the verifier can inspect the saved panel JSON, bind the expected variable values, execute the saved query, and compare the result to a reference query for the same case.

Our general grading philosophy is to always check against the ground truth of what the agent actually did, not just what it said. In practice, that is the difference between an agent that looks convincing in a transcript and one you can trust in a real investigation.

Measuring reliability vs. best-of-three success

Two headline scores are used in this benchmark to evaluate model performance:

Pass^3: A measure of consistency, computed as the average benchmark score across three runs
Pass@3: A measure of best-of-three success, indicating whether the model solved the task at least once across three attempts

Note: Each metric has its value, but their individual usefulness will depend on the use case. For the purposes of this exercise, we care more about consistency, so Pass^3 takes priority in the rankings. Further reading on agent eval methodology and metrics can be found on the Anthropic blog.

It's interesting to note how each model family performs in the relative metrics, as different leaders emerge when looking at the different metrics for success.

The results

The initial launch suite covered 29 model variants on 63 tasks (at three attempts each) for a total 5,481 total trials.

The o11y-bench leaderboard, as well as a description of the metric measurements

o11y-bench leaderboard (Top 15)

Using Pass^3 as the headline metric:

Opus 4.7 with reasoning turned off led the launch run
Opus 4.7 with reasoning set to high came in second, interestingly lower consistency at Pass^3 but with a higher Pass@3 score

Qwen 3.6 Plus performed the best of the frontier open source models we tested, even beating some of the smaller Sonnet and GPT models.

The main takeaway is that reliability is what truly separates the top models. Many models could get a task right at least once across three attempts. Far fewer could do it consistently. That gap is exactly why we treat reliability as the main benchmark signal.

“Got it right once” and “gets it right consistently” are not the same thing, especially in observability, where a subtle mistake can send an engineer down the wrong path. Mean score is still useful for debugging tasks and graders, but it is not a good headline metric for agent trust.

Category scores panel, with a ranking of the top 10 most consistent models

o11y-bench task category scores (Top 10 performers)

A per-category view sharpened that picture further. Grafana API tasks were close to saturated, and Prometheus was relatively strong. Tempo and Loki sat in the middle. Dashboarding remains the hardest area, not because it is the only thing that matters, but because it combines state, query correctness, variable wiring, and saved behavior in ways that are easy to get almost right.

Here, Pass@3 means “got an individual task right at least once across three tries,” while a perfect Pass^3 means “got it right all three times.” The gap between those two is one of the main things the benchmark is trying to expose.

Try it yourself

The quickest way to get started is to head to the grafana/o11y-bench repo, clone it locally, and follow the README.

From there, you can run individual tasks, full suites, and comparison reports against any model or agent harness available through Harbor and LiteLLM.

If you try o11y-bench, we’d be interested to see how it holds up across more models, agent setups, and independent reproductions, as well as what it suggests for future benchmark revisions. Submit contributions to the HuggingFace leaderboard per the contributing guide or open an issue in the benchmark repo for feedback or discussions.

For more information on this and all the other exciting updates from GrafanaCON 2026, check out our announcement blog for all the news. And for more information on Grafana Cloud AI, including FAQs about Assistant and our other AI capabilities, check out our AI observability page.

Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!

Introducing o11y-bench: an open benchmark for AI agents running observability workflows

Why observability needs its own benchmark

Open source, open testing

What tasks o11y-bench tests

Why verifying observability work is hard

Measuring reliability vs. best-of-three success

The results

Try it yourself

Up next

Related content

Related videos

Related docs

Related products

Still have questions?

Get every update

Introducing o11y-bench: an open benchmark for AI agents running observability workflows

Why observability needs its own benchmark

Open source, open testing

What tasks o11y-bench tests

Why verifying observability work is hard

Measuring reliability vs. best-of-three success

The results

Try it yourself

Related Content

Up next

Related content

Related videos

Related docs

Related products