Grafana Cloud

Set up online evaluation

Online evaluation lets you score your agents’ live production traffic automatically. This guide walks you through creating your first evaluator and rule.

Before you begin

  • The eval worker is enabled (SIGIL_EVAL_WORKER_ENABLED=true).
  • At least one judge provider is configured. Refer to Configure evaluation for provider setup.
  • You have the AI Observability Admin role.

Use the evaluation wizard

The evaluation overview page includes a setup wizard that walks you through creating your first evaluator and rule:

  1. Navigate to Evaluation in the AI Observability plugin.
  2. Click Create evaluator on the overview card.
  3. Choose an evaluator type from the card grid:
    • LLM judge — scores responses using an LLM.
    • JSON schema — validates response structure.
    • Regex — matches response patterns.
    • Heuristic — applies rule-based checks.
  4. Configure the evaluator settings and scoring criteria.
  5. Choose a rule selector and sampling rate.
  6. Optionally set up an alert with a pass-rate threshold.
  7. Review and activate.

You can also create evaluators and rules individually outside the wizard.

Write an LLM judge prompt

For an LLM judge evaluator, write a prompt that describes the scoring criteria. Use template variables to inject generation data:

text
You are evaluating the quality of an AI assistant response.

User message:
{{latest_user_message}}

Assistant response:
{{assistant_response}}

Rate the response quality on a scale of 1-5, where 1 is poor and 5 is excellent.
Consider accuracy, helpfulness, and clarity.

Return only the numeric score.

Create a rule

Rules connect evaluators to generation traffic:

  1. In Evaluation, click Create rule.
  2. Choose a selector:
    • User visible turn — assistant text responses without tool calls.
    • All assistant generations — any assistant output.
    • Tool call steps — generations containing tool calls.
    • Errored generations — generations with errors.
  3. Set a sampling rate (for example, 10% to evaluate 1 in 10 matching generations).
  4. Attach your evaluator.
  5. Click Save.

Set the evaluation target

For non-LLM-judge evaluators (JSON schema, regex, and heuristic), you can choose which text the evaluator analyzes:

  • Response (default) — the assistant’s output text.
  • Input — the user’s input text.
  • System prompt — the system prompt.

Select the target in the Evaluate against dropdown when you create or edit an evaluator. This is useful for detecting prompt injection or validating input structure without an LLM judge.

Create alerts from evaluation rules

You can create Grafana alert rules directly from an evaluation rule:

  1. Open a rule’s detail page.
  2. In the Alerts section, click Add alert.
  3. Set a pass-rate threshold (for example, alert when fewer than 90% of evaluations pass).
  4. Choose a contact point for notifications.
  5. Click Create.

Alert rules are created as non-provisioned, so you can edit them in the standard Grafana Alerting UI afterward. Alert rules appear as clickable rows in the rule detail page, and the rules table shows an alert count column.

Monitor results

Evaluation scores appear in several places:

  • Conversation detail: scores displayed next to each evaluated generation, with pass/fail/neutral counts in the metrics bar.
  • Evaluation overview: live stats, rules table, and score distributions.
  • Analytics dashboards: quality metrics alongside cost and performance data.

Use the evaluation dashboard to identify quality regressions after prompt changes, compare evaluator results across agent versions, and spot patterns in low-scoring generations.

Next steps