Set up online evaluation
Online evaluation lets you score your agents’ live production traffic automatically. This guide walks you through creating your first evaluator and rule.
Before you begin
- The eval worker is enabled (
SIGIL_EVAL_WORKER_ENABLED=true). - At least one judge provider is configured. Refer to Configure evaluation for provider setup.
- You have the AI Observability Admin role.
Use the evaluation wizard
The evaluation overview page includes a setup wizard that walks you through creating your first evaluator and rule:
- Navigate to Evaluation in the AI Observability plugin.
- Click Create evaluator on the overview card.
- Choose an evaluator type from the card grid:
- LLM judge — scores responses using an LLM.
- JSON schema — validates response structure.
- Regex — matches response patterns.
- Heuristic — applies rule-based checks.
- Configure the evaluator settings and scoring criteria.
- Choose a rule selector and sampling rate.
- Optionally set up an alert with a pass-rate threshold.
- Review and activate.
You can also create evaluators and rules individually outside the wizard.
Write an LLM judge prompt
For an LLM judge evaluator, write a prompt that describes the scoring criteria. Use template variables to inject generation data:
You are evaluating the quality of an AI assistant response.
User message:
{{latest_user_message}}
Assistant response:
{{assistant_response}}
Rate the response quality on a scale of 1-5, where 1 is poor and 5 is excellent.
Consider accuracy, helpfulness, and clarity.
Return only the numeric score.Create a rule
Rules connect evaluators to generation traffic:
- In Evaluation, click Create rule.
- Choose a selector:
- User visible turn — assistant text responses without tool calls.
- All assistant generations — any assistant output.
- Tool call steps — generations containing tool calls.
- Errored generations — generations with errors.
- Set a sampling rate (for example, 10% to evaluate 1 in 10 matching generations).
- Attach your evaluator.
- Click Save.
Set the evaluation target
For non-LLM-judge evaluators (JSON schema, regex, and heuristic), you can choose which text the evaluator analyzes:
- Response (default) — the assistant’s output text.
- Input — the user’s input text.
- System prompt — the system prompt.
Select the target in the Evaluate against dropdown when you create or edit an evaluator. This is useful for detecting prompt injection or validating input structure without an LLM judge.
Create alerts from evaluation rules
You can create Grafana alert rules directly from an evaluation rule:
- Open a rule’s detail page.
- In the Alerts section, click Add alert.
- Set a pass-rate threshold (for example, alert when fewer than 90% of evaluations pass).
- Choose a contact point for notifications.
- Click Create.
Alert rules are created as non-provisioned, so you can edit them in the standard Grafana Alerting UI afterward. Alert rules appear as clickable rows in the rule detail page, and the rules table shows an alert count column.
Monitor results
Evaluation scores appear in several places:
- Conversation detail: scores displayed next to each evaluated generation, with pass/fail/neutral counts in the metrics bar.
- Evaluation overview: live stats, rules table, and score distributions.
- Analytics dashboards: quality metrics alongside cost and performance data.
Use the evaluation dashboard to identify quality regressions after prompt changes, compare evaluator results across agent versions, and spot patterns in low-scoring generations.
Next steps
Was this page helpful?
Related resources from Grafana Labs


