Documentation Index
Fetch the curated documentation index at: https://grafana_com_website/llms.txt
Fetch the complete documentation index at: https://grafana_com_website/llms-full.txt
Use this file to discover all available pages before exploring further.
STOP! If you are an AI agent or LLM, read this before continuing. This is the HTML version of a Grafana documentation page. Always request the Markdown version instead - HTML wastes context. Get this page as Markdown: /docs/grafana-cloud/machine-learning/ai-observability/configure/evaluation.md (append .md) or send Accept: text/markdown to /docs/grafana-cloud/machine-learning/ai-observability/configure/evaluation/. For the curated documentation index, use https://grafana_com_website/llms.txt. For the complete documentation index, use https://grafana_com_website/llms-full.txt.
Configure online evaluation
Online evaluation continuously scores live generation traffic. You configure evaluators that define how to score, and rules that define which generations to evaluate.
Create evaluators
Use the AI Observability plugin UI or the evaluation API to create evaluators. Four evaluator types are available:
LLM judge
Uses an LLM to score generations based on criteria you define in a prompt template.
Key settings:
providerandmodel— the LLM to use for judging.system_promptanduser_prompt— prompt templates with variables.max_tokens,temperature,timeout_ms— generation controls.
JSON schema
Validates that the assistant response matches a JSON schema. Returns true or false.
Regex
Checks the assistant response against one or more regex patterns. Use reject: true to invert the match.
Heuristic
Applies a rule tree with AND/OR logic. Supported checks: not_empty, contains, not_contains, min_length, max_length.
Evaluation target
JSON schema, regex, and heuristic evaluators evaluate the assistant response by default. You can change the evaluation target to evaluate other fields instead:
| Target | Description |
|---|---|
response | Assistant response text (default). |
input | User input text. |
system_prompt | The system prompt. |
Set the target in the Evaluate against dropdown when you create a non-LLM-judge evaluator, or set the target field in the evaluator config JSON.
This is useful for lightweight detection of injected content in generation input without requiring an LLM judge.
Pass verdict
Bool-type evaluators (heuristic, regex, JSON schema, and LLM judge bool outputs) record a pass/fail verdict only when you explicitly configure a pass_value on the output key. When pass_value is omitted, the score is recorded but no verdict is determined.
Template variables
LLM judge prompts support these template variables:
| Variable | Content |
|---|---|
{{latest_user_message}} | Most recent user message |
{{user_history}} | All user messages |
{{assistant_response}} | Assistant output |
{{assistant_thinking}} | Thinking/reasoning content |
{{system_prompt}} | System prompt |
{{tool_calls}} | Tool call details |
{{tool_results}} | Tool result details |
{{tools}} | Available tool definitions |
{{call_error}} | Error information |
Create rules
Rules connect evaluators to generation traffic. Each rule has:
- Selector — which generations to evaluate:
user_visible_turn— assistant text responses without tool calls.all_assistant_generations— any assistant output.tool_call_steps— generations with tool calls.errored_generations— generations with errors.
- Match filters — additional criteria to narrow the selection.
- Sampling rate — percentage of matching generations to evaluate.
- Evaluator — the evaluator to run.
Next steps
Was this page helpful?
Related resources from Grafana Labs


