---
title: "Configure online evaluation | Grafana Cloud documentation"
description: "Set up LLM judges, evaluator types, rules, and providers for continuous quality scoring in AI Observability."
---

# Configure online evaluation

Online evaluation continuously scores live generation traffic. You configure evaluators that define how to score, and rules that define which generations to evaluate.

## Create evaluators

Use the AI Observability plugin UI or the evaluation API to create evaluators. Four evaluator types are available:

### LLM judge

Uses an LLM to score generations based on criteria you define in a prompt template.

Key settings:

- `provider` and `model` — the LLM to use for judging.
- `system_prompt` and `user_prompt` — prompt templates with variables.
- `max_tokens`, `temperature`, `timeout_ms` — generation controls.

### JSON schema

Validates that the assistant response matches a JSON schema. Returns `true` or `false`.

### Regex

Checks the assistant response against one or more regex patterns. Use `reject: true` to invert the match.

### Heuristic

Applies a rule tree with AND/OR logic. Supported checks: `not_empty`, `contains`, `not_contains`, `min_length`, `max_length`.

## Evaluation target

JSON schema, regex, and heuristic evaluators evaluate the assistant response by default. You can change the evaluation target to evaluate other fields instead:

Expand table

| Target          | Description                        |
|-----------------|------------------------------------|
| `response`      | Assistant response text (default). |
| `input`         | User input text.                   |
| `system_prompt` | The system prompt.                 |

Set the target in the **Evaluate against** dropdown when you create a non-LLM-judge evaluator, or set the `target` field in the evaluator config JSON.

This is useful for lightweight detection of injected content in generation input without requiring an LLM judge.

## Pass verdict

Bool-type evaluators (heuristic, regex, JSON schema, and LLM judge bool outputs) record a pass/fail verdict only when you explicitly configure a `pass_value` on the output key. When `pass_value` is omitted, the score is recorded but no verdict is determined.

## Template variables

LLM judge prompts support these template variables:

Expand table

| Variable                  | Content                    |
|---------------------------|----------------------------|
| `{{latest_user_message}}` | Most recent user message   |
| `{{user_history}}`        | All user messages          |
| `{{assistant_response}}`  | Assistant output           |
| `{{assistant_thinking}}`  | Thinking/reasoning content |
| `{{system_prompt}}`       | System prompt              |
| `{{tool_calls}}`          | Tool call details          |
| `{{tool_results}}`        | Tool result details        |
| `{{tools}}`               | Available tool definitions |
| `{{call_error}}`          | Error information          |

## Create rules

Rules connect evaluators to generation traffic. Each rule has:

- **Selector** — which generations to evaluate:
  
  - `user_visible_turn` — assistant text responses without tool calls.
  - `all_assistant_generations` — any assistant output.
  - `tool_call_steps` — generations with tool calls.
  - `errored_generations` — generations with errors.
- **Match filters** — additional criteria to narrow the selection.
- **Sampling rate** — percentage of matching generations to evaluate.
- **Evaluator** — the evaluator to run.

## Next steps

- [Set up evaluation end-to-end](/docs/grafana-cloud/machine-learning/ai-observability/guides/evaluation)
- [Configure deployment options](/docs/sigil/next/configure/deployment)
