Configure online evaluation

Online evaluation continuously scores live generation traffic. You configure evaluators that define how to score, and rules that define which generations to evaluate.

Create evaluators

Use the AI Observability plugin UI or the evaluation API to create evaluators. Four evaluator types are available:

LLM judge

Uses an LLM to score generations based on criteria you define in a prompt template.

Key settings:

provider and model — the LLM to use for judging.
system_prompt and user_prompt — prompt templates with variables.
max_tokens, temperature, timeout_ms — generation controls.

JSON schema

Validates that the assistant response matches a JSON schema. Returns true or false.

Regex

Checks the assistant response against one or more regex patterns. Use reject: true to invert the match.

Heuristic

Applies a rule tree with AND/OR logic. Supported checks: not_empty, contains, not_contains, min_length, max_length.

Evaluation target

JSON schema, regex, and heuristic evaluators evaluate the assistant response by default. You can change the evaluation target to evaluate other fields instead:

Target	Description
`response`	Assistant response text (default).
`input`	User input text.
`system_prompt`	The system prompt.

Set the target in the Evaluate against dropdown when you create a non-LLM-judge evaluator, or set the target field in the evaluator config JSON.

This is useful for lightweight detection of injected content in generation input without requiring an LLM judge.

Pass verdict

Bool-type evaluators (heuristic, regex, JSON schema, and LLM judge bool outputs) record a pass/fail verdict only when you explicitly configure a pass_value on the output key. When pass_value is omitted, the score is recorded but no verdict is determined.

Template variables

LLM judge prompts support these template variables:

Variable	Content
`{{latest_user_message}}`	Most recent user message
`{{user_history}}`	All user messages
`{{assistant_response}}`	Assistant output
`{{assistant_thinking}}`	Thinking/reasoning content
`{{system_prompt}}`	System prompt
`{{tool_calls}}`	Tool call details
`{{tool_results}}`	Tool result details
`{{tools}}`	Available tool definitions
`{{call_error}}`	Error information

Create rules

Rules connect evaluators to generation traffic. Each rule has:

Selector — which generations to evaluate:
- user_visible_turn — assistant text responses without tool calls.
- all_assistant_generations — any assistant output.
- tool_call_steps — generations with tool calls.
- errored_generations — generations with errors.
Match filters — additional criteria to narrow the selection.
Sampling rate — percentage of matching generations to evaluate.
Evaluator — the evaluator to run.