Configure online evaluation
Online evaluation continuously scores live generation traffic. You configure evaluators that define how to score, and rules that define which generations to evaluate.
Enable the eval worker
Set these environment variables (or Helm values) to enable evaluation:
| Variable | Default | Description |
|---|---|---|
SIGIL_EVAL_WORKER_ENABLED | false | Enable the evaluation worker loop. |
SIGIL_EVAL_MAX_CONCURRENT | 8 | Maximum in-flight evaluations. |
SIGIL_EVAL_MAX_RATE | 600 | Maximum evaluations per minute. |
SIGIL_EVAL_MAX_ATTEMPTS | 3 | Retry cap for transient failures. |
SIGIL_EVAL_CLAIM_BATCH_SIZE | 20 | Work items claimed per cycle. |
SIGIL_EVAL_POLL_INTERVAL | 250ms | How often the worker claims new work. |
SIGIL_EVAL_DEFAULT_JUDGE_MODEL | openai/gpt-4o-mini | Default model for LLM judge evaluators. |
SIGIL_EVAL_JUDGE_MONTHLY_USAGE_LIMIT_USD | 0 | Global default monthly llm_judge spend cap per tenant, in whole USD. 0 disables the cap. Per-tenant overrides can be set via sigil-runtime-config.yaml (see Per-tenant overrides). |
Push evaluation metrics
AI Observability can push per-tenant evaluation metrics to a Prometheus-compatible remote-write endpoint. This lets you query evaluation pass rates and score distributions in Grafana alongside your other metrics.
| Variable | Default | Description |
|---|---|---|
SIGIL_EVAL_METRICS_PUSH_ENDPOINT | "" | Remote-write endpoint URL. Leave empty to disable. |
SIGIL_EVAL_METRICS_PUSH_INTERVAL | 15s | How often metrics are pushed. |
SIGIL_EVAL_METRICS_PUSH_TIMEOUT | 10s | HTTP timeout for remote-write requests. |
Configure judge providers
The querier discovers judge providers and judge-model dropdown options from environment variables. The eval worker enforces the same provider and allowlist config when evaluations execute.
| Provider | Required variable | Optional allowlist variable |
|---|---|---|
| OpenAI | SIGIL_EVAL_OPENAI_API_KEY | SIGIL_EVAL_OPENAI_ALLOWED_MODELS |
| Azure OpenAI | SIGIL_EVAL_AZURE_OPENAI_ENDPOINT, SIGIL_EVAL_AZURE_OPENAI_API_KEY | SIGIL_EVAL_AZURE_OPENAI_ALLOWED_MODELS |
| Anthropic | SIGIL_EVAL_ANTHROPIC_API_KEY | SIGIL_EVAL_ANTHROPIC_ALLOWED_MODELS |
| AWS Bedrock | AWS default credentials or SIGIL_EVAL_BEDROCK_BEARER_TOKEN | SIGIL_EVAL_BEDROCK_ALLOWED_MODELS |
SIGIL_EVAL_GOOGLE_API_KEY | SIGIL_EVAL_GOOGLE_ALLOWED_MODELS | |
| Vertex AI | SIGIL_EVAL_VERTEXAI_PROJECT | SIGIL_EVAL_VERTEXAI_ALLOWED_MODELS |
| Anthropic on Vertex | SIGIL_EVAL_ANTHROPIC_VERTEX_PROJECT | SIGIL_EVAL_ANTHROPIC_VERTEX_ALLOWED_MODELS |
| OpenAI-compatible | Custom endpoint with optional API key | SIGIL_EVAL_OPENAI_COMPAT_ALLOWED_MODELS or SIGIL_EVAL_OPENAI_COMPAT_<N>_ALLOWED_MODELS |
Allowlist values are comma-separated model IDs. Use the same IDs returned by the judge model API or shown in the UI dropdown.
Empty or unset allowlist variables mean all models returned by that provider remain eligible.
In split deployments, keep allowlist env vars aligned across querier and eval-worker so the UI matches execution behavior.
Bedrock supports both Anthropic models and non-Anthropic models (for example, Mistral). Anthropic models use the Anthropic Messages API, while non-Anthropic models use the model-agnostic Bedrock Converse API. AI Observability routes model IDs automatically based on the anthropic. prefix.
Anthropic on Vertex dynamically lists available Claude models from the Vertex AI model catalog. You don’t need to specify model IDs manually.
Create evaluators
Use the AI Observability plugin UI or the evaluation API to create evaluators. Four evaluator types are available:
LLM judge
Uses an LLM to score generations based on criteria you define in a prompt template.
Key settings:
providerandmodel— the LLM to use for judging.system_promptanduser_prompt— prompt templates with variables.max_tokens,temperature,timeout_ms— generation controls.
JSON schema
Validates that the assistant response matches a JSON schema. Returns true or false.
Regex
Checks the assistant response against one or more regex patterns. Use reject: true to invert the match.
Heuristic
Applies a rule tree with AND/OR logic. Supported checks: not_empty, contains, not_contains, min_length, max_length.
Evaluation target
JSON schema, regex, and heuristic evaluators evaluate the assistant response by default. You can change the evaluation target to evaluate other fields instead:
| Target | Description |
|---|---|
response | Assistant response text (default). |
input | User input text. |
system_prompt | The system prompt. |
Set the target in the Evaluate against dropdown when you create a non-LLM-judge evaluator, or set the target field in the evaluator config JSON.
This is useful for lightweight detection of injected content in generation input without requiring an LLM judge.
Pass verdict
Bool-type evaluators (heuristic, regex, JSON schema, and LLM judge bool outputs) record a pass/fail verdict only when you explicitly configure a pass_value on the output key. When pass_value is omitted, the score is recorded but no verdict is determined.
Template variables
LLM judge prompts support these template variables:
| Variable | Content |
|---|---|
{{latest_user_message}} | Most recent user message |
{{user_history}} | All user messages |
{{assistant_response}} | Assistant output |
{{assistant_thinking}} | Thinking/reasoning content |
{{system_prompt}} | System prompt |
{{tool_calls}} | Tool call details |
{{tool_results}} | Tool result details |
{{tools}} | Available tool definitions |
{{call_error}} | Error information |
Create rules
Rules connect evaluators to generation traffic. Each rule has:
- Selector — which generations to evaluate:
user_visible_turn— assistant text responses without tool calls.all_assistant_generations— any assistant output.tool_call_steps— generations with tool calls.errored_generations— generations with errors.
- Match filters — additional criteria to narrow the selection.
- Sampling rate — percentage of matching generations to evaluate.
- Evaluator — the evaluator to run.
Next steps
Was this page helpful?
Related resources from Grafana Labs


