Grafana Cloud

Configure online evaluation

Online evaluation continuously scores live generation traffic. You configure evaluators that define how to score, and rules that define which generations to evaluate.

Enable the eval worker

Set these environment variables (or Helm values) to enable evaluation:

VariableDefaultDescription
SIGIL_EVAL_WORKER_ENABLEDfalseEnable the evaluation worker loop.
SIGIL_EVAL_MAX_CONCURRENT8Maximum in-flight evaluations.
SIGIL_EVAL_MAX_RATE600Maximum evaluations per minute.
SIGIL_EVAL_MAX_ATTEMPTS3Retry cap for transient failures.
SIGIL_EVAL_CLAIM_BATCH_SIZE20Work items claimed per cycle.
SIGIL_EVAL_POLL_INTERVAL250msHow often the worker claims new work.
SIGIL_EVAL_DEFAULT_JUDGE_MODELopenai/gpt-4o-miniDefault model for LLM judge evaluators.
SIGIL_EVAL_JUDGE_MONTHLY_USAGE_LIMIT_USD0Global default monthly llm_judge spend cap per tenant, in whole USD. 0 disables the cap. Per-tenant overrides can be set via sigil-runtime-config.yaml (see Per-tenant overrides).

Push evaluation metrics

AI Observability can push per-tenant evaluation metrics to a Prometheus-compatible remote-write endpoint. This lets you query evaluation pass rates and score distributions in Grafana alongside your other metrics.

VariableDefaultDescription
SIGIL_EVAL_METRICS_PUSH_ENDPOINT""Remote-write endpoint URL. Leave empty to disable.
SIGIL_EVAL_METRICS_PUSH_INTERVAL15sHow often metrics are pushed.
SIGIL_EVAL_METRICS_PUSH_TIMEOUT10sHTTP timeout for remote-write requests.

Configure judge providers

The querier discovers judge providers and judge-model dropdown options from environment variables. The eval worker enforces the same provider and allowlist config when evaluations execute.

ProviderRequired variableOptional allowlist variable
OpenAISIGIL_EVAL_OPENAI_API_KEYSIGIL_EVAL_OPENAI_ALLOWED_MODELS
Azure OpenAISIGIL_EVAL_AZURE_OPENAI_ENDPOINT, SIGIL_EVAL_AZURE_OPENAI_API_KEYSIGIL_EVAL_AZURE_OPENAI_ALLOWED_MODELS
AnthropicSIGIL_EVAL_ANTHROPIC_API_KEYSIGIL_EVAL_ANTHROPIC_ALLOWED_MODELS
AWS BedrockAWS default credentials or SIGIL_EVAL_BEDROCK_BEARER_TOKENSIGIL_EVAL_BEDROCK_ALLOWED_MODELS
GoogleSIGIL_EVAL_GOOGLE_API_KEYSIGIL_EVAL_GOOGLE_ALLOWED_MODELS
Vertex AISIGIL_EVAL_VERTEXAI_PROJECTSIGIL_EVAL_VERTEXAI_ALLOWED_MODELS
Anthropic on VertexSIGIL_EVAL_ANTHROPIC_VERTEX_PROJECTSIGIL_EVAL_ANTHROPIC_VERTEX_ALLOWED_MODELS
OpenAI-compatibleCustom endpoint with optional API keySIGIL_EVAL_OPENAI_COMPAT_ALLOWED_MODELS or SIGIL_EVAL_OPENAI_COMPAT_<N>_ALLOWED_MODELS

Allowlist values are comma-separated model IDs. Use the same IDs returned by the judge model API or shown in the UI dropdown.

Empty or unset allowlist variables mean all models returned by that provider remain eligible.

In split deployments, keep allowlist env vars aligned across querier and eval-worker so the UI matches execution behavior.

Bedrock supports both Anthropic models and non-Anthropic models (for example, Mistral). Anthropic models use the Anthropic Messages API, while non-Anthropic models use the model-agnostic Bedrock Converse API. AI Observability routes model IDs automatically based on the anthropic. prefix.

Anthropic on Vertex dynamically lists available Claude models from the Vertex AI model catalog. You don’t need to specify model IDs manually.

Create evaluators

Use the AI Observability plugin UI or the evaluation API to create evaluators. Four evaluator types are available:

LLM judge

Uses an LLM to score generations based on criteria you define in a prompt template.

Key settings:

  • provider and model — the LLM to use for judging.
  • system_prompt and user_prompt — prompt templates with variables.
  • max_tokens, temperature, timeout_ms — generation controls.

JSON schema

Validates that the assistant response matches a JSON schema. Returns true or false.

Regex

Checks the assistant response against one or more regex patterns. Use reject: true to invert the match.

Heuristic

Applies a rule tree with AND/OR logic. Supported checks: not_empty, contains, not_contains, min_length, max_length.

Evaluation target

JSON schema, regex, and heuristic evaluators evaluate the assistant response by default. You can change the evaluation target to evaluate other fields instead:

TargetDescription
responseAssistant response text (default).
inputUser input text.
system_promptThe system prompt.

Set the target in the Evaluate against dropdown when you create a non-LLM-judge evaluator, or set the target field in the evaluator config JSON.

This is useful for lightweight detection of injected content in generation input without requiring an LLM judge.

Pass verdict

Bool-type evaluators (heuristic, regex, JSON schema, and LLM judge bool outputs) record a pass/fail verdict only when you explicitly configure a pass_value on the output key. When pass_value is omitted, the score is recorded but no verdict is determined.

Template variables

LLM judge prompts support these template variables:

VariableContent
{{latest_user_message}}Most recent user message
{{user_history}}All user messages
{{assistant_response}}Assistant output
{{assistant_thinking}}Thinking/reasoning content
{{system_prompt}}System prompt
{{tool_calls}}Tool call details
{{tool_results}}Tool result details
{{tools}}Available tool definitions
{{call_error}}Error information

Create rules

Rules connect evaluators to generation traffic. Each rule has:

  • Selector — which generations to evaluate:
    • user_visible_turn — assistant text responses without tool calls.
    • all_assistant_generations — any assistant output.
    • tool_call_steps — generations with tool calls.
    • errored_generations — generations with errors.
  • Match filters — additional criteria to narrow the selection.
  • Sampling rate — percentage of matching generations to evaluate.
  • Evaluator — the evaluator to run.

Next steps