Set up online evaluation

Online evaluation lets you score your agents’ live production traffic automatically. This guide walks you through creating your first evaluator and rule.

Before you begin

The eval worker is enabled (SIGIL_EVAL_WORKER_ENABLED=true).
At least one judge provider is configured. Refer to Configure evaluation for provider setup.
You have the AI Observability Admin role.

Use the evaluation wizard

The evaluation overview page includes a setup wizard that walks you through creating your first evaluator and rule:

Navigate to Evaluation in the AI Observability plugin.
Click Create evaluator on the overview card.
Choose an evaluator type from the card grid:
- LLM judge — scores responses using an LLM.
- JSON schema — validates response structure.
- Regex — matches response patterns.
- Heuristic — applies rule-based checks.
Configure the evaluator settings and scoring criteria.
Choose a rule selector and sampling rate.
Optionally set up an alert with a pass-rate threshold.
Review and activate.

You can also create evaluators and rules individually outside the wizard.

Write an LLM judge prompt

For an LLM judge evaluator, write a prompt that describes the scoring criteria. Use template variables to inject generation data:

You are evaluating the quality of an AI assistant response.

User message:
{{latest_user_message}}

Assistant response:
{{assistant_response}}

Rate the response quality on a scale of 1-5, where 1 is poor and 5 is excellent.
Consider accuracy, helpfulness, and clarity.

Return only the numeric score.

Create a rule

Rules connect evaluators to generation traffic:

In Evaluation, click Create rule.
Choose a selector:
- User visible turn — assistant text responses without tool calls.
- All agent generations — any agent output.
- Tool call steps — generations containing tool calls.
- Errored generations — generations with errors.
Set a sampling rate (for example, 10% to evaluate 1 in 10 matching generations).
Attach your evaluator.
Click Save.

Set the evaluation target

For non-LLM-judge evaluators (JSON schema, regex, and heuristic), you can choose which text the evaluator analyzes:

Response (default) — the assistant’s output text.
Input — the user’s input text.
System prompt — the system prompt.

Select the target in the Evaluate against dropdown when you create or edit an evaluator. This is useful for detecting prompt injection or validating input structure without an LLM judge.

Create alerts from evaluation rules

You can create Grafana alert rules directly from an evaluation rule:

Open a rule’s detail page.
In the Alerts section, click Add alert.
Set a pass-rate threshold (for example, alert when fewer than 90% of evaluations pass).
Choose a contact point for notifications.
Click Create.

Alert rules are created as non-provisioned, so you can edit them in the standard Grafana Alerting UI afterward. Alert rules appear as clickable rows in the rule detail page, and the rules table shows an alert count column.

Monitor results

Evaluation scores appear in several places:

Conversation detail: scores displayed next to each evaluated generation, with pass/fail/neutral counts in the metrics bar.
Evaluation overview: live stats, rules table, and score distributions.
Analytics dashboards: quality metrics alongside cost and performance data.

Use the evaluation dashboard to identify quality regressions after prompt changes, compare evaluator results across agent versions, and spot patterns in low-scoring generations.

Custom code-based evaluators

Use custom code-based evaluators when your scoring logic needs to run outside AI Observability, such as a proprietary classifier, a deterministic business-rule check, or an evaluation service that already runs in your infrastructure. When your application has a generation ID and an evaluation result, post the score to POST /api/v1/scores:export on your stack-specific AI Observability API URL. Use a stable score_id for retries, set generation_id to the generation being evaluated, and identify your evaluator with evaluator_id, evaluator_version, and score_key so the result appears consistently in conversation details and quality reporting.

For the API URL, instance ID, and access token setup, refer to Use AI Observability on Grafana Cloud.

curl -X POST "${SIGIL_ENDPOINT}/api/v1/scores:export" \
  -H 'Content-Type: application/json' \
  -u "${SIGIL_AUTH_TENANT_ID}:${SIGIL_AUTH_TOKEN}" \
  -d '{
    "scores": [
      {
        "score_id": "sc_custom_helpfulness_gen_123",
        "generation_id": "gen_123",
        "evaluator_id": "custom.helpfulness",
        "evaluator_version": "2026-06-22",
        "score_key": "helpfulness",
        "value": { "number": 0.92 },
        "passed": true,
        "explanation": "The response answered the user question directly.",
        "source": { "kind": "external_api", "id": "custom-eval-worker" }
      }
    ]
  }'