GenAI evaluations configuration

Configure OpenLIT evaluations to monitor AI model quality, safety, and performance with customizable thresholds, providers, and evaluation parameters.

Basic configuration

Provider selection

Choose between OpenAI and Anthropic for evaluation services:

import openlit

# OpenAI-based evaluations (default)
evals = openlit.evals.All(provider="openai")

# Anthropic-based evaluations
evals = openlit.evals.All(provider="anthropic")

API key configuration

Set your evaluation provider API key:

import os

# Option 1: Environment variable (recommended)
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-api-key"

# Option 2: Direct parameter
evals = openlit.evals.All(provider="openai", api_key="your-api-key")

Model selection

Specify which model to use for evaluations:

# OpenAI models
evals = openlit.evals.All(provider="openai", model="gpt-4o")
evals = openlit.evals.All(provider="openai", model="gpt-4o-mini")

# Anthropic models
evals = openlit.evals.All(provider="anthropic", model="claude-3-5-sonnet-20241022")
evals = openlit.evals.All(provider="anthropic", model="claude-3-5-haiku-20241022")
evals = openlit.evals.All(provider="anthropic", model="claude-3-opus-20240229")

Advanced configuration

Threshold scoring

Configure the score threshold for determining evaluation verdicts:

import openlit

# Default threshold is 0.5
evals = openlit.evals.All(provider="openai", threshold_score=0.7)

# Different thresholds for different evaluation metrics
hallucination_eval = openlit.evals.Hallucination(threshold_score=0.6)
bias_eval = openlit.evals.Bias(threshold_score=0.8)
toxicity_eval = openlit.evals.Toxicity(threshold_score=0.9)

Custom base URL

For custom API endpoints or proxies:

import openlit

# Custom OpenAI-compatible endpoint
evals = openlit.evals.All(
    provider="openai",
    base_url="https://your-custom-endpoint.com/v1"
)

Custom categories

Add custom evaluation categories beyond the defaults:

import openlit

# Add custom categories for specialized detection
custom_categories = {
    "spam_detection": "Identify promotional or spam content",
    "factual_verification": "Verify claims against known facts",
    "technical_accuracy": "Check technical information correctness"
}

evals = openlit.evals.All(
    provider="openai",
    custom_categories=custom_categories
)

Metrics collection

Enable OpenTelemetry metrics collection for evaluations:

import openlit

# Initialize OpenLIT for metrics collection first
openlit.init()

# Enable metrics collection for evaluations
evals = openlit.evals.All(
    provider="openai",
    collect_metrics=True
)

# Now evaluation metrics are sent metrics to Grafana Cloud
result = evals.measure(prompt=prompt, contexts=contexts, text=text)

Evaluation-specific configuration

Hallucination detection

import openlit

hallucination_detector = openlit.evals.Hallucination(
    provider="openai",
    model="gpt-4o",
    threshold_score=0.6,
    custom_categories={
        "medical_misinformation": "Incorrect medical or health information",
        "historical_inaccuracy": "Incorrect historical facts or dates"
    },
    collect_metrics=True
)

# Usage with detailed context
result = hallucination_detector.measure(
    prompt="Explain the discovery of penicillin",
    contexts=[
        "Alexander Fleming discovered penicillin in 1928",
        "Penicillin was discovered accidentally when Fleming noticed mold killing bacteria"
    ],
    text="Fleming invented penicillin in 1925 as a deliberate research project"
)

Bias detection

import openlit

bias_detector = openlit.evals.Bias(
    provider="anthropic",
    model="claude-3-5-sonnet-20241022",
    threshold_score=0.7,
    custom_categories={
        "professional_bias": "Stereotypes about professional roles",
        "geographic_bias": "Assumptions based on geographic location"
    },
    collect_metrics=True
)

# Usage for bias detection
result = bias_detector.measure(
    prompt="Describe a typical nurse",
    text="Nurses are usually women who are very caring and emotional"
)

Toxicity detection

import openlit

toxicity_detector = openlit.evals.Toxicity(
    provider="openai",
    model="gpt-4o-mini",
    threshold_score=0.8,
    custom_categories={
        "cyberbullying": "Online harassment or bullying behavior",
        "discriminatory_language": "Language that discriminates against groups"
    },
    collect_metrics=True
)

# Usage for toxicity detection
result = toxicity_detector.measure(
    prompt="Provide feedback on this comment",
    text="Your opinion is worthless and you should be ashamed"
)