Menu
Grafana Cloud

GenAI evaluations configuration

Configure OpenLIT evaluations to monitor AI model quality, safety, and performance with customizable thresholds, providers, and evaluation parameters.

Basic configuration

Provider selection

Choose between OpenAI and Anthropic for evaluation services:

Python
import openlit

# OpenAI-based evaluations (default)
evals = openlit.evals.All(provider="openai")

# Anthropic-based evaluations
evals = openlit.evals.All(provider="anthropic")

API key configuration

Set your evaluation provider API key:

Python
import os

# Option 1: Environment variable (recommended)
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-api-key"

# Option 2: Direct parameter
evals = openlit.evals.All(provider="openai", api_key="your-api-key")

Model selection

Specify which model to use for evaluations:

Python
# OpenAI models
evals = openlit.evals.All(provider="openai", model="gpt-4o")
evals = openlit.evals.All(provider="openai", model="gpt-4o-mini")

# Anthropic models  
evals = openlit.evals.All(provider="anthropic", model="claude-3-5-sonnet-20241022")
evals = openlit.evals.All(provider="anthropic", model="claude-3-5-haiku-20241022")
evals = openlit.evals.All(provider="anthropic", model="claude-3-opus-20240229")

Advanced configuration

Threshold scoring

Configure the score threshold for determining evaluation verdicts:

Python
import openlit

# Default threshold is 0.5
evals = openlit.evals.All(provider="openai", threshold_score=0.7)

# Different thresholds for different evaluation metrics
hallucination_eval = openlit.evals.Hallucination(threshold_score=0.6)
bias_eval = openlit.evals.Bias(threshold_score=0.8)
toxicity_eval = openlit.evals.Toxicity(threshold_score=0.9)

Custom base URL

For custom API endpoints or proxies:

Python
import openlit

# Custom OpenAI-compatible endpoint
evals = openlit.evals.All(
    provider="openai",
    base_url="https://your-custom-endpoint.com/v1"
)

Custom categories

Add custom evaluation categories beyond the defaults:

Python
import openlit

# Add custom categories for specialized detection
custom_categories = {
    "spam_detection": "Identify promotional or spam content",
    "factual_verification": "Verify claims against known facts",
    "technical_accuracy": "Check technical information correctness"
}

evals = openlit.evals.All(
    provider="openai",
    custom_categories=custom_categories
)

Metrics collection

Enable OpenTelemetry metrics collection for evaluations:

Python
import openlit

# Initialize OpenLIT for metrics collection first
openlit.init()

# Enable metrics collection for evaluations
evals = openlit.evals.All(
    provider="openai",
    collect_metrics=True
)

# Now evaluation metrics are sent metrics to Grafana Cloud
result = evals.measure(prompt=prompt, contexts=contexts, text=text)

Evaluation-specific configuration

Hallucination detection

Python
import openlit

hallucination_detector = openlit.evals.Hallucination(
    provider="openai",
    model="gpt-4o",
    threshold_score=0.6,
    custom_categories={
        "medical_misinformation": "Incorrect medical or health information",
        "historical_inaccuracy": "Incorrect historical facts or dates"
    },
    collect_metrics=True
)

# Usage with detailed context
result = hallucination_detector.measure(
    prompt="Explain the discovery of penicillin",
    contexts=[
        "Alexander Fleming discovered penicillin in 1928",
        "Penicillin was discovered accidentally when Fleming noticed mold killing bacteria"
    ],
    text="Fleming invented penicillin in 1925 as a deliberate research project"
)

Bias detection

Python
import openlit

bias_detector = openlit.evals.Bias(
    provider="anthropic",
    model="claude-3-5-sonnet-20241022",
    threshold_score=0.7,
    custom_categories={
        "professional_bias": "Stereotypes about professional roles",
        "geographic_bias": "Assumptions based on geographic location"
    },
    collect_metrics=True
)

# Usage for bias detection
result = bias_detector.measure(
    prompt="Describe a typical nurse",
    text="Nurses are usually women who are very caring and emotional"
)

Toxicity detection

Python
import openlit

toxicity_detector = openlit.evals.Toxicity(
    provider="openai", 
    model="gpt-4o-mini",
    threshold_score=0.8,
    custom_categories={
        "cyberbullying": "Online harassment or bullying behavior",
        "discriminatory_language": "Language that discriminates against groups"
    },
    collect_metrics=True
)

# Usage for toxicity detection
result = toxicity_detector.measure(
    prompt="Provide feedback on this comment",
    text="Your opinion is worthless and you should be ashamed"
)