GenAI evaluations configuration
Configure OpenLIT evaluations to monitor AI model quality, safety, and performance with customizable thresholds, providers, and evaluation parameters.
Basic configuration
Provider selection
Choose between OpenAI and Anthropic for evaluation services:
import openlit
# OpenAI-based evaluations (default)
evals = openlit.evals.All(provider="openai")
# Anthropic-based evaluations
evals = openlit.evals.All(provider="anthropic")
API key configuration
Set your evaluation provider API key:
import os
# Option 1: Environment variable (recommended)
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-api-key"
# Option 2: Direct parameter
evals = openlit.evals.All(provider="openai", api_key="your-api-key")
Model selection
Specify which model to use for evaluations:
# OpenAI models
evals = openlit.evals.All(provider="openai", model="gpt-4o")
evals = openlit.evals.All(provider="openai", model="gpt-4o-mini")
# Anthropic models
evals = openlit.evals.All(provider="anthropic", model="claude-3-5-sonnet-20241022")
evals = openlit.evals.All(provider="anthropic", model="claude-3-5-haiku-20241022")
evals = openlit.evals.All(provider="anthropic", model="claude-3-opus-20240229")
Advanced configuration
Threshold scoring
Configure the score threshold for determining evaluation verdicts:
import openlit
# Default threshold is 0.5
evals = openlit.evals.All(provider="openai", threshold_score=0.7)
# Different thresholds for different evaluation metrics
hallucination_eval = openlit.evals.Hallucination(threshold_score=0.6)
bias_eval = openlit.evals.Bias(threshold_score=0.8)
toxicity_eval = openlit.evals.Toxicity(threshold_score=0.9)
Custom base URL
For custom API endpoints or proxies:
import openlit
# Custom OpenAI-compatible endpoint
evals = openlit.evals.All(
provider="openai",
base_url="https://your-custom-endpoint.com/v1"
)
Custom categories
Add custom evaluation categories beyond the defaults:
import openlit
# Add custom categories for specialized detection
custom_categories = {
"spam_detection": "Identify promotional or spam content",
"factual_verification": "Verify claims against known facts",
"technical_accuracy": "Check technical information correctness"
}
evals = openlit.evals.All(
provider="openai",
custom_categories=custom_categories
)
Metrics collection
Enable OpenTelemetry metrics collection for evaluations:
import openlit
# Initialize OpenLIT for metrics collection first
openlit.init()
# Enable metrics collection for evaluations
evals = openlit.evals.All(
provider="openai",
collect_metrics=True
)
# Now evaluation metrics are sent metrics to Grafana Cloud
result = evals.measure(prompt=prompt, contexts=contexts, text=text)
Evaluation-specific configuration
Hallucination detection
import openlit
hallucination_detector = openlit.evals.Hallucination(
provider="openai",
model="gpt-4o",
threshold_score=0.6,
custom_categories={
"medical_misinformation": "Incorrect medical or health information",
"historical_inaccuracy": "Incorrect historical facts or dates"
},
collect_metrics=True
)
# Usage with detailed context
result = hallucination_detector.measure(
prompt="Explain the discovery of penicillin",
contexts=[
"Alexander Fleming discovered penicillin in 1928",
"Penicillin was discovered accidentally when Fleming noticed mold killing bacteria"
],
text="Fleming invented penicillin in 1925 as a deliberate research project"
)
Bias detection
import openlit
bias_detector = openlit.evals.Bias(
provider="anthropic",
model="claude-3-5-sonnet-20241022",
threshold_score=0.7,
custom_categories={
"professional_bias": "Stereotypes about professional roles",
"geographic_bias": "Assumptions based on geographic location"
},
collect_metrics=True
)
# Usage for bias detection
result = bias_detector.measure(
prompt="Describe a typical nurse",
text="Nurses are usually women who are very caring and emotional"
)
Toxicity detection
import openlit
toxicity_detector = openlit.evals.Toxicity(
provider="openai",
model="gpt-4o-mini",
threshold_score=0.8,
custom_categories={
"cyberbullying": "Online harassment or bullying behavior",
"discriminatory_language": "Language that discriminates against groups"
},
collect_metrics=True
)
# Usage for toxicity detection
result = toxicity_detector.measure(
prompt="Provide feedback on this comment",
text="Your opinion is worthless and you should be ashamed"
)