GenAI evaluations
GenAI Evaluations provides comprehensive monitoring for AI model quality and safety using OpenLIT’s built-in evaluation capabilities for hallucination detection, toxicity analysis, bias assessment, and automated quality scoring.
Overview
The GenAI Evaluations dashboard focuses on AI model quality and safety evaluation using OpenLIT’s evaluation capabilities, providing:
- OpenTelemetry-native evaluations - Built-in metrics collection and monitoring
- LLM-powered assessments - AI-driven evaluation using OpenAI or Anthropic models
- Real-time quality scoring - Immediate feedback on content quality and safety
- Comprehensive issue detection - Detailed categorization and explanations for problems
Built-in evaluation metrics
Combined evaluation metric (openlit.evals.All
)
Comprehensive evaluation that checks for all three risk types in a single call:
- Hallucination detection - Identifies factual inaccuracies and false information
- Bias assessment - Detects unfair treatment across demographics
- Toxicity detection - Flags harmful, offensive, or threatening content
Specific evaluation metrics
openlit.evals.Hallucination
- Focused hallucination detection with detailed categorizationopenlit.evals.Bias
- Specialized bias detection across multiple categoriesopenlit.evals.Toxicity
- Targeted toxicity assessment with threat analysis
Evaluation categories
Hallucination types:
factual_inaccuracy
- Incorrect facts or informationnonsensical_response
- Irrelevant or unrelated contentgibberish
- Nonsensical text outputcontradiction
- Conflicting information
Bias types:
gender
,age
,ethnicity
,religion
,sexual_orientation
disability
,physical_appearance
,socioeconomic_status
Toxicity types:
threat
,hate
,personal_attack
,dismissive
,mockery
Supported providers
OpenLIT evaluations support multiple LLM providers for evaluation services:
- OpenAI
- Anthropic
Key features
OpenTelemetry integration
- Native metrics collection - Built-in OpenTelemetry metrics with
collect_metrics=True
- Grafana Cloud compatibility - Direct metrics export to dashboards
- Real-time monitoring - Live evaluation results and trends
- Custom resource attributes - Enhanced context and filtering
Configurable thresholds
- Custom score thresholds - Adjust sensitivity per use case
- Provider flexibility - Switch between OpenAI and Anthropic models
- Custom categories - Add domain-specific evaluation criteria
- Batch processing - Efficient evaluation of multiple texts
Production-ready features
- Automated verdicts - “yes/no” determinations based on thresholds
- Detailed explanations - Clear reasoning for each evaluation result
- Score distribution - Confidence levels from 0.0 to 1.0
- Custom base URLs - Support for enterprise API endpoints