Menu
Grafana Cloud

GenAI evaluations

GenAI Evaluations provides comprehensive monitoring for AI model quality and safety using OpenLIT’s built-in evaluation capabilities for hallucination detection, toxicity analysis, bias assessment, and automated quality scoring.

Overview

The GenAI Evaluations dashboard focuses on AI model quality and safety evaluation using OpenLIT’s evaluation capabilities, providing:

  • OpenTelemetry-native evaluations - Built-in metrics collection and monitoring
  • LLM-powered assessments - AI-driven evaluation using OpenAI or Anthropic models
  • Real-time quality scoring - Immediate feedback on content quality and safety
  • Comprehensive issue detection - Detailed categorization and explanations for problems

Built-in evaluation metrics

Combined evaluation metric (openlit.evals.All)

Comprehensive evaluation that checks for all three risk types in a single call:

  • Hallucination detection - Identifies factual inaccuracies and false information
  • Bias assessment - Detects unfair treatment across demographics
  • Toxicity detection - Flags harmful, offensive, or threatening content

Specific evaluation metrics

  • openlit.evals.Hallucination - Focused hallucination detection with detailed categorization
  • openlit.evals.Bias - Specialized bias detection across multiple categories
  • openlit.evals.Toxicity - Targeted toxicity assessment with threat analysis

Evaluation categories

Hallucination types:

  • factual_inaccuracy - Incorrect facts or information
  • nonsensical_response - Irrelevant or unrelated content
  • gibberish - Nonsensical text output
  • contradiction - Conflicting information

Bias types:

  • gender, age, ethnicity, religion, sexual_orientation
  • disability, physical_appearance, socioeconomic_status

Toxicity types:

  • threat, hate, personal_attack, dismissive, mockery

Supported providers

OpenLIT evaluations support multiple LLM providers for evaluation services:

  • OpenAI
  • Anthropic

Key features

OpenTelemetry integration

  • Native metrics collection - Built-in OpenTelemetry metrics with collect_metrics=True
  • Grafana Cloud compatibility - Direct metrics export to dashboards
  • Real-time monitoring - Live evaluation results and trends
  • Custom resource attributes - Enhanced context and filtering

Configurable thresholds

  • Custom score thresholds - Adjust sensitivity per use case
  • Provider flexibility - Switch between OpenAI and Anthropic models
  • Custom categories - Add domain-specific evaluation criteria
  • Batch processing - Efficient evaluation of multiple texts

Production-ready features

  • Automated verdicts - “yes/no” determinations based on thresholds
  • Detailed explanations - Clear reasoning for each evaluation result
  • Score distribution - Confidence levels from 0.0 to 1.0
  • Custom base URLs - Support for enterprise API endpoints

Getting started