---
title: "GenAI Evaluations Configuration | Grafana Cloud documentation"
description: "Configure GenAI Evaluations for comprehensive AI quality and safety monitoring"
---

# GenAI evaluations configuration

Configure OpenLIT evaluations to monitor AI model quality, safety, and performance with customizable thresholds, providers, and evaluation parameters.

## Basic configuration

### Provider selection

Choose between OpenAI and Anthropic for evaluation services:

Python ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```python
import openlit

# OpenAI-based evaluations (default)
evals = openlit.evals.All(provider="openai")

# Anthropic-based evaluations
evals = openlit.evals.All(provider="anthropic")
```

### API key configuration

Set your evaluation provider API key:

Python ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```python
import os

# Option 1: Environment variable (recommended)
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-api-key"

# Option 2: Direct parameter
evals = openlit.evals.All(provider="openai", api_key="your-api-key")
```

### Model selection

Specify which model to use for evaluations:

Python ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```python
# OpenAI models
evals = openlit.evals.All(provider="openai", model="gpt-4o")
evals = openlit.evals.All(provider="openai", model="gpt-4o-mini")

# Anthropic models
evals = openlit.evals.All(provider="anthropic", model="claude-3-5-sonnet-20241022")
evals = openlit.evals.All(provider="anthropic", model="claude-3-5-haiku-20241022")
evals = openlit.evals.All(provider="anthropic", model="claude-3-opus-20240229")
```

## Advanced configuration

### Threshold scoring

Configure the score threshold for determining evaluation verdicts:

Python ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```python
import openlit

# Default threshold is 0.5
evals = openlit.evals.All(provider="openai", threshold_score=0.7)

# Different thresholds for different evaluation metrics
hallucination_eval = openlit.evals.Hallucination(threshold_score=0.6)
bias_eval = openlit.evals.Bias(threshold_score=0.8)
toxicity_eval = openlit.evals.Toxicity(threshold_score=0.9)
```

### Custom base URL

For custom API endpoints or proxies:

Python ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```python
import openlit

# Custom OpenAI-compatible endpoint
evals = openlit.evals.All(
    provider="openai",
    base_url="https://your-custom-endpoint.com/v1"
)
```

### Custom categories

Add custom evaluation categories beyond the defaults:

Python ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```python
import openlit

# Add custom categories for specialized detection
custom_categories = {
    "spam_detection": "Identify promotional or spam content",
    "factual_verification": "Verify claims against known facts",
    "technical_accuracy": "Check technical information correctness"
}

evals = openlit.evals.All(
    provider="openai",
    custom_categories=custom_categories
)
```

### Metrics collection

Enable OpenTelemetry metrics collection for evaluations:

Python ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```python
import openlit

# Initialize OpenLIT for metrics collection first
openlit.init()

# Enable metrics collection for evaluations
evals = openlit.evals.All(
    provider="openai",
    collect_metrics=True
)

# Now evaluation metrics are sent metrics to Grafana Cloud
result = evals.measure(prompt=prompt, contexts=contexts, text=text)
```

## Evaluation-specific configuration

### Hallucination detection

Python ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```python
import openlit

hallucination_detector = openlit.evals.Hallucination(
    provider="openai",
    model="gpt-4o",
    threshold_score=0.6,
    custom_categories={
        "medical_misinformation": "Incorrect medical or health information",
        "historical_inaccuracy": "Incorrect historical facts or dates"
    },
    collect_metrics=True
)

# Usage with detailed context
result = hallucination_detector.measure(
    prompt="Explain the discovery of penicillin",
    contexts=[
        "Alexander Fleming discovered penicillin in 1928",
        "Penicillin was discovered accidentally when Fleming noticed mold killing bacteria"
    ],
    text="Fleming invented penicillin in 1925 as a deliberate research project"
)
```

### Bias detection

Python ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```python
import openlit

bias_detector = openlit.evals.Bias(
    provider="anthropic",
    model="claude-3-5-sonnet-20241022",
    threshold_score=0.7,
    custom_categories={
        "professional_bias": "Stereotypes about professional roles",
        "geographic_bias": "Assumptions based on geographic location"
    },
    collect_metrics=True
)

# Usage for bias detection
result = bias_detector.measure(
    prompt="Describe a typical nurse",
    text="Nurses are usually women who are very caring and emotional"
)
```

### Toxicity detection

Python ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```python
import openlit

toxicity_detector = openlit.evals.Toxicity(
    provider="openai",
    model="gpt-4o-mini",
    threshold_score=0.8,
    custom_categories={
        "cyberbullying": "Online harassment or bullying behavior",
        "discriminatory_language": "Language that discriminates against groups"
    },
    collect_metrics=True
)

# Usage for toxicity detection
result = toxicity_detector.measure(
    prompt="Provide feedback on this comment",
    text="Your opinion is worthless and you should be ashamed"
)
```
