How to monitor LLMs in production with Grafana Cloud,OpenLIT, and OpenTelemetry

Ishan Jain

•

2026-03-20•9 min

Note: The world is changing all around us thanks to AI. Today, anyone and everyone can be a developer, using LLMs to create LLM-powered applications, which users can then interact with by using even more LLMs.
Observability practitioners need to adapt and they need the right tools for the job. In this series, we'll show you how to use Grafana Cloud to monitor AI applications, including workloads in production (this post), AI agents, MCP servers, and zero-code LLMs.

Moving a large language model (LLM) application from a demo to a production‑scale service raises very different questions than the ones you ask when playing with an API key in a notebook.

In production, you have to answer: How much is each model costing us? Are we keeping latency within our service‑level objectives? Are we accidentally returning hallucinations or toxic content? Is the system vulnerable to prompt‑injection attacks?

The good news is that modern observability tools such as Grafana Cloud enable you to answer these questions in a single place. In this guide, you'll learn how to set up end‑to‑end observability for generative‑AI workloads using the OpenLIT SDK and AI Observability in Grafana Cloud.

Why use Grafana Cloud for LLM observability?

AI Observability builds on Grafana Cloud's ability to visualize and query metrics, logs, traces, and profiles, and tailors it to the unique needs of AI workloads. Key capabilities include:

Unified GenAI monitoring: AI Observability tracks model latency, throughput, and availability, and surfaces user prompts and completions so developers can understand usage patterns. It also provides real‑time cost management and token analytics so teams can see how many input/output tokens are consumed and how much each call costs.
Quality and safety evaluations: The integration adds programmatic evaluators on top of your traces so you can create alerts for hallucinations, verify factual accuracy, and score content quality. It also monitors for toxicity, bias, and other safety issues. These evaluation signals can be used to gate deployments and alert operators when model quality drifts.
Full‑stack observability: In addition to LLM metrics, Grafana Cloud (using OpenLIT) monitors vector database operations, Model Context Protocol (MCP) servers, and GPU performance. It tracks query latencies, resource utilization, and protocol health across the AI stack. There are also prebuilt dashboards covering five key areas: GenAI observability, GenAI evaluations, vector database observability, MCP observability, and GPU monitoring.
Vendor‑neutral instrumentation: The integration relies on OpenTelemetry, so you can export traces and metrics to any backend. Grafana Cloud provides a managed OpenTelemetry Protocol (OTLP) gateway and fully managed Prometheus and Tempo services, eliminating the need to run your own observability stack.

These capabilities make Grafana Cloud a natural destination for AI telemetry. Combined with OpenLIT’s auto‑instrumentation, you can gain comprehensive insights with minimal code changes.

We use OpenLIT because it makes it easier to instrument AI applications with minimal setup, supporting 50+ GenAI tools such as LLMs, vector databases, and frameworks such as LangChain and CrewAI.
It is OpenTelemetry-native and follows the GenAI semantic conventions, which makes it a natural fit for Grafana Cloud and other OTel-based backends. The SDK generates traces and metrics for tokens, latency, and cost (including custom models), and the same integration can also be used to run evaluations.

Demo: How to configure Grafana Cloud to monitor a customer support bot

Let's build a practical example: a customer support chatbot that uses different LLM providers based on the query complexity. We'll monitor everything in Grafana Cloud, and by the end, you'll see how this integration can save you money on AI queries and dramatically reduce latency.

Flowchart showing a user query processed through a router, LLM pool (GPT-4, Claude, GPT-3.5), and OpenLIT, leading to Grafana Cloud.

Note: if you get stuck anywhere along the way or need help with your own setup, click on the pulsar icon in the top-right corner of the Grafana Cloud UI to open a chat with Grafana Assistant, our purpose-built LLM that can help troubleshoot incidents, manage dashboards, and answer product questions.

Architecture

The high‑level architecture of an instrumented GenAI service is shown below. A router classifies incoming user messages and directs them to the most appropriate model. Every call, regardless of provider, is instrumented by OpenLIT, which generates traces and metrics. These signals are forwarded via the OTLP gateway to Grafana Cloud, where pre‑built dashboards visualize performance, cost, and quality.

User Query → Route by complexity → [GPT-4 (complex) | Claude (medium) | GPT-3.5 (simple)]
                                            ↓
                              OpenLIT SDK captures traces & metrics
                                            ↓
                              Grafana Cloud (Grafana Cloud Metrics + Grafana Cloud Traces + Grafana Cloud Logs)
                                            ↓
                              AI Observability Dashboards

Step 1: Install AI Observability

Start by adding AI Observability to your Grafana Cloud stack. This can be done by clicking on Connections in the left-side menu and following the steps outlined in our documentation.

This installs the five dashboards mentioned earlier (GenAI observability, GenAI evaluations, vector DB observability, MCP observability, and GPU monitoring). When metrics arrive, these dashboards automatically populate with latency histograms, token counts, cost summaries, and evaluation results.

Step 2: Install OpenLIT

Install OpenLIT and your preferred model providers via pip:

pip install openlit openai anthropic cohere

This command pulls the latest OpenLIT SDK from PyPI (v1.35.9 at the time of writing) and any client libraries you need for your models.

Step 3: Instrument your application

The simplest way to add observability is to call openlit.init() at the beginning of your application. You can optionally pass an service_name and environment to improve dashboard organization.

Below is a more realistic example than the original single‑model snippet. The router uses simple logic to choose between GPT‑3.5, Claude 3, and GPT‑4 based on message complexity. OpenLIT instruments every API call automatically. We also demonstrate how to use the evaluation and guardrail APIs to flag hallucinations and prompt‑injection attempts:

import os
import openlit
from openai import OpenAI
from anthropic import Anthropic

# Initialize OpenLIT with an application name and environment
openlit.init()

# Initialize clients for each model provider
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
anthropic_client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

def route_query(message: str) -> str:
    """Select a model based on the message length/complexity."""
    words = len(message.split())
    if words < 20:
        return "gpt-3.5-turbo"  # simple queries
    elif words < 100:
        return "claude-3-haiku"  # medium complexity
    else:
        return "gpt-4-turbo"  # complex queries

def call_model(model_name: str, message: str) -> str:
    """Send the query to the selected LLM.
    OpenLIT will automatically instrument each API call.
    """
    if model_name.startswith("gpt-3.5") or model_name.startswith("gpt-4"):
        response = openai_client.chat.completions.create(
            model=model_name,
            messages=[{"role": "system", "content": "You are a helpful assistant."},
                     {"role": "user", "content": message}]
        )
        return response.choices[0].message.content
    elif model_name.startswith("claude"):
        response = anthropic_client.completions.create(
            model=model_name,
            prompt=f"\n\nHuman: {message}\n\nAssistant:",
            max_tokens=1024
        )
        return response.completion
    else:
        raise ValueError(f"Unsupported model: {model_name}")

def chat(user_message: str) -> str:
    # Choose the right model for the query
    model = route_query(user_message)
    answer = call_model(model, user_message)

    # Optional: Evaluate the response for hallucinations and injection attempts
    evals = openlit.evals.Hallucination(provider="openai", api_key=os.getenv("OPENAI_API_KEY"))
   
   # Evaluation metric is automatically sent to the configured OTel destination 
   evals.measure(prompt=user_message,
                                contexts=["Internal knowledge base"],
                                text=answer)
   

    # Guard against prompt injection or sensitive topics
    guard = openlit.guard.All(provider="openai", api_key=os.getenv("OPENAI_API_KEY"))
    guard.detect(text=user_message)

    return answer

if __name__ == "__main__":
    user_question = input("Ask our support bot a question: ")
    print(chat(user_question))

This example shows how to route requests to different models and still emit consistent traces and metrics. We also run a hallucination evaluator and a combined guardrail on each message. The hallucination evaluator detects factual inaccuracies, contradictions, and fabricated information, while the All guardrail simultaneously performs injection detection, sensitive‑topic filtering, and topic restriction.

Step 4: Run the application

To send data to Grafana Cloud, you need an OTLP endpoint and an API token. To do so, log in to your Grafana Cloud stack, open the OpenTelemetry settings, generate an API token, and copy the OTEL endpoint and headers.

Export these values as environment variables before starting your application:

export OTEL_SERVICE_NAME=my-ai-app
export OTEL_DEPLOYMENT_ENVIRONMENT=production
export OTEL_EXPORTER_OTLP_ENDPOINT="https://otlp-gateway-<region>.grafana.net/otlp"
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <base64-instanceID:token>"

export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"

python your_app.py

Replace <region> with your Grafana zone (for example, prod-us-central-0) and <base64-instanceID: token> with the base64‑encoded instance ID and API token. When you run your instrumented application, OpenLIT will connect to the OTLP gateway and begin sending traces and metrics.

Step 5: Visualize in Grafana Cloud

Once data flows into Grafana, open the AI Observability dashboards. The GenAI observability dashboard visualizes request rates, latency percentiles, and cost metrics. For instance, it tracks "time to first token" and overall latency across providers and surfaces the total and average cost per request using metrics such as gen_ai_usage_cost_USD_sum and gen_ai_usage_input_tokens_total. The GenAI evaluations dashboard summarizes hallucination, bias, and toxicity detection events.

Grafana Alerting can trigger notifications when costs exceed thresholds, latency spikes, or evaluation scores cross your quality gates. Because everything is built on OpenTelemetry metrics, you can also build custom panels and alerts tailored to your use case.

The value you get

Now that we've walked through how to use AI Observability in Grafana Cloud, let's look at some hypothetical scenarios to illustrate the actionable insights to help improve application performance.

Cost optimization

Before: "Our AI costs are going up, but we don't know why."

After: You can see:

GPT-5 accounts for 70% of costs but only 20% of queries
Switching simple queries to GPT-4 saves $2000/month
One user is making excessive API calls

Performance monitoring

Before: "Users complain the bot is slow sometimes."

After: You discover:

Claude has 30% lower TTFT than GPT-4
Latency spikes correlate with Claude API rate limits
95th percentile latency is 3.2s (above your 3s SLA)

Quality assurance

Before: "Why are our users not happy with AI responses?"

After: You can see:

Hallucinations appear in 20% of the requests
Answer accuracy dropped from 92% to 81% after the latest prompt change
Instruction-following failures increased to 15% for complex queries

Debugging complex issues

With distributed tracing, you can:

- Follow a request from user input → classification → LLM call → response

- See exact prompts that caused errors

- Identify which part of your pipeline is slow

- Correlate issues with specific users or time periods

Next steps

Want to go further? In the next blog in this series, we’ll show how to set this up, step by step, for an agentic AI application.

You can also learn more about AI Observability in the official docs, including setup instructions and dashboards.These resources will help you move from a basic demo to a production-ready setup for your AI applications in no time.

Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!

How to monitor LLMs in production with Grafana Cloud,OpenLIT, and OpenTelemetry

Why use Grafana Cloud for LLM observability?

Demo: How to configure Grafana Cloud to monitor a customer support bot

Architecture

Step 1: Install AI Observability

Step 2: Install OpenLIT

Step 3: Instrument your application

Step 4: Run the application

Step 5: Visualize in Grafana Cloud

The value you get

Cost optimization

Performance monitoring

Quality assurance

Debugging complex issues

Next steps

Up next

Related content

Related videos

Related docs

Related products

Still have questions?

Get every update

How to monitor LLMs in production with Grafana Cloud,OpenLIT, and OpenTelemetry

Why use Grafana Cloud for LLM observability?

Demo: How to configure Grafana Cloud to monitor a customer support bot

Architecture

Step 1: Install AI Observability

Step 2: Install OpenLIT

Step 3: Instrument your application

Step 4: Run the application

Step 5: Visualize in Grafana Cloud

The value you get

Cost optimization

Performance monitoring

Quality assurance

Debugging complex issues

Next steps

Related Content

Up next

Related content

Related videos

Related docs

Related products