
How to monitor LLMs in production with Grafana Cloud,OpenLIT, and OpenTelemetry
Note: The world is changing all around us thanks to AI. Today, anyone and everyone can be a developer, using LLMs to create LLM-powered applications, which users can then interact with by using even more LLMs.
Observability practitioners need to adapt and they need the right tools for the job. In this series, we'll show you how to use Grafana Cloud to monitor AI applications, including workloads in production (this post), AI agents, MCP servers, and zero-code LLMs.
Moving a large language model (LLM) application from a demo to a production‑scale service raises very different questions than the ones you ask when playing with an API key in a notebook.
In production, you have to answer: How much is each model costing us? Are we keeping latency within our service‑level objectives? Are we accidentally returning hallucinations or toxic content? Is the system vulnerable to prompt‑injection attacks?
The good news is that modern observability tools such as Grafana Cloud enable you to answer these questions in a single place. In this guide, you'll learn how to set up end‑to‑end observability for generative‑AI workloads using the OpenLIT SDK and AI Observability in Grafana Cloud.
Why use Grafana Cloud for LLM observability?
AI Observability builds on Grafana Cloud's ability to visualize and query metrics, logs, traces, and profiles, and tailors it to the unique needs of AI workloads. Key capabilities include:
- Unified GenAI monitoring: AI Observability tracks model latency, throughput, and availability, and surfaces user prompts and completions so developers can understand usage patterns. It also provides real‑time cost management and token analytics so teams can see how many input/output tokens are consumed and how much each call costs.
- Quality and safety evaluations: The integration adds programmatic evaluators on top of your traces so you can create alerts for hallucinations, verify factual accuracy, and score content quality. It also monitors for toxicity, bias, and other safety issues. These evaluation signals can be used to gate deployments and alert operators when model quality drifts.
- Full‑stack observability: In addition to LLM metrics, Grafana Cloud (using OpenLIT) monitors vector database operations, Model Context Protocol (MCP) servers, and GPU performance. It tracks query latencies, resource utilization, and protocol health across the AI stack. There are also prebuilt dashboards covering five key areas: GenAI observability, GenAI evaluations, vector database observability, MCP observability, and GPU monitoring.
- Vendor‑neutral instrumentation: The integration relies on OpenTelemetry, so you can export traces and metrics to any backend. Grafana Cloud provides a managed OpenTelemetry Protocol (OTLP) gateway and fully managed Prometheus and Tempo services, eliminating the need to run your own observability stack.
These capabilities make Grafana Cloud a natural destination for AI telemetry. Combined with OpenLIT’s auto‑instrumentation, you can gain comprehensive insights with minimal code changes.
We use OpenLIT because it makes it easier to instrument AI applications with minimal setup, supporting 50+ GenAI tools such as LLMs, vector databases, and frameworks such as LangChain and CrewAI.
It is OpenTelemetry-native and follows the GenAI semantic conventions, which makes it a natural fit for Grafana Cloud and other OTel-based backends. The SDK generates traces and metrics for tokens, latency, and cost (including custom models), and the same integration can also be used to run evaluations.
Demo: How to configure Grafana Cloud to monitor a customer support bot
Let's build a practical example: a customer support chatbot that uses different LLM providers based on the query complexity. We'll monitor everything in Grafana Cloud, and by the end, you'll see how this integration can save you money on AI queries and dramatically reduce latency.

Note: if you get stuck anywhere along the way or need help with your own setup, click on the pulsar icon in the top-right corner of the Grafana Cloud UI to open a chat with Grafana Assistant, our purpose-built LLM that can help troubleshoot incidents, manage dashboards, and answer product questions.
Architecture
The high‑level architecture of an instrumented GenAI service is shown below. A router classifies incoming user messages and directs them to the most appropriate model. Every call, regardless of provider, is instrumented by OpenLIT, which generates traces and metrics. These signals are forwarded via the OTLP gateway to Grafana Cloud, where pre‑built dashboards visualize performance, cost, and quality.
User Query → Route by complexity → [GPT-4 (complex) | Claude (medium) | GPT-3.5 (simple)]
↓
OpenLIT SDK captures traces & metrics
↓
Grafana Cloud (Grafana Cloud Metrics + Grafana Cloud Traces + Grafana Cloud Logs)
↓
AI Observability Dashboards
Step 1: Install AI Observability
Start by adding AI Observability to your Grafana Cloud stack. This can be done by clicking on Connections in the left-side menu and following the steps outlined in our documentation.
This installs the five dashboards mentioned earlier (GenAI observability, GenAI evaluations, vector DB observability, MCP observability, and GPU monitoring). When metrics arrive, these dashboards automatically populate with latency histograms, token counts, cost summaries, and evaluation results.
Step 2: Install OpenLIT
Install OpenLIT and your preferred model providers via pip:
pip install openlit openai anthropic cohere
This command pulls the latest OpenLIT SDK from PyPI (v1.35.9 at the time of writing) and any client libraries you need for your models.
Step 3: Instrument your application
The simplest way to add observability is to call openlit.init() at the beginning of your application. You can optionally pass an service_name and environment to improve dashboard organization.
Below is a more realistic example than the original single‑model snippet. The router uses simple logic to choose between GPT‑3.5, Claude 3, and GPT‑4 based on message complexity. OpenLIT instruments every API call automatically. We also demonstrate how to use the evaluation and guardrail APIs to flag hallucinations and prompt‑injection attempts:
import os
import openlit
from openai import OpenAI
from anthropic import Anthropic
# Initialize OpenLIT with an application name and environment
openlit.init()
# Initialize clients for each model provider
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
anthropic_client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
def route_query(message: str) -> str:
"""Select a model based on the message length/complexity."""
words = len(message.split())
if words < 20:
return "gpt-3.5-turbo" # simple queries
elif words < 100:
return "claude-3-haiku" # medium complexity
else:
return "gpt-4-turbo" # complex queries
def call_model(model_name: str, message: str) -> str:
"""Send the query to the selected LLM.
OpenLIT will automatically instrument each API call.
"""
if model_name.startswith("gpt-3.5") or model_name.startswith("gpt-4"):
response = openai_client.chat.completions.create(
model=model_name,
messages=[{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": message}]
)
return response.choices[0].message.content
elif model_name.startswith("claude"):
response = anthropic_client.completions.create(
model=model_name,
prompt=f"\n\nHuman: {message}\n\nAssistant:",
max_tokens=1024
)
return response.completion
else:
raise ValueError(f"Unsupported model: {model_name}")
def chat(user_message: str) -> str:
# Choose the right model for the query
model = route_query(user_message)
answer = call_model(model, user_message)
# Optional: Evaluate the response for hallucinations and injection attempts
evals = openlit.evals.Hallucination(provider="openai", api_key=os.getenv("OPENAI_API_KEY"))
# Evaluation metric is automatically sent to the configured OTel destination
evals.measure(prompt=user_message,
contexts=["Internal knowledge base"],
text=answer)
# Guard against prompt injection or sensitive topics
guard = openlit.guard.All(provider="openai", api_key=os.getenv("OPENAI_API_KEY"))
guard.detect(text=user_message)
return answer
if __name__ == "__main__":
user_question = input("Ask our support bot a question: ")
print(chat(user_question))
This example shows how to route requests to different models and still emit consistent traces and metrics. We also run a hallucination evaluator and a combined guardrail on each message. The hallucination evaluator detects factual inaccuracies, contradictions, and fabricated information, while the All guardrail simultaneously performs injection detection, sensitive‑topic filtering, and topic restriction.
Step 4: Run the application
To send data to Grafana Cloud, you need an OTLP endpoint and an API token. To do so, log in to your Grafana Cloud stack, open the OpenTelemetry settings, generate an API token, and copy the OTEL endpoint and headers.
Export these values as environment variables before starting your application:
export OTEL_SERVICE_NAME=my-ai-app
export OTEL_DEPLOYMENT_ENVIRONMENT=production
export OTEL_EXPORTER_OTLP_ENDPOINT="https://otlp-gateway-<region>.grafana.net/otlp"
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <base64-instanceID:token>"
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
python your_app.py
Replace <region> with your Grafana zone (for example, prod-us-central-0) and <base64-instanceID: token> with the base64‑encoded instance ID and API token. When you run your instrumented application, OpenLIT will connect to the OTLP gateway and begin sending traces and metrics.
Step 5: Visualize in Grafana Cloud
Once data flows into Grafana, open the AI Observability dashboards. The GenAI observability dashboard visualizes request rates, latency percentiles, and cost metrics. For instance, it tracks "time to first token" and overall latency across providers and surfaces the total and average cost per request using metrics such as gen_ai_usage_cost_USD_sum and gen_ai_usage_input_tokens_total. The GenAI evaluations dashboard summarizes hallucination, bias, and toxicity detection events.
Grafana Alerting can trigger notifications when costs exceed thresholds, latency spikes, or evaluation scores cross your quality gates. Because everything is built on OpenTelemetry metrics, you can also build custom panels and alerts tailored to your use case.
The value you get
Now that we've walked through how to use AI Observability in Grafana Cloud, let's look at some hypothetical scenarios to illustrate the actionable insights to help improve application performance.
Cost optimization
Before: "Our AI costs are going up, but we don't know why."
After: You can see:
- GPT-5 accounts for 70% of costs but only 20% of queries
- Switching simple queries to GPT-4 saves $2000/month
- One user is making excessive API calls
Performance monitoring
Before: "Users complain the bot is slow sometimes."
After: You discover:
- Claude has 30% lower TTFT than GPT-4
- Latency spikes correlate with Claude API rate limits
- 95th percentile latency is 3.2s (above your 3s SLA)
Quality assurance
Before: "Why are our users not happy with AI responses?"
After: You can see:
- Hallucinations appear in 20% of the requests
- Answer accuracy dropped from 92% to 81% after the latest prompt change
- Instruction-following failures increased to 15% for complex queries
Debugging complex issues
With distributed tracing, you can:
- Follow a request from user input → classification → LLM call → response
- See exact prompts that caused errors
- Identify which part of your pipeline is slow
- Correlate issues with specific users or time periods
Next steps
Want to go further? In the next blog in this series, we’ll show how to set this up, step by step, for an agentic AI application.
You can also learn more about AI Observability in the official docs, including setup instructions and dashboards.These resources will help you move from a basic demo to a production-ready setup for your AI applications in no time.
Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!


