<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Grafana Labs blog on Grafana Labs</title><link>https://grafana.com/blog/</link><description>Recent content in Grafana Labs blog on Grafana Labs</description><generator>Hugo -- gohugo.io</generator><language>en</language><atom:link href="/blog/index.xml" rel="self" type="application/rss+xml"/><item><title>Business metrics in Grafana Cloud: Get an AI assist to help securely analyze your data</title><link>https://grafana.com/blog/business-metrics-in-grafana-cloud-get-an-ai-assist-to-help-securely-analyze-your-data/</link><pubDate>Wed, 08 Apr 2026 18:26:00</pubDate><author>Matt Wimpelberg</author><guid>https://grafana.com/blog/business-metrics-in-grafana-cloud-get-an-ai-assist-to-help-securely-analyze-your-data/</guid><description>&lt;p>For today's modern businesses, the data landscape demands security &lt;em>and&lt;/em> flexibility. &lt;/p>&lt;p>You need to connect your observability platform to rich, proprietary datasets that often reside in private networks without compromising security or managing complex network infrastructure. You may also face an extra layer of complexity in order to effectively query and visualize that data. Luckily, modern artificial intelligence tools have made these previously complicated processes much simpler.&lt;/p>&lt;p>This is where &lt;strong>Grafana Cloud's private data source connect (&lt;/strong>&lt;u>&lt;strong>&lt;a href="https://grafana.com/docs/grafana-cloud/connect-externally-hosted/private-data-source-connect/">PDC&lt;/a>&lt;/strong>&lt;/u>&lt;strong>)&lt;/strong> truly shines, offering a secure, elegant solution to bring relational data like business metrics and analytics directly into your dashboards. This secure connection also allows Grafana Assistant to access the data and leverage the power of AI to visualize and query your data.&lt;/p>&lt;p>In this blog post, we’ll demonstrate how you can access that private data securely in Grafana Cloud and how to use our AI assistant to translate complex database queries into human readable language and visualizations.&lt;/p>&lt;h2>Making business analytics easy with PDC, Assistant, and Postgres&lt;/h2>&lt;p>Observability was born from the need to give engineers deeper visibility into their workloads, but the scope of how it's used is quickly expanding. In fact, half of all organizations today use observability tools to track business-related metrics such as security, compliance, revenue, order tracking, customer conversions, and more, according to our &lt;u>&lt;a href="https://grafana.com/observability-survey/">2026 Observability Survey&lt;/a>&lt;/u>.&lt;/p>&lt;p>So while Grafana Cloud was built by engineers for engineers, it's also powerful and flexible enough to meet a wide range of needs, including business analytics. In this section, we'll briefly describe the tools you'll need to get started and tell you a little bit about the data source we'll use to demo this functionality.&lt;/p>&lt;h3>The power of PDC&lt;/h3>&lt;p>PDC is a key feature for enterprise-grade observability. It establishes a &lt;strong>secure, encrypted, private connection&lt;/strong> between your Grafana Cloud instance and data sources hosted within your private networks.&lt;/p>&lt;p>Here's how it works: A lightweight PDC agent is deployed in your private network. This agent creates a customer-controlled SSH tunnel back to Grafana Cloud, securely routing all queries. This critical design choice means:&lt;/p>&lt;ul>&lt;li>&lt;strong>Security first:&lt;/strong> Your databases are never exposed to the public internet. Traffic is encrypted end-to-end.&lt;/li>&lt;li>&lt;strong>Simplicity:&lt;/strong> You avoid the complexity of managing VPNs, NAT gateways, or intricate network-level access controls.&lt;/li>&lt;li>&lt;strong>Scalability:&lt;/strong> The agent can be deployed for high availability and easily scaled to meet your query demands.&lt;/li>&lt;li>&lt;strong>Local experience:&lt;/strong> You configure the data source in Grafana as if it were running locally within your private network.&lt;/li>&lt;/ul>&lt;h3>PostgreSQL: analytics beyond metrics&lt;/h3>&lt;p>While tools like &lt;u>&lt;a href="https://grafana.com/docs/grafana/latest/fundamentals/getting-started/first-dashboards/get-started-grafana-prometheus/">Prometheus&lt;/a>&lt;/u> are essential for scraping and querying time series metrics from infrastructure and applications, many critical business insights live in relational databases. PostgreSQL, with its robust support for complex queries, joins, and rich datasets, is the perfect complement to pure metrics-based observability.&lt;/p>&lt;p>Consider the example of the &lt;strong>World Happiness Report&lt;/strong>, which is a &lt;u>&lt;a href="https://www.worldhappiness.report/">research-based global report&lt;/a>&lt;/u> that ranks countries by how happy their people say they are, and explores the social and economic factors behind those differences. This dataset is full of relational context: countries, years, GDP per capita, life expectancy, and social support. Visualizing this data requires sophisticated queries that are not easily performed using traditional metrics-optimized sources.&lt;/p>&lt;p>By connecting PostgreSQL via PDC, you can:&lt;/p>&lt;ul>&lt;li>Query relational data like business metrics, customer survey results, or rich time series data&lt;/li>&lt;li>Perform complex joins to enrich time-series metrics with contextual data&lt;/li>&lt;li>Unlock deep analytics directly within your Grafana dashboards&lt;/li>&lt;/ul>&lt;h3>Grafana Assistant: Query the data with natural language &lt;/h3>&lt;p>&lt;u>&lt;a href="https://grafana.com/docs/grafana-cloud/machine-learning/assistant/">Grafana Assistant&lt;/a>&lt;/u> is our LLM purpose-built for Grafana Cloud. It's an invaluable AI-powered feature that significantly accelerates the dashboard creation process, as it lets you leverage natural language prompts to generate complex queries and refine visualizations quickly. &lt;/p>&lt;p>In this demo, Grafana Assistant was used to rapidly construct and fine-tune the prebuilt dashboard, demonstrating how it can quickly turn raw PostgreSQL data into meaningful, happiness-focused visualizations.&lt;/p>&lt;h2>AI-powered dashboard generation from PostgreSQL&lt;/h2>&lt;p>When connecting to a rich data source like PostgreSQL via PDC, Assistant acts as an intelligent translator between your analytical goal and the necessary SQL.&lt;/p>&lt;p>Here's how Assistant works with the PostgreSQL data source:&lt;/p>&lt;ol>&lt;li>&lt;strong>Natural Language query translation:&lt;/strong> Instead of manually writing complex SQL joins and aggregations, a user can simply prompt the assistant. For example: &lt;em>"Show me the trend of 'Life Ladder' score over time for the top 5 happiest countries in 2024."&lt;/em>&lt;/li>&lt;li>&lt;strong>SQL generation:&lt;/strong> The AI processes this prompt, understands the structure of the connected PostgreSQL schema (e.g., table names, column names like &lt;code>Life Ladder&lt;/code>, &lt;code>country_name, year&lt;/code>), and automatically generates the precise SQL query required to fetch the data.&lt;/li>&lt;li>&lt;strong>Visualization suggestion and refinement:&lt;/strong> Once the query runs, Assistant analyzes the returned dataset (e.g., time series data, categorical rankings). It then suggests the most appropriate visualization type (e.g., &lt;u>&lt;a href="https://grafana.com/docs/grafana/latest/visualizations/panels-visualizations/visualizations/time-series/">time series panel&lt;/a>&lt;/u> for trends, &lt;u>&lt;a href="https://grafana.com/docs/grafana/latest/visualizations/panels-visualizations/visualizations/bar-chart/">bar chart&lt;/a>&lt;/u> for rankings) and generates the panel configuration, including axis labels and legends.&lt;/li>&lt;/ol>&lt;p>This capability drastically lowers the barrier to entry for users who may not be SQL experts, allowing them to rapidly prototype and deploy complex analytical dashboards based on their private relational data.&lt;/p>&lt;h2>Automated setup: A Terraform blueprint for secure observability&lt;/h2>&lt;p>To demonstrate this modern observability pattern, we've created a comprehensive Terraform repository that automates the entire setup. This blueprint embodies the principle of "&lt;u>&lt;a href="https://en.wikipedia.org/wiki/Infrastructure_as_code">infrastructure as code&lt;/a>&lt;/u>" for your secure data connections.&lt;/p>&lt;p>Everything you need to set things up can be found in this public GitHub repo: &lt;u>&lt;a href="https://github.com/mwimpelberg28/grafana_happiness">https://github.com/mwimpelberg28/grafana_happiness&lt;/a>&lt;/u>&lt;/p>&lt;p>The blueprint includes the following components:&lt;/p>&lt;ul>&lt;li>&lt;strong>Amazon RDS PostgreSQL instance:&lt;/strong> Provisioned securely within a private Amazon VPC, preloaded with the World Happiness Report dataset&lt;/li>&lt;li>&lt;strong>PDC agent deployment:&lt;/strong> The PDC agent is deployed within the same private VPC to establish the secure tunnel and enforce network restrictions&lt;/li>&lt;li>&lt;strong>Grafana Terraform provider:&lt;/strong> Used to programmatically create the secure PostgreSQL data source, configured specifically to route queries over the PDC tunnel&lt;/li>&lt;li>&lt;strong>Prebuilt Grafana dashboard:&lt;/strong> A ready-to-use dashboard featuring PostgreSQL queries to visualize happiness data, including:&lt;ul>&lt;li>Time series panels tracking happiness scores over time&lt;/li>&lt;li>Bar charts ranking countries by key metrics&lt;/li>&lt;li>Tables correlating happiness metrics with factors like GDP per capita and life expectancy&lt;/li>&lt;/ul>&lt;/li>&lt;/ul>&lt;h2>Getting started with PDC and PostgreSQL&lt;/h2>&lt;p>Before you start deploying the Terraform setup, you need to configure the connection credentials within Grafana Cloud. This involves setting up an access policy and a service account to authenticate the PDC agent.&lt;/p>&lt;h3>Setup instructions&lt;/h3>&lt;p>1. Find the cluster your Grafana stack is deployed in. You will use this as the value of the &lt;code>pdc_cluster&lt;/code> Terraform variable.&lt;/p>&lt;p>2. Create a new service account with the Admin role within Grafana. In your Grafana Cloud instance, navigate to &lt;strong>Administration &lt;/strong>>&lt;strong> Users and access &lt;/strong>>&lt;strong> Service accounts&lt;/strong> and then click &lt;strong>Add service account&lt;/strong>.&lt;/p>&lt;p>Once the service account is created, click &lt;strong>Add service account token&lt;/strong>. Copy the token and set it as the value of the &lt;code>sa_token&lt;/code> Terraform variable.&lt;/p>&lt;p>3. Next, create a new access policy by navigating to &lt;strong>Administration &lt;/strong>>&lt;strong> Users and access &lt;/strong>>&lt;strong> Cloud access policies&lt;/strong>, then click &lt;strong>Create access policy&lt;/strong>. Name your policy and then add the following permissions:&lt;/p>&lt;p>You will need to click &lt;strong>Add scope&lt;/strong> in order to add the &lt;code>stacks:read&lt;/code> permission. After you’ve created the policy, click &lt;strong>Add token&lt;/strong>, name it (and optionally set an expiration date), then copy the token to the &lt;code>cloud_access_policy_token&lt;/code> Terraform variable.&lt;/p>&lt;p>Now that we’ve finished setting up connection credentials in Grafana, let’s set the remaining Terraform variables:&lt;/p>&lt;p>Variable&lt;/p>&lt;p>Example value&lt;/p>&lt;p>&lt;code>grafana_url&lt;/code>&lt;/p>&lt;p>https://&lt;stack-name>.grafana.net &lt;/p>&lt;p>&lt;code>grafana_slug&lt;/code>&lt;/p>&lt;p>&lt;stack-name>&lt;/p>&lt;p>&lt;code>vpc_name&lt;/code>&lt;/p>&lt;p>happiness-demo-vpc&lt;/p>&lt;p>5. Now you are ready to provision the infrastructure needed to support the demo. From the &lt;strong>grafana_happiness&lt;/strong> repository directory, run &lt;code>terraform init&lt;/code> to download the project’s dependencies, then &lt;code>terraform apply&lt;/code> to create the infrastructure and Grafana resources. Please be aware that provisioning the necessary infrastructure and resources can frequently take eight minutes or more.&lt;/p>&lt;p>6. Once &lt;code>terraform apply&lt;/code> has completed successfully you will find a new dashboard at &lt;strong>Dashboards &lt;/strong>>&lt;strong> Happiness&lt;/strong> >&lt;strong> World Happiness Index&lt;/strong>:&lt;/p>&lt;p>Now you are&lt;strong> &lt;/strong>visualizing data from your new private data source using PDC!&lt;/p>&lt;p>7. This demo deploys real cloud infrastructure that costs money, so remember to run &lt;code>terraform destroy&lt;/code> when you are done exploring so you don’t incur unwanted expenses.&lt;/p>&lt;h2>What's next&lt;/h2>&lt;p>We encourage you to use the provided demo as a starting point for your business analytics journey with Grafana Cloud. And now that you are connected to your private network, you can use any of the supported data sources for &lt;u>&lt;a href="https://grafana.com/docs/grafana-cloud/machine-learning/assistant/get-started/">Grafana Assistant&lt;/a>&lt;/u> to help you analyze and visualize your business data. &lt;/p>&lt;p>Assistant can run ad hoc queries, build dashboards, and help gain insight into data without having to learn numerous query languages. In addition, Assistant can also provide the translated SQL queries if you need to use them in other systems. And check out &lt;u>&lt;a href="https://grafana.com/docs/grafana-cloud/machine-learning/assistant/guides/querying/">this guide&lt;/a>&lt;/u> for even more use cases to explore in your journey with Grafana Assistant.&lt;/p></description></item><item><title>Query fair usage in Grafana Cloud: What it is and how it affects your logs observability practice</title><link>https://grafana.com/blog/query-fair-usage-in-grafana-cloud-what-is-it-and-how-it-affects-your-logs-observability-practice/</link><pubDate>Tue, 07 Apr 2026 18:32:14</pubDate><author>Russ Erbe</author><guid>https://grafana.com/blog/query-fair-usage-in-grafana-cloud-what-is-it-and-how-it-affects-your-logs-observability-practice/</guid><description>&lt;p>In Grafana Cloud we use a simple yet generous formula that lets you query up to 100x your monthly ingested log volume in gigabytes for free. This works for the vast majority of our customers, but if you aren’t careful and strategic with your usage, you could find yourself with an overage bill. &lt;/p>&lt;p>We obviously don't want to surprise any customer with an unexpected bill, so in this blog post you'll learn how to find your usage ratio in your Grafana dashboard so you can understand what you're looking at, where the queries are coming from, and who your top users are. You'll also learn some best practices for applying good policies to your observability practice going forward.&lt;/p>&lt;h2>What is 'query fair usage' and why does the policy exist?&lt;/h2>&lt;p>In Grafana Cloud Logs, "query fair usage” refers to a pricing policy that lets you query up to &lt;strong>100x your monthly ingested log volume in GBs&lt;/strong> at no additional charge. It is primarily a billing mechanism designed to allow typical usage without extra charges, while preventing abuse from extremely resource-intensive queries.&lt;/p>&lt;p>Grafana Cloud’s query engine (especially for Loki logs) can scan huge amounts of data. Without guardrails, a few heavy queries could create disproportionate load. The fair‑use policy:&lt;/p>&lt;ul>&lt;li>Keeps costs predictable&lt;/li>&lt;li>Encourages efficient querying&lt;/li>&lt;li>Protects shared infrastructure&lt;/li>&lt;li>Still allows generous exploration of logs&lt;/li>&lt;/ul>&lt;p> &lt;strong>Example&lt;/strong>&lt;/p>&lt;p>If you ingest &lt;strong>50 GB&lt;/strong> of logs in a month:&lt;/p>&lt;ul>&lt;li>Your fair‑use query allowance = &lt;strong>50 GB × 100 = 5,000 GB&lt;/strong> of queried logs.&lt;/li>&lt;/ul>&lt;p>As long as your queries scan &lt;strong>≤ 5,000 GB&lt;/strong>, you pay nothing extra for queries.&lt;/p>&lt;h2>How to find and monitor your query fair usage&lt;/h2>&lt;h3>How to view query usage&lt;/h3>&lt;p>For starters, you can actually view the cost of a query &lt;em>before&lt;/em> you run a query. When you write queries in Explore, Grafana provides an estimate of how much data will be queried when you run your query.&lt;/p>&lt;p>But if you want to know your usage ratio, you have a few options. The first way, and the way most people would be aware of, is by going to the billing dashboard to view your current query ratio.&lt;/p>&lt;ul>&lt;li>From the Grafana main menu, click the dashboard icon.&lt;/li>&lt;li>Select the &lt;strong>Billing/Usage&lt;/strong> dashboard.&lt;/li>&lt;li>Scroll down to the &lt;strong>Logs Ingestion and Query Details&lt;/strong> section.&lt;/li>&lt;li>Expand the section and scroll to the &lt;strong>Query Usage Ratio&lt;/strong> panel.&lt;/li>&lt;/ul>&lt;p>The screenshot below is an example of what you might see. This is from a large environment with seven different Grafana instances.&lt;/p>&lt;p>Another way to view your usage ratio is by utilizing the newly redesigned &lt;u>&lt;a href="https://grafana.com/docs/grafana-cloud/cost-management-and-billing/">Cost Management and billing&lt;/a>&lt;/u> experience. This is actually the way you should get familiar with viewing your usage, as the billing dashboard has been deprecated and will be removed some time in the future.&lt;/p>&lt;p>Once you're there, simply scroll down to &lt;strong>Products&lt;/strong>, and locate and expand &lt;strong>Logs&lt;/strong>. From there go to the bottom of the page where you will find the log query rate as well as the query usage ratio.&lt;/p>&lt;h3>Determine the source of query usage&lt;/h3>&lt;p>To help you track down the source of query usage, we built the &lt;u>&lt;a href="https://grafana.com/grafana/dashboards/21936-usage-insights-6-loki-query-fair-usage-drilldown/?tab=revisions">Loki query fair usage dashboard&lt;/a>&lt;/u>. For Grafana Cloud customers using hosted Grafana, this dashboard is automatically provisioned on each of your hosted Grafana instances.&lt;/p>&lt;p>The dashboard shows a breakdown of your query usage, query type, dashboard, grafana-alert, and Explore/other, by query bytes and query count. For each type, there are rows showing more detailed information on highest volume queries, including:&lt;/p>&lt;ul>&lt;li>The originating query&lt;/li>&lt;li>The Grafana username that submitted the query (if relevant)&lt;/li>&lt;li>Rule and dashboard names&lt;/li>&lt;li>Query size in bytes&lt;/li>&lt;li>Query execution frequency&lt;/li>&lt;/ul>&lt;p>And here's a breakdown of the different panels in the dashboard:&lt;/p>&lt;ul>&lt;li>&lt;strong>Grafana-alerts&lt;/strong> refers to rules managed within Grafana under Grafana Alerting found at &lt;strong>Home&lt;/strong> > &lt;strong>Alerts &amp; IRM&lt;/strong> > &lt;strong>Alerting&lt;/strong>. These rules can be alerts or recording rules. These are separate from the rules you upload to Loki with &lt;code>cortextool&lt;/code> or &lt;code>lokitool&lt;/code> using the Grafana Cloud APIs.&lt;/li>&lt;li>&lt;strong>Explore/other&lt;/strong> refers to a subset of queries executed against Loki that come from the Explore page. It doesn’t include those coming from the Grafana Logs Drilldown app. It also includes queries that come from a non-Grafana frontend source such as &lt;code>logcli&lt;/code>. Explore queries likely have a &lt;code>grafana_username&lt;/code> populated in the dashboard, queries from other sources don’t.&lt;/li>&lt;li>&lt;strong>Estimated interval&lt;/strong> is the estimation of how frequently the query ran over the selected time period. It’s the number of times the query ran divided by the total time range.&lt;/li>&lt;/ul>&lt;h2>Understanding your Grafana Cloud invoice for logs&lt;/h2>&lt;p>At this point, you have learned what query fair usage is, where to find it in your billing dashboard, and how to determine the source of your query usage. Now let's get an understanding of how that could impact your monthly invoice. &lt;/p>&lt;p>Grafana Cloud calculates logs usage by looking at the following components:&lt;/p>&lt;ul>&lt;li>&lt;strong>GBs ingested: &lt;/strong>The total number of GBs ingested into Grafana Cloud on a monthly basis.&lt;/li>&lt;li>&lt;strong>GBs retained: &lt;/strong>How long the logs data are retained within Grafana Cloud.&lt;/li>&lt;/ul>&lt;p>The minimum retention period is 30 days and you can purchase additional retention periods in 30-day increments. Retention is customizable per stack or by individual streams within the same stack. To enable this, contact Grafana Support.&lt;/p>&lt;p>&lt;strong>Note:&lt;/strong>&lt;strong> &lt;/strong>The retention period changes are not retroactive. Once the retention is increased, the current logs will be stored following the new retention period, but logs already out of the old retention period will not be recovered.&lt;/p>&lt;h2>&lt;strong>Billing calculations&lt;/strong>&lt;/h2>&lt;p>Billing is based on usage, and usage is determined by these primary factors:&lt;/p>&lt;ul>&lt;li>The number of GBs ingested per month&lt;/li>&lt;li>The number of months of retention&lt;/li>&lt;/ul>&lt;p>&lt;strong>Note: &lt;/strong>For customers exceeding the 100x fair use policy for GBs queried per month, the following billing calculations apply:&lt;/p>&lt;p>&lt;code>logs billable gb = max(ingested gb, queried gb / fair use query ratio)&lt;/code>&lt;/p>&lt;p>This calculation is performed on a per-stack level.&lt;/p>&lt;p>Even though you can see what your query fair usage is at any time, you are only billed for it at the beginning of the month for your previous month. Oftentimes early in the billing cycle, your usage ratio will be high, but as the month goes on, it often drops as more logs are being sent in.&lt;/p>&lt;h3>&lt;strong>Example&lt;/strong>&lt;/h3>&lt;ul>&lt;li>Ingested: 50 GB&lt;/li>&lt;li>Queried: 7,000 GB&lt;/li>&lt;li>Fair‑use threshold: 5,000 GB&lt;/li>&lt;/ul>&lt;p>Billable GB = max(50, 7,000 / 100) = max(50, 70) = &lt;strong>70 GB&lt;/strong>&lt;/p>&lt;p>So you would be billed for &lt;strong>70 GB&lt;/strong> instead of 50 GB at whatever your set rate is per GB in your Grafana contract.&lt;/p>&lt;h2>Managing your query fair usage costs&lt;/h2>&lt;p>Now that we've walked through the usage policy and how it could impact your costs, let's finish with some tips for avoiding potential overages in the future&lt;/p>&lt;h3>Recommendations&lt;/h3>&lt;p>One common source of excess queries is misconfigured Grafana/Loki-managed alerting rules—for example, querying one hour of data but running that query every minute.&lt;/p>&lt;p>For alerting rules using the Loki data source:&lt;/p>&lt;ul>&lt;li>Use instant queries instead of range queries for all rules. An instant query executes exactly one time and produces one data point for each series matched by your label selectors. Range queries are effectively instant queries executed multiple times. For more details, refer to the &lt;a href="https://grafana.com/blog/2023/07/05/how-to-run-faster-loki-metric-queries-with-more-accurate-results/">How to run faster Loki metric queries with more accurate results&lt;/a> blog post.&lt;/li>&lt;li>Look at the evaluation period and interval period and make their intervals match the amount of time queried. That is, a rule that runs once per minute should have a query range of &lt;code>1m&lt;/code>.&lt;/li>&lt;/ul>&lt;p>&lt;strong>Note: &lt;/strong>We also recommend checking alert rules run by the scheduler (recorded queries).&lt;/p>&lt;h3>Best practices for fair query usage&lt;/h3>&lt;p>To stay within fair usage policies and optimize performance:&lt;/p>&lt;ul>&lt;li>&lt;strong>Filter early:&lt;/strong> Use label selectors and log pipeline filters at the beginning of your queries (e.g., in LogQL) to reduce the data set before applying more complex operations.&lt;/li>&lt;li>&lt;strong>Avoid wide scans:&lt;/strong> Be cautious with queries that scan large time ranges or entire datasets, especially when using &lt;u>&lt;a href="https://grafana.com/docs/grafana-cloud/machine-learning/assistant/">Grafana Assistant&lt;/a>&lt;/u>.&lt;/li>&lt;li>&lt;strong>Narrow time ranges:&lt;/strong> Select a smaller time frame instead of querying “last 30 days” by default.&lt;/li>&lt;li>&lt;strong>Use aggregation and recording rules:&lt;/strong> Define Prometheus recording rules to pre-calculate frequently used, resource-heavy expressions into new metrics. Querying these pre-aggregated metrics is much more efficient than calculating them ad-hoc.&lt;/li>&lt;li>&lt;strong>Avoid wildcard-heavy filters: &lt;/strong>They force Loki to scan massive amounts of data, which will consume your query usage quickly. A precise label selector like &lt;code>{cluster="us-central1"}&lt;/code> can reduce the search space from, say, 100 TB down to 1 TB.&lt;/li>&lt;li>&lt;strong>Use labels wisely:&lt;/strong> Loki’s indexing model rewards good label design. For more details on how to do this, check out our &lt;u>&lt;a href="https://grafana.com/blog/the-concise-guide-to-grafana-loki-everything-you-need-to-know-about-labels/">concise guide to labels in Loki&lt;/a>&lt;/u>.&lt;/li>&lt;li>&lt;strong>Optimize alert rules: &lt;/strong>Ensure your log-based alert rules follow best practices, as they are a common cause of excessive query usage. And for more information on how to address poorly defined alert rules, check out our &lt;u>&lt;a href="https://grafana.com/docs/grafana-cloud/cost-management-and-billing/analyze-costs/reduce-costs/logs-costs/control-query-usage-costs/">docs page on this topic&lt;/a>&lt;/u>.&lt;/li>&lt;li>&lt;strong>Monitor usage dashboards:&lt;/strong> Regularly check the "Billing and Usage" dashboards in your Grafana Cloud portal to understand your consumption patterns and configure alerts for unexpected spikes. You can also use Grafana's "Usage insights" feature (available in Grafana Enterprise and Grafana Cloud) to identify heavy-hitting queries or unused dashboards.&lt;/li>&lt;/ul>&lt;h2>Configure Loki query limit policies&lt;/h2>&lt;p>If all else fails and with all the best practices in place, you still find your company going over the query fair usage policy, Grafana has recently introduced a way to basically put up guardrails to prevent expensive queries.&lt;/p>&lt;p>&lt;strong>Note: &lt;/strong>Loki query limit policies are currently in &lt;a href="https://grafana.com/docs/release-life-cycle/">public preview&lt;/a>. Grafana Labs offers limited support, and breaking changes might occur prior to the feature being made generally available.&lt;/p>&lt;p>This feature is disabled by default. Contact Grafana Support to enable query limit policies using the &lt;code>lokiQueryLimitsContext&lt;/code> feature flag.&lt;/p>&lt;p>Loki query limit policies provide fine-grained control over how users query your Grafana Cloud Logs data. You can configure these policies as attributes on &lt;a href="https://grafana.com/docs/grafana-cloud/security-and-account-management/authentication-and-permissions/access-policies/">access policies&lt;/a> to limit query result sizes.&lt;/p>&lt;p>When a query exceeds a configured limit, users receive meaningful error messages that explain why the query was rejected and how to adjust it.&lt;/p>&lt;h3>How query limit policies work&lt;/h3>&lt;p>Query limit policies are applied as &lt;code>lokiQueryPolicy&lt;/code> attributes on access policies. When a user makes a request using a token associated with an access policy that has query limits configured, Loki validates the entire time period of the query against those limits before execution.&lt;/p>&lt;p>Query limit policies are not enforced for Loki managed or Grafana managed alerts.&lt;/p>&lt;p>&lt;/p>&lt;p>To learn more about &lt;strong>Loki query limit policies &lt;/strong>and how to configure them, please see this &lt;u>&lt;a href="https://grafana.com/docs/grafana-cloud/cost-management-and-billing/analyze-costs/logs-costs/log-query-limit-policies/">link to the Grafana Documentation&lt;/a>&lt;/u>.&lt;/p></description></item><item><title>Observability in Go: Where to start and what matters most</title><link>https://grafana.com/blog/observability-in-go-where-to-start-and-what-matters-most/</link><pubDate>Mon, 06 Apr 2026 15:51:58</pubDate><author>Grafana Labs Team</author><guid>https://grafana.com/blog/observability-in-go-where-to-start-and-what-matters-most/</guid><description>&lt;p>Sometimes the hardest part of debugging a system isn’t fixing the problem—it’s figuring out what’s actually happening in the first place.&lt;/p>&lt;p>In this episode of “Grafana’s Big Tent” podcast, host Mat Ryer, Principal Software Engineer at Grafana Labs, is joined by Donia Chaiehloudj, Senior Software Engineer at Isovalent (Cisco) and co-author of &lt;u>&lt;a href="https://www.manning.com/books/learn-go-with-pocket-sized-projects">“Learn Go with Pocket-Sized Projects,”&lt;/a>&lt;/u> along with Charles Korn, Principal Software Engineer at Grafana Labs and Bryan Boreham, Distinguished Engineer at Grafana Labs, to talk about observability in &lt;u>&lt;a href="https://go.dev/">Go&lt;/a>&lt;/u>.&lt;/p>&lt;p>They dig into where to start (hint: logs are often the first step) and how context, metrics, traces, and profiling fit together as systems grow more complex. Along the way, they share practical lessons on turning logs into metrics, avoiding common pitfalls with context and tracing, using &lt;u>&lt;a href="https://github.com/google/pprof">pprof&lt;/a>&lt;/u> effectively, and what &lt;u>&lt;a href="https://ebpf.io/">eBPF&lt;/a>&lt;/u> unlocks when you need visibility beyond your application.&lt;/p>&lt;p>You can watch the full episode in the YouTube video below, or listen on &lt;u>&lt;a href="https://open.spotify.com/show/3beQvS8to0rYs1gxOnPrfD">Spotify&lt;/a>&lt;/u> or &lt;u>&lt;a href="https://podcasts.apple.com/us/podcast/grafanas-big-tent/id1616725129">Apple Podcasts&lt;/a>&lt;/u>.&lt;/p>&lt;p>&lt;em>(Note: The following are highlights from episode 8, season 3 of “Grafana’s Big Tent” podcast. This transcript has been edited for length and clarity.)&lt;/em>&lt;/p>&lt;h2>&lt;strong>Starting a Go project: Where observability begins&lt;/strong>&lt;/h2>&lt;p>&lt;strong>Donia Chaiehloudj: &lt;/strong>I would go simple to start. We know that we are always refactoring along the way and that priorities change, like real life. But I would try to go for the Go standard library as much as possible, because we know that it’s stable and not going to be archived tomorrow.&lt;/p>&lt;p>I would also go for well-known libraries that are not standard, but are used by a lot of people and are well-maintained, even though we know that contributors are less and less in the open source world. I would also think about some standardization from the beginning—for your data, your context, the way you want to trace, that kind of thing. &lt;/p>&lt;p>&lt;strong>Mat Ryer: &lt;/strong>Yeah, I think that makes sense. I like your point that you're going to refactor. Things are going to change. That kind of takes a bit of pressure off. It doesn't have to be perfect straight away.&lt;/p>&lt;h2>&lt;strong>Starting with logs—and turning them into metrics&lt;/strong>&lt;/h2>&lt;p>&lt;strong>Charles Korn:&lt;/strong> The thing I use most often, at least in the stuff that I'm working on at the moment, is logs. From logs, you can derive metrics if you really need to, so that's probably where I'd start. They're really easy to get started with. You can dump them into a file, you can dump them to the console, and you can start shipping them off to a system like &lt;u>&lt;a href="https://grafana.com/oss/loki/">Loki&lt;/a>&lt;/u>.&lt;/p>&lt;p>&lt;strong>Mat:&lt;/strong> Yeah, I think starting with logs is quite natural.&lt;/p>&lt;p>&lt;strong>Charles:&lt;/strong> We've got a bunch of Go services at Grafana Labs, and unfortunately, occasionally they panic, and they dump the trace to their logs and they get stuck to standard error, and they get picked up by our logging system.&lt;/p>&lt;p>And it's really useful to be able to show that on a graph—how often a thing's panicking. We actually have a system where it'll look at the logs, count the number of things that look like a panic, and turn that into a metric. And then we can alert on that metric just like any other metric. That's really helpful.&lt;/p>&lt;p>&lt;strong>Mat: &lt;/strong>Yeah, so you literally then get a graph that shows you how many panics you're having.&lt;/p>&lt;p>&lt;strong>Charles&lt;/strong>: Exactly. And you can have alerts on that.&lt;/p>&lt;h2>&lt;strong>Tracing and why context matters&lt;/strong>&lt;/h2>&lt;p>&lt;strong>Bryan Boreham&lt;/strong>: Tracing adds that explicit parent-child relationship, and everything's always got a beginning and an end. So I think tracing is kind of the superpower to figure out, with any complicated program, what happened. &lt;/p>&lt;p>&lt;strong>Mat:&lt;/strong> So the idea being it's spending more resources in that bit, and therefore, if you're going to optimize something, go for the big-hanging fruit, would you say?&lt;/p>&lt;p>&lt;strong>Bryan: &lt;/strong>I suppose so.&lt;/p>&lt;p>&lt;strong>Mat&lt;/strong>: So you mentioned that you would do that only in advanced projects or complicated projects. How do you know when it's time to reach for tracing?&lt;/p>&lt;p>&lt;strong>Bryan&lt;/strong>: Well, for myself, I said 20 or 30 lines, but it gets complicated. So my bar is very low. My ability to concentrate on things is quite poor.&lt;/p>&lt;p>It also depends, because tracing is quite complicated, or people find it complicated to set up. With logs, you just stick it in a file and then read it, so it's orders of magnitude different.&lt;/p>&lt;p>But tracing really comes into its own when you have multiple bits in what you call a distributed system, multiple frontend and backends, or multiple bits of backend, or something like that. You pass the same idea around to everything because it's related, and then they're all logging or they're all reporting using that same idea, and that allows you to then tie the whole process together across all these multiple systems.&lt;/p>&lt;p>&lt;strong>Mat&lt;/strong>: Yeah, it's very cool, and of course makes sense at scale.&lt;/p>&lt;h2>&lt;strong>Errors, tradeoffs, and observability in Go&lt;/strong>&lt;/h2>&lt;p>&lt;strong>Charles&lt;/strong>: One thing I do miss sometimes coming from other languages is that you've got an exception type, and each of those exceptions is a particular type. It's a file-not-found error or a network error or whatever it is.&lt;/p>&lt;p>Whereas with Go, most of those things are just strings. So if you're going to do any kind of analysis, like how many file-not-found errors did I get, that could be quite tricky in Go because they're just a whole bunch of strings.&lt;/p>&lt;p>But at the same time, it makes it really simple to create these really rich errors. They're really easy, as an engineer trying to solve a problem, to get that context of what's going on. It's a bit of good and bad.&lt;/p>&lt;p>&lt;strong>Donia&lt;/strong>: So I started my career with Go. And for me, it was very natural to have error types. That was like, "no, I want to create new types of errors" if I had something specific. And it was a reflex to just check if there was the type of error that I wanted already in the library.&lt;/p>&lt;p>I did one year of Java in a company and I was playing with exceptions, and I was confused, actually. I want to define my own error type, because it's something very specific. And I want to type it for that type of library that I'm dealing with.&lt;/p>&lt;p>So what Charles is talking about is really interesting—the way exceptions can be, in general, easier maybe. But I find that error types and being more granular is easier to read in the code and to understand when you're debugging, too.&lt;/p>&lt;h2>&lt;strong>Profiling with pprof&lt;/strong>&lt;/h2>&lt;p>&lt;strong>Mat: &lt;/strong>So then profiles. I know Go has pprof. What is pprof?&lt;/p>&lt;p>&lt;strong>Charles&lt;/strong>: It's a tool that allows you to measure the performance of your Go application. And it can show you a bunch of different profiles. The ones that I use most often are CPU time. It's literally just how much time is spent in different functions. And the other one I spend a lot of time looking at is memory consumption, like peak in-use memory consumption.&lt;/p>&lt;p>&lt;strong>Donia: &lt;/strong>I was very intimidated at the beginning of my career by pprof, actually. Do you have any advice for someone getting started with it?&lt;/p>&lt;p>&lt;strong>Bryan&lt;/strong>: I was just thinking to myself, actually, that there were one or two gotchas. The big one that catches some people is that they dive into CPU profiling when they don't actually have a CPU problem.&lt;/p>&lt;p>They're not running out of CPU. They've got a program that's slow, and so they think, "Oh, profiling." Then it turns out that this program is slow because it's waiting on some other program, like a database, and a profile will not show you that.&lt;/p>&lt;p>The simplest way to watch out for it is to watch your CPU meter. If it's ticking along at sort of 0.1 CPU usage or something like that, then it's very unlikely that profiling is going to get you anywhere. Whereas if the fans are all running, 18 CPUs going in parallel, then that's probably a good one to point the CPU profiler at.&lt;/p>&lt;p>The next thing is that it's almost always memory allocation in Go that is causing issues. If you do have a CPU problem, look at the memory profile, is my next top tip.&lt;/p>&lt;h2>&lt;strong>eBPF and observing the “dark side” of systems&lt;/strong>&lt;/h2>&lt;p>&lt;strong>Donia&lt;/strong>: eBPF, for people who maybe don't know what it is, is a way to write C programs, BPF programs, in the Linux kernel to dynamically observe or secure your kernel.&lt;/p>&lt;p>That's very powerful. But it can be very daunting and out of cost to write BPF programs. So having Go wrappers on top of that is very interesting.&lt;/p>&lt;p>Something I personally like about eBPF is that you can actually access dark sides of your kernel that you can't access from user space.&lt;/p>&lt;h2>&lt;strong>What Go could improve&lt;/strong>&lt;/h2>&lt;p>&lt;strong>Charles&lt;/strong>: One thing that would be really helpful is if, when you put out a stack trace, it could say, “give me the pointer address of this pointer, and print out this value from that struct.” The other thing that's kind of related is errors. I'd love to be able to get a stack trace reliably for an error.&lt;/p>&lt;p>&lt;strong>Bryan&lt;/strong>: I would love more flexibility, and it's probably more in the debugging tool than in Go itself.&lt;/p>&lt;p>&lt;em>“Grafana’s Big Tent” podcast wants to hear from you. If you have a great story to share, want to join the conversation, or have any feedback, please contact the Big Tent team at &lt;/em>&lt;em>&lt;strong>&lt;a href="mailto:bigtent@grafana.com">bigtent@grafana.com&lt;/a>&lt;/strong>&lt;/em>&lt;em>.&lt;/em>&lt;/p></description></item><item><title>Finding performance bottlenecks with Pyroscope and Alloy: An example using TON blockchain</title><link>https://grafana.com/blog/finding-performance-bottlenecks-with-pyroscope-and-alloy-an-example-using-ton-blockchain/</link><pubDate>Mon, 30 Mar 2026 16:22:31</pubDate><author>Anatoly Korniltsev</author><guid>https://grafana.com/blog/finding-performance-bottlenecks-with-pyroscope-and-alloy-an-example-using-ton-blockchain/</guid><description>&lt;p>Performance optimization often feels like searching for a needle in a haystack. You know your code is slow, but where exactly is the bottleneck? &lt;/p>&lt;p>This is where continuous profiling comes in. &lt;/p>&lt;p>In this blog post, we’ll explore how continuous profiling with &lt;u>&lt;a href="/oss/alloy-opentelemetry-collector/">Alloy&lt;/a>&lt;/u> and &lt;u>&lt;a href="/oss/pyroscope/">Pyroscope&lt;/a>&lt;/u> can transform the way you approach performance optimization. Using real-world examples from last year’s &lt;u>&lt;a href="https://contest.com/docs/BlockValidationChallenge">TON blockchain optimization contest&lt;/a>&lt;/u>, a C++ developer challenge, we’ll explore how modern profiling tools accelerate the optimization process. &lt;/p>&lt;h2>First, some background on the contest&lt;/h2>&lt;p>&lt;u>&lt;a href="https://en.wikipedia.org/wiki/TON_(blockchain)">The Open Network (TON)&lt;/a>&lt;/u> blockchain optimization contest is a C++ optimization challenge where contestants have to squeeze every microsecond out of a blockchain validation algorithm.&lt;/p>&lt;p>The challenge was straightforward: participants were given the reference implementation based on the original block validation algorithm in TON. Their task was to optimize the implementation, which had to be consistent with the reference algorithm. Scores were based on execution time.&lt;/p>&lt;p>While we did not directly participate in the contest, a handful of Pyroscope engineers ran several contestant submissions locally and profiled them. This allowed us to observe where the optimized implementations spent their time and how specific changes affected performance.&lt;/p>&lt;p>We used Alloy, an open source OpenTelemetry collector with built-in Prometheus pipelines and support for metrics, logs, traces, and profiles. Specifically, we leveraged Alloy’s &lt;code>pyroscope.ebpf&lt;/code> component, an eBPF-based CPU profiler, to capture detailed profiling data and send it to &lt;u>&lt;a href="/products/cloud/">Grafana Cloud&lt;/a>&lt;/u> for analysis. This approach allowed us to identify hotspots and track optimization progress.&lt;/p>&lt;p>With Alloy’s eBPF-based profiling, we were able to gain immediate visibility into performance bottlenecks without modifying a single line of contestant code.&lt;/p>&lt;h2>Alloy setup&lt;/h2>&lt;p>Setting up eBPF-based profiling with Alloy requires minimal configuration:&lt;/p>&lt;p>&lt;/p>&lt;pre>&lt;code>pyroscope.write "staging" {
endpoint {
url = "&lt;URL>"
basic_auth {
username = "&lt;User>"
password = "&lt;Password>"
}
}
}
pyroscope.ebpf "default" {
targets_only = false
forward_to = [pyroscope.write.staging.receiver]
demangle = "full"
}&lt;/code>&lt;/pre>&lt;p>Replace &lt;code>&lt;URL>&lt;/code> with your Pyroscope server URL, and &lt;code>&lt;User>&lt;/code> and &lt;code>&lt;Password>&lt;/code> with your Grafana Cloud credentials if sending data to the cloud. For local setups, you can skip the authentication and point to a local Pyroscope instance.&lt;/p>&lt;p>The profiler runs with root privileges and starts immediately:&lt;/p>&lt;p>&lt;/p>&lt;pre>&lt;code>sudo ./alloy run ./ebpf.alloy.txt&lt;/code>&lt;/pre>&lt;p>Once running, it profiles the entire system and sends data to your configured endpoint.&lt;/p>&lt;p>For the contest, we compiled with &lt;code>clang&lt;/code> using &lt;code>RelWithDebInfo&lt;/code> to preserve symbols for proper flame graph visualization:&lt;/p>&lt;p>&lt;/p>&lt;pre>&lt;code>CC=clang CXX=clang++ cmake ../ton -DCMAKE_BUILD_TYPE=RelWithDebInfo
make contest-grader -j
./contest/grader/contest-grader --threads 8 --tests ../../tests&lt;/code>&lt;/pre>&lt;h2>Crypto library optimizations&lt;/h2>&lt;p>Looking at the reference implementation flame graph, we can see that &lt;code>vm::DataCell::create&lt;/code> (DataCell deserialization) consumes about 14% of the total execution time. This function is responsible for creating and validating cells, which are TON's fundamental data structure. Each cell can store up to 1023 bits of data and references to other cells, forming a directed acyclic graph.&lt;/p>&lt;p>The SHA256 computation happens because every cell in TON has a cryptographic hash that serves as its unique identifier. During deserialization, the system must compute SHA256 hashes to verify data integrity, prevent circular references, and enable efficient deduplication. This hash computation involves serializing the cell's data, descriptor bytes, reference depths, and reference hashes into a single byte string that gets hashed with SHA256.&lt;/p>&lt;p>Another crypto operation hotspot is &lt;code>vm::exec_ed25519_check_signature&lt;/code>, which implements the TVM bytecode operation for Ed25519 signature verification. This operation is frequently called during smart contract execution and transaction validation.&lt;/p>&lt;p>These cryptographic operations represent natural optimization targets, as they consume significant CPU time during blockchain validation.&lt;/p>&lt;h3>SHA256 alternative implementation&lt;/h3>&lt;p>Sometimes the most effective optimizations are the simplest ones. One contestant took the low-hanging fruit approach and replaced the default OpenSSL SHA256 implementation with an alternative from SerenityOS. This submission (&lt;u>&lt;a href="https://contest.com/ton-block-validation/entry6294">entry6294&lt;/a>&lt;/u>) swapped out the library routine with one from &lt;u>&lt;a href="https://git.tu-berlin.de/leon.a.albrecht/serenity-mirror/-/blob/057abb9023a30ea226a7b979a9f53f4f9dbe3c93/Userland/Libraries/LibCrypto/Hash/SHA2.cpp">SerenityOS's crypto library&lt;/a>&lt;/u>.&lt;/p>&lt;p>The flame graph diff shows the impact: a ~2% total speedup. While this might seem modest, every percentage point matters in competitive optimization. It's unclear why the SerenityOS implementation was faster, but the execution time and flame graph diff data confirmed the improvement.&lt;/p>&lt;h3>SHA256 single feed &lt;/h3>&lt;p>Beyond replacing the SHA256 implementation, contestants also optimized how the algorithm is used. One particularly effective optimization consolidated multiple SHA256 feed operations into a single call within &lt;code>CellChecker::compute_hash&lt;/code>. This &lt;u>&lt;a href="https://github.com/ton-blockchain/ton/pull/1590">pull request&lt;/a>&lt;/u> demonstrates how algorithmic improvements can be more impactful than library replacements.&lt;/p>&lt;p>The change sped up &lt;code>DataCell::create&lt;/code> by 20% and improved overall verification performance by 3.5%. By reducing the overhead of multiple hash update calls and leveraging more efficient batched processing, this optimization showed that understanding the usage patterns of cryptographic functions can lead to gains.&lt;/p>&lt;h3>ED25519&lt;/h3>&lt;p>Another straightforward optimization targeted the Ed25519 signature verification in &lt;code>vm::exec_ed25519_check_signature&lt;/code>. Like the SHA256 case, this involved replacing the default OpenSSL implementation with an alternative that uses handwritten assembly for x86_64.&lt;/p>&lt;p>While this approach sacrifices portability for performance, the results justified the trade-off in a contest environment. The assembly-optimized implementation delivered a ~1.5% speedup, demonstrating how platform-specific optimizations can provide measurable gains even for well-established cryptographic operations.&lt;/p>&lt;h2>Ordered collections replacements &lt;/h2>&lt;p>Another low-hanging fruit optimization involved replacing &lt;code>std::map&lt;/code> with &lt;code>std::unordered_set&lt;/code> in &lt;code>CellStorageStat::add_used_storage()&lt;/code>. The original implementation used a map to track visited cells:&lt;/p>&lt;pre>&lt;code>- std::map&lt;vm::Cell::Hash, CellInfo> seen;
+ std::unordered_set&lt;vm::Cell::Hash> seen;&lt;/code>&lt;/pre>&lt;p>This seemingly trivial change provided a ~10% speedup. The performance improvement came from the difference between these data structures: &lt;code>std::map&lt;/code> maintains elements in sorted order using a balanced binary tree (typically red-black tree), providing O(log n) lookup time. In contrast, &lt;code>std::unordered_set&lt;/code> used a hash table with O(1) average lookup time.&lt;/p>&lt;p>Since the collection is only used for memoization to avoid reprocessing the same cells, ordering is unnecessary. The hash-based lookup eliminated the overhead of tree traversal and comparison operations, making cell deduplication significantly faster.&lt;/p>&lt;h2>Custom profilers&lt;/h2>&lt;p>Interestingly, contestant submissions and the TON codebase itself included custom-built profiling solutions. This demonstrates the lack of ready-to-use, gold-standard profilers in the C++ ecosystem, forcing developers to implement their own instrumentation when they need deeper insights.&lt;/p>&lt;h3>Tracing profiler&lt;/h3>&lt;p>One contestant implemented a manual instrumentation tracing profiler with RAII-style timing blocks. The system used a &lt;code>PROFILER(name)&lt;/code> macro that created static IDs for O(1) record lookup and automatically measured execution time using RAII destructors. While lightweight and precise, it required manual code instrumentation at every point of interest.&lt;/p>&lt;p>The profiler aggregated timing data by call site and provided sorted output showing the most expensive operations first. This approach offered fine-grained control over what gets measured but came with the overhead of manual instrumentation and potential code clutter.&lt;/p>&lt;h3>Memory profiler&lt;/h3>&lt;p>The TON monorepo includes a sophisticated memory allocation profiler (&lt;code>memprof&lt;/code>) that intercepts all malloc/free calls and C++ new/delete operators. It captures full stack traces for each allocation, aggregates them by call site, and maintains a hash table of unique allocation patterns.&lt;/p>&lt;p>The profiler uses fast assembly-based stack walking on x86_64 with fallback to standard backtrace functions. It can track memory usage patterns, identify leaks, and provide detailed allocation statistics, which are essential for optimizing memory-intensive blockchain validation.&lt;/p>&lt;p>These custom profiling implementations highlighted a common challenge in C++ optimization work: the absence of standardized, production-ready profiling tools forces developers to reinvent the wheel. eBPF-based profiling with tools like Alloy offers an attractive alternative, providing comprehensive system-wide profiling without requiring custom instrumentation or code modifications.&lt;/p>&lt;h2>Wrapping up&lt;/h2>&lt;p>You can learn more about each implementation in the contest &lt;u>&lt;a href="https://contest.com/ton-block-validation">here&lt;/a>&lt;/u>; winners are also listed anonymously on that page.&lt;/p>&lt;p>Looking back on the contest, the flame graph visualizations in Pyroscope made it easy to spot hotspots like &lt;code>DataCell::create&lt;/code> consuming 14% of execution time, while flame graph diffs clearly showed the impact of each optimization attempt.&lt;/p>&lt;p>What's particularly striking is how contestants achieved significant speedups through relatively simple changes: swapping crypto libraries, replacing ordered collections with hash tables, and optimizing algorithmic patterns. These optimizations, ranging from 1.5% to 20% improvements per change, demonstrate that performance gains often come from understanding your data structures and choosing the right tool for the job. &lt;/p>&lt;p>The big take-away for me was that modern profiling tools like Pyroscope and Alloy are making performance optimization more accessible and data-driven. Whether you're optimizing blockchain validators or any other performance-critical application, continuous profiling should be in your optimization toolkit from day one.&lt;/p>&lt;p>&lt;em>&lt;a href="/products/cloud/?pg=ai-observability-mcp-servers&amp;plcmt=footer-cta">Grafana Cloud&lt;/a>&lt;/em>&lt;em> is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. &lt;/em>&lt;em>&lt;a href="/auth/sign-up/create-user/?pg=ai-observability-mcp-servers&amp;plcmt=footer-cta">Sign up for free now!&lt;/a>&lt;/em>&lt;/p></description></item><item><title>Grafana security release: Critical and high severity security fixes for CVE-2026-27876 and CVE-2026-27880</title><link>https://grafana.com/blog/grafana-security-release-critical-and-high-severity-security-fixes-for-cve-2026-27876-and-cve-2026-27880/</link><pubDate>Thu, 26 Mar 2026 04:43:23</pubDate><author>Mariell Hoversholm</author><guid>https://grafana.com/blog/grafana-security-release-critical-and-high-severity-security-fixes-for-cve-2026-27876-and-cve-2026-27880/</guid><description>&lt;p>Today we are releasing Grafana 12.4.2 along with patches for Grafana 12.3, 12.2, 12.1, and 11.6, which include critical and high severity security fixes. We recommend that you install the newly released versions as soon as possible.&lt;/p>&lt;p>Grafana 12.4.2 with security fixes:&lt;/p>&lt;ul>&lt;li>&lt;a href="/grafana/download/12.4.2">Download Grafana 12.4.2&lt;/a>&lt;/li>&lt;/ul>&lt;p>Grafana 12.3.6 with security fixes:&lt;/p>&lt;ul>&lt;li>&lt;a href="/grafana/download/12.3.6">Download Grafana 12.3.6&lt;/a>&lt;/li>&lt;/ul>&lt;p>Grafana 12.2.8 with security fixes: &lt;/p>&lt;ul>&lt;li>&lt;a href="/grafana/download/12.2.8">Download Grafana 12.2.8&lt;/a>&lt;/li>&lt;/ul>&lt;p>Grafana 12.1.10 with security fixes:&lt;/p>&lt;ul>&lt;li>&lt;a href="/grafana/download/12.1.10">Download Grafana 12.1.10&lt;/a>&lt;/li>&lt;/ul>&lt;p>Grafana 11.6.14 with security fixes:&lt;/p>&lt;ul>&lt;li>&lt;a href="/grafana/download/11.6.14">Download Grafana 11.6.14&lt;/a>&lt;/li>&lt;/ul>&lt;p>As per our security policy, Grafana Labs customers have received security patched versions two weeks in advance under embargo, and Grafana Cloud has been patched. &lt;/p>&lt;p>We have also coordinated closely with all cloud providers licensed to offer Grafana Cloud. They received early notification under embargo and confirmed that their offerings are secure at the time of this announcement. This is applicable to Amazon Managed Grafana and Azure Managed Grafana.&lt;/p>&lt;h2>CVE-2026-27876: SQL expressions arbitrary file write enabling remote code execution&lt;/h2>&lt;p>Grafana's SQL expressions feature enables transforming query data with familiar SQL syntax. This syntax, however, also permitted writing arbitrary files to the file system in such a way that one could chain several attack vectors to achieve remote code execution.&lt;/p>&lt;p>The CVSS score for this vulnerability is 9.1 CRITICAL (&lt;u>&lt;a href="https://www.first.org/cvss/calculator/3.1">CVSS link&lt;/a>&lt;/u>).&lt;/p>&lt;p>&lt;strong>The following prerequisites are required for this vulnerability:&lt;/strong> &lt;/p>&lt;ul>&lt;li>Access to execute data source queries (Viewer permissions or higher)&lt;/li>&lt;li>The sqlExpressions feature toggle must be enabled on the Grafana instance.&lt;/li>&lt;/ul>&lt;h2>Impact&lt;/h2>&lt;p>An attacker with access to execute data source queries could overwrite a Sqlyze driver or write an AWS data source configuration file in order to achieve full remote code execution. We have confirmed this vulnerability could be exploited to acquire an SSH connection to the Grafana host.&lt;/p>&lt;h2>Impacted versions&lt;/h2>&lt;p>Grafana versions v11.6.0 and later are impacted by this vulnerability. &lt;/p>&lt;h2>Solutions and mitigations&lt;/h2>&lt;p>We recommend upgrading to one of the patched versions listed above as soon as possible.&lt;/p>&lt;p>If an upgrade is not immediately possible, the following workarounds reduce risk. Note: these may cause disruption to Grafana users and do not fully remediate the vulnerability.&lt;/p>&lt;p>&lt;strong>Option 1:&lt;/strong> Disable the sqlExpressions feature toggle.&lt;/p>&lt;p>&lt;strong>Option 2:&lt;/strong> Perform ALL of the following:&lt;/p>&lt;ul>&lt;li>If you have Sqlyze installed: update to at least v1.5.0 or disable it.&lt;/li>&lt;li>Disable all AWS data sources you have installed.&lt;/li>&lt;/ul>&lt;h2>CVE-2026-27880: Unauthenticated denial-of-service via OpenFeature endpoint&lt;/h2>&lt;p>Grafana's OpenFeature feature flag validation endpoints do not require authentication and accept unbounded user input. This input is read into memory.&lt;/p>&lt;p>The CVSS score for this vulnerability is 7.5 HIGH (&lt;u>&lt;a href="https://www.first.org/cvss/calculator/3.1">CVSS link&lt;/a>&lt;/u>).&lt;/p>&lt;h2>Impact&lt;/h2>&lt;p>An attacker could crash the Grafana server by sending requests that exhaust available memory.&lt;/p>&lt;h2>Impacted versions&lt;/h2>&lt;p>Grafana versions v12.1.0 and later are impacted by this vulnerability. &lt;/p>&lt;h2>Solutions and mitigations&lt;/h2>&lt;p>We recommend upgrading to one of the patched versions listed above as soon as possible.&lt;/p>&lt;p>If an upgrade is not immediately possible, any of the following workarounds reduces risk:&lt;/p>&lt;ul>&lt;li>Deploy Grafana in a highly available environment with automatic restarts.&lt;/li>&lt;li>Implement a reverse proxy in front of Grafana that limits input payload size. Cloudflare does this by default. Nginx supports this &lt;u>&lt;a href="https://nginx.org/en/docs/http/ngx_http_core_module.html#client_max_body_size">via explicit configuration&lt;/a>&lt;/u>.&lt;/li>&lt;/ul>&lt;h2>Timeline and post-incident review&lt;/h2>&lt;p>Here is a detailed incident timeline. All times are in UTC.&lt;/p>&lt;p>&lt;strong>CVE-2026-27876&lt;/strong>&lt;/p>&lt;p>&lt;strong>Date/Time (UTC)&lt;/strong>&lt;/p>&lt;p>&lt;strong>Event&lt;/strong>&lt;/p>&lt;p>2025-02-06&lt;/p>&lt;p>sqlExpressions feature reimplemented with MySQL syntax and released in v11.6.0&lt;/p>&lt;p>2026-02-23 13:33&lt;/p>&lt;p>Internal incident declared&lt;/p>&lt;p>2026-02-23 15:08&lt;/p>&lt;p>Grafana Cloud patched&lt;/p>&lt;p>2026-03-09&lt;/p>&lt;p>Private release issued to customers under embargo&lt;/p>&lt;p>2026-03-25&lt;/p>&lt;p>Public release&lt;/p>&lt;p>2026-03-26 04:00&lt;/p>&lt;p>Blog published&lt;/p>&lt;p>&lt;strong>CVE-2026-27880&lt;/strong>&lt;/p>&lt;p>&lt;strong>Date/Time (UTC)&lt;/strong>&lt;/p>&lt;p>&lt;strong>Event&lt;/strong>&lt;/p>&lt;p>2025-06-27&lt;/p>&lt;p>New OpenFeature evaluation endpoint introduced and released in v12.1.0&lt;/p>&lt;p>2026-02-24 13:12&lt;/p>&lt;p>Internal incident declared&lt;/p>&lt;p>2026-02-24 17:49&lt;/p>&lt;p>Grafana Cloud stacks not behind Cloudflare were patched; Cloudflare-backed stacks were not affected&lt;/p>&lt;p>2026-03-09&lt;/p>&lt;p>Private release issued to customers under embargo&lt;/p>&lt;p>2026-03-25&lt;/p>&lt;p>Public release&lt;/p>&lt;p>2026-03-26 04:00&lt;/p>&lt;p>Blog published&lt;/p>&lt;p>&lt;/p>&lt;h2>Acknowledgements&lt;/h2>&lt;p>We would like to thank Liad Eliyahu, Head of Research at Miggo Security, for responsibly disclosing CVE-2026-27876 through our bug bounty program.&lt;/p>&lt;p>CVE-2026-27880 was discovered internally by the Grafana Labs security team.&lt;/p>&lt;h2>Reporting security issues&lt;/h2>&lt;p>If you think you have found a security vulnerability, please go to our &lt;u>&lt;a href="/legal/report-a-security-issue/">Report a security issue&lt;/a>&lt;/u> page to learn how to send a security report.&lt;/p>&lt;p>Grafana Labs will send you a response indicating the next steps in handling your report. After the initial reply to your report, the security team will keep you informed of the progress towards a fix and full announcement, and may ask for additional information or guidance.&lt;/p>&lt;p>&lt;strong>Important:&lt;/strong> We ask you to not disclose the vulnerability before it has been fixed and announced, unless you received a response from the Grafana Labs security team that you can do so.&lt;/p>&lt;p>You can also read more about our &lt;u>&lt;a href="https://app.intigriti.com/programs/grafanalabs/grafanaossbbp/">bug bounty program&lt;/a>&lt;/u> and have a look at our &lt;u>&lt;a href="/security/hall-of-fame/">Security Hall of Fame&lt;/a>&lt;/u>.&lt;/p>&lt;h2>Security announcements&lt;/h2>&lt;p>We maintain a &lt;u>&lt;a href="/security/security-advisories/">security advisories page&lt;/a>&lt;/u>, where we always post a summary, remediation, and mitigation details for any patch containing security fixes. You can also subscribe to our &lt;u>&lt;a href="/blog/index.xml">RSS feed&lt;/a>&lt;/u>.&lt;/p>&lt;h2>&lt;/h2></description></item><item><title>From raw data to flame graphs: A deep dive into how the OpenTelemetry eBPF profiler symbolizes Go</title><link>https://grafana.com/blog/deep-dive-into-how-the-opentelemetry-ebpf-profiler-symbolizes-go/</link><pubDate>Wed, 25 Mar 2026 14:52:53</pubDate><author>Marc Sanmiquel</author><guid>https://grafana.com/blog/deep-dive-into-how-the-opentelemetry-ebpf-profiler-symbolizes-go/</guid><description>&lt;p>Imagine you're troubleshooting a production issue: your application is slow, the CPU is spiking, and users are complaining. You turn to your profiler for answers—after all, this is exactly what it's built for.&lt;/p>&lt;p>The profiler runs, collecting thousands of stack samples. eBPF profilers, including the &lt;u>&lt;a href="https://github.com/open-telemetry/opentelemetry-ebpf-profiler">OpenTelemetry eBPF profiler&lt;/a>&lt;/u>, operate at the kernel level, so they capture raw program counters: memory addresses pointing into your binary. Before these addresses reach &lt;u>&lt;a href="/oss/pyroscope/">Pyroscope&lt;/a>&lt;/u>, the open source continuous profiling database, they have to pass through a process called symbolization. &lt;/p>&lt;p>Here's what that data looks like &lt;em>before&lt;/em> symbolization:&lt;/p>&lt;p>Raw memory addresses. Long strings of hexadecimal with no obvious meaning. &lt;/p>&lt;p>Which function is actually consuming CPU? Where in your code should you even start looking? To make sense of this, you'd need to manually map each address back to your binary, assuming you have the exact version that’s running in production. In many cases, that’s slow, error-prone, or simply impossible.&lt;/p>&lt;p>Now, you look at the same profile with symbolization enabled:&lt;/p>&lt;p>Suddenly, everything clicks. You can see exactly what's consuming CPU: &lt;code>main.computeResult&lt;/code> is your bottleneck. You know which function to investigate, and can jump straight to the source code to start optimizing.&lt;/p>&lt;p>This transformation from useless hex addresses to actionable function names is symbolization. And for eBPF profilers, making this happen is far more complex than it might seem.&lt;/p>&lt;p>In this post, we’ll unpack that process step by step by following a single memory address through the entire symbolization pipeline, from a raw program counter all the way to a function name. We’ll focus specifically on Go programs, which have a unique advantage: they embed a &lt;code>.gopclntab&lt;/code> section that remains in the binary even when debug symbols are removed (stripped), enabling profilers to extract function names on-target. In contrast, most other native languages rely on server-side symbolization, which is why Go programs tend to produce better profiling data out of the box.&lt;/p>&lt;h2>What you'll learn&lt;/h2>&lt;p>Whether you're debugging missing symbols in production or wondering why your stripped Go binaries still profile correctly while C programs show hex addresses, this post will demystify Go symbolization in eBPF profilers from the ground up.&lt;/p>&lt;p>We'll explore:&lt;/p>&lt;ul>&lt;li>What symbols are and where they hide in your binaries (you might be surprised to learn they can represent a significant part of your binary's size)&lt;/li>&lt;li>The pipeline steps from raw address to function name, with real code from the OpenTelemetry eBPF profiler&lt;/li>&lt;li>Binary search and frame caching—the performance tricks that make symbolization fast enough for production&lt;/li>&lt;li>Practical commands (&lt;code>readelf&lt;/code>,&lt;code>nm&lt;/code>, &lt;code>file&lt;/code>) to inspect your own binaries&lt;/li>&lt;li>What happens when symbolization fails and how to debug it&lt;/li>&lt;/ul>&lt;p>By the end, you'll understand why Go programs profile better than other native languages even when stripped, how to debug symbol issues, and why &lt;code>gopclntab&lt;/code>—a compact data structure that maps every function's address range to its name and source location—makes Go uniquely suited for eBPF profiling.&lt;/p>&lt;h2>Why symbolization is a challenge with eBPF profilers&lt;/h2>&lt;p>Traditional profilers inject agents into your process, call runtime APIs, or even recompile your code with instrumentation. Need a function name? Just ask the running program.&lt;/p>&lt;p>eBPF profilers can't do any of that. They run in the kernel space, which, on one hand, gives them superpowers—they can profile any process, see through container boundaries, and capture kernel stacks without modification. But this comes with strict constraints:&lt;/p>&lt;p>What eBPF profilers can see:&lt;/p>&lt;ul>&lt;li>Which instruction is currently executing (a memory address)&lt;/li>&lt;li>The stack of return addresses (more memory addresses)&lt;/li>&lt;li>Process memory maps (which binary contains each address)&lt;/li>&lt;/ul>&lt;p>What eBPF profilers cannot do:&lt;/p>&lt;ul>&lt;li>Modify the running program&lt;/li>&lt;li>Call functions inside your application&lt;/li>&lt;li>Access language runtime APIs (Go's reflection, Python's introspection)&lt;/li>&lt;li>Load debugging agents or libraries into processes&lt;/li>&lt;/ul>&lt;p>When the profiler captures a stack trace, it gets this:&lt;/p>&lt;p>&lt;em>[0x00000000000f0318, 0x00000000000f0478, 0x0000000000050c08]&lt;/em>&lt;/p>&lt;p>Three addresses. No names, no context, no metadata. Everything must be figured out externally by analyzing binary files on disk, while maintaining sub-1% CPU overhead in production.&lt;/p>&lt;p>This constraint shapes the entire symbolization architecture:&lt;/p>&lt;ul>&lt;li>All symbol extraction happens outside the process: parsing &lt;u>&lt;a href="https://en.wikipedia.org/wiki/Executable_and_Linkable_Format">ELF&lt;/a>&lt;/u> files, &lt;u>&lt;a href="https://dwarfstd.org/">DWARF&lt;/a>&lt;/u> debug info, and language-specific sections like Go's &lt;code>gopclntab&lt;/code>&lt;/li>&lt;li>Performance is critical: with 20-100 samples/sec across hundreds of processes, the profiler needs microsecond lookups&lt;/li>&lt;li>Graceful degradation: production binaries are often stripped; the profiler needs fallback strategies&lt;/li>&lt;/ul>&lt;h2>Introducing our Go program example&lt;/h2>&lt;p>To make these concepts concrete, we’ll use a simple Go program throughout this post. Here's the complete code:&lt;/p>&lt;p>&lt;/p>&lt;pre>&lt;code>package main
import (
"os"
"runtime/pprof"
"time"
)
func processRequest(n int) int {
data := fetchData(n)
return computeResult(data)
}
func fetchData(n int) int {
sum := 0
for i := 0; i &lt; n; i++ {
sum += i * i
}
return sum
}
func computeResult(data int) int {
result := 0
for i := 0; i &lt; data/1000; i++ {
result += i * 2
}
return result
}
func main() {
f, _ := os.Create("cpu.pprof")
defer f.Close()
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
start := time.Now()
for time.Since(start) &lt; 10*time.Second {
processRequest(50000)
}
}&lt;/code>&lt;/pre>&lt;p>Clear call relationships: &lt;code>main&lt;/code> → &lt;code>processRequest&lt;/code> → &lt;code>fetchData&lt;/code> and &lt;code>computeResult&lt;/code>. When profiled, &lt;code>computeResult&lt;/code> dominates CPU time due to its larger loop.&lt;/p>&lt;p>Compile it:&lt;/p>&lt;pre>&lt;code># Disable optimizations to prevent inlining
go build -gcflags="all=-N -l" -o demo demo.go&lt;/code>&lt;/pre>&lt;p>This produces a ~2.6MB binary we’ll explore throughout this post.&lt;/p>&lt;h2>What is symbolization: a closer look&lt;/h2>&lt;p>Symbolization is the process of mapping memory addresses to function names. When our demo compiles, the compiler transforms source into machine instructions:&lt;/p>&lt;p>&lt;/p>&lt;pre>&lt;code>func processRequest(n int) int {
data := fetchData(n)
return computeResult(data)
}
// Becomes machine code at address 0xf0310
// objdump -d demo | grep -A8 "00000000000f0310"
00000000000f0310 &lt;main.processRequest>:
f0310: ldr x16, [x28, #16]
f0314: cmp sp, x16
f0318: b.ls f0350
f031c: str x30, [sp, #-48]!
f0320: stur x29, [sp, #-8]
...
&lt;/code>&lt;/pre>&lt;p>The compiler knows &lt;code>main.processRequest&lt;/code> starts at address 0xf0310. Symbolization is the process of recovering that mapping when all you have is the address.&lt;/p>&lt;p>When the eBPF profiler samples your running application, it captures a stack trace of addresses:&lt;/p>&lt;p>&lt;em>0x00000000000f0318  ← CPU is here (inside &lt;/em>&lt;code>processRequest&lt;/code>&lt;em>)&lt;/em>&lt;/p>&lt;p>&lt;em>0x00000000000f0478  ← Called from here (inside &lt;/em>&lt;code>main.main&lt;/code>&lt;em>)&lt;/em>&lt;/p>&lt;p>&lt;em>0x0000000000050c08  ← Called from here (&lt;/em>&lt;code>runtime.main&lt;/code>&lt;em>)&lt;/em>&lt;/p>&lt;p>To transform these addresses into the flame graph you see in Pyroscope, the profiler must answer: "What function contains address 0xf0318?"&lt;/p>&lt;h3>The answer: symbol tables&lt;/h3>&lt;p>The compiler embeds this mapping in the binary’s symbol table. Here’s what &lt;code>nm&lt;/code> shows for our demo:&lt;/p>&lt;p>&lt;/p>&lt;pre>&lt;code>nm demo | grep -E 'main\.(process|fetch|compute)|runtime.main
00000000000f03e0 T main.computeResult
00000000000f0370 T main.fetchData
00000000000f0310 T main.processRequest
00000000000f0470 T main.main
0000000000050c00 T runtime.main
&lt;/code>&lt;/pre>&lt;p>Each line maps an address to a name. Given address 0xf0318, the profiler searches this table, finds it falls between 0xf0310 (&lt;code>processRequest&lt;/code>) and 0xf0370 (&lt;code>fetchData&lt;/code>), and returns &lt;code>main.processRequest&lt;/code>.&lt;/p>&lt;p>&lt;strong>Note: &lt;/strong>Not all symbols appear in flame graphs—only functions where the profiler captured samples. If &lt;code>fetchData&lt;/code> runs too fast to be sampled, it won't appear, even though &lt;code>nm&lt;/code> shows it exists. Profilers show where time is spent, not what was called.&lt;/p>&lt;h3>The lookup challenge&lt;/h3>&lt;p>If symbolization were as simple as saying "read table and look up address," it would be trivial. But production profiling faces several challenges:&lt;/p>&lt;ul>&lt;li>Performance: Thousands of lookups per second across hundreds of processes&lt;/li>&lt;li>Missing symbols: Production binaries are often stripped to save space&lt;/li>&lt;li>Multiple formats: Go binaries may have &lt;code>gopclntab&lt;/code>, ELF symbol tables, or DWARF debug info.&lt;/li>&lt;li>Size constraints: Symbol information can represent 20-30% of binary size&lt;/li>&lt;li>Dynamic loading: Shared libraries load at different addresses each run&lt;/li>&lt;/ul>&lt;h2>What's inside a binary?&lt;/h2>&lt;p>Our compiled demo is 2.6 MB. Where does that space go? Let’s explore the sections:&lt;/p>&lt;p>&lt;code>readelf -S demo | grep -E 'Name|gopclntab|symtab|debug'&lt;/code>&lt;/p>&lt;p>This shows section headers, but sizes appear on the next line. To see everything clearly:&lt;/p>&lt;p>&lt;code>readelf -S demo | grep -A1 "\.text\|\.gopclntab\|\.debug_info\|\.debug_line"&lt;/code>&lt;/p>&lt;p>You'll see output like:&lt;/p>&lt;pre>&lt;code>[ 1] .text PROGBITS 0000000000011000 00001000
00000000000dfc04 0000000000000000 AX 0 0 16
[ 6] .gopclntab PROGBITS 00000000001426c0 001326c0
000000000008f848 0000000000000000 A 0 0 32
&lt;/code>&lt;/pre>&lt;p>The second line shows the size in hex. Converting these to human-readable format (you can use &lt;code>printf '%d\n' 0x8f848&lt;/code> or a calculator) will show:&lt;/p>&lt;p>Section&lt;/p>&lt;p>Hex size&lt;/p>&lt;p>Human size&lt;/p>&lt;p>Purpose&lt;/p>&lt;p> .text&lt;/p>&lt;p>0xdfc04&lt;/p>&lt;p> 0.87 MB&lt;/p>&lt;p>Actual executable code&lt;/p>&lt;p> .gopclntab&lt;/p>&lt;p>0x8f848&lt;/p>&lt;p>0.56 MB&lt;/p>&lt;p>Go's PC-to-line table (22% of binary!)&lt;/p>&lt;p>.debug_info&lt;/p>&lt;p>0x3ddca&lt;/p>&lt;p>0.24 MB&lt;/p>&lt;p>DWARF debug information&lt;/p>&lt;p>.debug_line&lt;/p>&lt;p> 0x1c00e&lt;/p>&lt;p>0.11 MB&lt;/p>&lt;p>DWARF line number mappings&lt;/p>&lt;p>Key insight: Symbol information (&lt;code>.gopclntab&lt;/code> + debug sections) represents ~35% of this binary's size.&lt;/p>&lt;h3>Finding functions with nm&lt;/h3>&lt;p>We can use &lt;code>nm&lt;/code> to list the symbols in our binary and confirm the address-to-function mapping:&lt;/p>&lt;p>&lt;/p>&lt;pre>&lt;code>nm demo | grep -E 'processRequest|fetchData|computeResult'
00000000000f0310 T main.processRequest
00000000000f0370 T main.fetchData
00000000000f03e0 T main.computeResult&lt;/code>&lt;/pre>&lt;p>Format: address type name. The &lt;code>T&lt;/code> means "function in the text section." When the profiler sees address 0xf0318, it searches this table and finds it falls within &lt;code>main.processRequest&lt;/code> (which starts at 0xf0310).&lt;/p>&lt;h3>The stripped binary trade-off&lt;/h3>&lt;p>Production binaries are often stripped to save space:&lt;/p>&lt;p>&lt;/p>&lt;pre>&lt;code>cp demo demo-stripped
strip demo-stripped
ls -lh demo demo-stripped&lt;/code>&lt;/pre>&lt;p>Output:&lt;/p>&lt;p>&lt;/p>&lt;pre>&lt;code>-rwxr-xr-x  2.6M  demo
-rwxr-xr-x  1.9M  demo-stripped    # 27% smaller!&lt;/code>&lt;/pre>&lt;p>Quick way to check if a binary is stripped:&lt;/p>&lt;p>&lt;/p>&lt;pre>&lt;code>file demo
# demo: ELF 64-bit LSB executable, ARM aarch64 ... not stripped
file demo-stripped
# demo-stripped: ELF 64-bit LSB executable, ARM aarch64 ... stripped&lt;/code>&lt;/pre>&lt;p>Check what happened to symbols:&lt;/p>&lt;p>&lt;/p>&lt;pre>&lt;code>nm demo | wc -l           # 4,041 symbols
nm demo-stripped          # "no symbols"&lt;/code>&lt;/pre>&lt;p>But Go has a safety net—&lt;code>.gopclntab&lt;/code> survives stripping:&lt;/p>&lt;p>&lt;/p>&lt;pre>&lt;code>readelf -S demo-stripped | grep gopclntab
[ 6] .gopclntab        PROGBITS         00000000001426c0  001326c0&lt;/code>&lt;/pre>&lt;p>This is why Go is special. When you strip a C or Rust binary, symbolization becomes impossible without separate debug files. When you strip a Go binary, &lt;code>gopclntab&lt;/code> remains embedded—it's required by Go's runtime for panic traces and reflection. The OpenTelemetry eBPF profiler can still extract every function name.&lt;/p>&lt;p>This asymmetry is why Go programs are particularly well-suited for eBPF profiling in production. You can strip binaries to save space without sacrificing observability, as the profiler continues to provide full function names.&lt;/p>&lt;h2>The symbolization pipeline&lt;/h2>&lt;p>When the eBPF profiler captures address 0xf0310 from our demo program, here's the journey to transform it into &lt;code>main.processRequest&lt;/code>:&lt;/p>&lt;p>Raw Address: 0x00000000000f0310&lt;/p>&lt;p>  ↓&lt;/p>&lt;p>[1] Find the binary&lt;/p>&lt;p>  ↓&lt;/p>&lt;p>[2] Load symbol information&lt;/p>&lt;p>  ↓&lt;/p>&lt;p>[3] Extract symbols from &lt;code>gopclntab&lt;/code>&lt;/p>&lt;p>  ↓&lt;/p>&lt;p>[4] Cache the result&lt;/p>&lt;p>  ↓&lt;/p>&lt;p>Result: &lt;code>main.processRequest&lt;/code>&lt;/p>&lt;h3>Step 1: Find the binary&lt;/h3>&lt;p>The profiler reads &lt;code>/proc/&lt;pid>/maps&lt;/code> to see all memory mappings for the process. Each line shows a memory region with its address range, permissions, and which file it maps to.&lt;/p>&lt;p>For our demo, one of those lines would show:&lt;/p>&lt;p>&lt;code>&lt;address-range>  r-xp  &lt;offset>  demo&lt;/code>&lt;/p>&lt;p>The profiler checks: does 0xf0310 fall within this range? Yes → it's in our demo binary. The profiler now knows which file to analyze.&lt;/p>&lt;h3>Step 2: Load symbol information&lt;/h3>&lt;p>The profiler opens the ELF file (&lt;code>libpf/pfelf/file.go:171-183 - Open()&lt;/code>) and looks for the &lt;code>.gopclntab&lt;/code> section, which is Go's primary symbol source. If &lt;code>gopclntab&lt;/code> is missing or corrupted (extremely rare), it falls back to standard ELF symbol tables.&lt;/p>&lt;h3>Step 3: Extract symbols from gopclntab&lt;/h3>&lt;p>This is where Go’s design shines. The profiler doesn't need to try multiple strategies or handle complex fallbacks—&lt;code>gopclntab&lt;/code> provides everything needed.&lt;/p>&lt;p>&lt;strong>What is gopclntab, exactly?&lt;/strong>&lt;/p>&lt;p>The &lt;code>.gopclntab&lt;/code> section (Go "program counter to line table") is a compact data structure that maps every function's address range to its name and source location. The Go compiler embeds this because the runtime needs it for:&lt;/p>&lt;ul>&lt;li>Stack traces in panic messages&lt;/li>&lt;li>Runtime reflection (&lt;code>runtime.FuncForPC&lt;/code>)&lt;/li>&lt;li>Profiler support (runtime/pprof)&lt;/li>&lt;/ul>&lt;p>Because it's required by the runtime, &lt;code>gopclntab&lt;/code> is always present, even in stripped binaries.&lt;/p>&lt;p>&lt;strong>The structure&lt;/strong>&lt;/p>&lt;p>Let's see what &lt;code>gopclntab&lt;/code> contains for our demo:&lt;/p>&lt;p>&lt;/p>&lt;pre>&lt;code># Extract gopclntab section to analyze it
readelf -S demo | grep -A1 gopclntab&lt;/code>&lt;/pre>&lt;p>Output:&lt;/p>&lt;p>&lt;/p>&lt;pre>&lt;code>[ 6] .gopclntab        PROGBITS         00000000001426c0  001326c0
000000000008f848  0000000000000000   A       0     0     32&lt;/code>&lt;/pre>&lt;p>The section is 0x8f848 bytes (0.56 MB), or about 22% of our binary. It contains a header followed by a table of function entries. Each entry stores:&lt;/p>&lt;ul>&lt;li>Function start address (PC)&lt;/li>&lt;li>Function end address&lt;/li>&lt;li>Function name offset (points to string table)&lt;/li>&lt;li>Source file and line number information&lt;/li>&lt;/ul>&lt;p>&lt;strong>How the profiler uses it&lt;/strong>&lt;/p>&lt;p>When the profiler needs to symbolize address 0xf0318:&lt;/p>&lt;p>1. Load &lt;code>gopclntab&lt;/code>: The profiler reads the &lt;code>.gopclntab &lt;/code>section from the demo binary&lt;/p>&lt;p> (Code: &lt;code>nativeunwind/elfunwindinfo/elfgopclntab.go:388 - NewGopclntab()&lt;/code>)&lt;/p>&lt;p>2. Binary search: Find which function contains 0xf0318 by searching the sorted function table&lt;/p>&lt;ul>&lt;li>Searches entries until it finds: &lt;code>start=0xf0310&lt;/code>, &lt;code>end=0xf0370&lt;/code>, &lt;code>name="main.processRequest"&lt;/code>&lt;/li>&lt;/ul>&lt;p>3. Return result: The profiler now knows 0xf0318 is inside &lt;code>main.processRequest&lt;/code>&lt;/p>&lt;p>&lt;strong>Fallback strategy&lt;/strong>&lt;/p>&lt;p>If &lt;code>gopclntab&lt;/code> is somehow missing or corrupted (extremely rare), the profiler falls back to standard ELF symbol tables (&lt;code>.symtab&lt;/code>, &lt;code>.dynsym&lt;/code>). But in practice, every Go binary has a valid &lt;code>gopclntab&lt;/code>.&lt;/p>&lt;h3>Step 4: Cache the result&lt;/h3>&lt;p>Once resolved, the profiler caches 0xf0310 → &lt;code>main.processRequest&lt;/code>. If the next stack sample hits the same address, it returns instantly without re-parsing the binary. Unlike DWARF debug info (which is compressed and expensive to decode), &lt;code>gopclntab&lt;/code> is uncompressed and memory-mapped. This makes Go symbolization particularly fast—the profiler can parse &lt;code>gopclntab&lt;/code> once at process startup, then perform microsecond lookups for every subsequent address.&lt;/p>&lt;p>The frame cache (&lt;code>processmanager/manager.go:75-79&lt;/code>) stores the resolved frames with an LRU eviction policy, keeping hot functions instantly accessible.&lt;/p>&lt;h2>Performance and optimizations&lt;/h2>&lt;p>Symbolization must be fast. With profilers sampling at 20-100 Hz across potentially hundreds of processes, the profiler might need to resolve thousands of addresses per second. At that scale, even small inefficiencies compound into significant overhead.&lt;/p>&lt;h3>The speed requirements&lt;/h3>&lt;p>Consider a modest setup: 50 processes, 20 samples/second, 20 stack frames per sample. That's 20,000 address lookups per second. If each lookup takes 1 millisecond (linear scan), the profiler would consume an entire CPU core just for symbolization, which is unacceptable overhead. The profiler's target: under 1% CPU overhead, requiring lookups in the microsecond range.&lt;/p>&lt;h3>Binary search: O(log n) lookups&lt;/h3>&lt;p>The profiler needs to solve the reverse lookup problem: given an address, find the symbol name. Since &lt;code>gopclntab&lt;/code> stores functions as address ranges (each function spans multiple addresses), the profiler moves through the following phases:&lt;/p>&lt;p>1. Extraction phase (once per binary):&lt;/p>&lt;ul>&lt;li>Parses &lt;code>gopclntab&lt;/code> to extract all functions&lt;/li>&lt;li>Each entry contains: start address, function name, source file info&lt;/li>&lt;li>Functions are naturally sorted by address in &lt;code>gopclntab&lt;/code>&lt;/li>&lt;/ul>&lt;p>2. Lookup phase (for each stack address):&lt;/p>&lt;ul>&lt;li>Uses binary search to find which range contains the address&lt;/li>&lt;li>Example: address 0xf0318 → binary search → found in range starting at 0xf0310→ returns &lt;code>"main.processRequest"&lt;/code>&lt;/li>&lt;/ul>&lt;p>Complexity: &lt;em>O(log n)&lt;/em> where n is the number of functions. With 4,000 functions (like our demo), this means ~12 comparisons per lookup instead of 4,000 linear scans.&lt;/p>&lt;p>Code reference: &lt;code>nativeunwind/elfunwindinfo/elfgopclntab.go:544-556&lt;/code> uses Go’s &lt;code>sort.Search&lt;/code>&lt;/p>&lt;h3>Frame caching&lt;/h3>&lt;p>Once a frame is symbolized, the profiler caches the complete result—not just the function name, but the entire resolved frame including source file and line number information.&lt;/p>&lt;p>The frame cache (&lt;code>processmanager/manager.go:345-355&lt;/code>) uses an LRU eviction policy.&lt;/p>&lt;p>Configuration:&lt;/p>&lt;ul>&lt;li> Cache size: 16,384 entries&lt;/li>&lt;li> TTL: 5 minutes per entry&lt;/li>&lt;li> Refreshed on each hit to keep hot paths cached&lt;/li>&lt;/ul>&lt;p>Since &lt;code>gopclntab&lt;/code> is memory-mapped and uncompressed, even cache misses are fast (microseconds). The cache primarily avoids repeated parsing of the same addresses across multiple stack samples.&lt;/p>&lt;h3>Real performance&lt;/h3>&lt;p>With these optimizations, the OpenTelemetry eBPF profiler achieves:&lt;/p>&lt;ul>&lt;li>Sub-microsecond symbol lookups (cached)&lt;/li>&lt;li>~100 microseconds for cache misses (disk read + parse)&lt;/li>&lt;li>&lt; 1% CPU overhead in production&lt;/li>&lt;/ul>&lt;p>This makes continuous profiling practical—you can run it 24/7 without noticing the performance impact.&lt;/p>&lt;h2>When symbolization fails&lt;/h2>&lt;p>Now that you know where symbols live, what happens when they're missing or incomplete?&lt;/p>&lt;h3>Missing functions despite having symbols&lt;/h3>&lt;p>If &lt;code>nm&lt;/code> doesn't show a function you know exists, the compiler likely inlined it—merged the function into its caller for optimization. This is common with small, frequently called functions.&lt;/p>&lt;p>For Go, prevent inlining during development:&lt;/p>&lt;p>&lt;code>go build -gcflags="all=-N -l" -o app main.go&lt;/code>&lt;/p>&lt;p>The &lt;code>-N&lt;/code> disables optimizations and &lt;code>-l&lt;/code> disables inlining. Don't use this for production—the performance cost is significant.&lt;/p>&lt;h3>CGO and C libraries&lt;/h3>&lt;p>For pure Go programs, symbolization "just works" and all your dependencies compile into a single binary with &lt;code>gopclntab&lt;/code> covering everything. But if your Go program uses &lt;u>&lt;a href="https://pkg.go.dev/cmd/cgo">CGO&lt;/a>&lt;/u> to call C libraries, those portions behave differently:&lt;/p>&lt;ul>&lt;li>Pure Go dependencies compile into your binary with &lt;code>gopclntab&lt;/code>, so all function calls are symbolized—whether it's your code or third-party Go packages.&lt;/li>&lt;li>For CGO/C libraries, functions may appear as hex addresses if the libraries are stripped. &lt;code>gopclntab&lt;/code> only covers Go code, not linked C binaries&lt;/li>&lt;/ul>&lt;p>In practice:&lt;/p>&lt;ul>&lt;li>If you see hex addresses in a Go program's profile, check for CGO usage&lt;/li>&lt;li>The Go portions always symbolize correctly&lt;/li>&lt;li>C library calls might show as addresses unless the shared libraries have debug symbols&lt;/li>&lt;/ul>&lt;h3>Quick diagnostic commands&lt;/h3>&lt;p>These four commands quickly tell you what symbol information is available before you start profiling.&lt;/p>&lt;p>&lt;/p>&lt;pre>&lt;code>file your-app # Stripped or not?
nm your-app | wc -l # How many symbols?
readelf -S your-app | grep gopclntab # Go binary check
readelf -S your-app | grep debug # Has debug info?
&lt;/code>&lt;/pre>&lt;h2>Wrapping up&lt;/h2>&lt;p>The next time you open Pyroscope and see function names in a flame graph for your Go application, you'll know the sophisticated machinery that made them appear. That &lt;code>main.processRequest&lt;/code> you're investigating? It started as raw address 0x00000000000f0310, was captured by eBPF from a running process the profiler couldn't modify, was then looked up in &lt;code>gopclntab&lt;/code> using binary search, and emerged as a readable name—all in microseconds, with minimal overhead.&lt;/p>&lt;p>Go's design makes this remarkably reliable. While other native languages lose all symbol information when stripped, Go's &lt;code>gopclntab&lt;/code> survives—the runtime needs it for panic traces, so it's always present. This single design decision means you can strip Go binaries to save 30% space in production while maintaining perfect symbolization. No separate debug files, no symbol servers, and no trade-offs.&lt;/p>&lt;p>The OpenTelemetry eBPF profiler leverages this by parsing &lt;code>gopclntab&lt;/code> directly, providing consistent symbolization whether your binary is fresh from development or stripped for production. This is why Go programs are particularly well-suited for continuous profiling—you get full observability without sacrificing binary size or runtime performance.&lt;/p>&lt;p>Symbolization is the invisible foundation of modern observability. Without it, profiling data would be nearly useless—just hexadecimal addresses with no meaning. To learn more, you can check out the &lt;u>&lt;a href="https://github.com/open-telemetry/opentelemetry-ebpf-profiler">OTel eBPF profiler on GitHub&lt;/a>&lt;/u> and our &lt;u>&lt;a href="/docs/pyroscope/latest/configure-client/opentelemetry/ebpf-profiler/">Pyroscope eBPF setup docs&lt;/a>&lt;/u>.&lt;/p>&lt;p>&lt;em>&lt;a href="/products/cloud/?pg=quickly-go-from-exploration-to-action-with-new-one-click-integrations-in-grafana-drilldown&amp;plcmt=footer-cta">Grafana Cloud&lt;/a>&lt;/em>&lt;em> is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. &lt;/em>&lt;em>&lt;a href="/auth/sign-up/create-user/?pg=quickly-go-from-exploration-to-action-with-new-one-click-integrations-in-grafana-drilldown&amp;plcmt=footer-cta">Sign up for free now!&lt;/a>&lt;/em>&lt;/p></description></item><item><title>How OpenRouter and Grafana Cloud bring observability to LLM-powered applications</title><link>https://grafana.com/blog/how-openrouter-and-grafana-cloud-bring-observability-to-llm-powered-applications/</link><pubDate>Tue, 24 Mar 2026 00:20:56</pubDate><author>Chris Watts</author><guid>https://grafana.com/blog/how-openrouter-and-grafana-cloud-bring-observability-to-llm-powered-applications/</guid><description>&lt;p>&lt;em>Chris Watts is Head of Enterprise Engineering at OpenRouter, building infrastructure for AI applications. Previously at Amazon and a startup founder.&lt;/em>&lt;/p>&lt;p>As large language models become core infrastructure for more and more applications, teams are discovering a familiar challenge in a new context: you can't improve what you can't see. Whether you're routing requests across multiple AI providers, managing costs across dozens of models, or debugging why a particular prompt is timing out in production, observability is no longer optional for LLM-powered systems.&lt;/p>&lt;p>At &lt;u>&lt;a href="https://openrouter.ai">OpenRouter&lt;/a>&lt;/u>, we provide a unified API that gives developers access to hundreds of models from providers like OpenAI, Anthropic, Google, and Meta through a single integration. We handle load balancing, provider fallbacks, and model routing so teams can focus on building their applications rather than managing multiple API keys and billing accounts.&lt;/p>&lt;p>But access to models is only half the story. When you're running AI workloads in production, you need to understand how those workloads are performing, what they're costing, and where they're failing. That's why we built &lt;u>&lt;a href="https://openrouter.ai/docs/guides/features/broadcast">Broadcast&lt;/a>&lt;/u>, a feature that automatically sends traces from your OpenRouter requests to observability platforms like  &lt;u>&lt;a href="/products/cloud/">Grafana Cloud&lt;/a>&lt;/u>, with no additional instrumentation required in your application code.&lt;/p>&lt;p>In this post, we'll walk through how Broadcast works with Grafana Cloud, and share some of the real-world use cases we're seeing.&lt;/p>&lt;h2>Why LLM observability is different&lt;/h2>&lt;p>Traditional application monitoring focuses on familiar signals: HTTP status codes, response times, and error rates. LLM applications use those same signals, but they also introduce entirely new dimensions that teams need to track:&lt;/p>&lt;ul>&lt;li>&lt;strong>Token usage and costs&lt;/strong>: Every request consumes tokens, and costs vary across models. A single prompt sent to GPT-4o vs. Claude 3.5 Haiku can differ dramatically.&lt;/li>&lt;li>&lt;strong>Model behavior variability&lt;/strong>: The same prompt can produce different results depending on which model or provider handles it. When you're using fallbacks or load balancing across providers, understanding which model actually served a request matters.&lt;/li>&lt;li>&lt;strong>Latency profiles&lt;/strong>: LLM latency isn't just about total response time. Time to first token, tokens per second, and total generation time each tell a different part of the story.&lt;/li>&lt;li>&lt;strong>Non-deterministic failures&lt;/strong>: LLM requests can fail in subtle ways, like hitting rate limits, receiving truncated outputs, or producing responses that technically succeed but don't meet quality expectations.&lt;/li>&lt;/ul>&lt;p>Most teams start by adding logging and metrics to their own application code, but this approach quickly becomes difficult to maintain, especially when you're using multiple models and providers. What you really want is observability that's built into the infrastructure layer, where the routing and model selection actually happen.&lt;/p>&lt;h2>How OpenRouter Broadcast works with Grafana Cloud&lt;/h2>&lt;p>OpenRouter Broadcast works by automatically generating &lt;u>&lt;a href="https://opentelemetry.io/">OpenTelemetry&lt;/a>&lt;/u> traces for every API request and sending them to your configured destinations. There's no SDK to install, no code to change, and no additional latency added to your requests. You configure it once in your OpenRouter dashboard, and every request flowing through your account is traced.&lt;/p>&lt;p>For Grafana Cloud, traces are sent via the standard OTLP HTTP/JSON endpoint directly to &lt;u>&lt;a href="/products/cloud/traces/">Grafana Cloud Traces&lt;/a>&lt;/u>, the cloud-based tracing backend powered by &lt;u>&lt;a href="/oss/tempo/">Tempo&lt;/a>&lt;/u> OSS. Each trace includes rich attributes following OpenTelemetry semantic conventions for generative AI:&lt;/p>&lt;ul>&lt;li>&lt;strong>Model information&lt;/strong>: Which model was requested, which model actually served the response, and which provider handled it&lt;/li>&lt;li>&lt;strong>Token usage&lt;/strong>: Input tokens, output tokens, and total tokens consumed&lt;/li>&lt;li>&lt;strong>Timing data&lt;/strong>: Total request duration, time to first token, and generation speed&lt;/li>&lt;li>&lt;strong>Cost data&lt;/strong>: The cost in USD for each request&lt;/li>&lt;li>&lt;strong>Status and errors&lt;/strong>: Whether the request succeeded, why generation ended, and any error details&lt;/li>&lt;li>&lt;strong>Custom metadata&lt;/strong>: Any application-specific context you attach to your requests, like user IDs, session IDs, or feature flags&lt;/li>&lt;/ul>&lt;p>Once traces are flowing into Grafana Cloud, you can query them using &lt;u>&lt;a href="/docs/tempo/latest/traceql/">TraceQL&lt;/a>&lt;/u>, build dashboards, and set up alerts, all using the same Grafana Cloud interface your team already knows.&lt;/p>&lt;p>You can see span rate, error rates, and duration for OpenRouter traces at a glance:&lt;/p>&lt;p>You can drill into into a single LLM Generation trace to inspect timing and service details:&lt;/p>&lt;p>Full span attributes show the prompt, model, token count, and completion, all captured via OpenTelemetry:&lt;/p>&lt;h2>Real-world use cases&lt;/h2>&lt;p>Here are some of the ways teams are using OpenRouter Broadcast with Grafana Cloud today.&lt;/p>&lt;h3>Tracking costs across models and features&lt;/h3>&lt;p>One of the most immediate use cases is cost visibility. When you're routing requests across multiple models, it's easy to lose track of where your spend is going. With traces flowing into Grafana Cloud, teams build dashboards that break down costs by model, API key, user, or any custom metadata they attach to their requests.&lt;/p>&lt;p>For example, a team running both a customer-facing chatbot and an internal document processing pipeline can use separate API keys or custom metadata to attribute costs to each workload. A simple TraceQL query like this surfaces all requests from a specific environment:&lt;/p>&lt;pre>&lt;code>{ resource.service.name = "openrouter" &amp;&amp; span.trace.metadata.environment = "production" }&lt;/code>&lt;/pre>&lt;p>This kind of visibility lets engineering leads and finance teams answer questions like "How much did our AI features cost last week?" or "Which model is giving us the best cost-per-quality ratio?" without building custom analytics infrastructure.&lt;/p>&lt;h3>Monitoring latency and performance&lt;/h3>&lt;p>LLM latency directly impacts user experience. A chatbot that takes 8 seconds to start responding feels broken, even if the final output is excellent. With OpenRouter traces in Grafana Cloud, teams can monitor latency trends over time, set alerts for slow requests, and compare performance across models.&lt;/p>&lt;p>TraceQL makes it easy to find outliers:&lt;/p>&lt;p>&lt;/p>&lt;pre>&lt;code>{ resource.service.name = "openrouter" &amp;&amp; duration > 5s }&lt;/code>&lt;/pre>&lt;p>Teams often build Grafana dashboards that show p50, p95, and p99 latency by model, which helps them make informed decisions about which models to use for latency-sensitive vs. batch workloads.&lt;/p>&lt;h3>Debugging errors and failed requests&lt;/h3>&lt;p>When something goes wrong in an LLM pipeline, the cause isn't always obvious. Was it a rate limit? A poorly created prompt? A provider outage? With distributed traces in Grafana Cloud, teams can quickly filter for errors and drill into individual requests to see exactly what happened:&lt;/p>&lt;p>&lt;/p>&lt;pre>&lt;code>{ resource.service.name = "openrouter" &amp;&amp; status = error }&lt;/code>&lt;/pre>&lt;p>Each trace includes the model, provider, error details, and timing information, giving teams the context they need to diagnose issues without digging through application logs.&lt;/p>&lt;h3>Usage analytics and capacity planning&lt;/h3>&lt;p>As AI features grow, teams need to understand usage patterns to plan capacity and negotiate contracts with providers. Grafana Cloud dashboards built on OpenRouter traces can show request volume over time, token consumption trends, and model popularity, all without any additional instrumentation.&lt;/p>&lt;p>Teams use this data to track how usage is growing and answer questions like: "Are we approaching our rate limits?" or "Should we shift more traffic to a cheaper model for this use case?" &lt;/p>&lt;h2>Getting started&lt;/h2>&lt;p>Setting up the integration takes just a few minutes:&lt;/p>&lt;p>1. &lt;strong>Get your Grafana Cloud credentials&lt;/strong>: You'll need your OTLP gateway endpoint, instance ID, and an API token with &lt;code>traces:write&lt;/code> permissions from your &lt;u>&lt;a href="/auth/sign-in">Grafana Cloud portal&lt;/a>&lt;/u>.&lt;/p>&lt;p>2. &lt;strong>Enable Broadcast in OpenRouter&lt;/strong>: Navigate to &lt;u>&lt;a href="https://openrouter.ai/settings/broadcast">Settings > Observability&lt;/a>&lt;/u> in your OpenRouter dashboard and toggle Broadcast on.&lt;/p>&lt;p>3. &lt;strong>Configure Grafana Cloud as a destination&lt;/strong>: Enter your Grafana Cloud credentials and click &lt;strong>Test Connection&lt;/strong> to verify the setup.&lt;/p>&lt;p>4.&lt;strong> Start querying traces&lt;/strong>: Once configured, every OpenRouter request will generate a trace in Grafana Cloud. Navigate to Explore, select your Tempo data source, and run &lt;code>{ resource.service.name = "openrouter" }&lt;/code> to see your traces.&lt;/p>&lt;p>For detailed setup instructions, including how to find your OTLP endpoint and create API tokens, check out our &lt;u>&lt;a href="https://openrouter.ai/docs/guides/features/broadcast/grafana">Broadcast to Grafana Cloud documentation&lt;/a>&lt;/u>.&lt;/p>&lt;h3>Adding custom metadata&lt;/h3>&lt;p>To get the most out of the integration, we recommend attaching custom metadata to your OpenRouter requests. This metadata flows through to Grafana Cloud as span attributes, making it easy to filter and group traces by your own application context:&lt;/p>&lt;p>&lt;/p>&lt;pre>&lt;code>{
"model": "openai/gpt-4o",
"messages": [{ "role": "user", "content": "Summarize this document..." }],
"user": "user_12345",
"session_id": "session_abc",
"trace": {
"trace_name": "Document Summary",
"environment": "production",
"feature": "summarization"
}
}
&lt;/code>&lt;/pre>&lt;p>You can then query these attributes in TraceQL:&lt;/p>&lt;p>&lt;/p>&lt;pre>&lt;code>{ resource.service.name = "openrouter" &amp;&amp; span.trace.metadata.feature = "summarization" }&lt;/code>&lt;/pre>&lt;h3>Privacy controls&lt;/h3>&lt;p>For teams working with sensitive data, Broadcast supports a &lt;u>&lt;a href="https://openrouter.ai/docs/guides/features/broadcast/overview#privacy-mode">Privacy Mode&lt;/a>&lt;/u> that excludes prompt and completion content from traces while still sending all operational data like token usage, costs, timing, and model information. This lets you get full observability without exposing the content of your LLM interactions.&lt;/p>&lt;h2>What's next&lt;/h2>&lt;p>We're continuing to invest in making LLM observability as seamless as possible. We're adding new integrations regularly and are working on richer trace data, including more granular timing breakdowns and quality signals that can help you build even more comprehensive observability dashboards.&lt;/p>&lt;p>If you're building with LLMs and want visibility into how your AI workloads are performing, give the &lt;u>&lt;a href="https://openrouter.ai/docs/guides/features/broadcast/grafana">OpenRouter and Grafana Cloud integration&lt;/a>&lt;/u> a try. You can get started with a &lt;u>&lt;a href="/auth/sign-up/create-user">free Grafana Cloud account&lt;/a>&lt;/u> and an &lt;u>&lt;a href="https://openrouter.ai">OpenRouter&lt;/a>&lt;/u> account in minutes.&lt;/p>&lt;p>&lt;em>To learn more about OpenRouter's Broadcast feature and all supported destinations, visit the &lt;/em>&lt;u>&lt;em>&lt;a href="https://openrouter.ai/docs/guides/features/broadcast">Broadcast documentation&lt;/a>&lt;/em>&lt;/u>&lt;em>. For questions or feedback, reach out to us at &lt;/em>&lt;u>&lt;em>&lt;a href="http://openrouter.ai">openrouter.ai&lt;/a>&lt;/em>&lt;/u>&lt;em>.&lt;/em>&lt;/p>&lt;p>&lt;/p></description></item><item><title>Instrument zero‑code observability for LLMs and agents on Kubernetes</title><link>https://grafana.com/blog/ai-observability-zero-code/</link><pubDate>Fri, 20 Mar 2026 17:20:53</pubDate><author>Ishan Jain</author><guid>https://grafana.com/blog/ai-observability-zero-code/</guid><description>&lt;p>&lt;strong>Note: &lt;/strong>The world is changing all around us thanks to AI. Today, anyone and everyone can be a developer, using LLMs to create LLM-powered applications, which users can then interact with by using even more LLMs. &lt;/p>&lt;p>Observability practitioners need to adapt and they need the right tools for the job. In this series, we'll show you how to use Grafana Cloud to monitor AI applications, including &lt;u>&lt;a href="/blog/ai-observability-llms-in-production">workloads in production&lt;/a>&lt;/u>, &lt;u>&lt;a href="/blog/ai-observability-ai-agents">AI agents&lt;/a>&lt;/u>, &lt;u>&lt;a href="/blog/ai-observability-MCP-servers">MCP servers&lt;/a>&lt;/u>, and zero-code LLMs (this post).&lt;/p>&lt;p>Building AI services with large language models and agentic frameworks often means running complex microservices on Kubernetes. Observability is vital, but instrumenting every pod in a distributed system can quickly become a maintenance nightmare. &lt;/p>&lt;p>&lt;strong>OpenLIT Operator&lt;/strong> solves this problem by automatically injecting OpenTelemetry instrumentation into your AI workloads—no code changes or image rebuilds required. When combined with &lt;a href="/docs/grafana-cloud/monitor-applications/ai-observability/">AI Observability&lt;/a> in Grafana Cloud, you can monitor costs, latency, token usage, and agent workflows across your entire cluster in minutes.&lt;/p>&lt;p>In this final post in our AI Observability series, we'll show you how to easily get started by combining OpenLIT Operator and Grafana Cloud to enable zero-code observability for your AI workloads.&lt;/p>&lt;h2>Why zero‑code instrumentation matters&lt;/h2>&lt;p>Traditional observability relies on developers adding instrumentation libraries to their application code. But in the fast‑moving world of generative AI, your stack might include multiple model providers, agent frameworks, vector databases, and custom tools. Keeping instrumentation up to date across all these components is a burden. &lt;/p>&lt;p>The &lt;u>&lt;a href="https://docs.openlit.io/latest/operator/overview">OpenLIT Operator&lt;/a>&lt;/u> brings zero‑code AI observability to Kubernetes. It automatically injects and configures OpenTelemetry instrumentation into your pods, producing distributed traces and metrics without any code changes. Because it is built on OpenTelemetry standards, it integrates with existing observability infrastructure and allows you to switch between providers (OpenLIT, OpenInference, OpenLLMetry, custom) without redeploying your applications.&lt;/p>&lt;p>This zero‑code approach is designed specifically for AI workloads. It provides seamless observability for LLMs, vector databases, and agentic frameworks running in Kubernetes. You can track token usage, monitor agent workflows, measure response times, and debug AI framework interactions—all without touching your code.&lt;/p>&lt;h3>Benefits of zero‑code observability in Grafana Cloud&lt;/h3>&lt;p>There are also multiple reasons why you should use zero-code observability in Grafana Cloud.&lt;/p>&lt;ul>&lt;li>&lt;strong>Rapid onboarding:&lt;/strong> Deploy the OpenLIT Operator once and instrument all your AI workloads without modifying a single line of code.&lt;/li>&lt;li>&lt;strong>Comprehensive coverage:&lt;/strong> The operator supports major LLM providers, vector databases, and agent frameworks, and can be extended to other providers through its plugin architecture.&lt;/li>&lt;li>&lt;strong>Vendor neutrality:&lt;/strong> Built on OpenTelemetry, the operator allows you to send telemetry to Grafana Cloud, a self‑hosted OpenTelemetry collector, or any OTLP‑compatible backend.&lt;/li>&lt;li>&lt;strong>Cost and performance insights:&lt;/strong> Distributed traces capture token usage, cost, latency, and agent step sequences, enabling you to optimise model selection and resource allocation.&lt;/li>&lt;/ul>&lt;h2>How to set up zero-code observability for AI applications in Grafana Cloud&lt;/h2>&lt;p>Now that we've covered why you should be using Grafana Cloud for zero-code observability, let's look at how you can make that happen, starting with a high-level explanation of the workflow, followed by step-by-step instructions for getting started quickly.&lt;/p>&lt;p>And if you get stuck anywhere along the way or need help with your own setup, click on the pulsar icon in the top-right corner of the Grafana Cloud UI to open a chat with &lt;u>&lt;a href="/docs/grafana-cloud/machine-learning/assistant/?pg=blog&amp;plcmt=body-txt">Grafana Assistant&lt;/a>&lt;/u>, our purpose-built LLM that can help troubleshoot incidents, manage dashboards, and answer product questions.&lt;/p>&lt;h3>Architecture overview&lt;/h3>&lt;p>AI applications like LLMs and agents run inside pods in your Kubernetes cluster. The OpenLIT Operator continuously monitors these pods and checks them against your instrumentation policies. When it finds a matching pod, it automatically injects an init container that sets up OpenTelemetry instrumentation, enabling observability without requiring manual changes to your application code.&lt;/p>&lt;p>Telemetry is sent to an OpenLIT collector or directly to Grafana Cloud’s OpenTelemetry Protocol (OTLP) gateway. The AI Observability dashboards in Grafana Cloud then visualize latency, cost, and quality metrics.&lt;/p>&lt;p>The workflow consists of four key pieces:&lt;/p>&lt;ol>&lt;li>&lt;strong>AI workloads&lt;/strong>: Pods running LLMs, vector DBs, or agent frameworks such as LangChain, CrewAI, or OpenAI Agents. The operator supports a wide range of LLM providers (OpenAI, Anthropic, Google, AWS Bedrock, Mistral) and frameworks (LangChain, LlamaIndex, CrewAI, Haystack, DSPy, and more).&lt;/li>&lt;li>&lt;strong>OpenLIT Operator&lt;/strong>: A Kubernetes operator that injects OpenTelemetry instrumentation into selected pods based on label selectors. The operator is OpenTelemetry‑native and allows you to switch providers without changing your application code.&lt;/li>&lt;li>&lt;strong>OpenLIT collector&lt;/strong>: Collects traces and metrics from instrumented pods. You can run it in‑cluster via Helm or send telemetry directly to Grafana Cloud’s OTLP endpoint.&lt;/li>&lt;li>&lt;strong>Grafana Cloud&lt;/strong>: Stores traces in &lt;u>&lt;a href="/docs/grafana-cloud/send-data/traces/?pg=blog&amp;plcmt=body-txt">Tempo&lt;/a>&lt;/u> and metrics in &lt;u>&lt;a href="/docs/grafana-cloud/send-data/metrics/metrics-prometheus/?pg=blog&amp;plcmt=body-txt">Prometheus&lt;/a>&lt;/u> through our fully managed platform. Our AI observability solution provides pre‑built dashboards for GenAI, vector DBs, agents, and Model Context Protocol (MCP), allowing you to explore latency percentiles, token and cost metrics, agent step sequences, and evaluation results.&lt;/li>&lt;/ol>&lt;h2>Step 1: Add the AI Observability integration&lt;/h2>&lt;p>Before instrumenting your cluster, add the AI Observability integration to your Grafana Cloud stack. This can be done by clicking on &lt;strong>Connections&lt;/strong> in the left-side menu and following the &lt;u>&lt;a href="/docs/grafana-cloud/monitor-applications/ai-observability/setup/?pg=blog&amp;plcmt=body-txt#install-the-ai-observability-integration">steps outlined in our documentation&lt;/a>&lt;/u>.&lt;/p>&lt;p>This provisions dashboards and sets up a managed OTLP gateway for receiving your traces and metrics. Once telemetry arrives, the dashboards populate automatically with request rates, latency distributions, and cost summaries.&lt;/p>&lt;h2>Step 2: Prepare your Kubernetes environment&lt;/h2>&lt;p>To follow this guide, you’ll need a Kubernetes cluster with cluster‑admin privileges, Helm, and &lt;code>kubectl&lt;/code> configured. If you don’t have a cluster, you can create one locally using &lt;u>&lt;a href="https://k3d.io/">k3d&lt;/a>&lt;/u> or &lt;u>&lt;a href="https://minikube.sigs.k8s.io/">minikube&lt;/a>&lt;/u>. For a quick test drive, create a cluster with:&lt;/p>&lt;pre>&lt;code>k3d cluster create openlit-demo
&lt;/code>&lt;/pre>&lt;h2>Step 3: Deploy OpenLIT Operator&lt;/h2>&lt;p>First add the OpenLIT Helm repository and update your charts:&lt;/p>&lt;pre>&lt;code>helm repo add openlit https://openlit.github.io/helm/
helm repo update
&lt;/code>&lt;/pre>&lt;p>Install the &lt;strong>OpenLIT Operator&lt;/strong> to enable zero‑code instrumentation:&lt;/p>&lt;pre>&lt;code>helm install openlit-operator openlit/openlit-operator
&lt;/code>&lt;/pre>&lt;p>Verify that the operator pod is running:&lt;/p>&lt;pre>&lt;code>kubectl get pods -n openlit -l app.kubernetes.io/name=openlit-operator
&lt;/code>&lt;/pre>&lt;p>You should see the operator in a &lt;code>Running&lt;/code> state.&lt;/p>&lt;h2>Step 4: Create an AutoInstrumentation resource&lt;/h2>&lt;p>The &lt;strong>AutoInstrumentation&lt;/strong> custom resource defines which pods to instrument and how to configure the injected instrumentation. It specifies label selectors to target your AI applications, the instrumentation provider (OpenLIT by default), and the OTLP endpoint to send telemetry.&lt;/p>&lt;p>Here is a minimal example that instruments pods labeled &lt;code>instrumentation=openlit&lt;/code> and sends data to Grafana Cloud:&lt;/p>&lt;pre>&lt;code>apiVersion: openlit.io/v1alpha1
kind: AutoInstrumentation
metadata:
name: grafana-observability
namespace: default
spec:
selector:
matchLabels:
instrumentation: openlit
python:
instrumentation:
enabled: true
otlp:
endpoint: "https://otlp-gateway-&lt;REGION>.grafana.net/otlp" # Grafana OTLP gateway
headers:
Authorization: "Basic &lt;BASE64>" # Replace with base64‑encoded instanceID:token
resource:
attributes:
deployment.environment: "production"
service.namespace: "ai-services"&lt;/code>&lt;/pre>&lt;p>Apply the manifest:&lt;/p>&lt;pre>&lt;code>kubectl apply -f autoinstrumentation.yaml
&lt;/code>&lt;/pre>&lt;p>Already have AI applications running? Restart the pods that match your selector to pick up the injected instrumentation:&lt;/p>&lt;pre>&lt;code>kubectl rollout restart deployment your-deployment-name
&lt;/code>&lt;/pre>&lt;p>When the pods restart, the OpenLIT Operator automatically injects an init container that configures Python instrumentation. The pods begin emitting distributed traces with LLM costs, token usage, and agent performance metrics.&lt;/p>&lt;h2>Step 5: Deploy your AI application (no code changes)&lt;/h2>&lt;p>You can now deploy or continue running your AI workloads normally. Whether you’re using OpenAI Agents SDK, CrewAI, LangChain, or a custom Python service, you don’t need to modify your code. The operator recognizes supported frameworks and model providers, and it instruments them transparently. &lt;/p>&lt;p>For example, a simple deployment of a CrewAI‑based chatbot can be launched via a Kubernetes &lt;code>Deployment&lt;/code>; the operator will detect and instrument all LLM and agent calls as soon as the pod starts. The instrumentation captures the sequence of agent steps, tool invocations, and model responses, along with latency and token metrics.&lt;/p>&lt;h2>Step 6: Visualize metrics and traces in Grafana&lt;/h2>&lt;p>With your pods instrumented and telemetry flowing to Grafana Cloud, open the AI Observability dashboards. &lt;/p>&lt;p>The &lt;strong>GenAI observability&lt;/strong> dashboard shows request rates, p95/p99 latencies, and cost metrics across different providers. The &lt;strong>GenAI observability&lt;/strong> dashboard surfaces agent workflows, step durations, and tool success rates. The &lt;strong>Vector DB&lt;/strong> and &lt;strong>MCP&lt;/strong> dashboards provide context on database queries and protocol health. &lt;/p>&lt;p>Because OpenLIT’s traces include LLM costs and token counts, Grafana can also estimate costs and highlight expensive calls. In the dashboard, you’ll see a service overview, individual traces for HTTP requests and OpenAI API calls, detailed spans with token usage, performance metrics (response times, error rates, throughput), and cost tracking.&lt;/p>&lt;p>Grafana’s alerting engine can trigger notifications when latency spikes, error rates increase, or token usage exceeds budget. Since the telemetry is OpenTelemetry‑native, you can build custom panels and alerts on top of Prometheus metrics and Tempo traces.&lt;/p>&lt;h2>Next steps&lt;/h2>&lt;p>You can also learn more about Grafana Cloud AI Observability in the &lt;u>&lt;a href="/docs/grafana-cloud/monitor-applications/ai-observability/?pg=blog&amp;plcmt=body-txt">official docs&lt;/a>&lt;/u>, including setup instructions and dashboards. You can also check out the &lt;u>&lt;a href="/blog/ai-observability-llms-in-production">first post in this series&lt;/a>&lt;/u> to see a full demo to better understand how to monitor AI workloads in Grafana Cloud, or check out our &lt;u>&lt;a href="/tags/ai-ml/?pg=blog&amp;plcmt=body-txt">other AI blogs&lt;/a>&lt;/u>, including posts about our own LLM: Grafana Assistant.  of setting up a or check out our other blog posts&lt;/p>&lt;p>Taken collectively, these resources will help you move from a basic demo to a production-ready setup for your AI applications.&lt;/p></description></item><item><title>Observe your AI agents: End‑to‑end tracing with OpenLIT and Grafana Cloud</title><link>https://grafana.com/blog/ai-observability-ai-agents/</link><pubDate>Fri, 20 Mar 2026 17:20:49</pubDate><author>Ishan Jain</author><guid>https://grafana.com/blog/ai-observability-ai-agents/</guid><description>&lt;p>&lt;strong>Note: &lt;/strong>The world is changing all around us thanks to AI. Today, anyone and everyone can be a developer, using LLMs to create LLM-powered applications, which users can then interact with by using even more LLMs. &lt;/p>&lt;p>Observability practitioners need to adapt and they need the right tools for the job. In this series, we'll show you how to use Grafana Cloud to monitor AI applications, including &lt;u>&lt;a href="/blog/ai-observability-llms-in-production">workloads in production&lt;/a>&lt;/u>, AI agents (this post), &lt;u>&lt;a href="/blog/ai-observability-MCP-servers">MCP servers&lt;/a>&lt;/u>, and &lt;u>&lt;a href="/blog/ai-observability-zero-code">zero-code LLMs&lt;/a>&lt;/u>.&lt;/p>&lt;p>In &lt;u>&lt;a href="/blog/ai-observability-llms-in-production">another post&lt;/a>&lt;/u> in this series, we discussed how to instrument large language model (LLM) calls. This can be a good starting point, but generative AI workloads increasingly rely on agents&lt;strong>, &lt;/strong>which are systems that plan, call tools, reason, and act autonomously. &lt;/p>&lt;p>And their non‑deterministic behavior makes incidents harder to diagnose, in part, because the same prompt can trigger different tool sequences and costs. &lt;/p>&lt;p>AI agents combine LLM reasoning with external tools and dynamic workflows, and observability data must serve as a feedback loop for continuous improvement. Without proper tracing, you end up guessing why an agent took a particular path. &lt;/p>&lt;p>In this guide, you'll learn how to use the &lt;u>&lt;a href="https://github.com/openlit/openlit">OpenLIT&lt;/a>&lt;/u> SDK to capture agent‑level telemetry and how to use Grafana Cloud to visualize every step.&lt;/p>&lt;h2>Why observability matters for agents&lt;/h2>&lt;p>Traditional APM covers infrastructure metrics and latency, but that's not enough to get a holistic view of your agents. &lt;a href="/docs/grafana-cloud/monitor-applications/ai-observability/">AI Observability&lt;/a> in Grafana Cloud uses the OpenLIT SDK to automatically generate distributed traces and metrics to provide insights into each agentic event. &lt;/p>&lt;p>AI Observability provides five prebuilt dashboards that analyze response times, error rates, throughput, token usage, and costs across your AI stack. Beyond raw metrics, OpenLIT captures agent names, actions, tool calls, token usage, and errors. This enables:&lt;/p>&lt;ul>&lt;li>&lt;strong>Full sequence visibility:&lt;/strong> Follow a request from the user query through planning, tool invocations, LLM calls, and final responses. Each span in the trace shows the prompt, selected tool, and reasoning chain.&lt;/li>&lt;li>&lt;strong>Cost and token tracking:&lt;/strong> For each step, you see token counts and API costs, so you can optimize tool choices and model selection.&lt;/li>&lt;li>&lt;strong>Behavioral troubleshooting:&lt;/strong> Agent traces reveal reasoning paths and tool usage. If the agent produces an incorrect answer, you can reconstruct the chain to find where it went wrong.&lt;/li>&lt;li>&lt;strong>Unified dashboards and alerting:&lt;/strong> Grafana Cloud combines fully managed versions of &lt;u>&lt;a href="/docs/grafana-cloud/send-data/metrics/metrics-prometheus/?pg=blog&amp;plcmt=body-txt">Prometheus&lt;/a>&lt;/u>, &lt;u>&lt;a href="/docs/grafana-cloud/send-data/traces/?pg=blog&amp;plcmt=body-txt">Tempo&lt;/a>&lt;/u>, and &lt;u>&lt;a href="/docs/grafana-cloud/send-data/logs/?pg=blog&amp;plcmt=body-txt">Loki&lt;/a>&lt;/u> to present metrics, traces, and logs in one place, with optional alerts on cost thresholds or latency spikes.&lt;/li>&lt;/ul>&lt;h2>Benefits of agent observability in Grafana Cloud&lt;/h2>&lt;p>Agent observability is more than just infrastructure monitoring. With OpenLIT and Grafana Cloud, you gain:&lt;/p>&lt;ul>&lt;li>&lt;strong>Predictable costs:&lt;/strong> Identify which agent step or tool call accounts for most of your spending and reroute simple queries to cheaper models.&lt;/li>&lt;li>&lt;strong>Performance optimization:&lt;/strong> Detect latency spikes at specific stages (e.g., search API vs. LLM) and adjust concurrency or caching accordingly.&lt;/li>&lt;li>&lt;strong>Quality assurance:&lt;/strong> Traces can be replayed to understand reasoning mistakes, while integrated evaluation tools in OpenLIT (such as hallucination detection and toxicity analysis) provide safety metrics.&lt;/li>&lt;li>&lt;strong>Faster debugging:&lt;/strong> When an agent fails, you have a single trace that links user input, internal reasoning, external calls, and the error, making root‑cause analysis straightforward.&lt;/li>&lt;li>&lt;strong>Future‑proof instrumentation:&lt;/strong> OpenTelemetry semantic conventions for AI agents are evolving; by using OpenLIT, you adopt these standards and avoid vendor lock‑in. Grafana Cloud’s integration ensures your telemetry remains compatible as conventions mature.&lt;/li>&lt;/ul>&lt;h2>How to monitor your AI agents with Grafana Cloud&lt;/h2>&lt;p>Now that you understand some of the nuances of observing AI agents, let's show you how you can use prebuilt capabilities in Grafana Cloud to start collecting and visualizing telemetry from your agents.&lt;/p>&lt;p>And if you get stuck anywhere along the way or need help with your own setup, click on the pulsar icon in the top-right corner of the Grafana Cloud UI to open a chat with &lt;u>&lt;a href="/docs/grafana-cloud/machine-learning/assistant/?pg=blog&amp;plcmt=body-txt">Grafana Assistant&lt;/a>&lt;/u>, our purpose-built LLM that can help troubleshoot incidents, manage dashboards, and answer product questions.&lt;/p>&lt;h3>Architecture overview&lt;/h3>&lt;p>AI agents orchestrate multiple actions: planning, calling external tools or models, and producing a response. OpenLIT instruments each of these steps and emits OpenTelemetry spans and metrics. You can send this data directly to Grafana Cloud or via an OpenTelemetry Collector. The following diagram shows how a user request flows through an agent orchestrator and is monitored:&lt;/p>&lt;p>The workflow consists of four key pieces:&lt;/p>&lt;ol>&lt;li>&lt;strong>User query&lt;/strong>: A customer sends a message to your agent.&lt;/li>&lt;li>&lt;strong>Agent orchestrator&lt;/strong>: Frameworks like CrewAI or the OpenAI Agents SDK break the task into sequential steps: plan, call a tool (e.g., a search API), call an LLM, and generate a result.&lt;/li>&lt;li>&lt;strong>OpenLIT instrumentation&lt;/strong>: A single &lt;code>openlit.init()&lt;/code> call instruments the entire agent pipeline. Each planning step, tool call, and model completion is captured as an OpenTelemetry span.&lt;/li>&lt;li>&lt;strong>Grafana Cloud&lt;/strong>: Metrics and traces flow into Grafana Cloud’s managed Prometheus and Tempo backends, where pre‑built AI dashboards visualize performance and costs.&lt;/li>&lt;/ol>&lt;h2>Step 1: Install the AI Observability integration&lt;/h2>&lt;p>Start by adding AI Observability to your Grafana Cloud stack. This can be done by clicking on &lt;strong>Connections&lt;/strong> in the left-side menu and following the &lt;u>&lt;a href="/docs/grafana-cloud/monitor-applications/ai-observability/setup/?pg=blog&amp;plcmt=body-txt#install-the-ai-observability-integration">steps outlined in our documentation&lt;/a>&lt;/u>.&lt;/p>&lt;p>This installs the five dashboards mentioned earlier (GenAI observability, GenAI evaluations, vector DB observability, MCP observability, and GPU monitoring). When metrics arrive, these dashboards automatically populate with latency histograms, token counts, cost summaries, and evaluation results.&lt;/p>&lt;h2>Step 2: Install OpenLIT&lt;/h2>&lt;p>OpenLIT is an OpenTelemetry‑native SDK for instrumenting GenAI workloads. Install it alongside your agent framework:&lt;/p>&lt;p>&lt;code>pip install openlit crewai&lt;/code>&lt;/p>&lt;p>OpenLIT supports dozens of frameworks, including CrewAI, OpenAI Agents, LangChain, AutoGen, and others. The SDK automatically instruments supported libraries; no manual span creation is required.&lt;/p>&lt;h2>Step 3: Instrument your agent&lt;/h2>&lt;p>OpenLIT can be added with a single line of code. Below is an example that uses CrewAI to build a simple agent with two tools: a search tool and a summarizer. &lt;/p>&lt;p>The agent plans its steps, uses the search tool to fetch content, and then summarizes the result. OpenLIT records each step, tool call, and model completion. You can swap CrewAI with the OpenAI Agents SDK—the instrumentation code remains the same.&lt;/p>&lt;pre>&lt;code>import os
import openlit # Instruments all supported frameworks when initialised
from crewai import Agent, Task, Crew, agent_tools # CrewAI framework
from your_search_module import SearchTool # hypothetical search tool
from your_summarise_module import SummariseTool # hypothetical summariser
openlit.init() # one line to enable OpenTelemetry tracing and metrics
# Define tools the agent can use
search_tool = SearchTool()
summarise_tool = SummariseTool()
# Compose an agent with a planning function and tool access
assistant = Agent(
name="research_assistant",
role="Find relevant sources and summarise them",
tools=[search_tool, summarise_tool],
planning=True # enable internal reasoning and tool selection
)
# Define a task for the agent
task = Task(
description="Provide a concise summary of the latest developments in battery recycling.",
expected_output="A two‑paragraph summary highlighting key advances",
)
# Create a crew to execute the task
crew = Crew(
agents=[assistant],
tasks=[task],
verbose=True,
)
if __name__ == "__main__":
result = crew.execute()
print(result)
&lt;/code>&lt;/pre>&lt;p>When this script runs, OpenLIT automatically captures:&lt;/p>&lt;ul>&lt;li>&lt;strong>LLM prompts and completions&lt;/strong>: The prompts sent to the LLM and the responses returned&lt;/li>&lt;li>&lt;strong>Token usage and costs&lt;/strong>: Counts the tokens for each call and estimates API cost&lt;/li>&lt;li>&lt;strong>Agent names and actions&lt;/strong>: Identifies which agent or sub‑agent executed each step&lt;/li>&lt;li>&lt;strong>Tool usage&lt;/strong>: Records which tool was invoked and its parameters&lt;/li>&lt;li>&lt;strong>Errors&lt;/strong>: Surfaces exceptions such as API failures or tool errors&lt;/li>&lt;/ul>&lt;p>This information becomes distributed spans in Tempo and metrics in Prometheus. If you use the &lt;strong>OpenAI Agents SDK&lt;/strong>, the pattern is the same: Call &lt;code>openlit.init()&lt;/code> before constructing your agent, and every agent step will emit telemetry.&lt;/p>&lt;h2>Step 4: Forward telemetry to Grafana Cloud&lt;/h2>&lt;p>To send traces and metrics directly to Grafana Cloud, set the following environment variables before running your agent. Replace the values with your own service name, environment, and Grafana credentials:&lt;/p>&lt;pre>&lt;code># identify your service and environment
export OTEL_SERVICE_NAME="agent-demo"
export OTEL_DEPLOYMENT_ENVIRONMENT="production"
# Grafana Cloud OTLP endpoint
export OTEL_EXPORTER_OTLP_ENDPOINT="https://otlp-gateway-&lt;region>.grafana.net/otlp"
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic &lt;your-base64-credentials>"
# Set any API keys for your agent framework
export OPENAI_API_KEY="sk-..."
python agent_service.py
&lt;/code>&lt;/pre>&lt;h2>Step 5: Explore your agent traces and metrics&lt;/h2>&lt;p>With your agent running, open Grafana Cloud. Navigate to AI Observability and select the &lt;strong>AI Agents&lt;/strong> dashboard. Here you can:&lt;/p>&lt;ul>&lt;li>&lt;strong>View complete traces&lt;/strong>: Each user request produces a trace containing spans for planning, tool invocations, LLM calls, and response generation. The traces page in OpenLIT provides detailed span analysis and execution flow, and Grafana Cloud mirrors this via Tempo.&lt;/li>&lt;li>&lt;strong>Monitor metrics and costs&lt;/strong>: Custom dashboards can display throughput, latency, token usage, and cost metrics stored in Prometheus.&lt;/li>&lt;li>&lt;strong>Filter and investigate errors:&lt;/strong> The errors page surfaces traces with exceptions and allows filtering by time range or exception type.&lt;/li>&lt;li>&lt;strong>Correlate with infrastructure&lt;/strong>: Grafana Cloud unifies metrics, logs, and traces, so you can correlate an agent’s slow step with CPU spikes or external API rate limits.&lt;/li>&lt;/ul>&lt;p>Grafana Cloud’s AI dashboards are purpose-built for GenAI applications and include separate panels for LLM performance, agent performance, vector database operations, and GPU health. Because OpenLIT uses OpenTelemetry standards, you can extend these dashboards or forward the data to other observability tools if required.&lt;/p>&lt;h2>Next steps&lt;/h2>&lt;p>Want to go further? In the next blog in this series, we’ll show, step by step, how to enable this step by step &lt;a href="/blog/ai-observability-MCP-servers">for an MCP client&lt;/a>.&lt;/p>&lt;p>You can also learn more about Grafana Cloud AI Observability in the &lt;u>&lt;a href="/docs/grafana-cloud/monitor-applications/ai-observability/?pg=blog&amp;plcmt=body-txt">official docs&lt;/a>&lt;/u>, including setup instructions and dashboards. These resources will help you move from a basic demo to a production-ready setup for your AI applications.&lt;/p></description></item><item><title>Monitor Model Context Protocol (MCP) servers with OpenLIT and Grafana Cloud</title><link>https://grafana.com/blog/ai-observability-MCP-servers/</link><pubDate>Fri, 20 Mar 2026 17:20:37</pubDate><author>Ishan Jain</author><guid>https://grafana.com/blog/ai-observability-MCP-servers/</guid><description>&lt;p>&lt;strong>Note: &lt;/strong>The world is changing all around us thanks to AI. Today, anyone and everyone can be a developer, using LLMs to create LLM-powered applications, which users can then interact with by using even more LLMs. &lt;/p>&lt;p>Observability practitioners need to adapt and they need the right tools for the job. In this series, we'll show you how to use Grafana Cloud to monitor AI applications, including &lt;u>&lt;a href="/blog/ai-observability-llms-in-production">workloads in production&lt;/a>&lt;/u>, &lt;u>&lt;a href="/blog/ai-observability-ai-agents">AI agents&lt;/a>&lt;/u>, MCP servers (this post), and &lt;u>&lt;a href="/blog/ai-observability-zero-code">zero-code LLMs&lt;/a>&lt;/u>.&lt;/p>&lt;p>Large language models don’t work in a vacuum. They often rely on Model Context Protocol (MCP) servers to fetch additional context from external tools or data sources. &lt;/p>&lt;p>MCP provides a standard way for AI agents to talk to tool servers, but this extra layer introduces complexity. Without visibility, an MCP server becomes a black box: you send a request and hope a tool answers. When something breaks, it’s hard to tell if the agent, the server or the downstream API failed. &lt;/p>&lt;p>In this guide, you'll learn how to instrument MCP servers using OpenLIT and how to analyze those servers in Grafana Cloud.&lt;/p>&lt;h2>Why MCP observability matters&lt;/h2>&lt;p>In an agentic system, an MCP server may route tool calls to multiple services. Observability helps you answer critical questions about:&lt;/p>&lt;ul>&lt;li>&lt;strong>Latency spikes: &lt;/strong>When a tool is slow to respond, user experience suffers. By examining request throughput and the 95th/99th percentile latency distributions, you can determine whether a downstream API or the MCP layer is responsible. &lt;/li>&lt;li>&lt;strong>Silent failures&lt;/strong>: For example, a tool returning partial data or timing out often goes unnoticed without structured telemetry. End‑to‑end tracing across the agent, MCP server, and external tools provides the full context needed to diagnose these issues. &lt;/li>&lt;li>&lt;strong>Cross‑service visibility: &lt;/strong>This is important because MCP calls cross-network and language boundaries. For example, OpenTelemetry propagates context, so spans started in a Python client link seamlessly to spans in a Node.js tool server, producing a coherent trace across systems.&lt;/li>&lt;li>&lt;strong>Context window usage: &lt;/strong>Resource consumption can grow quickly as agents query more tools. By tracking context window usage and memory consumption, you can right‑size your MCP servers and avoid over‑allocating resources.&lt;/li>&lt;/ul>&lt;p>&lt;a href="/docs/grafana-cloud/monitor-applications/ai-observability/">AI Observability&lt;/a> in Grafana Cloud supports MCP out of the box. The solution includes pre‑built dashboards for tool performance, protocol health, resource usage, and error tracking.&lt;/p>&lt;h2>Benefits of MCP observability in Grafana Cloud&lt;/h2>&lt;p>Observing your MCP servers unlocks a range of advantages:&lt;/p>&lt;ul>&lt;li>&lt;strong>End‑to‑end tracing&lt;/strong> shows the entire path of a request—from the agent through the MCP server to each tool call—so you can pinpoint bottlenecks and failures. &lt;/li>&lt;li>&lt;strong>Detailed&lt;/strong> &lt;strong>performance metrics&lt;/strong> like &lt;code>tool_invocation_duration_ms&lt;/code> and invocation counts, help you identify slow or overused tools and adjust resource allocation accordingly. &lt;/li>&lt;li>&lt;strong>Scalability and cost control &lt;/strong>are enabled through context‑window and memory usage telemetry, so you can right‑size servers and avoid over‑provisioning. Because OpenTelemetry uses an open, vendor‑neutral format, your instrumentation remains portable; you can route data to Grafana, a self‑hosted OTLP stack or any other backend without code changes. &lt;/li>&lt;li>&lt;strong>Security and compliance&lt;/strong> is also available with MCP monitoring by auditing tool interactions and ensuring protocols are used as intended.&lt;/li>&lt;/ul>&lt;h2>How to monitor your MCP server with Grafana Cloud&lt;/h2>&lt;p>Next, let's take a high-level look at how you can use Grafana Cloud to observe your MCP server, then we'll walk through the setup process so you can get up and running today.&lt;/p>&lt;p>And if&lt;strong> &lt;/strong>you get stuck anywhere along the way or need help with your own setup, click on the pulsar icon in the top-right corner of the Grafana Cloud UI to open a chat with &lt;u>&lt;a href="/docs/grafana-cloud/machine-learning/assistant/?pg=blog&amp;plcmt=body-txt">Grafana Assistant&lt;/a>&lt;/u>, our purpose-built LLM that can help troubleshoot incidents, manage dashboards, and answer product questions.&lt;/p>&lt;h3>Architecture overview&lt;/h3>&lt;p>The diagram below illustrates how agent interactions with an MCP server are instrumented and visualized. &lt;/p>&lt;p>The agent or client calls the MCP server to execute tools. OpenLIT instruments both the client and the server, capturing spans for context management, tool selection, and tool execution. These traces and metrics are exported to Grafana Cloud, where pre‑built dashboards provide insight into performance and failures.&lt;/p>&lt;p>The workflow consists of five key components:&lt;/p>&lt;ol>&lt;li>&lt;strong>Agent/client&lt;/strong>: AI agents use the MCP protocol to invoke tools hosted on external servers.&lt;/li>&lt;li>&lt;strong>MCP server&lt;/strong>: Hosts one or more tools (e.g., search, database query). The server handles context loading, manages tool state, and responds to requests.&lt;/li>&lt;li>&lt;strong>External tools&lt;/strong>: Actual services (databases, APIs) that do the work. They may be local or remote.&lt;/li>&lt;li>&lt;strong>OpenLIT instrumentation&lt;/strong>: A single &lt;code>openlit.init()&lt;/code> call instruments on both the client and server; context interactions, and tool executions generate OpenTelemetry spans.&lt;/li>&lt;li>&lt;strong>Grafana Cloud&lt;/strong>: Collected traces and metrics flow into Grafana Cloud’s fully managed &lt;u>&lt;a href="/docs/grafana-cloud/send-data/metrics/metrics-prometheus/?pg=blog&amp;plcmt=body-txt">Prometheus&lt;/a>&lt;/u> and &lt;u>&lt;a href="/docs/grafana-cloud/send-data/traces/?pg=blog&amp;plcmt=body-txt">Tempo&lt;/a>&lt;/u> backends, where specialized MCP dashboards offer visibility into protocol usage.&lt;/li>&lt;/ol>&lt;h2>Step 1: Install the AI Observability solution&lt;/h2>&lt;p>Start by adding the AI Observability integration to your Grafana Cloud stack. This can be done by clicking on &lt;strong>Connections&lt;/strong> in the left-side menu and following the remaining &lt;u>&lt;a href="/docs/grafana-cloud/monitor-applications/ai-observability/setup/?pg=blog&amp;plcmt=body-txt#install-the-ai-observability-integration">steps outlined in our documentation&lt;/a>&lt;/u>.&lt;/p>&lt;p>This provision pre‑built dashboards, including one for &lt;strong>MCP observability,&lt;/strong> and configured a managed OpenTelemetry Protocol (OTLP) gateway to receive traces and metrics. Once telemetry flows in, the dashboards automatically populate with call rates, latency percentiles, and error counts.&lt;/p>&lt;h2>Step 2: Install OpenLIT and the MCP library&lt;/h2>&lt;p>OpenLIT provides auto‑instrumentation for MCP alongside LLMs, vector stores, and agent frameworks. Install OpenLIT and the &lt;code>mcp&lt;/code> library (which implements the client and server) via &lt;code>pip&lt;/code>:&lt;/p>&lt;pre>&lt;code>pip install openlit mcp
&lt;/code>&lt;/pre>&lt;p>After installation, a single call to &lt;code>openlit.init()&lt;/code> automatically instruments all MCP operations. If you choose to run your own telemetry collector instead of Grafana’s OTLP gateway, OpenLIT can be self‑hosted via Docker Compose or deployed to Kubernetes using the OpenLIT Operator.&lt;/p>&lt;h2>Step 3: Instrument your MCP application&lt;/h2>&lt;p>Instrumentation requires just two lines of code. Below is a simple example of an MCP server that exposes a &lt;code>search_documents&lt;/code> tool. OpenLIT instruments the server, capturing each tool invocation and context interaction:&lt;/p>&lt;pre>&lt;code>import openlit
from mcp import Server
openlit.init() # enable OpenTelemetry tracing and metrics
# Create an MCP server instance
server = Server()
# Define a tool to fetch documents
@server.tool("search_documents")
def search_documents(query: str):
# Imagine this function calls a search API or database
results = document_search(query)
return results
# Run the MCP server on localhost
server.run(host="localhost", port=8080)
# When a client invokes search_documents, OpenLIT captures:
# * Context protocol interactions (e.g., context loading and management)
# * Tool usage metrics (latency and success rate)
# * Protocol handshake performance
# * Resource usage (context window size, memory)
# * Errors or exceptions:contentReference[oaicite:14]{index=14}
&lt;/code>&lt;/pre>&lt;p>To instrument an MCP client, use the &lt;code>Client&lt;/code> class from the &lt;code>mcp &lt;/code>library and call &lt;code>openlit.init()&lt;/code> before making requests:&lt;/p>&lt;pre>&lt;code>import openlit
from mcp import Client
openlit.init()
client = Client("http://localhost:8080")
tools = client.list_tools() # Lists available tools
result = client.call_tool(
"search_documents",
{"query": "AI observability"}
) # Invokes the tool
# All client operations are automatically instrumented:contentReference[oaicite:15]{index=15}
&lt;/code>&lt;/pre>&lt;p>OpenLIT supports zero‑code instrumentation via a CLI wrapper. To instrument an existing MCP service without code modifications, use:&lt;/p>&lt;pre>&lt;code>openlit-instrument \
--service-name my-mcp-app \
python your_mcp_app.py
# With custom settings:
openlit-instrument \
--otlp-endpoint http://127.0.0.1:4318 \
--service-name my-mcp-app \
--environment production \
python your_mcp_app.py
&lt;/code>&lt;/pre>&lt;h2>Step 4: Configure environment variables&lt;/h2>&lt;p>To send your traces and metrics to Grafana Cloud, you need OpenTelemetry credentials. Generate them from the Grafana Cloud portal and set the following variables in your environment:&lt;/p>&lt;pre>&lt;code>export OTEL_EXPORTER_OTLP_ENDPOINT="https://otlp-gateway-&lt;ZONE>.grafana.net/otlp"
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic &lt;BASE64>"
export OTEL_SERVICE_NAME="mcp-server-demo"
export OTEL_DEPLOYMENT_ENVIRONMENT="production"
&lt;/code>&lt;/pre>&lt;p>When you run your client or server with these variables, the OpenLIT SDK automatically sends telemetry data to Grafana Cloud.&lt;/p>&lt;h2>Step 5: Explore the MCP observability dashboard&lt;/h2>&lt;p>After you start your instrumented MCP client and server, open Grafana Cloud and navigate to &lt;strong>AI Observability &lt;/strong>→&lt;strong> MCP Observability&lt;/strong>. The dashboard provides:&lt;/p>&lt;ul>&lt;li>&lt;strong>Tool performance&lt;/strong>: Call latency histograms, success rates, and invocation counts per tool.&lt;/li>&lt;li>&lt;strong>Protocol health&lt;/strong>: Session stability and connection metrics to detect handshake issues.&lt;/li>&lt;li>&lt;strong>Resource usage&lt;/strong>: Context window size, memory, and data access patterns, helping you optimize server resources.&lt;/li>&lt;li>&lt;strong>Error tracking&lt;/strong>: Lists failed operations with trace IDs and detailed exception information to aid debugging&lt;/li>&lt;/ul>&lt;p>You can build custom dashboards by querying Prometheus metrics (e.g., tool invocation duration) and Tempo traces. Because OpenLIT uses OpenTelemetry, you’re not locked into a single backend. You can forward telemetry to any OTLP‑compatible observability stack.&lt;/p>&lt;h2>Next steps&lt;/h2>&lt;p>Ready to learn more? In the &lt;u>&lt;a href="/blog/ai-observability-zero-code">final blog in this series&lt;/a>&lt;/u>, we’ll show how to set this up, step by step, for a zero-code instrumentation approach to AI Observability.&lt;/p>&lt;p>You can also learn more about Grafana Cloud AI Observability in the &lt;u>&lt;a href="/docs/grafana-cloud/monitor-applications/ai-observability/?pg=blog&amp;plcmt=body-txt">official docs&lt;/a>&lt;/u>, including setup instructions and dashboards. These resources will help you move from a basic demo to a production-ready setup for your AI applications.&lt;/p></description></item></channel></rss>