How Alter Domus unified observability and improved their AI-powered applications with Grafana Cloud

Alter Domus is a global provider of fund administration and alternative investment services with $3.9tn in assets under administration, serving 85% of the top 30 asset managers and with 39 offices across 23 jurisdictions globally.

Needing to simplify a fragmented observability environment while supporting strict regulatory requirements, growing digital services, and a rapidly expanding AI footprint, Alter Domus consolidated six observability vendors into Grafana Cloud, reducing observability costs by more than 55%, expanding visibility across more than 2,000 resources, and improving MTTR from hours to minutes. The company also began using Grafana’s AI capabilities to investigate issues faster and track AI token consumption across teams and workflows.

“Once we onboarded onto Grafana Cloud, we not only reduced costs significantly, but we also gained full control over our observability, including how we manage and scale AI usage across the company,” said Philip Pencal, Head of Site Reliability Engineering at Alter Domus.

Philip recently spoke with Grafana Labs about consolidating observability, improving operational performance, and building visibility into AI systems to reduce costs and remove complexity.

Can you start by introducing yourself, your role at Alter Domus, and what your team is responsible for?

My name is Philip Pencal, and I am the Head of Site Reliability Engineering at Alter Domus. I’ve been at Alter Domus since 2019 and leading observability initiatives since 2023.

Today, I lead a team of senior engineers within our platform engineering organization. We are responsible for technical standards, reliability, resilience, documentation, and observability across the company.

The team acts almost like a chapter model. We define the standards for engineering teams across the organization, so we require very experienced engineers. Everyone on the team has at least 10 years of experience.

How do scale, regulation, and user expectations shape your technology environment today?

Alter Domus operates in the alternative investment space, serving clients across private equity, infrastructure, real estate, and venture capital.

It’s a very demanding environment. We operate across the highest regulatory frameworks, and we work with 85% of the top 30 asset managers globally.

At the same time, we have more than 6,500 employees using our systems every day, plus our clients’ investors using digital platforms that need to be extremely fast and reliable. Delays are not acceptable.

What challenges led your team to rethink observability?

When I joined Alter Domus, we did not have substantial investments in observability. Incidents could take days to resolve. We did not have proper foundations. Some systems did not even transmit logs. It was very difficult to troubleshoot.

I remember one incident in a client-facing application that took two days to resolve. It generated a lot of complaints and impacted our reputation. When we reevaluated that incident using an APM proof of concept, we realized it could have been resolved in 10 minutes.

After that, we started investing substantially more in observability, and improved our mean time to resolution from eight to one hour in some systems. But as we did so, we created a new problem. Different teams started adopting their own tools, and we ended up with a fragmented environment with up to six vendors providing the same capabilities.

This created challenges with cost, governance, and efficiency. Nobody wanted to maintain the platforms, and engineers had to learn multiple tools, which slowed everything down.

What made Grafana stand out during your evaluation process?

We evaluated 22 suppliers during a six-month procurement process. We had a very detailed checklist covering cloud infrastructure, Kubernetes, applications, frontend observability, security, user experience, and open standards.

One of the biggest differentiators was Grafana’s pricing model. It was simple and predictable. Because pricing is ingestion-based with low user licensing costs, we can onboard engineers broadly without worrying about costs increasing every time adoption grows.

OpenTelemetry was another major factor. With Grafana Alloy and OpenTelemetry, we were able to standardize observability without relying on vendor-specific SDKs or proprietary instrumentation.

The ecosystem also mattered. Grafana has a strong cloud-native community presence, which makes it easier to hire engineers who already know the platform.

How did the migration to Grafana Cloud unfold?

The migration from our previous supplier to Grafana Cloud took about two months.

What surprised us most was the immediate cost reduction. Simply by migrating, we reduced observability costs by 55%.

Leadership initially did not believe the results were possible. When I presented the migration outcomes, there was silence in the room. Everyone asked, “What’s the trick? What did we lose?” But the answer was nothing - Alter Domus still had the same baseline observability standards and quality, but at half the price.

But the savings were only part of the story. We reinvested those savings into expanding observability coverage across the company. We started onboarding systems we previously could not support properly, including Windows services and additional infrastructure components. Today we monitor more than 2,000 resources in Grafana Cloud.

In Q1 2026, even though adoption increased by 21%, we still reduced our monthly observability run rate by another 46%. Features like Adaptive Telemetry also helped us optimize high-cardinality telemetry and control unnecessary data growth without sacrificing visibility.

How has having a unified observability platform changed how teams work?

Before Grafana, frontend observability, database visibility, infrastructure telemetry, and application observability were all disconnected across different tools and vendors. Now everything is centralized - if an engineer needs to troubleshoot something, they know exactly where to go.

We now have frontend observability, Kubernetes, Windows services, applications, and custom telemetry connected together with tracing correlations. Engineers can move very quickly from an application to the Kubernetes infrastructure underneath it, which makes troubleshooting much faster.

We also started creating dashboards not only for engineers but for product owners and business stakeholders. For example, we use Apdex scores, a standard metric for measuring application responsiveness and user satisfaction, to measure user experience quality. In one system, we improved the Apdex score from 0.52 to 0.78, which represented a significant improvement for end users.

You’ve also started building observability around AI systems and agents. What triggered that?

We started thinking about AI observability very early, before it became a major trend.

One of the first things we realized is that AI usage behaves exactly like cloud usage did in the early days. If you don’t instrument it properly, you lose visibility. And once you lose visibility, you lose control over cost and operations.

So, we decided to treat AI like any other production workload and build observability around it from the beginning.

Today, we use Grafana Cloud to track token consumption across teams, agents, and workflows. That data is used by engineering leadership, FinOps, and product teams.

Engineering teams use it to understand adoption patterns and model efficiency. FinOps uses it for forecasting and anomaly detection. Product teams use it to understand whether AI-powered features are actually delivering value.

Token consumption is becoming a meaningful operational expense, especially when using providers like OpenAI, Anthropic, or Azure OpenAI at scale.

For us, token consumption is becoming a signal of AI maturity. It helps us scale responsibly, manage risk, and bring transparency to stakeholders as AI becomes part of our operational DNA.

Can you share an example of how Grafana’s AI capabilities helped your team investigate an issue?

We built our own internal chatbot platform with a strong focus on security and data protection. It interacts with multiple agents, MCP servers, and retrieval-augmented generation components.

At one point, users started reporting severe latency issues, but because LLMs behave probabilistically, there wasn’t a clean dependency chain showing where the issue originated.

Most engineers were already occupied with other incidents, so before opening dashboards manually, I started an investigation directly through Grafana Assistant.

Within a few minutes, Grafana Assistant had generated a full investigation report with likely root causes and an executive summary. One of the main findings was that a RAG pod was hitting memory limits after a usage spike. Once we increased the limits, the latency disappeared.

The investigation also surfaced additional weak points in the architecture that we later optimized proactively.

Without Grafana Assistant Investigations, this would have become a slow, manual, multi-team troubleshooting effort. Instead, we resolved the issue quickly and improved the platform’s overall reliability.

What made Grafana stand out compared to other AI observability approaches you evaluated?

We evaluated 22 vendors in total, and many of them are doing interesting work around model visibility and agent tracing, but most still treat AI as a chat interface layered on top of dashboards.

Grafana took a different approach.

Instead of only translating queries, it built a more agentic workflow capable of running investigations, correlating telemetry across Prometheus and Loki, and actively helping drive troubleshooting.

As Alter Domus expanded its use of AI agents and MCP servers internally, the team realized it lacked visibility into what happened after requests reached the LLM.

I reached out to Grafana Labs and said, “We need a solution for this use case. We are able to collect metrics and information about our applications, but everything happening after the LLM made decisions is a black box for us.”

The Grafana team then showed us its upcoming AI Observability capabilities and invited us into the preview. We implemented it the same day, and after two days in production, the difference was amazing.

We started collecting information about our MCP servers, prompts, and conversations, so now we can understand why MCP servers fail and why users are not achieving what they expect when interacting with those systems.

What has it been like working with the Grafana Labs team?

The relationship feels more like a partnership than a traditional vendor relationship. We’ve been able to test early-stage features, provide feedback directly to product teams, and influence the direction of capabilities we care about. In some situations, we started using new functionality within hours of requesting access.

That level of collaboration makes a huge difference because it allows us to shape the platform around our operational needs.

What’s next for your observability strategy?

I’ve seen Grafana evolve very quickly since we started using it, and I’m impressed with how much innovation has happened in a short amount of time.

The area I’m most excited about is AI observability and AI governance. As more companies operationalize AI, observability becomes critical for understanding cost, performance, reliability, and risk across those systems.

We’re looking forward to continuing to build on that foundation with Grafana.