How Property Finder cut observability costs by more than 50% and achieved full-stack visibility in just three months

Property Finder, a leading real estate platform serving customers across the Middle East and Northern Africa is on a mission to “change living for the good” by connecting home seekers with agents and homeowners. As the company rapidly scaled its architecture, its site reliability engineering (SRE) team found itself constrained by Datadog’s rising costs, shrinking retention windows, and the operational burden of managing telemetry at scale.

“We reached a point where the value proposition was no longer there,” said Tarek AbdelSater, Principal/Lead Site Reliability Engineer at Property Finder. “We needed something open standards–based, cost-effective, and mature and that’s what led us to Grafana Cloud.”

Within just three months, Property Finder migrated more than 300 microservices and thousands of AWS Lambda functions to Grafana Cloud. Using Grafana Alloy’s Datadog receiver, the team replicated telemetry without re-instrumenting code, restored its retention period from seven to 30 days, and reduced spending and MTTR by more than half, all while expanding to full-stack observability across metrics, logs, and traces.

Tarek recently spoke with Grafana Labs about Property Finder’s success with Grafana Cloud.

Can you start by introducing yourself, your role at Property Finder, what your team is responsible for, and a bit about the company’s mission and focus?

My name is Tarek AbdelSater. I am the Principal/Lead Site Reliability Engineer at Property Finder. Property Finder’s mission is to change living for the good in the MENA region by connecting home seekers with real estate agents and homeowners through our apps, making the journey as seamless as possible.

Our SRE team is responsible for a few major areas: observability, incident management, and application performance testing. We want to provide the best possible experience for our customers, and having strong observability is extremely important to achieving that.

Our observability group is three people, and the broader infrastructure team is about seven.

What specific challenges or limitations led your team to start looking for an alternative to Datadog and other tools?

Datadog became extremely expensive as we scaled the company. We added a lot of new microservices and were running major migrations from legacy systems. That meant sending a lot more logs, traces, and metrics than before. During sensitive migrations, you can’t cut back on telemetry.

We tried a lot of cost optimizations in Datadog, but it still remained prohibitively expensive. Some features like real user monitoring were so costly that we enabled them briefly and then turned them off. Datadog had amazing features that we could only admire from a distance because we couldn’t responsibly use them.

Eventually, we had to make big sacrifices, especially around retention. We went from 30 days of logs down to seven days. Losing the ability to look two or three weeks back was not acceptable for the long term. That’s when we realized the value just wasn’t there anymore and it was time to find an alternative.

Migrating observability platforms is never easy, yet your team moved quickly. What were the key steps or decisions that helped make your migration successful so fast?

We had a big challenge. Property Finder has more than 300 microservices on Kubernetes and several thousand AWS Lambda functions. All of them were instrumented using Datadog libraries and sending data through the Datadog agent. Our migration deadline was only three months, and we didn’t have time to re-instrument everything to OpenTelemetry.

Another constraint was that we needed to keep Datadog working during the migration. We couldn’t afford subpar monitoring even for a day.

Fortunately, Grafana Labs has Grafana Alloy, its distribution of the OpenTelemetry Collector. Alloy includes a Datadog receiver that can ingest Datadog telemetry, convert it into OpenTelemetry, and send it to Grafana Cloud.

Following the instructions in the documentation, we modified the Datadog agent to dual-write to both Datadog and Grafana Alloy at the same time. That allowed us to replicate all telemetry with one configuration change. We left both running for a few weeks to build historical data in Grafana.

All of this happened in one swoop, no need to instrument each service one by one. After that, all we had to do was migrate dashboards and alerts gradually.

The three-month deadline came from the business. Datadog was getting more expensive every day, and delaying migration would have cost us a lot of money.

What measurable results have you seen since the migration in terms of cost, retention, or performance?

We reduced costs by more than 50%, even while sending more telemetry due to other migrations happening in the company.

We restored our log and trace retention from 7 days back to 30 days.

We also had enough headroom in our budget to comfortably use additional features in Grafana Cloud, such as Frontend Observability for real user monitoring, Grafana Cloud IRM for incident management, and synthetic monitoring, while still paying significantly less than we did with Datadog.

In summary, we scaled up our usage – in terms of both data size and software stack – while reducing costs dramatically.

You’ve now achieved full-stack observability across metrics, logs, and traces. How has this visibility changed how your teams work day to day?

A lot of people already knew how Grafana dashboards worked, so we created templates to standardize dashboards across teams. That helped a lot.

Engineers now feel confident pushing telemetry to Grafana without worrying about costs. They can log almost anything as well as send metrics and traces freely. Because of that, Grafana has become much more useful to us.

The data being interconnected and correlated helped us catch issues much earlier. We reduced our mean time to resolve by around 50%.

Tell me about your experience using Adaptive Telemetry. What difference has it made in reducing overhead and helping your team move faster?

Adaptive Telemetry helped us optimize costs without constantly waiting for engineers to fix instrumentation.

For example, Adaptive Metrics allowed us to aggregate collector IDs and other labels that we needed for the single-writer principle in Prometheus. Normally those labels would’ve increased cardinality and cost, but Adaptive Metrics aggregated them away. Engineers are now much less worried about cardinality.

Adaptive Tracing allowed us to shift trace sampling policies left. Instead of having a centralized policy, we gave teams direct access to set their own sampling policies. That allowed us to control costs while keeping engineers fully in the loop.

Overall, Adaptive Telemetry has helped keep costs under control while giving the teams more freedom and visibility.

Can you share how building a joint success plan with Grafana Labs has helped you tie your observability goals to broader business outcomes?

Grafana’s team, especially the Observability Architects, helped us immensely. They worked with us to lay out our roadmap for observability, and following that roadmap allowed us to complete the migration quickly.

Shortly afterward, we were able to deliver major improvements like incident management, synthetic monitoring, and SLOs. The support has been excellent. Any time we faced an issue, we opened a ticket and the team helped us right away.

What’s next for Property Finder’s observability journey? Are there any areas or technologies you’re excited to explore next?

Grafana has been releasing a lot of impressive new features and we want to try many of them.

We already rolled out monitoring with Grafana Assistant. It has reduced support requests significantly because engineers can just ask the assistant how to write dashboards or alerts. It has been amazing so far.

Next on our roadmap is load testing with Grafana Cloud k6. We can reuse our existing synthetic monitoring scripts for load testing, and we’re very excited about that.

At this point, I don’t have anything else to add except that the Grafana team has been extremely helpful. I’m grateful to have worked with them.

Anything else you’d like to add?

At this point, I don’t have anything else to add except that the Grafana team has been extremely helpful. I’m grateful to have worked with them.