Webinar

How Gopuff Cut Observability Costs 40% with Grafana Cloud

You are registered for this webinar Thanks for registering
You'll receive an email confirmation, and a reminder on the day of the event. You'll receive an email when the on-demand video is available.
How Gopuff runs Grafana Cloud at scale: Observability, cost control & load testing

Company: Gopuff

Industry: Instant Commerce / Retail Delivery

Gopuff operates the largest instant commerce platform in the U.S. and U.K., with hundreds of micro-fulfillment centers covering most major markets. Behind its promise of delivery in minutes is a complex, multi-cloud architecture of 500+ microservices, where even small infrastructure issues can directly impact customer experience and revenue.

Challenge

Gopuff’s previous observability platform had become both a financial and operational burden, what the team called the “Observability Tax.”

  • Costs were increasing year-over-year without meaningful improvements in insight quality
  • A proprietary pricing model penalized scale, discouraging proper instrumentation
  • Engineers began second-guessing telemetry decisions, prioritizing cost over visibility
  • High-cardinality labels (e.g., user IDs, raw URLs) silently inflated costs
  • Teams spent significant time auditing and pruning metrics just to control spend

“ Engineers were asking, ‘Should I add this metric?’ instead of, ‘What do I need to add or what do I need to do to improve observability?’”

—Brad Oyler, Sr. Engineering Manager

Solution

In 2024, Gopuff migrated to Grafana Cloud, adopting the full LGTM stack to unify observability across its distributed system.

  • Grafana Loki for cost-efficient log aggregation
  • Grafana Tempo for end-to-end distributed tracing
  • Grafana Mimir for scalable, long-term Prometheus metrics
  • Grafana dashboards and alerting as a single interface for visibility

To control telemetry at scale:

  • Grafana Alloy was deployed for centralized telemetry collection and preprocessing
  • Adaptive Telemetry automatically identified and reduced unused high-cardinality metrics, eliminating manual audits

The team also standardized observability practices across all services:

  • Unified on Prometheus metrics and OpenTelemetry
  • Implemented consistent Golden Signals across 500+ microservices
  • Integrated alerting with Gopuff’s internal Service Catalog, ensuring context-rich routing and linked runbooks

For proactive reliability:

  • Deployed k6 Operator to run recurring stress tests in UAT and production
  • Simulated peak-demand scenarios (e.g., the Big Game traffic) weeks in advance

“We chose Grafana for its open source roots. It aligns with our engineering culture: change transparency, flexibility, and no vendor lock-in.”

—Brad Oyler, Sr. Engineering Manager

Impact

Gopuff transformed observability from a cost center into an operational advantage:

  • ~40% reduction in total observability spend 
  • Reclaimed budget equivalent to 1–2 SRE hires
  • Eliminated ongoing engineering time spent managing metric cardinality
  • Shifted from alert fatigue to context-rich, actionable incident response
  • Replaced reactive over-provisioning with proactive capacity planning using k6 insights

Most importantly, engineers regained the ability to instrument systems based on need, not cost.

“We achieved a 40% reduction in total observability spend. We’ve moved from alert fatigue to actionable insights, allowing our team to focus on innovation rather than maintenance.”

—Brad Oyler, Sr. Engineering Manager

Looking Ahead

With a clean, standardized telemetry foundation in place, Gopuff is building toward autonomous infrastructure: an AI agent embedded in Grafana dashboards capable of reasoning about telemetry and taking safe, scoped production actions. Near-term focus is on guardrails and human-in-the-loop confirmation for production changes. 

The long-term vision is a system where open-standard telemetry becomes the instruction set for an automated, self-optimizing infrastructure.

Resources

More great videos and webinars