Company: Aurora Innovation
Industry: Automotive & Manufacturing (self-driving vehicles)
Aurora builds self-driving technology for commercial vehicles, operating some of the most complex distributed systems in the world — mobile data centers on wheels, streaming high-fidelity sensor data in challenging environments like West Texas. Their infrastructure spans Kubernetes-based cloud services, large-scale batch processing clusters, and real-time AI/ML pipelines. Managing observability across dynamic, high-cardinality workloads and hybrid architectures is critical not only for system performance but for ensuring the safety and reliability of autonomous operations. Aurora’s engineering teams face the unique challenge of scaling telemetry, debugging distributed systems, and optimizing performance at both the edge and the cloud, all under tight operational and cost constraints.
Challenge
Aurora struggled with fragmented observability as they were using Chronosphere for metrics and Honeycomb for logs and traces. It got even worse as Aurora acquired Uber’s ATG division, which brought along more scale and more complexity. Managing multiple vendors created inefficiencies, high operational overhead, and increased costs, especially as dynamic batch processing clusters introduced extreme cardinality challenges. Honeycomb’s simplicity helped onboarding but limited advanced debugging, and Chronosphere’s custom solutions complicated scaling. With a broad engineering team — from LiDAR specialists to infrastructure engineers — Aurora needed a unified, scalable, and cost-efficient observability platform to support their fast-growing, high-complexity operations.
Solution
Aurora migrated to Grafana Cloud and the LGTM stack, consolidating metrics, logs, and traces into a single, Prometheus-compatible platform. They leveraged Adaptive Metrics to control cardinality costs and integrated OpenTelemetry for consistent tracing across their diverse set of services. Pyroscope was introduced in test clusters to deepen performance insights, while Terraform and API automation streamlined observability infrastructure as code. By moving to Grafana Cloud, Aurora simplified developer workflows, improved system-wide visibility, and laid the foundation for adopting AI-powered features like Sift Investigations, Metric Forecasting, and expanded SLO automation.
Impact
Migrating to Grafana Cloud allowed Aurora to unify metrics, logs, and traces under a single platform, dramatically improving visibility and reducing operational complexity. Adaptive Metrics helped control observability costs in their fast-scaling environments, while Terraform integration streamlined infrastructure management. Developers now troubleshoot faster with a single, powerful interface, and continuous profiling with Pyroscope has unlocked deeper performance insights. With a scalable, AI-ready observability foundation in place, Aurora is positioned to detect issues earlier, optimize performance more efficiently, and accelerate the deployment of safe, autonomous technology.
“Swapping between multiple vendors to diagnose an issue, especially if you are used to one of those vendors, ends up becoming very cumbersome and expensive. With Grafana Cloud, we have our first pane of glass with all three pillars of observability in one place. We have easy access to out of the box tools like profiles, adaptive metrics and logs, and frontend observability. It’s helped us expand our observability capabilities to support new teams and products.”
– Craig Sebenik, Staff SRE
Your guide
