Company: Aurora
Industry: Travel & Transportation
Aurora is a leader in autonomous vehicle technology, focused on delivering safe and scalable self-driving solutions for the commercial trucking industry. Their ecosystem spans real-time onboard compute, cloud-based machine-learning pipelines, large-scale Kubernetes platforms, and a network of safety-critical services enabling autonomous freight movement across Texas and the Southwest. Maintaining observability across this highly distributed, multi-tenant, latency-sensitive environment is essential for safety, performance, and operational scale.
Challenge
Aurora’s observability footprint expanded dramatically when the company acquired a division of Uber ATG in early 2021—instantly growing its engineering organization by ~200%. This rapid expansion strained an already fragmented monitoring stack that included:
- A self-hosted OSS toolchain (Prometheus + Thanos + Grafana) coupled with a separate logging vendor and additional “best-of-breed” telemetry tools.
- Multiple dashboards, alerting systems, and time-zone inconsistencies that slowed troubleshooting.
- Divergent vendor billing models, making cost forecasting difficult.
- Disparate instrumentation patterns across 30+ Kubernetes clusters and numerous service types.
This fragmentation made troubleshooting across metrics, logs, traces, and profiling both time-consuming and expensive—particularly for developers working in a safety-critical autonomous vehicle environment.
Solution
Aurora consolidated telemetry onto Grafana Cloud as its unified observability platform. Key elements included:
- Consolidation: Migrating from Chronosphere, Honeycomb, and self-hosted Grafana OSS into one Grafana Cloud platform supporting PromQL, logs, traces, and continuous profiling via Pyroscope.
- Phased migration: Metrics migrated first (~30–45 days), followed by logs and traces (~11 months) across 30+ clusters with standardized pipelines and alerting.
- Adaptive telemetry: Dynamic control of metric and profiling volume; opt-in profiling to manage cost; deployment and feature-flag annotations added directly to telemetry.
- Developer enablement: A single pane of glass with consistent time zones, unified dashboards, and reduced context switching for teams with diverse skill sets.
“A single pane of glass was one of the ways … we really wanted to make this simple for people.” \
– Craig Sebenik, Observability Lead
Impact
By unifying their telemetry into a single platform, Aurora unlocked dramatic improvements in speed, efficiency, and operational scale:
- Faster incident resolution: Issues that previously took hours or days now take hours—or even minutes with all telemetry in one place.
- Cost control: Adaptive metrics and in the future profiling prevented runaway ingestion; opt-in profiling reduced spend while preserving visibility.
- Higher developer productivity: Fewer vendor pivots, standardized telemetry, and lower cognitive load across teams.
- Scalable operations: Unified observability now supports 30+ Kubernetes clusters spanning core infrastructure, ML/batch workloads, R&D systems, and customer-visible autonomous trucking services.
“There have been cases where teams have reported that a given kind of incident or issue that might’ve days to resolve… now takes them hours or even potentially minutes because all the data’s in one place.”
– Craig Sebenik, Observability Lead
Your guide


