Webinar

Aurora’s observability in motion: Adaptive profiling and cost-efficient monitoring with Grafana Cloud

You are registered for this webinar Thanks for registering
You'll receive an email confirmation, and a reminder on the day of the event. You'll receive an email when the on-demand video is available.
Aurora’s observability in motion: Adaptive profiling and cost-efficient monitoring with Grafana Cloud

Company: Aurora
Industry: Travel & Transportation

Aurora is a leader in autonomous vehicle technology, focused on delivering safe and scalable self-driving solutions for the commercial trucking industry. Their ecosystem spans real-time onboard compute, cloud-based machine-learning pipelines, large-scale Kubernetes platforms, and a network of safety-critical services enabling autonomous freight movement across Texas and the Southwest. Maintaining observability across this highly distributed, multi-tenant, latency-sensitive environment is essential for safety, performance, and operational scale.

Challenge

Aurora’s observability footprint expanded dramatically when the company acquired a division of Uber ATG in early 2021—instantly growing its engineering organization by ~200%. This rapid expansion strained an already fragmented monitoring stack that included:

  • A self-hosted OSS toolchain (Prometheus + Thanos + Grafana) coupled with a separate logging vendor and additional “best-of-breed” telemetry tools.
  • Multiple dashboards, alerting systems, and time-zone inconsistencies that slowed troubleshooting.
  • Divergent vendor billing models, making cost forecasting difficult.
  • Disparate instrumentation patterns across 30+ Kubernetes clusters and numerous service types.

This fragmentation made troubleshooting across metrics, logs, traces, and profiling both time-consuming and expensive—particularly for developers working in a safety-critical autonomous vehicle environment.

Solution

Aurora consolidated telemetry onto Grafana Cloud as its unified observability platform. Key elements included:

  • Consolidation: Migrating from Chronosphere, Honeycomb, and self-hosted Grafana OSS into one Grafana Cloud platform supporting PromQL, logs, traces, and continuous profiling via Pyroscope.
  • Phased migration: Metrics migrated first (~30–45 days), followed by logs and traces (~11 months) across 30+ clusters with standardized pipelines and alerting.
  • Adaptive telemetry: Dynamic control of metric and profiling volume; opt-in profiling to manage cost; deployment and feature-flag annotations added directly to telemetry.
  • Developer enablement: A single pane of glass with consistent time zones, unified dashboards, and reduced context switching for teams with diverse skill sets.

“A single pane of glass was one of the ways … we really wanted to make this simple for people.” \

– Craig Sebenik, Observability Lead

Impact

By unifying their telemetry into a single platform, Aurora unlocked dramatic improvements in speed, efficiency, and operational scale:

  • Faster incident resolution: Issues that previously took hours or days now take hours—or even minutes with all telemetry in one place.
  • Cost control: Adaptive metrics and in the future profiling prevented runaway ingestion; opt-in profiling reduced spend while preserving visibility.
  • Higher developer productivity: Fewer vendor pivots, standardized telemetry, and lower cognitive load across teams.
  • Scalable operations: Unified observability now supports 30+ Kubernetes clusters spanning core infrastructure, ML/batch workloads, R&D systems, and customer-visible autonomous trucking services.

“There have been cases where teams have reported that a given kind of incident or issue that might’ve days to resolve… now takes them hours or even potentially minutes because all the data’s in one place.”

– Craig Sebenik, Observability Lead


Your guide

Craig Sebenik
Craig Sebenik
Lead for Observability
Aurora
Resources

More great videos and webinars