Company: Atlassian
Industry: Software & Technology
Atlassian is a global leader in software collaboration and productivity tools—best known for products like Jira, Confluence, and Bitbucket—that help teams plan, track, and deliver work. As Atlassian’s cloud infrastructure grew rapidly, the company faced a complex web of microservices and fragmented visibility across teams. Diagnosing issues and identifying root causes during incidents could take hours, impacting both customer experience and operational efficiency.
Challenge
Atlassian’s rapid move from on-prem to cloud created a sprawling ecosystem of thousands of microservices. During incidents, responders faced complex dependencies, fragmented tools, and siloed teams. Identifying the right team or root cause could take up to an hour. The Observability Insights team set out to drastically reduce these “time to” metrics—time to engage, time to mitigate, and time to root cause.
Solution
To solve its incident response and troubleshooting challenges, Atlassian built OpsDeck, an observability platform powered by Grafana.
- Adopted Grafana’s open and flexible platform, enabling seamless integration with existing tools and self-hosted data for security and cost control.
- Created opinionated workflows instead of static dashboards, guiding engineers and support staff through step-by-step troubleshooting journeys.
- Leveraged Grafana Mimir for real-time alerting and high-cardinality metrics, enabling proactive detection before customers were affected.
- Built custom tail-sampling algorithms to manage massive trace volumes efficiently and open-sourced the technology for the community.
- Integrated logs, traces, and telemetry data into a unified observability experience through Grafana and OpenTelemetry.
Impact
With Grafana as its observability foundation, Atlassian achieved measurable and lasting improvements:
- Reduced incident engagement time from one hour to less than one minute using automated incident creation and paging workflows.
- Improved customer experience by detecting and mitigating issues before users noticed an outage.
- Accelerated support troubleshooting across Jira, JSM, and Confluence, delivering significant reductions in mean time to resolution (MTTR).
- Enhanced collaboration by breaking down silos and centralizing observability data across engineering, SRE, and support teams.
- Laid the groundwork for AI-driven observability, enabling predictive insights and faster root cause analysis.
The future of observability at Atlassian
Next, Atlassian plans to extend OpsDeck with machine learning and AI. The goal is to automatically surface root causes—identifying whether an outage stems from CPU saturation, a recent deployment, or a newly enabled feature flag.
By continuing to standardize data and enrich service dependency graphs, Atlassian aims to build an even smarter observability ecosystem—one that keeps humans in the loop while making incident response faster, more predictive, and more reliable than ever.

