Observability Journey Maturity Model

Modern infrastructure and applications are complex and constantly evolving. Understand where your organization is on your observability journey and how to improve your maturity

Assess your organization’s observability maturity

The model

This model for observability maturity will help you identify your level of observability maturity by giving you a method to evaluate your tools, people, and processes across 9 key dimensions. This will help identify your strengths and opportunities for improvement. This model will then help you identify actions you can take to become systematic about observability and keeping your apps running.

How to use it

You can use this model to evaluate observability using the lens of how you access and analyze observability data as well as how you respond to and prevent incidents. To achieve systematic observability, there are 9 key dimensions to master.

Dimensions of observability

This model will evaluate Observability using the lens of how we access and analyze observability data as well as how we respond and prevent incidents.

Access

Observability coverage
Observability data access
Observability data efficiency

Analyze

Visualization
Correlation
Root Cause Analysis

Respond and Prevent

SLOs and Business Impact
Incident Response & Management
Observability Driven Development

Access

Observability coverage

Start your observability journey by determining the applications, cloud services, and infrastructure that you need to observe to keep your environment running. Then collect observability data needed to get visibility over your systems, including the metrics, logs, and traces from the key components of your application architecture. Observability data sources include cloud and self-hosted infrastructure, databases, APIs, as well as network, security, real-user and synthetic monitoring. Increasing the data sources and types of observability data you can access will broaden your observability coverage across your organization and limit blind spots.

Observability data access

Next, you evaluate how to collect, store, and access your observability data. Some data collection agents are proprietary while others use open standards, which can be stored and visualized using a wide ecosystem of tools. Observability teams should offer developers and operations teams data stores for hosting metrics, logs, and traces using the latest standards, such as Prometheus and OpenTelemetry. You should determine which data can be accessed using APIs and which data must be collected and stored so you can fully observe your environment.

Observability data efficiency

Finally, you will need to efficiently store and manage large volumes of observability data. The data stores you offer should be scalable, highly available, secure, and highly performant. You should define policies around data fidelity and retention and create policies for managing cardinality and cost

Analyze

Visualization

Once you can access your Observability data, you’ll need to determine the best way to visualize it. You’ll want to create a global view across your organization as well as role-specific views for executives and technical users, and it's a best-practice to offer the ability to visualize data from numerous sources in a single place. You should provide your business a reliable platform for visualization, such as Grafana dashboards, while securing data access amongst the teams using an RBAC model, and integrated with your directory service.

Correlation

After visualizing data, you’ll need to be able to correlate across data sources to solve problems quickly. This includes the ability to correlate many types of data including metrics, logs, and traces as well as business and technology data sources. Navigating between tools is time consuming and error prone, so reducing the number of tools required to correlate data can make a big difference in how long it takes to identify and solve issues.

Root Cause Analysis

Once you can correlate data, you will want to create an efficient root cause analysis (RCA) process and toolset. Great observability teams track Mean Time To Recover (MTTR) metrics and are constantly seeking to improve their process to identify root causes faster. The RCA step depends on solid data fidelity and retention policies to ensure enough data is on hand, balanced by budgetary needs. Collaboration amongst cross-functional teams to reduce the number of tools and people required to determine RCA will reduce errors and allow for faster MTTR. A well-designed RCA practice also allows observability teams to learn from each issue and continually improve their RCA process to prevent future outages.

Respond and Prevent

SLOs and Business Impact

For each of the services you support, you will need to determine expectations for performance and availability. This is often in the form of a Service Level Agreement (SLA) for external customers and Operational Level Agreement (OLA) for internal customers. Most observability teams have defined Service Level Objectives (SLOs) for key services, whether or not they are bound by an SLA. Reporting on SLO performance and business impact becomes an essential tool for executives to manage their key business systems.

Incident Response & Management

Unifying alerts through a central system can help observability teams identify and notify the appropriate on-call engineers with relevant information. Observability teams should define on-call rotations and escalation policies, along with runbooks to guide troubleshooting so that the dependence on specific individuals is minimized.

Observability Driven Development

Organizations that implement observability during the development process roll out applications with higher uptime and improved performance. The earlier in the Software Development Life Cycle (SDLC) that observability and performance testing are implemented, the more issues can be prevented before impacting users. This “shift-left” approach requires metrics, logs and traces to be included in the coding process. It also includes a Quality Assurance (QA) performance testing stage as well as tools designed for developers that can stress test applications to make sure they will perform when the production loads are unleashed on each new release of the application. Ideally, these should use the tools and methodologies that your developers are familiar with so that it's easy to integrate into your development pipelines.

Observability Maturity Levels

Reactive

Reactive Observability teams are working to establish competence in the 9 dimensions. Customers are bringing you problems before you know about them. Time is spent responding to issues reported by users. Many observability teams will fall into this group today as developing Systematic Observability is a journey.

Proactive

Proactive Observability teams establish competency across a majority of the 9 dimensions and develop mastery in a few. They work to develop procedures & implement tools to know about issues before users. You are able to identify some issues and prevent them from impacting customers, but others impact users.

Systematic

Systematic Observability teams demonstrate mastery across all 9 dimensions of observability. They develop procedures & implement tools to know about issues before users and are able to prioritize problems to minimize impact to users. Observability and performance testing are implemented early in the SDLC, preventing issues from occurring in production.

Assess your organization’s observability maturity