Slide 6 of 7

Operations at Level 2

Service-level operations

At Level 2, your operational practices shift from infrastructure to services. You’re now alerting on service health, building service dashboards, and defining SLOs around availability.

Alerting

What to alert onExample
Service error rateCheckout service errors > 1%
RED metricsRequest rate dropped 50%, duration P95 > 2s
Dependency healthPayment gateway latency increasing

SLOs and error budgets

SLO typeExample
Availability99.5% of requests successful
Latency95% of requests < 500ms
Error budgetAlert when burning budget too fast

Dashboards

Dashboard typeWhat you see
Service health overviewAll services, RED metrics at a glance
Service detailDeep dive into one service’s metrics
Dependency mapService graph with health indicators

Investigation

ToolHow you use it at Level 2
Service InventoryFind the service, see its health at a glance
Service GraphTrace dependencies, find upstream issues
ExploreQuery service-level metrics and logs

At Level 3, you’ll alert on individual transactions and trace-based metrics.

Script

Here’s where things get interesting. Remember at Level 1, you were asking “is my server healthy?” Now you’re asking “is my checkout flow healthy?” That’s a much more meaningful question for your business.

The big shift is from alerting on CPU spikes to alerting on things your customers actually feel. If your payment service starts returning errors, you want to know immediately, even if the servers look fine.

SLOs become really powerful here because you can set targets that matter: “99.5% of checkouts should succeed” is something everyone in your organization can understand and rally around.

And when something goes wrong, you’re no longer guessing. Service Inventory shows you exactly which services exist and their health. Service Graph shows you if the problem is actually coming from an upstream dependency. You go from “something’s broken” to “the payment gateway is slow and it’s affecting checkout” in seconds.

This is the level where observability starts feeling like a superpower instead of just monitoring.