Operations at Level 2

Service-level operations

At Level 2, your operational practices shift from infrastructure to services. You’re now alerting on service health, building service dashboards, and defining SLOs around availability.

Alerting

What to alert on	Example
Service error rate	Checkout service errors > 1%
RED metrics	Request rate dropped 50%, duration P95 > 2s
Dependency health	Payment gateway latency increasing

SLOs and error budgets

SLO type	Example
Availability	99.5% of requests successful
Latency	95% of requests < 500ms
Error budget	Alert when burning budget too fast

Dashboards

Dashboard type	What you see
Service health overview	All services, RED metrics at a glance
Service detail	Deep dive into one service’s metrics
Dependency map	Service graph with health indicators

Investigation

Tool	How you use it at Level 2
Service Inventory	Find the service, see its health at a glance
Service Graph	Trace dependencies, find upstream issues
Explore	Query service-level metrics and logs

At Level 3, you’ll alert on individual transactions and trace-based metrics.

Here’s where things get interesting. Remember at Level 1, you were asking “is my server healthy?” Now you’re asking “is my checkout flow healthy?”

That’s a much more meaningful question for your business.

The big shift is from alerting on CPU spikes to alerting on things your customers actually feel. If your payment service starts returning errors, you want to know immediately, even if the servers look fine.

SLOs become really powerful here because you can set targets that matter: “99.5% of checkouts should succeed” is something everyone in your organization can understand and rally around.

And when something goes wrong, you’re no longer guessing. Service Inventory shows you exactly which services exist and their health. Service Graph shows you if the problem is actually coming from an upstream dependency.

You go from “something’s broken” to “the payment gateway is slow and it’s affecting checkout” in seconds.

This is the level where observability starts feeling like a superpower instead of just monitoring.

Service-level operations

Alerting

SLOs and error budgets

Dashboards

Investigation

Script

In this module