Service-level operations
At Level 2, your operational practices shift from infrastructure to services. You’re now alerting on service health, building service dashboards, and defining SLOs around availability.
Alerting
| What to alert on | Example |
|---|
| Service error rate | Checkout service errors > 1% |
| RED metrics | Request rate dropped 50%, duration P95 > 2s |
| Dependency health | Payment gateway latency increasing |
SLOs and error budgets
| SLO type | Example |
|---|
| Availability | 99.5% of requests successful |
| Latency | 95% of requests < 500ms |
| Error budget | Alert when burning budget too fast |
Dashboards
| Dashboard type | What you see |
|---|
| Service health overview | All services, RED metrics at a glance |
| Service detail | Deep dive into one service’s metrics |
| Dependency map | Service graph with health indicators |
Investigation
| Tool | How you use it at Level 2 |
|---|
| Service Inventory | Find the service, see its health at a glance |
| Service Graph | Trace dependencies, find upstream issues |
| Explore | Query service-level metrics and logs |
At Level 3, you’ll alert on individual transactions and trace-based metrics.