Important: This documentation is about an older version. It's relevant only to the release noted, many of the features and functions have been updated or replaced. Please view the current version.
Common observability strategies
When you have a lot to monitor, like a server farm, you need a strategy to decide what is important enough to monitor. This page describes several common methods for choosing what to monitor.
A logical strategy allows you to make uniform dashboards and scale your observability platform more easily.
Guidelines for usage
- The USE method tells you how happy your machines are, the RED method tells you how happy your users are.
- USE reports on causes of issues.
- RED reports on user experience and is more likely to report symptoms of problems.
- The best practice of alerting is to alert on symptoms rather than causes, so alerting should be done on RED dashboards.
USE stands for:
- Utilization - Percent time the resource is busy, such as node CPU usage
- Saturation - Amount of work a resource has to do, often queue length or node load
- Errors - Count of error events
This method is best for hardware resources in infrastructure, such as CPU, memory, and network devices. For more information, refer to The USE Method.
RED stands for:
- Rate - Requests per second
- Errors - Number of requests that are failing
- Duration - Amount of time these requests take, distribution of latency measurements
This method is most applicable to services, especially a microservices environment. For each of your services, instrument the code to expose these metrics for each component. RED dashboards are good for alerting and SLAs. A well-designed RED dashboard is a proxy for user experience.
For more information, refer to Tom Wilkie’s blog post The RED method: How to instrument your services.
The Four Golden Signals
According to the Google SRE handbook, if you can only measure four metrics of your user-facing system, focus on these four.
This method is similar to the RED method, but it includes saturation.
- Latency - Time taken to serve a request
- Traffic - How much demand is placed on your system
- Errors - Rate of requests that are failing
- Saturation - How “full” your system is