Important: This documentation is about an older version. It's relevant only to the release noted, many of the features and functions have been updated or replaced. Please view the current version.

Best practices

Common observability strategies

Enterprise Open source

Common observability strategies

When you have a lot to monitor, like a server farm, you need a strategy to decide what is important enough to monitor. This page describes several common methods for choosing what to monitor.

A logical strategy allows you to make uniform dashboards and scale your observability platform more easily.

Guidelines for usage

The USE method tells you how happy your machines are, the RED method tells you how happy your users are.
USE reports on causes of issues.
RED reports on user experience and is more likely to report symptoms of problems.
The best practice of alerting is to alert on symptoms rather than causes, so alerting should be done on RED dashboards.

USE method

USE stands for:

Utilization - Percent time the resource is busy, such as node CPU usage
Saturation - Amount of work a resource has to do, often queue length or node load
Errors - Count of error events

This method is best for hardware resources in infrastructure, such as CPU, memory, and network devices. For more information, refer to The USE Method.

RED method

RED stands for:

Rate - Requests per second
Errors - Number of requests that are failing
Duration - Amount of time these requests take, distribution of latency measurements

This method is most applicable to services, especially a microservices environment. For each of your services, instrument the code to expose these metrics for each component. RED dashboards are good for alerting and SLAs. A well-designed RED dashboard is a proxy for user experience.

For more information, refer to Tom Wilkie’s blog post The RED method: How to instrument your services.