Often there’s a focus on how a service is running from the perspective of the organization. But what does service health monitoring look like from the perspective of a user?
There are many metrics that indicate the overall health of a container, vm, or application,
but independently they do not indicate if the system is functioning correctly.
Often these metrics (CPU, disk, memory) are too narrow, and they can be poor indicators. High CPU may be desirable or bursts of memory usage may be normal.
Synthetic metrics address the user experience, whether measuring a simple API call or authenticating into an application and viewing a dashboard.
In this example, we’ll use hosted Grafana since the entire process is well-known. This will demonstrate
the common steps and metrics collected that can be used to monitor service health and, as a by-product, show where bottlenecks exist.
Here’s the final dashboard:
What are synthetic metrics?
Synthetic metrics are a collection of multi-stage steps required to complete an API call or transaction.
A set of metrics for an API call typically include:
Time to connect to API (connect latency)
Duration of request (response latency)
Size of response payload
Result Code of request (200, 204, 400, 500, etc)
Success/Failure state of the request
That’s a very high-level synthetic and can be used as a model for more complex API calls.
Taking this idea further, an API call may require authentication before making a request.
The user making the request may have a valid authentication token but not the authorization to make some API calls.
A “read only” user would not be modifying data but could make some useful queries.
Why use synthetics?
User experience is the most important aspect of service offerings. As long as the user can perform their tasks according to
expectations, a service is healthy.
From the SRE viewpoint, a service can be “degraded” but remain operational:
A database could be degraded (Two out of three nodes in a cluster are healthy, but the third is offline)
Kafka replication may not be working, but enough nodes are online to continue working
Cassandra storage may be running out (It always does over time, particularly when you are on-call next)
Kubernetes Masters are offline (This does happen, even in the best of clouds)
From the user experience, none of the above issues matter as long as the service is functioning.
Synthetic metrics with hosted Grafana
A very basic Python script will be used to traverse 10 steps required to login and validate a session with a hosted Grafana instance.
The metrics generated by the script are in Graphite format and will be sent to a hosted metrics instance with tags enabled.
The same script can be adapted to send this data to InfluxDB or provide a metrics API that can be scraped by Prometheus.
Time series databases
Grafana offers hosted metrics for both Graphite and Prometheus. The script currently generates metrics suitable for Graphite with tags enabled.
10 steps to success
There are 10 steps for the entire process, with a final step that parses the result and ensures the login has succeeded.
To discover these steps, a combination of using Chrome Developer tools and Postman was used to duplicate the process.
Here’s the general process used to figure out each step for a hosted Grafana login. The script that performs each step is written in Python, but could easily be written in other languages.
Step 1
Starting with Chrome and devtools show, enable preserve log, and visit the destination, in this case it is
https://bkgann3.grafana.net
In dev tools you’ll see a 302 (redirect) as the response. The response will also include the redirect_url.
With those two items, we can test step 1 by connecting, checking for a 302 HTTP response code (anything else is an error), and get the redirect_url, which we’ll use in the next step.
Step 2
Connecting to the redirect_url from step 1, we’ll be sent to the login path, in this case https://bkgann3.grafana.net/login. We get a 200 response from this step. Anything else is an error.
We get a 302 redirect and an url again, which is exactly where we started off!
Step 10
Now that we have authorization and a valid session established, we can connect and get back a 200 response.
Step 11
The 11th step is to parse the body of step 10 for a successful login string, which is easy to locate:
"isSignedIn": true
If we see this string in the body, we’ve completed our login successfully.
Wrapping it all up
In this example, the end-user experience is measured and provides real feedback on site reliability.
Granular metrics like CPU, disk, and memory are also collected but only leveraged by an SRE when looking for opportunities to optimize the service. The synthetics can provide insight as to where to start looking.
The synthetic script can be cloned from this repo.
Use it to monitor your own experience with hosted Grafana or adapt it for your application!
Chicago (Mar 11) Register now for 30% off early bird pricing! Get the latest updates on Grafana, Loki, Prometheus, and a demo of new AI/ML features & Adaptive Telemetry.
Registration is open! Join us in a city near you to preview new LGTM Stack features, attend technical deep dive sessions, and leave with what you need to advance your observability roadmap.
After last year's record sellout, our biggest community event is headed to Seattle on May 6-8! Discover what's new in Grafana 12, learn from 20+ talks covering Prometheus, OpenTelemetry, & Loki, and much more.