Blog  /  Engineering

The 3 major benefits that Grafana Cloud customers get from our hosted Prometheus service

18 Feb 2021 4 min read

Grafana Cloud is the easiest way to get what you need for observability: Prometheus and Graphite for metrics, Loki for logs, and Tempo for tracing, all integrated within Grafana and managed by the Grafana Labs team. You can go from zero to beautiful graphs, insightful logs, and preconfigured alerts in minutes. Built with modern distributed systems techniques, Grafana Cloud allows you to grow with your applications and infrastructure and easily scale past 100M+ metrics.

Recently, a question came up on one of our community boards about using the Grafana Cloud hosted Prometheus service, and why it’s even valuable if Prometheus is still needed to scrape metrics.  

The uncertainty stemmed from the fact that to get metrics into Grafana Cloud, you need to configure Prometheus or use the Grafana Agent (or soon, the Prometheus Agent). Shouldn’t a SaaS reduce the amount of resources you run in your own clusters?

In my experience talking to our Grafana Cloud customers, I’ve found that there are three main reasons why the benefits of Grafana Cloud far outweigh this one concern:

Scalable querying and storage — as a service

Grafana Cloud heavily parallelizes and caches queries, and stores 13 months of data. The service can answer hundreds of queries — many of which can span months of data — at sub-second latency. 

While you can store years of data in Prometheus, you will most likely make engineering trade-offs for single node or high-availability (HA) operation, potentially with federation. These long-lived Prometheus instances will need the usual lifecycle management, including backup and restore. This works great for small, independent teams, but is still an overall cost factor.

With Grafana Cloud, all of that is handled for you. Plus, it’s built on Cortex, which also powers Grafana Enterprise Metrics, our offering for Prometheus-as-a-Service.

A global view 

Another issue with the single-node model of Prometheus is how data from different Prometheus servers is combined.

The most obvious case is when you run one Prometheus per data center (as recommended), and you need to perform aggregations across Prometheus servers. For example: What is my global 99th percentile latency, or global requests per second? While this can be achieved with federation in OSS, it’s tricky to get right and can easily end up being a hassle at scale. This gets trickier if you’re using HA pairs of Prometheus for redundancy. 

This is another task Grafana Cloud takes off your hands; it instead provides you with a single endpoint to query all your data (however many hundreds of millions of series it is).

Reliability and ease of performance 

We have strict service-level agreements and are built in a fully HA manner, which means you can be confident that even at scale, or in case of an outage in your cluster, your monitoring system is up and alerting you. Some people prefer depending on us rather than running beefy Prometheus servers locally. Using hosted Prometheus lets them run their Prometheus with only a few hours of retention, which makes it easier to keep the local Prometheus happy, too.

It’s fair to ask why you’d even run Prometheus in the first place if there is hosted service. You still need something to collect the data, and before we built the Grafana Agent (a great solution our community member @wlargou recommended in response to the original question), Prometheus was the only way. Now you can install the agent, which collects the data and sends it to Grafana Cloud.

We also built a suite of integrations around the agent and Grafana Cloud, which helps you monitor the most popular systems out there very easily, and give you a default set of powerful dashboards and alerts for those systems. Our list of integrations is expanding constantly, and you’ll be able to get up and running with Prometheus much faster through the use of Grafana Cloud.

There are still some reasons people choose to run a Prometheus locally, mainly as a backup if the cloud service has issues. They opt for running the local Prometheus with low retention (few hours) to reduce costs. This means if the cloud service has issues, they still have enough data locally to debug anything required, while also saving on the cost. In general, local alerting is more reliable than cloud alerting (which depends on network availability), but that level of reliability isn’t something most customers require. 

In conclusion, the scalable querying and storage, global view, and reliability and ease of performance — all handled for you — make Grafana Cloud a worthwhile investment for our customers. And while you do need to run a local Prometheus or install the Grafana Agent, those can bring value too.

The easiest way to get started with Grafana, Prometheus, Loki for logging, and Tempo for tracing is Grafana Cloud, and we’ve recently added a new free plan and upgraded our paid plans. If you’re not already using Grafana Cloud, sign up today for free and see which plan meets your use case.