ObservabilityCON Day 2 recap: The latest Grafana Cloud tools for Prometheus to improve alerting, debugging, and scaling. Plus why continuous monitoring matters now

Published: 28 Oct 2020

ObservabilityCON 2020 is live! This week Grafana Labs is bringing together the Grafana community for talks dedicated to observability.

We hope you’re able to catch the great sessions we have planned. You can find the full schedule on the event page, and for additional information on viewing, participating in Q&As, and more, check out our quick guide to getting the most out of ObservabilityCON. 

Day 2 was dedicated to all things Prometheus — featuring new solutions and in-depth case studies.

If you aren’t up-to-date on the presentations so far, here’s a recap of day 2 of the conference:

The evolution of Prometheus observability

Prometheus has become the industry standard for monitoring applications and services — and a must-have for the Grafana community. There are currently 24 maintainers working on Prometheus, and six of those are from Grafana Labs. “We want to increase adoption further and make it easier to find success using Prometheus,” said Grafana Labs co-founder and CTO Anthony Woods.

The Grafana Labs team discussed the Prometheus ecosystem and their work to introduce new features and capabilities to drive more users to the project:

Backfilling in Prometheus

Grafana Labs software engineer intern Atibhi Agrawal reviewed what’s next for backfilling in Prometheus, the most in-demand feature in the project. “It’s on the road map but it’s still not solved,” says Agrawal. “So [Grafana Labs] decided to start working on adding backfilling support.”

Grafana Cloud Agent

Grafana Labs’s co-founder and CTO Anthony Woods detailed how Cortex clusters operate in the Grafana Cloud Agent. Cortex allows multiple Prom servers to push their data up to a central store and through Grafana, users can then query a central store to help provide a global view of all the metrics. Bonus: Grafana Cloud Agent also delivered a 40% reduction in resources when compared to equivalent Prom workloads. (Sign up for a free trial of Grafana Cloud here.)

Scaling Cortex with shuffle sharding

With more complex, multi-tenant infrastructures deployed across multiple regions or multiple clouds, Grafana Labs software engineer Marco Pracucci explained how Cortex leverages shuffle sharding techniques to horizontally scale while isolating different tenant workloads to protect against widespread failures. 

Grafana Cloud alerting  

Grafana Labs UX designer Jess Müller showcased the new Grafana Cloud interface for Prometheus alerting. In v1.0, users have a more streamlined interface to manage alerts versus using command lines and logging into multiple tools. The interface can be used to alert on Loki logs as well. “You have double the fun in one tool,” said Müller. (Sign up for a free trial with Grafana Cloud here.)

Grafana Cloud synthetic monitoring 

Grafana Labs UX designer Teddy Bartha walked through the new synthetic monitoring plugin, which allows Grafana Cloud users to debug the end user experience. The solution sits on top of the Prometheus black box exporter, and the robust UI is accessible via Grafana Cloud. So now users can access all their data — from app performance to user experience — on Grafana Cloud to “get a fuller single pane of glass picture as to what’s happening in your app,” said Bartha. (Sign up for a free trial with Grafana Cloud here.)

Prometheus for enterprise organizations

Grafana Labs technical services manager Alex Martin provided an in-depth introduction to Grafana Metrics Enterprise (GME), a new Cortex-powered on-prem product by Grafana Labs. The scalable, self-hosted Prometheus-as-a-service solution is seamless to use and helps streamline and solve common problems related to architecture, config management, and security. “It will take out so much of the stress and difficulty when it comes to large and complex deployments,” says Martin. 

Watch the full session on demand here

ConProf: Production-grade Prometheus for continuous profiling

It’s a familiar situation: A process in your system starts crash-looping, and every couple of minutes your monitoring system sends an alert. But before you can retrieve the memory profile, the kernel shuts down the process. To help prevent future OOM kills, Red Hat principal software engineer Bartek Plotka and Polar Signals founder and CEO Frederic Branczyk demonstrated the benefits of ConProf, an open source tool that allows for continuous profiling to improve your Kubernetes debugging story.

Watch the full session on demand here

Building observability infrastructure on Istio using Prometheus and Jaeger at LastPass

At LastPass, there are millions of daily users with thousands of requests that hit the application’s endpoints at any given second. Krisztian Fekete, the DevOps engineer at LogMeIn who works on the LastPass platform, outlined why implementing Istio as the service mesh was a game changer for monitoring.

In his presentation, Fekete highlighted Istio’s observability options, which make troubleshooting live issues faster and more efficient. Plus Istio’s built-in telemetry solutions such as distributed tracing allow for the Last Pass team to use Jaeger to follow requests through services and identify performance bottlenecks and design issues. Fekete provided a deep dive into Istio and all the tools that contribute to running an effective infrastructure at Last Pass. 

Watch the full session on demand here

Always-enabled monitoring with Loki, Prometheus, and Grafana

Gofers, the leading e-grocery platform in India, can credit much of its success to the continuous monitoring model activated on the application’s backend. Gofers infrastructure engineer Vaibhav Krishna talked about how vital monitoring can be to an application’s code bases as well as to the health of the business, especially in the era of microservices, rapid deployments, and ambitious SLOs. He also walked through the development of the internal tool Legend, which was built with Loki, Prometheus, and Grafana to create end-to-end dashboards that have enabled the team to expand its scope of work and share their progress for operational reviews.

Watch the full session on demand here

Today’s Sessions

View the full schedule and all the session details on the event page

Also join the Grafana Labs Community Slack workspace and drop into the #observabilitycon channel.

See the full ObservabilityCON 2020 schedule here.

Related Posts

If you’re joining the Grafana community for ObservabilityCON, check out this quick guide on what to expect and how to get the most out of the event.
Learn about scaling Prometheus, implementing distributed tracing, monitoring your network in Minecraft, and much more!
At KubeCon + CloudNativeCon EU, Grafana Labs software engineer Marco Pracucci discussed the new experimental Cortex blocks storage, how it can reduce the Cortex operational cost without compromising scalability and performance, and lessons learned from running Cortex at scale.