How we eliminated service outages from ‘certificate expired’ by setting up alerts with Grafana and Prometheus

Published: 25 Nov 2020

Here at Grafana Labs we are lucky to work with many partners around the globe. From these partnerships, we get great inspiration from some clever use cases on how Grafana and Prometheus can be used to great effect for service monitoring and availability. We came across this use case that our partner OpenAdvice came up with for their client base, and we thought it was too good to keep secret! Say goodbye to spending endless hours debugging a service availability issue as the key metrics are missing, only to find out after many hours that it was all due to a certificate expired on one endpoint — and worst of all, finding out that it could easily have been avoided! It’s our pleasure to introduce Malte to tell you more on how to implement this solution for yourself!


Hi! My name is Malte, and I’m a consultant with OpenAdvice in Germany. We offer our customers open source as well as commercial monitoring solutions. In both cases we use Grafana for dashboarding, so it was a logical step to team up with Grafana Labs to offer Grafana Enterprise on-premise for the DACH region.  

There’s one thing most of the customers have in common: At one point or another, expired certificates have caused a problem. In theory, they shouldn’t; the exact expiration date is known, and so is the process for updating. But still the problems persist! 

In this blog post, we present a simple yet effective solution: Monitor the expiration date of certificates with Prometheus and visualize it with Grafana, using features from the new table visualization in Grafana 7.

This is how it looks, all your certificates in a glance: the time remaining until the certificate expires, the HTTP response message, and connection metrics.

Exporting and scraping the metrics

Fortunately there is an existing blackbox exporter that we can use, which offers everything we need to collect this data, so it’s a “low code” implementation. We only have to take care of the configuration, and luckily we do not have to build our own exporter!  

The blackbox exporter is used to monitor HTTP(S) pages. By monitoring a [host:port] combination, we can grab the SSL certificate information, and from this we automatically capture the expiry date, and calculate the time remaining using the `probe_ssl_earliest_cert_expiry` metric.   

The configuration of the blackbox exporter is very simple, consisting of two configurations:

  1. The blackbox.yml itself: Simply configure the following module to export the metrics.
modules:
  http_2xx:
    prober: http
    http:
            preferred_ip_protocol: "ip4"
            tls_config:
                    insecure_skip_verify: true
 
  1. Then update your Prometheus server scraping targets to add a scrape job for the blackbox exporter prometheus.yml and collect the metrics.
- job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
              - https://websrv01.openadvice.de
              - https://www.openadvice.de
              - https://oa-win2016-oa.int.openadvice.de
              - https://portal.openadvice.de
              - https://xwiki.int.openadvice.de 
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: <blackboxexporter_IP>:9115  # Blackbox exporter scraping address

You can see in these configurations that we are scraping metrics from a number HTTPS endpoints, and their associated certificates. It’s really simple to add more targets and monitor more URLs as your services grow, and this can be automated in your provisioning pipelines!

Bringing it all together in one panel

Now that we have our metrics, we can quickly visualize this in Grafana. We’ve created a new table panel in Grafana 7 to display all the relevant metrics in one panel. 

We’ll explain each step, but we’ve also uploaded a dashboard with this table as an example to the open source Grafana dashboards repo here: Dashboard ID: 13230. Feel free to import it, explore, and modify it to make it your own! 

A couple of Prometheus queries…

We’ve kept the queries simple in this example.

Remaining time:

probe_ssl_earliest_cert_expiry-time()

HTTP status codes:

probe_http_status_code

All HTTP duration queries:

probe_http_duration_seconds{phase="resolve"}

probe_http_duration_seconds{phase="connect"}

probe_http_duration_seconds{phase="tls"}

probe_http_duration_seconds{phase="processing"}

probe_http_duration_seconds{phase="transfer"}

A transformation or two…

We are also leveraging the new “Transform” features of Grafana 7: using an Outer join on the “instance” field to display the results of all queries together on one row. 

The Organize Fields transform is also used to filter the columns that are displayed in our panel.

And some series overrides…  

One of our favorite new features in the table visualization is the “Cell Display Mode” option for each column. We’ve chosen bright background colors for the “Certificate expires in” and the “HTTP Code” columns. The “LCD Gauge” in the connection performance columns provides clarity, but also a cool retro look. :)

Alerting

In the sections above, we built a shiny table panel for our HTTPS connections. It’s visually impactful, but as we don’t look at dashboards all the time, we still need a notification when a certificate is about to expire. For this, we defined the following alert in the Prometheus configuration file:

- name: ssl_expiry
  rules:
  - alert: Ssl Cert Will Expire in 30 days
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "SSL certificate will expire soon on (instance {{ $labels.instance }})"
      description: "SSL certificate expires in 30 days\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

Of course you an easily modify the alert to another number of days; you just have to adjust the end of the expression line, e.g., to 10 days:

expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 10

And don’t forget to adjust the alert name and description accordingly. :) Send an alert notification in the unfavorable event that an expired certificate has occurred. See the Alertmanager configuration for a list of channels that you can send alerts and notifications to!   

expr: probe_ssl_earliest_cert_expiry - time() < 0

Conclusion 

Monitoring certificates and connections with Prometheus and Grafana is a must-have for anyone providing services via HTTPS.

It was possible to do something similar with earlier Grafana versions, but now with Grafana 7 features, we can put everything into one compact table without overloading the dashboard visually, and the certificate/connection problems jump right out with even a quick glance!

For any questions regarding this blog post, further ideas, or if you need help with your integration, feel free to contact us:

OpenAdvice website 

OpenAdvice on LinkedIn

Technical contact @OpenAdvice: Malte Grimm

Sales contact @OpenAdvice: Jeanette Fürst

Related Posts

Grafana Labs solutions engineer Ronald McCollam explains how to convert metrics from a Java application into a format that Prometheus can understand.
In this blog, we’ll explain how to use other data sources for trace discovery in Grafana Tempo, our new distributed tracing backend.
During the keynote today, we made some exciting announcements (Grafana Tempo! Loki 2.0!). Here's where you can find out more.