Menu
Grafana Cloud

Analyze log costs with Grafana Explore

Grafana Cloud provides a managed Loki environment for storing logs. Similar to metrics labels in Prometheus, Loki only indexes the log metadata using labels. This guide will help you analyze and understand logs usage in Grafana Cloud.

Note

This process will work with any Grafana Loki install, not just within Grafana Cloud.

Before you begin

To view and manage logs, you must have the following:

  • A Grafana Cloud account
  • Admin or Editor user permissions for the managed Grafana Cloud instance

Limitations

If you expect queries to return a large number of results, you can use a smaller duration of time to avoid timeout errors as Grafana Cloud limits the size of expensive queries. For more information on setting query duration limits, see Prometheus querying basics. You can also aggregate the results along some label dimension. See Aggregating Logs Usage below.

If you would prefer to explore logs with a command line interface, see the LogCLI documentation.

View log usage by log stream

In this section we’ll query for metrics about each log stream that contains the label job, by using wildcard regex.

  1. Log in to your instance and click the Explore (compass) icon in the menu sidebar.

  2. Use the data sources dropdown located at the top of the page to select the data source corresponding to your Grafana Cloud Logs endpoint. The data source name should be similar to grafanacloud-<yourstackname>-logs.

  1. Select Instant for the Query Type, and Code for the query-editor. Then use the following LogQL searches in the query toolbar to explore usage for your environment (logs generated by Synthetic Monitoring checks will also be returned):

    • Number of entries for each log stream over a five minute interval:

      Logql
      count_over_time({job=~".+"}[5m])

      This count will be listed on the far right hand column of the results table under Value #A.

    • Bytes used by each log stream for the past five minutes:

      Logql
      bytes_over_time({job=~".+"}[5m])
    • Count the number of entries within the last minute and return any job with greater than 100 log lines. Adjust the number of log lines as needed for more insight:

      Logql
      count_over_time({job=~".+"}[1m]) > 100
    • Count of logs ingested over the past hour and specify filename, host, and job names if those labels exist:

      Logql
      sum(count_over_time({job=~".+"}[1h])) by (filename, host, job)

Note

The job label is used in these statements, but you can use other labels. If you are not sure which labels your environment might be using, click on the Log browser tab in Explore to review the available labels. If an expected label is missing, this is a good indication that these logs are not being successfully received by your Grafana Cloud environment.
  1. Save your queries (optional).

    You can save a query in your query history to quickly access your favorites.

    You can also download the query results as a text file directly from Explore using Grafana Inspector. For more information, see Grafana Inspector.

Aggregating logs usage

You may wish to view your logs usage grouped by some dimension, such as an app name or team name, cluster or even log level. These dimensions don’t need to be in the label set. You may also wish to track usage over a long time period. You can use log lines from the Grafana Cloud Synthetic Monitoring app to in both of these cases because the queries should be easy to adapt for these purposes, specifically bytes_over_time metric query, which is the most relevant to Grafana Cloud costs.

Aggregation by label

Aggregation by label is a great option if you have labeled log streams for apps, services, or teams. A typical Synthetic Monitoring log line looks like this:

level=info target=https://grafana-assets.grafana.net probe=Amsterdam region=EMEA instance=https://grafana-assets.grafana.net job=grafanaAssetsCDN check_name=http source=synthetic-monitoring-agent label_env=production msg="Check succeeded" duration_seconds=0.242576593

We might want to determine the bytes ingested per probe, so we can use the following query:

Logql
sum by (probe) (bytes_over_time({source="synthetic-monitoring-agent"} [1m]))

If a result comes up with an empty string under the probe column, then some of your queried log streams do not contain that label.

Aggregation by log line content

Sometimes, you might need to aggregate something that you don’t have scraped into a label. Continuing our Synthetic Monitoring example, we don’t have the response message (msg) mapped into a label, since it is potentially unbounded.

However, you may need to see logs ingested by message. You can do this by using logfmt to extract log fields for aggregation:

Logql
sum by (msg) (bytes_over_time({source="synthetic-monitoring-agent"} | logfmt [1m]))

What if I get time series limit errors?

logfmt is a very powerful tool, but since it creates a temporary log stream for every combination of log line fields, it can easily hit the time series limits. You can use regexp to ensure you don’t increase the cardinality of results too much before aggregation.

Moving away from our Synthetic Monitoring example, consider this log line:

logger=context traceID=12ab34cd56ef userId=x orgId=y uname=grafanauser t=2022-12-19T23:22:18.825687729Z level=info msg="Request Completed" method=POST path=/api/ds/query status=200 remote_addr=127.0.0.6 time_ms=38 duration=38.275733ms size=1369 referer="https://grafanauser.grafana.net/d/89gh01ij/super-cool-dashboard?from=now-1h&orgId=1&refresh=10s&to=now" db_call_count=1 handler=/api/ds/query

This is a typical log line you’ll see from the Grafana backend, and in many Go web services. But if you try to uselogfmt to insert it into a metric query, it’ll quickly create hundreds of thousands of log streams due to parts like the timestamp, duration, size, and referrer.

For example, the following query might fail due to time series limit:

Logql
sum by (logger) (bytes_over_time({app="grafana"} | logfmt | __error__="" [1m]))

To avoid these errors, you can replace logfmt and add a regular expression:

Logql
sum by (logger) (bytes_over_time({app="grafana"} | regexp `logger=(?P<logger>[^ ]+)` | logger != "" [1m]))

Here, we take advantage of named capture groups to enrich to log labels with a single new label.

What if I STILL get time series limit errors?

This depends on the use case. Query time series limits are set in the Loki configuration to help manage resources. Raising them can be an option, but should be done with caution. Alternatives include:

  1. Filtering out some high cardinality log-streams. For example, the label-value with the highest cardinality: In this case, logger=context is associated with a large number of log streams, so we can use the selector

    Logql
    {app="grafana"} != "logger=context" | regexp `logger=(?P<logger>[^ ]+)` | logger != ""
  2. Adjusting the regex to get a subset of those fields: This query will select only those loggers which start with ‘A’ through ‘H’ (case insensitive).

    Logql
    {app="grafana"} | regexp `logger=(?P<logger>[a-h]{1}[^ ]+)i` | logger != ""
  3. Scrubbing one or more labels with label_format: This will reduce the number of time series in the result set. This scrubs the pod label, which often has a Kubernetes UID attached and inflates the number of time series created.

    Logql
    {app="grafana"} | label_format pod=`` | regexp `logger=(?P<logger>[a-h]{1}[^ ]+)i` | logger != ""

Why are you using such small time ranges?

In short, these queries will almost always be fairly slow.

The example queries here have all considered a small time range to consider two cases:

  1. sudden and recent spikes in usage
  2. use in recording rules for continuous tracking

Running any of the queries over a long time range will be fairly slow. Since they require the querier to process and mutate every single line, it’s just not possible to get response times down to what you can get with time-series databases such as Prometheus. See below to read about setting up recording rules.

What if I don’t have a label or any log line content common across all of my log lines?

Mixing the above approaches will be the only way. Suppose you’re using the following query:

Logql
sum by (logger) (bytes_over_time({app="grafana"} | regexp `logger=(?P<logger>[^ ]+)` | logger != "" [1m]))

You can use the following stream selector to pull logs that don’t contain a logger, and configure another query to retrieve them:

Logql
{app="grafana"} !~ `logger=[^ ]+)`

Once you’ve crafted another query to account for the other logs, make sure to check your work to prevent double counting. One approach would be use the empty value from one sum by (<something>) query to spot check the other.

Recording rules for long-term tracking

You can use recording rules for very high-resolution usage attribution, usage attribution across multiple dimensions, and usage attribution over long periods of time. To create a recording rule:

  1. Hover over the Grafana Alerting icon (bell).

  2. Click New Alert Rule.

  3. Use a descriptive name.

  4. Select Mimir or Loki Recording Rule.

  1. Select the Loki data source you want to perform usage attribution on.
  1. Set the query.

  2. If desired, set the Namespace and Group.

  1. Click Save.
  1. Return to Explore and query your hosted metrics with the alert name. You might have to wait a few moments for results to be returned.

What if I have a large volume of historical logs I want to attribute from before I set up the recording rules?

If you’re using the query in a dashboard, optimize them by running them as ranges with a specific minimum interval, then sum results on visualization panel using transformations. We tested the this approach on up to a week of internal logs, totalling 23 PB, which took almost three minutes.