Grafana Cloud

Observability concepts

Understanding these foundational concepts helps you use signal correlation effectively and avoid common pitfalls.

Labels, attributes, and matching

In order for you to be able to navigate between telemetry signals, you first must understand what information Grafana Cloud needs to be able to correlate the different signals.

What are labels and attributes?

Labels and attributes are key-value pairs that describe your telemetry data. Metrics, logs, and profiles use labels. Traces use attributes.

Note

In the Grafana data source configuration UI, trace attributes are sometimes called tags for historical reasons. This documentation uses attributes (the OpenTelemetry term) when discussing traces conceptually, and tags only when referring to specific Grafana configuration fields.

For metrics, you can use this example PromQL query to monitor successful API GET request volume. With slight variations, you could use it to track throughput, calculate success rates, track SLI/SLO, and capacity planning.

promql
checkout_orders_rate:actual{service_name="checkoutservice", ml_job_name="Checkout Orders Rate"}

In this example:

  • Metric name: checkout_orders_rate:
  • Labels: service_name="checkoutservice, ml_job_name="Checkout Orders Rate"

Traces use resource attributes (shared across all spans from a service) and span attributes (specific to each operation). You can use TraceQL, the trace query language, to create a similar query using tracing data.

For example, this TraceQL query finds traces containing spans from the checkoutservice with GET requests returning 200 status. Instead of using the PromQL service label, this query uses the resource.service.name attribute. The query returns traces and matching spans instead of aggregated metrics.

traceql
{span.app.currency.conversion.from="USD" && resource.service.name="currencyservice" && name="CurrencyService/Convert" && status=error}

Match for correlation

This diagram shows how labels and attributes must match exactly across different signals:

Label matching for correlation

Correlation works by matching labels and attributes across different signals. For two pieces of telemetry to correlate:

  1. Names must match exactly (case-sensitive)

    • The label service="api" matches the label service="api"
    • The attribute resource.service.name="api" matches the label service="api"
    • service="api" doesn’t match Service="api", it fails due to the capital S
    • service="api" doesn’t match service_name="api", it fails because the name is different
  2. Values must match exactly (case-sensitive)

    • environment="prod" matches environment="prod"
    • environment="prod" doesn’t match environment="production"
    • environment="prod" doesn’t match environment="Prod", it fails due to the capital P.

Example correlation

Prometheus metric query using PromQL:

promql
http_requests_total{service="checkout", environment="prod"}

Loki logs query using LogQL:

logql
{service="checkout", environment="prod"} |= "error"

These correlate because both have identical service and environment labels. For traces, use resource attributes such as resource.service.name and resource.environment to match with labels from other signals.

What is label cardinality?

The cardinality of a data attribute is the number of distinct values that the attribute can have. For example, a boolean column in a database, which can only have a value of either true or false has a cardinality of 2.

In Grafana Cloud, cardinality refers to the number of unique combinations of label values for a telemetry signal. It determines how many time series or log streams exist.

Cardinality examples

Low cardinality:

Bash
environment = ["dev", "staging", "prod"]  // 3 unique values

Medium cardinality:

Bash
http_endpoint = ["/api/users", "/api/orders", "/api/products", ...]  // ~50 unique values

High cardinality:

Bash
user_id = ["user_123", "user_456", "user_789", ...]  // Millions of unique values

Why cardinality matters

This diagram illustrates the impact of low versus high cardinality:

mermaid
graph TD
    subgraph Low["Low Cardinality: environment"]
        L1["{service='api', environment='prod'}"]
        L2["{service='api', environment='staging'}"]
        L3["{service='api', environment='dev'}"]
    end

    subgraph High["High Cardinality: user_id"]
        H1["{service='api', user_id='user_1'}"]
        H2["{service='api', user_id='user_2'}"]
        H3["{service='api', user_id='user_3'}"]
        H4["... millions more ..."]
    end

    Low --> LowImpact["3 time series, fast queries, good for correlation"]
    High --> HighImpact["Millions of series, slow queries, poor for correlation"]

For metrics:

  • Each unique label combination creates a time series
  • High cardinality = many time series = higher costs and slower queries
  • Example: {service="api", user_id="123"} creates one series per user

For logs:

  • Each unique label combination creates a log stream
  • High cardinality = many streams = slower queries
  • Example: {service="api", request_id="abc"} creates one stream per request

For correlation:

  • High cardinality labels make correlation less effective
  • Matching on user_id rarely finds related data
  • Matching on service finds all related data for that service

Cardinality best practices

Do:

  • Use low-cardinality labels for correlation (service, environment, cluster)
  • Use medium-cardinality labels for filtering (endpoint, method, status)
  • Include high-cardinality data in log content, trace span attributes, or structured metadata
  • Keep metrics labels under 30 per series (40 maximum allowed, but performance degrades)
  • Keep logs labels under 15 per stream (hard limit)
  • For profiles, use consistent service_name labels to enable trace-to-profile correlation

Don’t:

  • Don’t use user IDs as labels
  • Don’t use request IDs as labels
  • Don’t use timestamps as labels
  • Don’t use UUIDs as labels

Signal-specific guidance:

SignalLabel limitHigh-cardinality alternative
Metrics30 recommended, 40 maxUse exemplars to link to traces with full context
Logs15 maxUse structured metadata or log content
TracesNo hard limit on attributesAttributes over 2KB are auto-truncated
ProfilesNo documented limitUse service_name for correlation

For metrics, use the cardinality management dashboard to monitor label distribution and identify high-cardinality issues.

Examples

Store high-cardinality values like user IDs in trace span attributes or log message content, not as metric or log labels.

This example uses a high cardinality label. This creates a separate time series for every user.

promql
http_requests_total{service="api", user_id="12345"}

This is a good example that keeps high-cardinality data out of labels.

promql
http_requests_total{service="api", endpoint="/api/users"}

Check cardinality

To check cardinality in Prometheus:

promql
count(http_requests_total) by (service)     # Count series per service
count(http_requests_total) by (user_id)     # Count series per user_id (likely very high)

To check cardinality in Loki:

logql
count_over_time({service="api"}[1h])        # Total log entries
sum(count_over_time({service="api"}[1h])) by (user_id)  # Count by user_id (high cardinality warning)

What is sampling?

Sampling is the process of determining which data to store and which to discard. The decision to store or discard is decided by enacted sampling policies.

Why sampling is used

For traces, sampling is commonly used to ensure that only relevant traces are stored for observation. There are common use cases that are generally applied:

  • Reduction of stored tracing telemetry volume. Higher volumes of unused trace data can lead to unnecessary costs.
  • Dropping the collection of traces that don’t really add any informational value to the overall health of an application. This includes traces that may be generated by the endpoints for health checks, such as readiness probes in Kubernetes.
  • Replicated traces across active-active HA instances.
  • Ensuring that only critical issues are sampled (such as traces with errors, or those with above average latencies).
  • Sampling a baseline number of traces across all requests (common patterns are 1% or fewer), to ensure that comparisons can be made between nominal and anomalous traces.

Refer to Sampling in the Tempo documentation for more information.

Log sampling reduces volume by keeping only a percentage of logs. At ingestion, configure agents like Promtail/Alloy with rate: 0.1 to keep 10% and drop 90% before sending to Loki. At query time, you can use LogQL functions like first_over_time() or last_over_time() to downsample by selecting one log line per time window instead of all lines. This helps reduce ingestion costs, stay within rate limits (5 MB/sec per user in Cloud), and manage high-volume streams, though you trade granularity for representative patterns.

Strategies for sampling

When sampling traces, you can use a head or tail sampling strategy.

With a head sampling strategy, the decision to sample the trace is usually made as early as possible and doesn’t need to take into account the whole trace. It’s an effective sampling strategy.

With a tail sampling strategy, the decision to sample a trace is made after considering all or most of the spans. For example, tail sampling is a good option to sample only traces that have errors or traces with long request duration. Tail sampling is more complex to configure, implement, and maintain but is the recommended sampling strategy for large systems with a high telemetry volume.

Tail sampling policies

For production environments, use tail sampling with multiple policies to keep the most valuable traces:

  • Error-based: Keep all traces that contain errors
  • Latency-based: Keep traces that exceed a latency threshold (for example, requests over 500ms)
  • Attribute-based: Keep traces matching specific attributes (for example, specific endpoints or user segments)
  • Probabilistic: Keep a percentage of remaining traces for baseline visibility

When configuring tail sampling:

  • Set the decision wait period to accommodate your typical trace duration
  • Tune batch timeouts and sizes to avoid latency in sampling decisions
  • Refer to the traces sampling documentation for configuration details

How sampling affects correlation

Metrics-to-traces (exemplars):

  • Exemplars might point to traces that were sampled out (dropped).
  • Clicking an exemplar might show “No trace found”.
  • This is expected behavior with sampling.
  • Exemplars are generated before sampling decisions, so the trace may not exist.

Logs-to-traces:

  • Log entry has a trace ID, but the trace was sampled out.
  • Clicking the trace ID shows “No trace found”.
  • Logs remain, but trace details aren’t available.

Traces-to-metrics (metrics-generator):

  • The Tempo metrics-generator derives metrics from traces.
  • Metrics are generated before sampling, so they represent all traces.
  • However, clicking from generated metrics to traces may show sampled-out traces.
  • The metrics-generator collection interval is 60 seconds by default.

Best practices:

  • Keep higher sampling rates during development and testing.
  • Use intelligent tail sampling (keep errors, slow requests) in production.
  • Accept that not all traces are available.
  • Use metrics and logs for complete visibility, traces for detailed investigation.
  • Configure error-based and latency-based policies to ensure problematic traces are always kept.

Time ranges and alignment

Correlation between telemetry signals is also affected by timestamps and the time ranges of your queries.

Time synchronization

For correlation to work, timestamps must be synchronized across systems.

Requirements:

  • All systems use UTC
  • Clocks are synchronized (NTP recommended)
  • Timestamps in telemetry data are accurate

Common issues:

  • Clock drift causes time misalignment
  • Logs show events at 10:00, but metrics show activity at 10:05
  • This breaks time-based correlation

Time range queries

When correlating signals, use the same time range.

In Grafana Explore:

  • Use the time picker to set the same range for all queries
  • Use split view with time sync enabled (chain link icon)

In Drilldown apps:

  • Selecting a time range in one signal focuses other signals automatically
  • Time ranges stay synchronized across signal types

Time range best practices

Do:

  • Use consistent time ranges across all queries
  • Include a buffer (for example, query 5 minutes before and after event)
  • Use UTC for all timestamps
  • Verify clock synchronization with NTP

Don’t:

  • Don’t assume local time zones match
  • Don’t query narrow time ranges that might miss related events
  • Don’t forget clock drift on long-running systems

Data source limits

Grafana Cloud has limits to ensure system stability and fair usage. Understanding these limits helps you design effective correlation strategies and avoid unexpected query failures.

Signal-specific limits

The following table summarizes key limits for each telemetry signal. Many limits autoscale based on your usage tier. For complete details, refer to the Grafana Cloud usage limits documentation.

Metrics (Prometheus/Mimir)

LimitDefault valueNotes
Active series per user150,000Autoscaled based on tier
Ingestion rate10,000 samples/secAutoscaled based on tier
Labels per series40 max, 30 recommendedKeep under 30 for best performance
Label name length1,024 charactersFixed limit
Label value length2,048 charactersFixed limit; applies to metric names
Query time range768 hours (32 days)Maximum max_partial_query_length
Out-of-order window2 hoursSamples older than this are rejected

Logs (Loki)

LimitDefault valueNotes
Ingestion rate5 MB/secPer-user rate limit
Active streams per user5,000Per-user stream limit
Labels per stream15Maximum labels allowed
Label name length1,024 charactersFixed limit
Label value length2,048 charactersFixed limit
Max line size256 KBCannot be modified; lines exceeding this are dropped
Retention period30 days defaultConfigurable from 30 days to 1 year
Query time range721 hours (~30 days)Maximum max_query_length

Traces (Tempo)

LimitDefault valueNotes
Max bytes per trace5 MBTraces exceeding this are rejected
Ingestion rate500 KB/secPer-tenant rate limit
Attribute size2 KBAttributes exceeding this are auto-truncated
Retention period30 daysDefault block retention
TraceQL metrics query range24 hoursCan be increased via Support

Profiles (Pyroscope)

LimitDefault valueNotes
Daily ingestion limitConfigurableSet per stack; data discarded when reached
BillingPer GB ingested$0.50/GB beyond plan inclusion

Note

Many metrics and logs limits autoscale based on your Grafana Cloud tier. Contact Grafana Support to request increases for non-autoscaled limits.

How limits affect correlation

Label and attribute limits:

  • Metrics allow up to 40 labels per series, but performance degrades above 30
  • Logs allow only 15 labels per stream
  • When designing shared labels for correlation, stay within the most restrictive limit (15 for logs)
  • Label name and value length limits (1,024/2,048 characters) are consistent across metrics and logs

Query limits:

  • Metrics queries can span up to 32 days; logs queries up to ~30 days
  • TraceQL metrics queries are limited to 24 hours by default
  • Use shorter time ranges when correlating across signals to ensure all queries complete

Retention differences:

  • Default retention varies: metrics (13 months), logs (30 days), traces (30 days)
  • Can’t correlate signals if one has been deleted due to shorter retention
  • Plan retention periods to ensure overlapping data availability

Ingestion limits:

  • Exceeding ingestion limits causes data to be dropped
  • Dropped data creates gaps that break correlation
  • Monitor ingestion rates using the Usage Insights dashboard

Work within limits

Here are some best practices:

  • Use specific label selectors to reduce data volume
  • Query shorter time ranges for high-cardinality data
  • Use topk() or bottomk() to limit results in PromQL
  • Add filters progressively to narrow results
  • Monitor your usage with the cardinality management dashboard
  • Keep label counts well under the maximum (aim for fewer than 15 shared labels to work across all signals)

Summary

Key takeaways:

  1. Labels must match exactly (name and value, case-sensitive) for correlation
  2. Cardinality measures unique label combinations - low is better for correlation
  3. Sampling means not all traces are kept - this is normal and expected
  4. Time ranges must be synchronized and consistent across queries
  5. Limits require thoughtful querying - use filters and shorter time ranges

For effective correlation:

  • Use low-cardinality labels (service, environment, cluster)
  • Accept that sampling means some traces won’t be available
  • Synchronize time ranges across all signals
  • Query efficiently within data source limits
  • Test label matching across all signals before relying on correlation

Next steps