Understand your data

Use signals together

Key concepts

Grafana Cloud

Observability concepts

Understanding these foundational concepts helps you use signal correlation effectively and avoid common pitfalls.

Labels, attributes, and matching

In order for you to be able to navigate between telemetry signals, you first must understand what information Grafana Cloud needs to be able to correlate the different signals.

What are labels and attributes?

Labels and attributes are key-value pairs that describe your telemetry data. Metrics, logs, and profiles use labels. Traces use attributes.

Note
In the Grafana data source configuration UI, trace attributes are sometimes called tags for historical reasons. This documentation uses attributes (the OpenTelemetry term) when discussing traces conceptually, and tags only when referring to specific Grafana configuration fields.

For metrics, you can use this example PromQL query to monitor successful API GET request volume. With slight variations, you could use it to track throughput, calculate success rates, track SLI/SLO, and capacity planning.

checkout_orders_rate:actual{service_name="checkoutservice", ml_job_name="Checkout Orders Rate"}

In this example:

Metric name: checkout_orders_rate:
Labels: service_name="checkoutservice, ml_job_name="Checkout Orders Rate"

Traces use resource attributes (shared across all spans from a service) and span attributes (specific to each operation). You can use TraceQL, the trace query language, to create a similar query using tracing data.

For example, this TraceQL query finds traces containing spans from the checkoutservice with GET requests returning 200 status. Instead of using the PromQL service label, this query uses the resource.service.name attribute. The query returns traces and matching spans instead of aggregated metrics.

{span.app.currency.conversion.from="USD" && resource.service.name="currencyservice" && name="CurrencyService/Convert" && status=error}

Match for correlation

This diagram shows how labels and attributes must match exactly across different signals:

Correlation works by matching labels and attributes across different signals. For two pieces of telemetry to correlate:

Names must match exactly (case-sensitive)
- The label service="api" matches the label service="api"
- The attribute resource.service.name="api" matches the label service="api"
- service="api" doesn’t match Service="api", it fails due to the capital S
- service="api" doesn’t match service_name="api", it fails because the name is different
Values must match exactly (case-sensitive)
- environment="prod" matches environment="prod"
- environment="prod" doesn’t match environment="production"
- environment="prod" doesn’t match environment="Prod", it fails due to the capital P.

Example correlation

Prometheus metric query using PromQL:

http_requests_total{service="checkout", environment="prod"}

Loki logs query using LogQL:

{service="checkout", environment="prod"} |= "error"

These correlate because both have identical service and environment labels. For traces, use resource attributes such as resource.service.name and resource.environment to match with labels from other signals.

What is label cardinality?

The cardinality of a data attribute is the number of distinct values that the attribute can have. For example, a boolean column in a database, which can only have a value of either true or false has a cardinality of 2.

In Grafana Cloud, cardinality refers to the number of unique combinations of label values for a telemetry signal. It determines how many time series or log streams exist.

Cardinality examples

Low cardinality:

environment = ["dev", "staging", "prod"]  // 3 unique values

Medium cardinality:

http_endpoint = ["/api/users", "/api/orders", "/api/products", ...]  // ~50 unique values

High cardinality:

user_id = ["user_123", "user_456", "user_789", ...]  // Millions of unique values

Why cardinality matters

This diagram illustrates the impact of low versus high cardinality:

graph TD
    subgraph Low["Low Cardinality: environment"]
        L1["{service='api', environment='prod'}"]
        L2["{service='api', environment='staging'}"]
        L3["{service='api', environment='dev'}"]
    end

    subgraph High["High Cardinality: user_id"]
        H1["{service='api', user_id='user_1'}"]
        H2["{service='api', user_id='user_2'}"]
        H3["{service='api', user_id='user_3'}"]
        H4["... millions more ..."]
    end

    Low --> LowImpact["3 time series, fast queries, good for correlation"]
    High --> HighImpact["Millions of series, slow queries, poor for correlation"]

For metrics:

Each unique label combination creates a time series
High cardinality = many time series = higher costs and slower queries
Example: {service="api", user_id="123"} creates one series per user

For logs:

Each unique label combination creates a log stream
High cardinality = many streams = slower queries
Example: {service="api", request_id="abc"} creates one stream per request

For correlation:

High cardinality labels make correlation less effective
Matching on user_id rarely finds related data
Matching on service finds all related data for that service

Cardinality best practices

Do:

Use low-cardinality labels for correlation (service, environment, cluster)
Use medium-cardinality labels for filtering (endpoint, method, status)
Include high-cardinality data in log content, trace span attributes, or structured metadata
Keep metrics labels under 30 per series (40 maximum allowed, but performance degrades)
Keep logs labels under 15 per stream (hard limit)
For profiles, use consistent service_name labels to enable trace-to-profile correlation

Don’t:

Don’t use user IDs as labels
Don’t use request IDs as labels
Don’t use timestamps as labels
Don’t use UUIDs as labels

Signal-specific guidance:

Signal	Label limit	High-cardinality alternative
Metrics	30 recommended, 40 max	Use exemplars to link to traces with full context
Logs	15 max	Use structured metadata or log content
Traces	No hard limit on attributes	Attributes over 2KB are auto-truncated
Profiles	No documented limit	Use `service_name` for correlation

For metrics, use the cardinality management dashboard to monitor label distribution and identify high-cardinality issues.

Examples

Store high-cardinality values like user IDs in trace span attributes or log message content, not as metric or log labels.

This example uses a high cardinality label. This creates a separate time series for every user.

http_requests_total{service="api", user_id="12345"}

This is a good example that keeps high-cardinality data out of labels.

http_requests_total{service="api", endpoint="/api/users"}

Check cardinality

To check cardinality in Prometheus:

count(http_requests_total) by (service)     # Count series per service
count(http_requests_total) by (user_id)     # Count series per user_id (likely very high)

To check cardinality in Loki:

count_over_time({service="api"}[1h])        # Total log entries
sum(count_over_time({service="api"}[1h])) by (user_id)  # Count by user_id (high cardinality warning)

What is sampling?

Sampling is the process of determining which data to store and which to discard. The decision to store or discard is decided by enacted sampling policies.

Why sampling is used

For traces, sampling is commonly used to ensure that only relevant traces are stored for observation. There are common use cases that are generally applied:

Reduction of stored tracing telemetry volume. Higher volumes of unused trace data can lead to unnecessary costs.
Dropping the collection of traces that don’t really add any informational value to the overall health of an application. This includes traces that may be generated by the endpoints for health checks, such as readiness probes in Kubernetes.
Replicated traces across active-active HA instances.
Ensuring that only critical issues are sampled (such as traces with errors, or those with above average latencies).
Sampling a baseline number of traces across all requests (common patterns are 1% or fewer), to ensure that comparisons can be made between nominal and anomalous traces. Refer to Sampling in the Tempo documentation for more information.

Log sampling reduces volume by keeping only a percentage of logs. At ingestion, configure agents like Promtail/Alloy with rate: 0.1 to keep 10% and drop 90% before sending to Loki. At query time, you can use LogQL functions like first_over_time() or last_over_time() to downsample by selecting one log line per time window instead of all lines. This helps reduce ingestion costs, stay within rate limits (5 MB/sec per user in Cloud), and manage high-volume streams, though you trade granularity for representative patterns.

Strategies for sampling

When sampling traces, you can use a head or tail sampling strategy.

With a head sampling strategy, the decision to sample the trace is usually made as early as possible and doesn’t need to take into account the whole trace. It’s an effective sampling strategy.

With a tail sampling strategy, the decision to sample a trace is made after considering all or most of the spans. For example, tail sampling is a good option to sample only traces that have errors or traces with long request duration. Tail sampling is more complex to configure, implement, and maintain but is the recommended sampling strategy for large systems with a high telemetry volume.

Tail sampling policies

For production environments, use tail sampling with multiple policies to keep the most valuable traces:

Error-based: Keep all traces that contain errors
Latency-based: Keep traces that exceed a latency threshold (for example, requests over 500ms)
Attribute-based: Keep traces matching specific attributes (for example, specific endpoints or user segments)
Probabilistic: Keep a percentage of remaining traces for baseline visibility

When configuring tail sampling:

Set the decision wait period to accommodate your typical trace duration
Tune batch timeouts and sizes to avoid latency in sampling decisions
Refer to the traces sampling documentation for configuration details

How sampling affects correlation

Metrics-to-traces (exemplars):

Exemplars might point to traces that were sampled out (dropped).
Clicking an exemplar might show “No trace found”.
This is expected behavior with sampling.
Exemplars are generated before sampling decisions, so the trace may not exist.

Logs-to-traces:

Log entry has a trace ID, but the trace was sampled out.
Clicking the trace ID shows “No trace found”.
Logs remain, but trace details aren’t available.

Traces-to-metrics (metrics-generator):

The Tempo metrics-generator derives metrics from traces.
Metrics are generated before sampling, so they represent all traces.
However, clicking from generated metrics to traces may show sampled-out traces.
The metrics-generator collection interval is 60 seconds by default.

Best practices:

Keep higher sampling rates during development and testing.
Use intelligent tail sampling (keep errors, slow requests) in production.
Accept that not all traces are available.
Use metrics and logs for complete visibility, traces for detailed investigation.
Configure error-based and latency-based policies to ensure problematic traces are always kept.

Time ranges and alignment

Correlation between telemetry signals is also affected by timestamps and the time ranges of your queries.

Time synchronization

For correlation to work, timestamps must be synchronized across systems.

Requirements:

All systems use UTC
Clocks are synchronized (NTP recommended)
Timestamps in telemetry data are accurate

Common issues:

Clock drift causes time misalignment
Logs show events at 10:00, but metrics show activity at 10:05
This breaks time-based correlation

Time range queries

When correlating signals, use the same time range.

In Grafana Explore:

Use the time picker to set the same range for all queries
Use split view with time sync enabled (chain link icon)

In Drilldown apps:

Selecting a time range in one signal focuses other signals automatically
Time ranges stay synchronized across signal types

Time range best practices

Do:

Use consistent time ranges across all queries
Include a buffer (for example, query 5 minutes before and after event)
Use UTC for all timestamps
Verify clock synchronization with NTP

Don’t:

Don’t assume local time zones match
Don’t query narrow time ranges that might miss related events
Don’t forget clock drift on long-running systems

Data source limits

Grafana Cloud has limits to ensure system stability and fair usage. Understanding these limits helps you design effective correlation strategies and avoid unexpected query failures.

Signal-specific limits

The following table summarizes key limits for each telemetry signal. Many limits autoscale based on your usage tier. For complete details, refer to the Grafana Cloud usage limits documentation.

Metrics (Prometheus/Mimir)

Limit	Default value	Notes
Active series per user	150,000	Autoscaled based on tier
Ingestion rate	10,000 samples/sec	Autoscaled based on tier
Labels per series	40 max, 30 recommended	Keep under 30 for best performance
Label name length	1,024 characters	Fixed limit
Label value length	2,048 characters	Fixed limit; applies to metric names
Query time range	768 hours (32 days)	Maximum `max_partial_query_length`
Out-of-order window	2 hours	Samples older than this are rejected

Logs (Loki)

Limit	Default value	Notes
Ingestion rate	5 MB/sec	Per-user rate limit
Active streams per user	5,000	Per-user stream limit
Labels per stream	15	Maximum labels allowed
Label name length	1,024 characters	Fixed limit
Label value length	2,048 characters	Fixed limit
Max line size	256 KB	Cannot be modified; lines exceeding this are dropped
Retention period	30 days default	Configurable from 30 days to 1 year
Query time range	721 hours (~30 days)	Maximum `max_query_length`

Traces (Tempo)

Limit	Default value	Notes
Max bytes per trace	5 MB	Traces exceeding this are rejected
Ingestion rate	500 KB/sec	Per-tenant rate limit
Attribute size	2 KB	Attributes exceeding this are auto-truncated
Retention period	30 days	Default block retention
TraceQL metrics query range	24 hours	Can be increased via Support

Profiles (Pyroscope)

Limit	Default value	Notes
Daily ingestion limit	Configurable	Set per stack; data discarded when reached
Billing	Per GB ingested	$0.50/GB beyond plan inclusion

Note
Many metrics and logs limits autoscale based on your Grafana Cloud tier. Contact Grafana Support to request increases for non-autoscaled limits.

How limits affect correlation

Label and attribute limits:

Metrics allow up to 40 labels per series, but performance degrades above 30
Logs allow only 15 labels per stream
When designing shared labels for correlation, stay within the most restrictive limit (15 for logs)
Label name and value length limits (1,024/2,048 characters) are consistent across metrics and logs

Query limits:

Metrics queries can span up to 32 days; logs queries up to ~30 days
TraceQL metrics queries are limited to 24 hours by default
Use shorter time ranges when correlating across signals to ensure all queries complete

Retention differences:

Default retention varies: metrics (13 months), logs (30 days), traces (30 days)
Can’t correlate signals if one has been deleted due to shorter retention
Plan retention periods to ensure overlapping data availability

Ingestion limits:

Exceeding ingestion limits causes data to be dropped
Dropped data creates gaps that break correlation
Monitor ingestion rates using the Usage Insights dashboard

Work within limits

Here are some best practices:

Use specific label selectors to reduce data volume
Query shorter time ranges for high-cardinality data
Use topk() or bottomk() to limit results in PromQL
Add filters progressively to narrow results
Monitor your usage with the cardinality management dashboard
Keep label counts well under the maximum (aim for fewer than 15 shared labels to work across all signals)

Summary

Key takeaways:

Labels must match exactly (name and value, case-sensitive) for correlation
Cardinality measures unique label combinations - low is better for correlation
Sampling means not all traces are kept - this is normal and expected
Time ranges must be synchronized and consistent across queries
Limits require thoughtful querying - use filters and shorter time ranges

For effective correlation:

Use low-cardinality labels (service, environment, cluster)
Accept that sampling means some traces won’t be available
Synchronize time ranges across all signals
Query efficiently within data source limits
Test label matching across all signals before relying on correlation

Next steps

Configure signal correlation - Set up correlation between signals
Why correlation matters - Understand correlation benefits
Troubleshoot signal correlation - Solve common correlation issues
Grafana Cloud usage limits - Complete limits reference

Was this page helpful?

Email docs@grafana.com

Help and support

Community

Observability concepts

Labels, attributes, and matching

What are labels and attributes?

Match for correlation

Example correlation

What is label cardinality?

Cardinality examples

Why cardinality matters

Cardinality best practices

Examples

Check cardinality

What is sampling?

Why sampling is used

Strategies for sampling

Tail sampling policies

How sampling affects correlation

Time ranges and alignment

Time synchronization

Time range queries

Time range best practices

Data source limits

Signal-specific limits

Metrics (Prometheus/Mimir)

Logs (Loki)

Traces (Tempo)

Profiles (Pyroscope)

How limits affect correlation

Work within limits

Summary

Next steps

Was this page helpful?

Related resources from Grafana Labs