Observability concepts
Understanding these foundational concepts helps you use signal correlation effectively and avoid common pitfalls.
Labels, attributes, and matching
In order for you to be able to navigate between telemetry signals, you first must understand what information Grafana Cloud needs to be able to correlate the different signals.
What are labels and attributes?
Labels and attributes are key-value pairs that describe your telemetry data. Metrics, logs, and profiles use labels. Traces use attributes.
Note
In the Grafana data source configuration UI, trace attributes are sometimes called tags for historical reasons. This documentation uses attributes (the OpenTelemetry term) when discussing traces conceptually, and tags only when referring to specific Grafana configuration fields.
For metrics, you can use this example PromQL query to monitor successful API GET request volume. With slight variations, you could use it to track throughput, calculate success rates, track SLI/SLO, and capacity planning.
checkout_orders_rate:actual{service_name="checkoutservice", ml_job_name="Checkout Orders Rate"}In this example:
- Metric name:
checkout_orders_rate: - Labels:
service_name="checkoutservice,ml_job_name="Checkout Orders Rate"
Traces use resource attributes (shared across all spans from a service) and span attributes (specific to each operation). You can use TraceQL, the trace query language, to create a similar query using tracing data.
For example, this TraceQL query finds traces containing spans from the checkoutservice with GET requests returning 200 status.
Instead of using the PromQL service label, this query uses the resource.service.name attribute. The query returns traces and matching spans instead of aggregated metrics.
{span.app.currency.conversion.from="USD" && resource.service.name="currencyservice" && name="CurrencyService/Convert" && status=error}Match for correlation
This diagram shows how labels and attributes must match exactly across different signals:

Correlation works by matching labels and attributes across different signals. For two pieces of telemetry to correlate:
Names must match exactly (case-sensitive)
- The label
service="api"matches the labelservice="api" - The attribute
resource.service.name="api"matches the labelservice="api" service="api"doesn’t matchService="api", it fails due to the capital Sservice="api"doesn’t matchservice_name="api", it fails because the name is different
- The label
Values must match exactly (case-sensitive)
environment="prod"matchesenvironment="prod"environment="prod"doesn’t matchenvironment="production"environment="prod"doesn’t matchenvironment="Prod", it fails due to the capital P.
Example correlation
Prometheus metric query using PromQL:
http_requests_total{service="checkout", environment="prod"}Loki logs query using LogQL:
{service="checkout", environment="prod"} |= "error"These correlate because both have identical service and environment labels. For traces, use resource attributes such as resource.service.name and resource.environment to match with labels from other signals.
What is label cardinality?
The cardinality of a data attribute is the number of distinct values that the attribute can have. For example, a boolean column in a database, which can only have a value of either true or false has a cardinality of 2.
In Grafana Cloud, cardinality refers to the number of unique combinations of label values for a telemetry signal. It determines how many time series or log streams exist.
Cardinality examples
Low cardinality:
environment = ["dev", "staging", "prod"] // 3 unique valuesMedium cardinality:
http_endpoint = ["/api/users", "/api/orders", "/api/products", ...] // ~50 unique valuesHigh cardinality:
user_id = ["user_123", "user_456", "user_789", ...] // Millions of unique valuesWhy cardinality matters
This diagram illustrates the impact of low versus high cardinality:
graph TD
subgraph Low["Low Cardinality: environment"]
L1["{service='api', environment='prod'}"]
L2["{service='api', environment='staging'}"]
L3["{service='api', environment='dev'}"]
end
subgraph High["High Cardinality: user_id"]
H1["{service='api', user_id='user_1'}"]
H2["{service='api', user_id='user_2'}"]
H3["{service='api', user_id='user_3'}"]
H4["... millions more ..."]
end
Low --> LowImpact["3 time series, fast queries, good for correlation"]
High --> HighImpact["Millions of series, slow queries, poor for correlation"]For metrics:
- Each unique label combination creates a time series
- High cardinality = many time series = higher costs and slower queries
- Example:
{service="api", user_id="123"}creates one series per user
For logs:
- Each unique label combination creates a log stream
- High cardinality = many streams = slower queries
- Example:
{service="api", request_id="abc"}creates one stream per request
For correlation:
- High cardinality labels make correlation less effective
- Matching on
user_idrarely finds related data - Matching on
servicefinds all related data for that service
Cardinality best practices
Do:
- Use low-cardinality labels for correlation (service, environment, cluster)
- Use medium-cardinality labels for filtering (endpoint, method, status)
- Include high-cardinality data in log content, trace span attributes, or structured metadata
- Keep metrics labels under 30 per series (40 maximum allowed, but performance degrades)
- Keep logs labels under 15 per stream (hard limit)
- For profiles, use consistent
service_namelabels to enable trace-to-profile correlation
Don’t:
- Don’t use user IDs as labels
- Don’t use request IDs as labels
- Don’t use timestamps as labels
- Don’t use UUIDs as labels
Signal-specific guidance:
For metrics, use the cardinality management dashboard to monitor label distribution and identify high-cardinality issues.
Examples
Store high-cardinality values like user IDs in trace span attributes or log message content, not as metric or log labels.
This example uses a high cardinality label. This creates a separate time series for every user.
http_requests_total{service="api", user_id="12345"}This is a good example that keeps high-cardinality data out of labels.
http_requests_total{service="api", endpoint="/api/users"}Check cardinality
To check cardinality in Prometheus:
count(http_requests_total) by (service) # Count series per service
count(http_requests_total) by (user_id) # Count series per user_id (likely very high)To check cardinality in Loki:
count_over_time({service="api"}[1h]) # Total log entries
sum(count_over_time({service="api"}[1h])) by (user_id) # Count by user_id (high cardinality warning)What is sampling?
Sampling is the process of determining which data to store and which to discard. The decision to store or discard is decided by enacted sampling policies.
Why sampling is used
For traces, sampling is commonly used to ensure that only relevant traces are stored for observation. There are common use cases that are generally applied:
- Reduction of stored tracing telemetry volume. Higher volumes of unused trace data can lead to unnecessary costs.
- Dropping the collection of traces that don’t really add any informational value to the overall health of an application. This includes traces that may be generated by the endpoints for health checks, such as readiness probes in Kubernetes.
- Replicated traces across active-active HA instances.
- Ensuring that only critical issues are sampled (such as traces with errors, or those with above average latencies).
- Sampling a baseline number of traces across all requests (common patterns are 1% or fewer), to ensure that comparisons can be made between nominal and anomalous traces.
Refer to Sampling in the Tempo documentation for more information.
Log sampling reduces volume by keeping only a percentage of logs.
At ingestion, configure agents like Promtail/Alloy with rate: 0.1 to keep 10% and drop 90% before sending to Loki.
At query time, you can use LogQL functions like first_over_time() or last_over_time() to downsample by selecting one log line per time window instead of all lines.
This helps reduce ingestion costs, stay within rate limits (5 MB/sec per user in Cloud), and manage high-volume streams, though you trade granularity for representative patterns.
Strategies for sampling
When sampling traces, you can use a head or tail sampling strategy.
With a head sampling strategy, the decision to sample the trace is usually made as early as possible and doesn’t need to take into account the whole trace. It’s an effective sampling strategy.
With a tail sampling strategy, the decision to sample a trace is made after considering all or most of the spans. For example, tail sampling is a good option to sample only traces that have errors or traces with long request duration. Tail sampling is more complex to configure, implement, and maintain but is the recommended sampling strategy for large systems with a high telemetry volume.
Tail sampling policies
For production environments, use tail sampling with multiple policies to keep the most valuable traces:
- Error-based: Keep all traces that contain errors
- Latency-based: Keep traces that exceed a latency threshold (for example, requests over 500ms)
- Attribute-based: Keep traces matching specific attributes (for example, specific endpoints or user segments)
- Probabilistic: Keep a percentage of remaining traces for baseline visibility
When configuring tail sampling:
- Set the decision wait period to accommodate your typical trace duration
- Tune batch timeouts and sizes to avoid latency in sampling decisions
- Refer to the traces sampling documentation for configuration details
How sampling affects correlation
Metrics-to-traces (exemplars):
- Exemplars might point to traces that were sampled out (dropped).
- Clicking an exemplar might show “No trace found”.
- This is expected behavior with sampling.
- Exemplars are generated before sampling decisions, so the trace may not exist.
Logs-to-traces:
- Log entry has a trace ID, but the trace was sampled out.
- Clicking the trace ID shows “No trace found”.
- Logs remain, but trace details aren’t available.
Traces-to-metrics (metrics-generator):
- The Tempo metrics-generator derives metrics from traces.
- Metrics are generated before sampling, so they represent all traces.
- However, clicking from generated metrics to traces may show sampled-out traces.
- The metrics-generator collection interval is 60 seconds by default.
Best practices:
- Keep higher sampling rates during development and testing.
- Use intelligent tail sampling (keep errors, slow requests) in production.
- Accept that not all traces are available.
- Use metrics and logs for complete visibility, traces for detailed investigation.
- Configure error-based and latency-based policies to ensure problematic traces are always kept.
Time ranges and alignment
Correlation between telemetry signals is also affected by timestamps and the time ranges of your queries.
Time synchronization
For correlation to work, timestamps must be synchronized across systems.
Requirements:
- All systems use UTC
- Clocks are synchronized (NTP recommended)
- Timestamps in telemetry data are accurate
Common issues:
- Clock drift causes time misalignment
- Logs show events at 10:00, but metrics show activity at 10:05
- This breaks time-based correlation
Time range queries
When correlating signals, use the same time range.
In Grafana Explore:
- Use the time picker to set the same range for all queries
- Use split view with time sync enabled (chain link icon)
In Drilldown apps:
- Selecting a time range in one signal focuses other signals automatically
- Time ranges stay synchronized across signal types
Time range best practices
Do:
- Use consistent time ranges across all queries
- Include a buffer (for example, query 5 minutes before and after event)
- Use UTC for all timestamps
- Verify clock synchronization with NTP
Don’t:
- Don’t assume local time zones match
- Don’t query narrow time ranges that might miss related events
- Don’t forget clock drift on long-running systems
Data source limits
Grafana Cloud has limits to ensure system stability and fair usage. Understanding these limits helps you design effective correlation strategies and avoid unexpected query failures.
Signal-specific limits
The following table summarizes key limits for each telemetry signal. Many limits autoscale based on your usage tier. For complete details, refer to the Grafana Cloud usage limits documentation.
Metrics (Prometheus/Mimir)
Logs (Loki)
Traces (Tempo)
Profiles (Pyroscope)
Note
Many metrics and logs limits autoscale based on your Grafana Cloud tier. Contact Grafana Support to request increases for non-autoscaled limits.
How limits affect correlation
Label and attribute limits:
- Metrics allow up to 40 labels per series, but performance degrades above 30
- Logs allow only 15 labels per stream
- When designing shared labels for correlation, stay within the most restrictive limit (15 for logs)
- Label name and value length limits (1,024/2,048 characters) are consistent across metrics and logs
Query limits:
- Metrics queries can span up to 32 days; logs queries up to ~30 days
- TraceQL metrics queries are limited to 24 hours by default
- Use shorter time ranges when correlating across signals to ensure all queries complete
Retention differences:
- Default retention varies: metrics (13 months), logs (30 days), traces (30 days)
- Can’t correlate signals if one has been deleted due to shorter retention
- Plan retention periods to ensure overlapping data availability
Ingestion limits:
- Exceeding ingestion limits causes data to be dropped
- Dropped data creates gaps that break correlation
- Monitor ingestion rates using the Usage Insights dashboard
Work within limits
Here are some best practices:
- Use specific label selectors to reduce data volume
- Query shorter time ranges for high-cardinality data
- Use
topk()orbottomk()to limit results in PromQL - Add filters progressively to narrow results
- Monitor your usage with the cardinality management dashboard
- Keep label counts well under the maximum (aim for fewer than 15 shared labels to work across all signals)
Summary
Key takeaways:
- Labels must match exactly (name and value, case-sensitive) for correlation
- Cardinality measures unique label combinations - low is better for correlation
- Sampling means not all traces are kept - this is normal and expected
- Time ranges must be synchronized and consistent across queries
- Limits require thoughtful querying - use filters and shorter time ranges
For effective correlation:
- Use low-cardinality labels (service, environment, cluster)
- Accept that sampling means some traces won’t be available
- Synchronize time ranges across all signals
- Query efficiently within data source limits
- Test label matching across all signals before relying on correlation
Next steps
- Configure signal correlation - Set up correlation between signals
- Why correlation matters - Understand correlation benefits
- Troubleshoot signal correlation - Solve common correlation issues
- Grafana Cloud usage limits - Complete limits reference



