---
title: "Observability concepts | Grafana Cloud documentation"
description: "Foundational observability concepts for effective signal correlation."
---

> For a curated documentation index, see [llms.txt](/llms.txt). For the complete documentation index, see [llms-full.txt](/llms-full.txt).

# Observability concepts

Understanding these foundational concepts helps you use signal correlation effectively and avoid common pitfalls.

## Labels, attributes, and matching

In order for you to be able to navigate between telemetry signals, you first must understand what information Grafana Cloud needs to be able to correlate the different signals.

### What are labels and attributes?

Labels and attributes are key-value pairs that describe your telemetry data. Metrics, logs, and profiles use **labels**. Traces use **attributes**.

> Note
> 
> In the Grafana data source configuration UI, trace attributes are sometimes called *tags* for historical reasons. This documentation uses *attributes* (the OpenTelemetry term) when discussing traces conceptually, and tags only when referring to specific Grafana configuration fields.

For metrics, you can use this example PromQL query to monitor successful API GET request volume. With slight variations, you could use it to track throughput, calculate success rates, track SLI/SLO, and capacity planning.

promql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```promql
checkout_orders_rate:actual{service_name="checkoutservice", ml_job_name="Checkout Orders Rate"}
```

In this example:

- Metric name: `checkout_orders_rate:`
- Labels: `service_name="checkoutservice`, `ml_job_name="Checkout Orders Rate"`

Traces use resource attributes (shared across all spans from a service) and span attributes (specific to each operation). You can use TraceQL, the trace query language, to create a similar query using tracing data.

For example, this TraceQL query finds traces containing spans from the `checkoutservice` with GET requests returning 200 status. Instead of using the PromQL `service` label, this query uses the `resource.service.name` attribute. The query returns traces and matching spans instead of aggregated metrics.

traceql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```traceql
{span.app.currency.conversion.from="USD" && resource.service.name="currencyservice" && name="CurrencyService/Convert" && status=error}
```

### Match for correlation

This diagram shows how labels and attributes must match exactly across different signals:

Correlation works by matching labels and attributes across different signals. For two pieces of telemetry to correlate:

1. Names must match exactly (case-sensitive)
   
   - The label `service="api"` matches the label `service="api"`
   - The attribute `resource.service.name="api"` matches the label `service="api"`
   - `service="api"` doesn’t match `Service="api"`, it fails due to the capital S
   - `service="api"` doesn’t match `service_name="api"`, it fails because the name is different
2. Values must match exactly (case-sensitive)
   
   - `environment="prod"` matches `environment="prod"`
   - `environment="prod"` doesn’t match `environment="production"`
   - `environment="prod"` doesn’t match `environment="Prod"`, it fails due to the capital P.

### Example correlation

Prometheus metric query using PromQL:

promql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```promql
http_requests_total{service="checkout", environment="prod"}
```

Loki logs query using LogQL:

logql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```logql
{service="checkout", environment="prod"} |= "error"
```

These correlate because both have identical `service` and `environment` labels. For traces, use resource attributes such as `resource.service.name` and `resource.environment` to match with labels from other signals.

## What is label cardinality?

The cardinality of a data attribute is the number of distinct values that the attribute can have. For example, a boolean column in a database, which can only have a value of either `true` or `false` has a cardinality of `2`.

In Grafana Cloud, cardinality refers to the number of unique combinations of label values for a telemetry signal. It determines how many time series or log streams exist.

### Cardinality examples

Low cardinality:

Bash ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```bash
environment = ["dev", "staging", "prod"]  // 3 unique values
```

Medium cardinality:

Bash ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```bash
http_endpoint = ["/api/users", "/api/orders", "/api/products", ...]  // ~50 unique values
```

High cardinality:

Bash ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```bash
user_id = ["user_123", "user_456", "user_789", ...]  // Millions of unique values
```

### Why cardinality matters

This diagram illustrates the impact of low versus high cardinality:

mermaid ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```mermaid
graph TD
    subgraph Low["Low Cardinality: environment"]
        L1["{service='api', environment='prod'}"]
        L2["{service='api', environment='staging'}"]
        L3["{service='api', environment='dev'}"]
    end

    subgraph High["High Cardinality: user_id"]
        H1["{service='api', user_id='user_1'}"]
        H2["{service='api', user_id='user_2'}"]
        H3["{service='api', user_id='user_3'}"]
        H4["... millions more ..."]
    end

    Low --> LowImpact["3 time series, fast queries, good for correlation"]
    High --> HighImpact["Millions of series, slow queries, poor for correlation"]
```

For metrics:

- Each unique label combination creates a time series
- High cardinality = many time series = higher costs and slower queries
- Example: `{service="api", user_id="123"}` creates one series per user

For logs:

- Each unique label combination creates a log stream
- High cardinality = many streams = slower queries
- Example: `{service="api", request_id="abc"}` creates one stream per request

For correlation:

- High cardinality labels make correlation less effective
- Matching on `user_id` rarely finds related data
- Matching on `service` finds all related data for that service

### Cardinality best practices

Do:

- Use low-cardinality labels for correlation (service, environment, cluster)
- Use medium-cardinality labels for filtering (endpoint, method, status)
- Include high-cardinality data in log content, trace span attributes, or structured metadata
- Keep metrics labels under 30 per series (40 maximum allowed, but performance degrades)
- Keep logs labels under 15 per stream (hard limit)
- For profiles, use consistent `service_name` labels to enable trace-to-profile correlation

Don’t:

- Don’t use user IDs as labels
- Don’t use request IDs as labels
- Don’t use timestamps as labels
- Don’t use UUIDs as labels

Signal-specific guidance:

Expand table

| Signal   | Label limit                 | High-cardinality alternative                      |
|----------|-----------------------------|---------------------------------------------------|
| Metrics  | 30 recommended, 40 max      | Use exemplars to link to traces with full context |
| Logs     | 15 max                      | Use structured metadata or log content            |
| Traces   | No hard limit on attributes | Attributes over 2KB are auto-truncated            |
| Profiles | No documented limit         | Use `service_name` for correlation                |

For metrics, use the [cardinality management dashboard](/docs/grafana-cloud/cost-management-and-billing/analyze-costs/metrics-costs/prometheus-metrics-costs/cardinality-management/) to monitor label distribution and identify high-cardinality issues.

#### Examples

Store high-cardinality values like user IDs in trace span attributes or log message content, not as metric or log labels.

This example uses a high cardinality label. This creates a separate time series for every user.

promql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```promql
http_requests_total{service="api", user_id="12345"}
```

This is a good example that keeps high-cardinality data out of labels.

promql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```promql
http_requests_total{service="api", endpoint="/api/users"}
```

### Check cardinality

To check cardinality in Prometheus:

promql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```promql
count(http_requests_total) by (service)     # Count series per service
count(http_requests_total) by (user_id)     # Count series per user_id (likely very high)
```

To check cardinality in Loki:

logql ![Copy code to clipboard](/media/images/icons/icon-copy-small-2.svg) Copy

```logql
count_over_time({service="api"}[1h])        # Total log entries
sum(count_over_time({service="api"}[1h])) by (user_id)  # Count by user_id (high cardinality warning)
```

## What is sampling?

Sampling is the process of determining which data to store and which to discard. The decision to store or discard is decided by enacted sampling policies.

### Why sampling is used

For traces, sampling is commonly used to ensure that only relevant traces are stored for observation. There are common use cases that are generally applied:

- Reduction of stored tracing telemetry volume. Higher volumes of unused trace data can lead to unnecessary costs.
- Dropping the collection of traces that don’t really add any informational value to the overall health of an application. This includes traces that may be generated by the endpoints for health checks, such as readiness probes in Kubernetes.
- Replicated traces across active-active HA instances.
- Ensuring that only critical issues are sampled (such as traces with errors, or those with above average latencies).
- Sampling a baseline number of traces across all requests (common patterns are 1% or fewer), to ensure that comparisons can be made between nominal and anomalous traces. Refer to [Sampling](/docs/tempo/latest/set-up-for-tracing/instrument-send/set-up-collector/tail-sampling/) in the Tempo documentation for more information.

Log sampling reduces volume by keeping only a percentage of logs. At ingestion, configure agents like Alloy with `rate: 0.1` to keep 10% and drop 90% before sending to Loki. At query time, you can use LogQL functions like `first_over_time()` or `last_over_time()` to downsample by selecting one log line per time window instead of all lines. This helps reduce ingestion costs, stay within rate limits (5 MB/sec per user in Cloud), and manage high-volume streams, though you trade granularity for representative patterns.

### Strategies for sampling

When sampling traces, you can use a head or tail sampling strategy.

With a head sampling strategy, the decision to sample the trace is usually made as early as possible and doesn’t need to take into account the whole trace. It’s an effective sampling strategy.

With a tail sampling strategy, the decision to sample a trace is made after considering all or most of the spans. For example, tail sampling is a good option to sample only traces that have errors or traces with long request duration. Tail sampling is more complex to configure, implement, and maintain but is the recommended sampling strategy for large systems with a high telemetry volume.

### Tail sampling policies

For production environments, use tail sampling with multiple policies to keep the most valuable traces:

- Error-based: Keep all traces that contain errors
- Latency-based: Keep traces that exceed a latency threshold (for example, requests over 500ms)
- Attribute-based: Keep traces matching specific attributes (for example, specific endpoints or user segments)
- Probabilistic: Keep a percentage of remaining traces for baseline visibility

When configuring tail sampling:

- Set the decision wait period to accommodate your typical trace duration
- Tune batch timeouts and sizes to avoid latency in sampling decisions
- Refer to the [traces sampling documentation](/docs/tempo/latest/set-up-for-tracing/instrument-send/set-up-collector/tail-sampling/policies-strategies/) for configuration details

### How sampling affects correlation

Metrics-to-traces (exemplars):

- Exemplars might point to traces that were sampled out (dropped).
- Clicking an exemplar might show “No trace found”.
- This is expected behavior with sampling.
- Exemplars are generated before sampling decisions, so the trace may not exist.

Logs-to-traces:

- Log entry has a trace ID, but the trace was sampled out.
- Clicking the trace ID shows “No trace found”.
- Logs remain, but trace details aren’t available.

Traces-to-metrics (metrics-generator):

- The Tempo metrics-generator derives metrics from traces.
- Metrics are generated before sampling, so they represent all traces.
- However, clicking from generated metrics to traces may show sampled-out traces.
- The metrics-generator collection interval is 60 seconds by default.

Best practices:

- Keep higher sampling rates during development and testing.
- Use intelligent tail sampling (keep errors, slow requests) in production.
- Accept that not all traces are available.
- Use metrics and logs for complete visibility, traces for detailed investigation.
- Configure error-based and latency-based policies to ensure problematic traces are always kept.

## Time ranges and alignment

Correlation between telemetry signals is also affected by timestamps and the time ranges of your queries.

### Time synchronization

For correlation to work, timestamps must be synchronized across systems.

Requirements:

- All systems use UTC
- Clocks are synchronized (NTP recommended)
- Timestamps in telemetry data are accurate

Common issues:

- Clock drift causes time misalignment
- Logs show events at 10:00, but metrics show activity at 10:05
- This breaks time-based correlation

### Time range queries

When correlating signals, use the same time range.

In Grafana Explore:

- Use the time picker to set the same range for all queries
- Use split view with time sync enabled (chain link icon)

In Drilldown apps:

- Selecting a time range in one signal focuses other signals automatically
- Time ranges stay synchronized across signal types

### Time range best practices

Do:

- Use consistent time ranges across all queries
- Include a buffer (for example, query 5 minutes before and after event)
- Use UTC for all timestamps
- Verify clock synchronization with NTP

Don’t:

- Don’t assume local time zones match
- Don’t query narrow time ranges that might miss related events
- Don’t forget clock drift on long-running systems

## Data source limits

Grafana Cloud has limits to ensure system stability and fair usage. Understanding these limits helps you design effective correlation strategies and avoid unexpected query failures.

### Signal-specific limits

The following table summarizes key limits for each telemetry signal. Many limits autoscale based on your usage tier. For complete details, refer to the [Grafana Cloud usage limits documentation](/docs/grafana-cloud/cost-management-and-billing/manage-invoices/understand-your-invoice/usage-limits/).

#### Metrics (Prometheus/Mimir)

Expand table

| Limit                  | Default value          | Notes                                |
|------------------------|------------------------|--------------------------------------|
| Active series per user | 150,000                | Autoscaled based on tier             |
| Ingestion rate         | 10,000 samples/sec     | Autoscaled based on tier             |
| Labels per series      | 40 max, 30 recommended | Keep under 30 for best performance   |
| Label name length      | 1,024 characters       | Fixed limit                          |
| Label value length     | 2,048 characters       | Fixed limit; applies to metric names |
| Query time range       | 768 hours (32 days)    | Maximum `max_partial_query_length`   |
| Out-of-order window    | 2 hours                | Samples older than this are rejected |

#### Logs (Loki)

Expand table

| Limit                   | Default value        | Notes                                                |
|-------------------------|----------------------|------------------------------------------------------|
| Ingestion rate          | 5 MB/sec             | Per-user rate limit                                  |
| Active streams per user | 5,000                | Per-user stream limit                                |
| Labels per stream       | 15                   | Maximum labels allowed                               |
| Label name length       | 1,024 characters     | Fixed limit                                          |
| Label value length      | 2,048 characters     | Fixed limit                                          |
| Max line size           | 256 KB               | Cannot be modified; lines exceeding this are dropped |
| Retention period        | 30 days default      | Configurable from 30 days to 1 year                  |
| Query time range        | 721 hours (~30 days) | Maximum `max_query_length`                           |

#### Traces (Tempo)

Expand table

| Limit                       | Default value | Notes                                        |
|-----------------------------|---------------|----------------------------------------------|
| Max bytes per trace         | 5 MB          | Traces exceeding this are rejected           |
| Ingestion rate              | 500 KB/sec    | Per-tenant rate limit                        |
| Attribute size              | 2 KB          | Attributes exceeding this are auto-truncated |
| Retention period            | 30 days       | Default block retention                      |
| TraceQL metrics query range | 24 hours      | Can be increased via Support                 |

#### Profiles (Pyroscope)

Expand table

| Limit                 | Default value   | Notes                                      |
|-----------------------|-----------------|--------------------------------------------|
| Daily ingestion limit | Configurable    | Set per stack; data discarded when reached |
| Billing               | Per GB ingested | $0.50/GB beyond plan inclusion             |

> Note
> 
> Many metrics and logs limits autoscale based on your Grafana Cloud tier. Contact Grafana Support to request increases for non-autoscaled limits.

### How limits affect correlation

Label and attribute limits:

- Metrics allow up to 40 labels per series, but performance degrades above 30
- Logs allow only 15 labels per stream
- When designing shared labels for correlation, stay within the most restrictive limit (15 for logs)
- Label name and value length limits (1,024/2,048 characters) are consistent across metrics and logs

Query limits:

- Metrics queries can span up to 32 days; logs queries up to ~30 days
- TraceQL metrics queries are limited to 24 hours by default
- Use shorter time ranges when correlating across signals to ensure all queries complete

Retention differences:

- Default retention varies: metrics (13 months), logs (30 days), traces (30 days)
- Can’t correlate signals if one has been deleted due to shorter retention
- Plan retention periods to ensure overlapping data availability

Ingestion limits:

- Exceeding ingestion limits causes data to be dropped
- Dropped data creates gaps that break correlation
- Monitor ingestion rates using the Usage Insights dashboard

### Work within limits

Here are some best practices:

- Use specific label selectors to reduce data volume
- Query shorter time ranges for high-cardinality data
- Use `topk()` or `bottomk()` to limit results in PromQL
- Add filters progressively to narrow results
- Monitor your usage with the [cardinality management dashboard](/docs/grafana-cloud/cost-management-and-billing/analyze-costs/metrics-costs/prometheus-metrics-costs/cardinality-management/)
- Keep label counts well under the maximum (aim for fewer than 15 shared labels to work across all signals)

## Summary

Key takeaways:

1. Labels must match exactly (name and value, case-sensitive) for correlation
2. Cardinality measures unique label combinations - low is better for correlation
3. Sampling means not all traces are kept - this is normal and expected
4. Time ranges must be synchronized and consistent across queries
5. Limits require thoughtful querying - use filters and shorter time ranges

For effective correlation:

- Use low-cardinality labels (service, environment, cluster)
- Accept that sampling means some traces won’t be available
- Synchronize time ranges across all signals
- Query efficiently within data source limits
- Test label matching across all signals before relying on correlation

## Next steps

- [Configure signal correlation](/docs/grafana-cloud/telemetry-signals/use-signals-together/setup-correlations/) - Set up correlation between signals
- [Why correlation matters](/docs/grafana-cloud/telemetry-signals/use-signals-together/why-correlation-matters/) - Understand correlation benefits
- [Troubleshoot signal correlation](/docs/grafana-cloud/telemetry-signals/use-signals-together/troubleshooting/) - Solve common correlation issues
- [Grafana Cloud usage limits](/docs/grafana-cloud/cost-management-and-billing/manage-invoices/understand-your-invoice/usage-limits/) - Complete limits reference