Menu
Open source

Troubleshooting the OpenTelemetry Collector

The OpenTelemetry Collector is made of components such as Receivers, Exporters, Processors, Connectors, and Extensions. Each component is usually part of one or more pipelines. This article helps you figure out how to sort out common problems with the Collector and what to do if you suspect you found a bug.

Receiver issues

When your telemetry client generated data but it hasn’t been received by your backend, use the metric otelcol_receiver_accepted_spans to ensure that the data point has been received by the Collector. If you expect a data point to have been counted as part of this metric but hasn’t, check the metric otelcol_receiver_refused_spans to ensure it wasn’t refused by the Collector.

See Metrics for more information on the Collector’s own metrics.

When neither metric are showing that data points have been seen, it’s an indication that the Collector hasn’t received the data point at all. In that case, check the connectivity between your telemetry client and the Collector. When possible, simplify your networking between the source of data (typically your workload) and the Collector. For instance, try running everything directly on your machine instead of as containers. Refer to the Getting Started Guide for more information on how to run the Collector locally.

A quick way to verify whether data is being received by the Collector is to use a configuration similar to the following:

receivers:
  otlp:
    protocols:
      grpc:

exporters:
  logging:

service:
  pipelines:
    traces:
      receivers: [ otlp ]
      processors:
      exporters: [ logging ]

With that, you should see an output similar to the below when receiving data via a traces pipeline:

2023-05-05T15:04:58.982-0300	info	TracesExporter	{"kind": "exporter", "data_type": "traces", "name": "logging", "#spans": 512}

If you don’t see that, it’s a good indication that your receiver didn’t receive the data you were expecting.

Exporter issues

When you confirmed that your Collector received the data point, the next step would be to check the similar metrics on the exporter side: otelcol_exporter_sent_spans and otelcol_exporter_send_failed_spans. If you see only sent spans, the exporter reported that it was able to send all data points to the destination. In that case, check the logs at your backend for clues.

See Metrics for more information on the Collector’s own metrics.

A good way to find out whether the exporter is not working as intended is to add a logging exporter to the same pipeline you are currently using. When doing that, you should see your telemetry data being printed out to the console, confirming that data has been seen by exporters:

2023-05-07T10:53:48.893-0300    info    TracesExporter  {"kind": "exporter", "data_type": "traces", "name": "logging", "#spans": 2}

If you still see no data at your backends, try disabling any resiliency mechanisms the exporter may have, such as sending queues or retry mechanisms. Here’s one example of a configuration file with all the tips above:

receivers:
  otlp:
    protocols:
      grpc:

exporters:
  logging:

  otlp:
    endpoint: failing.example.com:4317
    sending_queue:
      enabled: false
    retry_on_failure:
      enabled: false
service:
  pipelines:
    traces:
      receivers: [ otlp ]
      processors:
      exporters: [ logging, otlp ]

At this point, the metrics might look like this, indicating that the logging exporter was able to export the data but the otlp wasn’t:

otelcol_exporter_sent_spans{exporter="logging",service_instance_id="d9853063-c63c-48db-8b13-69e379b38314",service_name="otelcol",service_version="0.75.0"} 4
otelcol_exporter_sent_spans{exporter="otlp",service_instance_id="d9853063-c63c-48db-8b13-69e379b38314",service_name="otelcol",service_version="0.75.0"} 0

At this point, the best place to look for clues is the Collector console. If it doesn’t show any further logs that might help explain the issue, double-check the exporter’s documentation for specific instructions for that exporter.

Authentication issues

When you need to authenticate against a remote OTLP server, such as Grafana Cloud OTLP, use the auth extensions and tell the exporter to use them. Check the Connect OpenTelemetry Collector to Grafana Cloud databases for more information on how to obtain your username and password. This is preferrable to directly passing the auth data as username and password in the URL. Here’s an example of the recommendation:

extensions:
  basicauth/traces:
    client_auth:
      username: "1234" # your Username / Instance ID
      password: "ey..." # your API key with the "MetricsPublisher" role

exporters:
  otlphttp:
    endpoint: "https://otlp-gateway-prod-us-central-0.grafana.net/otlp"
    auth:
      authenticator: basicauth/traces

Issues sending data to Grafana Tempo / Grafana Cloud Traces

When sending data to Grafana Tempo or Grafana Cloud Traces, make sure you are using the OTLP gRPC Exporter and that the endpoint contains only the hostname and port, like in the following example:

exporters:
  otlp:
    endpoint: tempo-us-central1.grafana.net:443

You can find the correct hostname and port for your Grafana Cloud Traces by looking at the Tempo section of your Grafana Cloud account page. Under “Details”, you should see a URL like this: https://tempo-us-central1.grafana.net/tempo, which translates to tempo-us-central1.grafana.net:443 in the Collector configuration.

Issues sending data to Grafana Mimir / Grafana Cloud Metrics

When sending data to Grafana Mimir or Grafana Cloud Metrics, make sure you are using the Prometheus Remote Write Exporter and that the endpoint contains the full URL for the endpoint, including the full path for the push endpoint, like in the following example:

exporters:
  prometheusremotewrite:
    endpoint: https://prometheus-blocks-prod-us-central1.grafana.net/api/prom/push

You can find the correct hostname and port for your Grafana Cloud Metrics by looking at the Prometheus section of your Grafana Cloud account page. Under “Details”, you should see the Prometheus Remote Write endpoint like this: https://prometheus-blocks-prod-us-central1.grafana.net/api/prom/push, which can be used as is in the Collector configuration.

Issues sending data to Grafana Loki / Grafana Cloud Logs

When sending data to Grafana Loki or Grafana Cloud Logs, make sure you are using the Loki Exporter and that the endpoint contains the full URL for the endpoint, including the full path for the push endpoint, like in the following example:

exporters:
  loki:
    endpoint: https://logs-prod-us-central1.grafana.net/loki/api/v1/push

You can find the correct hostname and port for your Grafana Cloud Logs by looking at the Loki section of your Grafana Cloud account page. Under “Details”, you should see a URL like this: https://logs-prod-us-central1.grafana.net, which translates to https://logs-prod-us-central1.grafana.net/loki/api/v1/push in the Collector configuration.

Debugging configuration issues

The Collector’s configuration file is composed of different sections:

  1. top-level components, such as extensions, receivers, exporters, connectors, and processors
  2. the service node, which defines the list of extensions to load, the Collector’s own telemetry settings, as well as the pipeline definitions
  3. the pipelines section within the service node defines all the pipelines for the Collector’s instance, listing all components that are part of each pipeline

A common mistake is to define the component’s configuration at the top-level component section, but not include them in pipelines, like in the following example:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: localhost:4317
  otlp/2:
    protocols:
      grpc:
        endpoint: localhost:5317

exporters:
  logging:

service:
  pipelines:
    traces:
      receivers:  [ otlp ]
      processors: [  ]
      exporters:  [ logging ]

In the example above, we define two receivers (otlp and otlp/2) but use only one in the traces pipeline. A client attempting to send data to this Collector’s 5317 port won’t therefore succeed, potentially receiving a Connection refused message.

Similarly, it’s also common to make a reference to a component that hasn’t been defined before, like in the following example:

receivers:
  otlp/2:
    protocols:
      grpc:
        endpoint: localhost:5317

exporters:
  logging:

service:
  pipelines:
    traces:
      receivers:  [ otlp ]
      processors: [  ]
      exporters:  [ logging ]

In the example above, we tell the Collector to use the receiver otlp in the traces pipeline, but there’s no such receiver specified, only otlp/2. In this case, the Collector will fail with an error message like this:

Error: invalid configuration: service::pipeline::traces: references receiver "otlp" which is not configured
2023/04/15 17:47:14 collector server run finished with error: invalid configuration: service::pipeline::traces: references receiver "otlp" which is not configured

A common source of confusion is also when trying to use a component that is not part of the distribution being used. For instance, the “core” distribution of the Collector does not include the Loki exporter. An example configuration would be the following:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: localhost:5317

exporters:
  loki:
    endpoint: https://logs-prod-us-central1.grafana.net/loki/api/v1/push

service:
  pipelines:
    logs:
      receivers:  [ otlp ]
      processors: [  ]
      exporters:  [ loki ]

While running this configuration with the OpenTelemetry Collector Contrib distribution works fine, running it with the core distribution returns this:

Error: failed to get config: cannot unmarshal the configuration: 1 error(s) decoding:

* error decoding 'exporters': unknown type: "loki" for id: "loki" (valid values: [logging otlp otlphttp])
2023/04/15 17:51:14 collector server run finished with error: failed to get config: cannot unmarshal the configuration: 1 error(s) decoding:

* error decoding 'exporters': unknown type: "loki" for id: "loki" (valid values: [logging otlp otlphttp])

If you are unsure about which components are available in a given distribution, use the components option on the otelcol binary:

> otelcol components
buildinfo:
    command: otelcol
    description: OpenTelemetry Collector
    version: 0.75.0
receivers:
    - otlp
    - hostmetrics
    - jaeger
    - kafka
    - opencensus
    - prometheus
    - zipkin
processors:
    - filter
    - batch
    - memory_limiter
    - attributes
    - resource
    - span
    - probabilistic_sampler
exporters:
    - jaeger
    - kafka
    - opencensus
    - prometheus
    - logging
    - otlp
    - otlphttp
    - file
    - prometheusremotewrite
    - zipkin
extensions:
    - zpages
    - memory_ballast
    - health_check
    - pprof

Connector issues

Connectors make the bridge between pipelines, acting as an exporter in one pipeline and receiver in another. Therefore, connectors must always be specified at those two sides, like the following:

receivers:
  otlp:
    protocols:
      grpc:

exporters:
  logging:

connectors:
  forward:

service:
  pipelines:
    traces/1:
      receivers: [ otlp ]
      processors:
      exporters: [ forward ]
    traces/2:
      receivers: [ forward ]
      processors:
      exporters: [ logging ]

When the connector is only used on one side (exporter or receiver), an error like the following is shown:

2023/05/07 11:42:42 collector server run finished with error: invalid configuration: connectors::forward: must be used as both receiver and exporter but is not used as receiver

Troubleshooting tools

The Collector provides a set of tools that can be used for further troubleshooting, such as the following.

Logging exporter

The logging exporter can be added to any pipeline and will print basic information on the console about the telemetry data that it has seen. One example seen earlier in this document is this:

2023-05-07T10:53:48.893-0300    info    TracesExporter  {"kind": "exporter", "data_type": "traces", "name": "logging", "#spans": 2}

However, when you want to inspect the actual data that was exported, you should increase the verbosity of the logging exporter to detailed, like this:

exporters:
  logging:
    verbosity: detailed

The following example output shows what to expect when the logging exporter is used for traces. Click here to see the example

2023-05-11T10:59:15.268-0300    info    TracesExporter  {"kind": "exporter", "data_type": "traces", "name": "logging", "#spans": 2}
2023-05-11T10:59:15.269-0300    info    ResourceSpans #0
Resource SchemaURL: https://opentelemetry.io/schemas/1.4.0
Resource attributes:
     -> service.name: Str(telemetrygen)
ScopeSpans #0
ScopeSpans SchemaURL: 
InstrumentationScope telemetrygen 
Span #0
    Trace ID       : 70eb62646b2e74db99bab9966cbf2ccf
    Parent ID      : 08ff66da51fc330a
    ID             : 1f4ce9f7660b1bf4
    Name           : okey-dokey
    Kind           : Internal
    Start time     : 2023-05-11 13:59:15.267019806 +0000 UTC
    End time       : 2023-05-11 13:59:15.267147306 +0000 UTC
    Status code    : Unset
    Status message : 
Attributes:
     -> span.kind: Str(server)
     -> net.peer.ip: Str(1.2.3.4)
     -> peer.service: Str(telemetrygen-client)
Span #1
    Trace ID       : 70eb62646b2e74db99bab9966cbf2ccf
    Parent ID      : 
    ID             : 08ff66da51fc330a
    Name           : lets-go
    Kind           : Internal
    Start time     : 2023-05-11 13:59:15.266974872 +0000 UTC
    End time       : 2023-05-11 13:59:15.267147306 +0000 UTC
    Status code    : Unset
    Status message : 
Attributes:
     -> span.kind: Str(client)
     -> net.peer.ip: Str(1.2.3.4)
     -> peer.service: Str(telemetrygen-server)
        {"kind": "exporter", "data_type": "traces", "name": "logging"}

Logs

When you are still setting up your collection pipeline, logs will be your best troubleshooting tool. They are available directly on the console running the Collector instance. The verbosity can be configured by changing the telemetry settings in the configuration file, under the service top-level node, as follows:

service:
  telemetry:
    logs:
      level: debug
      initial_fields:
        service: my-instance

The initial_fields can be added so that every log entry produced by the Collector includes them. This is useful when sending the logs from all your Collectors to a centralized location, to identify which collector in the fleet generated which log entry. Here’s an example output for a log entry produced by a Collector using the configuration above:

2023-05-05T14:16:52.935-0300	info	TracesExporter	{"service": "my-instance", "kind": "exporter", "data_type": "traces", "name": "logging", "#spans": 2000}

The possible log levels are: debug, info, warn, error, dpanic, panic, fatal. The levels follow the semantics of the logging library Zap.

Other logging tweaks

Beyond the already mentioned options, the following attributes can be used to further tweak the behavior of the Collector’s logger:

  • development, to set the logger in development mode, allowing for mode detailed feedback in case of problems. This is particularly useful when developing your own components, not so much when setting up your pipelines.
  • encoding, allowing the default output format to be strictly json or the default console format.
  • disable_caller, to disable adding the caller’s location to the log entry.
  • disable_stacktrace, to disable the automatic logging of stracktraces.
  • sampling, to specify a sampling strategy for the logs (see below).
  • output_paths, to specify where to write the log entries, where stderr and stdout are interpreted as the output devices, not as text files.
  • error_output_paths, same as above, but specifically for error messages.

The sampling configuration may include the following options:

  • initial, following Zap’s semantics, this sets the maximum number of log entries per second for a specific message before throtling occurs. Counters are reset every second.
  • thereafter, also following Zap’s semantics, sets how many entries per second are allowed to be recorded when throttling is already in place. Counters are reset every second.

Here’s an example setting all fields:

service:
  telemetry:
    logs:
      level: debug
      initial_fields:
        service: my-instance
      development: true
      encoding: json
      disable_caller: true
      disable_stacktrace: true
      sampling:
        initial: 10
        thereafter: 5
      output_paths:
        - /var/log/otelcol.log
        - stdout
      error_output_paths:
        - /var/log/otelcol.err
        - stderr

Metrics

The Collector exposes its own metrics using OpenMetrics format on localhost:8888/metrics and can be used on production pipelines to undersand the runtime behavior of the Collector and its components. With metrics, you can check how many data points were seen by receivers, processors, and exporters individually. See Receiver issues and Exporter issues for specific examples.

The verbosity can be configured by changing the telemetry settings in the configuration file, under the service top-level node, as follows:

service:
  telemetry:
    metrics:
      level: detailed

The following levels are available:

  • none, telling the Collector that no metrics should be collected. This effectively disables the metrics endpoint altogether.
  • basic, with only the metrics deemed essential by the component authors.
  • normal, with metrics for regular usage.
  • detailed, with potentially new metrics or attributes for existing ones.

Beyond the level attribute, the address can be used to specify which address (host:port, like localhost:8888) to bind the metrics endpoint.

zpages

This extension provides a few endpoints that can be viewed in a browser, providing insights about the Collector’s own effective configuration. For instance, the URL http://localhost:55679/debug/servicez shows the Collector’s version, the start time of the process, as well as the configured pipelines and enabled extensions.

The URL http://localhost:55679/debug/tracez provides a simple span viewer for the Collector’s own spans, including the number of spans in pre-determined latency buckets.

telemetrygen

The OpenTelemetry Collector project has a tool called telemetrygen, which can generate telemetry data to test pipelines. Refer to the tool’s readme for installation instructions.

When you have a local OpenTelemetry Collector with the OTLP receiver and the gRPC protocol enabled, you can run it like this:

telemetrygen traces --traces 1 --otlp-insecure
telemetrygen logs --logs 1 --otlp-insecure
telemetrygen metrics --metrics 1 --otlp-insecure

A suitable OpenTelemetry Collector would look like this:

receivers:
  otlp:
    protocols:
      grpc:

exporters:
  logging:

service:
  pipelines:
    traces:
      receivers: [ otlp ]
      processors:
      exporters: [ logging ]
    logs:
      receivers: [ otlp ]
      processors:
      exporters: [ logging ]
    metrics:
      receivers: [ otlp ]
      processors:
      exporters: [ logging ]

Dealing with bugs

After troubleshooting, If you suspect you are facing a bug in the Collector, you have the following options:

  • for components owned by Grafana Labs, such as the loki receiver and exporter, open an issue with Grafana Labs’ support
  • for core components (otlp receiver and exporter, batch processor, logging exporter, among others), open an issue against the OpenTelemetry Collector core repository
  • for contrib components (loadbalancer exporter, spanmetrics processor, tailsampling processor, among others), open an issue against the OpenTelemetry Collector contrib repository

We are also available on Grafana Labs’ Community Slack at the #opentelemetry channel.