Troubleshooting the OpenTelemetry Collector
The OpenTelemetry Collector is made of components such as Receivers, Exporters, Processors, Connectors, and Extensions. Each component is usually part of one or more pipelines. This article helps you figure out how to sort out common problems with the Collector and what to do if you suspect you found a bug.
Receiver issues
When your telemetry client generated data but it hasn’t been received by your backend, use the metric otelcol_receiver_accepted_spans
to ensure that the data point has been received by the Collector. If you expect a data point to have been counted as part of this metric but hasn’t, check the metric otelcol_receiver_refused_spans
to ensure it wasn’t refused by the Collector.
See Metrics for more information on the Collector’s own metrics.
When neither metric are showing that data points have been seen, it’s an indication that the Collector hasn’t received the data point at all. In that case, check the connectivity between your telemetry client and the Collector. When possible, simplify your networking between the source of data (typically your workload) and the Collector. For instance, try running everything directly on your machine instead of as containers. Refer to the Getting Started Guide for more information on how to run the Collector locally.
A quick way to verify whether data is being received by the Collector is to use a configuration similar to the following:
receivers:
otlp:
protocols:
grpc:
exporters:
logging:
service:
pipelines:
traces:
receivers: [ otlp ]
processors:
exporters: [ logging ]
With that, you should see an output similar to the below when receiving data via a traces pipeline:
2023-05-05T15:04:58.982-0300 info TracesExporter {"kind": "exporter", "data_type": "traces", "name": "logging", "#spans": 512}
If you don’t see that, it’s a good indication that your receiver didn’t receive the data you were expecting.
Exporter issues
When you confirmed that your Collector received the data point, the next step would be to check the similar metrics on the exporter side: otelcol_exporter_sent_spans
and otelcol_exporter_send_failed_spans
. If you see only sent spans
, the exporter reported that it was able to send all data points to the destination. In that case, check the logs at your backend for clues.
See Metrics for more information on the Collector’s own metrics.
A good way to find out whether the exporter is not working as intended is to add a logging
exporter to the same pipeline you are currently using. When doing that, you should see your telemetry data being printed out to the console, confirming that data has been seen by exporters:
2023-05-07T10:53:48.893-0300 info TracesExporter {"kind": "exporter", "data_type": "traces", "name": "logging", "#spans": 2}
If you still see no data at your backends, try disabling any resiliency mechanisms the exporter may have, such as sending queues or retry mechanisms. Here’s one example of a configuration file with all the tips above:
receivers:
otlp:
protocols:
grpc:
exporters:
logging:
otlp:
endpoint: failing.example.com:4317
sending_queue:
enabled: false
retry_on_failure:
enabled: false
service:
pipelines:
traces:
receivers: [ otlp ]
processors:
exporters: [ logging, otlp ]
At this point, the metrics might look like this, indicating that the logging
exporter was able to export the data but the otlp
wasn’t:
otelcol_exporter_sent_spans{exporter="logging",service_instance_id="d9853063-c63c-48db-8b13-69e379b38314",service_name="otelcol",service_version="0.75.0"} 4
otelcol_exporter_sent_spans{exporter="otlp",service_instance_id="d9853063-c63c-48db-8b13-69e379b38314",service_name="otelcol",service_version="0.75.0"} 0
At this point, the best place to look for clues is the Collector console. If it doesn’t show any further logs that might help explain the issue, double-check the exporter’s documentation for specific instructions for that exporter.
Authentication issues
When you need to authenticate against a remote OTLP server, such as Grafana Cloud OTLP, use the auth extensions and tell the exporter to use them. Check the Connect OpenTelemetry Collector to Grafana Cloud databases for more information on how to obtain your username and password. This is preferrable to directly passing the auth data as username and password in the URL. Here’s an example of the recommendation:
extensions:
basicauth/traces:
client_auth:
username: "1234" # your Username / Instance ID
password: "ey..." # your API key with the "MetricsPublisher" role
exporters:
otlphttp:
endpoint: "https://otlp-gateway-prod-us-central-0.grafana.net/otlp"
auth:
authenticator: basicauth/traces
Issues sending data to Grafana Tempo / Grafana Cloud Traces
When sending data to Grafana Tempo or Grafana Cloud Traces, make sure you are using the OTLP gRPC Exporter and that the endpoint contains only the hostname and port, like in the following example:
exporters:
otlp:
endpoint: tempo-us-central1.grafana.net:443
You can find the correct hostname and port for your Grafana Cloud Traces by looking at the Tempo section of your Grafana Cloud account page. Under “Details”, you should see a URL like this: https://tempo-us-central1.grafana.net/tempo
, which translates to tempo-us-central1.grafana.net:443
in the Collector configuration.
Issues sending data to Grafana Mimir / Grafana Cloud Metrics
When sending data to Grafana Mimir or Grafana Cloud Metrics, make sure you are using the Prometheus Remote Write Exporter and that the endpoint contains the full URL for the endpoint, including the full path for the push endpoint, like in the following example:
exporters:
prometheusremotewrite:
endpoint: https://prometheus-blocks-prod-us-central1.grafana.net/api/prom/push
You can find the correct hostname and port for your Grafana Cloud Metrics by looking at the Prometheus section of your Grafana Cloud account page. Under “Details”, you should see the Prometheus Remote Write endpoint like this: https://prometheus-blocks-prod-us-central1.grafana.net/api/prom/push
, which can be used as is in the Collector configuration.
Issues sending data to Grafana Loki / Grafana Cloud Logs
When sending data to Grafana Loki or Grafana Cloud Logs, make sure you are using the Loki Exporter and that the endpoint contains the full URL for the endpoint, including the full path for the push endpoint, like in the following example:
exporters:
loki:
endpoint: https://logs-prod-us-central1.grafana.net/loki/api/v1/push
You can find the correct hostname and port for your Grafana Cloud Logs by looking at the Loki section of your Grafana Cloud account page. Under “Details”, you should see a URL like this: https://logs-prod-us-central1.grafana.net
, which translates to https://logs-prod-us-central1.grafana.net/loki/api/v1/push
in the Collector configuration.
Debugging configuration issues
The Collector’s configuration file is composed of different sections:
- top-level components, such as
extensions
,receivers
,exporters
,connectors
, andprocessors
- the
service
node, which defines the list of extensions to load, the Collector’s own telemetry settings, as well as the pipeline definitions - the
pipelines
section within theservice
node defines all the pipelines for the Collector’s instance, listing all components that are part of each pipeline
A common mistake is to define the component’s configuration at the top-level component section, but not include them in pipelines, like in the following example:
receivers:
otlp:
protocols:
grpc:
endpoint: localhost:4317
otlp/2:
protocols:
grpc:
endpoint: localhost:5317
exporters:
logging:
service:
pipelines:
traces:
receivers: [ otlp ]
processors: [ ]
exporters: [ logging ]
In the example above, we define two receivers (otlp
and otlp/2
) but use only one in the traces
pipeline. A client attempting to send data to this Collector’s 5317
port won’t therefore succeed, potentially receiving a Connection refused
message.
Similarly, it’s also common to make a reference to a component that hasn’t been defined before, like in the following example:
receivers:
otlp/2:
protocols:
grpc:
endpoint: localhost:5317
exporters:
logging:
service:
pipelines:
traces:
receivers: [ otlp ]
processors: [ ]
exporters: [ logging ]
In the example above, we tell the Collector to use the receiver otlp
in the traces
pipeline, but there’s no such receiver specified, only otlp/2
. In this case, the Collector will fail with an error message like this:
Error: invalid configuration: service::pipeline::traces: references receiver "otlp" which is not configured
2023/04/15 17:47:14 collector server run finished with error: invalid configuration: service::pipeline::traces: references receiver "otlp" which is not configured
A common source of confusion is also when trying to use a component that is not part of the distribution being used. For instance, the “core” distribution of the Collector does not include the Loki exporter. An example configuration would be the following:
receivers:
otlp:
protocols:
grpc:
endpoint: localhost:5317
exporters:
loki:
endpoint: https://logs-prod-us-central1.grafana.net/loki/api/v1/push
service:
pipelines:
logs:
receivers: [ otlp ]
processors: [ ]
exporters: [ loki ]
While running this configuration with the OpenTelemetry Collector Contrib distribution works fine, running it with the core distribution returns this:
Error: failed to get config: cannot unmarshal the configuration: 1 error(s) decoding:
* error decoding 'exporters': unknown type: "loki" for id: "loki" (valid values: [logging otlp otlphttp])
2023/04/15 17:51:14 collector server run finished with error: failed to get config: cannot unmarshal the configuration: 1 error(s) decoding:
* error decoding 'exporters': unknown type: "loki" for id: "loki" (valid values: [logging otlp otlphttp])
If you are unsure about which components are available in a given distribution, use the components
option on the otelcol
binary:
> otelcol components
buildinfo:
command: otelcol
description: OpenTelemetry Collector
version: 0.75.0
receivers:
- otlp
- hostmetrics
- jaeger
- kafka
- opencensus
- prometheus
- zipkin
processors:
- filter
- batch
- memory_limiter
- attributes
- resource
- span
- probabilistic_sampler
exporters:
- jaeger
- kafka
- opencensus
- prometheus
- logging
- otlp
- otlphttp
- file
- prometheusremotewrite
- zipkin
extensions:
- zpages
- memory_ballast
- health_check
- pprof
Connector issues
Connectors make the bridge between pipelines, acting as an exporter in one pipeline and receiver in another. Therefore, connectors must always be specified at those two sides, like the following:
receivers:
otlp:
protocols:
grpc:
exporters:
logging:
connectors:
forward:
service:
pipelines:
traces/1:
receivers: [ otlp ]
processors:
exporters: [ forward ]
traces/2:
receivers: [ forward ]
processors:
exporters: [ logging ]
When the connector is only used on one side (exporter or receiver), an error like the following is shown:
2023/05/07 11:42:42 collector server run finished with error: invalid configuration: connectors::forward: must be used as both receiver and exporter but is not used as receiver
Troubleshooting tools
The Collector provides a set of tools that can be used for further troubleshooting, such as the following.
Logging exporter
The logging exporter can be added to any pipeline and will print basic information on the console about the telemetry data that it has seen. One example seen earlier in this document is this:
2023-05-07T10:53:48.893-0300 info TracesExporter {"kind": "exporter", "data_type": "traces", "name": "logging", "#spans": 2}
However, when you want to inspect the actual data that was exported, you should increase the verbosity of the logging exporter to detailed
, like this:
exporters:
logging:
verbosity: detailed
The following example output shows what to expect when the logging exporter is used for traces. Click here to see the example
2023-05-11T10:59:15.268-0300 info TracesExporter {"kind": "exporter", "data_type": "traces", "name": "logging", "#spans": 2}
2023-05-11T10:59:15.269-0300 info ResourceSpans #0
Resource SchemaURL: https://opentelemetry.io/schemas/1.4.0
Resource attributes:
-> service.name: Str(telemetrygen)
ScopeSpans #0
ScopeSpans SchemaURL:
InstrumentationScope telemetrygen
Span #0
Trace ID : 70eb62646b2e74db99bab9966cbf2ccf
Parent ID : 08ff66da51fc330a
ID : 1f4ce9f7660b1bf4
Name : okey-dokey
Kind : Internal
Start time : 2023-05-11 13:59:15.267019806 +0000 UTC
End time : 2023-05-11 13:59:15.267147306 +0000 UTC
Status code : Unset
Status message :
Attributes:
-> span.kind: Str(server)
-> net.peer.ip: Str(1.2.3.4)
-> peer.service: Str(telemetrygen-client)
Span #1
Trace ID : 70eb62646b2e74db99bab9966cbf2ccf
Parent ID :
ID : 08ff66da51fc330a
Name : lets-go
Kind : Internal
Start time : 2023-05-11 13:59:15.266974872 +0000 UTC
End time : 2023-05-11 13:59:15.267147306 +0000 UTC
Status code : Unset
Status message :
Attributes:
-> span.kind: Str(client)
-> net.peer.ip: Str(1.2.3.4)
-> peer.service: Str(telemetrygen-server)
{"kind": "exporter", "data_type": "traces", "name": "logging"}
Logs
When you are still setting up your collection pipeline, logs will be your best troubleshooting tool. They are available directly on the console running the Collector instance. The verbosity can be configured by changing the telemetry
settings in the configuration file, under the service
top-level node, as follows:
service:
telemetry:
logs:
level: debug
initial_fields:
service: my-instance
The initial_fields
can be added so that every log entry produced by the Collector includes them. This is useful when sending the logs from all your Collectors to a centralized location, to identify which collector in the fleet generated which log entry. Here’s an example output for a log entry produced by a Collector using the configuration above:
2023-05-05T14:16:52.935-0300 info TracesExporter {"service": "my-instance", "kind": "exporter", "data_type": "traces", "name": "logging", "#spans": 2000}
The possible log levels are: debug
, info
, warn
, error
, dpanic
, panic
, fatal
. The levels follow the semantics of the logging library Zap.
Other logging tweaks
Beyond the already mentioned options, the following attributes can be used to further tweak the behavior of the Collector’s logger:
development
, to set the logger in development mode, allowing for mode detailed feedback in case of problems. This is particularly useful when developing your own components, not so much when setting up your pipelines.encoding
, allowing the default output format to be strictlyjson
or the defaultconsole
format.disable_caller
, to disable adding the caller’s location to the log entry.disable_stacktrace
, to disable the automatic logging of stracktraces.sampling
, to specify a sampling strategy for the logs (see below).output_paths
, to specify where to write the log entries, wherestderr
andstdout
are interpreted as the output devices, not as text files.error_output_paths
, same as above, but specifically for error messages.
The sampling configuration may include the following options:
initial
, following Zap’s semantics, this sets the maximum number of log entries per second for a specific message before throtling occurs. Counters are reset every second.thereafter
, also following Zap’s semantics, sets how many entries per second are allowed to be recorded when throttling is already in place. Counters are reset every second.
Here’s an example setting all fields:
service:
telemetry:
logs:
level: debug
initial_fields:
service: my-instance
development: true
encoding: json
disable_caller: true
disable_stacktrace: true
sampling:
initial: 10
thereafter: 5
output_paths:
- /var/log/otelcol.log
- stdout
error_output_paths:
- /var/log/otelcol.err
- stderr
Metrics
The Collector exposes its own metrics using OpenMetrics format on localhost:8888/metrics
and can be used on production pipelines to undersand the runtime behavior of the Collector and its components. With metrics, you can check how many data points were seen by receivers, processors, and exporters individually. See Receiver issues and Exporter issues for specific examples.
The verbosity can be configured by changing the telemetry
settings in the configuration file, under the service
top-level node, as follows:
service:
telemetry:
metrics:
level: detailed
The following levels are available:
none
, telling the Collector that no metrics should be collected. This effectively disables the metrics endpoint altogether.basic
, with only the metrics deemed essential by the component authors.normal
, with metrics for regular usage.detailed
, with potentially new metrics or attributes for existing ones.
Beyond the level
attribute, the address
can be used to specify which address (host:port
, like localhost:8888
) to bind the metrics endpoint.
zpages
This extension provides a few endpoints that can be viewed in a browser, providing insights about the Collector’s own effective configuration. For instance, the URL http://localhost:55679/debug/servicez shows the Collector’s version, the start time of the process, as well as the configured pipelines and enabled extensions.
The URL http://localhost:55679/debug/tracez provides a simple span viewer for the Collector’s own spans, including the number of spans in pre-determined latency buckets.
telemetrygen
The OpenTelemetry Collector project has a tool called telemetrygen
, which can generate telemetry data to test pipelines. Refer to the tool’s readme for installation instructions.
When you have a local OpenTelemetry Collector with the OTLP receiver and the gRPC protocol enabled, you can run it like this:
telemetrygen traces --traces 1 --otlp-insecure
telemetrygen logs --logs 1 --otlp-insecure
telemetrygen metrics --metrics 1 --otlp-insecure
A suitable OpenTelemetry Collector would look like this:
receivers:
otlp:
protocols:
grpc:
exporters:
logging:
service:
pipelines:
traces:
receivers: [ otlp ]
processors:
exporters: [ logging ]
logs:
receivers: [ otlp ]
processors:
exporters: [ logging ]
metrics:
receivers: [ otlp ]
processors:
exporters: [ logging ]
Dealing with bugs
After troubleshooting, If you suspect you are facing a bug in the Collector, you have the following options:
- for components owned by Grafana Labs, such as the
loki
receiver and exporter, open an issue with Grafana Labs’ support - for core components (
otlp
receiver and exporter,batch
processor,logging
exporter, among others), open an issue against the OpenTelemetry Collector core repository - for contrib components (
loadbalancer
exporter,spanmetrics
processor,tailsampling
processor, among others), open an issue against the OpenTelemetry Collector contrib repository
We are also available on Grafana Labs’ Community Slack at the #opentelemetry channel.