TensorFlow integration for Grafana Cloud
TensorFlow is an end-to-end open source platform for machine learning. The TensorFlow integration uses the Grafana agent to collect metrics for monitoring a TensorFlow instance, including aspects such as model request latency, model runtime latency, batch queuing latency, graph build time, and graph run time. The integration also supports the TensorFlow Docker container logs being scraped by the agent using Promtail. An accompanying dashboard is provided to visualize these metrics and logs.
This integration supports TensorFlow 2.10.0+.
Pre-install configuration for the TensorFlow integration
This integration monitors a TensorFlow instance that exposes its metrics through TensorFlow’s built in prometheus metrics server.
In order for the integration to work, you must first enable the prometheus metrics as described in the TensorFlow Monitoring Configuration documentation.
Additional configuration is required to enable the the prometheus batching metrics. Batching configuration can be enabled as described in the TensorFlow Batching Configuration documentation.
Install TensorFlow integration for Grafana Cloud
- In your Grafana Cloud instance, click Integrations and Connections (lightning bolt icon).
- Navigate to the TensorFlow tile and review the prerequisites. Then click Install integration.
- Once the integration is installed, follow the steps on the Configuration Details page to setup Grafana Agent and start sending TensorFlow metrics to your Grafana Cloud instance.
Post-install configuration for the TensorFlow integration
This integration supports metrics and logs from a TensorFlow instance. If you want to show logs and metrics signals correlated on the same dashboards, ensure the following:
instancelabel values must match for the
metricsscrape config and the
logsscrape config in the Agent configuration file.
<your-instance-name>with the value that uniquely identifies your instance.
- There must be a
name => tensorflowlabel in the
Refer to the following snippet:
metrics: wal_directory: /tmp/wal configs: - name: integrations scrape_configs: - job_name: integrations/tensorflow metrics_path: /monitoring/prometheus/metrics relabel_configs: - replacement: "<your-instance-name>" target_label: instance static_configs: - targets: ['localhost'] remote_write: - url: http://cortex:9009/api/prom/push logs: configs: - name: integrations clients: - url: http://loki:3100/loki/api/v1/push positions: filename: /var/lib/grafana-agent/logs/positions.yaml scrape_configs: - job_name: integrations/tensorflow relabel_configs: - source_labels: ['__meta_docker_container_name'] replacement: tensorflow target_label: name - source_labels: ['__meta_docker_container_name'] replacement: integrations/tensorflow target_label: job - source_labels: ['__meta_docker_container_name'] replacement: "<your-instance-name>" target_label: instance docker_sd_configs: - host: unix:///var/run/docker.sock refresh_interval: 5s filters: - name: name values: [tensorflow]
The TensorFlow integration installs the following dashboards in your Grafana Cloud instance to help monitor your metrics.
- TensorFlow overview
TensorFlow overview dashboard 1
TensorFlow overview dashboard 2
The TensorFlow integration includes the following useful alerts:
|TensorFlowModelRequestHighErrorRate||Critical: More than 30% of all model requests are not successful.|
|TensorFlowHighBatchQueuingLatency||Warning: Batch queuing latency more than 5000000µs.|
The following metrics are automatically written to your Grafana Cloud instance by connecting your TensorFlow instance through this integration:
# 0.0.1 - December 2022 * Initial Release
By connecting your TensorFlow instance to Grafana Cloud you might incur charges. To view information on the number of active series that your Grafana Cloud account uses for metrics included in each Cloud tier, see Active series and dpm usage and Cloud tier pricing.