Menu

TensorFlow integration for Grafana Cloud

TensorFlow is an end-to-end open source platform for machine learning. The TensorFlow integration uses the Grafana agent to collect metrics for monitoring a TensorFlow instance, including aspects such as model request latency, model runtime latency, batch queuing latency, graph build time, and graph run time. The integration also supports the TensorFlow Docker container logs being scraped by the agent using Promtail. An accompanying dashboard is provided to visualize these metrics and logs.

This integration supports TensorFlow 2.10.0+.

Pre-install configuration for the TensorFlow integration

This integration monitors a TensorFlow instance that exposes its metrics through TensorFlow’s built in prometheus metrics server.

In order for the integration to work, you must first enable the prometheus metrics as described in the TensorFlow Monitoring Configuration documentation.

Additional configuration is required to enable the the prometheus batching metrics. Batching configuration can be enabled as described in the TensorFlow Batching Configuration documentation.

Install TensorFlow integration for Grafana Cloud

  1. In your Grafana Cloud instance, click Integrations and Connections (lightning bolt icon).
  2. Navigate to the TensorFlow tile and review the prerequisites. Then click Install integration.
  3. Once the integration is installed, follow the steps on the Configuration Details page to setup Grafana Agent and start sending TensorFlow metrics to your Grafana Cloud instance.

Post-install configuration for the TensorFlow integration

This integration supports metrics and logs from a TensorFlow instance. If you want to show logs and metrics signals correlated on the same dashboards, ensure the following:

  • job and instance label values must match for the metrics scrape config and the logs scrape config in the Agent configuration file.
  • Replace <your-instance-name> with the value that uniquely identifies your instance.
  • There must be a name => tensorflow label in the logs scrape config.

Refer to the following snippet:

metrics:
  wal_directory: /tmp/wal
  configs:
    - name: integrations
      scrape_configs:
        - job_name: integrations/tensorflow
          metrics_path: /monitoring/prometheus/metrics
          relabel_configs:
            - replacement: "<your-instance-name>"
              target_label: instance
          static_configs:
            - targets: ['localhost']
      remote_write:
        - url: http://cortex:9009/api/prom/push
logs:
  configs:
    - name: integrations
      clients:
        - url: http://loki:3100/loki/api/v1/push
      positions:
        filename: /var/lib/grafana-agent/logs/positions.yaml
      scrape_configs:
        - job_name: integrations/tensorflow
          relabel_configs:
            - source_labels: ['__meta_docker_container_name']
              replacement: tensorflow
              target_label: name
            - source_labels: ['__meta_docker_container_name']
              replacement: integrations/tensorflow
              target_label: job
            - source_labels: ['__meta_docker_container_name']
              replacement: "<your-instance-name>"
              target_label: instance
          docker_sd_configs:
            - host: unix:///var/run/docker.sock
              refresh_interval: 5s
              filters:
                - name: name
                  values: [tensorflow]

Dashboards

The TensorFlow integration installs the following dashboards in your Grafana Cloud instance to help monitor your metrics.

  • TensorFlow overview

TensorFlow overview dashboard 1

image

TensorFlow overview dashboard 2

image

Alerts

The TensorFlow integration includes the following useful alerts:

Group: TensorFlowAlerts

AlertDescription
TensorFlowModelRequestHighErrorRateCritical: More than 30% of all model requests are not successful.
TensorFlowHighBatchQueuingLatencyWarning: Batch queuing latency more than 5000000µs.

Metrics

The following metrics are automatically written to your Grafana Cloud instance by connecting your TensorFlow instance through this integration:

  • :tensorflow:core:graph_build_calls
  • :tensorflow:core:graph_build_time_usecs
  • :tensorflow:core:graph_run_time_usecs
  • :tensorflow:core:graph_runs
  • :tensorflow:serving:batching_session:queuing_latency_count
  • :tensorflow:serving:batching_session:queuing_latency_sum
  • :tensorflow:serving:request_count
  • :tensorflow:serving:request_latency_count
  • :tensorflow:serving:request_latency_sum
  • :tensorflow:serving:runtime_latency_count
  • :tensorflow:serving:runtime_latency_sum

Changelog

# 0.0.1 - December 2022

* Initial Release

Cost

By connecting your TensorFlow instance to Grafana Cloud you might incur charges. To view information on the number of active series that your Grafana Cloud account uses for metrics included in each Cloud tier, see Active series and dpm usage and Cloud tier pricing.