Telegraf: system dashboard

Dashboard

InfluxDB dashboards for telegraf metrics
Last updated: 4 months ago

Downloads: 1509

  • FireShot Capture 4 - Grafana - Telegraf_ - https___grf.lex.io_dashboard_db_telegraf-system-dashboard.png
    FireShot Capture 4 - Grafana - Telegraf_ - https___grf.lex.io_dashboard_db_telegraf-system-dashboard.png

Templated dashboard for telegraf + influxdb.
Similar to basic https://grafana.net/dashboards/914, but with templating, repeating panels/row and etc.
Was made as a "learn influxdb/telegraf" project, ended up with something i use daily.
Variables (among standard like server / datasource / interval):

  • CPUs (defaults to all)
  • Disks (per-disk IOPS)
  • Network interfaces (packets, bandwidth, errors/drops)
  • Mountpoints (space / inodes)

Metrics:

  • Detailed network stack info, nstat plugin allows us to grab raw snmp data, ie:
    • TCP handshakes data
    • TCP aborts data
    • ICMP errors, ICMP data
    • SYN data
    • TCP errors (retransmissions/etc)
    • IPv4 errors
    • IPv6 errors
  • Conntrack data
  • File descriptors
  • UDP data

...And basically everything "generic" you can extract from ordinary linux system

By default all variables points to "all", so dashboard can be huge if you have large amounts of disks/network interfaces.
So far i tested it on machine with 46 disks, 8 interfaces and it loaded correctly (but pretty slow, poor browser barely handled all that data)

Known issues / kludges:

  • Docker "veth" interfaces are blacklisted via template regexp. Docker creates shitload of them with names like "veth%container_id%" and they appear in selector even if they were alive months ago, so info about that interfaces is practically useless.

  • Disk IO only displays for disks like /dev/sda, /dev/hda, /dev/vda. Using per-partition IOPS produces way too much graphs, but if you really want it, you can fix it by editing regexp in "disk" template variable. Also, i'm not sure about drbd and other "virtual" block devices.

  • Split IPv4/IPv6 data (no ipv6 networks in my ownership, so not high on priority)

Collector Configuration Details

Sample /etc/telegraf/telegraf.conf, just change urls = ["http://your_host:8086"] to your server and you're ready to go

# Global tags can be specified here in key="value" format.
[global_tags]
  # dc = "us-east-1" # will tag all metrics with dc=us-east-1
  # rack = "1a"
  ## Environment variables can be used as tags, and throughout the config file
  # user = "$USER"


# Configuration for telegraf agent
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  debug = false
  quiet = false
  hostname = ""
  omit_hostname = false


### OUTPUT

# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
  urls = ["http://your_host:8086"]
  database = "telegraf_metrics"

  ## Retention policy to write to. Empty string writes to the default rp.
  retention_policy = ""
  ## Write consistency (clusters only), can be: "any", "one", "quorum", "all"
  write_consistency = "any"

  ## Write timeout (for the InfluxDB client), formatted as a string.
  ## If not provided, will default to 5s. 0s means no timeout (not recommended).
  timeout = "5s"
  # username = "telegraf"
  # password = "2bmpiIeSWd63a7ew"
  ## Set the user agent for HTTP POSTs (can be useful for log differentiation)
  # user_agent = "telegraf"
  ## Set UDP payload size, defaults to InfluxDB UDP Client default (512 bytes)
  # udp_payload = 512


# Read metrics about cpu usage
[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## Comment this line if you want the raw CPU time metrics
  fielddrop = ["time_*"]


# Read metrics about disk usage by mount point
[[inputs.disk]]
  ## By default, telegraf gather stats for all mountpoints.
  ## Setting mountpoints will restrict the stats to the specified mountpoints.
  # mount_points = ["/"]

  ## Ignore some mountpoints by filesystem type. For example (dev)tmpfs (usually
  ## present on /run, /var/run, /dev/shm or /dev).
  ignore_fs = ["tmpfs", "devtmpfs"]


# Read metrics about disk IO by device
[[inputs.diskio]]
  ## By default, telegraf will gather stats for all devices including
  ## disk partitions.
  ## Setting devices will restrict the stats to the specified devices.
  # devices = ["sda", "sdb"]
  ## Uncomment the following line if you need disk serial numbers.
  # skip_serial_number = false


# Get kernel statistics from /proc/stat
[[inputs.kernel]]
  # no configuration


# Read metrics about memory usage
[[inputs.mem]]
  # no configuration


# Get the number of processes and group them by status
[[inputs.processes]]
  # no configuration


# Read metrics about swap memory usage
[[inputs.swap]]
  # no configuration


# Read metrics about system load & uptime
[[inputs.system]]
  # no configuration

# Read metrics about network interface usage
[[inputs.net]]
  # collect data only about specific interfaces
  # interfaces = ["eth0"]


[[inputs.netstat]]
  # no configuration

[[inputs.interrupts]]
  # no configuration

[[inputs.linux_sysctl_fs]]
  # no configuration
Dependencies: