Menu
Grafana Cloud

Linux Server integration for Grafana Cloud

Linux is a family of open-source Unix-like operating systems based on the Linux kernel. Linux is the leading operating system on servers, and is one of the most prominent examples of free and open-source software collaboration.

Linux Server integration for Grafana Cloud enables you to collect metrics related to the operating system running on a node, including aspects like CPU usage, load average, memory usage, and disk and networking I/O using node_exporter integration. It also allows you to use the agent to scrape logs.

This integration includes 22 useful alerts and 7 pre-built dashboards to help monitor and visualize Linux Server metrics and logs.

Before you begin

Each Linux node being observed must have its dedicated Grafana Agent running.

If you want to monitor more than one Linux Node with this integration, we recommend you to use the Ansible collection for Grafana Cloud to deploy Grafana Agent to multiple machines, as described in this documentation.

Install Linux Server integration for Grafana Cloud

  1. In your Grafana Cloud stack, click Connections in the left-hand menu.
  2. Find Linux Server and click its tile to open the integration.
  3. Review the prerequisites in the Configuration Details tab and set up Grafana Agent to send Linux Server metrics and logs to your Grafana Cloud instance.
  4. Click Install to add this integration’s pre-built dashboards and alerts to your Grafana Cloud instance, and you can start monitoring your Linux Server setup.

Post-install configuration for the Linux Server integration

This integration is configured to work with the node_exporter, which is embedded in the Grafana Agent.

Enable the integration by manually adding the provided snippets to your agent configuration file.

Note: The instance label must uniquely identify the node being scraped. Also, ensure each deployed Grafana Agent has a configuration that matches the node it is deployed to.

This integration supports metrics and logs from Linux. If you want to monitor your Linux node logs, there are 3 options. You can:

  • scrape the journal
  • scrape your OS log files directly
  • scrape both your journal and OS log files

We recommend that you enable journal scraping because it comes with a unit label that can be used to filter logs on the dashboards. Config snippets for both cases are provided.

If you want to show logs and metrics signals correlated in your dashboards, as a single pane of glass, ensure the following:

  • job and instance label values must match for node_exporter integration and logs scrape config in your agent configuration file.
  • job label must be set to integrations/node_exporter (already configured in the snippets).
  • instance label must be set to a value that uniquely identifies your Linux Node. Please replace the default <your-instance-name> value according to your environment - it should be set manually. Note that if you use localhost for multiple nodes, the dashboards will not be able to filter correctly by instance.

For a full description of configuration options see how to configure the node_exporter_config block in the agent documentation.

Configuration snippets for Grafana Agent

Below integrations, insert the following lines and change the URLs according to your environment:

yaml
  node_exporter:
    enabled: true
    # disable unused collectors
    disable_collectors:
      - ipvs #high cardinality on kubelet
      - btrfs
      - infiniband
      - xfs
      - zfs
    # exclude dynamic interfaces
    netclass_ignored_devices: "^(veth.*|cali.*|[a-f0-9]{15})$"
    netdev_device_exclude: "^(veth.*|cali.*|[a-f0-9]{15})$"
    # disable tmpfs
    filesystem_fs_types_exclude: "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|tmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$"
    # drop extensive scrape statistics
    metric_relabel_configs:
    - action: drop
      regex: node_scrape_collector_.+
      source_labels: [__name__]
    relabel_configs:
    - replacement: '<your-instance-name>'
      target_label: instance

Below logs.configs.scrape_configs, insert the following lines according to your environment.

yaml
    - job_name: integrations/node_exporter_journal_scrape
      journal:
        max_age: 24h
        labels:
          instance: '<your-instance-name>'
          job: integrations/node_exporter
      relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: 'unit'
      - source_labels: ['__journal__boot_id']
        target_label: 'boot_id'
      - source_labels: ['__journal__transport']
        target_label: 'transport'
      - source_labels: ['__journal_priority_keyword']
        target_label: 'level'
    - job_name: integrations/node_exporter_direct_scrape
      static_configs:
      - targets:
        - localhost
        labels:
          instance: '<your-instance-name>'
          __path__: /var/log/{syslog,messages,*.log}
          job: integrations/node_exporter

Full example configuration for Grafana Agent

Refer to the following Grafana Agent configuration for a complete example that contains all the snippets used for the Linux Server integration. This example also includes metrics that are sent to monitor your Grafana Agent instance.

yaml
integrations:
  prometheus_remote_write:
  - basic_auth:
      password: <your_prom_pass>
      username: <your_prom_user>
    url: <your_prom_url>
  agent:
    enabled: true
    relabel_configs:
    - action: replace
      source_labels:
      - agent_hostname
      target_label: instance
    - action: replace
      target_label: job
      replacement: "integrations/agent-check"
    metric_relabel_configs:
    - action: keep
      regex: (prometheus_target_sync_length_seconds_sum|prometheus_target_scrapes_.*|prometheus_target_interval.*|prometheus_sd_discovered_targets|agent_build.*|agent_wal_samples_appended_total|process_start_time_seconds)
      source_labels:
      - __name__
  # Add here any snippet that belongs to the `integrations` section.
  # For a correct indentation, paste snippets copied from Grafana Cloud at the beginning of the line.
  node_exporter:
    enabled: true
    # disable unused collectors
    disable_collectors:
      - ipvs #high cardinality on kubelet
      - btrfs
      - infiniband
      - xfs
      - zfs
    # exclude dynamic interfaces
    netclass_ignored_devices: "^(veth.*|cali.*|[a-f0-9]{15})$"
    netdev_device_exclude: "^(veth.*|cali.*|[a-f0-9]{15})$"
    # disable tmpfs
    filesystem_fs_types_exclude: "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|tmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$"
    # drop extensive scrape statistics
    metric_relabel_configs:
    - action: drop
      regex: node_scrape_collector_.+
      source_labels: [__name__]
    relabel_configs:
    - replacement: '<your-instance-name>'
      target_label: instance
logs:
  configs:
  - clients:
    - basic_auth:
        password: <your_loki_pass>
        username: <your_loki_user>
      url: <your_loki_url>
    name: integrations
    positions:
      filename: /tmp/positions.yaml
    scrape_configs:
      # Add here any snippet that belongs to the `logs.configs.scrape_configs` section.
      # For a correct indentation, paste snippets copied from Grafana Cloud at the beginning of the line.
    - job_name: integrations/node_exporter_journal_scrape
      journal:
        max_age: 24h
        labels:
          instance: '<your-instance-name>'
          job: integrations/node_exporter
      relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: 'unit'
      - source_labels: ['__journal__boot_id']
        target_label: 'boot_id'
      - source_labels: ['__journal__transport']
        target_label: 'transport'
      - source_labels: ['__journal_priority_keyword']
        target_label: 'level'
    - job_name: integrations/node_exporter_direct_scrape
      static_configs:
      - targets:
        - localhost
        labels:
          instance: '<your-instance-name>'
          __path__: /var/log/{syslog,messages,*.log}
          job: integrations/node_exporter
metrics:
  configs:
  - name: integrations
    remote_write:
    - basic_auth:
        password: <your_prom_pass>
        username: <your_prom_user>
      url: <your_prom_url>
    scrape_configs:
      # Add here any snippet that belongs to the `metrics.configs.scrape_configs` section.
      # For a correct indentation, paste snippets copied from Grafana Cloud at the beginning of the line.
  global:
    scrape_interval: 60s
  wal_directory: /tmp/grafana-agent-wal

Dashboards

The Linux Server integration installs the following dashboards in your Grafana Cloud instance to help monitor your system.

  • Linux node / CPU and system
  • Linux node / filesystem and disks
  • Linux node / fleet overview
  • Linux node / logs
  • Linux node / memory
  • Linux node / network
  • Linux node / overview

Node overview dashboard

Node overview dashboard

Fleet overview dashboard

Fleet overview dashboard

Drill down dashboards: Network interfaces

Drill down dashboards: Network interfaces

Alerts

The Linux Server integration includes the following useful alerts:

node-exporter-filesystem

AlertDescription
NodeFilesystemAlmostOutOfSpaceWarning: Filesystem has less than 5% space left.
NodeFilesystemAlmostOutOfSpaceCritical: Filesystem has less than 3% space left.
NodeFilesystemFilesFillingUpWarning: Filesystem is predicted to run out of inodes within the next 24 hours.
NodeFilesystemFilesFillingUpCritical: Filesystem is predicted to run out of inodes within the next 4 hours.
NodeFilesystemAlmostOutOfFilesWarning: Filesystem has less than 5% inodes left.
NodeFilesystemAlmostOutOfFilesCritical: Filesystem has less than 3% inodes left.

node-exporter

AlertDescription
NodeNetworkReceiveErrsWarning: Network interface is reporting many receive errors.
NodeNetworkTransmitErrsWarning: Network interface is reporting many transmit errors.
NodeHighNumberConntrackEntriesUsedWarning: Number of conntrack are getting close to the limit.
NodeTextFileCollectorScrapeErrorWarning: Node Exporter text file collector failed to scrape.
NodeClockSkewDetectedWarning: Clock skew detected.
NodeClockNotSynchronisingWarning: Clock not synchronising.
NodeRAIDDegradedCritical: RAID Array is degraded.
NodeRAIDDiskFailureWarning: Failed device in RAID array.
NodeFileDescriptorLimitWarning: Kernel is predicted to exhaust file descriptors limit soon.
NodeFileDescriptorLimitCritical: Kernel is predicted to exhaust file descriptors limit soon.
NodeCPUHighUsageInfo: High CPU usage.
NodeSystemSaturationWarning: System saturated, load per core is very high.
NodeMemoryMajorPagesFaultsWarning: Memory major page faults are occurring at very high rate.
NodeMemoryHighUtilizationWarning: Host is running out of memory.
NodeDiskIOSaturationWarning: Disk IO queue is high.
NodeSystemdServiceFailedWarning: Systemd service has entered failed state.

Metrics

The most important metrics provided by the Linux Server integration, which are used on the pre-built dashboards and Prometheus alerts, are as follows:

  • node_arp_entries
  • node_boot_time_seconds
  • node_context_switches_total
  • node_cpu_seconds_total
  • node_disk_io_time_seconds_total
  • node_disk_io_time_weighted_seconds_total
  • node_disk_read_bytes_total
  • node_disk_read_time_seconds_total
  • node_disk_reads_completed_total
  • node_disk_write_time_seconds_total
  • node_disk_writes_completed_total
  • node_disk_written_bytes_total
  • node_filefd_allocated
  • node_filefd_maximum
  • node_filesystem_avail_bytes
  • node_filesystem_device_error
  • node_filesystem_files
  • node_filesystem_files_free
  • node_filesystem_readonly
  • node_filesystem_size_bytes
  • node_intr_total
  • node_load1
  • node_load15
  • node_load5
  • node_md_disks
  • node_md_disks_required
  • node_memory_Active_anon_bytes
  • node_memory_Active_bytes
  • node_memory_Active_file_bytes
  • node_memory_AnonHugePages_bytes
  • node_memory_AnonPages_bytes
  • node_memory_Bounce_bytes
  • node_memory_Buffers_bytes
  • node_memory_Cached_bytes
  • node_memory_CommitLimit_bytes
  • node_memory_Committed_AS_bytes
  • node_memory_DirectMap1G_bytes
  • node_memory_DirectMap2M_bytes
  • node_memory_DirectMap4k_bytes
  • node_memory_Dirty_bytes
  • node_memory_HugePages_Free
  • node_memory_HugePages_Rsvd
  • node_memory_HugePages_Surp
  • node_memory_HugePages_Total
  • node_memory_Hugepagesize_bytes
  • node_memory_Inactive_anon_bytes
  • node_memory_Inactive_bytes
  • node_memory_Inactive_file_bytes
  • node_memory_Mapped_bytes
  • node_memory_MemAvailable_bytes
  • node_memory_MemFree_bytes
  • node_memory_MemTotal_bytes
  • node_memory_SReclaimable_bytes
  • node_memory_SUnreclaim_bytes
  • node_memory_ShmemHugePages_bytes
  • node_memory_ShmemPmdMapped_bytes
  • node_memory_Shmem_bytes
  • node_memory_Slab_bytes
  • node_memory_SwapTotal_bytes
  • node_memory_VmallocChunk_bytes
  • node_memory_VmallocTotal_bytes
  • node_memory_VmallocUsed_bytes
  • node_memory_WritebackTmp_bytes
  • node_memory_Writeback_bytes
  • node_netstat_Icmp6_InErrors
  • node_netstat_Icmp6_InMsgs
  • node_netstat_Icmp6_OutMsgs
  • node_netstat_Icmp_InErrors
  • node_netstat_Icmp_InMsgs
  • node_netstat_Icmp_OutMsgs
  • node_netstat_IpExt_InOctets
  • node_netstat_IpExt_OutOctets
  • node_netstat_TcpExt_ListenDrops
  • node_netstat_TcpExt_ListenOverflows
  • node_netstat_TcpExt_TCPSynRetrans
  • node_netstat_Tcp_InErrs
  • node_netstat_Tcp_InSegs
  • node_netstat_Tcp_OutRsts
  • node_netstat_Tcp_OutSegs
  • node_netstat_Tcp_RetransSegs
  • node_netstat_Udp6_InDatagrams
  • node_netstat_Udp6_InErrors
  • node_netstat_Udp6_NoPorts
  • node_netstat_Udp6_OutDatagrams
  • node_netstat_Udp6_RcvbufErrors
  • node_netstat_Udp6_SndbufErrors
  • node_netstat_UdpLite_InErrors
  • node_netstat_Udp_InDatagrams
  • node_netstat_Udp_InErrors
  • node_netstat_Udp_NoPorts
  • node_netstat_Udp_OutDatagrams
  • node_netstat_Udp_RcvbufErrors
  • node_netstat_Udp_SndbufErrors
  • node_network_carrier
  • node_network_info
  • node_network_mtu_bytes
  • node_network_receive_bytes_total
  • node_network_receive_compressed_total
  • node_network_receive_drop_total
  • node_network_receive_errs_total
  • node_network_receive_fifo_total
  • node_network_receive_multicast_total
  • node_network_receive_packets_total
  • node_network_speed_bytes
  • node_network_transmit_bytes_total
  • node_network_transmit_compressed_total
  • node_network_transmit_drop_total
  • node_network_transmit_errs_total
  • node_network_transmit_fifo_total
  • node_network_transmit_multicast_total
  • node_network_transmit_packets_total
  • node_network_transmit_queue_length
  • node_network_up
  • node_nf_conntrack_entries
  • node_nf_conntrack_entries_limit
  • node_os_info
  • node_sockstat_FRAG6_inuse
  • node_sockstat_FRAG_inuse
  • node_sockstat_RAW6_inuse
  • node_sockstat_RAW_inuse
  • node_sockstat_TCP6_inuse
  • node_sockstat_TCP_alloc
  • node_sockstat_TCP_inuse
  • node_sockstat_TCP_mem
  • node_sockstat_TCP_mem_bytes
  • node_sockstat_TCP_orphan
  • node_sockstat_TCP_tw
  • node_sockstat_UDP6_inuse
  • node_sockstat_UDPLITE6_inuse
  • node_sockstat_UDPLITE_inuse
  • node_sockstat_UDP_inuse
  • node_sockstat_UDP_mem
  • node_sockstat_UDP_mem_bytes
  • node_sockstat_sockets_used
  • node_softnet_dropped_total
  • node_softnet_processed_total
  • node_softnet_times_squeezed_total
  • node_systemd_unit_state
  • node_textfile_scrape_error
  • node_time_zone_offset_seconds
  • node_timex_estimated_error_seconds
  • node_timex_maxerror_seconds
  • node_timex_offset_seconds
  • node_timex_sync_status
  • node_uname_info
  • node_vmstat_oom_kill
  • node_vmstat_pgfault
  • node_vmstat_pgmajfault
  • node_vmstat_pgpgin
  • node_vmstat_pgpgout
  • node_vmstat_pswpin
  • node_vmstat_pswpout
  • process_max_fds
  • process_open_fds

Changelog

md
# 1.2.3 - December 2023

* Accept `integrations/unix` for compatibility with default flow mode node_exporter job name.

# 1.2.2 - December 2023

* Fix issues with showing data on dashboards when `cluster` label has no value.

# 1.2.1 - December 2023

* Fix queries for memoryBuffers memoryCached metrics
* Update network traffic panels to show only interfaces that had traffic
* Update network errors/drops panels to show only values greater than 0.

# 1.2.0 - November 2023

* Dashboards prefixes are changed to 'Linux node/ '
* Add new Loki based annotations:
    * Service failed
    * Critical system event
    * Session (ssh,console) opened/closed
* Apply panel changes, some examples:
    * Use Sentence case in titles
    * Memory TS panel: Show only 'Memory total' and 'Memory used' by default
    * CPU usage TS panel: Use Blue-Yellow-Red color Schema
    * Add OS and group labels(job, cluster) as columns in Fleet overview table
* NodeSystemSaturation alert severity is set to warning
* Attach integration status panel to fleet and logs dashboards

# 1.1.2 - August 2023

* Add regex filter for logs datasource

# 1.1.1 - July 2023

* New Filter Metrics option for configuring the Grafana Agent, which saves on metrics cost by dropping any metric not used by this integration. Beware that anything custom built using metrics that are not on the snippet will stop working.

# 1.1.0 - June 2023

* This update introduces generic logs dashboard 'Node Exporter / Node Logs'
    * Drop log panels 'Node Overview' dashboard

# 1.0.1 - June 2023

* This update includes the following, by updating to the latest mixin:
    * Panel description typos have been fixed
    * Incorrect data links in the "Node Fleet Overview" Dashboard now correctly include the dashboard selector

# 1.0.0 - April 2023

* This update introduces 3-tier view of linux nodes:
    * TOP: Fleet view: see group of your linux instances at once
    * Overview of the specific node: see specific node at a glance
    * Drill down: Set of dashboards for deep analysis using advanced metrics (Memory, CPU and System, Filesystem and Disk, Networking)
    * Links and data links are provided for better navigation between views
* Update agent's filter config in docs, to reduce number of timeseries generated per node
    * Metrics filter instructions to exclude dynamic network devices, temp filesystems and extended scrape statistics
* Remove USE dashboards
* Convert all graphs to timeseries panels
* Add information row
* New alerts
* Split alerts into two alert groups
* Annotations for events: Reboot, OOMkill, and 'Kernel update'

# 0.0.8 - October 2022

* Update upstream node_exporter mixin: [ba8c043079b38748e57adf1f80e3d86a4060efc5](https://github.com/prometheus/node_exporter/commit/ba8c043079b38748e57adf1f80e3d86a4060efc5)
* Enable multicluster dashboards for use in kubernetes.
* Add direct log file scrape to the agent snippets

# 0.0.7 - September 2022

* Remove source_address from relabel_configs

# 0.0.6 - May 2022

* Reverse fsSpaceAvailableCriticalThreshold and fsSpaceAvailableWarningThreshold
* Update units for disk and networking panels

# 0.0.5 - May 2022

* Update 'Disk Space Usage' panel to table format

# 0.0.4 - April 2022

* Fixed alerts and recording rules by providing proper nodeSelector

# 0.0.3 - February 2022

* Added logs support from Loki datasource

# 0.0.2 - October 2021

* Update all rate queries to use `$__rate_interval`

# 0.0.1 - June 2020

* Initial release

Cost

By connecting your Linux Server instance to Grafana Cloud, you might incur charges. To view information on the number of active series that your Grafana Cloud account uses for metrics included in each Cloud tier, see Active series and dpm usage and Cloud tier pricing.