Multiple Nvidia GPU Metrics

Nvidia GPU Metrics based on the prometheus metrics from github.com/utkuozdemir/nvidia_gpu_exporter

Run Nvidia GPU Monitoringm with vLLM under Sbnb Linux in Automated Way

This tutorial will show how to get vLLM on a Bare Metal server up & running with Nvidia GPU in minutes.

We also start monitoring using Grafana and run a vLLM benchmark. At the end, you will be able to see the following vLLM benchmark results and monitoring graphs from your Bare Metal server.

The graph below shows GPU load during a vLLM benchmark test for a few minutes, leading to a GPU load spike to 100%. Memory allocation is at 90% per vLLM config.

Prerequisites

Boot Bare Metal server into Sbnb Linux. Read more at README-INSTALL.md.
One or more Nvidia GPUs attached to the Bare Metal server
Laptop with Tailscale configured to access the bare metal server for configuration.

Howto

1. Boot Bare Metal Server into Sbnb Linux

Boot the Bare Metal server into Sbnb Linux using the instructions in README-INSTALL.md. After booting, verify that the server appears in your Tailscale machine list.

For more details on automatic hostname assignments, refer to README-SERIAL-NUMBER.md.

2. Connect Your Laptop to Tailscale

We will use a MacBook in this tutorial, but any machine, such as a Linux instance, should work the same.

3. Download Tailscale Dynamic Inventory Script

curl https://raw.githubusercontent.com/m4wh6k/ansible-tailscale-inventory/refs/heads/main/ansible_tailscale_inventory.py -O
chmod +x ansible_tailscale_inventory.py

4. Pull Sbnb Linux Repo with All Required Grafana Configs and Ansible Playbooks

git clone https://github.com/sbnb-io/sbnb.git
cd sbnb/automation/

5. Configure VM Settings

Open sbnb-example-vm.json file with an editor of your choice and configure the following parameters:

{
    "vcpu": 2,
    "mem": "4G",
    "tskey": "your-tskey-auth",
    "attach_gpus": true,
    "image_size": "10G"
}

Replace "your-tskey-auth" with your actual Tailscale key.

6. Start VM with Ansible Playbook

export SBNB_HOSTS=sbnb-F6S0R8000719

ansible-playbook -i ./ansible_tailscale_inventory.py sbnb-start-vm.yaml

Once the VM starts, you should see it appear in the Tailscale network as sbnb-vm-VMID. For example, sbnb-vm-67f97659333f.

All Nvidia GPUs present in the system will be attached to this VM using a low-overhead vfio-pci mechanism:

7. Configure vLLM

By default, run-vllm.yaml has the following settings:

--max-model-len 2048
--gpu-memory-utilization 0.9
--tensor-parallel-size 2
--max-num-seqs 32
--enforce-eager
--model "allenai/OLMo-1B-hf"

We’re setting tensor parallelization --tensor-parallel-size 2 because we have 2 Nvidia GPU cards in the system. We also choose a small model "allenai/OLMo-1B-hf" to fit our limited GPU memory (12GB * 2 = 24GB) in this setup.

Please refer to the vLLM engine arguments for more details:
https://docs.vllm.ai/en/latest/serving/engine_args.html

8. Start vLLM in the VM

Run on the laptop:

export SBNB_HOSTS=sbnb-vm-67f97659333f

for playbook in install-docker.yaml install-nvidia.yaml install-nvidia-container-toolkit.yaml run-vllm.yaml; do
  ansible-playbook -i ./ansible_tailscale_inventory.py $playbook
done

Note that this time we set SBNB_HOSTS to the hostname of the VM we started in the previous step.

These commands will install Docker, Nvidia drivers, Nvidia container toolkit, and vLLM into the VM.

Congratulations! Now you have vLLM up and running.

Run vLLM Benchmark

Run on the laptop:

ansible-playbook -i ./ansible_tailscale_inventory.py run-vllm-benchmark.yaml

Example Output of the Benchmark

=========== Serving Benchmark Result ============
Successful requests:                     10000
Benchmark duration (s):                  1650.40
Total input tokens:                      10240000
Total generated tokens:                  1249339
Request throughput (req/s):              6.06
Output token throughput (tok/s):         756.99
Total Token throughput (tok/s):          6961.54
---------------Time to First Token----------------
Mean TTFT (ms):                          823866.75
Median TTFT (ms):                        822703.42
P99 TTFT (ms):                           1627880.13
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          41.80
Median TPOT (ms):                        41.25
P99 TPOT (ms):                           64.80
---------------Inter-token Latency----------------
Mean ITL (ms):                           41.37
Median ITL (ms):                         18.53
P99 ITL (ms):                            179.15
==================================================

Display GPU Utilization in Grafana

Follow this guide:
README-NVIDIA-GPU-FRYER-GRAFANA.md

✅ Summary

You now have:

A GPU-enabled VM on Bare Metal running Sbnb Linux
vLLM deployed automatically via Ansible
Full monitoring via Grafana
vLLM benchmark results confirming throughput and performance

Happy experimenting with vLLM!

Revisions

Revision	Description	Created
			Download

Get this dashboard

Import the dashboard template

Download JSON

Datasource

Dependencies

Resources

Docs: Importing dashboards Webinar: Getting started with Grafana dashboard design Webinar: Building advanced Grafana dashboards