Learn OpenTelemetry tracing through a grand strategy game: introducing Game of Traces
A trace always remembers!
Okay, okay. I will try to keep my Game of Thrones references to a minimum throughout this post, but there is a lot of truth to that statement.
In observability, a trace is the “when” and “where” of telemetry signals, allowing us to track the state of interactions between services within a microservice architecture. This makes traces the ideal observability signal for discovering bottlenecks and interconnection issues.
The challenge with traces, however, is that they often come with a steep learning curve – particularly for those who are new to observability. This is mostly due to the fact that it’s hard to relate traces to a real-world concept unless we talk about logistics or a road trip. But where is the fun in that?
Here at Grafana Labs, I’m a developer advocate, and a big part of my job is to learn and teach the pillars of observability to our end users. In my personal life, I also love grand strategy games: Total War: Warhammer III, Sid Meier’s Civilisation, and Sins of a Solar Empire (which is more of a hybrid). You name it, I’m there.
Recently, I’ve been blending these two things together to create strategy games that can teach engineers the basics of observability. I did this last year (alongside my fellow dev advocate Tom Glenn) to create Quest World, an interactive game to learn about OpenTelemetry, Grafana Alloy, and the Grafana LGTM (Loki for logs, Grafana for visualization, Tempo for traces, Mimir for metrics) Stack.
And now, I’m applying the same approach to tracing.
In this post, I’ll walk you through Game of Traces, a grand strategy game you can play to learn the key concepts of OpenTelemetry tracing — and, of course, heroically defend a few kingdoms along the way.

But first, how does it all work?
In grand strategy games, you build up forces within settlements and send them to capture new towns and territories in a path of conquest to the enemy capital. This got me thinking: what if we built a grand strategy game where each capital and village was a service, and we instrumented each with OpenTelemetry (OTel) tracing. We could track the grand battles as they unfold: the wins, the losses, and the interactions with the AI opponent. Thus, a Game of Traces was born!
I normally leave how it works to the end of the blog post, but it’s important to understand the underlying mechanism here to see how traces track the game state.

Let’s break down the components:
- Capitals are the seats of power within the game. This is where the player can collect resources and spend those resources to create their armies. In this game, only capitals can make armies. A capital is essentially a Python Flask server with a series of API endpoints which can be interacted with to create different game events, more on this later. Capitals are connected to a series of villages via a series of API endpoints, which generate our game map. Once a capital is lost, the game is over for that player.
- Villages are essentially minor capitals, but with some key differences. They start neutral with a predefined army. Once captured, they automatically generate resources that can be sent back to the capital (you will see how this mechanism works when you play the game). They cannot generate armies, but can hold player armies from the capital as defense. They are also Flask servers connected to one another — in fact, they are built from the same source code. We define the role of a Flask server using the game config.
- The war map is how the player interacts with the series of Flask servers. It is itself a Flask server, which hosts the grand map UI. The war map supports one-on-one human battles and one-on-one battles between a human and an AI opponent.
- The AI is another Flask server which acts as its own war map. It provides the AI with the ability to interact with its capital and villages, and plays a game against you based on a weighted decision tree depending on how far you are into the game.
- Last but not least, OpenTelemetry and the open source Grafana LGTM Stack. We have manually instrumented each Flask server within our architecture to generate metrics, logs, and traces within the OpenTelemetry protocol (OTLP) format. We use the telemetry signals to monitor key aspects of the game, such as army marching, returning resources to the capital, and AI interactions. Spoiler alert: what you’ll learn is that traces tie all these signals together as we monitor our army’s progress.
Game setup
To play the game locally, all you need is Docker. You can find the setup instructions here. Or, if you prefer not to deploy the game locally, you can run the game online using Killercoda.
Whether you are playing the game locally or via Killercoda, make sure you import the included V2 Dashboard. It shows you all telemetry signals in one user interface and also showcases some of the new dynamic dashboard features introduced in Grafana 12.

Playing the game
Okay, we’d better play this game before winter arrives. There are two primary interfaces:
- The game itself, which can be found at http://localhost:8080 once you’ve run
docker compose
(make sure you have followed the steps in Game setup). - The Grafana dashboard you installed during setup.
Let’s focus on the game UI first. You will be asked to select your faction.

Once selected, you can enter the game. You will be greeted by the war map.

I recommend enabling the AI opponent if you are playing by yourself.
It’s time to play the game:
- Collect resources
- Create armies
- Move armies to villages to capture them
- Return resources to your capital to create more armies
- Use all-out attack with a big enough army at your capital to send your troops on a war path to the enemy’s capital (a personal favorite for generating interesting traces)
Once you have won or lost, check back in and read the next section of the blog!
I told you it was a game of traces!
So you are back! You’re either the victor or the loser of the great war between the northern and southern kingdoms. Let’s see how we can use our battles to understand how tracing works.
A common move you would have taken is moving your army to take the villages. Behind the scenes, this is being represented as a trace. Let’s find one of these moves within the Player Decisions table. Select the trace ID, which will take you to the Grafana traces view panel.

The trace allows us to understand the flow in which our services are being called underneath in order to initiate the move. Traces are made from a collection of spans.
We start at the top, which is called our parent span. The parent span comes from our service war_map, which sends a request to our northern-capital to initiate a move order to village-2. We can see that the northern-capital calls the receive_army endpoint of village-2 to initiate this move. The color of our span indicates which service it belongs to. As you can see, context is very important when it comes to populating spans within a trace. If the current span has no context of the previous, then the chain of events is broken, and we lose the ability to track API and function calls across services.
Okay, so we can see that an army moved to village-2, but what actually unfolded in that battle? That’s where span attributes come in. To find span attributes, let’s select village-2 receive_army and expand the span attributes dropdown.

Span attributes are key-value pairs that provide further context for a particular span. Span attributes are optional, but are extremely useful when determining the state of particular variables or API calls. In our case, it allows us to track the battle that took place. We can see that the northern capital sent an army of four to defeat the army of three at village 2. It was a successful attack, leaving one army remaining to defend the captured village.
A trace also allows us to generate a node graph to understand service flow, but also to summarize rate, error, and duration (RED) metrics.

In the context of our game, we represent lost battles by setting the span type to error. This allows us to see how many battles were both successful and unsuccessful during a game.

By the light of the three!
One of OpenTelemetry’s greatest strengths is correlation of telemetry signals (in our case, traces, logs, and metrics), which is largely available out-of-the-box. If you generate logs within the context of a span, these logs will automatically contain the trace ID and span ID as part of their attributes. Metrics have a similar correlation method, but let’s focus on traces-to-logs first.

Within the LGTM stack, we provide the ability to correlate between traces stored in Tempo and logs stored in Loki. If we take our battle span (never thought I would say that in a sentence), we can select logs for the span. This will automatically query Loki using the span ID or trace ID or both (depending on your data source configuration), providing the logs from our village-2 at the time of the battle span. We can see a historic log of the battle that took place.
Lastly, let’s look at traces-to-metrics. This is handled within OpenTelemetry and Prometheus as an exemplar. An exemplar is a specific trace representative of a measurement taken in a given time interval. While metrics excel at giving you an aggregated view of your system, traces give you a fine-grained view of a single request; exemplars are a way to link the two.

In our scenario, we can see that we generate a total number of battles per village, as battles take place. This metric is updated within the scope of our receive_army battle span. Thus, when the metric reading is taken, we are given the trace and span ID, which is stored as an exemplar in Prometheus/Mimir. Grafana allows us to overlay exemplars on top of line graphs so we can see a timeline of the total number of battles each village has had and then drill down into a particular trace to understand what happened during that particular battle.
Continuous profiles is the only signal missing here, but I plan to add profiles using Grafana’s OpenTelemetry Profiling library. Stay tuned!
A battle is a series of cause and effect
The last feature we introduced (I say “we” because the idea came from Hedley Simons, one of our resident tracing experts) was span links. Links are optional and can be a bit more challenging to implement. Essentially, their role is to create relationships between spans that aren’t in a direct parent-child hierarchy. Unlike parent-child relationships (which are synchronous and hierarchical), links are more flexible and can connect spans across different traces or time periods. This becomes particularly useful in our case, as it ties gameplay together by linking a player’s actions across the course of the game.
Let’s first take a look at how links are represented in Grafana.

As we can see, our final all-out attack has a span link attached to it. This references the previous action our player took within the game sequence. Thus, we can track our game to the very beginning move in reverse order.
So, what does this allow us to do? Well, this isn’t a practical use of Tempo’s query API, but it allowed us to create a replay feature.

The war map searches Tempo for a particular game session ID, which is stored as an attribute for each span. It then uses span links to retrieve each Trace within the link from Tempo in reverse order before we invert this. This gives us the ability to step through the actions of the player during the game.
As the dust settles, a chance for reflection
When building this demo, my own “aha!” moment came with traces when I was designing the game AI and how that could interact with game mechanics. It was hard to know how the set of decision weights affected the AI’s actions — whether the AI would get stuck repeating the same mechanic over and over, or fail to interact with one of the Flask servers (capital or village).

Traces let me track the decisions of the AI opponent throughout our match together. I discovered rather early on that my first attempt at the AI opponent didn’t react quickly enough to imminent threats from the player. If it didn’t recognize an all-out attack or that a village had been captured close to its capital, it remained passively collecting resources and taking villages. After modifying the AI’s behavior, I could see it play a more defensive role in its actions when the enemy was near, through its decision to spend less time collecting resources and more time building armies at its capital.
Game over… for now
I had a lot of fun building this game. It allowed me to truly dive into the OpenTelemetry SDK for Python and understand how traces and spans are created, how context is passed between one span to another, and how metrics and logs can be tied to a given span.
I purposefully chose not to auto-instrument Game of Traces, so the techniques of instrumentation can be applicable across programming languages. I plan to write a follow-up post that dives deeper into the code base to cover how the game was instrumented.
In the meantime, I hope this game provides a new way to learn observability and its value from the ground up. Until next time, for the north!