Here is a short version: In Cortex and Loki, the ring is a space divided by tokens into smaller segments. Each segment belongs to a single “ingester” (component in Cortex and Loki that receives data) and is used to shard series/logs across multiple ingesters. In addition to the tokens, each ingester also has an entry with its ID, address, and the latest heartbeat timestamp updated periodically. This allows other components (distributors and queriers) to discover which ingesters are available and alive.
In Cortex and Loki, we store the ring in some key-value store like Consul or etcd. The key-value store (KV store) is a simple dictionary, where each value is stored under a key. The ring is stored as a single value, typically under the key “ring” – perhaps with some prefix. Distributors and queriers then watch this key for updates, and keep the latest value into memory for quick access.
This works great … until it starts to break. In this blog post, we will discuss some of the issues that we have observed, and how we are trying to solve them using gossip.
Some common problems
One problem is related to updates. Each update requires two operations to the KV store: one to read the ring into memory, and one to write back the updated value using a Compare-and-Set (or Compare-and-Swap, CAS) operation. CAS updates the value only if it hasn’t been changed since reading. If multiple ingesters are updating the ring at the same time, only one of them will succeed, and other ingesters will need to retry the update. That means another read, update in memory, and CAS cycle. As the number of ingesters grows, the rate of failed CAS operations goes up as well, and this leads to a lot of wasted effort on the ingester side and also on the KV-store side.
Another problem lies in components watching the ring for updates. Let’s say we have 30 ingesters, each writing a heartbeat every 10 seconds. That is 3 updates per second in total. Now let’s say we have 20 distributors and 20 queriers, each watching the ring. That’s 40 watchers, receiving 3 updates per second = 120 updates per second that the KV store has to send out. This can be reduced by applying rate limits to watchers, as we have implemented recently for the Consul client, which helped Consul to use less CPU and bandwidth.
Two more problems are worth mentioning… since the ring is currently stored as a single value, each read or CAS operation needs to transfer the entire ring. Lastly, a separate KV store is just another dependency and component to operate.
Can we do better?
An alternative option to Consul or etcd may be using gossip to propagate the ring data structure across nodes. The gossip algorithm is a generic algorithm for distributing information across a set of nodes.
In the picture, we have 25 nodes. The node in the middle has some information that it wants to distribute to all the other nodes. In gossip, every node runs a simple loop: In each iteration, a node randomly picks 3 other nodes, and sends the information to them.
In round 1, information is now available on 4 nodes: the original node in the middle, and three new nodes (marked as red in the picture).
Now all nodes do the same again: randomly pick 3 other nodes, and send the information to them. It’s possible that some nodes get the information from multiple source nodes, or a selected node already has the information. That’s fine.
In the next round, more nodes get the information, and will start sending it out. In a couple of more rounds, all nodes will have it. The number of rounds required to spread the piece of information everywhere depends on the total number of nodes, and how many nodes we randomly select in each iteration (3 in our example).
As we can see, the gossip algorithm (sometimes also called gossip or epidemic protocol) is characterized by these properties:
- There’s periodic communication with other nodes.
- Exchanged information is bounded in size.
- It doesn’t require reliable communication (failures are OK, although they do delay the spread).
- It is important to select target nodes randomly.
- There is no need for having some central server.
The algorithm requires that nodes are aware of each other; however, nodes can also learn about other nodes via gossip, as we will see later.
For use in Cortex, we have chosen the Memberlist library as the gossip implementation for sharing the ring. In memberlist the information that is shared between the nodes is actually a list of nodes in the cluster. Instead of exchanging the full list, nodes using memberlist library exchange “JOIN” and “LEAVE” messages between each other. Memberlist also sends “PING” messages to detect failed nodes, and has features to speed up the synchronization of the state and to make information spread faster.
For our usage, it’s important that we can send custom messages via its underlying gossip implementation.
Now that we have a library for gossiping, what can we do with it? Build a KV store! Why? Simply because it’s an abstraction that we already use for storing the ring, and using the same abstraction allows us to easily switch between different implementations.
KV store on top of memberlist
As a typical KV store, our implementation will map string keys to binary blobs (values). Exchanged state information will be modified key-value pairs. In the case of the ring, we will be only exchanging a modified subset of the ring, typically entries with timestamp updates. The KV store will start as empty, and ingesters will be able to write their own entries to the ring. When this update is received by another node, this node will need to merge incoming updates to the state it already has in memory.
Merging two rings together is conceptually a very simple operation: For non-conflicting updates (different ingesters), we simply put them together.
If there are conflicting entries (“Ingester-1” on the picture), the entry with the more recent heartbeat wins. Heartbeats are generated on the ingester and propagated from there, so a more recent heartbeat indicates more recent information.
The merge operation is invoked between the state that node already has and the state (update) that was received via gossip. If the merge operation actually updates a ring, this update is then sent out via gossip to the other members. Sending out only updates (deltas) keeps the messages exchanged between nodes small. The merge operation must be idempotent, commutative, and associative.
This is a high-level overview of how the KV store and ring merging works in Cortex and Loki. There are many details that we’re not covering for simplicity, like handling conflicting tokens, tombstones for entries removed from the ring, or the KV store being independent from actual data type.
What did we get by doing this?
In our production clusters, we use Consul to store the ring, and while it works great, it has caused some issues as well. With gossip, we can drop Consul dependency.
Ingester updates (that is CAS operations) are now very cheap, since they only happen in memory, and there is no immediate network communication. Ingesters can therefore update their timestamps more often, and higher frequency doesn’t cause problems: It doesn’t generate more gossip messages, since gossip happens with its own frequency, independent from updates. If old updates are still in the gossip queue, they are simply discarded.
Watching the ring for updates (as done by distributors and queriers) now also doesn’t require any extra effort. Messages with updates are received all the time, and if they cause a ring update, watchers are notified. We may still want to apply watch rate limits, since watchers use CPU cycles to convert the ring as stored in the KV store to the representation they need.
Gossip update messages are small, since they only include a small part of the ring, and are stable: Size of the single update doesn’t grow with the number of ingesters. This is in contrast with Consul/etcd, since all operations on those transfer the entire ring.
Not everything is so great though. Spreading information through gossip means that each node sees a slightly different view of the ring. If there were no updates, eventually they would converge to the same state, but since ingesters are updating the ring all the time, this convergence to a single state never happens. However, it’s not a big issue. Once ingesters are running, they only update heartbeat timestamps, and watchers start treating them as “unhealthy” only if the timestamp is older than a minute. That’s usually enough time for updates to spread via gossip.
Cortex and Loki distributors have the /ring page where operators can see the ring. Refreshing this page may hit different distributors each time, showing slightly different views of the ring. In the picture below, we see that viewing the /ring page three times shows an older Ingester-3 timestamp on the last refresh. This can happen if the distributor serving the page hasn’t seen the update yet.
Cortex and Loki have supported gossip for storing the ring for some time now. Most recent releases (Cortex 0.7, and the Loki version after the soon-to-be-released 1.4) have some improvements that allow usage of gossip in single-binary mode as well. Please give gossip in Cortex and Loki a try, and let us know how it works for you.
Learn more about Cortex
If you’re interested in finding out more about Cortex, sign up for our webinar scheduled for March 31 at 9:30am PT. There will be a preview of new features in the upcoming Cortex 1.0 release, a demonstration, and Q&A with Cortex maintainers Tom Wilkie and Goutham Veeramachaneni.