Storage on Grafana Labs

Log Entry Deletion

Tue, 16 Jul 2024 15:42:20 +0000

Log Entry Deletion

Grafana Loki supports the deletion of log entries from a specified stream. Log entries that fall within a specified time window and match an optional line filter are those that will be deleted.

Log entry deletion is supported only when the BoltDB Shipper is configured for the index store.

The compactor component exposes REST endpoints that process delete requests. Hitting the endpoint specifies the streams and the time window. The deletion of the log entries takes place after a configurable cancellation time period expires.

Log entry deletion relies on configuration of the custom logs retention workflow as defined for the compactor. The compactor looks at unprocessed requests which are past their cancellation period to decide whether a chunk is to be deleted or not.

Configuration

Enable log entry deletion by setting retention_enabled to true and deletion_mode to filter-only or filter-and-delete in the compactor’s configuration.

With filter-only, log lines matching the query in the delete request are filtered out when querying Loki. They are not removed from storage. With filter-and-delete, log lines matching the query in the delete request are filtered out when querying Loki, and they are also removed from storage.

A delete request may be canceled within a configurable cancellation period. Set the delete_request_cancel_period in the compactor’s YAML configuration or on the command line when invoking Loki. Its default value is 24h.

Access to the deletion API can be enabled per tenant via the allow_deletes setting.

Filesystem

Tue, 16 Jul 2024 15:42:20 +0000

Filesystem Object Store

The filesystem object store is the easiest to get started with Grafana Loki but there are some pros/cons to this approach.

Very simply it stores all the objects (chunks) in the specified directory:

storage_config:
  filesystem:
    directory: /tmp/loki/

A folder is created for every tenant all the chunks for one tenant are stored in that directory.

If Loki is run in single-tenant mode, all the chunks are put in a folder named fake which is the synthesized tenant name used for single tenant mode.

Pros

Very simple, no additional software required to use Loki when paired with the BoltDB index store.

Great for low volume applications, proof of concepts, and just playing around with Loki.

Cons

Scaling

At some point there is a limit to how many chunks can be stored in a single directory, for example see issue #1502 which explains how a Loki user ran into a strange error with about 5.5 million chunk files in their file store (and also a workaround for the problem).

However, if you keep your streams low (remember loki writes a chunk per stream) and use configs like chunk_target_size (around 1MB), max_chunk_age (increase beyond 1h), chunk_idle_period (increase to match max_chunk_age) can be tweaked to reduce the number of chunks flushed (although they will trade for more memory consumption).

It’s still very possible to store terabytes of log data with the filestore, but realize there are limitations to how many files a filesystem will want to store in a single directory.

Durability

The durability of the objects is at the mercy of the filesystem itself where other object stores like S3/GCS do a lot behind the scenes to offer extremely high durability to your data.

High Availability

Running Loki clustered is not possible with the filesystem store unless the filesystem is shared in some fashion (NFS for example). However using shared filesystems is likely going to be a bad experience with Loki just as it is for almost every other application.

Retention

Tue, 16 Jul 2024 15:42:20 +0000

Grafana Loki Storage Retention

Retention in Grafana Loki is achieved either through the Table Manager or the Compactor.

Retention through the Table Manager is achieved by relying on the object store TTL feature, and will work for both boltdb-shipper store and chunk/index store. However retention through the Compactor is supported only with the boltdb-shipper store.

The Compactor retention will become the default and have long term support. It supports more granular retention policies on per tenant and per stream use cases.

Compactor

The Compactor can deduplicate index entries. It can also apply granular retention. When applying retention with the Compactor, the Table Manager is unnecessary.

Run the compactor as a singleton (a single instance).

Compaction and retention are idempotent. If the compactor restarts, it will continue from where it left off.

The Compactor loops to apply compaction and retention at every compaction_interval, or as soon as possible if running behind.

The compactor’s algorithm to update the index:

For each table within each day:
- Compact the table into a single index file.
- Traverse the entire index. Use the tenant configuration to identify and mark chunks that need to be removed.
- Remove marked chunks from the index and save their reference in a file on disk.
- Upload the new modified index files.

The retention algorithm is applied to the index. Chunks are not deleted while applying the retention algorithm. The chunks will be deleted by the compactor asynchronously when swept.

Marked chunks will only be deleted after retention_delete_delay configured is expired because:

boltdb-shipper indexes are refreshed from the shared store on components using it (querier and ruler) at a specific interval. This means deleting chunks instantly could lead to components still having reference to old chunks and so they could fails to execute queries. Having a delay allows for components to refresh their store and so remove gracefully their reference of those chunks.
It provides a short window of time in which to cancel chunk deletion in the case of a configuration mistake.

Marker files (containing chunks to delete) should be stored on a persistent disk, since the disk will be the sole reference to them.

Retention Configuration

This compactor configuration example activates retention.

compactor:
  working_directory: /data/retention
  shared_store: gcs
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150
schema_config:
    configs:
      - from: "2020-07-31"
        index:
            period: 24h
            prefix: loki_index_
        object_store: gcs
        schema: v11
        store: boltdb-shipper
storage_config:
    boltdb_shipper:
        active_index_directory: /data/index
        cache_location: /data/boltdb-cache
        shared_store: gcs
    gcs:
        bucket_name: loki

Note that retention is only available if the index period is 24h.

Set retention_enabled to true. Without this, the Compactor will only compact tables.

Define schema_config and storage_config to access the storage.

The index period must be 24h.

working_directory is the directory where marked chunks and temporary tables will be saved.

compaction_interval dictates how often compaction and/or retention is applied. If the Compactor falls behind, compaction and/or retention occur as soon as possible.

retention_delete_delay is the delay after which the compactor will delete marked chunks.

retention_delete_worker_count specifies the maximum quantity of goroutine workers instantiated to delete chunks.

Configuring the retention period

Retention period is configured within the limits_config configuration section.

There are two ways of setting retention policies:

retention_period which is applied globally.
retention_stream which is only applied to chunks matching the selector

The minimum retention period is 24h.

This example configures global retention:

...
limits_config:
  retention_period: 744h
  retention_stream:
  - selector: '{namespace="dev"}'
    priority: 1
    period: 24h
  per_tenant_override_config: /etc/overrides.yaml
...

Per tenant retention can be defined using the /etc/overrides.yaml files. For example:

overrides:
    "29":
        retention_period: 168h
        retention_stream:
        - selector: '{namespace="prod"}'
          priority: 2
          period: 336h
        - selector: '{container="loki"}'
          priority: 1
          period: 72h
    "30":
        retention_stream:
        - selector: '{container="nginx"}'
          priority: 1
          period: 24h

A rule to apply is selected by choosing the first in this list that matches:

If a per-tenant retention_stream matches the current stream, the highest priority is picked.
If a global retention_stream matches the current stream, the highest priority is picked.
If a per-tenant retention_period is specified, it will be applied.
The global retention_period will be selected if nothing else matched.
If no global retention_period is specified, the default value of 744h (30days) retention is used.

Stream matching uses the same syntax as Prometheus label matching:

=: Select labels that are exactly equal to the provided string.
!=: Select labels that are not equal to the provided string.
=~: Select labels that regex-match the provided string.
!~: Select labels that do not regex-match the provided string.

The example configurations will set these rules:

All tenants except 29 and 30 in the dev namespace will have a retention period of 24h hours.
All tenants except 29 and 30 that are not in the dev namespace will have the retention period of 744h.
For tenant 29:
- All streams except those in the container loki or in the namespace prod will have retention period of 168h (1 week).
- All streams in the prod namespace will have a retention period of 336h (2 weeks), even if the container label is loki, since the priority of the prod rule is higher.
- Streams that have the container label loki but are not in the namespace prod will have a 72h retention period.
For tenant 30:
- All streams except those having the container label nginx will have the global retention period of 744h, since there is no override specified.
- Streams that have the label nginx will have a retention period of 24h.

Table Manager

In order to enable the retention support, the Table Manager needs to be configured to enable deletions and a retention period. Refer to the table_manager section of the Loki configuration reference for all available options. Alternatively, the table-manager.retention-period and table-manager.retention-deletes-enabled command line flags can be used. The provided retention period needs to be a duration represented as a string that can be parsed using the Prometheus common model ParseDuration. Examples: 7d, 1w, 168h.

WARNING: The retention period must be a multiple of the index and chunks table period, configured in the period_config block. See the Table Manager documentation for more information.

NOTE: To avoid querying of data beyond the retention period, max_look_back_period config in chunk_store_config must be set to a value less than or equal to what is set in table_manager.retention_period.

When using S3 or GCS, the bucket storing the chunks needs to have the expiry policy set correctly. For more details check S3’s documentation or GCS’s documentation.

Currently, the retention policy can only be set globally. A per-tenant retention policy with an API to delete ingested logs is still under development.

Since a design goal of Loki is to make storing logs cheap, a volume-based deletion API is deprioritized. Until this feature is released, if you suddenly must delete ingested logs, you can delete old chunks in your object store. Note, however, that this only deletes the log content and keeps the label index intact; you will still be able to see related labels but will be unable to retrieve the deleted log content.

For further details on the Table Manager internals, refer to the Table Manager documentation.

Example Configuration

Example configuration with GCS with a 28 day retention:

schema_config:
  configs:
  - from: 2018-04-15
    store: bigtable
    object_store: gcs
    schema: v11
    index:
      prefix: loki_index_
      period: 168h

storage_config:
  bigtable:
    instance: BIGTABLE_INSTANCE
    project: BIGTABLE_PROJECT
  gcs:
    bucket_name: GCS_BUCKET_NAME

chunk_store_config:
  max_look_back_period: 672h

table_manager:
  retention_deletes_enabled: true
  retention_period: 672h

Single Store (boltdb-shipper)

Tue, 16 Jul 2024 15:42:20 +0000

Single Store Loki (boltdb-shipper index type)

BoltDB Shipper lets you run Grafana Loki without any dependency on NoSQL stores for storing index. It locally stores the index in BoltDB files instead and keeps shipping those files to a shared object store i.e the same object store which is being used for storing chunks. It also keeps syncing BoltDB files from shared object store to a configured local directory for getting index entries created by other services of same Loki cluster. This helps run Loki with one less dependency and also saves costs in storage since object stores are likely to be much cheaper compared to cost of a hosted NoSQL store or running a self hosted instance of Cassandra.

Note: BoltDB shipper works best with 24h periodic index files. It is a requirement to have index period set to 24h for either active or upcoming usage of boltdb-shipper. If boltdb-shipper already has created index files with 7 days period, and you want to retain previous data then just add a new schema config using boltdb-shipper with a future date and index files period set to 24h.

Example Configuration

Example configuration with GCS:

schema_config:
  configs:
    - from: 2018-04-15
      store: boltdb-shipper
      object_store: gcs
      schema: v11
      index:
        prefix: loki_index_
        period: 24h

storage_config:
  gcs:
    bucket_name: GCS_BUCKET_NAME

  boltdb_shipper:
    active_index_directory: /loki/index
    shared_store: gcs
    cache_location: /loki/boltdb-cache

This would run Loki with BoltDB Shipper storing BoltDB files locally at /loki/index and chunks at configured GCS_BUCKET_NAME. It would also keep shipping BoltDB files periodically to same configured bucket. It would also keep downloading BoltDB files from shared bucket uploaded by other ingesters to /loki/boltdb-cache folder locally.

Operational Details

Loki can be configured to run as just a single vertically scaled instance or as a cluster of horizontally scaled single binary(running all Loki services) instances or in micro-services mode running just one of the services in each instance. When it comes to reads and writes, Ingesters are the ones which writes the index and chunks to stores and Queriers are the ones which reads index and chunks from the store for serving requests.

Before we get into more details, it is important to understand how Loki manages index in stores. Loki shards index as per configured period which defaults to seven days i.e when it comes to table based stores like Bigtable/Cassandra/DynamoDB there would be separate table per week containing index for that week. In the case of BoltDB Shipper, a table is defined by a collection of many smaller BoltDB files, each file storing just 15 mins worth of index. Tables created per day are identified by a configured prefix_ + <period-number-since-epoch>. Here <period-number-since-epoch> in case of boltdb-shipper would be day number since epoch. For example, if you have a prefix set to loki_index_ and a write request comes in on 20th April 2020, it would be stored in a table named loki_index_18372 because it has been 18371 days since the epoch, and we are in 18372th day. Since sharding of index creates multiple files when using BoltDB, BoltDB Shipper would create a folder per day and add files for that day in that folder and names those files after ingesters which created them.

To reduce the size of files which help with faster transfer speeds and reduced storage costs, they are stored after compressing them with gzip.

To show how BoltDB files in shared object store would look like, let us consider 2 ingesters named ingester-0 and ingester-1 running in a Loki cluster, and they both having shipped files for day 18371 and 18372 with prefix loki_index_, here is how the files would look like:

└── index
    ├── loki_index_18371
    │   ├── ingester-0-1587254400.gz
    │   └── ingester-1-1587255300.gz
    |   ...
    └── loki_index_18372
        ├── ingester-0-1587254400.gz
        └── ingester-1-1587254400.gz
        ...

Note: We also add a timestamp to names of the files to randomize the names to avoid overwriting files when running Ingesters with same name and not have a persistent storage. Timestamps not shown here for simplification.

Let us talk about more in depth about how both Ingesters and Queriers work when running them with BoltDB Shipper.

Ingesters

Ingesters keep writing the index to BoltDB files in active_index_directory and BoltDB Shipper keeps looking for new and updated files in that directory every 15 Minutes to upload them to the shared object store. When running Loki in clustered mode there could be multiple ingesters serving write requests hence each of them generating BoltDB files locally.

Note: To avoid any loss of index when Ingester crashes it is recommended to run Ingesters as statefulset(when using k8s) with a persistent storage for storing index files.

Another important detail to note is when chunks are flushed they are available for reads in object store instantly while index is not since we only upload them every 15 Minutes with BoltDB shipper. Ingesters expose a new RPC for letting Queriers query the Ingester’s local index for chunks which were recently flushed but its index might not be available yet with Queriers. For all the queries which require chunks to be read from the store, Queriers also query Ingesters over RPC for IDs of chunks which were recently flushed which is to avoid missing any logs from queries.

Queriers

To avoid running Queriers as a StatefulSet with persistent storage, we recommend running an Index Gateway. An Index Gateway will download and synchronize the index, and it will serve it over gRPC to Queriers and Rulers.

Queriers lazily loads BoltDB files from shared object store to configured cache_location. When a querier receives a read request, the query range from the request is resolved to period numbers and all the files for those period numbers are downloaded to cache_location, if not already. Once we have downloaded files for a period we keep looking for updates in shared object store and download them every 5 Minutes by default. Frequency for checking updates can be configured with resync_interval config.

To avoid keeping downloaded index files forever there is a ttl for them which defaults to 24 hours, which means if index files for a period are not used for 24 hours they would be removed from cache location. ttl can be configured using cache_ttl config.

Within Kubernetes, if you are not using an Index Gateway, we recommend running Queriers as a StatefulSet with persistent storage for downloading and querying index files. This will obtain better read performance, and it will avoid using node disk.

Index Gateway

An Index Gateway downloads and synchronizes the BoltDB index from the Object Storage in order to serve index queries to the Queriers and Rulers over gRPC. This avoids running Queriers and Rulers with a disk for persistence. Disks can become costly in a big cluster.

To run an Index Gateway, configure StorageConfig and set the -target CLI flag to index-gateway. To connect Queriers and Rulers to the Index Gateway, set the address (with gRPC port) of the Index Gateway with the -boltdb.shipper.index-gateway-client.server-address CLI flag or its equivalent YAML value under StorageConfig.

When using the Index Gateway within Kubernetes, we recommend using a StatefulSet with persistent storage for downloading and querying index files. This can obtain better read performance, avoids noisy neighbor problems by not using the node disk, and avoids the time consuming index downloading step on startup after rescheduling to a new node.

Write Deduplication disabled

Loki does write deduplication of chunks and index using Chunks and WriteDedupe cache respectively, configured with ChunkStoreConfig. The problem with write deduplication when using boltdb-shipper though is ingesters only keep uploading boltdb files periodically to make them available to all the other services which means there would be a brief period where some of the services would not have received updated index yet. The problem due to that is if an ingester which first wrote the chunks and index goes down and all the other ingesters which were part of replication scheme skipped writing those chunks and index due to deduplication, we would end up missing those logs from query responses since only the ingester which had the index went down. This problem would be faced even during rollouts which is quite common.

To avoid this, Loki disables deduplication of index when the replication factor is greater than 1 and boltdb-shipper is an active or upcoming index type. While using boltdb-shipper avoid configuring WriteDedupe cache since it is used purely for the index deduplication, so it would not be used anyways.

Compactor

Compactor is a BoltDB Shipper specific service that reduces the index size by deduping the index and merging all the files to a single file per table. We recommend running a Compactor since a single Ingester creates 96 files per day which include a lot of duplicate index entries and querying multiple files per table adds up the overall query latency.

Note: There should be only 1 compactor instance running at a time that otherwise could create problems and may lead to data loss.

Example compactor configuration with GCS:

Delete Permissions

The compactor is an optional but suggested component that combines and deduplicates the boltdb-shipper index files. When compacting index files, the compactor writes a new file and deletes unoptimized files. Ensure that the compactor has appropriate permissions for deleting files, for example, s3:DeleteObject permission for AWS S3.

compactor:
  working_directory: /loki/compactor
  shared_store: gcs

storage_config:
  gcs:
    bucket_name: GCS_BUCKET_NAME

Table manager

Tue, 16 Jul 2024 15:42:20 +0000

Table Manager

Grafana Loki supports storing indexes and chunks in table-based data storages. When such a storage type is used, multiple tables are created over the time: each table - also called periodic table - contains the data for a specific time range.

This design brings two main benefits:

Schema config changes: each table is bounded to a schema config and version, so that changes can be introduced over the time and multiple schema configs can coexist
Retention: the retention is implemented deleting an entire table, which allows to have fast delete operations

The Table Manager is a Loki component which takes care of creating a periodic table before its time period begins, and deleting it once its data time range exceeds the retention period.

The Table Manager supports the following backends:

Index store
- Single Store (boltdb-shipper)
- Amazon DynamoDB
- Google Bigtable
- Apache Cassandra
- BoltDB (primarily used for local environments)
Chunk store
- Amazon DynamoDB
- Google Bigtable
- Apache Cassandra
- Filesystem (primarily used for local environments)

The object storages - like Amazon S3 and Google Cloud Storage - supported by Loki to store chunks, are not managed by the Table Manager, and a custom bucket policy should be set to delete old data.

For detailed information on configuring the Table Manager, refer to the table_manager section in the Loki configuration document.

Tables and schema config

A periodic table stores the index or chunk data relative to a specific period of time. The duration of the time range of the data stored in a single table and its storage type is configured in the schema_config configuration block.

The schema_config can contain one or more configs. Each config, defines the storage used between the day set in from (in the format yyyy-mm-dd) and the next config, or “now” in the case of the last schema config entry.

This allows to have multiple non-overlapping schema configs over the time, in order to perform schema version upgrades or change storage settings (including changing the storage type).

The write path hits the table where the log entry timestamp falls into (usually the last table, except short periods close to the end of a table and the beginning of the next one), while the read path hits the tables containing data for the query time range.

Schema config example

For example, the following schema_config defines two configurations: the first one using the schema v10 and the current one using the v11.

The first config stores data between 2019-01-01 and 2019-04-14 (included), then a new config has been added - to upgrade the schema version to v11 - storing data using the v11 schema from 2019-04-15 on.

For each config, multiple tables are created, each one storing data for period time (168 hours = 7 days).

schema_config:
  configs:
    - from:   2019-01-01
      store:  dynamo
      schema: v10
      index:
        prefix: loki_
        period: 168h
    - from:   2019-04-15
      store:  dynamo
      schema: v11
      index:
        prefix: loki_
        period: 168h

Table creation

The Table Manager creates new tables slightly ahead of their start period, in order to make sure that the new table is ready once the current table end period is reached.

The creation_grace_period property - in the table_manager configuration block - defines how long before a table should be created.

Retention

The retention - managed by the Table Manager - is disabled by default, due to its destructive nature. You can enable the data retention explicitly enabling it in the configuration and setting a retention_period greater than zero:

table_manager:
  retention_deletes_enabled: true
  retention_period: 336h

The Table Manager implements the retention deleting the entire tables whose data exceeded the retention_period. This design allows to have fast delete operations, at the cost of having a retention granularity controlled by the table’s period.

Given each table contains data for period of time and that the entire table is deleted, the Table Manager keeps the last tables alive using this formula:

number_of_tables_to_keep = floor(retention_period / table_period) + 1

It’s important to note that - due to the internal implementation - the table period and retention_period must be multiples of 24h in order to get the expected behavior.

For detailed information on configuring the retention, refer to the Loki Storage Retention documentation.

Active / inactive tables

A table can be active or inactive.

A table is considered active if the current time is within the range:

Table start period - creation_grace_period
Table end period + max chunk age (hardcoded to 12h)

Currently, the difference between an active and inactive table only applies to the DynamoDB storage settings: capacity mode (on-demand or provisioned), read/write capacity units and autoscaling.

DynamoDB	Active table	Inactive table
Capacity mode	`enable_ondemand_throughput_mode`	`enable_inactive_throughput_on_demand_mode`
Read capacity unit	`provisioned_read_throughput`	`inactive_read_throughput`
Write capacity unit	`provisioned_write_throughput`	`inactive_write_throughput`
Autoscaling	Enabled (if configured)	Always disabled

DynamoDB Provisioning

When configuring DynamoDB with the Table Manager, the default on-demand provisioning capacity units for reads are set to 300 and writes are set to 3000. The defaults can be overwritten:

table_manager:
  index_tables_provisioning:
    provisioned_write_throughput: 10
    provisioned_read_throughput: 10
  chunk_tables_provisioning:
    provisioned_write_throughput: 10
    provisioned_read_throughput: 10

If Table Manager is not automatically managing DynamoDB, old data cannot easily be erased and the index will grow indefinitely. Manual configurations should ensure that the primary index key is set to h (string) and the sort key is set to r (binary). The “period” attribute in the configuration YAML should be set to 0.

Table Manager deployment mode

The Table Manager can be executed in two ways:

Implicitly executed when Loki runs in monolithic mode (single process)
Explicitly executed when Loki runs in microservices mode

Monolithic mode

When Loki runs in monolithic mode, the Table Manager is also started as component of the entire stack.

Microservices mode

When Loki runs in microservices mode, the Table Manager should be started as separate service named table-manager.

You can check out a production grade deployment example at table-manager.libsonnet.

Write Ahead Log

Mon, 14 Apr 2025 21:05:47 +0000

Write Ahead Log (WAL)

Ingesters temporarily store data in memory. In the event of a crash, there could be data loss. The WAL helps fill this gap in reliability.

The WAL in Grafana Loki records incoming data and stores it on the local file system in order to guarantee persistence of acknowledged data in the event of a process crash. Upon restart, Loki will “replay” all of the data in the log before registering itself as ready for subsequent writes. This allows Loki to maintain the performance & cost benefits of buffering data in memory and durability benefits (it won’t lose data once a write has been acknowledged).

This section will use Kubernetes as a reference deployment paradigm in the examples.

Disclaimer & WAL nuances

The Write Ahead Log in Loki takes a few particular tradeoffs compared to other WALs you may be familiar with. The WAL aims to add additional durability guarantees, but not at the expense of availability. Particularly, there are two scenarios where the WAL sacrifices these guarantees.

Corruption/Deletion of the WAL prior to replaying it

In the event the WAL is corrupted/partially deleted, Loki will not be able to recover all of it’s data. In this case, Loki will attempt to recover any data it can, but will not prevent Loki from starting.

Note: the Prometheus metric loki_ingester_wal_corruptions_total can be used to track and alert when this happens.

No space left on disk

In the event the underlying WAL disk is full, Loki will not fail incoming writes, but neither will it log them to the WAL. In this case, the persistence guarantees across process restarts will not hold.

Note: the Prometheus metric loki_ingester_wal_disk_full_failures_total can be used to track and alert when this happens.

Backpressure

The WAL also includes a backpressure mechanism to allow a large WAL to be replayed within a smaller memory bound. This is helpful after bad scenarios (i.e. an outage) when a WAL has grown past the point it may be recovered in memory. In this case, the ingester will track the amount of data being replayed and once it’s passed the ingester.wal-replay-memory-ceiling threshold, will flush to storage. When this happens, it’s likely that Loki’s attempt to deduplicate chunks via content addressable storage will suffer. We deemed this efficiency loss an acceptable tradeoff considering how it simplifies operation and that it should not occur during regular operation (rollouts, rescheduling) where the WAL can be replayed without triggering this threshold.

Metrics

Changes to deployment

Since ingesters need to have the same persistent volume across restarts/rollout, all the ingesters should be run on statefulset with fixed volumes.
Following flags needs to be set
- --ingester.wal-enabled to true which enables writing to WAL during ingestion.
- --ingester.wal-dir to the directory where the WAL data should be stored and/or recovered from. Note that this should be on the mounted volume.
- --ingester.checkpoint-duration to the interval at which checkpoints should be created.
- --ingester.wal-replay-memory-ceiling (default 4GB) may be set higher/lower depending on your resource settings. It handles memory pressure during WAL replays, allowing a WAL many times larger than available memory to be replayed. This is provided to minimize reconciliation time after very bad situations, i.e. an outage, and will likely not impact regular operations/rollouts at all. We suggest setting this to a high percentage (~75%) of available memory.

Changes in lifecycle when WAL is enabled

Flushing of data to chunk store during rollouts or scale down is disabled. This is because during a rollout of statefulset there are no ingesters that are simultaneously leaving and joining, rather the same ingester is shut down and brought back again with updated config. Hence flushing is skipped and the data is recovered from the WAL.

Disk space requirements

Based on tests in real world:

Numbers from an ingester with 5000 series ingesting ~5mb/s.
Checkpoint period was 5mins.
disk utilization on a WAL-only disk was steady at ~10-15GB.

You should not target 100% disk utilisation.

Migrating from stateless deployments

The ingester deployment without WAL and statefulset with WAL should be scaled down and up respectively in sync without transfer of data between them to ensure that any ingestion after migration is reliable immediately.

Let’s take an example of 4 ingesters. The migration would look something like this:

Bring up one stateful ingester ingester-0 and wait until it’s ready (accepting read and write requests).
Scale down the old ingester deployment to 3 and wait until the leaving ingester flushes all the data to chunk store.
Once that ingester has disappeared from kc get pods ..., add another stateful ingester and wait until it’s ready. Now you have ingester-0 and ingester-1.
Repeat step 2 to reduce remove another ingester from old deployment.
Repeat step 3 to add another stateful ingester. Now you have ingester-0 ingester-1 ingester-2.
Repeat step 4 and 5, and now you will finally have ingester-0 ingester-1 ingester-2 ingester-3.

How to scale up/down

Scale up

Scaling up is same as what you would do without WAL or statefulsets. Nothing to change here.

Scale down

When scaling down, we must ensure existing data on the leaving ingesters are flushed to storage instead of just the WAL. This is because we won’t be replaying the WAL on an ingester that will no longer exist and we need to make sure the data is not orphaned.

Consider you have 4 ingesters ingester-0 ingester-1 ingester-2 ingester-3 and you want to scale down to 2 ingesters, the ingesters which will be shutdown according to statefulset rules are ingester-3 and then ingester-2.

Hence before actually scaling down in Kubernetes, port forward those ingesters and hit the /ingester/flush_shutdown endpoint. This will flush the chunks and remove itself from the ring, after which it will register as unready and may be deleted.

After hitting the endpoint for ingester-2 ingester-3, scale down the ingesters to 2.

Additional notes

Kubernetes hacking

Statefulsets are significantly more cumbersome to work with/upgrade/etc. Much of this stems from immutable fields on the specification. For example, if one wants to start using the WAL with single store Loki and wants separate volume mounts for the WAL and the boltdb-shipper, you may see immutability errors when attempting updates the Kubernetes statefulsets.

In this case, try kubectl -n <namespace> delete sts ingester --cascade=false. This will leave the pods alive but delete the statefulset. Then you may recreate the (updated) statefulset and one-by-one start deleting the ingester-0 through ingester-n pods in that order, allowing the statefulset to spin up new pods to replace them.

Non-Kubernetes or baremetal deployments

When the ingester restarts for any reason (upgrade, crash, etc), it should be able to attach to the same volume in order to recover back the WAL and tokens.
2 ingesters should not be working with the same volume/directory for the WAL.
A Rollout should bring down an ingester completely and then start the new ingester, not the other way around.