Menu

Run Grafana Mimir in production using the Helm chart

Beyond Getting started with Grafana Mimir using the Helm chart, which covers setting up Grafana Mimir on a local Kubernetes cluster or within a low-risk development environment, you can prepare Grafana Mimir for production.

Although the information that follows assumes that you are using Grafana Mimir in a production environment that is customer-facing, you might need the high-availability and horizontal-scalability features of Grafana Mimir even in an internal, development environment.

Before you begin

Meet all the follow prerequisites:

  • You are familiar with Helm 3.x.

    Add the grafana Helm repository to your local environment or to your CI/CD tooling:

    helm repo add grafana https://grafana.github.io/helm-charts
    helm repo update
    
  • You have an external object storage that is different from the MinIO object storage that mimir-distributed deploys, because the MinIO deployment in the Helm chart is only intended for getting started and is not intended for production use.

    To use Grafana Mimir in production, you must replace the default object storage with an Amazon S3 compatible service, Google Cloud Storage, Microsoft® Azure Blob Storage, or OpenStack Swift. Alternatively, to deploy MinIO yourself, see MinIO High Performance Object Storage.

Plan capacity

The mimir-distributed Helm chart comes with two sizing plans:

These sizing plans are estimated based on experience from operating Grafana Mimir at Grafana Labs. The ideal size for your cluster depends on your usage patterns. Therefore, use the sizing plans as starting point for sizing your Grafana Mimir cluster, rather than as strict guidelines. To get a better idea of how to plan capacity, refer to the YAML comments at the beginning of small.yaml and large.yaml files, which relate to read and write workloads. See also Planning Grafana Mimir capacity.

To use a sizing plan, copy it from the mimir GitHub repository, and pass it as a values file to the helm command. Note that sizing plans may change with new versions of the mimir-distributed chart. Make sure to use a sizing plan from a version close to the version of the Helm chart that you are installing.

For example:

helm install mimir-prod grafana/mimir-distributed -f ./small.yaml

Conform to fault-tolerance requirements

As part of Pod scheduling, the small.yaml and large.yaml files add Pod anti-affinity rules so that no two ingester Pods, nor two store-gateway Pods, are scheduled on any given Kubernetes Node. This increases fault tolerance of the Mimir cluster.

You must create and add Nodes, such that the number of Nodes is equal to or larger than either the number of ingester Pods or the number of store-gateway Pods, whichever one is larger. Expressed as a formula, it reads as follows:

number_of_nodes >= max(number_of_ingesters_pods, number_of_store_gateway_pods)

For more information about the failure modes of either the ingester or store-gateway component, refer to Ingesters failure and data loss or Store-gateway: Blocks sharding and replication.

Decide whether you need geographical redundancy, fast rolling updates, or both.

You can use a rolling update strategy to apply configuration changes to Grafana Mimir, and to upgrade Grafana Mimir to a newer version. A rolling update results in no downtime to Grafana Mimir.

The Helm chart performs a rolling update for you. To make sure that rolling updates are faster, configure the Helm chart to deploy Grafana Mimir with zone-aware replication.

New installations

Grafana Mimir supports replication across availability zones within your Kubernetes cluster. This further increases fault tolerance of the Mimir cluster. Even if you do not currently have multiple zones across your Kubernetes cluster, you can avoid having to extraneously migrate your cluster when you start using multiple zones.

For mimir-distributed Helm chart v4.0 or higher, zone-awareness is enabled by default for new installations.

To benefit from zone-awareness, choose the node selectors for your different zones. For convenience, you can use the following YAML configuration snippet as a starting point:

ingester:
  zoneAwareReplication:
    enabled: true
    topologyKey: kubernetes.io/hostname
    zones:
      - name: zone-a
        nodeSelector:
          topology.kubernetes.io/zone: us-central1-a
      - name: zone-b
        nodeSelector:
          topology.kubernetes.io/zone: us-central1-b
      - name: zone-c
        nodeSelector:
          topology.kubernetes.io/zone: us-central1-c

store_gateway:
  zoneAwareReplication:
    enabled: true
    topologyKey: kubernetes.io/hostname
    zones:
      - name: zone-a
        nodeSelector:
          topology.kubernetes.io/zone: us-central1-a
      - name: zone-b
        nodeSelector:
          topology.kubernetes.io/zone: us-central1-b
      - name: zone-c
        nodeSelector:
          topology.kubernetes.io/zone: us-central1-c

Existing installations

If you are upgrading from a previous mimir-distributed Helm chart version to v4.0, then refer to the migration guide to configure zone-aware replication.

Configure Mimir to use object storage

For the different object storage types that Mimir supports, and examples, see Configure Grafana Mimir object storage backend.

  1. Add the following YAML to your values file, if you are not using the sizing plans that are mentioned in Plan capacity:

    minio:
      enabled: false
    
  2. Prepare the credentials and bucket names for the object storage.

  3. Add the object storage configuration to the Helm chart values. Nest the object storage configuration under mimir.structuredConfig. This example uses Amazon S3:

    mimir:
      structuredConfig:
        common:
          storage:
            backend: s3
            s3:
              endpoint: s3.us-east-2.amazonaws.com
              region: us-east
              secret_access_key: "${AWS_SECRET_ACCESS_KEY}" # This is a secret injected via an environment variable
              access_key_id: "${AWS_ACCESS_KEY_ID}" # This is a secret injected via an environment variable
    
        blocks_storage:
          s3:
            bucket_name: mimir-blocks
        alertmanager_storage:
          s3:
            bucket_name: mimir-alertmanager
        ruler_storage:
          s3:
            bucket_name: mimir-ruler
    
        # The following admin_client configuration only applies to Grafana Enterprise Metrics deployments:
        #admin_client:
        #  storage:
        #    s3:
        #      bucket_name: gem-admin
    

Meet security compliance regulations

Grafana Mimir does not require any special permissions on the hosts that it runs on. Because of this, you can deploy it in environments that enforce the Kubernetes Restricted security policy.

In Kubernetes v1.23 and higher, the Restricted policy can be enforced via a namespace label on the Namespace resource where Mimir is deployed. For example:

pod-security.kubernetes.io/enforce: restricted

In Kubernetes versions prior to 1.23, the mimir-distributed Helm chart provides a PodSecurityPolicy resource that enforces many of the recommendations from the Restricted policy that the namespace label enforces. To enable the PodSecurityPolicy admission controller for your Kubernetes cluster, refer to How do I turn on an admission controller?.

Monitor the health of your Grafana Mimir cluster

To monitor the health of your Grafana Mimir cluster, which is also known as metamonitoring, you can use ready-made Grafana dashboards, and Prometheus alerting and recording rules. For more information, see Installing Grafana Mimir dashboards and alerts.

The mimir-distributed Helm chart makes it easy for you to collect metrics and logs from Mimir. It assigns the correct labels for you so that the dashboards and alerts simply work. The chart uses the Grafana Agent to ship metrics to a Prometheus-compatible server and logs to a Loki or GEL (Grafana Enterprise Metrics) server.

  1. Download the Grafana Agent Operator Custom Resource Definitions (CRDs) from https://github.com/grafana/agent/tree/main/production/operator/crds

  2. Install the CRDs on your cluster:

    kubectl apply -f production/operator/crds/
    
  3. Add the following YAML snippet to your values file, to send metamonitoring telemetry from Mimir. Change the URLs and credentials to match your desired destination.

    metaMonitoring:
      serviceMonitor:
        enabled: true
      grafanaAgent:
        enabled: true
        installOperator: true
    
        logs:
          remote:
            url: "https://example.com/loki/api/v1/push"
            auth:
              username: 12345
    
        metrics:
          remote:
            url: "https://prometehus.prometheus.svc.cluster.local./api/v1/push"
            headers:
              X-Scope-OrgID: metamonitoring
    

    For details about how to set up the credentials, see Collecting metrics and logs from Grafana Mimir.

Your Grafana Mimir cluster can now ingest metrics in production.

Configure clients to write metrics to Mimir

To configure each client to remote-write metrics to Mimir, refer to Configure Prometheus to write to Grafana Mimir and Configure Grafana Agent to write to Grafana Mimir.

Set up redundant Prometheus or Grafana Agent instances for high availability

If you need redundancy on the write path before it reaches Mimir, then you can set up redundant instances of Prometheus or Grafana Agent to write metrics to Mimir.

For more information, see Configure high-availability deduplication with Consul.