Run Grafana Mimir in production using the Helm chart
In addition to the guide Get started with Grafana Mimir using the Helm chart, which covers setting up Grafana Mimir on a local Kubernetes cluster or within a low-risk development environment, you can prepare Grafana Mimir for production.
Although the information that follows assumes that you are using Grafana Mimir in a production environment that is customer-facing, you might need the high-availability and horizontal-scalability features of Grafana Mimir even in an internal, development environment.
Before you begin
Meet all the follow prerequisites:
You are familiar with Helm 3.x.
Add the grafana Helm repository to your local environment or to your CI/CD tooling:
helm repo add grafana https://grafana.github.io/helm-charts helm repo update
You have an external object storage that is different from the MinIO object storage that
mimir-distributed
deploys, because the MinIO deployment in the Helm chart is only intended for getting started and is not intended for production use.To use Grafana Mimir in production, you must replace the default object storage with an Amazon S3 compatible service, Google Cloud Storage, Microsoft® Azure Blob Storage, or OpenStack Swift. Alternatively, to deploy MinIO yourself, see MinIO High Performance Object Storage.
Note
Like Amazon S3, the chosen object storage implementation must not create directories. Grafana Mimir doesn’t have any notion of object storage directories, and so will leave empty directories behind when removing blocks. For example, if you use Azure Blob Storage, you must disable hierarchical namespace.
Plan capacity
The mimir-distributed
Helm chart comes with two sizing plans:
- For 1M series:
small.yaml
- For 10M series:
large.yaml
These sizing plans are estimated based on experience from operating Grafana
Mimir at Grafana Labs. The ideal size for your cluster depends on your
usage patterns. Therefore, use the sizing plans as starting
point for sizing your Grafana Mimir cluster, rather than as strict guidelines.
To get a better idea of how to plan capacity, refer to the YAML comments at
the beginning of small.yaml
and large.yaml
files, which relate to read and write workloads.
See also [Planning Grafana Mimir capacity].
To use a sizing plan, copy it from the mimir
GitHub repository, and pass it as a values file to the helm
command. Note that sizing plans may change with new
versions of the mimir-distributed
chart. Make sure to use a sizing plan from a version close to the version of the
Helm chart that you are installing.
For example:
helm install mimir-prod grafana/mimir-distributed -f ./small.yaml
Conform to fault-tolerance requirements
As part of Pod scheduling, the small.yaml
and large.yaml
files add Pod
anti-affinity rules so that no two ingester Pods, nor two store-gateway
Pods, are scheduled on any given Kubernetes Node. This increases fault
tolerance of the Mimir cluster.
You must create and add Nodes, such that the number of Nodes is equal to or larger than either the number of ingester Pods or the number of store-gateway Pods, whichever one is larger. Expressed as a formula, it reads as follows:
number_of_nodes >= max(number_of_ingesters_pods, number_of_store_gateway_pods)
For more information about the failure modes of either the ingester or store-gateway component, refer to [Ingesters failure and data loss] or [Store-gateway: Blocks sharding and replication].
Decide whether you need geographical redundancy, fast rolling updates, or both.
You can use a rolling update strategy to apply configuration changes to Grafana Mimir, and to upgrade Grafana Mimir to a newer version. A rolling update results in no downtime to Grafana Mimir.
The Helm chart performs a rolling update for you. To make sure that rolling updates are faster, configure the Helm chart to deploy Grafana Mimir with zone-aware replication.
New installations
Grafana Mimir supports [replication across availability zones] within your Kubernetes cluster. This further increases fault tolerance of the Mimir cluster. Even if you do not currently have multiple zones across your Kubernetes cluster, you can avoid having to extraneously migrate your cluster when you start using multiple zones.
For mimir-distributed
Helm chart v4.0 or higher, zone-awareness is enabled by
default for new installations.
To benefit from zone-awareness, choose the node selectors for your different zones. For convenience, you can use the following YAML configuration snippet as a starting point:
ingester:
zoneAwareReplication:
enabled: true
topologyKey: kubernetes.io/hostname
zones:
- name: zone-a
nodeSelector:
topology.kubernetes.io/zone: us-central1-a
- name: zone-b
nodeSelector:
topology.kubernetes.io/zone: us-central1-b
- name: zone-c
nodeSelector:
topology.kubernetes.io/zone: us-central1-c
store_gateway:
zoneAwareReplication:
enabled: true
topologyKey: kubernetes.io/hostname
zones:
- name: zone-a
nodeSelector:
topology.kubernetes.io/zone: us-central1-a
- name: zone-b
nodeSelector:
topology.kubernetes.io/zone: us-central1-b
- name: zone-c
nodeSelector:
topology.kubernetes.io/zone: us-central1-c
Existing installations
If you are upgrading from a previous mimir-distributed
Helm chart version
to v4.0, then refer to the migration guide to configure
zone-aware replication.
Configure Mimir to use object storage
For the different object storage types that Mimir supports, and examples, see [Configure Grafana Mimir object storage backend].
Add the following YAML to your values file, if you are not using the sizing plans that are mentioned in Plan capacity:
minio: enabled: false
Prepare the credentials and bucket names for the object storage.
Add the object storage configuration to the Helm chart values. Nest the object storage configuration under
mimir.structuredConfig
. This example uses Amazon S3:mimir: structuredConfig: common: storage: backend: s3 s3: endpoint: s3.us-east-2.amazonaws.com region: us-east secret_access_key: "${AWS_SECRET_ACCESS_KEY}" # This is a secret injected via an environment variable access_key_id: "${AWS_ACCESS_KEY_ID}" # This is a secret injected via an environment variable blocks_storage: s3: bucket_name: mimir-blocks alertmanager_storage: s3: bucket_name: mimir-alertmanager ruler_storage: s3: bucket_name: mimir-ruler # The following admin_client configuration only applies to Grafana Enterprise Metrics deployments: #admin_client: # storage: # s3: # bucket_name: gem-admin
Meet security compliance regulations
Grafana Mimir does not require any special permissions on the hosts that it runs on. Because of this, you can deploy it in environments that enforce the Kubernetes Restricted security policy.
In Kubernetes v1.23 and higher, the Restricted policy can be enforced via a namespace label on the Namespace resource where Mimir is deployed. For example:
pod-security.kubernetes.io/enforce: restricted
In Kubernetes versions prior to 1.23, the mimir-distributed
Helm chart
provides a PodSecurityPolicy resource
that enforces many of the recommendations from the Restricted policy that the
namespace label enforces.
To enable the PodSecurityPolicy admission controller for your Kubernetes
cluster, refer to
How do I turn on an admission controller?.
For OpenShift-specific instructions see Deploy on OpenShift.
The mimir-distributed
Helm chart also deploys most of the containers
with a read-only root filesystem (readOnlyRootFilesystem: true
).
The exceptions are the optional MinIO and Grafana Agent containers.
The PodSecurityPolicy resource enforces this setting.
Monitor the health of your Grafana Mimir cluster
To monitor the health of your Grafana Mimir cluster, which is also known as metamonitoring, you can use ready-made Grafana dashboards, and Prometheus alerting and recording rules. For more information, see [Installing Grafana Mimir dashboards and alerts].
The mimir-distributed
Helm chart makes it easy for you to collect metrics and
logs from Mimir. It assigns the correct labels for you so that the dashboards
and alerts simply work. The chart uses the Grafana Agent to ship metrics to
a Prometheus-compatible server and logs to a Loki or GEL (Grafana Enterprise
Metrics) server.
Download the Grafana Agent Operator Custom Resource Definitions (CRDs) from https://github.com/grafana/agent/tree/main/operations/agent-static-operator/crds
Install the CRDs on your cluster:
kubectl apply -f operations/agent-static-operator/crds/
Add the following YAML snippet to your values file, to send metamonitoring telemetry from Mimir. Change the URLs and credentials to match your desired destination.
metaMonitoring: serviceMonitor: enabled: true grafanaAgent: enabled: true installOperator: true logs: remote: url: "https://example.com/loki/api/v1/push" auth: username: 12345 metrics: remote: url: "https://prometehus.prometheus.svc.cluster.local./api/v1/push" headers: X-Scope-OrgID: metamonitoring
For details about how to set up the credentials, see [Collecting metrics and logs from Grafana Mimir].
Your Grafana Mimir cluster can now ingest metrics in production.
Configure clients to write metrics to Mimir
To configure each client to remote-write metrics to Mimir, refer to Configure Prometheus to write to Grafana Mimir and Configure Grafana Agent to write to Grafana Mimir.
Set up redundant Prometheus or Grafana Agent instances for high availability
If you need redundancy on the write path before it reaches Mimir, then you can set up redundant instances of Prometheus or Grafana Agent to write metrics to Mimir.
For more information, see Configure high-availability deduplication with Consul.
Deploy on OpenShift
To deploy the mimir-distributed
Helm chart on OpenShift you need to change some of the default values.
Add the following YAML snippet to your values file.
This will create a dedicated SecurityContextConstraints (SCC) resource for the mimir-distributed
chart.
rbac:
create: true
type: scc
podSecurityContext:
fsGroup: null
runAsGroup: null
runAsUser: null
rollout_operator:
podSecurityContext:
fsGroup: null
runAsGroup: null
runAsUser: null
Alternatively, to deploy using the default SCC in your OpenShift cluster, add the following YAML snippet to your values file:
rbac:
create: false
type: scc
podSecurityContext:
fsGroup: null
runAsGroup: null
runAsUser: null
rollout_operator:
podSecurityContext:
fsGroup: null
runAsGroup: null
runAsUser: null
Note: When using
mimir-distributed
as a subchart, setting Helm values tonull
requires a workaround due to a bug in Helm. To set the PodSecurityContext fields tonull
, in addition to the YAML, set the values tonull
via the command line when usinghelm
. For example, to usehelm tempalte
:helm template grafana/mimir-distributed -f values.yaml \ --set 'mimir-distributed.rbac.podSecurityContext.fsGroup=null' \ --set 'mimir-distributed.rbac.podSecurityContext.runAsUser=null' \ --set 'mimir-distributed.rbac.podSecurityContext.runAsGroup=null' \ --set 'mimir-distributed.rollout_operator.podSecurityContext.fsGroup=null' \ --set 'mimir-distributed.rollout_operator.podSecurityContext.runAsUser=null' \ --set 'mimir-distributed.rollout_operator.podSecurityContext.runAsGroup=null'