Scaling Prometheus with Thanos or Grafana Mimir

A Production Blueprint for Multi-Cluster Metric Federation, Deep Archetypal Comparisons, Object Storage Lifecycle Strategy, and Sub-Second Historical Query Optimization.

1. Executive Summary & The Monolithic Scale Wall

Prometheus is an exceptionally efficient time-series database (TSDB) for local, single-cluster metric collection. However, its native architecture is inherently monolithic and localized. It stores data on local non-replicated block storage, handles queries on the same node where ingestion occurs, and lacks a built-in mechanism to combine data from multiple physical locations. As an organization expands across hundreds of Kubernetes namespaces and multiple cloud regions, this localized design hits an operational scale wall.

When metric volumes scale past millions of concurrent active series, a monolithic Prometheus instance faces three major failure modes:

Storage Exhaustion: Retaining months or years of fine-grained metrics requires massive, highly expensive cloud SSD arrays (such as AWS EBS gp3 volumes), which increases operational costs and creates a single point of data loss failure.
Query-Time Memory Exhaustion (OOM): When a user runs a PromQL query covering a 90-day window across high-cardinality targets, Prometheus must pull millions of historical data blocks from local disk into RAM, frequently triggering kernel Out-Of-Memory (OOM) panic terminations.
Lack of a Global View: Engineers must log into separate Grafana instances or switch data sources manually to inspect different clusters, making cross-cluster correlation, capacity planning, and global alerting impossible.

To overcome these limits, platform engineers transition to distributed metric engines. The two industry standards for scaling Prometheus are Thanos and Grafana Mimir. Both platforms enable cheap long-term storage via object storage and decouple ingestion from query processing. However, they implement completely different architectural patterns to achieve this scalability.

2. Thanos: Architecture, Components, and Sidecar Dynamics

Thanos is an open-source, CNCF incubated project designed to seamlessly extend Prometheus into a highly available, long-term metric system. It uses a **pull-based distributed proxy model** that builds on top of your existing local Prometheus deployments, minimizing changes to your foundational collection layers.

The Modular Thanos Component Ecosystem

Thanos Sidecar: Runs as a container alongside your standard Prometheus instance. It monitors the local Prometheus TSDB directory. Every 2 hours, as Prometheus flushes an immutable data block to disk, the Sidecar intercepts the block and ships it up to cloud object storage. Additionally, it exposes a gRPC query interface that allows upstream components to fetch real-time, in-memory metrics.
Thanos Querier (Query Engine): The stateless brain of the cluster. It implements the standard Prometheus Query API and speaks PromQL. When an engineer runs a query via Grafana, the Thanos Querier fanouts the request to all active downstream endpoints simultaneously—evaluating real-time metrics via Sidecars and historical data via Store Gateways. It automatically deduplicates metrics generated by highly available, twin Prometheus replicas using their replica labels.
Thanos Store Gateway: A stateless proxy that sits in front of your object storage bucket. It scans remote historical data blocks, caches index files locally on its own disk, and fetches only the specific byte-ranges required by an active query, preventing slow, full-bucket downloads.
Thanos Compactor: A singleton stateful component that continuously scans object storage. It merges multiple small 2-hour data blocks into larger 2-week blocks to improve compression ratios. Critically, it handles **Downsampling**—creating long-term 5-minute and 1-hour resolution variations of your metrics so that 1-year graphs load in milliseconds.

Production Thanos Manifest Integration

To implement Thanos within an enterprise cluster, attach the sidecar service structure and object storage bucket secrets directly into your core Prometheus Custom Resource Definition:


# thanos-prometheus-crd.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: cluster-prod-prometheus
  namespace: telemetry
spec:
  replicaCount: 2 # Twin replicas for HA scraping; Thanos Querier will deduplicate these
  thenNameLabel: "prometheus_replica" # Explicit label token used by Thanos to deduplicate series
  storageSpec:
    volumeClaimTemplate:
      spec:
        storageClassName: gp3-ebs-sc
        resources:
          requests:
            storage: 50Gi # Local disk now only needs to act as a short 24-hour cache
  thanos:
    version: v0.31.0
    # Object storage credentials mapping to your centralized AWS S3 bucket
    objectStorageConfig:
      key: thanos.yaml
      name: thanos-objstore-secret

3. Grafana Mimir: Enterprise Remote-Write Scale Mechanics

Grafana Mimir is a horizontally scalable, highly available, multi-tenant long-term storage engine designed to scale to billions of active series. Unlike Thanos's decentralized proxy approach, Mimir implements a **centralized, push-based microservices architecture** modeled on Cortex.

The High-Throughput Mimir Pipeline Components

Distributor: The stateless entry gatekeeper for incoming metrics. Prometheus instances utilize the standard HTTP remote_write protocol to stream metrics to the Distributor. The Distributor breaks incoming batches apart, validates tenant permissions, hashes metric names, and distributes the data across multiple Ingesters using a consistent hashing ring.
Ingester: A stateful, memory-intensive component responsible for active data ingestion. It buffers incoming metrics directly in memory, appending samples into compressed chunks using the same TSDB format as Prometheus. To protect against node failures, it replicates data across multiple internal ingester nodes before flushing completed blocks out to long-term object storage every 2 hours.
Querier & Query Frontend: The Query Frontend intercepts inbound PromQL requests from Grafana, cuts large time windows into smaller day-long query chunks, and parallelizes execution loops. The stateless Queriers then execute these sub-queries by pulling real-time data from the Ingesters' memory buffers and historical blocks from object storage.

Production Mimir Architecture Configuration

Configure Mimir’s core microservices architecture layout at /etc/mimir/mimir.yaml to establish strict replication, multi-tenancy, and indexing structures:


# /etc/mimir/mimir.yaml
multitenancy_enabled: true

# Consistent Hashing Ring Coordination Layout
common:
  storage:
    backend: s3
    s3:
      endpoint: s3.us-east-1.amazonaws.com
      bucket_name: enterprise-mimir-longterm-storage
      access_key_id: "${MIMIR_AWS_ACCESS_KEY}"
      secret_access_key: "${MIMIR_AWS_SECRET_KEY}"

# Distributor Load-Balancing Rules
distributor:
  config:
    ingress_rate_limit_per_tenant: 100000
    ingress_burst_size_per_tenant: 200000

# Ingester Memory-Chunck and Replication Settings
ingester:
  lifecycler:
    ring:
      kvstore:
        store: memberlist # Dynamic peer-to-peer cluster discovery mesh
      replication_factor: 3 # Mirror data across 3 independent pods for safety
  chunks_write_buffer_size: 5242880 # 5MB buffer size allocation per stream
  chunk_idle_period: 15m
  max_chunk_age: 2h

# Query Split and Optimization Architecture
query_frontend:
  align_queries_with_step: true
  log_queries_longer_than: 5s
  split_queries_by_interval: 24h # Slices long queries into 1-day chunks to prevent memory spikes

4. Deep Archetypal Comparison: Thanos vs. Mimir

Choosing between Thanos and Mimir requires balancing operational complexity against structural scale requirements. The following blueprint provides an exhaustive breakdown of their core operational differences:

Architectural Pillar	Thanos (Decentralized Proxy Pattern)	Grafana Mimir (Centralized Microservices Pattern)
Data Ingestion Mechanic	Pull / Hybrid: Local Prometheus pulls data normally; Sidecar pushes completed 2h blocks asynchronously to object storage. Real-time queries pull metrics directly from local Prometheus RAM via gRPC hooks.	Push / Remote-Write: Prometheus acts as a stateless collector, immediately pushing all metrics out of local memory to Mimir's HTTP endpoints via the `remote_write` protocol.
Local Storage Footprint	High / Stateful: Prometheus instances must maintain a functional local disk (e.g., 24-48h caching window) to store data before the Sidecar can flush it up to object storage.	Minimal / Near-Stateless: Prometheus instances require very little local disk space because data is continuously streamed away to Mimir, which handles all long-term durability.
Multi-Tenancy Isolation	Difficult / Bolted-On: Requires deploying separate Thanos Queriers and implementing strict Kubernetes RBAC network rules across different namespaces to isolate data streams.	Native / Built-In: Every single API request requires an HTTP header identifier (`X-Scope-OrgID`). Mimir natively isolates metrics, limits, and rings across different tenants out of the box.
Operational Complexity	Low to Moderate: Highly modular and non-intrusive. You can easily add Thanos components to an existing Prometheus deployment without changing your core collection loops.	High: Requires operating a large, complex microservices web of component pods (Distributors, Ingesters, Queriers, Compactors) along with a central `memberlist` gossip communication mesh.
Scale Limits (Active Series)	Optimized for small to large multi-cluster environments (scaling reliably from 10M to 50M active metrics).	Engineered for massive corporate infrastructures, scaling cleanly to 100M+ to billions of concurrent metrics.

5. Object Storage Configurations & Lifecycle Topologies

Both scaling platforms use low-cost object storage as their single source of truth for historical metrics. Managing these storage buckets efficiently is critical for minimizing long-term data costs and optimizing query lookups.

Production AWS S3 Lifecycle Bucket Policy

Because Thanos and Mimir compile historical data into immutable blocks, your storage infrastructure must be tuned to prevent unnecessary bucket versioning costs while safely archiving metrics across multi-year compliance windows. Apply this production AWS S3 Lifecycle json configuration map:


{
  "Rules": [
    {
      "ID": "ArchiveHistoricalTelemetryToGlacier",
      "Status": "Enabled",
      "Filter": {
        "Prefix": ""
      },
      "Transitions": [
        {
          "Days": 90,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 365,
          "StorageClass": "GLACIER_IR"
        }
      ],
      "NoncurrentVersionExpiration": {
        "NoncurrentDays": 14
      },
      "AbortIncompleteMultipartUpload": {
        "DaysAfterInitiation": 7
      }
    }
  ]
}

Data Indexing Optimization Mechanics

To prevent queries from scanning petabytes of raw data in object storage, both engines utilize a structured layout inside the bucket:

Block Directory: Every 2 hours of telemetry is written into a dedicated folder named after a unique UUID. This folder contains the raw metric chunk files along with a critical index file.
Index Sharding: The Store Gateway download engine reads only the trailing bytes of the historical index file. This map tells the gateway exactly where specific metric streams reside within the larger chunk blocks, minimizing network transfer latency.
Compactor Downsampling: The Compactor reads raw 2-hour data blocks and pre-aggregates them into coarser resolutions. It creates a 5-minute resolution block for queries spanning weeks, and a 1-hour resolution block for queries spanning months or years. Upstream query engines automatically select the optimal resolution, allowing long-term trend analysis graphs to load instantly.

6. Distributed Telemetry Diagnostics & Operational Recovery

Operating a distributed metric architecture introduces complex, cross-network failure modes. Use these diagnostic procedures to locate and resolve production bottlenecks.

1. Triaging Thanos Global View Query Timeouts

If your global Thanos dashboard fails with an HTTP 504 gateway timeout, use these steps to pinpoint the failing component:


# Step A: Check the Thanos Querier logs to find which downstream cluster endpoint is lagging
kubectl logs -n telemetry deployment/thanos-querier -c thanos-querier --tail=200 | grep "msg=\"fetch series failed\""

# Step B: Verify if the Store Gateway is running out of memory while parsing the storage index map
kubectl describe pod -n telemetry -l app.kubernetes.io/name=thanos-store

# Step C: Audit the local index cache hit ratios via Prometheus internal metrics
# Query: thanos_store_nodes_grpc_connections / thanos_store_index_cache_hits_total

2. Resolving Mimir Ingest Ring Split-Brain and Token Desynchronization

If your application services fail to write metrics and throw "ingestion rate limit exceeded" or "unhealthy ingester ring" errors, the internal peer-to-peer gossip network may be out of sync. Use these commands to inspect and repair the hashing ring:


# Step A: Port-forward directly into the Mimir Query Frontend management console
kubectl port-forward -n telemetry deployment/mimir-query-frontend 8080:8080

# Step B: Open your browser and navigate to http://localhost:8080/ingester/ring
# Inspect the UI status board to look for disconnected nodes stuck in a "LEAVING" or "UNHEALTHY" state.

# Step C: Force-evict an unrecoverable, dead ingester node from the token ring using Mimir's CLI tool
mimirtool ring evict --ring-name=ingester --instance-id=mimir-ingester-3 --address=mimir-kvstore.telemetry:8500

# Step D: Restart the distributor pods to force an instant synchronization of the consistent hashing ring map
kubectl rollout restart deployment/mimir-distributor -n telemetry

7. Technical Interview Architecture Deep Dive

Q1: Explain how Thanos manages real-time deduplication of metric streams. If you have two highly available Prometheus instances scraping the exact same target, what internal logic does the Thanos Querier use to present a clean, single line on a Grafana panel?

Answer: Thanos handles metric deduplication transparently at the query layer by evaluating specific identifier labels across your data streams. The process works through these technical steps:

When you deploy a pair of highly available Prometheus instances for redundancy, both instances scrape the same targets. They collect identical data points, but each instance attaches a unique identifying label to its metrics (e.g., prometheus_replica="pod-a" and prometheus_replica="pod-b").
When an operator runs a query via Grafana, the request hits the centralized **Thanos Querier**. The Querier fans out the query to both Prometheus instances simultaneously, pulling back two duplicate time-series sets.
If deduplication is enabled in the query options, the Thanos Querier uses the designated replica label parameter (e.g., replica_labels=["prometheus_replica"]) to evaluate the incoming data streams.
The Querier lines up the timestamps of both series. If both replicas returned healthy data for a specific timestamp, the Querier drops one of the duplicate samples and strips away the replica label, presenting a single, continuous metric line to the frontend.
If one of the Prometheus pods crashes or drops off the grid due to a network partition, the Thanos Querier detects the missing samples from that replica and automatically fills the gap using data from the surviving twin instance. This failover happens seamlessly without creating gaps or jagged jumps on your dashboards.

Q2: Why does Grafana Mimir utilize a Consistent Hashing Ring layout across its Ingester tier? What failure scenario occurs if a core Ingester container crashes while holding un-flushed metric blocks in memory?

Answer: Grafana Mimir uses a **Consistent Hashing Ring** distributed coordination pattern to load-balance metric ingestion dynamically across its storage tier while minimizing data reshuffling when nodes scale up or down:

The Ingestion Ring Mechanic: Each Ingester instance registers itself with a central coordination grid (managed via a shared `memberlist` gossip mesh) and claims ownership over a set of pseudo-random tokens distributed across a continuous $360^\circ$ ring architecture. When the stateless **Distributor** receives a batch of metrics, it hashes the combination of the metric name and tenant ID into a numeric value. The Distributor then matches that hash against the tokens on the ring to route the metric data to the specific Ingester owning that token segment.
Handling Node Crashes: Because Ingesters hold up to 2 hours of metric data in memory before flushing it out to object storage, an abrupt container crash risks data loss. To protect against this, Mimir enforces a **Replication Factor** configuration (typically set to 3). This means the Distributor doesn't just write a metric to a single Ingester; it calculates the token's position on the ring and mirrors the data across the next two sibling Ingesters down the line.
Quorum Verification: To ensure consistency, Mimir uses write quorums. The Distributor only returns a successful HTTP 200 code back to the client Prometheus instance once a majority of target Ingesters (at least 2 out of 3) confirm they have written the data to their local Write-Ahead Log (WAL). If a single Ingester node completely crashes, the remaining twin Ingesters continue processing queries and ingest traffic normally, preventing data loss and keeping the metric pipeline stable.

Q3: Your organization monitors thousands of dynamic microservices that generate short-lived, high-cardinality pods. Why is Thanos’s pull-based Store Gateway architecture more prone to long-term query degradation compared to Grafana Mimir's pre-split Query Frontend engine under this specific workload?

Answer: The difference in query performance under heavy, short-lived workloads comes down to how each platform indexes and accesses historical data within your object storage buckets:

Thanos Store Gateway Mechanics: Thanos uses a pull-based proxy model. When you query a long historical time window, the **Thanos Store Gateway** must connect to your object storage bucket and scan the index files across all historical blocks matching that timeframe. If your cluster generates thousands of short-lived, ephemeral pods daily, each pod creates a high volume of unique time-series entries. This expands the size of the block index files significantly. Over time, the Store Gateway must spend more memory and network processing cycles just downloading, parsing, and searching through these bloated index maps to resolve your queries, leading to slow load times and occasional query timeouts.
Grafana Mimir Query Frontend Optimization: Grafana Mimir avoids this indexing bottleneck by processing and optimizing queries before they ever look at storage. Mimir's **Query Frontend** splits long query windows into smaller, independent 1-day chunks and distributes them across a stateless pool of execution workers to be computed in parallel. Additionally, Mimir uses a **QueryResult Cache** to save completed historical calculations in high-speed memory systems like Memcached. If a user runs a 30-day query, Mimir pulls 29 days of pre-calculated results instantly from the memory cache and executes a small query only for the remaining active day, completely avoiding the need to scan heavy historical storage indexes over and over again.

8. Summary

Scaling Prometheus past monolithic limits requires a thoughtful balance of operational trade-offs and structural choices. Implementing Thanos provides an intuitive, non-intrusive approach that adds high-availability global views and long-term storage archival to existing deployments without changing your core collection loops. Transitioning to Grafana Mimir delivers an enterprise-grade, centralized remote-write engine optimized for massive, high-throughput multi-tenant environments. Mastering distributed components, configuring downsampling compaction loops, and tuning object storage lifecycle policies ensures a resilient monitoring architecture that scales reliably across enterprise platforms.

Scaling Prometheus with Thanos or Grafana Mimir

1. Executive Summary & The Monolithic Scale Wall

2. Thanos: Architecture, Components, and Sidecar Dynamics

The Modular Thanos Component Ecosystem

Production Thanos Manifest Integration

3. Grafana Mimir: Enterprise Remote-Write Scale Mechanics

The High-Throughput Mimir Pipeline Components

Production Mimir Architecture Configuration

4. Deep Archetypal Comparison: Thanos vs. Mimir

5. Object Storage Configurations & Lifecycle Topologies

Production AWS S3 Lifecycle Bucket Policy

Data Indexing Optimization Mechanics

6. Distributed Telemetry Diagnostics & Operational Recovery

1. Triaging Thanos Global View Query Timeouts

2. Resolving Mimir Ingest Ring Split-Brain and Token Desynchronization

7. Technical Interview Architecture Deep Dive

Q1: Explain how Thanos manages real-time deduplication of metric streams. If you have two highly available Prometheus instances scraping the exact same target, what internal logic does the Thanos Querier use to present a clean, single line on a Grafana panel?

Q2: Why does Grafana Mimir utilize a Consistent Hashing Ring layout across its Ingester tier? What failure scenario occurs if a core Ingester container crashes while holding un-flushed metric blocks in memory?

8. Summary

🔥 Popular Topics

About the Author

Naresh Kumar

1. Executive Summary & The Monolithic Scale Wall

2. Thanos: Architecture, Components, and Sidecar Dynamics

The Modular Thanos Component Ecosystem

Production Thanos Manifest Integration

3. Grafana Mimir: Enterprise Remote-Write Scale Mechanics

The High-Throughput Mimir Pipeline Components

Production Mimir Architecture Configuration

4. Deep Archetypal Comparison: Thanos vs. Mimir

5. Object Storage Configurations & Lifecycle Topologies

Production AWS S3 Lifecycle Bucket Policy

Data Indexing Optimization Mechanics

6. Distributed Telemetry Diagnostics & Operational Recovery

1. Triaging Thanos Global View Query Timeouts

2. Resolving Mimir Ingest Ring Split-Brain and Token Desynchronization

7. Technical Interview Architecture Deep Dive

Q1: Explain how Thanos manages real-time deduplication of metric streams. If you have two highly available Prometheus instances scraping the exact same target, what internal logic does the Thanos Querier use to present a clean, single line on a Grafana panel?

Q2: Why does Grafana Mimir utilize a Consistent Hashing Ring layout across its Ingester tier? What failure scenario occurs if a core Ingester container crashes while holding un-flushed metric blocks in memory?

8. Summary

Related Topics

🔥 Popular Topics

About the Author

Naresh Kumar