Introduction to Prometheus: Architecture and Core Concepts
A comprehensive, enterprise-grade guide to the internals, storage engine, retrieval mechanisms, and operational patterns of Prometheus.
Table of Contents
- 1. Executive Summary & Core Definitions
- 2. What You Will Learn
- 3. Prerequisites
- 4. Why Prometheus? The Paradigm Shift in Observability
- 5. Deep-Dive Architecture
- 6. TSDB Internals: How Prometheus Stores Data
- 7. Scraping Mechanics and Service Discovery
- 8. Alerting Pipeline and the Pushgateway
- 9. Production-Grade Configuration Blueprint
- 10. Operational Best Practices & Security Hardening
- 11. Meta-Monitoring: Monitoring Prometheus Itself
- 12. Enterprise Scaling Patterns (Thanos, Cortex, Mimir)
- 13. Troubleshooting and Runbooks
- 14. Advanced Technical Interview Questions
- 15. Frequently Asked Questions (FAQs)
- 16. Summary & Next Steps
1. Executive Summary & Core Definitions
In modern cloud-native ecosystems, observability is not merely a debugging aid; it is a fundamental architectural requirement. At the center of this revolution sits Prometheus, an open-source, systems monitoring and alerting toolkit originally built at SoundCloud and now a graduated project within the Cloud Native Computing Foundation (CNCF).
What is Prometheus?
Prometheus is a multi-dimensional, pull-based time-series monitoring system designed to collect, store, query, and alert on real-time metrics. It operates by scraping HTTP metrics endpoints exposed by monitored targets, storing these data points as time-series data, and evaluating alerting rules to identify system anomalies.
Unlike traditional push-based monitoring agents that send metrics to a centralized database over custom protocols, Prometheus flips this model on its head. It utilizes a highly optimized pull architecture combined with dynamic service discovery to actively fetch metrics. This approach decouples the collection infrastructure from the application lifecycle, ensuring resilience, predictability, and low overhead in high-throughput enterprise environments.
2. What You Will Learn
This lesson provides an exhaustive, production-oriented dive into Prometheus. By the end of this guide, you will be able to:
- Deconstruct the internal architecture of Prometheus, including the retrieval engine, TSDB, and HTTP query API.
- Configure and optimize the Prometheus Time Series Database (TSDB) for high-throughput, low-latency environments.
- Implement advanced dynamic service discovery patterns for Kubernetes, Consul, and file-based targets.
- Author complex relabeling rules to clean, filter, and enrich metrics before ingestion.
- Architect resilient alerting pipelines using Prometheus alerting rules and Alertmanager.
- Diagnose and remediate common production failure modes, such as Out-of-Memory (OOM) kills, high cardinality, and WAL corruption.
- Evaluate and select scaling patterns like federation, Thanos, and Grafana Mimir for multi-cluster enterprise deployments.
3. Prerequisites
To extract the maximum value from this lesson, you should possess:
- A solid understanding of basic Linux systems administration, networking concepts (TCP/IP, HTTP, DNS), and containerization (Docker/Kubernetes).
- Familiarity with the core pillars of observability: metrics, logs, and traces. If you need a refresher, please review our Introduction to Observability lesson.
- Basic knowledge of YAML syntax, as it is the primary configuration language for Prometheus and its ecosystem.
4. Why Prometheus? The Paradigm Shift in Observability
Traditional monitoring tools (such as Nagios, Zabbix, or basic Graphite setups) were designed for static, physical infrastructure. In these environments, servers had long lifespans, fixed IP addresses, and predictable workloads.
With the advent of microservices, containerization (Docker), and dynamic orchestration platforms (Kubernetes), infrastructure became ephemeral. Instances are spun up and torn down in seconds. Traditional tools failed under this dynamic churn for several fundamental reasons:
- Configuration Overhead: Manually updating monitoring targets when containers scale is operationally impossible.
- Hierarchical Data Models: Dot-separated metric names (e.g.,
servers.datacenter1.web01.cpu.user) cannot represent multidimensional data elegantly. If you want to aggregate CPU usage across all web servers regardless of datacenter, you must resort to complex wildcard patterns. - Push Bottlenecks: When thousands of ephemeral containers simultaneously push metrics to a centralized monitoring server, they can easily overwhelm the ingestion pipeline, acting as a self-inflicted Distributed Denial of Service (DDoS) attack.
The Prometheus Solution
Prometheus solves these challenges through a set of architectural innovations:
1. Multidimensional Data Model
Metrics are stored as time series identified by a metric name and a set of key-value pairs called labels. This allows for arbitrary slicing and dicing of data. For example:
http_requests_total{method="POST", handler="/api/v1/checkout", status="200", environment="production"}
Using this model, a single query can aggregate metrics across different dimensions (e.g., total requests for the checkout API, or total 500 errors across all endpoints in the production environment).
2. Pull-Based Ingestion Model
Instead of waiting for targets to send data, Prometheus actively scrapes metrics at configured intervals. This provides several operational advantages:
- Rate Limiting by Design: The Prometheus server controls the ingestion rate, protecting itself from being overwhelmed.
- Simplified Target Diagnostics: If a target is down or misconfigured, the pull mechanism detects it immediately via the
upmetric. You do not have to guess whether a silent target is healthy or dead. - No Agent Required on Target: Targets only need to expose a standard HTTP endpoint serving plain-text metrics. No complex agent daemon is required on the client side.
3. PromQL: A Powerful Query Language
PromQL (Prometheus Query Language) is a read-only, functional query language designed specifically for selecting and aggregating time-series data. It allows operators to perform real-time math, rate calculations, and statistical aggregations directly on the metrics database.
5. Deep-Dive Architecture
Prometheus is not a single monolithic tool; it is an ecosystem of decoupled components that communicate via standard protocols. Understanding the flow of data through these components is essential for designing and operating a stable monitoring platform.
Architectural Component Diagram
+-------------------------------------------------------------------------------------------------+
| PROMETHEUS ECOSYSTEM |
+-------------------------------------------------------------------------------------------------+
| |
| +------------------+ +-----------------------+ |
| | Kubernetes API | <-----+ | Service Discovery | |
| | Consul / File SD | | (Dynamic Targets) | |
| +------------------+ +-----------+-----------+ |
| | |
| v |
| +--------------------+ +-----------------------+ +---------------------------+ |
| | Short-lived Tasks | ----> | Pushgateway | -----> | Exporters (Node, MySQL) | |
| | (Batch Jobs) | | (Metrics Cache) | | Instrument App (Go, Java) | |
| +--------------------+ +-----------------------+ +-------------+-------------+ |
| ^ | |
| | | |
| | Pull Scrape | Pull Scrape |
| +------------------+---------------+ |
| | |
| v |
| +------------------------------+ |
| | PROMETHEUS SERVER | |
| | | |
| | +------------------------+ | |
| | | Retrieval Engine | | |
| | +------------+-----------+ | |
| | | | |
| | v | |
| | +------------------------+ | |
| | | TSDB (Storage Engine) | | |
| | +------------+-----------+ | |
| | | | |
| | v | |
| | +------------------------+ | |
| | | HTTP Query API | | |
| | +------------+-----------+ | |
| +---------------+--------------+ |
| | |
| +--------------------------------+----------------+ |
| | PromQL Evaluation | Alert Rules |
| v v |
| +--------------------+ +--------------------+ |
| | Grafana / | | Alertmanager | |
| | Prometheus Web | | (Routing, Silencing| |
| +--------------------+ +----------+---------+ |
| | |
| v |
| +--------------------+ |
| | PagerDuty, Slack, | |
| | Webhooks, Email | |
| +--------------------+ |
+-------------------------------------------------------------------------------------------------+
Component Descriptions and Workflows
1. The Prometheus Server
The core engine of the system, responsible for three primary functions:
- Retrieval (Scraper): Pulls metrics from registered targets via HTTP
GETrequests on a configured schedule (the scrape interval). - TSDB (Time Series Database): An ultra-high-performance, customized database that persists metrics data to disk and manages memory caching.
- HTTP API & PromQL Engine: Receives queries from dashboards (like Grafana), CLI tools, or scripts, parses them, and executes them against the TSDB.
2. Service Discovery
Prometheus relies on service discovery to dynamically locate scrape targets. Rather than hardcoding IP addresses in configuration files, Prometheus queries external APIs (such as Kubernetes, Consul, AWS EC2, or DNS) to fetch lists of active nodes, containers, or services.
3. Exporters
Many third-party software packages (like Linux, Apache, MySQL, or Redis) do not expose Prometheus metrics natively. Exporters act as translators. An exporter is a small service that runs alongside the target application, queries its internal state via native APIs, and exposes that data as a Prometheus-compatible HTTP metrics endpoint (typically /metrics).
Examples include:
- Node Exporter: Measures hardware and OS metrics (CPU, memory, disk I/O, network).
- Blackbox Exporter: Performs synthetic probing (HTTP, HTTPS, DNS, TCP, ICMP) to measure endpoint latency and availability from the outside.
- Kube-State-Metrics: Listens to the Kubernetes API server and generates metrics about the state of resources (deployments, pods, nodes).
4. Pushgateway
Because Prometheus is pull-based, it cannot easily monitor short-lived batch jobs that complete in seconds. The Pushgateway acts as an intermediary buffer. Batch jobs push their metrics to the Pushgateway before exiting. Prometheus then scrapes the Pushgateway at its regular interval, ensuring no metrics from transient jobs are lost.
5. Alertmanager
Prometheus does not send alerts directly to end-users. Instead, the Prometheus server evaluates alerting rules at regular intervals. If a rule condition is met, Prometheus generates an alert and forwards it to the Alertmanager.
The Alertmanager is responsible for:
- Deduplication: Merging multiple identical alerts into a single notification.
- Grouping: Combining related alerts (e.g., 50 pods failing in a single namespace) into one cohesive notification.
- Routing: Sending alerts to different receivers (Slack, PagerDuty, Email, Webhooks) based on labels.
- Inhibition and Silencing: Muting alerts based on active outages or dependencies (e.g., muting application alerts if the underlying host is known to be offline).
6. TSDB Internals: How Prometheus Stores Data
To manage millions of samples per second on modest hardware, the Prometheus Time Series Database (TSDB) employs a highly optimized architecture designed specifically for append-only, sequential time-series workloads.
The Anatomy of a Metric Sample
Every data point stored in Prometheus is a sample consisting of:
- A 64-bit float value.
- A millisecond-precision Unix timestamp.
The identifier for this data point is the series, which is defined by the unique combination of the metric name and its label set.
Memory and Disk Layout
The TSDB splits its data into non-overlapping block directories on disk. Each block represents a slice of time (by default, 2 hours).
data/
โโโ 01F8Z6Y7... (Block 1 - 2 hours old)
โ โโโ chunks/
โ โ โโโ 000001
โ โโโ index
โ โโโ meta.json
โ โโโ tombstones
โโโ 01F8Z9A2... (Block 2 - Current 2-hour window)
โ โโโ chunks/
โ โ โโโ 000001
โ โโโ index
โ โโโ meta.json
โ โโโ tombstones
โโโ chunks_head/ (Active, in-memory chunks)
โ โโโ 000001
โโโ wal/ (Write-Ahead Log)
โโโ 00000001
โโโ 00000002
1. The Head Block (In-Memory Buffer)
When Prometheus scrapes a metric, the sample is immediately written to two places:
- The Write-Ahead Log (WAL): An append-only log on disk. If Prometheus crashes, the WAL is replayed on startup to restore the in-memory state. The WAL is synchronized to disk using sequential I/O, which is highly performant.
- The Head Block: An in-memory data structure. Samples are compressed in memory using the Gorilla compression algorithm, reducing the footprint of a sample from 16 bytes (8 bytes for timestamp + 8 bytes for value) to an average of just 1.37 bytes.
2. Memory-Mapped Chunks (mmap)
To prevent the Head block from consuming excessive RAM, Prometheus periodically flushes closed chunks of samples from memory to disk as "memory-mapped" (mmap) files located in the chunks_head/ directory. The operating system handles caching these files in the page cache, allowing Prometheus to access them quickly without manual memory management.
3. Block Compaction
Every 2 hours, Prometheus cuts the active Head block and writes a new immutable block directory to disk containing:
- chunks/: Raw compressed time-series samples.
- index: An inverted index mapping labels to series IDs, allowing fast lookups of metrics by labels (similar to how search engines index web pages).
- meta.json: Metadata about the block, including start and end times, compaction levels, and source data.
- tombstones: Marks for deleted data, which are cleaned up during compaction.
Over time, a background process merges these 2-hour blocks into larger blocks (e.g., 24 hours, 30 days) in a process called compaction. This optimization reduces disk fragmentation, improves compression ratios, and speeds up queries covering long time ranges.
High Cardinality: The Silent Killer
The performance of the TSDB is directly tied to the number of active series. Cardinality is the mathematical term for the uniqueness of label values.
High cardinality occurs when a label value can have thousands or millions of unique values. Common culprits include:
- Using a
user_id,session_token, orip_addressas a metric label. - Using UUIDs or high-resolution timestamps as labels.
When cardinality explodes, the size of the index file grows exponentially, and Prometheus must allocate massive amounts of RAM to keep the index in memory. This leads to high memory utilization, slow query performance, and eventual Out-of-Memory (OOM) crashes.
7. Scraping Mechanics and Service Discovery
To understand how Prometheus gathers data, we must dissect the lifecycle of a scrape target, from discovery to ingestion.
The Scrape Lifecycle
- Discovery: Prometheus queries the configured Service Discovery mechanism (e.g., Kubernetes API) to find targets.
- Relabeling (Pre-Scrape): Prometheus applies
relabel_configsto the discovered targets. This allows you to filter targets, rewrite labels, or dynamically drop targets before they are scraped. - Scrape: Prometheus sends an HTTP
GETrequest to the target (e.g.,http://10.244.1.45:8080/metrics). - Metric Relabeling (Post-Scrape): Prometheus applies
metric_relabel_configsto the ingested metrics. This allows you to drop specific high-cardinality metrics, rewrite metric names, or filter out unused labels. - Ingestion: The finalized metrics are written to the TSDB.
Understanding Relabeling
Relabeling is one of the most powerful, yet often misunderstood, features of Prometheus. It is executed during two distinct phases:
1. Target Relabeling (relabel_configs)
This phase operates on the metadata labels discovered by service discovery (which start with double underscores, e.g., __meta_kubernetes_pod_name). These temporary labels are discarded after relabeling unless mapped to target labels.
2. Metric Relabeling (metric_relabel_configs)
This phase operates on the actual scraped metrics. It is highly useful for managing cardinality by dropping metrics you do not need before they hit the storage engine.
Relabeling Actions Explained
| Action Name | Description | Typical Use Case |
|---|---|---|
keep |
Only keep targets/metrics where the source labels match the regex. Drop all others. | Scrape only pods that have the annotation prometheus.io/scrape: "true". |
drop |
Drop targets/metrics where the source labels match the regex. Keep all others. | Exclude high-volume debug metrics from ingestion. |
replace |
Match a regex against source labels and write the result to a target label. | Extract a pod name from a Kubernetes label and set it as the pod label. |
labelmap |
Match regex against all label names, then map the matched label name to a new name. | Promote Kubernetes annotations directly into Prometheus labels. |
labeldrop |
Match regex against label names and drop matching labels from the metric. | Remove ephemeral labels like pod_template_hash to reduce cardinality. |
8. Alerting Pipeline and the Pushgateway
A complete observability stack must not only visualize data but also proactively notify human operators when things go wrong. Prometheus achieves this by splitting alerting into two distinct steps: detection and notification.
Alert Detection (Prometheus Server)
Alerting rules are defined in YAML files loaded by the Prometheus server. These rules use PromQL to express conditions that, if true for a specified duration, transition the alert state.
An alert goes through three states:
- Inactive: The alert condition is false.
- Pending: The alert condition is true, but has not yet met the
forduration threshold. This prevents flapping alerts on transient spikes. - Firing: The alert condition has been true for longer than the
forduration. The Prometheus server begins sending alert payloads to the Alertmanager.
Alert Notification (Alertmanager)
Once Alertmanager receives a firing alert from Prometheus, it processes it through a pipeline:
+---------------------+
| Prometheus Alerts |
+----------+----------+
|
v
+---------------------+
| Alertmanager Ingest |
+----------+----------+
|
v
+---------------------+
| Grouping Engine | <--- Groups alerts by label (e.g., service="payment")
+----------+----------+
|
v
+---------------------+
| Inhibition Rules | <--- Mutes alerts if root-cause alert is active
+----------+----------+
|
v
+---------------------+
| Silences Checking | <--- Drops alerts matching active maintenance windows
+----------+----------+
|
v
+---------------------+
| Routing Tree | <--- Matches labels to determine notification target
+----------+----------+
|
v
+---------------------+
| Receivers (Slack, |
| PagerDuty, Webhook) |
+---------------------+
The Pushgateway: When and When Not to Use It
The Pushgateway is an important architectural component, but it is frequently abused by beginners.
Anti-Patterns of Pushgateway Abuse
- Using it to turn Prometheus into a push-based system: Do not push metrics from long-running services. This bypasses service discovery, breaks the
upmetric health-checking, and introduces a single point of failure. - High-Cardinality accumulation: Pushgateway never deletes metrics unless explicitly requested via an API call. If ephemeral batch jobs push metrics with unique job IDs, the Pushgateway will accumulate these metrics forever, leading to memory leaks and performance degradation.
Valid Pushgateway Use Cases
- Short-lived batch jobs: Cron jobs, data migration scripts, or machine learning model training runs that execute and terminate in less time than the standard Prometheus scrape interval.
9. Production-Grade Configuration Blueprint
Below is a production-ready, fully commented configuration file for Prometheus (prometheus.yml) and an alerting rules file (alerts.yml). These configurations demonstrate enterprise best practices, including security, performance optimization, and advanced relabeling.
1. The Core Configuration: prometheus.yml
# Global configuration settings
global:
scrape_interval: 15s # How frequently to scrape targets (default: 1m)
evaluation_interval: 15s # How frequently to evaluate alerting rules (default: 1m)
scrape_timeout: 10s # Timeout before a scrape request is aborted
# Alertmanager configuration
alerting:
alert_relabel_configs:
- source_labels: [replica]
regex: ".*"
action: drop # Strip replica labels to deduplicate alerts across HA pairs
alertmanagers:
- scheme: http
static_configs:
- targets:
- "alertmanager-0.monitoring.svc.cluster.local:9093"
- "alertmanager-1.monitoring.svc.cluster.local:9093"
# Load alerting and recording rules
rule_files:
- "/etc/prometheus/rules/*.yml"
# Scrape configurations
scrape_configs:
# 1. Self-monitoring scrape job
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# 2. Kubernetes Pods Scrape Configuration with Advanced Relabeling
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Scrape only pods annotated with prometheus.io/scrape="true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
# Override the default scrape path if prometheus.io/path annotation exists
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: "(.+)"
# Override the port if prometheus.io/port annotation exists
- source_labels: [address, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: "([^:]+)(?::\\d+)?;(\\d+)"
replacement: "$1:$2"
target_label: __address__
# Map Kubernetes labels to Prometheus metrics labels
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
# Promote pod name and namespace to standard labels
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
# Metric relabeling to drop high-cardinality metrics before storage
metric_relabel_configs:
- source_labels: [__name__]
regex: "jvm_gc_memory_allocated_bytes_total|http_request_duration_seconds_bucket"
action: drop # Drop high-volume bucket or JVM metrics if not required
2. The Alerting Rules: rules/alerts.yml
groups:
- name: InfrastructureAlerts
rules:
# 1. Host High CPU Usage Alert
- alert: HostHighCpuLoad
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 10m
labels:
severity: warning
tier: infrastructure
annotations:
summary: "High CPU load on instance {{ $labels.instance }}"
description: "CPU usage on {{ $labels.instance }} has been over 85% for the last 10 minutes. Current value: {{ $value | printf \"%.2f\" }}%"
# 2. Prometheus Target Down Alert
- alert: PrometheusTargetDown
expr: up == 0
for: 5m
labels:
severity: critical
tier: monitoring
annotations:
summary: "Prometheus target down: {{ $labels.job }}"
description: "The target {{ $labels.instance }} of job {{ $labels.job }} has been unreachable for more than 5 minutes."
# 3. High Ingestion Latency (Scrape Duration)
- alert: PrometheusHighScrapeDuration
expr: prometheus_target_interval_length_seconds{quantile="0.99"} > 30
for: 10m
labels:
severity: warning
tier: monitoring
annotations:
summary: "High scrape latency detected for job {{ $labels.job }}"
description: "Prometheus 99th percentile scrape interval length is {{ $value }} seconds, indicating slow target response or network congestion."
10. Operational Best Practices & Security Hardening
Running Prometheus in an enterprise environment requires careful planning around storage, security, and query optimization.
1. Storage Optimization and Retention
- Retention Time vs. Size: By default, Prometheus retains data on disk for 15 days. You can configure this using
--storage.tsdb.retention.time. However, in containerized environments, it is safer to configure retention by size using--storage.tsdb.retention.size(e.g.,50GB) to prevent disk exhaustion. - Disk Type: Prometheus relies heavily on fast random disk access for queries and sequential disk access for WAL. Always use SSDs (such as AWS EBS
gp3or equivalent) with high IOPS. Avoid standard HDDs or network-attached NFS storage, which can cause severe query latency and block corruption.
2. Security Hardening
- Enable Basic Authentication / TLS: Out of the box, Prometheus does not have built-in authentication. Anyone who can reach the HTTP port (9090) can run expensive queries or trigger administrative shutdowns. Always run Prometheus behind an ingress controller, reverse proxy (Nginx, Envoy), or configure Prometheus's native web config file to enforce TLS and Basic Auth.
- Disable Admin API: Unless strictly required, disable the administrative HTTP endpoints (like
/api/v1/admin/tsdb/delete_series) by omitting the--web.enable-admin-apicommand-line flag. This prevents unauthorized deletion of metrics. - Network Segmentation: Ensure that the Prometheus server is the only entity allowed to scrape metrics from internal application ports. Use firewall rules or Kubernetes Network Policies to block public access to
/metricsendpoints.
3. Managing Cardinality Explosion
- Use Recording Rules: If you have expensive queries that are run repeatedly (e.g., by Grafana dashboards), configure recording rules. A recording rule evaluates a PromQL expression at regular intervals and saves the result as a new pre-computed time series. This reduces query execution times from seconds to milliseconds.
- Establish Label Governance: Create a strict developer guideline banning dynamic values (such as database query strings, raw URLs with IDs, or timestamps) from being passed as labels.
11. Meta-Monitoring: Monitoring Prometheus Itself
To ensure your monitoring system is reliable, you must monitor Prometheus itself. This is known as meta-monitoring. A failure in the monitoring infrastructure must be detected before production applications go down.
Key Prometheus Self-Monitoring Metrics
| Metric Name | Type | Critical Threshold & Meaning |
|---|---|---|
prometheus_tsdb_head_series |
Gauge | Monitors active series in memory. Sudden spikes indicate a high cardinality explosion. |
prometheus_tsdb_wal_corruptions_total |
Counter | Value > 0 indicates WAL corruption, which will cause Prometheus to fail to start. |
prometheus_target_scrapes_exceeded_sample_limit_total |
Counter | Tracks targets dropped because they exceeded the configured sample_limit. |
prometheus_engine_query_duration_seconds |
Histogram | Track 99th percentile of query execution times. High values indicate slow Grafana dashboards or unoptimized PromQL. |
process_resident_memory_bytes |
Gauge | The actual RAM consumed by the Prometheus process. Use this to predict and prevent OOM kills. |
The "Dead Man's Snitch" Pattern
What happens if the Prometheus server crashes entirely? It will stop sending alerts, meaning your operations team will receive no notifications of the outage.
To solve this, implement the Dead Man's Snitch pattern:
- Configure Prometheus to constantly run an alert that is always firing:
rules: - alert: Watchdog expr: vector(1) labels: severity: critical annotations: description: "This is an always-firing alert used to verify the alerting pipeline." - Route this alert through Alertmanager to an external SaaS uptime service (such as Dead Man's Snitch or Healthchecks.io) using a webhook.
- The external service expects to receive a ping from Alertmanager every minute. If Prometheus or Alertmanager goes down, the ping stops, and the external service triggers an immediate notification to your on-call team.
12. Enterprise Scaling Patterns (Thanos, Cortex, Mimir)
While a single Prometheus instance can scale to handle millions of active series, it is fundamentally a vertical-scaling system. It does not support native clustering, distributed queries, or long-term cold storage.
When your infrastructure expands across multiple regions, or you require years of metric retention, you must implement a scaling framework.
Comparison of Enterprise Scaling Solutions
| Scaling Framework | Architecture Style | Storage Backend | Best Used For |
|---|---|---|---|
| Federation | Hierarchical Prometheus Servers | Local Storage (TSDB) | Aggregating high-level metrics from edge datacenters. Not suitable for long-term storage. |
| Thanos | Sidecar & Store-Gateway | Object Storage (S3, GCS) | Adding global querying and infinite retention to existing Prometheus installations. |
| Grafana Mimir | Microservices (Distributed) | Object Storage (S3, GCS) | Massive, multi-tenant SaaS platforms requiring high-throughput ingestion and fast queries. |
1. Hierarchical Federation
Federation allows a master Prometheus server to scrape a subset of metrics from other Prometheus servers.
+-------------------------------------------------+
| Global Prometheus |
+------------------------+------------------------+
|
+-------------+-------------+
| Scrapes /federate endpoint|
v v
+--------------------+ +--------------------+
| Datacenter A Prom | | Datacenter B Prom |
+--------------------+ +--------------------+
This pattern is useful for localized alerting and metrics storage at the edge, while aggregating key performance indicators (KPIs) globally. However, it does not scale well for deep analytical queries over raw historical data.
2. Thanos: The Sidecar Pattern
Thanos extends Prometheus by running a sidecar container alongside each Prometheus instance.
- Thanos Sidecar: Watches the local Prometheus TSDB. Every 2 hours, as Prometheus writes a block, the sidecar uploads it to Object Storage (e.g., AWS S3).
- Thanos Querier: Evaluates PromQL queries by fetching data from both local Prometheus instances (for real-time data) and Object Storage (for historical data). It performs deduplication across HA pairs on the fly.
- Thanos Store-Gateway: Acts as a proxy, indexing and serving historical metrics directly from Object Storage.
3. Grafana Mimir: The Microservices Pattern
Grafana Mimir is a horizontally scalable, multi-tenant, long-term storage tool. Instead of relying on local Prometheus storage, Prometheus instances are configured to stream their metrics to Mimir using the Remote Write protocol. Mimir breaks down ingestion, querying, and storage into separate microservices (Distributor, Ingester, Querier, Compactor) that scale independently.
13. Troubleshooting and Runbooks
Below are real-world troubleshooting scenarios encountered by systems engineers running Prometheus at scale, along with actionable recovery steps.
Scenario A: Prometheus is Out-of-Memory (OOM) Killed
Symptoms:
The Prometheus container crashes repeatedly. The Linux kernel logs show Out of memory: Kill process (prometheus).
Root Cause:
This is almost always caused by a high cardinality explosion or an expensive query that fetched millions of series into memory simultaneously.
Remediation Runbook:
- Identify the culprit query or target: If Prometheus is still partially running, use the TSDB Status page (
/tsdb-status) to identify the highest cardinality labels and metric names. - Increase Memory Limits: Temporarily allocate more RAM to the Prometheus container to allow it to complete its WAL replay on startup.
- Apply Scrape Limits: Prevent targets from sending too many metrics by setting a
sample_limitin your scrape configuration:scrape_configs: - job_name: 'flapping-app' sample_limit: 10000 # Drop target if it exposes more than 10,000 metrics - Drop High-Cardinality Labels: Use
metric_relabel_configsto strip out problematic labels before they hit the TSDB.
Scenario B: WAL Corruption on Disk
Symptoms:
Prometheus fails to start, displaying the error: panic: error rebuilding WAL: error reading segment ....
Root Cause:
An unclean shutdown, disk exhaustion, or underlying hardware failure corrupted the active Write-Ahead Log segments.
Remediation Runbook:
- Stop the Prometheus service entirely.
- Navigate to the Prometheus data directory (e.g.,
/var/lib/prometheus/data/). - Backup the corrupted WAL directory:
cp -r wal/ wal_backup/. - Delete the corrupted WAL segment files. Note that deleting WAL segments will result in the loss of the last 1-2 hours of scraped metrics, but it is necessary to restore service.
- Restart Prometheus. It will initialize cleanly and begin scraping new metrics.
Scenario C: Slow Queries and Grafana Timeout Errors
Symptoms:
Grafana dashboards display timeout errors (HTTP 504 Gateway Timeout) when loading panels.
Root Cause:
Dashboards are querying raw, unaggregated metrics over large time ranges (e.g., querying 30 days of raw 15-second resolution data).
Remediation Runbook:
- Enable Query Logging: Edit
prometheus.ymlto enable logging of slow queries:global: query_log_file: /var/log/prometheus/query.log - Analyze the log file to find queries with high execution times.
- Convert to Recording Rules: Replace the slow, raw PromQL expressions in Grafana with pre-computed recording rules. For example, instead of running
sum(rate(http_requests_total[5m])) by (job)on every page load, save it as a recording rule:rules: - record: job:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job) - Update Grafana to query the pre-computed metric
job:http_requests:rate5minstead of the raw rate function.
14. Advanced Technical Interview Questions
The following questions are commonly asked in senior DevOps, SRE, Cloud Engineering, and Platform Engineering interviews.
Q1. Why does Prometheus use a pull model instead of a push model?
The pull model gives Prometheus complete control over scrape frequency,
timeout handling, and ingestion rate. It simplifies service health
verification through the up metric and integrates naturally
with dynamic service discovery systems such as Kubernetes.
Q2. What is the difference between relabel_configs and metric_relabel_configs?
| Feature | relabel_configs | metric_relabel_configs |
|---|---|---|
| Execution Stage | Before Scrape | After Scrape |
| Operates On | Targets | Metrics |
| Use Cases | Filtering targets | Dropping metrics |
| Performance Impact | Lower | Higher |
Q3. What is Cardinality in Prometheus?
Cardinality refers to the number of unique time series created by combinations of metric names and label values.
http_requests_total{
method="GET",
status="200"
}
http_requests_total{
method="POST",
status="200"
}
Each unique label combination creates a new time series. Excessive cardinality increases memory consumption, index size, and query latency.
Q4. What happens when Prometheus crashes?
During ingestion, every sample is written to the WAL (Write-Ahead Log) before being committed to memory. Upon restart, Prometheus replays WAL segments and rebuilds the Head Block.
Q5. What is Federation?
Federation allows one Prometheus server to scrape metrics from another
Prometheus server's /federate endpoint.
Global Prometheus
|
v
Regional Prometheus
|
v
Application Metrics
Q6. Explain Thanos Architecture.
+-----------------------------------------------------+
| THANOs ECOSYSTEM |
+-----------------------------------------------------+
Prometheus + Sidecar
|
v
S3/GCS
|
v
Store Gateway
|
v
Thanos Querier
|
v
Grafana
Thanos extends Prometheus by providing:
- Global Query View
- Infinite Retention
- High Availability Deduplication
- Object Storage Integration
Q7. How does Alertmanager prevent alert storms?
Alertmanager uses:
- Grouping
- Deduplication
- Inhibition
- Silencing
These mechanisms ensure that hundreds of related alerts become a single actionable notification.
Q8. Why is Prometheus not a distributed database?
Prometheus was intentionally designed as a single-node TSDB optimized for operational simplicity and local reliability. Horizontal scaling is achieved through systems such as Thanos, Cortex, and Grafana Mimir.
Q9. What causes OOM issues in Prometheus?
- High Cardinality Labels
- Excessive Active Series
- Large WAL Replay
- Expensive PromQL Queries
- Large Histograms
Q10. Explain Recording Rules.
Recording rules precompute expensive PromQL expressions and store the results as new metrics.
groups:
- name: recording-rules
rules:
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
Dashboards can query the recorded metric directly, reducing CPU usage and query latency.
15. Frequently Asked Questions (FAQs)
Is Prometheus a monitoring tool or a database?
Prometheus is both. It includes a monitoring system, query engine, alerting engine, and a purpose-built Time Series Database.
Can Prometheus replace Grafana?
Prometheus includes a basic expression browser and graphing UI, but Grafana provides significantly better visualization, dashboards, RBAC, annotations, and alerting workflows.
Can Prometheus monitor Kubernetes automatically?
Yes. Prometheus integrates natively with Kubernetes Service Discovery and automatically discovers:
- Pods
- Services
- Endpoints
- Nodes
- Ingress Resources
Can Prometheus collect logs?
No. Prometheus is designed for metrics. Log collection is typically handled by Loki, Elasticsearch, Fluent Bit, or Fluentd.
Can Prometheus collect traces?
No. Distributed tracing is handled by systems such as:
- Jaeger
- Tempo
- Zipkin
- OpenTelemetry
How much data can a single Prometheus server handle?
A properly tuned Prometheus instance running on modern hardware can comfortably manage millions of active series and hundreds of thousands of samples per second.
When should I choose Thanos?
- Need Long-Term Storage
- Need Multi-Cluster Visibility
- Need High Availability
- Already Running Prometheus
When should I choose Grafana Mimir?
- Massive Scale Environments
- Multi-Tenant SaaS Platforms
- Petabyte-Level Metrics
- Distributed Ingestion Requirements
What retention period is recommended?
| Environment | Retention |
|---|---|
| Development | 7 Days |
| Staging | 15 Days |
| Production | 30-90 Days |
| Enterprise Historical Analysis | Years (Thanos/Mimir) |
16. Summary & Next Steps
Prometheus has become the de facto standard for cloud-native metrics monitoring because of its elegant pull architecture, powerful multidimensional data model, and highly optimized time-series storage engine.
Throughout this lesson we explored:
- Prometheus Architecture
- TSDB Internals
- WAL and Block Compaction
- Service Discovery
- Relabeling Workflows
- Alerting Architecture
- Pushgateway Best Practices
- Security Hardening
- Meta-Monitoring
- Federation
- Thanos
- Grafana Mimir
- Troubleshooting Runbooks
Prometheus Architecture Summary
Applications / Exporters
|
v
Service Discovery
|
v
Prometheus
|
+--------+--------+
| |
v v
Alertmanager Grafana
|
v
Notifications
(Long-Term Storage)
Prometheus
|
v
Thanos
|
v
S3
Recommended Next Lessons
- Installing Prometheus on Linux
- Prometheus Configuration Deep Dive
- PromQL Fundamentals and Advanced Querying
- Node Exporter Monitoring
- Kubernetes Monitoring with Prometheus Operator
- Alertmanager Deep Dive
- Grafana Dashboards and Visualization
- Recording Rules and Alerting Rules
- Thanos Architecture and Deployment
- Grafana Mimir at Scale
Key Takeaway:
Prometheus is not merely a monitoring tool. It is a specialized observability platform built around efficient time-series storage, dynamic service discovery, and real-time alerting. Understanding TSDB internals, cardinality management, and scaling architectures such as Thanos and Mimir is essential for operating modern cloud-native infrastructure at enterprise scale.