Introduction to Modern Observability: Metrics, Logs, and Traces

In the era of microservices, serverless architectures, and Kubernetes-orchestrated cloud environments, understanding the runtime state of your software has evolved from a simple operational task into a complex engineering discipline. Traditional monitoring systems, designed for monolithic applications running on static physical servers, are fundamentally inadequate for today's dynamic, distributed systems.

Modern observability is not just a buzzword; it is an engineering paradigm rooted in control theory. It defines how well we can infer the internal states of a system based solely on its external outputs. In this comprehensive guide, we will explore the theoretical foundations, structural components, and practical implementation patterns of the three pillars of observability: Metrics, Logs, and Traces. We will examine how these signals interact within an enterprise-grade Grafana, Prometheus, and Loki ecosystem to provide high-fidelity insights into production workloads.

What You Will Learn

The mathematical and conceptual differences between monitoring and observability.
Deep architectural analysis of the Three Pillars: Metrics, Logs, and Traces.
How to design a zero-loss telemetry pipeline using Prometheus, Loki, and Tempo.
The mechanics of context propagation and W3C Trace Context standards.
Real-world strategies for mitigating high-cardinality issues in Time Series Databases (TSDBs).
Production-grade instrumentation code in Python and Go using OpenTelemetry and Prometheus clients.
How to troubleshoot common failures in observability pipelines, such as ingestion bottlenecks and OOM kills.

Prerequisites

To get the most out of this guide, you should have a solid foundation in systems engineering and software development. Specifically:

Familiarity with containerized applications (Docker, Kubernetes).
Basic understanding of network protocols (HTTP/1.1, HTTP/2, gRPC, TCP/IP).
Intermediate knowledge of at least one programming language (such as Go, Python, or Java).
Conceptual understanding of distributed systems and microservices architectures.

What is Observability? Monitoring vs. Observability
The Three Pillars of Observability
Metrics Deep Dive: Prometheus Data Model & TSDB Mechanics
Logs Deep Dive: Loki, Label-Based Indexing & Cardinality
Distributed Tracing Deep Dive: W3C Context & Spans
Enterprise Observability Architecture & Data Flow
Practical Instrumentation: Code Examples
Cardinality Management & Performance Optimization
Troubleshooting & Operational Runbooks
Technical Interview Questions & Answers
Frequently Asked Questions (FAQs)
Summary & Next Steps

What is Observability? Monitoring vs. Observability

To understand why we build modern observability platforms, we must first clarify the semantic distinction between monitoring and observability.

Observability (O11y): A measure of how well the internal states of a system can be inferred from knowledge of its external outputs. If a system is highly observable, an engineer can quickly diagnose the root cause of a novel, unprecedented failure mode without deploying new code or changing configuration.

In contrast, Monitoring is the action of tracking specific, predefined metrics to determine if a system is operating within normal parameters. Monitoring is symptom-oriented. It answers the question: "Is the system broken?" Observability, on the other hand, is diagnostic-oriented. It answers the question: "Why is the system broken, and where exactly is the bottleneck?"

The Shift from Known-Unknowns to Unknown-Unknowns

In traditional systems, failures were typically "known-unknowns." We knew a database disk could fill up, so we monitored disk space. We knew a web server process could crash, so we monitored process health.

In a modern microservices topology, failures are almost always "unknown-unknowns." They arise from complex, emergent behaviors: a specific combination of user input, network latency, database lock contention, and third-party API throttling. You cannot pre-configure alert thresholds for failure states you have never conceived. Observability gives you the raw, high-cardinality, and high-dimensionality data needed to query and reconstruct these emergent system states on the fly.

Attribute	Monitoring	Observability
Core Focus	Failure detection and alerting (Symptom-based)	Exploratory debugging and system understanding (Cause-based)
Problem Space	Known-Unknowns (predictable failure modes)	Unknown-Unknowns (emergent, unpredictable failure modes)
Data Structure	Aggregated metrics, static dashboards, binary states	High-cardinality metrics, structured logs, distributed traces
System View	Component-centric (e.g., CPU usage per server)	Transaction-centric (e.g., lifecycle of an API request)
Primary Tooling	Pingdom, Nagios, basic SNMP pollers	Prometheus, Loki, Tempo, OpenTelemetry, Grafana

The Three Pillars of Observability

To capture the complete state of a distributed system, we rely on three distinct telemetry signals: Metrics, Logs, and Traces. While each signal provides a unique perspective, they must not exist in silos. True observability is achieved when these signals are deeply correlated.

1. Metrics (The "What")

Metrics are numeric values aggregated over intervals of time. They are optimized for real-time querying, mathematical aggregation, and long-term trend analysis. Metrics are highly structured, lightweight, and cheap to store.

Metrics excel at telling you what is happening. For example, a metric can tell you that your application's 95th percentile (p95) latency has spiked to 2.4 seconds, or that the rate of HTTP 500 errors has increased by 15% over the last 5 minutes. However, metrics lack context; they cannot tell you which specific user or request caused the error.

2. Logs (The "Why")

Logs are discrete, timestamped text or structured JSON events emitted by applications or infrastructure components. They capture detailed, contextual information about specific operations.

Logs tell you why something happened. When an error occurs, the logs can reveal the exact stack trace, database query, or payload parameters that triggered the failure. The challenge with logs is their volume and cost. Because they contain rich text and arbitrary metadata, storing and indexing logs at scale is computationally expensive and disk-intensive.

3. Traces (The "Where")

Traces represent the end-to-end journey of a single request or transaction as it propagates through a distributed system. A trace is composed of multiple "spans," where each span represents a discrete unit of work performed by an individual microservice, database, or external API.

Traces show you where latency or errors occur in a multi-hop call chain. If a user request takes 5 seconds, a distributed trace will visually pinpoint that 4.2 seconds of that time was spent waiting for a specific PostgreSQL query execution inside a downstream payment service.

+-------------------------------------------------------------+
|                     The Observability Loop                  |
|                                                             |
|  1. METRICS (Detect)                                        |
|     "ALERT: p99 Latency > 2s on /checkout endpoint!"        |
|            |                                                |
|            v (Correlation via Exemplar / Trace ID)          |
|  2. TRACES (Isolate)                                        |
|     "Trace ID: abc123xyz shows 1.8s spent in Auth Service"   |
|            |                                                |
|            v (Correlation via Service & Timestamp)          |
|  3. LOGS (Diagnose)                                         |
|     "Log: DB Connection Pool Exhausted at Auth Service"     |
+-------------------------------------------------------------+

Metrics Deep Dive: Prometheus Data Model & TSDB Mechanics

To build a high-performance metrics pipeline, we must understand how metrics are structured and stored. Prometheus, the de facto standard for cloud-native metric collection, uses a multi-dimensional data model with time-series data identified by metric name and key-value pairs called labels.

The Prometheus Time-Series Identity

Every time series in Prometheus is uniquely identified by its metric name and an unordered set of labels (dimensions). The notation looks like this:

http_requests_total{method="POST", handler="/checkout", status="500"}

Behind the scenes, the Prometheus Time Series Database (TSDB) treats this entire string as a unique key. The corresponding values are a sequence of float64 samples, each paired with a millisecond-resolution timestamp.

The Four Core Prometheus Metric Types

When instrumenting your code, you must select the appropriate metric type based on the behavior you want to capture:

Counter: A cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. Use counters to track total events, such as http_requests_total or tasks_completed_total.
Rule of thumb: Never use a counter for values that can decrease (like active memory usage). Use the rate() or irate() functions in PromQL to calculate the per-second rate of change.
Gauge: A metric that represents a single numerical value that can arbitrarily go up and down. Use gauges for measured values like current memory usage, CPU load, temperature, or the number of concurrent active connections.
Histogram: A complex metric that samples observations (usually things like request durations or response sizes) and counts them in configurable, bucketed ranges. It provides three main outputs:
- Cumulative counters for the bucket upper bounds (e.g., http_request_duration_seconds_bucket{le="0.1"})
- The total sum of all observed values (http_request_duration_seconds_sum)
- The total count of events (http_request_duration_seconds_count)
Histograms allow you to calculate mathematically accurate quantiles (p50, p90, p99) across any number of distributed instances using the histogram_quantile() PromQL function.
Summary: Similar to a histogram, a summary samples observations and calculates configurable quantiles over a sliding time window. While summaries are easier to use and have less overhead on the client side, they cannot be aggregated across multiple instances. This makes them unsuitable for distributed microservices where you need cluster-wide quantiles.

Prometheus TSDB Storage Engine Mechanics

The Prometheus TSDB is engineered for massive write throughput and highly efficient compression. It structures data into two-hour blocks. Each block contains:

A Metadata file detailing the block's time range and status.
A Chunks directory containing the raw time-series samples compressed using Gorilla compression, which can compress 16-byte samples (8-byte timestamp + 8-byte value) down to an average of 1.37 bytes.
An Index file that maps label pairs to specific time-series IDs, enabling sub-millisecond lookups even across millions of active series.

Active, un-flushed data is held in memory in a Head block and simultaneously written to a Write-Ahead Log (WAL) on disk to prevent data loss in the event of an unexpected crash.

Logs Deep Dive: Loki, Label-Based Indexing & Cardinality

Traditional log management solutions like Elasticsearch or OpenSearch index every single word of every log line. While this allows for complex full-text search, it creates massive, resource-hungry indexes that can easily exceed the size of the raw logs themselves. Grafana Loki takes a completely different, highly optimized approach.

Loki's Design Philosophy: Index the Metadata, Not the Log Text

Loki does not index the message body of your logs. Instead, it only indexes the metadata labels associated with the log stream—exactly like Prometheus. The raw log lines are compressed into chunks and stored directly in cheap object storage (such as Amazon S3, Google Cloud Storage, or MinIO).

+------------------------------------------------------------------------+
|                      Loki Ingestion Pipeline                           |
|                                                                        |
|  [Log Line] -> "2023-10-27 10:00:01 [ERROR] User login failed: uid=42" |
|                                                                        |
|  1. Extract Labels: {app="auth-service", env="prod", severity="error"}  |
|  2. Index Labels: BoltDB / TSDB Index (Tiny, fast, fits in memory)      |
|  3. Compress Log Body: gzip/snappy chunk -> Push to Object Storage (S3)|
+------------------------------------------------------------------------+

This design makes Loki incredibly cost-effective and highly performant at scale. However, it shifts the engineering responsibility onto the developer to design a clean, low-cardinality labeling strategy.

Understanding Cardinality in Loki

In database systems, cardinality refers to the number of unique values in a particular field. In Loki, a "stream" is defined by the unique combination of its labels.

If you add a high-cardinality label—such as a user_id, request_id, or ip_address—to your Loki logs as an index label, Loki will create a separate stream and index entry for every single unique user or request. This will quickly exhaust memory, degrade query performance, and potentially crash your Loki ingesters. This phenomenon is known as a cardinality explosion.

Best Practice: Keep Loki index labels static and low-cardinality (e.g., environment, service_name, container_name). If you need to search logs by user_id or request_id, write them inside the log message body (preferably formatted as JSON) and extract them at query time using Loki's LogQL parser operators:

{app="auth-service"} | json | user_id = "42"

Distributed Tracing Deep Dive: W3C Context & Spans

Distributed tracing solves the problem of tracking requests across network boundaries. When a client clicks "Buy Now," that request may hit an API gateway, which calls an auth service, which calls an order processing service, which executes a database query and publishes an event to Kafka. To trace this entire flow, we must propagate context.

The Anatomy of a Trace

Trace: The complete DAG (Directed Acyclic Graph) of spans representing the end-to-end execution of a request. A trace is identified by a globally unique 128-bit integer called a TraceID.
Span: The fundamental building block of a trace. A span represents a single contiguous block of time-bound work. Each span has a SpanID, a parent SpanID (unless it is the root span), start/end timestamps, tags (key-value metadata), and events (structured logs inside the span).

Context Propagation and the W3C Trace Context Standard

For tracing to work, the unique TraceID and parent SpanID must be passed along with every network call. Historically, tracing systems used proprietary HTTP headers (such as Zipkin's B3 headers). Today, the industry has standardized on the W3C Trace Context specification.

The W3C standard defines two primary HTTP headers:

traceparent: A single, highly structured header containing all the information required to propagate a trace:
```
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
```
Let's break down this header value:
- 00: Version (currently 00).
- 4bf92f3577b34da6a3ce929d0e0e4736: Trace ID (16 bytes / 32 hex characters).
- 00f067aa0ba902b7: Parent Span ID (8 bytes / 16 hex characters).
- 01: Trace Flags (8 bits, where 01 indicates that the request was sampled for recording).
tracestate: A set of key-value pairs used to propagate vendor-specific routing and filtering metadata (e.g., congo=t61rcWkgMzE), ensuring compatibility across different tracing backends.

+-------------------------------------------------------------------------+
|                      Context Propagation Flow                           |
|                                                                         |
|  [Client]                                                               |
|     |                                                                   |
|     v  HTTP GET /checkout                                               |
|  [API Gateway] (Generates Trace ID: abc123xyz, Span ID: 111)            |
|     |                                                                   |
|     v  HTTP POST /charge  (Header: traceparent: 00-abc123xyz-111-01)    |
|  [Payment Service] (Reuses Trace ID: abc123xyz, New Span ID: 222)       |
|     |                                                                   |
|     v  SQL SELECT...      (Injected traceparent as SQL comment)         |
|  [PostgreSQL]                                                           |
+-------------------------------------------------------------------------+

Enterprise Observability Architecture & Data Flow

In an enterprise-scale architecture, telemetry data must be collected reliably, buffered to handle network partitions or traffic spikes, and routed to the appropriate backend storage engines. Below is the blueprint of a production-grade, zero-loss observability pipeline using the Grafana LGTM stack (Loki, Grafana, Tempo, Mimir/Prometheus) combined with the OpenTelemetry (OTel) Collector.

System Architecture Diagram

+------------------------------------------------------------------------------------+
|                                  KUBERNETES CLUSTER                                |
|                                                                                    |
|  +------------------+     +------------------+     +----------------------------+  |
|  |   App Pod A      |     |   App Pod B      |     |   Infrastructure / Node    |  |
|  |  (OTel SDK)      |     |  (Prom Client)   |     |  (Node Exporter, Kube-API) |  |
|  +------------------+     +------------------+     +----------------------------+  |
|         |                        |                                |                |
|         | (OTLP/gRPC)            | (Metrics Scraping)             | (Log Scraping) |
|         v                        v                                v                |
|  +------------------------------------------------------------------------------+  |
|  |                      Grafana Alloy / OpenTelemetry Collector                 |  |
|  |  - Receives traces, metrics, logs                                            |  |
|  |  - Enriches with Kubernetes metadata (namespace, pod, node labels)           |  |
|  |  - Batches and compresses outgoing streams                                   |  |
|  +------------------------------------------------------------------------------+  |
+------------------------------------------------------------------------------------+
         |                                |                                |
         | (OTLP/gRPC)                    | (Remote Write)                 | (Loki Push API)
         v                                v                                v
+------------------+            +------------------+            +--------------------+
|  GRAFANA TEMPO   |            |  GRAFANA MIMIR   |            |    GRAFANA LOKI    |
| (Trace Storage)  |            | (Metric Storage) |            |   (Log Storage)    |
+------------------+            +------------------+            +--------------------+
         |                                |                                |
         +-----------------+              |              +-----------------+
                           |              |              |
                           v              v              v
                        +----------------------------------+
                        |         GRAFANA ENTERPRISE       |
                        |      (Visualization & Alerting)  |
                        +----------------------------------+

The Ingestion Mechanics: Push vs. Pull

An ongoing architectural debate in observability is whether to use a Push or Pull model for telemetry collection.

Pull (Prometheus Default): The monitoring server periodically queries (scrapes) a known HTTP endpoint (e.g., /metrics) on each target instance.
Pros: Easier to detect if a target is down (scrape failure); self-throttling (the server controls scrape intervals, preventing it from being overwhelmed).
Cons: Requires service discovery to find targets; difficult to implement behind strict firewalls or in ephemeral serverless environments (AWS Lambda).
Push (Loki, Tempo, OTel Default): The application or a local daemon pushed telemetry data directly to the backend collector.
Pros: Excellent for ephemeral workloads; simpler network configuration (egress-only).
Cons: Risk of overwhelming the ingestion backend during traffic spikes (requires load balancing and queueing buffers).

The Enterprise Hybrid Solution: Deploy a local collector agent (such as Grafana Alloy or the OpenTelemetry Collector) on every node. The application pushes traces and logs locally to the agent via localhost (low latency, high reliability). The agent then scrapes application metrics locally, enriches all telemetry signals with consistent Kubernetes metadata, and safely batches and pushes the data to the centralized backend engines.

Practical Instrumentation: Code Examples

Let's look at how to implement structured logging, metrics generation, and distributed tracing in production code. We will use two complete, clean examples.

Example 1: Python FastAPI Service with Prometheus Metrics & Structured JSON Logging

This example shows how to configure custom Prometheus metrics and structured JSON logs that output standard trace correlation IDs.

import time
import json
import logging
import uuid
from fastapi import FastAPI, Request, Response
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST

# Initialize FastAPI App
app = FastAPI(title="PaymentService")

# Define Prometheus Metrics with low-cardinality labels
HTTP_REQUESTS_TOTAL = Counter(
    "http_requests_total",
    "Total number of HTTP requests processed",
    ["method", "endpoint", "status_code"]
)

HTTP_REQUEST_LATENCY = Histogram(
    "http_request_duration_seconds",
    "HTTP request latency in seconds",
    ["method", "endpoint"],
    buckets=(0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0)
)

# Custom JSON Log Formatter for Loki compatibility
class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_record = {
            "timestamp": self.formatTime(record, self.datefmt),
            "level": record.levelname,
            "message": record.getMessage(),
            "logger": record.name,
            # Inject trace context for correlation
            "trace_id": getattr(record, "trace_id", "00000000000000000000000000000000"),
            "span_id": getattr(record, "span_id", "0000000000000000")
        }
        return json.dumps(log_record)

# Configure Logging
logger = logging.getLogger("app")
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
logger.setLevel(logging.INFO)

@app.middleware("http")
async def observability_middleware(request: Request, call_next):
    start_time = time.perf_counter()
    
    # Extract W3C Trace Context from incoming headers or generate mock ones
    traceparent = request.headers.get("traceparent")
    if traceparent:
        parts = traceparent.split("-")
        if len(parts) == 4:
            trace_id = parts[1]
            parent_span_id = parts[2]
        else:
            trace_id = uuid.uuid4().hex
            parent_span_id = uuid.uuid4().hex[:16]
    else:
        trace_id = uuid.uuid4().hex
        parent_span_id = uuid.uuid4().hex[:16]

    # Bind IDs to a local logging context
    extra_log_context = {"trace_id": trace_id, "span_id": parent_span_id}
    
    response = Response("Internal Server Error", status_code=500)
    try:
        response = await call_next(request)
    finally:
        duration = time.perf_counter() - start_time
        status_code = str(response.status_code)
        
        # Record Prometheus Metrics
        HTTP_REQUESTS_TOTAL.labels(
            method=request.method,
            endpoint=request.url.path,
            status_code=status_code
        ).inc()
        
        HTTP_REQUEST_LATENCY.labels(
            method=request.method,
            endpoint=request.url.path
        ).observe(duration)
        
        # Log event with structured trace context
        logger.info(
            f"Processed request {request.method} {request.url.path} in {duration:.4f}s with status {status_code}",
            extra=extra_log_context
        )
        
    return response

@app.get("/checkout")
async def checkout():
    # Simulate database work
    time.sleep(0.05)
    return {"status": "success", "transaction_id": uuid.uuid4().hex}

@app.get("/metrics")
async def metrics():
    # Expose Prometheus metrics endpoint
    return Response(content=generate_latest(), media_type=CONTENT_TYPE_LATEST)

Example 2: Go Microservice implementing W3C Context Propagation

This Go snippet demonstrates how to manually inject and extract W3C Trace Context headers when making downstream HTTP calls.

package main

import (
	"context"
	"fmt"
	"net/http"
	"time"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/propagation"
	trace "go.opentelemetry.io/otel/trace"
)

const (
	port        = ":8080"
	serviceName = "order-processor"
)

type Server struct {
	tracer trace.Tracer
}

func main() {
	// Initialize Global W3C Text Map Propagator
	otel.SetTextMapPropagator(propagation.TraceContext{})

	// Mock Tracer Provider (In production, replace with real OTLP exporter)
	tp := trace.NewNoopTracerProvider()
	otel.SetTracerProvider(tp)

	server := &Server{
		tracer: tp.Tracer(serviceName),
	}

	mux := http.NewServeMux()
	mux.HandleFunc("/process", server.handleProcess)

	fmt.Printf("Starting server on %s\n", port)
	if err := http.ListenAndServe(port, mux); err != nil {
		panic(err)
	}
}

func (s *Server) handleProcess(w http.ResponseWriter, r *http.Request) {
	// 1. Extract context from incoming HTTP headers using W3C Text Map Propagator
	propagator := otel.GetTextMapPropagator()
	ctx := propagator.Extract(r.Context(), propagation.HeaderCarrier(r.Header))

	// 2. Start a new child span
	ctx, span := s.tracer.Start(ctx, "process-order-span")
	defer span.End()

	// Perform database query simulation
	s.queryDatabase(ctx)

	// 3. Prepare an outgoing downstream request to Payment Service
	downstreamURL := "http://payment-service:8081/charge"
	req, err := http.NewRequestWithContext(ctx, "POST", downstreamURL, nil)
	if err != nil {
		http.Error(w, err.Error(), http.StatusInternalServerError)
		return
	}

	// 4. Inject trace context into outgoing HTTP headers
	propagator.Inject(ctx, propagation.HeaderCarrier(req.Header))

	// Execute client call (mocked)
	client := &http.Client{Timeout: 5 * time.Second}
	fmt.Printf("Sending request to downstream service with Trace ID: %s\n", span.SpanContext().TraceID().String())
    _ = client // In production, handle req execution and response err
    
	w.WriteHeader(http.StatusOK)
	w.Write([]byte(`{"status":"order_processed"}`))
}

func (s *Server) queryDatabase(ctx context.Context) {
	_, span := s.tracer.Start(ctx, "postgres-select-query")
	defer span.End()
	time.Sleep(20 * time.Millisecond) // Simulate DB latency
}

Cardinality Management & Performance Optimization

One of the most common operational failures in large-scale observability platforms is a crash caused by mismanaged cardinality. If left unchecked, runaway cardinality can exhaust TSDB memory, cause out-of-disk errors, and make dashboards unusable.

What Causes Cardinality to Explode?

Consider this Prometheus metric:

http_requests_total{service="user-service", route="/users/{id}"}

If your router does not normalize the route variable, Prometheus will create a separate time series for every single user ID. If you have 1,000,000 users, you will have 1,000,000 unique time series. This is a cardinality explosion.

To prevent this, always normalize path variables in metrics:

http_requests_total{service="user-service", route="/users/:id"}

Techniques for Optimizing Prometheus TSDB Performance

Relabeling Configurations: Use Prometheus relabel_config or metric_relabel_configs to drop high-cardinality labels or discard entire metrics that provide low operational value.
Recording Rules: For complex PromQL queries that run frequently (such as those powering critical Grafana dashboards), create Recording Rules. Recording rules pre-compute the query results at regular intervals and save them as a new, lightweight time series.
Limit Max Series per Scrape: Enforce safety limits in your scrape configuration to drop targets that expose too many series:
```
scrape_configs:
  - job_name: 'my-app'
    sample_limit: 10000 # Drops target if it exposes > 10,000 metrics
```

Optimizing Loki Logging Pipelines

To keep Loki performing optimally, adhere to the following architectural rules:

Do not use UUIDs, user IDs, or timestamps as labels. Use them within the log line body.
Use dynamic label parsing: Standardize on JSON logging. Use Loki's LogQL | json parser to extract fields dynamically during search operations instead of pre-indexing them.
Configure retention policies: Set distinct retention periods for different environments. Keep dev and staging logs for 3 days, while preserving prod logs for 30 days.

Troubleshooting & Operational Runbooks

Even the best-designed observability stacks experience operational failures. When telemetry pipelines fail, you lose visibility into your applications. Below are common failure modes and step-by-step troubleshooting procedures.

Scenario A: Prometheus Memory Usage Explodes (OOM Kill Loop)

Symptom: The Prometheus pod crashes repeatedly with an Out-of-Memory error (Exit Code 137).

Root Cause: A recent deployment has introduced a high-cardinality metric label, overloading the TSDB Head block memory.

Runbook:

Temporarily increase the container memory limit in Kubernetes to allow Prometheus to start up and process queries.
Identify the high-cardinality metric using the Prometheus TSDB Status API or by running this PromQL query:
```
topk(10, count by (__name__) ({__name__=~".+"}))
```
This query returns the top 10 metrics with the highest number of active time series.
Once the offending metric is identified (e.g., api_request_duration_seconds), run a query to find the label responsible:
```
count by (label_name) (api_request_duration_seconds)
```
Apply a metric_relabel_configs block in your Prometheus configuration to drop the offending label immediately:
```
metric_relabel_configs:
  - source_labels: [offending_label]
    action: labeldrop
```
Contact the development team to fix the instrumentation code (e.g., normalize the URL path variables).

Scenario B: Loki Returns "Entry too far behind" or "Ingestion Rate Limit Exceeded"

Symptom: Applications or log shippers (Promtail/Alloy) report errors like 429 Too Many Requests or 400 Entry too far behind when pushing logs to Loki.

Root Cause:

429 Error: The application is emitting logs faster than Loki's configured rate-limiting threshold per tenant.
400 Error: A system clock is out of sync, or an application is backfilling historical logs that are older than Loki's max_chunk_age limit.

Runbook:

For 429 errors, check Loki's limits configuration (typically in the limits_config section) and adjust the ingestion rates:

limits_config:
  ingestion_rate_mb: 10 # Increase limit from default 4MB
  ingestion_burst_size_mb: 20
  max_streams_per_user: 10000

For 400 errors, verify NTP (Network Time Protocol) sync on the host servers emitting logs:
```
timedatectl status
```
Ensure "System clock synchronized" is yes.

Technical Interview Questions & Answers

Q1: Explain the difference between rate() and irate() in PromQL. When would you use each?

Answer: Both rate() and irate() calculate the per-second rate of change of a counter metric, but they do so differently:

rate() calculates the average per-second rate of increase over the entire specified time window (e.g., [5m]). It looks at all data points in that interval and is highly resilient to temporary spikes, making it ideal for alerting thresholds and long-term trend analysis.
irate() calculates the instantaneous rate of change based on the last two data points within the specified time window. It is highly sensitive to rapid, short-term fluctuations. It should be used for high-resolution zoom-in dashboards where you need to see precise spikes, but should never be used for alerting or long-term trends because a spike right at the end of the window can trigger false alarms.

Q2: Why is it bad practice to put a User ID or Request ID in a Prometheus label?

Answer: Prometheus is a Time Series Database designed to track long-lived series over time. Every unique combination of key-value labels creates a distinct time series that must be stored in memory and indexed.

Because User IDs and Request IDs have unbounded cardinality (potentially millions of unique values), putting them in a label causes a cardinality explosion. This consumes massive amounts of RAM, degrades query performance, increases disk utilization, and can ultimately crash Prometheus. Instead, such identifiers should be recorded in logs or traces where high-cardinality data is expected and managed more efficiently.

Q3: What is the purpose of the OpenTelemetry Collector?

Answer: The OpenTelemetry Collector acts as a vendor-neutral telemetry gateway that receives, processes, enriches, batches, filters, and exports metrics, logs, and traces. It decouples application instrumentation from backend observability platforms.

Centralizes telemetry collection.
Reduces application complexity.
Supports multiple exporters (Prometheus, Tempo, Loki, Jaeger, Datadog, Splunk).
Provides buffering and retry capabilities.
Enriches telemetry with Kubernetes metadata.

Q4: Explain Exemplars in Prometheus.

Answer: Exemplars are metadata attached to metric samples that link metrics directly to traces. They allow engineers to move from a metric spike directly into the distributed trace responsible for that anomaly.

http_request_duration_seconds_bucket{
  le="1.0"
} # {trace_id="abc123xyz"}

When Grafana displays a latency spike, exemplars allow users to click directly on the associated trace, dramatically reducing Mean Time To Resolution (MTTR).

Q5: Why are Histograms preferred over Summaries in distributed systems?

Answer: Histograms can be aggregated across multiple instances because they expose bucket counts that can be summed mathematically. Summaries calculate quantiles locally and therefore cannot be aggregated accurately across replicas.

In Kubernetes environments where applications run across dozens or hundreds of pods, histograms are the preferred choice for latency measurements and Service Level Objective (SLO) calculations.

Q6: What is sampling in distributed tracing?

Answer: Sampling determines which traces should be stored and which should be discarded. Recording every trace in a high-traffic environment is often prohibitively expensive.

Head-Based Sampling: Decision made at trace start.
Tail-Based Sampling: Decision made after trace completion.
Probabilistic Sampling: Random percentage selection.
Rule-Based Sampling: Sample based on attributes like status code or endpoint.

Enterprise observability platforms often use tail-based sampling to retain error traces while discarding routine successful requests.

Frequently Asked Questions (FAQs)

How much data does a typical observability platform generate?

A medium-sized Kubernetes cluster running 100 microservices can easily generate:

50,000–500,000 active Prometheus time series
100GB–1TB of logs per day
Millions of distributed traces per day

This is why retention management, compression, sampling, and cardinality controls are critical architectural concerns.

Can Prometheus store logs?

No. Prometheus is a Time Series Database (TSDB) optimized for numeric measurements. Logs should be stored in systems designed for event data, such as Loki, Elasticsearch, or OpenSearch.

Can Loki replace Elasticsearch?

It depends on requirements. Loki is significantly cheaper and more scalable for Kubernetes workloads because it indexes labels instead of full log contents. However, Elasticsearch provides richer full-text search capabilities and complex analytics.

What is the LGTM Stack?

LGTM stands for:

Loki → Logs
Grafana → Visualization
Tempo → Traces
Mimir (or Prometheus) → Metrics

Together, these components provide a complete open-source observability platform.

What are the biggest observability anti-patterns?

High-cardinality labels in Prometheus.
Using request IDs as Loki labels.
Logging sensitive customer data.
Alerting on every metric instead of business-impacting signals.
Collecting traces without context propagation.
Running observability backends without retention policies.

Summary & Next Steps

Modern observability extends far beyond traditional infrastructure monitoring. In distributed systems, engineers must be able to investigate unknown failure modes, correlate telemetry across service boundaries, and reconstruct the complete execution path of user transactions.

The three pillars of observability work together to answer distinct questions:

Metrics: What is happening?
Logs: Why did it happen?
Traces: Where did it happen?

A production-grade observability platform typically consists of:

Prometheus or Mimir for metrics.
Loki for logs.
Tempo for traces.
Grafana for visualization and alerting.
OpenTelemetry Collector or Grafana Alloy for telemetry collection and enrichment.

Successful observability implementations depend on:

Consistent instrumentation standards.
Structured logging.
Trace context propagation.
Low-cardinality metric and log labels.
Efficient retention and sampling strategies.
Well-defined operational runbooks.

Observability is not about collecting more data. It is about collecting the right data, correlating it effectively, and enabling engineers to answer questions they have never needed to ask before.

Conclusion

Observability has become a foundational capability for operating modern distributed systems at scale. Whether you are running cloud-native microservices, event-driven architectures, or large Kubernetes platforms, the ability to correlate metrics, logs, and traces directly impacts system reliability, operational efficiency, and customer experience.

By combining Prometheus, Loki, Tempo, Grafana, and OpenTelemetry, organizations can build a robust observability ecosystem capable of handling millions of telemetry events while maintaining deep visibility into application behavior.

Mastering observability is no longer optional for platform engineers, DevOps engineers, SREs, cloud architects, and senior software developers—it is an essential engineering competency for building resilient, scalable systems.