Published: 2026-06-01 โ€ข Updated: 2026-07-05

Monitoring and Logging in Kubernetes with Prometheus, Grafana, Loki, and Alertmanager

Running applications in Kubernetes is not only about deploying Pods, Services, and Ingress resources. In production, teams must continuously understand what is happening inside the cluster. They need to know whether applications are healthy, whether users are facing errors, whether CPU or memory usage is increasing, and whether any service is close to failure.

This is where monitoring and logging become extremely important. Monitoring helps teams measure system health using metrics. Logging helps teams investigate what happened inside applications and infrastructure. Together, they provide observability.

Your base content already explains Prometheus, Grafana, Alertmanager, and Loki integration. This expanded version adds real-time production examples, Kubernetes architecture diagrams, troubleshooting workflows, interview notes, and practical best practices. :contentReference[oaicite:0]{index=0}


What is Observability?

Observability means understanding the internal state of a system by looking at its external outputs.

In Kubernetes, observability usually includes:

  • Metrics: Numerical measurements like CPU usage, memory usage, request count, and error rate
  • Logs: Text records generated by applications and containers
  • Traces: Request journey across multiple microservices
  • Alerts: Notifications when something crosses a danger threshold

Why Monitoring and Logging Are Needed?

Without monitoring and logging, production issues become guesswork.

For example, if users say the website is slow, teams must quickly answer:

  • Which service is slow?
  • Is CPU usage high?
  • Is memory full?
  • Are Pods restarting?
  • Is the database slow?
  • Are APIs returning errors?
  • When did the issue start?

Prometheus, Grafana, Loki, and Alertmanager help answer these questions quickly.


Real-Time E-Commerce Example

Suppose an e-commerce platform runs a festival sale.

During peak traffic:

  • Frontend traffic increases
  • Checkout service gets heavy load
  • Payment service latency increases
  • Database CPU usage rises
  • Some Pods start restarting

With proper monitoring:

  • Prometheus detects high CPU and error rate
  • Grafana dashboards show which service is affected
  • Loki logs reveal payment timeout errors
  • Alertmanager notifies the DevOps team
[ User Complaints ]
        |
        v
[ Grafana Dashboard ]
        |
        v
[ Prometheus Metrics ]
        |
        v
[ Loki Logs ]
        |
        v
Root Cause Found Quickly

Prometheus: Metrics Collection

Prometheus is an open-source monitoring system widely used in Kubernetes. It collects metrics from applications, Kubernetes components, nodes, and exporters.

Prometheus stores metrics as time-series data. Each metric contains:

  • Metric name
  • Timestamp
  • Value
  • Labels

Prometheus Pull-Based Model

Prometheus usually uses a pull model. That means Prometheus scrapes metrics from targets instead of waiting for applications to push metrics.

[ Application /metrics Endpoint ]
              ^
              |
[ Prometheus Scrapes Metrics ]

Most applications expose metrics through an endpoint such as:

/metrics

Prometheus Workflow

Kubernetes Pods
      |
      v
Expose Metrics Endpoint
      |
      v
Prometheus Scrapes Metrics
      |
      v
Metrics Stored in Time-Series DB
      |
      v
PromQL Queries Analyze Data
      |
      v
Grafana Visualizes Results

Prometheus Deployment Example

apiVersion: apps/v1
kind: Deployment

metadata:
  name: prometheus

spec:
  replicas: 1

  selector:
    matchLabels:
      app: prometheus

  template:
    metadata:
      labels:
        app: prometheus

    spec:
      containers:
      - name: prometheus
        image: prom/prometheus

        ports:
        - containerPort: 9090

This deploys Prometheus and exposes it on port 9090 inside the Pod.


What Metrics Should Be Monitored?

In production Kubernetes environments, teams usually monitor:

  • CPU usage
  • Memory usage
  • Pod restart count
  • Container OOMKilled events
  • Node disk usage
  • Network traffic
  • API latency
  • Error rate
  • Request throughput
  • Database connections
  • Queue length
  • Ingress traffic

Golden Signals of Monitoring

A useful monitoring strategy often focuses on four golden signals:

Signal Meaning Example
Latency How long requests take Payment API takes 3 seconds
Traffic How many requests system receives 10,000 requests per minute
Errors How many requests fail 5xx errors increase
Saturation How full resources are CPU at 90%

Grafana: Visualization and Dashboards

Grafana is used to visualize metrics through dashboards. Prometheus stores metrics, but Grafana makes them easy to understand using graphs, panels, tables, and alerts.

Grafana can connect to:

  • Prometheus
  • Loki
  • Elasticsearch
  • InfluxDB
  • Cloud monitoring systems

Grafana Deployment Example

apiVersion: apps/v1
kind: Deployment

metadata:
  name: grafana

spec:
  replicas: 1

  selector:
    matchLabels:
      app: grafana

  template:
    metadata:
      labels:
        app: grafana

    spec:
      containers:
      - name: grafana
        image: grafana/grafana

        ports:
        - containerPort: 3000

Grafana Dashboard Example

A Kubernetes dashboard may show:

  • Cluster CPU usage
  • Cluster memory usage
  • Pod restart count
  • Node health
  • HTTP request rate
  • Error percentage
  • Top slow services
[ Grafana Dashboard ]
        |
        +-- CPU Usage Panel
        +-- Memory Usage Panel
        +-- Pod Restart Panel
        +-- API Latency Panel
        +-- Error Rate Panel

Alertmanager

Alertmanager handles alerts generated by Prometheus.

It can send notifications to:

  • Email
  • Slack
  • Microsoft Teams
  • PagerDuty
  • Webhook endpoints

Alerting Flow

Prometheus Detects Problem
          |
          v
Alert Rule Triggers
          |
          v
Alertmanager Receives Alert
          |
          v
Notification Sent to Team
          |
          v
Issue Investigated

Real-Time Banking Alert Example

In a banking system, alerts may be created for:

  • Payment API error rate greater than 2%
  • Transaction latency greater than 1 second
  • Database CPU above 85%
  • Pod restart count increasing
  • Disk usage above 80%

These alerts help teams respond before customers face major issues.


Loki for Kubernetes Logging

Prometheus is mainly for metrics. Logs require a logging system.

Loki is commonly used with Grafana for centralized logging.

A typical logging stack includes:

  • Promtail: Collects logs from nodes and Pods
  • Loki: Stores logs
  • Grafana: Searches and visualizes logs

Kubernetes Logging Flow

Application Pods
      |
      v
Container Logs
      |
      v
Promtail Collects Logs
      |
      v
Loki Stores Logs
      |
      v
Grafana Searches Logs

Real-Time Log Debugging Example

Suppose users report payment failures.

Metrics may show:

  • Payment API error rate increased

Logs may show:

Payment gateway timeout after 30000ms

Together, metrics and logs help identify the root cause faster.


Monitoring vs Logging

Monitoring Logging
Shows system health Shows detailed events
Uses metrics Uses log messages
Good for alerts Good for investigation
Example: CPU 90% Example: NullPointerException

Prometheus Operator and ServiceMonitor

In production Kubernetes environments, Prometheus is often installed using Prometheus Operator.

The Operator simplifies:

  • Prometheus setup
  • Alertmanager setup
  • Service discovery
  • ServiceMonitor configuration
  • Rule management

What is ServiceMonitor?

A ServiceMonitor tells Prometheus Operator which services should be scraped for metrics.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor

metadata:
  name: payment-monitor

spec:
  selector:
    matchLabels:
      app: payment

  endpoints:
  - port: metrics

Spring Boot Metrics Example

For Spring Boot applications, Actuator and Micrometer are commonly used.

Application exposes metrics at:

/actuator/prometheus

Prometheus scrapes this endpoint and stores application metrics.


Useful PromQL Examples

CPU Usage

rate(container_cpu_usage_seconds_total[5m])

Memory Usage

container_memory_usage_bytes

Pod Restart Count

kube_pod_container_status_restarts_total

HTTP Error Rate

rate(http_server_requests_seconds_count{status=~"5.."}[5m])

Real-Time Production Incident Example

Suppose an order service becomes slow.

Investigation Flow

User Reports Slow Checkout
        |
        v
Check Grafana Latency Panel
        |
        v
Prometheus Shows High Response Time
        |
        v
Check Loki Logs
        |
        v
Database Timeout Found
        |
        v
Scale Database / Fix Query

Common Mistakes

1. Monitoring Only CPU and Memory

CPU and memory are important, but not enough. Monitor latency, errors, traffic, and saturation.

2. No Alert Threshold Planning

Bad thresholds create alert fatigue or missed incidents.

3. Too Many Grafana Panels

Dashboards should be useful, not overloaded.

4. Not Collecting Application Metrics

Infrastructure metrics alone cannot explain business failures.

5. Ignoring Logs

Metrics show symptoms. Logs often show the reason.


Production Troubleshooting Commands

kubectl get pods -n monitoring

kubectl logs prometheus-pod -n monitoring

kubectl logs grafana-pod -n monitoring

kubectl get svc -n monitoring

kubectl top pods

kubectl top nodes

kubectl describe pod pod-name

Monitoring Best Practices

  • Monitor golden signals: latency, traffic, errors, saturation
  • Use meaningful labels
  • Create service-level dashboards
  • Configure alert thresholds carefully
  • Use logs with metrics for debugging
  • Track business metrics such as orders and payments
  • Monitor Pod restarts and OOMKilled events
  • Use Alertmanager for notifications
  • Review dashboards regularly

Interview Questions

Q1: What is Prometheus?

Prometheus is an open-source monitoring system that collects and stores time-series metrics.

Q2: What is Grafana?

Grafana is a visualization tool used to create dashboards from metrics and logs.

Q3: What is Alertmanager?

Alertmanager receives alerts from Prometheus and sends notifications to teams.

Q4: What is PromQL?

PromQL is the query language used to analyze Prometheus metrics.

Q5: What is Loki?

Loki is a log aggregation system commonly used with Grafana.


Interview Trap Questions

Does Prometheus collect logs?

No. Prometheus collects metrics. Loki or similar tools collect logs.

Can Grafana store metrics?

Grafana mainly visualizes data from data sources. Prometheus stores metrics.

Is Metrics Server enough for production monitoring?

No. Metrics Server is mainly for Kubernetes resource metrics and autoscaling. Prometheus provides deeper monitoring.

Should every alert wake up the team?

No. Only actionable alerts should trigger urgent notifications.


Recommended Learning Path


Summary

Prometheus, Grafana, Loki, and Alertmanager provide a strong observability stack for Kubernetes.

Prometheus collects metrics, Grafana visualizes metrics and logs, Loki stores logs, and Alertmanager sends notifications when issues occur.

In real production systems, monitoring and logging help teams detect issues early, reduce downtime, troubleshoot faster, and improve application reliability.

For banking, e-commerce, healthcare, SaaS, and cloud-native platforms, observability is not optional. It is a core requirement for running reliable Kubernetes applications.

About the Author

Naresh Kumar

Naresh Kumar

Senior Java Backend Engineer experienced in Banking, Payments, ISO 20022, Spring Boot, Microservices, Kafka, Docker, Kubernetes, AWS and Cloud Native Systems.

Built enterprise payment solutions, transaction processing systems, API platforms and scalable microservices used in production.

LinkedIn Profile