Monitoring and Logging in Kubernetes with Prometheus, Grafana, Loki, and Alertmanager

Running applications in Kubernetes is not only about deploying Pods, Services, and Ingress resources. In production, teams must continuously understand what is happening inside the cluster. They need to know whether applications are healthy, whether users are facing errors, whether CPU or memory usage is increasing, and whether any service is close to failure.

This is where monitoring and logging become extremely important. Monitoring helps teams measure system health using metrics. Logging helps teams investigate what happened inside applications and infrastructure. Together, they provide observability.

Your base content already explains Prometheus, Grafana, Alertmanager, and Loki integration. This expanded version adds real-time production examples, Kubernetes architecture diagrams, troubleshooting workflows, interview notes, and practical best practices. :contentReference[oaicite:0]{index=0}

What is Observability?

Observability means understanding the internal state of a system by looking at its external outputs.

In Kubernetes, observability usually includes:

Metrics: Numerical measurements like CPU usage, memory usage, request count, and error rate
Logs: Text records generated by applications and containers
Traces: Request journey across multiple microservices
Alerts: Notifications when something crosses a danger threshold

Why Monitoring and Logging Are Needed?

Without monitoring and logging, production issues become guesswork.

For example, if users say the website is slow, teams must quickly answer:

Which service is slow?
Is CPU usage high?
Is memory full?
Are Pods restarting?
Is the database slow?
Are APIs returning errors?
When did the issue start?

Prometheus, Grafana, Loki, and Alertmanager help answer these questions quickly.

Real-Time E-Commerce Example

Suppose an e-commerce platform runs a festival sale.

During peak traffic:

Frontend traffic increases
Checkout service gets heavy load
Payment service latency increases
Database CPU usage rises
Some Pods start restarting

With proper monitoring:

Prometheus detects high CPU and error rate
Grafana dashboards show which service is affected
Loki logs reveal payment timeout errors
Alertmanager notifies the DevOps team

[ User Complaints ]
        |
        v
[ Grafana Dashboard ]
        |
        v
[ Prometheus Metrics ]
        |
        v
[ Loki Logs ]
        |
        v
Root Cause Found Quickly

Prometheus: Metrics Collection

Prometheus is an open-source monitoring system widely used in Kubernetes. It collects metrics from applications, Kubernetes components, nodes, and exporters.

Prometheus stores metrics as time-series data. Each metric contains:

Metric name
Timestamp
Value
Labels

Prometheus Pull-Based Model

Prometheus usually uses a pull model. That means Prometheus scrapes metrics from targets instead of waiting for applications to push metrics.

[ Application /metrics Endpoint ]
              ^
              |
[ Prometheus Scrapes Metrics ]

Most applications expose metrics through an endpoint such as:

/metrics

Prometheus Workflow

Kubernetes Pods
      |
      v
Expose Metrics Endpoint
      |
      v
Prometheus Scrapes Metrics
      |
      v
Metrics Stored in Time-Series DB
      |
      v
PromQL Queries Analyze Data
      |
      v
Grafana Visualizes Results

Prometheus Deployment Example

apiVersion: apps/v1
kind: Deployment

metadata:
  name: prometheus

spec:
  replicas: 1

  selector:
    matchLabels:
      app: prometheus

  template:
    metadata:
      labels:
        app: prometheus

    spec:
      containers:
      - name: prometheus
        image: prom/prometheus

        ports:
        - containerPort: 9090

This deploys Prometheus and exposes it on port 9090 inside the Pod.

What Metrics Should Be Monitored?

In production Kubernetes environments, teams usually monitor:

CPU usage
Memory usage
Pod restart count
Container OOMKilled events
Node disk usage
Network traffic
API latency
Error rate
Request throughput
Database connections
Queue length
Ingress traffic

Golden Signals of Monitoring

A useful monitoring strategy often focuses on four golden signals:

Signal	Meaning	Example
Latency	How long requests take	Payment API takes 3 seconds
Traffic	How many requests system receives	10,000 requests per minute
Errors	How many requests fail	5xx errors increase
Saturation	How full resources are	CPU at 90%

Grafana: Visualization and Dashboards

Grafana is used to visualize metrics through dashboards. Prometheus stores metrics, but Grafana makes them easy to understand using graphs, panels, tables, and alerts.

Grafana can connect to:

Prometheus
Loki
Elasticsearch
InfluxDB
Cloud monitoring systems

Grafana Deployment Example

apiVersion: apps/v1
kind: Deployment

metadata:
  name: grafana

spec:
  replicas: 1

  selector:
    matchLabels:
      app: grafana

  template:
    metadata:
      labels:
        app: grafana

    spec:
      containers:
      - name: grafana
        image: grafana/grafana

        ports:
        - containerPort: 3000

Grafana Dashboard Example

A Kubernetes dashboard may show:

Cluster CPU usage
Cluster memory usage
Pod restart count
Node health
HTTP request rate
Error percentage
Top slow services

[ Grafana Dashboard ]
        |
        +-- CPU Usage Panel
        +-- Memory Usage Panel
        +-- Pod Restart Panel
        +-- API Latency Panel
        +-- Error Rate Panel

Alertmanager

Alertmanager handles alerts generated by Prometheus.

It can send notifications to:

Email
Slack
Microsoft Teams
PagerDuty
Webhook endpoints

Alerting Flow

Prometheus Detects Problem
          |
          v
Alert Rule Triggers
          |
          v
Alertmanager Receives Alert
          |
          v
Notification Sent to Team
          |
          v
Issue Investigated

Real-Time Banking Alert Example

In a banking system, alerts may be created for:

Payment API error rate greater than 2%
Transaction latency greater than 1 second
Database CPU above 85%
Pod restart count increasing
Disk usage above 80%

These alerts help teams respond before customers face major issues.

Loki for Kubernetes Logging

Prometheus is mainly for metrics. Logs require a logging system.

Loki is commonly used with Grafana for centralized logging.

A typical logging stack includes:

Promtail: Collects logs from nodes and Pods
Loki: Stores logs
Grafana: Searches and visualizes logs

Kubernetes Logging Flow

Application Pods
      |
      v
Container Logs
      |
      v
Promtail Collects Logs
      |
      v
Loki Stores Logs
      |
      v
Grafana Searches Logs

Real-Time Log Debugging Example

Suppose users report payment failures.

Metrics may show:

Payment API error rate increased

Logs may show:

Payment gateway timeout after 30000ms

Together, metrics and logs help identify the root cause faster.

Monitoring vs Logging

Monitoring	Logging
Shows system health	Shows detailed events
Uses metrics	Uses log messages
Good for alerts	Good for investigation
Example: CPU 90%	Example: NullPointerException

Prometheus Operator and ServiceMonitor

In production Kubernetes environments, Prometheus is often installed using Prometheus Operator.

The Operator simplifies:

Prometheus setup
Alertmanager setup
Service discovery
ServiceMonitor configuration
Rule management

What is ServiceMonitor?

A ServiceMonitor tells Prometheus Operator which services should be scraped for metrics.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor

metadata:
  name: payment-monitor

spec:
  selector:
    matchLabels:
      app: payment

  endpoints:
  - port: metrics

Spring Boot Metrics Example

For Spring Boot applications, Actuator and Micrometer are commonly used.

Application exposes metrics at:

/actuator/prometheus

Prometheus scrapes this endpoint and stores application metrics.

Useful PromQL Examples

CPU Usage

rate(container_cpu_usage_seconds_total[5m])

Memory Usage

container_memory_usage_bytes

Pod Restart Count

kube_pod_container_status_restarts_total

HTTP Error Rate

rate(http_server_requests_seconds_count{status=~"5.."}[5m])

Real-Time Production Incident Example

Suppose an order service becomes slow.

Investigation Flow

User Reports Slow Checkout
        |
        v
Check Grafana Latency Panel
        |
        v
Prometheus Shows High Response Time
        |
        v
Check Loki Logs
        |
        v
Database Timeout Found
        |
        v
Scale Database / Fix Query

Common Mistakes

1. Monitoring Only CPU and Memory

CPU and memory are important, but not enough. Monitor latency, errors, traffic, and saturation.

2. No Alert Threshold Planning

Bad thresholds create alert fatigue or missed incidents.

3. Too Many Grafana Panels

Dashboards should be useful, not overloaded.

4. Not Collecting Application Metrics

Infrastructure metrics alone cannot explain business failures.

5. Ignoring Logs

Metrics show symptoms. Logs often show the reason.

Production Troubleshooting Commands

kubectl get pods -n monitoring

kubectl logs prometheus-pod -n monitoring

kubectl logs grafana-pod -n monitoring

kubectl get svc -n monitoring

kubectl top pods

kubectl top nodes

kubectl describe pod pod-name

Monitoring Best Practices

Monitor golden signals: latency, traffic, errors, saturation
Use meaningful labels
Create service-level dashboards
Configure alert thresholds carefully
Use logs with metrics for debugging
Track business metrics such as orders and payments
Monitor Pod restarts and OOMKilled events
Use Alertmanager for notifications
Review dashboards regularly

Interview Questions

Q1: What is Prometheus?

Prometheus is an open-source monitoring system that collects and stores time-series metrics.

Q2: What is Grafana?

Grafana is a visualization tool used to create dashboards from metrics and logs.

Q3: What is Alertmanager?

Alertmanager receives alerts from Prometheus and sends notifications to teams.

Q4: What is PromQL?

PromQL is the query language used to analyze Prometheus metrics.

Q5: What is Loki?

Loki is a log aggregation system commonly used with Grafana.

Interview Trap Questions

Does Prometheus collect logs?

No. Prometheus collects metrics. Loki or similar tools collect logs.

Can Grafana store metrics?

Grafana mainly visualizes data from data sources. Prometheus stores metrics.

Is Metrics Server enough for production monitoring?

No. Metrics Server is mainly for Kubernetes resource metrics and autoscaling. Prometheus provides deeper monitoring.

Should every alert wake up the team?

No. Only actionable alerts should trigger urgent notifications.

Recommended Learning Path

Summary

Prometheus, Grafana, Loki, and Alertmanager provide a strong observability stack for Kubernetes.

Prometheus collects metrics, Grafana visualizes metrics and logs, Loki stores logs, and Alertmanager sends notifications when issues occur.

In real production systems, monitoring and logging help teams detect issues early, reduce downtime, troubleshoot faster, and improve application reliability.

For banking, e-commerce, healthcare, SaaS, and cloud-native platforms, observability is not optional. It is a core requirement for running reliable Kubernetes applications.