Monitoring and Logging in Kubernetes with Prometheus, Grafana, Loki, and Alertmanager
Running applications in Kubernetes is not only about deploying Pods, Services, and Ingress resources. In production, teams must continuously understand what is happening inside the cluster. They need to know whether applications are healthy, whether users are facing errors, whether CPU or memory usage is increasing, and whether any service is close to failure.
This is where monitoring and logging become extremely important. Monitoring helps teams measure system health using metrics. Logging helps teams investigate what happened inside applications and infrastructure. Together, they provide observability.
Your base content already explains Prometheus, Grafana, Alertmanager, and Loki integration. This expanded version adds real-time production examples, Kubernetes architecture diagrams, troubleshooting workflows, interview notes, and practical best practices. :contentReference[oaicite:0]{index=0}
What is Observability?
Observability means understanding the internal state of a system by looking at its external outputs.
In Kubernetes, observability usually includes:
- Metrics: Numerical measurements like CPU usage, memory usage, request count, and error rate
- Logs: Text records generated by applications and containers
- Traces: Request journey across multiple microservices
- Alerts: Notifications when something crosses a danger threshold
Why Monitoring and Logging Are Needed?
Without monitoring and logging, production issues become guesswork.
For example, if users say the website is slow, teams must quickly answer:
- Which service is slow?
- Is CPU usage high?
- Is memory full?
- Are Pods restarting?
- Is the database slow?
- Are APIs returning errors?
- When did the issue start?
Prometheus, Grafana, Loki, and Alertmanager help answer these questions quickly.
Real-Time E-Commerce Example
Suppose an e-commerce platform runs a festival sale.
During peak traffic:
- Frontend traffic increases
- Checkout service gets heavy load
- Payment service latency increases
- Database CPU usage rises
- Some Pods start restarting
With proper monitoring:
- Prometheus detects high CPU and error rate
- Grafana dashboards show which service is affected
- Loki logs reveal payment timeout errors
- Alertmanager notifies the DevOps team
[ User Complaints ]
|
v
[ Grafana Dashboard ]
|
v
[ Prometheus Metrics ]
|
v
[ Loki Logs ]
|
v
Root Cause Found Quickly
Prometheus: Metrics Collection
Prometheus is an open-source monitoring system widely used in Kubernetes. It collects metrics from applications, Kubernetes components, nodes, and exporters.
Prometheus stores metrics as time-series data. Each metric contains:
- Metric name
- Timestamp
- Value
- Labels
Prometheus Pull-Based Model
Prometheus usually uses a pull model. That means Prometheus scrapes metrics from targets instead of waiting for applications to push metrics.
[ Application /metrics Endpoint ]
^
|
[ Prometheus Scrapes Metrics ]
Most applications expose metrics through an endpoint such as:
/metrics
Prometheus Workflow
Kubernetes Pods
|
v
Expose Metrics Endpoint
|
v
Prometheus Scrapes Metrics
|
v
Metrics Stored in Time-Series DB
|
v
PromQL Queries Analyze Data
|
v
Grafana Visualizes Results
Prometheus Deployment Example
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus
ports:
- containerPort: 9090
This deploys Prometheus and exposes it on port 9090 inside the Pod.
What Metrics Should Be Monitored?
In production Kubernetes environments, teams usually monitor:
- CPU usage
- Memory usage
- Pod restart count
- Container OOMKilled events
- Node disk usage
- Network traffic
- API latency
- Error rate
- Request throughput
- Database connections
- Queue length
- Ingress traffic
Golden Signals of Monitoring
A useful monitoring strategy often focuses on four golden signals:
| Signal | Meaning | Example |
|---|---|---|
| Latency | How long requests take | Payment API takes 3 seconds |
| Traffic | How many requests system receives | 10,000 requests per minute |
| Errors | How many requests fail | 5xx errors increase |
| Saturation | How full resources are | CPU at 90% |
Grafana: Visualization and Dashboards
Grafana is used to visualize metrics through dashboards. Prometheus stores metrics, but Grafana makes them easy to understand using graphs, panels, tables, and alerts.
Grafana can connect to:
- Prometheus
- Loki
- Elasticsearch
- InfluxDB
- Cloud monitoring systems
Grafana Deployment Example
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana
ports:
- containerPort: 3000
Grafana Dashboard Example
A Kubernetes dashboard may show:
- Cluster CPU usage
- Cluster memory usage
- Pod restart count
- Node health
- HTTP request rate
- Error percentage
- Top slow services
[ Grafana Dashboard ]
|
+-- CPU Usage Panel
+-- Memory Usage Panel
+-- Pod Restart Panel
+-- API Latency Panel
+-- Error Rate Panel
Alertmanager
Alertmanager handles alerts generated by Prometheus.
It can send notifications to:
- Slack
- Microsoft Teams
- PagerDuty
- Webhook endpoints
Alerting Flow
Prometheus Detects Problem
|
v
Alert Rule Triggers
|
v
Alertmanager Receives Alert
|
v
Notification Sent to Team
|
v
Issue Investigated
Real-Time Banking Alert Example
In a banking system, alerts may be created for:
- Payment API error rate greater than 2%
- Transaction latency greater than 1 second
- Database CPU above 85%
- Pod restart count increasing
- Disk usage above 80%
These alerts help teams respond before customers face major issues.
Loki for Kubernetes Logging
Prometheus is mainly for metrics. Logs require a logging system.
Loki is commonly used with Grafana for centralized logging.
A typical logging stack includes:
- Promtail: Collects logs from nodes and Pods
- Loki: Stores logs
- Grafana: Searches and visualizes logs
Kubernetes Logging Flow
Application Pods
|
v
Container Logs
|
v
Promtail Collects Logs
|
v
Loki Stores Logs
|
v
Grafana Searches Logs
Real-Time Log Debugging Example
Suppose users report payment failures.
Metrics may show:
- Payment API error rate increased
Logs may show:
Payment gateway timeout after 30000ms
Together, metrics and logs help identify the root cause faster.
Monitoring vs Logging
| Monitoring | Logging |
|---|---|
| Shows system health | Shows detailed events |
| Uses metrics | Uses log messages |
| Good for alerts | Good for investigation |
| Example: CPU 90% | Example: NullPointerException |
Prometheus Operator and ServiceMonitor
In production Kubernetes environments, Prometheus is often installed using Prometheus Operator.
The Operator simplifies:
- Prometheus setup
- Alertmanager setup
- Service discovery
- ServiceMonitor configuration
- Rule management
What is ServiceMonitor?
A ServiceMonitor tells Prometheus Operator which services should be scraped for metrics.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: payment-monitor
spec:
selector:
matchLabels:
app: payment
endpoints:
- port: metrics
Spring Boot Metrics Example
For Spring Boot applications, Actuator and Micrometer are commonly used.
Application exposes metrics at:
/actuator/prometheus
Prometheus scrapes this endpoint and stores application metrics.
Useful PromQL Examples
CPU Usage
rate(container_cpu_usage_seconds_total[5m])
Memory Usage
container_memory_usage_bytes
Pod Restart Count
kube_pod_container_status_restarts_total
HTTP Error Rate
rate(http_server_requests_seconds_count{status=~"5.."}[5m])
Real-Time Production Incident Example
Suppose an order service becomes slow.
Investigation Flow
User Reports Slow Checkout
|
v
Check Grafana Latency Panel
|
v
Prometheus Shows High Response Time
|
v
Check Loki Logs
|
v
Database Timeout Found
|
v
Scale Database / Fix Query
Common Mistakes
1. Monitoring Only CPU and Memory
CPU and memory are important, but not enough. Monitor latency, errors, traffic, and saturation.
2. No Alert Threshold Planning
Bad thresholds create alert fatigue or missed incidents.
3. Too Many Grafana Panels
Dashboards should be useful, not overloaded.
4. Not Collecting Application Metrics
Infrastructure metrics alone cannot explain business failures.
5. Ignoring Logs
Metrics show symptoms. Logs often show the reason.
Production Troubleshooting Commands
kubectl get pods -n monitoring
kubectl logs prometheus-pod -n monitoring
kubectl logs grafana-pod -n monitoring
kubectl get svc -n monitoring
kubectl top pods
kubectl top nodes
kubectl describe pod pod-name
Monitoring Best Practices
- Monitor golden signals: latency, traffic, errors, saturation
- Use meaningful labels
- Create service-level dashboards
- Configure alert thresholds carefully
- Use logs with metrics for debugging
- Track business metrics such as orders and payments
- Monitor Pod restarts and OOMKilled events
- Use Alertmanager for notifications
- Review dashboards regularly
Interview Questions
Q1: What is Prometheus?
Prometheus is an open-source monitoring system that collects and stores time-series metrics.
Q2: What is Grafana?
Grafana is a visualization tool used to create dashboards from metrics and logs.
Q3: What is Alertmanager?
Alertmanager receives alerts from Prometheus and sends notifications to teams.
Q4: What is PromQL?
PromQL is the query language used to analyze Prometheus metrics.
Q5: What is Loki?
Loki is a log aggregation system commonly used with Grafana.
Interview Trap Questions
Does Prometheus collect logs?
No. Prometheus collects metrics. Loki or similar tools collect logs.
Can Grafana store metrics?
Grafana mainly visualizes data from data sources. Prometheus stores metrics.
Is Metrics Server enough for production monitoring?
No. Metrics Server is mainly for Kubernetes resource metrics and autoscaling. Prometheus provides deeper monitoring.
Should every alert wake up the team?
No. Only actionable alerts should trigger urgent notifications.
Recommended Learning Path
- Kubernetes Pods
- Health Probes
- Requests and Limits
- Horizontal Pod Autoscaler
- Monitoring and Logging
- Prometheus and Grafana
- Loki and Promtail
Summary
Prometheus, Grafana, Loki, and Alertmanager provide a strong observability stack for Kubernetes.
Prometheus collects metrics, Grafana visualizes metrics and logs, Loki stores logs, and Alertmanager sends notifications when issues occur.
In real production systems, monitoring and logging help teams detect issues early, reduce downtime, troubleshoot faster, and improve application reliability.
For banking, e-commerce, healthcare, SaaS, and cloud-native platforms, observability is not optional. It is a core requirement for running reliable Kubernetes applications.