Monitoring and Metrics with Prometheus and Grafana

Modern microservices architectures generate enormous amounts of operational data. In distributed systems running across Kubernetes clusters, cloud infrastructure, message brokers, databases, API gateways, and hundreds of Spring Boot services, understanding system health becomes impossible without centralized monitoring and observability.

Production engineering teams rely heavily on metrics, dashboards, alerting systems, distributed tracing, and real-time analytics to maintain reliability and uptime. Without proper monitoring, organizations cannot detect outages early, troubleshoot performance bottlenecks, identify memory leaks, understand traffic patterns, or maintain service-level objectives.

Prometheus and Grafana have become the industry-standard observability stack for cloud-native systems. Together, they provide metrics collection, time-series storage, querying, visualization, alerting, and operational intelligence for enterprise-scale distributed systems.

This guide explains Prometheus architecture, Grafana dashboards, Spring Boot Micrometer integration, Kubernetes monitoring, alerting strategies, metrics design, production-grade observability patterns, troubleshooting techniques, scalability concerns, and enterprise monitoring best practices.

What You Will Learn
What is Monitoring?
What is Prometheus?
What is Grafana?
Why Monitoring is Critical
Monitoring vs Observability
Core Monitoring Pillars
Prometheus Architecture
How Prometheus Works
Understanding Metrics
Types of Prometheus Metrics
Spring Boot and Micrometer
Setting Up Prometheus
Setting Up Grafana
Exposing Spring Boot Metrics
Monitoring Kubernetes
Important Production Metrics
Creating Grafana Dashboards
Alerting with Prometheus
Monitoring Microservices
Monitoring Kafka and Event Systems
Monitoring Databases
Distributed Tracing Integration
Production Alerting Strategies
Common Monitoring Mistakes
Troubleshooting Monitoring Systems
Performance and Scaling
Security Best Practices
Enterprise Monitoring Best Practices
Interview Questions and Answers
Frequently Asked Questions
Summary
Next Learning Recommendations

What You Will Learn

How Prometheus works internally
How Grafana visualizes metrics
How Spring Boot exposes operational metrics
How Micrometer integrates with Prometheus
How to monitor Kubernetes clusters
How to create production dashboards
How alerting systems work
How to monitor Kafka and databases
How enterprise monitoring systems scale
Best practices for observability architecture

What is Monitoring?

Monitoring is the process of continuously collecting, analyzing, and visualizing operational data from systems, infrastructure, applications, and services.

The purpose of monitoring is to understand:

System health
Performance behavior
Error conditions
Resource utilization
Traffic patterns
Service reliability

Simple Definition

Monitoring helps engineering teams understand what is happening inside distributed systems in real time.

What is Prometheus?

Prometheus is an open-source monitoring and alerting system designed for cloud-native environments.

It collects metrics from applications and infrastructure using a pull-based model and stores them in a time-series database.

Prometheus Features

Time-series database
Metrics scraping
Powerful query language
Alerting support
Kubernetes integration
Service discovery
Label-based metrics model

Why Prometheus is Popular

Cloud-native architecture
Excellent Kubernetes support
Powerful metric querying
Scalable monitoring model
Strong open-source ecosystem

What is Grafana?

Grafana is an open-source visualization and dashboard platform used for monitoring analytics.

Grafana connects to Prometheus and visualizes metrics using dashboards, charts, graphs, heatmaps, and operational panels.

Grafana Capabilities

Real-time dashboards
Metrics visualization
Alerting integrations
Custom dashboards
Team collaboration
Operational analytics

Why Monitoring is Critical

In microservices environments, failures happen continuously:

Containers crash
Databases slow down
Kafka lag increases
Memory leaks occur
Network latency spikes
API requests fail
CPU usage becomes unstable

Without monitoring:

Incidents are detected late
Root cause analysis becomes difficult
Performance problems remain hidden
Customer impact increases
Downtime grows significantly

Monitoring vs Observability

Monitoring	Observability
Known issue detection	Unknown issue analysis
Predefined metrics	Deep system understanding
Threshold alerts	Behavior investigation
Dashboards	Correlated telemetry analysis

Observability typically includes:

Metrics
Logs
Distributed traces

Core Monitoring Pillars

Metrics

Numerical operational data collected over time.

Logs

Detailed event records generated by applications.

Traces

Request flow tracking across distributed services.

Observability Architecture

Applications
     |
     +-------------------+
     |                   |
     v                   v

Metrics              Logs
     |                   |
     v                   v

Prometheus         ELK Stack
     |
     v

Grafana

Prometheus Architecture

                +----------------------+
                | Spring Boot Apps     |
                | /actuator/prometheus |
                +----------------------+
                           |
                           v

                +----------------------+
                | Prometheus Server    |
                +----------------------+
                           |
         +-----------------+-----------------+
         |                                   |
         v                                   v

+------------------+              +------------------+
| Alertmanager     |              | Grafana          |
+------------------+              +------------------+

Core Components

Prometheus Server
Exporters
Alertmanager
Pushgateway
Grafana

How Prometheus Works

Prometheus periodically scrapes metrics endpoints exposed by applications.

Metrics Collection Flow

Spring Boot App
      |
      v

/actuator/prometheus
      |
      v

Prometheus Scraper
      |
      v

Time-Series Database
      |
      v

Grafana Dashboards

Pull-Based Model

Prometheus pulls metrics from services instead of applications pushing metrics.

Understanding Metrics

Metrics are numerical measurements collected over time.

Examples

CPU utilization
Memory usage
HTTP request count
Request latency
Error rates
Kafka consumer lag

Metric Structure

http_server_requests_seconds_count{
  method="GET",
  status="200"
}

Types of Prometheus Metrics

Counter

Only increases over time.

Examples:

Total requests
Error count
Processed messages

Gauge

Can increase or decrease.

Examples:

Memory usage
CPU utilization
Active connections

Histogram

Measures value distributions.

Examples:

Request latency
Response size

Summary

Tracks statistical summaries such as percentiles.

Spring Boot and Micrometer

Micrometer is the metrics abstraction layer used by Spring Boot.

Micrometer integrates with:

Prometheus
Datadog
InfluxDB
CloudWatch
New Relic

Why Micrometer Matters

It standardizes metrics collection across monitoring systems.

Setting Up Prometheus

Docker Compose Example

version: '3'

services:

  prometheus:
    image: prom/prometheus

    ports:
      - "9090:9090"

    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

Prometheus Configuration

global:
  scrape_interval: 15s

scrape_configs:

  - job_name: 'spring-boot-app'

    metrics_path: '/actuator/prometheus'

    static_configs:
      - targets: ['localhost:8080']

Setting Up Grafana

Docker Compose

grafana:
  image: grafana/grafana

  ports:
    - "3000:3000"

Grafana Workflow

Add Prometheus datasource
Create dashboards
Build visualizations
Configure alerts

Exposing Spring Boot Metrics

Maven Dependencies

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

application.yml

management:

  endpoints:
    web:
      exposure:
        include: health,info,prometheus

  endpoint:
    prometheus:
      enabled: true

  metrics:
    tags:
      application: order-service

Prometheus Endpoint

http://localhost:8080/actuator/prometheus

Monitoring Kubernetes

Kubernetes environments require cluster-level monitoring.

Important Kubernetes Metrics

Pod CPU usage
Memory utilization
Node health
Pod restart count
Network throughput
Disk utilization

Kubernetes Monitoring Stack

Kubernetes Cluster
        |
        v

Prometheus Operator
        |
        v

Prometheus Server
        |
        v

Grafana Dashboards

Important Production Metrics

Application Metrics

Request count
Latency
Error rate
Thread pool usage
Database connections

Infrastructure Metrics

CPU utilization
Memory usage
Disk usage
Network bandwidth

Business Metrics

Orders processed
Revenue generated
Failed transactions
User signups

Creating Grafana Dashboards

Recommended Dashboard Panels

Request throughput
Error rate
Response latency
JVM memory usage
CPU utilization
Kafka consumer lag

Golden Signals

Google SRE recommends monitoring:

Latency
Traffic
Errors
Saturation

Alerting with Prometheus

Prometheus supports rule-based alerting.

Example Alert Rule

groups:

- name: application-alerts

  rules:

  - alert: HighCPUUsage

    expr: process_cpu_usage > 0.85

    for: 2m

    labels:
      severity: critical

    annotations:
      summary: "High CPU usage detected"

Alertmanager Responsibilities

Deduplication
Notification routing
Silencing
Escalation policies

Monitoring Microservices

Distributed systems require service-level visibility.

Key Monitoring Areas

Inter-service latency
API failures
Circuit breaker states
Database bottlenecks
Thread pool exhaustion

Monitoring Kafka and Event Systems

Critical Kafka Metrics

Consumer lag
Broker availability
Message throughput
Partition imbalance
Replication health

Kafka Monitoring Flow

Kafka Cluster
      |
      v

JMX Exporter
      |
      v

Prometheus
      |
      v

Grafana

Monitoring Databases

Important Database Metrics

Slow queries
Connection pool usage
Deadlocks
Replication lag
Cache hit ratio

Production Database Monitoring

Monitoring databases is critical because database bottlenecks affect every downstream service.

Distributed Tracing Integration

Metrics alone cannot fully explain request behavior across distributed services.

Distributed tracing complements monitoring systems.

Tracing Flow

API Gateway
     |
     v

Order Service
     |
     v

Payment Service
     |
     v

Inventory Service

Production Alerting Strategies

Avoid Alert Fatigue

Too many alerts cause teams to ignore notifications.

Alert on Symptoms

Focus on customer-impacting issues.

Use Severity Levels

Critical
Warning
Informational

Examples of Good Alerts

Error rate spikes
Service unavailable
Memory exhaustion
Kafka lag growth

Common Monitoring Mistakes

Monitoring too many useless metrics
Ignoring business metrics
No alert prioritization
Missing latency monitoring
Not monitoring dependencies
Overloading Prometheus with high-cardinality labels

High Cardinality Problem

Excessive unique labels create massive memory usage.

Example of bad labels:

user_id
session_id
transaction_id

Troubleshooting Monitoring Systems

Missing Metrics

Verify Prometheus scraping configuration.

Dashboard Not Updating

Check datasource connectivity.

High Prometheus Memory Usage

Reduce metric cardinality.

Slow Queries

Optimize PromQL queries.

Common Commands

kubectl get pods

kubectl logs prometheus-pod

kubectl port-forward svc/prometheus 9090:9090

Performance and Scaling

Scaling Prometheus

Large enterprises often use:

Federation
Remote storage
Sharding
Long-term storage systems

Performance Optimization Tips

Reduce unnecessary metrics
Optimize scrape intervals
Limit label cardinality
Archive historical data

Security Best Practices

Protect Monitoring Endpoints

Do not expose metrics publicly.

Secure Grafana Access

Use authentication and role-based access control.

Encrypt Communication

Use HTTPS between monitoring components.

Restrict Sensitive Metrics

Avoid exposing confidential operational data.

Enterprise Monitoring Best Practices

Standardize dashboards
Use centralized observability platforms
Correlate logs, metrics, and traces
Monitor business KPIs
Implement SLA monitoring
Automate incident response workflows
Continuously improve alert quality

Interview Questions and Answers

What is Prometheus?

Prometheus is an open-source monitoring and alerting platform for collecting and storing time-series metrics.

What is Grafana?

Grafana is a visualization platform used to create dashboards and monitor operational data.

How does Prometheus collect metrics?

Prometheus uses a pull-based model to scrape metrics endpoints from applications.

What is Micrometer?

Micrometer is the metrics abstraction library used by Spring Boot.

What are Prometheus labels?

Labels are key-value pairs used to categorize metrics.

Why is high cardinality dangerous?

High cardinality increases memory consumption and degrades Prometheus performance.

Frequently Asked Questions

Can Prometheus monitor Kubernetes?

Yes. Prometheus has deep Kubernetes integration and is widely used for cluster monitoring.

Does Grafana store metrics?

No. Grafana visualizes data from external data sources like Prometheus.

What is PromQL?

PromQL is the query language used by Prometheus for querying metrics.

Can Spring Boot expose metrics automatically?

Yes. Spring Boot Actuator and Micrometer expose metrics automatically.

What is Alertmanager?

Alertmanager manages notifications, deduplication, and alert routing for Prometheus alerts.

Why are metrics important in microservices?

Metrics provide visibility into performance, failures, scaling, and system health.

Summary

Monitoring and observability are foundational requirements for operating modern microservices architectures reliably.

Prometheus and Grafana together provide:

Metrics collection
Operational dashboards
Real-time analytics
Alerting systems
Infrastructure monitoring
Application observability

In this guide, you learned:

How Prometheus works internally
How Grafana visualizes operational data
How Spring Boot integrates with Micrometer
How Kubernetes monitoring operates
How alerting systems are designed
How enterprise monitoring scales
Production monitoring best practices

Strong monitoring capabilities directly improve uptime, reliability, scalability, incident response speed, and engineering productivity in distributed systems.

Table of Contents

What You Will Learn

What is Monitoring?

Simple Definition

What is Prometheus?

Prometheus Features

Why Prometheus is Popular

What is Grafana?

Grafana Capabilities

Why Monitoring is Critical

Monitoring vs Observability

Core Monitoring Pillars

Metrics

Logs

Traces

Observability Architecture

Prometheus Architecture

Core Components

How Prometheus Works

Metrics Collection Flow

Pull-Based Model

Understanding Metrics

Examples

Metric Structure

Types of Prometheus Metrics

Counter

Gauge

Histogram

Summary

Spring Boot and Micrometer

Why Micrometer Matters

Setting Up Prometheus

Docker Compose Example

Prometheus Configuration

Setting Up Grafana

Docker Compose

Grafana Workflow

Exposing Spring Boot Metrics

Maven Dependencies

application.yml

Prometheus Endpoint

Monitoring Kubernetes

Important Kubernetes Metrics

Kubernetes Monitoring Stack

Important Production Metrics

Application Metrics

Infrastructure Metrics

Business Metrics

Creating Grafana Dashboards

Recommended Dashboard Panels

Golden Signals

Alerting with Prometheus

Example Alert Rule

Alertmanager Responsibilities

Monitoring Microservices

Key Monitoring Areas

Monitoring Kafka and Event Systems

Critical Kafka Metrics

Kafka Monitoring Flow

Monitoring Databases

Important Database Metrics

Production Database Monitoring

Distributed Tracing Integration

Tracing Flow

Production Alerting Strategies

Avoid Alert Fatigue

Alert on Symptoms

Use Severity Levels

Examples of Good Alerts

Common Monitoring Mistakes

High Cardinality Problem

Troubleshooting Monitoring Systems

Missing Metrics

Dashboard Not Updating

High Prometheus Memory Usage

Slow Queries

Common Commands

Performance and Scaling

Scaling Prometheus

Performance Optimization Tips