Published: 2026-06-01 โ€ข Updated: 2026-06-20

Monitoring and Observability for AI Java Apps with Prometheus and Grafana

Building a production-grade Artificial Intelligence (AI) application in Java using Spring Boot is only half the battle. Once your application is deployed to production, you must ensure it runs efficiently, reliably, and within budget. Unlike traditional CRUD (Create, Read, Update, Delete) applications, AI-powered applications introduce unique runtime challenges. These include heavy CPU/GPU utilization, high latency during model inference, external API dependencies, and variable token-based costs.

To keep these systems healthy, you need a robust observability stack. In this guide, we will explore how to implement monitoring and observability for AI-powered Java applications using Prometheus and Grafana. We will cover the core metrics you must track, how to instrument your Spring Boot application using Micrometer, and how to build dashboards that give you deep insights into your AI workloads.

If you have not yet provisioned your target cloud execution environment or established the core access control layers necessary to run these endpoints, read our deployment playbook: Provisioning AWS AI Infrastructure with Terraform.

The Core Concepts of Observability in AI Apps

Observability is the measure of how well you can understand the internal states of a system based on its external outputs. In the context of Java AI applications, observability is divided into three main pillars: Metrics, Logs, and Traces. This guide focuses on Metrics, which are numeric values measured over time, scraped by Prometheus, and visualized by Grafana.

When monitoring traditional Java applications, you typically focus on standard JVM metrics such as garbage collection pauses, heap memory usage, thread counts, and HTTP request latencies. However, for AI-driven Java applications, you must monitor an additional layer of AI-specific metrics:

  • Model Inference Latency: The time taken by an on-premise model (like ONNX or Deep Java Library) or an external LLM API (like OpenAI or Anthropic) to return a response.
  • Token Consumption: The number of prompt tokens sent and completion tokens received. This is critical for tracking costs and rate limits.
  • Semantic Cache Hit Rate: If you use a vector database to cache LLM responses, you need to track how often requests are served from the cache versus hitting the raw model.
  • Vector Database Query Latency: The time it takes to perform similarity searches in databases like Pgvector, Pinecone, or Milvus.
  • API Error Rates: The frequency of HTTP 429 (Too Many Requests) or 5xx errors returned by downstream AI model providers.

To evaluate how these metrics relate to a decentralized system mesh before diving into specific platform tools, explore Designing AI-Driven Microservices Architectures.

Architecture of the Monitoring Stack

To monitor a Java AI microservice, we use a pull-based metrics collection architecture. The diagram below illustrates how metrics flow from your Spring Boot application to the developer's dashboard:

+--------------------------------------------------+
|              Spring Boot AI Application          |
|  [LangChain4j / Spring AI] -> [Micrometer Core]  |
+--------------------------------------------------+
                         |
                         | Exposes /actuator/prometheus
                         v
+--------------------------------------------------+
|                  Prometheus Server               |
|        (Pulls metrics at regular intervals)      |
+--------------------------------------------------+
                         |
                         | Queries PromQL
                         v
+--------------------------------------------------+
|                  Grafana Dashboard               |
|        (Visualizes Latency, Tokens, Errors)      |
+--------------------------------------------------+
        

The Java application uses the Micrometer library to collect metrics. Micrometer acts as a facade, allowing you to instrument your code once and output metrics to various monitoring systems. We configure Micrometer to expose a Prometheus-compatible endpoint at /actuator/prometheus. The Prometheus server scrapes this endpoint periodically, stores the time-series data, and Grafana queries Prometheus to display real-time graphs.

For a complete guide on aligning your local workstation compilers to parse and debug these metric collectors, see Setting up Java Development Environment for AI.

Setting Up Micrometer in Spring Boot for AI Metrics

To begin collecting metrics, you must add the required dependencies to your Spring Boot project. If you are using Maven, add the Spring Boot Actuator and Micrometer Prometheus registry dependencies to your pom.xml file:

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

Next, configure your application.properties file to expose the Prometheus actuator endpoint. By default, Spring Boot hides most endpoints for security reasons:

# Expose the prometheus and health endpoints
management.endpoints.web.exposure.include=health,prometheus

# Enable detailed metrics for JVM and HTTP requests
management.metrics.enable.all=true
management.metrics.tags.application=ai-java-service

Once you start your Spring Boot application, you can navigate to http://localhost:8080/actuator/prometheus in your browser. You will see a plain-text list of raw metrics formatted for Prometheus scraping.

To see how to expose these metrics across standard web paths, visit Building AI-Powered Spring Boot REST APIs. If you need a sandbox setup to track memory performance before going to the cloud, look over Integrating OpenAI, HuggingFace, and Local LLMs via Ollama.

Practical Code Example: Tracking LLM Latency and Tokens

Now that the infrastructure is set up, let us write custom Java code to track AI-specific metrics. We will create an LLM service wrapper that records prompt tokens, completion tokens, and the execution time of the model call using Micrometer's MeterRegistry.

package com.example.ai.monitoring;

import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import org.springframework.stereotype.Service;
import java.util.concurrent.TimeUnit;

@Service
public class OpenAiMonitoringService {

    private final MeterRegistry meterRegistry;
    private final Counter promptTokenCounter;
    private final Counter completionTokenCounter;
    private final Timer inferenceTimer;

    public OpenAiMonitoringService(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;

        this.promptTokenCounter = Counter.builder("ai.llm.tokens")
                .description("Total number of tokens sent to the LLM")
                .tag("type", "prompt")
                .tag("model", "gpt-4")
                .register(meterRegistry);

        this.completionTokenCounter = Counter.builder("ai.llm.tokens")
                .description("Total number of tokens received from the LLM")
                .tag("type", "completion")
                .tag("model", "gpt-4")
                .register(meterRegistry);

        this.inferenceTimer = Timer.builder("ai.llm.inference.latency")
                .description("Inference latency of the LLM API calls")
                .tag("model", "gpt-4")
                .publishPercentiles(0.5, 0.9, 0.95, 0.99)
                .register(meterRegistry);
    }

    public String callModelWithMetrics(String prompt) {
        long startTime = System.nanoTime();
        
        try {
            String response = mockExternalLlmCall(prompt);
            
            int promptTokens = prompt.split("\\s+").length;
            int completionTokens = response.split("\\s+").length;
            
            promptTokenCounter.increment(promptTokens);
            completionTokenCounter.increment(completionTokens);
            
            return response;
        } finally {
            long duration = System.nanoTime() - startTime;
            inferenceTimer.record(duration, TimeUnit.NANOSECONDS);
        }
    }

    private String mockExternalLlmCall(String prompt) {
        try {
            Thread.sleep(1500);
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }
        return "This is a simulated response containing ten words of text.";
    }
}

In this code, we use two primary Micrometer meter types:

  • Counter: A monotonically increasing value used to accumulate token counts over time. We use tags (type and model) to allow filtering by prompt vs. completion tokens and by specific models (e.g., gpt-4 vs. gpt-3.5-turbo).
  • Timer: Measures short-duration latencies. By configuring publishPercentiles(0.5, 0.9, 0.95, 0.99), Micrometer automatically calculates the 50th (median), 90th, 95th, and 99th percentiles of your LLM response times. This is vital for detecting long-tail latency spikes that ruin user experience.

To explore how low-level telemetry metrics integrate with managed AWS infrastructure APIs, check out Integrating AWS Bedrock and SageMaker with Spring Boot. If you want to see how these metric frameworks are handled within the core Spring AI abstract layer, see Introduction to the Spring AI Framework.

Configuring Prometheus and Grafana

To collect the metrics generated by your Java application, you need to configure a Prometheus server. Create a file named prometheus.yml and add your Spring Boot application as a scrape target:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'spring-boot-ai-app'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['host.docker.internal:8080']

Note: If you are running Prometheus inside a Docker container, host.docker.internal allows the container to connect to your local host machine where your Spring Boot app is running.

Once Prometheus starts scraping, open your Grafana dashboard and add Prometheus as a data source. You can then build custom panels using Prometheus Query Language (PromQL). Here are some useful PromQL queries for your AI dashboards:

  • Average LLM Latency (Last 5 minutes): rate(ai_llm_inference_latency_seconds_sum[5m]) / rate(ai_llm_inference_latency_seconds_count[5m])
  • Total Tokens Consumed per Model: sum(increase(ai_llm_tokens_total[1h])) by (model, type)
  • 95th Percentile Latency: ai_llm_inference_latency_seconds{quantile="0.95"}

To manage state persistence and monitor tracking caches cleanly across high-volume chat runs, check out Managing Chat Memory and Conversational Context in Spring Boot.

Observability in Vector Databases and Decoupled Ingestion Pipelines

In production enterprise setups, AI services usually pull dynamic knowledge maps from a vector store using Retrieval-Augmented Generation (RAG). To maintain performance, you should track semantic query delays and data transformation times alongside your main model metrics.

To learn how to calculate text embedding delays, look over Understanding Vector Databases and Embeddings in Java. To watch vector delays inside your application logic, read Implementing RAG with Spring AI.

Additionally, processing heavy payloads asynchronously can cause delays in your message brokers. To monitor event consumption delays and keep your processing pipelines healthy under heavy loads, check out our guide: Asynchronous AI Processing with Kafka.

Container Instrumentation and Enterprise Cloud Topologies

Isolating application code inside thin, highly efficient images ensures your performance metrics accurately reflect model runtime behavior rather than container overhead.

To learn how to configure lean base layers for your monitoring containers, check out Containerizing AI-Enabled Java Applications with Docker. To launch your instrumented containers across scalable test systems, follow Deploying AI Java Microservices to Kubernetes.

For applications running on AWS managed nodes, ensure your instances use IAM Roles for Service Accounts (IRSA) to authenticate with secure metrics endpoints. See our deployment guide: Deploying Java AI Microservices on AWS EKS. If your cluster balances hardware nodes directly, ensure your monitoring stack captures accelerator performance by following Kubernetes Scaling & GPU Resources for AI Workloads.

Finally, to prevent malicious traffic spikes from blowing out your infrastructure costs, secure your public endpoints by following Securing AI APIs, Prompts, and Data Pipelines in Spring Boot. To further protect your budget, compile your applications into thin, fast binaries by reviewing Optimizing Java AI Applications: GraalVM Native Images & Cost Management.

Real-World Use Cases

  • Cost Control and Budgeting: By tracking token usage per model and application module, you can calculate the exact dollar cost of your AI features. If a specific user or feature starts consuming millions of tokens unexpectedly, you can trigger alerts or implement rate limiting.
  • SLA Compliance: Many enterprise agreements require strict Service Level Agreements (SLAs) on response times. Monitoring the 95th and 99th percentile latencies of your AI workflows ensures you detect slow API calls or slow vector database lookups before they affect end users.
  • Model Drift and Degradation Detection: If you notice your average inference latency dropping significantly while token usage remains constant, it might indicate that your model is returning empty or truncated responses due to API errors or severe context limitations.

Common Mistakes to Avoid

  • High Cardinality Tags: Do not include unique values like User IDs, Session IDs, or raw prompt strings as tags in your Micrometer meters. Prometheus is designed for low-to-medium cardinality data. Adding high-cardinality tags will overload the Prometheus database memory and crash your monitoring infrastructure.
  • Blocking Main Execution Threads: Do not perform complex metric processing or external logging calls inside your main application thread. Micrometer is highly optimized and non-blocking, but custom calculations or synchronous logging can slow down your application.
  • Ignoring Rate Limit Headers: External LLM providers return rate limit information in their HTTP response headers (e.g., remaining requests, remaining tokens). Failing to parse and monitor these headers can lead to sudden, unexplained application failures when limits are reached.

Interview Notes and Questions

  • Question: Why is the Prometheus pull model preferred over the push model?
  • Answer: The pull model prevents the monitoring server from being overwhelmed by high-frequency push requests from thousands of microservice instances. It also allows Prometheus to detect if an instance is dead (if it fails to scrape it) without relying on heartbeats.
  • Question: How do you handle high-cardinality data in enterprise logging stacks?
  • Answer: High-cardinality tokens such as request identifiers, account keys, or raw textual inputs should be routed to centralized log indexing systems (like Loki or ELK) rather than time-series metrics stores like Prometheus. Keep metrics limited to structured categorization dimensions like HTTP codes, target model identities, and endpoint types.
  • Question: What is the difference between tracing and metrics in a RAG pipeline?
  • Answer: Metrics tell you the aggregate health of the system, such as average similarity search delay or overall token consumption counts. Traces track a single request's end-to-end lifecycle, allowing you to pinpoint exactly which microservice or database lookup caused a specific transaction to slow down or fail.

Summary

Implementing monitoring and observability for your Java AI applications ensures visibility into application health, token burn rates, and inference performance. By using Spring Boot Actuator and Micrometer to collect metrics, configuring Prometheus to scrape data, and building clean Grafana dashboards, you can monitor and optimize your enterprise AI systems at scale.

To learn how to streamline your infrastructure configurations, minimize cold-start performance delays, and lower cloud compute costs across your entire cluster setup, read the next chapter in this series: Optimizing Java AI Applications: GraalVM Native Images & Cost Management.

About the Author

Naresh Kumar

Naresh Kumar

Senior Java Backend Engineer experienced in Banking, Payments, ISO 20022, Spring Boot, Microservices, Kafka, Docker, Kubernetes, AWS and Cloud Native Systems.

Built enterprise payment solutions, transaction processing systems, API platforms and scalable microservices used in production.

LinkedIn Profile