Published: 2026-06-01 • Updated: 2026-06-20

Monitoring and Observability in Spring AI

Monitoring and observability are critical for production Spring AI applications. A normal Spring Boot API can be monitored using logs, metrics, traces, error rates, and response times. But an AI application needs more visibility because an AI response can be technically successful but still wrong, slow, expensive, unsafe, or poorly grounded.

A Spring AI application may call chat models, embedding models, vector databases, tools, memory stores, document pipelines, and external APIs. If any layer fails or becomes slow, the user experience becomes poor.


What is Monitoring?

Monitoring means tracking the health and performance of your application using predefined metrics.

Examples:

  • API response time
  • Error count
  • CPU usage
  • Memory usage
  • LLM latency
  • Vector search latency
  • Tool call failure count

What is Observability?

Observability means understanding what is happening inside the system by using logs, metrics, traces, and events.

User Request
   |
   v
Controller
   |
   v
ChatClient
   |
   v
RAG Search
   |
   v
Tool Call
   |
   v
Model Response
   |
   v
Final Answer

Observability helps you identify where the problem happened.


Why Observability is Important in Spring AI?

AI applications can fail in many ways:

  • Model response is slow
  • Model provider is unavailable
  • Prompt is too large
  • Token cost is too high
  • RAG retrieves wrong documents
  • Vector database is slow
  • Tool call fails
  • Memory context is wrong
  • AI hallucinates answer
  • Output parser fails

Spring AI Observability Architecture

Spring AI Application
      |
      +-- Metrics
      +-- Logs
      +-- Traces
      +-- Token Usage
      +-- Tool Events
      +-- RAG Events
      +-- User Feedback
      |
      v
Prometheus / Grafana / Loki / Jaeger

Core Observability Areas

Area What to Track
Chat Model Latency, errors, token usage, cost
Embedding Model Embedding time, failures, dimensions
Vector Store Search latency, empty results, similarity score
RAG Retrieved chunks, source quality, fallback count
Tools Tool calls, success rate, failures, authorization blocks
Memory Conversation size, retrieval latency, memory leaks
Security Prompt injection attempts, unsafe outputs

Real-Time Learning Platform Example

For a learning website, AI may answer questions about Java, Spring Boot, Docker, Kubernetes, Spring AI, RAG, and Agentic AI.

You should monitor:

  • Which topics users ask most
  • Which answers get poor feedback
  • Which RAG documents are retrieved
  • Which courses are recommended
  • How much each AI request costs
  • How long ChatClient takes to respond

Real-Time Banking Example

For a banking AI assistant, observability is even more important.

Track:

  • Transaction explanation tool calls
  • Unauthorized access attempts
  • Prompt injection attempts
  • Failed tool calls
  • Masked data usage
  • Audit events
  • Response validation failures

Real-Time E-Commerce Example

For an e-commerce AI assistant, monitor:

  • Order tracking tool latency
  • Refund policy retrieval quality
  • Product recommendation accuracy
  • Cancellation confirmation events
  • Customer satisfaction feedback
  • Fallback responses

Step 1: Add Spring Boot Actuator

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

Step 2: Add Micrometer Prometheus Registry

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

Step 3: Configure Actuator Endpoints

management.endpoints.web.exposure.include=health,info,metrics,prometheus
management.endpoint.health.show-details=always
management.metrics.tags.application=spring-ai-app

Step 4: Basic AI Metrics Service

@Service
public class AiMetricsService {

    private final MeterRegistry meterRegistry;

    public AiMetricsService(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
    }

    public void recordChatSuccess() {
        meterRegistry.counter("ai.chat.success").increment();
    }

    public void recordChatFailure() {
        meterRegistry.counter("ai.chat.failure").increment();
    }

    public void recordToolCall(String toolName) {
        meterRegistry.counter("ai.tool.calls", "tool", toolName).increment();
    }

    public void recordRagFallback() {
        meterRegistry.counter("ai.rag.fallback").increment();
    }
}

Step 5: Measure ChatClient Latency

@Service
public class ObservableChatService {

    private final ChatClient chatClient;
    private final MeterRegistry meterRegistry;

    public ObservableChatService(ChatClient.Builder builder,
                                 MeterRegistry meterRegistry) {
        this.chatClient = builder.build();
        this.meterRegistry = meterRegistry;
    }

    public String ask(String message) {

        Timer.Sample sample = Timer.start(meterRegistry);

        try {
            String response = chatClient.prompt()
                    .system("You are a helpful Spring AI assistant.")
                    .user(message)
                    .call()
                    .content();

            meterRegistry.counter("ai.chat.success").increment();

            return response;

        } catch (Exception ex) {

            meterRegistry.counter("ai.chat.failure").increment();
            throw ex;

        } finally {

            sample.stop(meterRegistry.timer("ai.chat.latency"));
        }
    }
}

Important AI Metrics

  • ai.chat.latency
  • ai.chat.success
  • ai.chat.failure
  • ai.tool.calls
  • ai.tool.failure
  • ai.rag.search.latency
  • ai.rag.empty.results
  • ai.token.usage
  • ai.cost.estimated

RAG Observability

RAG systems need special monitoring because poor retrieval leads to poor answers.

Track:

  • Number of retrieved documents
  • Top similarity score
  • Empty retrieval count
  • Vector search latency
  • Source document names
  • Fallback answer count

RAG Monitoring Flow

User Question
      |
      v
Vector Search
      |
      +-- Search Latency
      +-- Retrieved Count
      +-- Similarity Score
      +-- Source Documents
      |
      v
Chat Model Answer

Vector Search Metric Example

public List<Document> search(String question) {

    Timer.Sample sample = Timer.start(meterRegistry);

    try {
        List<Document> documents =
                vectorStore.similaritySearch(question);

        meterRegistry.counter("ai.rag.search.count").increment();

        if (documents.isEmpty()) {
            meterRegistry.counter("ai.rag.empty.results").increment();
        }

        return documents;

    } finally {
        sample.stop(meterRegistry.timer("ai.rag.search.latency"));
    }
}

Tool Calling Observability

Tool calls must be monitored because tools connect AI to real business systems.

Track:

  • Tool name
  • Tool success/failure
  • Tool latency
  • Unauthorized tool attempts
  • Missing parameter errors
  • High-risk action requests

Tool Call Logging Example

log.info("tool_call tool={} userIdHash={} success={} latencyMs={}",
        toolName,
        userIdHash,
        success,
        latencyMs);

Never log passwords, OTPs, full card numbers, API keys, or sensitive prompts.


Memory Observability

Chat memory can grow and affect cost, latency, and privacy.

Track:

  • Conversation count
  • Average messages per conversation
  • Memory retrieval latency
  • Memory clear events
  • Token usage from history
  • Cross-user access attempts

Token and Cost Monitoring

AI cost can grow quickly if prompts are large or requests are repeated.

Track:

  • Input tokens
  • Output tokens
  • Total tokens
  • Average tokens per request
  • Estimated cost per request
  • Daily cost
  • Cost by user
  • Cost by feature

Cost Control Flow

AI Request
   |
   v
Estimate Token Usage
   |
   v
Check User Quota
   |
   +-- Allowed → Process
   |
   +-- Exceeded → Reject Safely

Structured Logs

Use structured logs instead of random text logs.

{
  "event": "ai_chat_request",
  "userIdHash": "abc123",
  "conversationId": "conv-789",
  "model": "gpt-4o-mini",
  "latencyMs": 1200,
  "success": true
}

Safe Logging Rules

  • Log metadata, not sensitive content
  • Mask user identifiers
  • Do not log raw prompts with private data
  • Do not log API keys
  • Do not log full tool payloads
  • Log failures with safe error messages

Distributed Tracing

Tracing helps you understand how one user request moves across services.

Request Trace
   |
   +-- API Gateway
   +-- Spring AI Service
   +-- Vector Store
   +-- Tool Service
   +-- Model Provider
   +-- Response Validator

Why Tracing Matters?

If a user says the AI is slow, tracing shows whether the delay came from:

  • Controller
  • Vector database
  • Tool API
  • Chat model
  • Network
  • Output parser

Prometheus Configuration Example

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "spring-ai-app"
    metrics_path: "/actuator/prometheus"
    static_configs:
      - targets: ["spring-ai-app:8080"]

Grafana Dashboard Panels

  • AI request count
  • Chat model latency
  • Chat model error rate
  • Vector search latency
  • RAG empty result count
  • Tool call success rate
  • Token usage trend
  • Estimated AI cost
  • Prompt injection attempts
  • Fallback response count

Alerting Rules

Create alerts for:

  • High model latency
  • High AI error rate
  • Vector search failures
  • Tool failure spikes
  • Too many fallback responses
  • Unusual token usage
  • Prompt injection spike
  • Provider outage

Example Alert Conditions

If ai.chat.failure rate > 5% for 5 minutes → Alert

If ai.chat.latency p95 > 5 seconds → Alert

If ai.rag.empty.results increases suddenly → Alert

If ai.token.usage doubles unexpectedly → Alert

Quality Monitoring

AI quality must also be monitored.

Track:

  • User thumbs up/down
  • Reported wrong answers
  • Hallucination reports
  • Low-confidence responses
  • Unsupported claims
  • Repeated user rephrasing

User Feedback Table

ai_feedback
   |
   +-- id
   +-- user_id
   +-- conversation_id
   +-- question
   +-- answer
   +-- rating
   +-- feedback_text
   +-- created_at

Feedback Controller Example

@RestController
@RequestMapping("/api/ai/feedback")
public class AiFeedbackController {

    private final AiFeedbackService feedbackService;

    public AiFeedbackController(AiFeedbackService feedbackService) {
        this.feedbackService = feedbackService;
    }

    @PostMapping
    public String submitFeedback(@RequestBody AiFeedbackRequest request) {
        feedbackService.save(request);
        return "Feedback submitted successfully.";
    }
}

Prompt Version Monitoring

Prompt changes can affect answer quality.

Track prompt version with every AI request.

{
  "promptName": "rag-answer-prompt",
  "promptVersion": "1.0.3",
  "model": "gpt-4o-mini",
  "latencyMs": 1300
}

Why Prompt Version Tracking Matters?

  • Find which prompt caused poor responses
  • Compare old and new prompts
  • Rollback bad prompt changes
  • Debug quality regressions
  • Run A/B testing

Production Observability Flow

User Request
      |
      v
Generate Trace ID
      |
      v
Validate Input
      |
      v
RAG Search
      |
      v
Tool Calls
      |
      v
Chat Model
      |
      v
Output Validation
      |
      v
Record Metrics + Logs + Feedback

Security Observability

Track AI security events:

  • Prompt injection attempts
  • Unsafe tool requests
  • Unauthorized document retrieval
  • Blocked file uploads
  • Unsafe output detection
  • Rate limit violations

Common Monitoring Mistakes

1. Monitoring Only HTTP Status

AI response may be wrong even with HTTP 200.

2. Not Tracking Token Usage

Costs may increase silently.

3. No RAG Metrics

Wrong retrieval causes wrong answers.

4. No Tool Metrics

Tool failures break agent workflows.

5. Logging Sensitive Data

Logs can become a security risk.


Best Practices

  • Use Actuator and Micrometer
  • Expose Prometheus metrics
  • Track model latency and error rate
  • Monitor vector search quality
  • Track tool calls and failures
  • Measure token usage and cost
  • Use structured logs
  • Use distributed tracing
  • Collect user feedback
  • Track prompt versions
  • Alert on abnormal behavior
  • Never log sensitive prompts or secrets

Production Checklist

  • Actuator enabled
  • Prometheus metrics enabled
  • Grafana dashboard created
  • Chat latency tracked
  • Model failures tracked
  • RAG metrics tracked
  • Tool metrics tracked
  • Memory usage tracked
  • Token usage tracked
  • Cost estimated
  • Prompt version logged
  • User feedback collected
  • Security events monitored
  • Alerts configured

Interview Questions

Q1: Why is observability important in Spring AI?

Because AI applications can fail logically even when APIs return success. Observability helps track latency, cost, retrieval quality, tool calls, and response quality.

Q2: What should be monitored in ChatClient calls?

Latency, success rate, failure rate, token usage, model name, prompt version, and cost.

Q3: What should be monitored in RAG?

Vector search latency, retrieved document count, similarity score, empty results, source documents, and fallback responses.

Q4: Why track tool calls?

Tools connect AI to real systems, so failures, latency, unauthorized attempts, and wrong parameters must be monitored.

Q5: Why is token monitoring important?

Token usage directly affects cost, latency, and model context limits.


Advanced Interview Questions

Q1: Why is HTTP 200 not enough for AI monitoring?

The API may succeed technically, but the AI answer may still be wrong, hallucinated, unsafe, or irrelevant.

Q2: How do you detect poor RAG quality?

Monitor empty retrievals, low similarity scores, wrong source documents, user feedback, and hallucination reports.

Q3: How do you monitor AI cost?

Track input tokens, output tokens, total tokens, model used, feature name, user usage, and estimated price per request.

Q4: What is prompt version observability?

It means logging prompt names and versions with requests so response quality can be compared and bad prompt changes can be rolled back.

Q5: What security events should be monitored?

Prompt injection attempts, unsafe tool calls, unauthorized RAG access, blocked uploads, and rate limit violations.


Recommended Learning Path


Summary

Monitoring and observability are essential for production Spring AI applications. They help developers understand performance, reliability, cost, security, retrieval quality, and user satisfaction.

A strong observability setup should include metrics, structured logs, traces, token tracking, RAG monitoring, tool monitoring, memory monitoring, prompt version tracking, and user feedback.

For real-world applications such as learning platforms, banking assistants, e-commerce support bots, SaaS help desks, and enterprise AI agents, observability is the difference between an experimental AI demo and a trustworthy production AI system.

About the Author

Naresh Kumar

Naresh Kumar

Senior Java Backend Engineer experienced in Banking, Payments, ISO 20022, Spring Boot, Microservices, Kafka, Docker, Kubernetes, AWS and Cloud Native Systems.

Built enterprise payment solutions, transaction processing systems, API platforms and scalable microservices used in production.

LinkedIn Profile