Monitoring and Observability in Spring AI

Monitoring and observability are critical for production Spring AI applications. A normal Spring Boot API can be monitored using logs, metrics, traces, error rates, and response times. But an AI application needs more visibility because an AI response can be technically successful but still wrong, slow, expensive, unsafe, or poorly grounded.

A Spring AI application may call chat models, embedding models, vector databases, tools, memory stores, document pipelines, and external APIs. If any layer fails or becomes slow, the user experience becomes poor.

What is Monitoring?

Monitoring means tracking the health and performance of your application using predefined metrics.

Examples:

API response time
Error count
CPU usage
Memory usage
LLM latency
Vector search latency
Tool call failure count

What is Observability?

Observability means understanding what is happening inside the system by using logs, metrics, traces, and events.

User Request
   |
   v
Controller
   |
   v
ChatClient
   |
   v
RAG Search
   |
   v
Tool Call
   |
   v
Model Response
   |
   v
Final Answer

Observability helps you identify where the problem happened.

Why Observability is Important in Spring AI?

AI applications can fail in many ways:

Model response is slow
Model provider is unavailable
Prompt is too large
Token cost is too high
RAG retrieves wrong documents
Vector database is slow
Tool call fails
Memory context is wrong
AI hallucinates answer
Output parser fails

Spring AI Observability Architecture

Spring AI Application
      |
      +-- Metrics
      +-- Logs
      +-- Traces
      +-- Token Usage
      +-- Tool Events
      +-- RAG Events
      +-- User Feedback
      |
      v
Prometheus / Grafana / Loki / Jaeger

Core Observability Areas

Area	What to Track
Chat Model	Latency, errors, token usage, cost
Embedding Model	Embedding time, failures, dimensions
Vector Store	Search latency, empty results, similarity score
RAG	Retrieved chunks, source quality, fallback count
Tools	Tool calls, success rate, failures, authorization blocks
Memory	Conversation size, retrieval latency, memory leaks
Security	Prompt injection attempts, unsafe outputs

Real-Time Learning Platform Example

For a learning website, AI may answer questions about Java, Spring Boot, Docker, Kubernetes, Spring AI, RAG, and Agentic AI.

You should monitor:

Which topics users ask most
Which answers get poor feedback
Which RAG documents are retrieved
Which courses are recommended
How much each AI request costs
How long ChatClient takes to respond

Real-Time Banking Example

For a banking AI assistant, observability is even more important.

Track:

Transaction explanation tool calls
Unauthorized access attempts
Prompt injection attempts
Failed tool calls
Masked data usage
Audit events
Response validation failures

Real-Time E-Commerce Example

For an e-commerce AI assistant, monitor:

Order tracking tool latency
Refund policy retrieval quality
Product recommendation accuracy
Cancellation confirmation events
Customer satisfaction feedback
Fallback responses

Step 1: Add Spring Boot Actuator

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

Step 2: Add Micrometer Prometheus Registry

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

Step 3: Configure Actuator Endpoints

management.endpoints.web.exposure.include=health,info,metrics,prometheus
management.endpoint.health.show-details=always
management.metrics.tags.application=spring-ai-app

Step 4: Basic AI Metrics Service

@Service
public class AiMetricsService {

    private final MeterRegistry meterRegistry;

    public AiMetricsService(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
    }

    public void recordChatSuccess() {
        meterRegistry.counter("ai.chat.success").increment();
    }

    public void recordChatFailure() {
        meterRegistry.counter("ai.chat.failure").increment();
    }

    public void recordToolCall(String toolName) {
        meterRegistry.counter("ai.tool.calls", "tool", toolName).increment();
    }

    public void recordRagFallback() {
        meterRegistry.counter("ai.rag.fallback").increment();
    }
}

Step 5: Measure ChatClient Latency

@Service
public class ObservableChatService {

    private final ChatClient chatClient;
    private final MeterRegistry meterRegistry;

    public ObservableChatService(ChatClient.Builder builder,
                                 MeterRegistry meterRegistry) {
        this.chatClient = builder.build();
        this.meterRegistry = meterRegistry;
    }

    public String ask(String message) {

        Timer.Sample sample = Timer.start(meterRegistry);

        try {
            String response = chatClient.prompt()
                    .system("You are a helpful Spring AI assistant.")
                    .user(message)
                    .call()
                    .content();

            meterRegistry.counter("ai.chat.success").increment();

            return response;

        } catch (Exception ex) {

            meterRegistry.counter("ai.chat.failure").increment();
            throw ex;

        } finally {

            sample.stop(meterRegistry.timer("ai.chat.latency"));
        }
    }
}

Important AI Metrics

ai.chat.latency
ai.chat.success
ai.chat.failure
ai.tool.calls
ai.tool.failure
ai.rag.search.latency
ai.rag.empty.results
ai.token.usage
ai.cost.estimated

RAG Observability

RAG systems need special monitoring because poor retrieval leads to poor answers.

Track:

Number of retrieved documents
Top similarity score
Empty retrieval count
Vector search latency
Source document names
Fallback answer count

RAG Monitoring Flow

User Question
      |
      v
Vector Search
      |
      +-- Search Latency
      +-- Retrieved Count
      +-- Similarity Score
      +-- Source Documents
      |
      v
Chat Model Answer

Vector Search Metric Example

public List<Document> search(String question) {

    Timer.Sample sample = Timer.start(meterRegistry);

    try {
        List<Document> documents =
                vectorStore.similaritySearch(question);

        meterRegistry.counter("ai.rag.search.count").increment();

        if (documents.isEmpty()) {
            meterRegistry.counter("ai.rag.empty.results").increment();
        }

        return documents;

    } finally {
        sample.stop(meterRegistry.timer("ai.rag.search.latency"));
    }
}

Tool Calling Observability

Tool calls must be monitored because tools connect AI to real business systems.

Track:

Tool name
Tool success/failure
Tool latency
Unauthorized tool attempts
Missing parameter errors
High-risk action requests

Tool Call Logging Example

log.info("tool_call tool={} userIdHash={} success={} latencyMs={}",
        toolName,
        userIdHash,
        success,
        latencyMs);

Never log passwords, OTPs, full card numbers, API keys, or sensitive prompts.

Memory Observability

Chat memory can grow and affect cost, latency, and privacy.

Track:

Conversation count
Average messages per conversation
Memory retrieval latency
Memory clear events
Token usage from history
Cross-user access attempts

Token and Cost Monitoring

AI cost can grow quickly if prompts are large or requests are repeated.

Track:

Input tokens
Output tokens
Total tokens
Average tokens per request
Estimated cost per request
Daily cost
Cost by user
Cost by feature

Cost Control Flow

AI Request
   |
   v
Estimate Token Usage
   |
   v
Check User Quota
   |
   +-- Allowed â†’ Process
   |
   +-- Exceeded â†’ Reject Safely

Structured Logs

Use structured logs instead of random text logs.

{
  "event": "ai_chat_request",
  "userIdHash": "abc123",
  "conversationId": "conv-789",
  "model": "gpt-4o-mini",
  "latencyMs": 1200,
  "success": true
}

Safe Logging Rules

Log metadata, not sensitive content
Mask user identifiers
Do not log raw prompts with private data
Do not log API keys
Do not log full tool payloads
Log failures with safe error messages

Distributed Tracing

Tracing helps you understand how one user request moves across services.

Request Trace
   |
   +-- API Gateway
   +-- Spring AI Service
   +-- Vector Store
   +-- Tool Service
   +-- Model Provider
   +-- Response Validator

Why Tracing Matters?

If a user says the AI is slow, tracing shows whether the delay came from:

Controller
Vector database
Tool API
Chat model
Network
Output parser

Prometheus Configuration Example

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "spring-ai-app"
    metrics_path: "/actuator/prometheus"
    static_configs:
      - targets: ["spring-ai-app:8080"]

Grafana Dashboard Panels

AI request count
Chat model latency
Chat model error rate
Vector search latency
RAG empty result count
Tool call success rate
Token usage trend
Estimated AI cost
Prompt injection attempts
Fallback response count

Alerting Rules

Create alerts for:

High model latency
High AI error rate
Vector search failures
Tool failure spikes
Too many fallback responses
Unusual token usage
Prompt injection spike
Provider outage

Example Alert Conditions

If ai.chat.failure rate > 5% for 5 minutes â†’ Alert

If ai.chat.latency p95 > 5 seconds â†’ Alert

If ai.rag.empty.results increases suddenly â†’ Alert

If ai.token.usage doubles unexpectedly â†’ Alert

Quality Monitoring

AI quality must also be monitored.

Track:

User thumbs up/down
Reported wrong answers
Hallucination reports
Low-confidence responses
Unsupported claims
Repeated user rephrasing

User Feedback Table

ai_feedback
   |
   +-- id
   +-- user_id
   +-- conversation_id
   +-- question
   +-- answer
   +-- rating
   +-- feedback_text
   +-- created_at

Feedback Controller Example

@RestController
@RequestMapping("/api/ai/feedback")
public class AiFeedbackController {

    private final AiFeedbackService feedbackService;

    public AiFeedbackController(AiFeedbackService feedbackService) {
        this.feedbackService = feedbackService;
    }

    @PostMapping
    public String submitFeedback(@RequestBody AiFeedbackRequest request) {
        feedbackService.save(request);
        return "Feedback submitted successfully.";
    }
}

Prompt Version Monitoring

Prompt changes can affect answer quality.

Track prompt version with every AI request.

{
  "promptName": "rag-answer-prompt",
  "promptVersion": "1.0.3",
  "model": "gpt-4o-mini",
  "latencyMs": 1300
}

Why Prompt Version Tracking Matters?

Find which prompt caused poor responses
Compare old and new prompts
Rollback bad prompt changes
Debug quality regressions
Run A/B testing

Production Observability Flow

User Request
      |
      v
Generate Trace ID
      |
      v
Validate Input
      |
      v
RAG Search
      |
      v
Tool Calls
      |
      v
Chat Model
      |
      v
Output Validation
      |
      v
Record Metrics + Logs + Feedback

Security Observability

Track AI security events:

Prompt injection attempts
Unsafe tool requests
Unauthorized document retrieval
Blocked file uploads
Unsafe output detection
Rate limit violations

Common Monitoring Mistakes

1. Monitoring Only HTTP Status

AI response may be wrong even with HTTP 200.

2. Not Tracking Token Usage

Costs may increase silently.

3. No RAG Metrics

Wrong retrieval causes wrong answers.

4. No Tool Metrics

Tool failures break agent workflows.

5. Logging Sensitive Data

Logs can become a security risk.

Best Practices

Use Actuator and Micrometer
Expose Prometheus metrics
Track model latency and error rate
Monitor vector search quality
Track tool calls and failures
Measure token usage and cost
Use structured logs
Use distributed tracing
Collect user feedback
Track prompt versions
Alert on abnormal behavior
Never log sensitive prompts or secrets

Production Checklist

Actuator enabled
Prometheus metrics enabled
Grafana dashboard created
Chat latency tracked
Model failures tracked
RAG metrics tracked
Tool metrics tracked
Memory usage tracked
Token usage tracked
Cost estimated
Prompt version logged
User feedback collected
Security events monitored
Alerts configured

Interview Questions

Q1: Why is observability important in Spring AI?

Because AI applications can fail logically even when APIs return success. Observability helps track latency, cost, retrieval quality, tool calls, and response quality.

Q2: What should be monitored in ChatClient calls?

Latency, success rate, failure rate, token usage, model name, prompt version, and cost.

Q3: What should be monitored in RAG?

Vector search latency, retrieved document count, similarity score, empty results, source documents, and fallback responses.

Q4: Why track tool calls?

Tools connect AI to real systems, so failures, latency, unauthorized attempts, and wrong parameters must be monitored.

Q5: Why is token monitoring important?

Token usage directly affects cost, latency, and model context limits.

Advanced Interview Questions

Q1: Why is HTTP 200 not enough for AI monitoring?

The API may succeed technically, but the AI answer may still be wrong, hallucinated, unsafe, or irrelevant.

Q2: How do you detect poor RAG quality?

Monitor empty retrievals, low similarity scores, wrong source documents, user feedback, and hallucination reports.

Q3: How do you monitor AI cost?

Track input tokens, output tokens, total tokens, model used, feature name, user usage, and estimated price per request.

Q4: What is prompt version observability?

It means logging prompt names and versions with requests so response quality can be compared and bad prompt changes can be rolled back.

Q5: What security events should be monitored?

Prompt injection attempts, unsafe tool calls, unauthorized RAG access, blocked uploads, and rate limit violations.

Recommended Learning Path

Summary

Monitoring and observability are essential for production Spring AI applications. They help developers understand performance, reliability, cost, security, retrieval quality, and user satisfaction.

A strong observability setup should include metrics, structured logs, traces, token tracking, RAG monitoring, tool monitoring, memory monitoring, prompt version tracking, and user feedback.

For real-world applications such as learning platforms, banking assistants, e-commerce support bots, SaaS help desks, and enterprise AI agents, observability is the difference between an experimental AI demo and a trustworthy production AI system.

Monitoring and Observability in Spring AI

What is Monitoring?

What is Observability?

Why Observability is Important in Spring AI?

Spring AI Observability Architecture

Core Observability Areas

Real-Time Learning Platform Example

Real-Time Banking Example

Real-Time E-Commerce Example

Step 1: Add Spring Boot Actuator

Step 2: Add Micrometer Prometheus Registry

Step 3: Configure Actuator Endpoints

Step 4: Basic AI Metrics Service

Step 5: Measure ChatClient Latency

Important AI Metrics

RAG Observability

RAG Monitoring Flow

Vector Search Metric Example

Tool Calling Observability

Tool Call Logging Example

Memory Observability

Token and Cost Monitoring

Cost Control Flow

Structured Logs

Safe Logging Rules

Distributed Tracing

Why Tracing Matters?

Prometheus Configuration Example

Grafana Dashboard Panels

Alerting Rules

Example Alert Conditions

Quality Monitoring

User Feedback Table

Feedback Controller Example

Prompt Version Monitoring

Why Prompt Version Tracking Matters?

Production Observability Flow

Security Observability

Common Monitoring Mistakes

1. Monitoring Only HTTP Status

2. Not Tracking Token Usage

3. No RAG Metrics

4. No Tool Metrics

5. Logging Sensitive Data

Best Practices

Production Checklist

Interview Questions

Q1: Why is observability important in Spring AI?

Q2: What should be monitored in ChatClient calls?

Q3: What should be monitored in RAG?

Q4: Why track tool calls?

Q5: Why is token monitoring important?

Advanced Interview Questions

Q1: Why is HTTP 200 not enough for AI monitoring?

Q2: How do you detect poor RAG quality?

Q3: How do you monitor AI cost?

Q4: What is prompt version observability?

Q5: What security events should be monitored?

Recommended Learning Path

Summary

Related Topics

🔥 Popular Topics

About the Author

Naresh Kumar