Implementing Retrieval-Augmented Generation (RAG) with Spring AI
Retrieval-Augmented Generation, commonly called RAG, is one of the most important architectures for building reliable AI applications. A normal chat model answers using the knowledge it learned during training. But enterprise applications usually need answers from private, updated, domain-specific data such as documents, FAQs, policies, product catalogs, interview questions, course content, tickets, or database records.
RAG solves this problem by retrieving relevant information from your own knowledge base and giving that context to the AI model before generating the final answer.
Spring AI provides support for RAG flows using VectorStore, embeddings, ChatClient, and Advisor APIs. The Spring AI documentation explains that RAG helps overcome LLM limitations around long-form content, factual accuracy, and context awareness, and it provides Advisor-based support such as QuestionAnswerAdvisor for common RAG workflows. ([docs.spring.io](https://docs.spring.io/spring-ai/reference/api/retrieval-augmented-generation.html))
What is RAG?
RAG stands for Retrieval-Augmented Generation.
It combines two steps:
- Retrieval: Find relevant information from your own documents or database
- Generation: Send the retrieved context to an AI model and generate an answer
Simple RAG Flow
User Question
|
v
Search Knowledge Base
|
v
Retrieve Relevant Context
|
v
Send Context + Question to LLM
|
v
Generate Grounded Answer
Why RAG is Needed?
Without RAG, an AI model may:
- Give outdated answers
- Guess information
- Hallucinate facts
- Fail to answer company-specific questions
- Miss private business knowledge
- Provide generic responses
With RAG, the model receives real context from your documents before answering.
Normal LLM vs RAG-Based LLM
| Normal LLM | RAG-Based LLM |
|---|---|
| Uses model training knowledge | Uses retrieved enterprise knowledge |
| May hallucinate | More grounded in provided context |
| May be outdated | Can use updated documents |
| Generic answers | Domain-specific answers |
| No source tracking | Can track source documents |
Real-Time Learning Platform Example
Suppose a learning platform has course content about:
- Java
- Spring Boot
- Microservices
- Docker
- Kubernetes
- Spring AI
- Agentic AI
User asks:
Which course should I learn to build scalable backend systems?
A RAG system can search existing course content, retrieve relevant lessons, and generate a personalized answer using platform-specific content.
Real-Time Banking Example
A banking AI assistant may answer questions using verified banking documents.
User:
Amount was debited but UPI transaction failed. When will it be reversed?
RAG flow:
- Search failed UPI transaction policy
- Retrieve reversal timeline document
- Send policy context to chat model
- Generate grounded response
The model should not guess. It should answer only from retrieved banking policy.
Real-Time E-Commerce Example
An e-commerce AI assistant may answer refund and delivery questions.
User:
Can I return a damaged mobile phone after delivery?
RAG retrieves:
- Return policy
- Damaged product policy
- Refund processing timeline
- Replacement rules
The AI then generates a clear customer-friendly answer.
Core Components of RAG
| Component | Purpose |
|---|---|
| Documents | Knowledge source |
| Chunking | Splits large documents into smaller parts |
| Embedding Model | Converts text into vectors |
| Vector Store | Stores and searches embeddings |
| Retriever | Finds relevant chunks |
| Chat Model | Generates answer using retrieved context |
Spring AI RAG Architecture
Documents
|
v
Text Extraction
|
v
Chunking
|
v
EmbeddingModel
|
v
VectorStore
|
v
Similarity Search
|
v
ChatClient
|
v
Grounded Answer
Spring AI Building Blocks for RAG
Spring AI provides important abstractions for RAG:
DocumentEmbeddingModelVectorStoreChatClientQuestionAnswerAdvisorVectorStoreRetriever
Spring AI also provides VectorStoreRetriever, a read-only view of a vector store that exposes similarity search functionality. This is useful in RAG applications where the application only needs retrieval access and should not modify vector data. ([docs.spring.io](https://docs.spring.io/spring-ai/reference/api/vectordbs.html))
Step 1: Choose a Vector Database
RAG needs a vector store to save and search embeddings.
Common options:
- PGVector
- Pinecone
- MongoDB Atlas Vector Search
- Redis Vector Search
- Qdrant
- Milvus
- Weaviate
- Elasticsearch
- OpenSearch
When to Use PGVector?
- You already use PostgreSQL
- Your dataset is small to medium
- You want simple local development
- You prefer SQL-based infrastructure
When to Use Pinecone or Cloud Vector DB?
- You need managed vector infrastructure
- You expect large-scale vector search
- You want easier scaling
- You need production-ready cloud vector retrieval
Step 2: Add Dependencies
Example using OpenAI and PGVector:
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-bom</artifactId>
<version>1.0.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-model-openai</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-vector-store-pgvector</artifactId>
</dependency>
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<scope>runtime</scope>
</dependency>
Advisor Dependency for Spring AI RAG
Spring AI documentation says QuestionAnswerAdvisor and VectorStoreChatMemoryAdvisor require the spring-ai-advisors-vector-store dependency. ([docs.spring.io](https://docs.spring.io/spring-ai/reference/api/retrieval-augmented-generation.html))
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-advisors-vector-store</artifactId>
</dependency>
Step 3: Configure application.properties
spring.application.name=spring-ai-rag-demo
spring.ai.model.chat=openai
spring.ai.openai.api-key=${OPENAI_API_KEY}
spring.ai.openai.chat.options.model=gpt-4o-mini
spring.ai.model.embedding=openai
spring.ai.openai.embedding.options.model=text-embedding-3-small
spring.datasource.url=jdbc:postgresql://localhost:5432/spring_ai
spring.datasource.username=postgres
spring.datasource.password=postgres
spring.ai.vectorstore.pgvector.initialize-schema=true
spring.ai.vectorstore.pgvector.index-type=HNSW
spring.ai.vectorstore.pgvector.distance-type=COSINE_DISTANCE
spring.ai.vectorstore.pgvector.dimensions=1536
The vector dimension must match the embedding model output dimension.
Step 4: Start PGVector with Docker
docker run --name pgvector-db \
-e POSTGRES_USER=postgres \
-e POSTGRES_PASSWORD=postgres \
-e POSTGRES_DB=spring_ai \
-p 5432:5432 \
-d pgvector/pgvector:pg16
Enable the extension:
docker exec -it pgvector-db psql -U postgres -d spring_ai
CREATE EXTENSION IF NOT EXISTS vector;
Step 5: Create Document Ingestion Service
The ingestion service stores documents into the vector store.
package com.dhanish.rag.service;
import org.springframework.ai.document.Document;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.stereotype.Service;
import java.util.List;
import java.util.Map;
@Service
public class DocumentIngestionService {
private final VectorStore vectorStore;
public DocumentIngestionService(VectorStore vectorStore) {
this.vectorStore = vectorStore;
}
public void loadSampleDocuments() {
Document doc1 = new Document(
"Spring AI helps Java developers build AI applications using chat models, embeddings, vector stores, and RAG.",
Map.of(
"source", "spring-ai-course",
"topic", "spring-ai"
)
);
Document doc2 = new Document(
"Retrieval-Augmented Generation retrieves relevant documents from a vector database and uses them as context for a chat model.",
Map.of(
"source", "rag-guide",
"topic", "rag"
)
);
Document doc3 = new Document(
"PGVector is a PostgreSQL extension used to store and search vector embeddings for semantic search.",
Map.of(
"source", "pgvector-guide",
"topic", "vector-database"
)
);
vectorStore.add(List.of(doc1, doc2, doc3));
}
}
Step 6: Create Search Service
package com.dhanish.rag.service;
import org.springframework.ai.document.Document;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.stereotype.Service;
import java.util.List;
@Service
public class SemanticSearchService {
private final VectorStore vectorStore;
public SemanticSearchService(VectorStore vectorStore) {
this.vectorStore = vectorStore;
}
public List<Document> search(String question) {
return vectorStore.similaritySearch(question);
}
}
Step 7: Manual RAG with ChatClient
This approach gives full control over the RAG prompt.
package com.dhanish.rag.service;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.document.Document;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.stereotype.Service;
import java.util.List;
import java.util.stream.Collectors;
@Service
public class ManualRagService {
private final VectorStore vectorStore;
private final ChatClient chatClient;
public ManualRagService(VectorStore vectorStore,
ChatClient.Builder builder) {
this.vectorStore = vectorStore;
this.chatClient = builder.build();
}
public String answer(String question) {
List<Document> documents =
vectorStore.similaritySearch(question);
String context = documents.stream()
.map(Document::getText)
.collect(Collectors.joining("\n\n"));
return chatClient.prompt()
.system("""
You are a helpful AI assistant.
Rules:
1. Answer only using the provided context.
2. Do not guess.
3. If the answer is not in the context, say:
I do not have enough information.
4. Keep the answer clear and practical.
""")
.user("""
Context:
%s
Question:
%s
""".formatted(context, question))
.call()
.content();
}
}
Step 8: RAG with QuestionAnswerAdvisor
Spring AI also supports Advisor-based RAG. Advisors can enrich a ChatClient request with retrieved context automatically.
package com.dhanish.rag.service;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.chat.client.advisor.vectorstore.QuestionAnswerAdvisor;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.stereotype.Service;
@Service
public class AdvisorRagService {
private final ChatClient chatClient;
public AdvisorRagService(ChatClient.Builder builder,
VectorStore vectorStore) {
this.chatClient = builder
.defaultAdvisors(new QuestionAnswerAdvisor(vectorStore))
.build();
}
public String answer(String question) {
return chatClient.prompt()
.user(question)
.call()
.content();
}
}
The Advisor approach is cleaner for standard RAG flows, while manual RAG is useful when you want complete control over prompt format and retrieval behavior.
Step 9: Create REST Controller
package com.dhanish.rag.controller;
import com.dhanish.rag.service.DocumentIngestionService;
import com.dhanish.rag.service.ManualRagService;
import org.springframework.web.bind.annotation.*;
@RestController
@RequestMapping("/api/rag")
public class RagController {
private final DocumentIngestionService ingestionService;
private final ManualRagService ragService;
public RagController(DocumentIngestionService ingestionService,
ManualRagService ragService) {
this.ingestionService = ingestionService;
this.ragService = ragService;
}
@PostMapping("/load")
public String load() {
ingestionService.loadSampleDocuments();
return "Documents loaded into vector store successfully.";
}
@GetMapping("/ask")
public String ask(@RequestParam String question) {
return ragService.answer(question);
}
}
Step 10: Test RAG APIs
Load Documents
curl -X POST http://localhost:8080/api/rag/load
Ask Question
curl "http://localhost:8080/api/rag/ask?question=What is RAG used for?"
Expected RAG Request Flow
Client
|
v
/api/rag/ask
|
v
RagController
|
v
ManualRagService
|
v
VectorStore Similarity Search
|
v
Relevant Documents
|
v
ChatClient
|
v
Grounded Answer
Document Chunking Strategy
Real-world documents are usually large. Do not store an entire PDF or article as one vector.
Split documents into meaningful chunks:
- By heading
- By paragraph
- By section
- By page
- By topic
Chunking Flow
Large Document
|
+-- Chunk 1: Introduction
+-- Chunk 2: Architecture
+-- Chunk 3: Code Example
+-- Chunk 4: Best Practices
+-- Chunk 5: Troubleshooting
Why Chunking Matters?
- Improves retrieval accuracy
- Reduces irrelevant context
- Controls token usage
- Improves answer grounding
- Makes source tracking easier
Metadata Strategy
Always store metadata with chunks.
{
"source": "spring-ai-rag-guide",
"topic": "rag",
"module": "spring-ai",
"page": 3,
"tenantId": "dhanish-empower"
}
Metadata helps:
- Filtering
- Source tracking
- Citations
- Debugging
- Tenant isolation
- Access control
RAG Prompt Template
You are a helpful assistant.
Use only the provided context.
Context:
{context}
Question:
{question}
Rules:
1. Do not guess.
2. If answer is missing, say you do not have enough information.
3. Keep the answer clear.
4. Mention source if available.
Good RAG Answer Behavior
| Situation | Correct Behavior |
|---|---|
| Context contains answer | Answer using context |
| Context does not contain answer | Say information is unavailable |
| Context is conflicting | Mention uncertainty |
| User asks unrelated question | Do not hallucinate |
RAG for AI Agents
Agentic AI systems use RAG to retrieve knowledge before deciding actions.
User Goal
|
v
Agent Understands Intent
|
v
RAG Retrieves Relevant Knowledge
|
v
Agent Plans Action
|
v
Tool Execution
|
v
Final Response
RAG with Tool Calling
Some workflows need both RAG and tools.
Example
User:
Where is my order and what is the refund policy?
System flow:
- Call Order API for live order status
- Use RAG to retrieve refund policy
- Generate combined answer
Hybrid RAG Architecture
User Question
|
+-- Tool Call for Live Data
|
+-- Vector Search for Policy Data
|
v
Combined Context
|
v
Chat Model
|
v
Final Answer
Security in RAG
RAG systems can expose sensitive documents if access control is missing.
Always apply:
- User authentication
- Authorization checks
- Tenant filters
- Metadata filters
- Safe logging
- Document-level access control
Safe RAG Retrieval Flow
User Request
|
v
Authenticate User
|
v
Check Permissions
|
v
Apply Tenant Filter
|
v
Vector Search
|
v
Allowed Documents Only
Multi-Tenant RAG Example
In a SaaS application:
Tenant A user → search only Tenant A documents
Tenant B user → search only Tenant B documents
Without tenant isolation, one user may receive another customer’s data.
Evaluating RAG Quality
RAG quality depends on retrieval quality and answer quality.
Measure:
- Did retrieval find the right documents?
- Did the answer use retrieved context?
- Did the answer avoid guessing?
- Was the answer clear?
- Was the source correct?
RAG Evaluation Dataset Example
| Question | Expected Source | Expected Behavior |
|---|---|---|
| What is PGVector? | pgvector-guide | Explain PGVector |
| What is refund timeline? | refund-policy | Answer from policy |
| What is CEO salary? | None | Say not enough information |
Monitoring RAG in Production
Track:
- Vector search latency
- Embedding generation time
- Empty retrieval count
- Average similarity score
- Top-K relevance
- RAG fallback rate
- User feedback score
- Hallucination reports
- Token usage
- Cost per answer
Production RAG Monitoring Flow
RAG Request
|
+-- Retrieval Metrics
+-- Prompt Metrics
+-- LLM Metrics
+-- Answer Quality Feedback
|
v
Observability Dashboard
Common RAG Mistakes
1. Poor Chunking
Large or random chunks reduce retrieval quality.
2. No Metadata
Difficult to filter, cite, and debug.
3. No Access Control
Sensitive data may leak.
4. Asking Model to Guess
RAG should instruct the model to avoid unsupported answers.
5. Too Many Retrieved Chunks
This increases token cost and may confuse the model.
6. Not Updating Embeddings
Changed documents require updated embeddings.
Best Practices for RAG
- Use meaningful chunks
- Add metadata to every document
- Use consistent embedding models
- Use access control before retrieval
- Keep top-k reasonable
- Use clear RAG prompts
- Monitor retrieval quality
- Evaluate answers with real user questions
- Track source documents
- Re-index changed documents
- Use fallback responses when context is missing
Production RAG Architecture
Frontend
|
v
Spring Boot AI API
|
+-- Authentication
+-- Document Retrieval
+-- VectorStore
+-- Prompt Builder
+-- ChatClient
+-- Response Validator
+-- Monitoring
|
v
Grounded Answer
Common Errors and Fixes
1. Empty Answers
Possible causes:
- No documents loaded
- Poor similarity match
- Wrong vector store configuration
- Embedding model mismatch
2. Wrong Answers
Possible causes:
- Irrelevant chunks retrieved
- Weak prompt instructions
- Too much unrelated context
- Documents are outdated
3. Dimension Mismatch
Vector store dimension must match embedding model dimension.
4. Slow RAG Response
Possible fixes:
- Optimize vector index
- Reduce top-k
- Use caching
- Use faster model
- Shorten context
5. Hallucination Despite RAG
Fix:
- Strengthen prompt rules
- Use better retrieval
- Add answer validation
- Reject unsupported claims
Interview Questions
Q1: What is RAG?
RAG stands for Retrieval-Augmented Generation. It retrieves relevant external knowledge and gives it to a language model to generate grounded answers.
Q2: Why is RAG needed?
RAG helps AI systems answer using updated, private, and domain-specific data instead of relying only on model training knowledge.
Q3: What are the main components of RAG?
Documents, chunking, embeddings, vector store, retriever, prompt builder, and chat model.
Q4: What is the role of VectorStore in RAG?
VectorStore stores embeddings and retrieves semantically similar documents for the user question.
Q5: What is QuestionAnswerAdvisor?
QuestionAnswerAdvisor is a Spring AI Advisor that supports common RAG flows by retrieving relevant context from a vector store and adding it to the chat request.
Advanced Interview Questions
Q1: Manual RAG vs Advisor-based RAG?
Manual RAG gives full control over retrieval and prompt construction, while Advisor-based RAG provides cleaner integration for common use cases.
Q2: How do you reduce hallucination in RAG?
Use high-quality retrieval, strong prompts, response validation, source tracking, and instructions to avoid guessing.
Q3: Why is metadata important in RAG?
Metadata supports filtering, tenant isolation, source citation, debugging, and access control.
Q4: What happens if the embedding model changes?
Existing vectors may need to be regenerated because dimensions or semantic vector space may change.
Q5: How do you secure multi-tenant RAG?
Authenticate users, enforce authorization, apply tenant metadata filters, and retrieve only allowed documents.
Recommended Learning Path
- Introduction to Spring AI
- Introduction to Embeddings
- Vector Databases and Vector Stores
- Integrating PGVector with Spring AI
- Integrating Pinecone and Cloud Vector Databases
- Implementing RAG with Spring AI
- Java AI Agents
Summary
Retrieval-Augmented Generation is a powerful architecture for building reliable AI applications that answer using your own knowledge base. Instead of depending only on the model’s internal knowledge, RAG retrieves relevant documents and uses them as context for the final answer.
In Spring AI, RAG can be implemented using embeddings, VectorStore, ChatClient, and Advisor APIs such as QuestionAnswerAdvisor.
For production systems such as learning platforms, banking assistants, e-commerce support bots, SaaS knowledge bases, and enterprise AI agents, RAG improves factual accuracy, context awareness, user trust, and answer quality.
A strong RAG system depends on good chunking, high-quality embeddings, secure retrieval, metadata filtering, clear prompts, monitoring, and regular evaluation.