Building Document Readers and ETL Pipelines in Spring AI
Modern AI applications need reliable knowledge before they can provide useful answers. A chat model alone may answer general questions, but enterprise applications usually need answers from documents such as PDFs, Word files, web pages, FAQs, policies, course lessons, product manuals, support tickets, and database records.
To make these documents useful for AI applications, we need a proper document processing flow. This flow is commonly called an ETL pipeline.
ETL stands for:
- Extract: Read content from documents or data sources
- Transform: Clean, split, enrich, and prepare the content
- Load: Store the processed content into a vector database or knowledge store
In Spring AI, document readers and ETL pipelines are important for building Retrieval-Augmented Generation, semantic search, AI knowledge assistants, customer support bots, and agentic AI systems.
What is a Document Reader?
A document reader is a component that reads content from a source and converts it into a format that the AI system can process.
A source can be:
- PDF file
- Text file
- Markdown file
- HTML page
- Word document
- CSV file
- Database record
- API response
- Website content
The reader extracts useful text and converts it into Document objects.
What is an ETL Pipeline?
An ETL pipeline is a structured process that moves data from raw source format into a usable AI-ready format.
Raw Documents
|
v
Extract Text
|
v
Transform and Clean
|
v
Split into Chunks
|
v
Generate Embeddings
|
v
Load into Vector Store
Why ETL Pipelines Are Important for AI?
Poor document processing leads to poor AI answers. Even if the language model is powerful, it cannot answer correctly if the retrieved content is messy, outdated, incomplete, or poorly chunked.
A good ETL pipeline improves:
- Retrieval accuracy
- RAG answer quality
- Search relevance
- Source traceability
- Content freshness
- Security filtering
- AI agent reliability
Simple RAG ETL Architecture
PDF / Docs / Web Pages / Database
|
v
Document Reader
|
v
Text Cleaner
|
v
Text Splitter
|
v
Metadata Enricher
|
v
Embedding Model
|
v
Vector Store
|
v
RAG Application
Real-Time Learning Platform Example
A learning platform may have:
- Java course lessons
- Spring Boot interview questions
- Docker tutorials
- Kubernetes articles
- AI and RAG guides
- Project documentation
The ETL pipeline reads all this content, splits it into meaningful sections, generates embeddings, and stores them in a vector database. When a learner asks a question, the AI retrieves the most relevant content and answers from your platform knowledge.
Real-Time Banking Example
A banking support assistant may process:
- UPI failure policy
- Loan eligibility rules
- Credit card FAQs
- Account statement guide
- Refund reversal timelines
When a customer asks:
Amount was debited but UPI failed. When will I get my money?
The RAG system searches the vector store and retrieves the failed UPI reversal policy. The AI then answers using verified banking content instead of guessing.
Real-Time E-Commerce Example
An e-commerce assistant may process:
- Refund policy
- Return policy
- Warranty rules
- Delivery timelines
- Product descriptions
When a user asks:
Can I return a damaged phone after delivery?
The AI retrieves return policy and damaged product policy chunks before generating the answer.
Spring AI Document Object
In Spring AI, document content is usually represented using a Document object.
A document contains:
- Text content
- Metadata
Example
Document document = new Document(
"Spring AI helps Java developers build AI applications.",
Map.of(
"source", "spring-ai-course",
"topic", "spring-ai"
)
);
Why Metadata Matters?
Metadata gives meaning and traceability to document chunks.
Useful metadata includes:
- Source file
- Page number
- Topic
- Category
- Author
- Created date
- Updated date
- Tenant ID
- Access level
Metadata Example
{
"source": "refund-policy.pdf",
"page": 3,
"category": "refund",
"tenantId": "company-a",
"updatedDate": "2026-05-20"
}
Step 1: Extract Documents
The extract phase reads raw content from files, databases, APIs, or websites.
Source File
|
v
Document Reader
|
v
Raw Text
Common Document Reader Types
| Reader Type | Use Case |
|---|---|
| Text Reader | Plain text files |
| PDF Reader | PDF documents |
| Markdown Reader | Developer documentation |
| HTML Reader | Web pages and articles |
| CSV Reader | Tabular data |
| Database Reader | Rows from application database |
| API Reader | External knowledge services |
Text Document Reader Example
@Service
public class TextDocumentReaderService {
public Document readText(String content, String source) {
return new Document(
content,
Map.of(
"source", source,
"type", "text"
)
);
}
}
PDF Reader Concept
PDF files usually require text extraction before they can be embedded.
PDF File
|
v
Extract Text Page by Page
|
v
Create Document Objects
|
v
Add Page Metadata
PDF Document Example
Document document = new Document(
"Refunds are processed within 5 to 7 business days.",
Map.of(
"source", "refund-policy.pdf",
"page", "2",
"category", "refund"
)
);
Step 2: Transform Documents
The transform phase prepares extracted content for embeddings and retrieval.
Transform operations include:
- Remove unnecessary spaces
- Remove headers and footers
- Remove duplicate content
- Normalize formatting
- Fix broken paragraphs
- Add metadata
- Split into chunks
Text Cleaning Example
public String cleanText(String text) {
if (text == null) {
return "";
}
return text
.replaceAll("\\s+", " ")
.replaceAll("Page \\d+", "")
.trim();
}
Why Text Cleaning Matters?
Raw documents often contain:
- Repeated headers
- Footer text
- Page numbers
- Broken lines
- Navigation menus
- Legal boilerplate
- Duplicate content
If these are embedded without cleaning, retrieval quality becomes poor.
Step 3: Split Documents into Chunks
Chunking is one of the most important parts of RAG quality.
A large document should be split into smaller meaningful chunks before embedding.
Large Document
|
+-- Chunk 1: Introduction
+-- Chunk 2: Eligibility Rules
+-- Chunk 3: Refund Timeline
+-- Chunk 4: Exceptions
+-- Chunk 5: Contact Support
Why Chunking is Needed?
- Improves retrieval precision
- Reduces irrelevant context
- Controls prompt size
- Improves answer accuracy
- Helps cite exact sources
Bad Chunking Example
Chunk 1:
Random first 500 characters from document.
Chunk 2:
Next 500 characters, possibly cutting a sentence in half.
This may break meaning and reduce retrieval quality.
Good Chunking Example
Chunk 1:
Refund eligibility rules.
Chunk 2:
Refund processing timeline.
Chunk 3:
Refund exceptions and rejected cases.
Meaningful chunks give better RAG results.
Simple Java Chunking Example
public List<String> splitText(String text, int chunkSize) {
List<String> chunks = new ArrayList<>();
for (int i = 0; i < text.length(); i += chunkSize) {
int end = Math.min(i + chunkSize, text.length());
chunks.add(text.substring(i, end));
}
return chunks;
}
Better Chunking Strategy
For production systems, split by:
- Headings
- Paragraphs
- Sections
- Pages
- Business topics
Chunk Metadata Example
Document chunk = new Document(
"Refunds are processed within 5 to 7 business days.",
Map.of(
"source", "refund-policy.pdf",
"section", "Refund Timeline",
"page", "3"
)
);
Step 4: Generate Embeddings
After chunking, each chunk is converted into an embedding vector.
Document Chunk
|
v
Embedding Model
|
v
Vector Embedding
Spring AI uses EmbeddingModel to generate embeddings.
Embedding Concept Example
Text:
Spring AI supports vector stores.
Vector:
[0.11, -0.45, 0.78, ...]
Step 5: Load into Vector Store
The load phase stores embedded documents into a vector database.
Vector store options include:
- PGVector
- Pinecone
- MongoDB Atlas Vector Search
- Redis Vector Search
- Qdrant
- Milvus
- Weaviate
- Elasticsearch
Vector Store Load Example
@Service
public class VectorLoadService {
private final VectorStore vectorStore;
public VectorLoadService(VectorStore vectorStore) {
this.vectorStore = vectorStore;
}
public void load(List<Document> documents) {
vectorStore.add(documents);
}
}
Complete ETL Service Example
@Service
public class DocumentEtlService {
private final VectorStore vectorStore;
public DocumentEtlService(VectorStore vectorStore) {
this.vectorStore = vectorStore;
}
public void processText(String content, String source) {
String cleanedText = cleanText(content);
List<String> chunks = splitText(cleanedText, 800);
List<Document> documents = chunks.stream()
.map(chunk -> new Document(
chunk,
Map.of(
"source", source,
"type", "text"
)
))
.toList();
vectorStore.add(documents);
}
private String cleanText(String text) {
return text.replaceAll("\\s+", " ").trim();
}
private List<String> splitText(String text, int chunkSize) {
List<String> chunks = new ArrayList<>();
for (int i = 0; i < text.length(); i += chunkSize) {
int end = Math.min(i + chunkSize, text.length());
chunks.add(text.substring(i, end));
}
return chunks;
}
}
ETL Controller Example
@RestController
@RequestMapping("/api/etl")
public class DocumentEtlController {
private final DocumentEtlService etlService;
public DocumentEtlController(DocumentEtlService etlService) {
this.etlService = etlService;
}
@PostMapping("/text")
public String processText(@RequestParam String source,
@RequestBody String content) {
etlService.processText(content, source);
return "Document processed successfully.";
}
}
Testing ETL API
curl -X POST "http://localhost:8080/api/etl/text?source=spring-ai-notes" \
-H "Content-Type: text/plain" \
-d "Spring AI helps Java developers build AI applications using chat models, embeddings, vector stores, and RAG."
ETL + RAG Flow
ETL Time:
Documents → Clean → Chunk → Embed → Store
Query Time:
Question → Search Vector Store → Retrieve Chunks → Chat Model → Answer
Building a Database Reader
Many applications store knowledge in relational databases. For example, your website may store courses, interview questions, projects, and articles in MySQL or PostgreSQL.
Database Rows
|
v
Read Content
|
v
Create Documents
|
v
Add Metadata
|
v
Store in Vector DB
Database Reader Example
@Service
public class CourseDatabaseReader {
private final CourseRepository courseRepository;
private final VectorStore vectorStore;
public CourseDatabaseReader(CourseRepository courseRepository,
VectorStore vectorStore) {
this.courseRepository = courseRepository;
this.vectorStore = vectorStore;
}
public void indexCourses() {
List<Document> documents = courseRepository.findAll()
.stream()
.map(course -> new Document(
course.getTitle() + "\n" + course.getDescription(),
Map.of(
"type", "course",
"courseId", course.getId().toString(),
"slug", course.getSlug()
)
))
.toList();
vectorStore.add(documents);
}
}
Building an HTML Reader
An HTML reader extracts useful article content from web pages and removes navigation, footer, scripts, and styling.
HTML Page
|
v
Remove Tags / Scripts
|
v
Extract Main Content
|
v
Create Document
|
v
Store in Vector DB
Building a CSV Reader
CSV readers are useful for structured datasets.
course_title,description,category
Spring Boot,Backend Java framework,Java
Docker,Container platform,DevOps
Each row can become a document with metadata.
CSV Document Example
Document document = new Document(
"Spring Boot is a backend Java framework.",
Map.of(
"type", "course",
"category", "Java",
"source", "courses.csv"
)
);
Incremental ETL
In production, you should avoid reprocessing every document every time.
Use incremental ETL:
- Index only new documents
- Update changed documents
- Delete removed documents
- Track document version
- Track updated timestamp
Incremental ETL Flow
Check Last Indexed Time
|
v
Find New or Updated Documents
|
v
Process Only Changed Data
|
v
Update Vector Store
Document Versioning
Versioning helps prevent outdated AI answers.
{
"source": "refund-policy.pdf",
"version": "2026-05-20",
"status": "active"
}
Handling Deleted Documents
If a document is deleted from your main system, remove or deactivate its vector records.
Document Deleted
|
v
Find Related Vector Chunks
|
v
Delete from Vector Store
|
v
Prevent Outdated Retrieval
ETL Scheduling
ETL pipelines can run:
- Immediately after upload
- Every hour
- Every night
- After database changes
- Through message queues
Scheduled ETL Example
@Scheduled(cron = "0 0 2 * * *")
public void runDailyIndexing() {
courseDatabaseReader.indexCourses();
}
Event-Driven ETL
For scalable systems, use event-driven processing.
Document Uploaded
|
v
Publish Event
|
v
ETL Worker Consumes Event
|
v
Process Document
|
v
Store Vectors
Message Queue Options
- Kafka
- RabbitMQ
- Amazon SQS
- Redis Streams
- Google Pub/Sub
ETL Error Handling
Document processing can fail due to:
- Unsupported file format
- Corrupted PDF
- Empty content
- Embedding API failure
- Vector database failure
- Timeouts
- Permission issues
Error Handling Flow
ETL Step Fails
|
v
Log Safe Error
|
v
Retry if Temporary
|
v
Move to Failed Queue if Permanent
|
v
Notify Admin
ETL Monitoring
Track:
- Documents processed
- Chunks created
- Embedding generation time
- Vector store load time
- Failed documents
- Empty documents
- Average chunk size
- Index freshness
Production ETL Dashboard
ETL Metrics
|
+-- Total documents indexed
+-- Failed documents
+-- Average processing time
+-- Vector store latency
+-- Last successful run
+-- Documents pending
Security Best Practices
- Scan uploaded files before processing
- Validate file type and size
- Do not index secrets unnecessarily
- Apply tenant metadata
- Use access control before retrieval
- Encrypt sensitive storage
- Do not log full sensitive documents
- Remove deleted content from vector store
Common ETL Mistakes
1. Indexing Raw Documents Without Cleaning
This reduces retrieval quality.
2. Poor Chunking
Random chunks can break meaning.
3. No Metadata
Difficult to filter, cite, or debug.
4. No Incremental Updates
Old content may remain in vector search.
5. No Error Handling
A single bad document can break the pipeline.
Best Practices
- Extract clean text from documents
- Use meaningful chunking
- Store rich metadata
- Track document version
- Use incremental indexing
- Monitor ETL failures
- Use queues for large workloads
- Protect sensitive data
- Delete outdated vectors
- Test retrieval quality regularly
Interview Questions
Q1: What is an ETL pipeline in AI applications?
It is a process that extracts documents, transforms them into AI-ready chunks, and loads them into a vector store for retrieval.
Q2: Why are document readers important?
They extract useful text from files, databases, APIs, or web pages and convert it into AI-processable documents.
Q3: Why is chunking important?
Chunking improves retrieval quality by splitting large documents into meaningful smaller sections.
Q4: What metadata should be stored?
Source, page number, topic, category, tenant ID, version, updated date, and access level.
Q5: What is incremental ETL?
Incremental ETL processes only new or updated content instead of reprocessing everything.
Advanced Interview Questions
Q1: How do you handle deleted documents in vector databases?
Find and remove related vector chunks or mark them inactive so outdated content is not retrieved.
Q2: Why use queues for ETL?
Queues support scalable, asynchronous document processing and prevent large uploads from blocking API requests.
Q3: How do you secure document ETL?
Validate files, scan uploads, avoid indexing secrets, add tenant metadata, encrypt sensitive data, and enforce access control.
Q4: What causes poor RAG answers after ETL?
Poor chunking, noisy text, missing metadata, outdated vectors, wrong embedding model, or weak prompts.
Q5: How do you monitor ETL pipelines?
Track processed documents, failed documents, chunk count, embedding time, vector load time, and index freshness.
Recommended Learning Path
- Introduction to Spring AI
- Introduction to Embeddings
- Vector Databases and Vector Stores
- Implementing RAG
- Document Readers and ETL Pipelines
- Java AI Agents
Summary
Document readers and ETL pipelines are essential for building high-quality Spring AI RAG applications. They convert raw files, database records, APIs, and web content into clean, chunked, metadata-rich documents that can be embedded and stored in vector databases.
A strong ETL pipeline improves retrieval quality, reduces hallucination, supports source tracking, improves security, and keeps AI knowledge up to date.
For learning platforms, banking assistants, e-commerce support systems, SaaS knowledge bases, and enterprise AI agents, ETL quality directly affects AI answer quality.