Building Document Readers and ETL Pipelines in Spring AI

Modern AI applications need reliable knowledge before they can provide useful answers. A chat model alone may answer general questions, but enterprise applications usually need answers from documents such as PDFs, Word files, web pages, FAQs, policies, course lessons, product manuals, support tickets, and database records.

To make these documents useful for AI applications, we need a proper document processing flow. This flow is commonly called an ETL pipeline.

ETL stands for:

Extract: Read content from documents or data sources
Transform: Clean, split, enrich, and prepare the content
Load: Store the processed content into a vector database or knowledge store

In Spring AI, document readers and ETL pipelines are important for building Retrieval-Augmented Generation, semantic search, AI knowledge assistants, customer support bots, and agentic AI systems.

What is a Document Reader?

A document reader is a component that reads content from a source and converts it into a format that the AI system can process.

A source can be:

PDF file
Text file
Markdown file
HTML page
Word document
CSV file
Database record
API response
Website content

The reader extracts useful text and converts it into Document objects.

What is an ETL Pipeline?

An ETL pipeline is a structured process that moves data from raw source format into a usable AI-ready format.

Raw Documents
      |
      v
Extract Text
      |
      v
Transform and Clean
      |
      v
Split into Chunks
      |
      v
Generate Embeddings
      |
      v
Load into Vector Store

Why ETL Pipelines Are Important for AI?

Poor document processing leads to poor AI answers. Even if the language model is powerful, it cannot answer correctly if the retrieved content is messy, outdated, incomplete, or poorly chunked.

A good ETL pipeline improves:

Retrieval accuracy
RAG answer quality
Search relevance
Source traceability
Content freshness
Security filtering
AI agent reliability

Simple RAG ETL Architecture

PDF / Docs / Web Pages / Database
              |
              v
Document Reader
              |
              v
Text Cleaner
              |
              v
Text Splitter
              |
              v
Metadata Enricher
              |
              v
Embedding Model
              |
              v
Vector Store
              |
              v
RAG Application

Real-Time Learning Platform Example

A learning platform may have:

Java course lessons
Spring Boot interview questions
Docker tutorials
Kubernetes articles
AI and RAG guides
Project documentation

The ETL pipeline reads all this content, splits it into meaningful sections, generates embeddings, and stores them in a vector database. When a learner asks a question, the AI retrieves the most relevant content and answers from your platform knowledge.

Real-Time Banking Example

A banking support assistant may process:

UPI failure policy
Loan eligibility rules
Credit card FAQs
Account statement guide
Refund reversal timelines

When a customer asks:

Amount was debited but UPI failed. When will I get my money?

The RAG system searches the vector store and retrieves the failed UPI reversal policy. The AI then answers using verified banking content instead of guessing.

Real-Time E-Commerce Example

An e-commerce assistant may process:

Refund policy
Return policy
Warranty rules
Delivery timelines
Product descriptions

When a user asks:

Can I return a damaged phone after delivery?

The AI retrieves return policy and damaged product policy chunks before generating the answer.

Spring AI Document Object

In Spring AI, document content is usually represented using a Document object.

A document contains:

Text content
Metadata

Example

Document document = new Document(
        "Spring AI helps Java developers build AI applications.",
        Map.of(
                "source", "spring-ai-course",
                "topic", "spring-ai"
        )
);

Why Metadata Matters?

Metadata gives meaning and traceability to document chunks.

Useful metadata includes:

Source file
Page number
Topic
Category
Author
Created date
Updated date
Tenant ID
Access level

Metadata Example

{
  "source": "refund-policy.pdf",
  "page": 3,
  "category": "refund",
  "tenantId": "company-a",
  "updatedDate": "2026-05-20"
}

Step 1: Extract Documents

The extract phase reads raw content from files, databases, APIs, or websites.

Source File
    |
    v
Document Reader
    |
    v
Raw Text

Common Document Reader Types

Reader Type	Use Case
Text Reader	Plain text files
PDF Reader	PDF documents
Markdown Reader	Developer documentation
HTML Reader	Web pages and articles
CSV Reader	Tabular data
Database Reader	Rows from application database
API Reader	External knowledge services

Text Document Reader Example

@Service
public class TextDocumentReaderService {

    public Document readText(String content, String source) {

        return new Document(
                content,
                Map.of(
                        "source", source,
                        "type", "text"
                )
        );
    }
}

PDF Reader Concept

PDF files usually require text extraction before they can be embedded.

PDF File
   |
   v
Extract Text Page by Page
   |
   v
Create Document Objects
   |
   v
Add Page Metadata

PDF Document Example

Document document = new Document(
        "Refunds are processed within 5 to 7 business days.",
        Map.of(
                "source", "refund-policy.pdf",
                "page", "2",
                "category", "refund"
        )
);

Step 2: Transform Documents

The transform phase prepares extracted content for embeddings and retrieval.

Transform operations include:

Remove unnecessary spaces
Remove headers and footers
Remove duplicate content
Normalize formatting
Fix broken paragraphs
Add metadata
Split into chunks

Text Cleaning Example

public String cleanText(String text) {

    if (text == null) {
        return "";
    }

    return text
            .replaceAll("\\s+", " ")
            .replaceAll("Page \\d+", "")
            .trim();
}

Why Text Cleaning Matters?

Raw documents often contain:

Repeated headers
Footer text
Page numbers
Broken lines
Navigation menus
Legal boilerplate
Duplicate content

If these are embedded without cleaning, retrieval quality becomes poor.

Step 3: Split Documents into Chunks

Chunking is one of the most important parts of RAG quality.

A large document should be split into smaller meaningful chunks before embedding.

Large Document
      |
      +-- Chunk 1: Introduction
      +-- Chunk 2: Eligibility Rules
      +-- Chunk 3: Refund Timeline
      +-- Chunk 4: Exceptions
      +-- Chunk 5: Contact Support

Why Chunking is Needed?

Improves retrieval precision
Reduces irrelevant context
Controls prompt size
Improves answer accuracy
Helps cite exact sources

Bad Chunking Example

Chunk 1:
Random first 500 characters from document.

Chunk 2:
Next 500 characters, possibly cutting a sentence in half.

This may break meaning and reduce retrieval quality.

Good Chunking Example

Chunk 1:
Refund eligibility rules.

Chunk 2:
Refund processing timeline.

Chunk 3:
Refund exceptions and rejected cases.

Meaningful chunks give better RAG results.

Simple Java Chunking Example

public List<String> splitText(String text, int chunkSize) {

    List<String> chunks = new ArrayList<>();

    for (int i = 0; i < text.length(); i += chunkSize) {

        int end = Math.min(i + chunkSize, text.length());

        chunks.add(text.substring(i, end));
    }

    return chunks;
}

Better Chunking Strategy

For production systems, split by:

Headings
Paragraphs
Sections
Pages
Business topics

Chunk Metadata Example

Document chunk = new Document(
        "Refunds are processed within 5 to 7 business days.",
        Map.of(
                "source", "refund-policy.pdf",
                "section", "Refund Timeline",
                "page", "3"
        )
);

Step 4: Generate Embeddings

After chunking, each chunk is converted into an embedding vector.

Document Chunk
      |
      v
Embedding Model
      |
      v
Vector Embedding

Spring AI uses EmbeddingModel to generate embeddings.

Embedding Concept Example

Text:
Spring AI supports vector stores.

Vector:
[0.11, -0.45, 0.78, ...]

Step 5: Load into Vector Store

The load phase stores embedded documents into a vector database.

Vector store options include:

PGVector
Pinecone
MongoDB Atlas Vector Search
Redis Vector Search
Qdrant
Milvus
Weaviate
Elasticsearch

Vector Store Load Example

@Service
public class VectorLoadService {

    private final VectorStore vectorStore;

    public VectorLoadService(VectorStore vectorStore) {
        this.vectorStore = vectorStore;
    }

    public void load(List<Document> documents) {
        vectorStore.add(documents);
    }
}

Complete ETL Service Example

@Service
public class DocumentEtlService {

    private final VectorStore vectorStore;

    public DocumentEtlService(VectorStore vectorStore) {
        this.vectorStore = vectorStore;
    }

    public void processText(String content, String source) {

        String cleanedText = cleanText(content);

        List<String> chunks = splitText(cleanedText, 800);

        List<Document> documents = chunks.stream()
                .map(chunk -> new Document(
                        chunk,
                        Map.of(
                                "source", source,
                                "type", "text"
                        )
                ))
                .toList();

        vectorStore.add(documents);
    }

    private String cleanText(String text) {
        return text.replaceAll("\\s+", " ").trim();
    }

    private List<String> splitText(String text, int chunkSize) {

        List<String> chunks = new ArrayList<>();

        for (int i = 0; i < text.length(); i += chunkSize) {
            int end = Math.min(i + chunkSize, text.length());
            chunks.add(text.substring(i, end));
        }

        return chunks;
    }
}

ETL Controller Example

@RestController
@RequestMapping("/api/etl")
public class DocumentEtlController {

    private final DocumentEtlService etlService;

    public DocumentEtlController(DocumentEtlService etlService) {
        this.etlService = etlService;
    }

    @PostMapping("/text")
    public String processText(@RequestParam String source,
                              @RequestBody String content) {

        etlService.processText(content, source);

        return "Document processed successfully.";
    }
}

Testing ETL API

curl -X POST "http://localhost:8080/api/etl/text?source=spring-ai-notes" \
-H "Content-Type: text/plain" \
-d "Spring AI helps Java developers build AI applications using chat models, embeddings, vector stores, and RAG."

ETL + RAG Flow

ETL Time:
Documents â†’ Clean â†’ Chunk â†’ Embed â†’ Store

Query Time:
Question â†’ Search Vector Store â†’ Retrieve Chunks â†’ Chat Model â†’ Answer

Building a Database Reader

Many applications store knowledge in relational databases. For example, your website may store courses, interview questions, projects, and articles in MySQL or PostgreSQL.

Database Rows
     |
     v
Read Content
     |
     v
Create Documents
     |
     v
Add Metadata
     |
     v
Store in Vector DB

Database Reader Example

@Service
public class CourseDatabaseReader {

    private final CourseRepository courseRepository;
    private final VectorStore vectorStore;

    public CourseDatabaseReader(CourseRepository courseRepository,
                                VectorStore vectorStore) {
        this.courseRepository = courseRepository;
        this.vectorStore = vectorStore;
    }

    public void indexCourses() {

        List<Document> documents = courseRepository.findAll()
                .stream()
                .map(course -> new Document(
                        course.getTitle() + "\n" + course.getDescription(),
                        Map.of(
                                "type", "course",
                                "courseId", course.getId().toString(),
                                "slug", course.getSlug()
                        )
                ))
                .toList();

        vectorStore.add(documents);
    }
}

Building an HTML Reader

An HTML reader extracts useful article content from web pages and removes navigation, footer, scripts, and styling.

HTML Page
   |
   v
Remove Tags / Scripts
   |
   v
Extract Main Content
   |
   v
Create Document
   |
   v
Store in Vector DB

Building a CSV Reader

CSV readers are useful for structured datasets.

course_title,description,category
Spring Boot,Backend Java framework,Java
Docker,Container platform,DevOps

Each row can become a document with metadata.

CSV Document Example

Document document = new Document(
        "Spring Boot is a backend Java framework.",
        Map.of(
                "type", "course",
                "category", "Java",
                "source", "courses.csv"
        )
);

Incremental ETL

In production, you should avoid reprocessing every document every time.

Use incremental ETL:

Index only new documents
Update changed documents
Delete removed documents
Track document version
Track updated timestamp

Incremental ETL Flow

Check Last Indexed Time
        |
        v
Find New or Updated Documents
        |
        v
Process Only Changed Data
        |
        v
Update Vector Store

Document Versioning

Versioning helps prevent outdated AI answers.

{
  "source": "refund-policy.pdf",
  "version": "2026-05-20",
  "status": "active"
}

Handling Deleted Documents

If a document is deleted from your main system, remove or deactivate its vector records.

Document Deleted
      |
      v
Find Related Vector Chunks
      |
      v
Delete from Vector Store
      |
      v
Prevent Outdated Retrieval

ETL Scheduling

ETL pipelines can run:

Immediately after upload
Every hour
Every night
After database changes
Through message queues

Scheduled ETL Example

@Scheduled(cron = "0 0 2 * * *")
public void runDailyIndexing() {
    courseDatabaseReader.indexCourses();
}

Event-Driven ETL

For scalable systems, use event-driven processing.

Document Uploaded
      |
      v
Publish Event
      |
      v
ETL Worker Consumes Event
      |
      v
Process Document
      |
      v
Store Vectors

Message Queue Options

Kafka
RabbitMQ
Amazon SQS
Redis Streams
Google Pub/Sub

ETL Error Handling

Document processing can fail due to:

Unsupported file format
Corrupted PDF
Empty content
Embedding API failure
Vector database failure
Timeouts
Permission issues

Error Handling Flow

ETL Step Fails
      |
      v
Log Safe Error
      |
      v
Retry if Temporary
      |
      v
Move to Failed Queue if Permanent
      |
      v
Notify Admin

ETL Monitoring

Track:

Documents processed
Chunks created
Embedding generation time
Vector store load time
Failed documents
Empty documents
Average chunk size
Index freshness

Production ETL Dashboard

ETL Metrics
   |
   +-- Total documents indexed
   +-- Failed documents
   +-- Average processing time
   +-- Vector store latency
   +-- Last successful run
   +-- Documents pending

Security Best Practices

Scan uploaded files before processing
Validate file type and size
Do not index secrets unnecessarily
Apply tenant metadata
Use access control before retrieval
Encrypt sensitive storage
Do not log full sensitive documents
Remove deleted content from vector store

Common ETL Mistakes

1. Indexing Raw Documents Without Cleaning

This reduces retrieval quality.

2. Poor Chunking

Random chunks can break meaning.

3. No Metadata

Difficult to filter, cite, or debug.

4. No Incremental Updates

Old content may remain in vector search.

5. No Error Handling

A single bad document can break the pipeline.

Best Practices

Extract clean text from documents
Use meaningful chunking
Store rich metadata
Track document version
Use incremental indexing
Monitor ETL failures
Use queues for large workloads
Protect sensitive data
Delete outdated vectors
Test retrieval quality regularly

Interview Questions

Q1: What is an ETL pipeline in AI applications?

It is a process that extracts documents, transforms them into AI-ready chunks, and loads them into a vector store for retrieval.

Q2: Why are document readers important?

They extract useful text from files, databases, APIs, or web pages and convert it into AI-processable documents.

Q3: Why is chunking important?

Chunking improves retrieval quality by splitting large documents into meaningful smaller sections.

Q4: What metadata should be stored?

Source, page number, topic, category, tenant ID, version, updated date, and access level.

Q5: What is incremental ETL?

Incremental ETL processes only new or updated content instead of reprocessing everything.

Advanced Interview Questions

Q1: How do you handle deleted documents in vector databases?

Find and remove related vector chunks or mark them inactive so outdated content is not retrieved.

Q2: Why use queues for ETL?

Queues support scalable, asynchronous document processing and prevent large uploads from blocking API requests.

Q3: How do you secure document ETL?

Validate files, scan uploads, avoid indexing secrets, add tenant metadata, encrypt sensitive data, and enforce access control.

Q4: What causes poor RAG answers after ETL?

Poor chunking, noisy text, missing metadata, outdated vectors, wrong embedding model, or weak prompts.

Q5: How do you monitor ETL pipelines?

Track processed documents, failed documents, chunk count, embedding time, vector load time, and index freshness.

Recommended Learning Path

Summary

Document readers and ETL pipelines are essential for building high-quality Spring AI RAG applications. They convert raw files, database records, APIs, and web content into clean, chunked, metadata-rich documents that can be embedded and stored in vector databases.

A strong ETL pipeline improves retrieval quality, reduces hallucination, supports source tracking, improves security, and keeps AI knowledge up to date.

For learning platforms, banking assistants, e-commerce support systems, SaaS knowledge bases, and enterprise AI agents, ETL quality directly affects AI answer quality.