Published: 2026-06-01 • Updated: 2026-06-20

Building Document Readers and ETL Pipelines in Spring AI

Modern AI applications need reliable knowledge before they can provide useful answers. A chat model alone may answer general questions, but enterprise applications usually need answers from documents such as PDFs, Word files, web pages, FAQs, policies, course lessons, product manuals, support tickets, and database records.

To make these documents useful for AI applications, we need a proper document processing flow. This flow is commonly called an ETL pipeline.

ETL stands for:

  • Extract: Read content from documents or data sources
  • Transform: Clean, split, enrich, and prepare the content
  • Load: Store the processed content into a vector database or knowledge store

In Spring AI, document readers and ETL pipelines are important for building Retrieval-Augmented Generation, semantic search, AI knowledge assistants, customer support bots, and agentic AI systems.


What is a Document Reader?

A document reader is a component that reads content from a source and converts it into a format that the AI system can process.

A source can be:

  • PDF file
  • Text file
  • Markdown file
  • HTML page
  • Word document
  • CSV file
  • Database record
  • API response
  • Website content

The reader extracts useful text and converts it into Document objects.


What is an ETL Pipeline?

An ETL pipeline is a structured process that moves data from raw source format into a usable AI-ready format.

Raw Documents
      |
      v
Extract Text
      |
      v
Transform and Clean
      |
      v
Split into Chunks
      |
      v
Generate Embeddings
      |
      v
Load into Vector Store

Why ETL Pipelines Are Important for AI?

Poor document processing leads to poor AI answers. Even if the language model is powerful, it cannot answer correctly if the retrieved content is messy, outdated, incomplete, or poorly chunked.

A good ETL pipeline improves:

  • Retrieval accuracy
  • RAG answer quality
  • Search relevance
  • Source traceability
  • Content freshness
  • Security filtering
  • AI agent reliability

Simple RAG ETL Architecture

PDF / Docs / Web Pages / Database
              |
              v
Document Reader
              |
              v
Text Cleaner
              |
              v
Text Splitter
              |
              v
Metadata Enricher
              |
              v
Embedding Model
              |
              v
Vector Store
              |
              v
RAG Application

Real-Time Learning Platform Example

A learning platform may have:

  • Java course lessons
  • Spring Boot interview questions
  • Docker tutorials
  • Kubernetes articles
  • AI and RAG guides
  • Project documentation

The ETL pipeline reads all this content, splits it into meaningful sections, generates embeddings, and stores them in a vector database. When a learner asks a question, the AI retrieves the most relevant content and answers from your platform knowledge.


Real-Time Banking Example

A banking support assistant may process:

  • UPI failure policy
  • Loan eligibility rules
  • Credit card FAQs
  • Account statement guide
  • Refund reversal timelines

When a customer asks:

Amount was debited but UPI failed. When will I get my money?

The RAG system searches the vector store and retrieves the failed UPI reversal policy. The AI then answers using verified banking content instead of guessing.


Real-Time E-Commerce Example

An e-commerce assistant may process:

  • Refund policy
  • Return policy
  • Warranty rules
  • Delivery timelines
  • Product descriptions

When a user asks:

Can I return a damaged phone after delivery?

The AI retrieves return policy and damaged product policy chunks before generating the answer.


Spring AI Document Object

In Spring AI, document content is usually represented using a Document object.

A document contains:

  • Text content
  • Metadata

Example

Document document = new Document(
        "Spring AI helps Java developers build AI applications.",
        Map.of(
                "source", "spring-ai-course",
                "topic", "spring-ai"
        )
);

Why Metadata Matters?

Metadata gives meaning and traceability to document chunks.

Useful metadata includes:

  • Source file
  • Page number
  • Topic
  • Category
  • Author
  • Created date
  • Updated date
  • Tenant ID
  • Access level

Metadata Example

{
  "source": "refund-policy.pdf",
  "page": 3,
  "category": "refund",
  "tenantId": "company-a",
  "updatedDate": "2026-05-20"
}

Step 1: Extract Documents

The extract phase reads raw content from files, databases, APIs, or websites.

Source File
    |
    v
Document Reader
    |
    v
Raw Text

Common Document Reader Types

Reader Type Use Case
Text Reader Plain text files
PDF Reader PDF documents
Markdown Reader Developer documentation
HTML Reader Web pages and articles
CSV Reader Tabular data
Database Reader Rows from application database
API Reader External knowledge services

Text Document Reader Example

@Service
public class TextDocumentReaderService {

    public Document readText(String content, String source) {

        return new Document(
                content,
                Map.of(
                        "source", source,
                        "type", "text"
                )
        );
    }
}

PDF Reader Concept

PDF files usually require text extraction before they can be embedded.

PDF File
   |
   v
Extract Text Page by Page
   |
   v
Create Document Objects
   |
   v
Add Page Metadata

PDF Document Example

Document document = new Document(
        "Refunds are processed within 5 to 7 business days.",
        Map.of(
                "source", "refund-policy.pdf",
                "page", "2",
                "category", "refund"
        )
);

Step 2: Transform Documents

The transform phase prepares extracted content for embeddings and retrieval.

Transform operations include:

  • Remove unnecessary spaces
  • Remove headers and footers
  • Remove duplicate content
  • Normalize formatting
  • Fix broken paragraphs
  • Add metadata
  • Split into chunks

Text Cleaning Example

public String cleanText(String text) {

    if (text == null) {
        return "";
    }

    return text
            .replaceAll("\\s+", " ")
            .replaceAll("Page \\d+", "")
            .trim();
}

Why Text Cleaning Matters?

Raw documents often contain:

  • Repeated headers
  • Footer text
  • Page numbers
  • Broken lines
  • Navigation menus
  • Legal boilerplate
  • Duplicate content

If these are embedded without cleaning, retrieval quality becomes poor.


Step 3: Split Documents into Chunks

Chunking is one of the most important parts of RAG quality.

A large document should be split into smaller meaningful chunks before embedding.

Large Document
      |
      +-- Chunk 1: Introduction
      +-- Chunk 2: Eligibility Rules
      +-- Chunk 3: Refund Timeline
      +-- Chunk 4: Exceptions
      +-- Chunk 5: Contact Support

Why Chunking is Needed?

  • Improves retrieval precision
  • Reduces irrelevant context
  • Controls prompt size
  • Improves answer accuracy
  • Helps cite exact sources

Bad Chunking Example

Chunk 1:
Random first 500 characters from document.

Chunk 2:
Next 500 characters, possibly cutting a sentence in half.

This may break meaning and reduce retrieval quality.


Good Chunking Example

Chunk 1:
Refund eligibility rules.

Chunk 2:
Refund processing timeline.

Chunk 3:
Refund exceptions and rejected cases.

Meaningful chunks give better RAG results.


Simple Java Chunking Example

public List<String> splitText(String text, int chunkSize) {

    List<String> chunks = new ArrayList<>();

    for (int i = 0; i < text.length(); i += chunkSize) {

        int end = Math.min(i + chunkSize, text.length());

        chunks.add(text.substring(i, end));
    }

    return chunks;
}

Better Chunking Strategy

For production systems, split by:

  • Headings
  • Paragraphs
  • Sections
  • Pages
  • Business topics

Chunk Metadata Example

Document chunk = new Document(
        "Refunds are processed within 5 to 7 business days.",
        Map.of(
                "source", "refund-policy.pdf",
                "section", "Refund Timeline",
                "page", "3"
        )
);

Step 4: Generate Embeddings

After chunking, each chunk is converted into an embedding vector.

Document Chunk
      |
      v
Embedding Model
      |
      v
Vector Embedding

Spring AI uses EmbeddingModel to generate embeddings.


Embedding Concept Example

Text:
Spring AI supports vector stores.

Vector:
[0.11, -0.45, 0.78, ...]

Step 5: Load into Vector Store

The load phase stores embedded documents into a vector database.

Vector store options include:

  • PGVector
  • Pinecone
  • MongoDB Atlas Vector Search
  • Redis Vector Search
  • Qdrant
  • Milvus
  • Weaviate
  • Elasticsearch

Vector Store Load Example

@Service
public class VectorLoadService {

    private final VectorStore vectorStore;

    public VectorLoadService(VectorStore vectorStore) {
        this.vectorStore = vectorStore;
    }

    public void load(List<Document> documents) {
        vectorStore.add(documents);
    }
}

Complete ETL Service Example

@Service
public class DocumentEtlService {

    private final VectorStore vectorStore;

    public DocumentEtlService(VectorStore vectorStore) {
        this.vectorStore = vectorStore;
    }

    public void processText(String content, String source) {

        String cleanedText = cleanText(content);

        List<String> chunks = splitText(cleanedText, 800);

        List<Document> documents = chunks.stream()
                .map(chunk -> new Document(
                        chunk,
                        Map.of(
                                "source", source,
                                "type", "text"
                        )
                ))
                .toList();

        vectorStore.add(documents);
    }

    private String cleanText(String text) {
        return text.replaceAll("\\s+", " ").trim();
    }

    private List<String> splitText(String text, int chunkSize) {

        List<String> chunks = new ArrayList<>();

        for (int i = 0; i < text.length(); i += chunkSize) {
            int end = Math.min(i + chunkSize, text.length());
            chunks.add(text.substring(i, end));
        }

        return chunks;
    }
}

ETL Controller Example

@RestController
@RequestMapping("/api/etl")
public class DocumentEtlController {

    private final DocumentEtlService etlService;

    public DocumentEtlController(DocumentEtlService etlService) {
        this.etlService = etlService;
    }

    @PostMapping("/text")
    public String processText(@RequestParam String source,
                              @RequestBody String content) {

        etlService.processText(content, source);

        return "Document processed successfully.";
    }
}

Testing ETL API

curl -X POST "http://localhost:8080/api/etl/text?source=spring-ai-notes" \
-H "Content-Type: text/plain" \
-d "Spring AI helps Java developers build AI applications using chat models, embeddings, vector stores, and RAG."

ETL + RAG Flow

ETL Time:
Documents → Clean → Chunk → Embed → Store

Query Time:
Question → Search Vector Store → Retrieve Chunks → Chat Model → Answer

Building a Database Reader

Many applications store knowledge in relational databases. For example, your website may store courses, interview questions, projects, and articles in MySQL or PostgreSQL.

Database Rows
     |
     v
Read Content
     |
     v
Create Documents
     |
     v
Add Metadata
     |
     v
Store in Vector DB

Database Reader Example

@Service
public class CourseDatabaseReader {

    private final CourseRepository courseRepository;
    private final VectorStore vectorStore;

    public CourseDatabaseReader(CourseRepository courseRepository,
                                VectorStore vectorStore) {
        this.courseRepository = courseRepository;
        this.vectorStore = vectorStore;
    }

    public void indexCourses() {

        List<Document> documents = courseRepository.findAll()
                .stream()
                .map(course -> new Document(
                        course.getTitle() + "\n" + course.getDescription(),
                        Map.of(
                                "type", "course",
                                "courseId", course.getId().toString(),
                                "slug", course.getSlug()
                        )
                ))
                .toList();

        vectorStore.add(documents);
    }
}

Building an HTML Reader

An HTML reader extracts useful article content from web pages and removes navigation, footer, scripts, and styling.

HTML Page
   |
   v
Remove Tags / Scripts
   |
   v
Extract Main Content
   |
   v
Create Document
   |
   v
Store in Vector DB

Building a CSV Reader

CSV readers are useful for structured datasets.

course_title,description,category
Spring Boot,Backend Java framework,Java
Docker,Container platform,DevOps

Each row can become a document with metadata.


CSV Document Example

Document document = new Document(
        "Spring Boot is a backend Java framework.",
        Map.of(
                "type", "course",
                "category", "Java",
                "source", "courses.csv"
        )
);

Incremental ETL

In production, you should avoid reprocessing every document every time.

Use incremental ETL:

  • Index only new documents
  • Update changed documents
  • Delete removed documents
  • Track document version
  • Track updated timestamp

Incremental ETL Flow

Check Last Indexed Time
        |
        v
Find New or Updated Documents
        |
        v
Process Only Changed Data
        |
        v
Update Vector Store

Document Versioning

Versioning helps prevent outdated AI answers.

{
  "source": "refund-policy.pdf",
  "version": "2026-05-20",
  "status": "active"
}

Handling Deleted Documents

If a document is deleted from your main system, remove or deactivate its vector records.

Document Deleted
      |
      v
Find Related Vector Chunks
      |
      v
Delete from Vector Store
      |
      v
Prevent Outdated Retrieval

ETL Scheduling

ETL pipelines can run:

  • Immediately after upload
  • Every hour
  • Every night
  • After database changes
  • Through message queues

Scheduled ETL Example

@Scheduled(cron = "0 0 2 * * *")
public void runDailyIndexing() {
    courseDatabaseReader.indexCourses();
}

Event-Driven ETL

For scalable systems, use event-driven processing.

Document Uploaded
      |
      v
Publish Event
      |
      v
ETL Worker Consumes Event
      |
      v
Process Document
      |
      v
Store Vectors

Message Queue Options

  • Kafka
  • RabbitMQ
  • Amazon SQS
  • Redis Streams
  • Google Pub/Sub

ETL Error Handling

Document processing can fail due to:

  • Unsupported file format
  • Corrupted PDF
  • Empty content
  • Embedding API failure
  • Vector database failure
  • Timeouts
  • Permission issues

Error Handling Flow

ETL Step Fails
      |
      v
Log Safe Error
      |
      v
Retry if Temporary
      |
      v
Move to Failed Queue if Permanent
      |
      v
Notify Admin

ETL Monitoring

Track:

  • Documents processed
  • Chunks created
  • Embedding generation time
  • Vector store load time
  • Failed documents
  • Empty documents
  • Average chunk size
  • Index freshness

Production ETL Dashboard

ETL Metrics
   |
   +-- Total documents indexed
   +-- Failed documents
   +-- Average processing time
   +-- Vector store latency
   +-- Last successful run
   +-- Documents pending

Security Best Practices

  • Scan uploaded files before processing
  • Validate file type and size
  • Do not index secrets unnecessarily
  • Apply tenant metadata
  • Use access control before retrieval
  • Encrypt sensitive storage
  • Do not log full sensitive documents
  • Remove deleted content from vector store

Common ETL Mistakes

1. Indexing Raw Documents Without Cleaning

This reduces retrieval quality.

2. Poor Chunking

Random chunks can break meaning.

3. No Metadata

Difficult to filter, cite, or debug.

4. No Incremental Updates

Old content may remain in vector search.

5. No Error Handling

A single bad document can break the pipeline.


Best Practices

  • Extract clean text from documents
  • Use meaningful chunking
  • Store rich metadata
  • Track document version
  • Use incremental indexing
  • Monitor ETL failures
  • Use queues for large workloads
  • Protect sensitive data
  • Delete outdated vectors
  • Test retrieval quality regularly

Interview Questions

Q1: What is an ETL pipeline in AI applications?

It is a process that extracts documents, transforms them into AI-ready chunks, and loads them into a vector store for retrieval.

Q2: Why are document readers important?

They extract useful text from files, databases, APIs, or web pages and convert it into AI-processable documents.

Q3: Why is chunking important?

Chunking improves retrieval quality by splitting large documents into meaningful smaller sections.

Q4: What metadata should be stored?

Source, page number, topic, category, tenant ID, version, updated date, and access level.

Q5: What is incremental ETL?

Incremental ETL processes only new or updated content instead of reprocessing everything.


Advanced Interview Questions

Q1: How do you handle deleted documents in vector databases?

Find and remove related vector chunks or mark them inactive so outdated content is not retrieved.

Q2: Why use queues for ETL?

Queues support scalable, asynchronous document processing and prevent large uploads from blocking API requests.

Q3: How do you secure document ETL?

Validate files, scan uploads, avoid indexing secrets, add tenant metadata, encrypt sensitive data, and enforce access control.

Q4: What causes poor RAG answers after ETL?

Poor chunking, noisy text, missing metadata, outdated vectors, wrong embedding model, or weak prompts.

Q5: How do you monitor ETL pipelines?

Track processed documents, failed documents, chunk count, embedding time, vector load time, and index freshness.


Recommended Learning Path


Summary

Document readers and ETL pipelines are essential for building high-quality Spring AI RAG applications. They convert raw files, database records, APIs, and web content into clean, chunked, metadata-rich documents that can be embedded and stored in vector databases.

A strong ETL pipeline improves retrieval quality, reduces hallucination, supports source tracking, improves security, and keeps AI knowledge up to date.

For learning platforms, banking assistants, e-commerce support systems, SaaS knowledge bases, and enterprise AI agents, ETL quality directly affects AI answer quality.

About the Author

Naresh Kumar

Naresh Kumar

Senior Java Backend Engineer experienced in Banking, Payments, ISO 20022, Spring Boot, Microservices, Kafka, Docker, Kubernetes, AWS and Cloud Native Systems.

Built enterprise payment solutions, transaction processing systems, API platforms and scalable microservices used in production.

LinkedIn Profile