Pipeline steps Innovation Release

The pipeline converts raw data into structured, AI-ready vectors through a series of modular transformation stages and steps, ensuring each file is cleaned, parsed, and optimized for retrieval.

The following steps are available for use in a pipeline:

  • Chunking: To fit within LLM context windows, the parsed text is divided into smaller segments.

    • Chunk size: Determines the length of each segment. Smaller chunks are better for precise fact retrieval.

    • Overlap: A portion of text is repeated between adjacent chunks to ensure that context isn't lost at the seams of a document.

  • Parsing: Once text is accessible, this step identifies the structural elements of the document, such as headings, tables, and metadata. This distinction helps the pipeline understand the hierarchy of the information before it is broken down.

  • Performing OCR: It transforms the visual image data into queryable text strings by leveraging specialized AI models like NVIDIA NIM PaddleOCR. The workflow involves registering an OCR-capable model with your API credentials and calling the aidb.perform_ocr() function on image bytes. The system then automatically unnests the results, returning a table where each detected text block is assigned a part_id to maintain its original sequence and structure, effectively turning non-searchable images into structured database records.

  • Summarizing: To provide better context for large documents, an optional summarization step can generate a high-level overview of the content. This summary can be embedded alongside the raw text to improve retrieval accuracy during a RAG (Retrieval-Augmented Generation) flow.

  • Embedding (knowledge base): The final stage of the pipeline is the knowledge base, a vector-indexed repository that enables your applications to perform semantic searches. Unlike traditional keyword matching, this method allows you to retrieve information based on actual conceptual meaning and context.

    This abstraction automates the complex heavy lifting of AI modeling by managing embeddings and vector indexes for you. It is highly versatile, supporting both standard text models and multi-modal models like CLIP (for image and text cross-referencing). Once your data is indexed, you can use the aidb.retrieve_key() and aidb.retrieve_text() functions to perform high-speed semantic searches against the stored knowledge.

Pipeline steps

Learn how to configure each pipeline step — ChunkText, ParseHtml, ParsePdf, PerformOcr, SummarizeText, and KnowledgeBase — when building AI pipelines in EDB Postgres AI.