Data preparation SQL functions Innovation Release

Each pipeline step operation is also available as a standalone SQL function. You can call these functions directly in queries for one-off transformations, for testing and exploration, or to build custom workflows outside of the pipeline framework.

FunctionOperationDescription
aidb.chunk_text()ChunkTextDivides long text into smaller, semantically coherent segments.
aidb.parse_html()ParseHtmlExtracts readable text from HTML, stripping tags while preserving structure.
aidb.parse_pdf()ParsePdfExtracts text from binary PDF data, with page-level part_id output.
aidb.perform_ocr()PerformOcrExtracts text from images using an OCR-capable AI model.
aidb.summarize_text()SummarizeTextGenerates concise summaries of long text passages using an AI model.
aidb.summarize_text_aggregate()SummarizeTextSummarizes text across multiple rows using a SQL aggregate pattern.
Note

To use these operations as steps inside a pipeline, see Pipeline steps.

Text chunking

The chunking step divides long text into smaller segments based on configurable parameters, optimizing it for processing by LLMs and embedding in knowledge bases.

Data parsing

The parsing step extracts structured text from various formats (like HTML and PDF) using AI models, preparing it for downstream processing in the pipeline.

Performing OCR

The OCR step extracts text from images using AI models, enabling the conversion of visual data into searchable text for indexing in knowledge bases.

Text summarizing

The summarizing step generates concise summaries of long text passages using AI models, improving retrieval accuracy in RAG applications.

Embedding

The embedding step transforms processed text or image data into vector representations using AI models, creating a searchable knowledge base for semantic retrieval in RAG applications.