Data preparation SQL functions Innovation Release
- Hybrid Manager dual release strategy
- Documentation for the current Long-term support release
Each pipeline step operation is also available as a standalone SQL function. You can call these functions directly in queries for one-off transformations, for testing and exploration, or to build custom workflows outside of the pipeline framework.
| Function | Operation | Description |
|---|---|---|
aidb.chunk_text() | ChunkText | Divides long text into smaller, semantically coherent segments. |
aidb.parse_html() | ParseHtml | Extracts readable text from HTML, stripping tags while preserving structure. |
aidb.parse_pdf() | ParsePdf | Extracts text from binary PDF data, with page-level part_id output. |
aidb.perform_ocr() | PerformOcr | Extracts text from images using an OCR-capable AI model. |
aidb.summarize_text() | SummarizeText | Generates concise summaries of long text passages using an AI model. |
aidb.summarize_text_aggregate() | SummarizeText | Summarizes text across multiple rows using a SQL aggregate pattern. |
Note
To use these operations as steps inside a pipeline, see Pipeline steps.
Text chunking
The chunking step divides long text into smaller segments based on configurable parameters, optimizing it for processing by LLMs and embedding in knowledge bases.
Data parsing
The parsing step extracts structured text from various formats (like HTML and PDF) using AI models, preparing it for downstream processing in the pipeline.
Performing OCR
The OCR step extracts text from images using AI models, enabling the conversion of visual data into searchable text for indexing in knowledge bases.
Text summarizing
The summarizing step generates concise summaries of long text passages using AI models, improving retrieval accuracy in RAG applications.
Embedding
The embedding step transforms processed text or image data into vector representations using AI models, creating a searchable knowledge base for semantic retrieval in RAG applications.