Text chunking Innovation Release
- Hybrid Manager dual release strategy
- Documentation for the current Long-term support release
Use the aidb.chunk_text() function to intelligently divide long strings into smaller, semantically coherent segments. This process is essential for staying within the context window limits of Large Language Models (LLMs) and optimizing vector storage.
SELECT * FROM aidb.chunk_text( input => 'This is a significantly longer text example that might require splitting into smaller chunks. The purpose of this function is to partition text data into segments of a specified maximum length, for example, this sentence 145 is characters. This enables processing or storage of data in manageable parts.', options => '{"desired_length": 120, "max_length": 150}' );
part_id | chunk
---------+---------------------------------------------------------------------------------------------------------------------------------------------------
0 | This is a significantly longer text example that might require splitting into smaller chunks.
1 | The purpose of this function is to partition text data into segments of a specified maximum length, for example, this sentence 145 is characters.
2 | This enables processing or storage of data in manageable parts.
(3 rows)Configuration options
desired_length(required): The target size for each segment. The splitter attempts to reach this size, while preserving semantic boundaries. Ifmax_lengthis omitted, this becomes a strict upper limit.max_length(optional): The upper bound for chunk size. If specified, the function will try to generate chunks close todesired_lengthbut may extend up tomax_lengthto preserve larger semantic units (like full sentences or paragraphs). Chunks will exceeddesired_lengthonly when it's necessary to avoid cutting across meaningful boundaries. The unit depends on thestrategyused.Specifying
desired_length = max_lengthresults in fixed-size chunks (for example, when filling a context window exactly for embeddings).Use a larger
max_lengthif you want to stay within a soft limit but allow some flexibility to preserve higher semantic units, common in RAG, summarization, or Q&A applications.
overlap_length(optional): The amount of content to overlap between consecutive chunks. This helps preserve context across chunk boundaries by duplicating a portion of the content from the end of one chunk to the beginning of the next chunk. Defaults to 0 (no overlap). The unit depends on thestrategyused.strategy(optional): The chunking strategy to use. Can be either"chars"(default) for character-based chunking that splits strictly by character count, or"words"for word-based chunking that preserves word boundaries. This determines the unit fordesired_length,max_length, andoverlap_length: characters for"chars", words for"words".
Algorithm summary
Text is split using a hierarchy of semantic boundaries: characters, graphemes, words, sentences, and increasingly long newline sequences (for example, paragraphs).
The splitter attempts to form the largest semantically valid chunk that fits within the specified size range.
Chunks may be returned that are shorter than
desired_lengthif adding the next semantic unit would exceedmax_length.
Tip
This operation transforms the shape of the data, automatically unnesting collections by introducing a part_id column. See the unnesting concept for more details.
- On this page
- Configuration options