Text chunking Innovation Release

Use the aidb.chunk_text() function to intelligently divide long strings into smaller, semantically coherent segments. This process is essential for staying within the context window limits of Large Language Models (LLMs) and optimizing vector storage.

SELECT * FROM aidb.chunk_text(
    input => 'This is a significantly longer text example that might require splitting into smaller chunks. The purpose of this function is to partition text data into segments of a specified maximum length, for example, this sentence 145 is characters. This enables processing or storage of data in manageable parts.',
    options => '{"desired_length": 120, "max_length": 150}'
);
Output
 part_id |                                                                       chunk
---------+---------------------------------------------------------------------------------------------------------------------------------------------------
       0 | This is a significantly longer text example that might require splitting into smaller chunks.
       1 | The purpose of this function is to partition text data into segments of a specified maximum length, for example, this sentence 145 is characters.
       2 | This enables processing or storage of data in manageable parts.
(3 rows)

Configuration options

  • desired_length (required): The target size for each segment. The splitter attempts to reach this size, while preserving semantic boundaries. If max_length is omitted, this becomes a strict upper limit.

  • max_length (optional): The upper bound for chunk size. If specified, the function will try to generate chunks close to desired_length but may extend up to max_length to preserve larger semantic units (like full sentences or paragraphs). Chunks will exceed desired_length only when it's necessary to avoid cutting across meaningful boundaries. The unit depends on the strategy used.

    • Specifying desired_length = max_length results in fixed-size chunks (for example, when filling a context window exactly for embeddings).

    • Use a larger max_length if you want to stay within a soft limit but allow some flexibility to preserve higher semantic units, common in RAG, summarization, or Q&A applications.

  • overlap_length (optional): The amount of content to overlap between consecutive chunks. This helps preserve context across chunk boundaries by duplicating a portion of the content from the end of one chunk to the beginning of the next chunk. Defaults to 0 (no overlap). The unit depends on the strategy used.

  • strategy (optional): The chunking strategy to use. Can be either "chars" (default) for character-based chunking that splits strictly by character count, or "words" for word-based chunking that preserves word boundaries. This determines the unit for desired_length, max_length, and overlap_length: characters for "chars", words for "words".

Algorithm summary

  • Text is split using a hierarchy of semantic boundaries: characters, graphemes, words, sentences, and increasingly long newline sequences (for example, paragraphs).

  • The splitter attempts to form the largest semantically valid chunk that fits within the specified size range.

  • Chunks may be returned that are shorter than desired_length if adding the next semantic unit would exceed max_length.

Tip

This operation transforms the shape of the data, automatically unnesting collections by introducing a part_id column. See the unnesting concept for more details.