Loading...
Loading...
3,528 documents
Ground-truth chunking for benchmarking using Voronoi boundaries with word alignment.
Langroid's [`ParsingConfig`][langroid.parsing.parser.ParsingConfig]
This document defines a proposed chunking and indexing strategy for semantic search in `pginbox`.
Chunking strategies are critical for dividing large texts into manageable parts, enabling effective content processing and extraction. These strategies are foundational in cosine similarity-based extraction techniques, which allow users to retrieve only the most relevant chunks of content for a given query. Additionally, they facilitate direct integration into RAG (Retrieval-Augmented Generation) systems for structured and scalable workflows.
CodeRAG uses Abstract Syntax Tree (AST) parsing to split code into semantic chunks rather than arbitrary character or line-based splits. This produces more meaningful search units.
LLMs have context limits. You can't pass an entire 200-page SEC filing to an LLM for entity extraction. Documents must be broken into smaller pieces—**chunks**—that fit within processing limits.
**Supersedes:** Previous `extractChunks()` + `splitLargeChunk()` approach
The original system was creating too many tiny chunks (14 chunks for 1793 characters), fragmenting context and reducing answer quality. The new **adaptive chunking system** intelligently handles all document types with optimal chunk sizes.
<!-- markdownlint-disable-file MD029 MD036 MD026 -->
Instead of predicting one action at a time, predict a sequence of actions (chunk). This captures temporal structure and is a key idea from ACT (Action Chunking with Transformers) that carries into Pi0.
title: "Text Chunking Strategies for RAG Applications"
*Status: Accepted – 2025-01-27*
ProcessorBase["ProcessorBase"]
Issue #368 — Smoothing-based topic boundary detection for memory chunking.
This document explains the *current* message flushing/chunking pipeline used by TomoriBot when streaming model output to Discord.
title: Row-Parallel Chunking
This guide walks through the two building blocks of the GPT chat knowledge base:
**Critical Bug**: [clustering_rpn.py:43-48](knowledge3d/cranium/clustering_rpn.py#L43-L48) was truncating 128-dimensional embeddings to **4 dimensions**:
← Back to flow index: [`docs/ai-playbook/flows.md`](../flows.md)
> **Positioning**: Half of a vector database's retrieval quality depends on the chunking strategy. Chunks too large fill search results with irrelevant content; chunks too small fracture semantics. This chapter covers MemPalace's two chunking strategies -- fixed windows for project files, Q&A pairs for conversations -- and why conversation text cannot use fixed windows.
This RFC proposes a modification to the Kimchi proof system and the pickles recursion layer to increase the circuit size limit by splitting the polynomials from a circuit into 'chunks' which are less than the hard limit of 2^16 that Mina / SnarkyJS supports.
[](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_en/note/api_mapping/pytorch_diff/CoNLL2000Chunking.md)
title: "Text Chunking Strategies for RAG Applications"
- **Status**: Accepted