Loading...
Loading...
Loading...
Langroid's [`ParsingConfig`][langroid.parsing.parser.ParsingConfig]
# Document Chunking/Splitting in Langroid
Langroid's [`ParsingConfig`][langroid.parsing.parser.ParsingConfig]
provides several document chunking strategies through the `Splitter` enum:
## 1. MARKDOWN (`Splitter.MARKDOWN`) (The default)
**Purpose**: Structure-aware splitting that preserves markdown formatting.
**How it works**:
- Preserves document hierarchy (headers and sections)
- Enriches chunks with header information
- Uses word count instead of token count (with adjustment factor)
- Supports "rollup" to maintain document structure
- Ideal for markdown documents where preserving formatting is important
## 2. TOKENS (`Splitter.TOKENS`)
**Purpose**: Creates chunks of approximately equal token size.
**How it works**:
- Tokenizes the text using tiktoken
- Aims for chunks of size `chunk_size` tokens (default: 200)
- Looks for natural breakpoints like punctuation or newlines
- Prefers splitting at sentence/paragraph boundaries
- Ensures chunks are at least `min_chunk_chars` long (default: 350)
## 3. PARA_SENTENCE (`Splitter.PARA_SENTENCE`)
**Purpose**: Splits documents respecting paragraph and sentence boundaries.
**How it works**:
- Recursively splits documents until chunks are below 1.3ร the target size
- Maintains document structure by preserving natural paragraph breaks
- Adjusts chunk boundaries to avoid cutting in the middle of sentences
- Stops when it can't split chunks further without breaking coherence
## 4. SIMPLE (`Splitter.SIMPLE`)
**Purpose**: Basic splitting using predefined separators.
**How it works**:
- Uses a list of separators to split text (default: `["\n\n", "\n", " ", ""]`)
- Splits on the first separator in the list
- Doesn't attempt to balance chunk sizes
- Simplest and fastest splitting method
## Basic Configuration
```python
from langroid.parsing.parser import ParsingConfig, Splitter
config = ParsingConfig(
splitter=Splitter.MARKDOWN, # Most feature-rich option
chunk_size=200, # Target tokens per chunk
chunk_size_variation=0.30, # Allowed variation from target
overlap=50, # Token overlap between chunks
token_encoding_model="text-embedding-3-small"
)
```
## Format-Specific Configuration
```python
# Customize PDF parsing
config = ParsingConfig(
splitter=Splitter.PARA_SENTENCE,
pdf=PdfParsingConfig(
library="pymupdf4llm" # Default PDF parser
)
)
# Use Gemini for PDF parsing
config = ParsingConfig(
pdf=PdfParsingConfig(
library="gemini",
gemini_config=GeminiConfig(
model_name="gemini-2.0-flash",
requests_per_minute=5
)
)
)
```
# Setting Up Parsing Config in DocChatAgentConfig
You can configure document parsing when creating a `DocChatAgent` by customizing the `parsing` field within the `DocChatAgentConfig`. Here's how to do it:
```python
from langroid.agent.special.doc_chat_agent import DocChatAgentConfig
from langroid.parsing.parser import ParsingConfig, Splitter, PdfParsingConfig
# Create a DocChatAgent with custom parsing configuration
agent_config = DocChatAgentConfig(
parsing=ParsingConfig(
# Choose the splitting strategy
splitter=Splitter.MARKDOWN, # Structure-aware splitting with header context
# Configure chunk sizes
chunk_size=800, # Target tokens per chunk
overlap=150, # Overlap between chunks
# Configure chunk behavior
max_chunks=5000, # Maximum number of chunks to create
min_chunk_chars=250, # Minimum characters when truncating at punctuation
discard_chunk_chars=10, # Discard chunks smaller than this
# Configure context window
n_neighbor_ids=3, # Store 3 chunk IDs on either side
# Configure PDF parsing specifically
pdf=PdfParsingConfig(
library="pymupdf4llm", # Choose PDF parsing library
)
)
)
```[](https://github.com/BUAADreamer/EasyRAG/blob/main/licence)
Welcome to the most comprehensive n8n AI Agent course! Build powerful automation workflows and intelligent AI agents using n8n's visual workflow builder.
Chunking is the process that decides which modules are placed into which bundles, and the relationship between these bundles.
<img src="frontend/public/logo.svg" alt="FinSight AI Logo" width="80" height="80" />