Loading...
Loading...
Loading...
CodeRAG uses Abstract Syntax Tree (AST) parsing to split code into semantic chunks rather than arbitrary character or line-based splits. This produces more meaningful search units.
# AST-Based Chunking
CodeRAG uses Abstract Syntax Tree (AST) parsing to split code into semantic chunks rather than arbitrary character or line-based splits. This produces more meaningful search units.
## How Tree-Sitter Parsers Work
CodeRAG uses Synth parsers (built on tree-sitter) to parse code into AST nodes. Tree-sitter is a parser generator that creates fast, incremental parsers for programming languages.
**Key concepts:**
- **Nodes**: AST nodes represent code constructs (functions, classes, expressions, etc.)
- **Spans**: Each node has a span with start/end positions (line, column, offset)
- **Types**: Node types identify the construct (e.g., `FunctionDeclaration`, `ClassDeclaration`)
- **Tree structure**: Nodes form a hierarchical tree with parent/child relationships
**Example AST:**
```typescript
// Code:
function hello(name: string) {
return `Hello, ${name}`
}
// Simplified AST:
{
type: "FunctionDeclaration",
span: { start: { line: 0, column: 0 }, end: { line: 2, column: 1 } },
children: [
{ type: "Identifier", value: "hello" },
{ type: "Parameters", children: [...] },
{ type: "BlockStatement", children: [...] }
]
}
```
**Synth parser interface:**
```typescript
interface SynthParser {
parseAsync: (source: string, options?: Record<string, unknown>) => Promise<Tree>
}
interface Tree {
meta: { language: string; source: string }
root: NodeId
nodes: BaseNode[]
}
```
All Synth parsers use async parsing because they're WASM-based (v0.3.x).
## Semantic Boundaries
Semantic boundaries define where code should be split. CodeRAG chunks code at boundaries defined in the language configuration.
**Common boundaries by language:**
**TypeScript/JavaScript:**
- `FunctionDeclaration`
- `ClassDeclaration`
- `InterfaceDeclaration`
- `TypeAliasDeclaration`
- `ExportNamedDeclaration`
- `ExportDefaultDeclaration`
**Python:**
- `FunctionDef`
- `AsyncFunctionDef`
- `ClassDef`
- `Module`
**Go:**
- `FuncDecl`
- `MethodDecl`
- `TypeSpec`
- `GenDecl`
**Rust:**
- `function_item`
- `impl_item`
- `struct_item`
- `enum_item`
- `trait_item`
**Why semantic boundaries?**
Semantic chunking ensures each chunk is a complete, meaningful unit:
```typescript
// Good: Function-level chunking
// Chunk 1:
export function calculateBM25(tf: number, idf: number, docLen: number, avgDocLen: number): number {
const k1 = 1.2
const b = 0.75
const numerator = tf * (k1 + 1)
const denominator = tf + k1 * (1 - b + b * docLen / avgDocLen)
return idf * (numerator / denominator)
}
// Bad: Character-based chunking at 100 chars
// Chunk 1: "export function calculateBM25(tf: number, idf: number, docLen: number, avgDocLen: number): number {"
// Chunk 2: "const k1 = 1.2\n const b = 0.75\n const numerator = tf * (k1 + 1)\n const denominator = tf + k"
// Result: Incomplete, nonsensical chunks
```
## Language-Specific Configurations
Each language has a configuration defining its parser, boundaries, and context types.
**Configuration structure:**
```typescript
interface LanguageConfig {
parser: string // NPM package name
extensions: readonly string[] // File extensions
boundaries: readonly string[] // AST node types for chunking
contextTypes?: readonly string[] // Context to preserve (imports, types)
embedded?: EmbeddedLanguageConfig[] // Embedded languages
parserOptions?: Record<string, unknown>
}
```
**Example: TypeScript configuration:**
```typescript
typescript: {
parser: '@sylphx/synth-js',
extensions: ['.ts', '.mts', '.cts'],
boundaries: [
'FunctionDeclaration',
'ClassDeclaration',
'InterfaceDeclaration',
'TypeAliasDeclaration',
'EnumDeclaration',
'MethodDefinition',
'ExportNamedDeclaration',
'ExportDefaultDeclaration',
],
contextTypes: ['ImportDeclaration', 'TypeAliasDeclaration', 'InterfaceDeclaration'],
parserOptions: { sourceType: 'module' },
}
```
**Context preservation:**
Context types (imports, type definitions) can be prepended to each chunk for better understanding:
```typescript
// With preserveContext: true
// Chunk 1 (with context):
import { User } from './types'
export function getUser(id: string): User {
return database.findById(id)
}
// Without context, the chunk would start at "export function..."
```
## Chunk Metadata
Each chunk includes metadata for precise navigation and filtering.
**ChunkResult interface:**
```typescript
interface ChunkResult {
readonly content: string // Chunk source code
readonly type: string // AST node type
readonly startLine: number // 1-indexed
readonly endLine: number // 1-indexed
readonly metadata: Record<string, unknown>
}
```
**Example chunk:**
```typescript
{
content: "export function parseQuery(query: string): string[] {\n return query.toLowerCase().split(/\\s+/)\n}",
type: "FunctionDeclaration",
startLine: 5,
endLine: 7,
metadata: {
name: "parseQuery",
exported: true
}
}
```
**Metadata uses:**
- **Search results**: Display which function/class matched
- **Navigation**: Jump to exact line in editor
- **Filtering**: Search only specific node types (e.g., only functions)
- **Ranking**: Boost certain types (e.g., exported functions)
## Chunking Process
Step-by-step chunking algorithm:
1. **Detect language**: Determine language from file extension
2. **Load parser**: Get Synth parser for the language
3. **Parse AST**: Parse source code into syntax tree
4. **Extract chunks**: Traverse tree, extract nodes at semantic boundaries
5. **Merge small chunks**: Combine small chunks below minChunkSize
6. **Split large chunks**: Recursively split chunks exceeding maxChunkSize
7. **Add metadata**: Attach type, line numbers, and other metadata
**API usage:**
```typescript
import { chunkCodeByAST } from '@sylphx/coderag'
const chunks = await chunkCodeByAST(
sourceCode,
'example.ts',
{
maxChunkSize: 1000, // Max chars per chunk
minChunkSize: 100, // Min chars per chunk
preserveContext: true, // Include imports/types
nodeTypes: undefined, // Chunk all boundaries
parseEmbedded: true // Parse code blocks in markdown
}
)
for (const chunk of chunks) {
console.log(`${chunk.type} (lines ${chunk.startLine}-${chunk.endLine})`)
console.log(chunk.content)
}
```
**Fallback behavior:**
If AST parsing fails (unknown language, syntax error), CodeRAG falls back to character-based chunking:
```typescript
// Fallback chunk
{
content: "... raw text ...",
type: "text",
startLine: 0,
endLine: 0,
metadata: { fallback: true, reason: "no-semantic-boundaries" }
}
```
## Supported Languages
CodeRAG supports 15+ languages through Synth parsers:
**Tier 1 (Full AST support):**
- JavaScript/TypeScript/JSX/TSX
- Python
- Go
- Java
- C/C++
- Rust
**Tier 2 (Markup/Config):**
- Markdown
- HTML
- XML
- JSON
- YAML
- TOML
- INI
**Tier 3 (Specialized):**
- Protobuf
See [languages.md](/Users/kyle/coderag/docs/guide/languages.md) for full details.
[](https://github.com/BUAADreamer/EasyRAG/blob/main/licence)
Welcome to the most comprehensive n8n AI Agent course! Build powerful automation workflows and intelligent AI agents using n8n's visual workflow builder.
Chunking is the process that decides which modules are placed into which bundles, and the relationship between these bundles.
<img src="frontend/public/logo.svg" alt="FinSight AI Logo" width="80" height="80" />