AST-Based Chunking

CodeRAG uses Abstract Syntax Tree (AST) parsing to split code into semantic chunks rather than arbitrary character or line-based splits. This produces more meaningful search units.

How Tree-Sitter Parsers Work

CodeRAG uses Synth parsers (built on tree-sitter) to parse code into AST nodes. Tree-sitter is a parser generator that creates fast, incremental parsers for programming languages.

Key concepts:

Nodes: AST nodes represent code constructs (functions, classes, expressions, etc.)
Spans: Each node has a span with start/end positions (line, column, offset)
Types: Node types identify the construct (e.g., FunctionDeclaration, ClassDeclaration)
Tree structure: Nodes form a hierarchical tree with parent/child relationships

Example AST:

// Code:
function hello(name: string) {
  return `Hello, ${name}`
}

// Simplified AST:
{
  type: "FunctionDeclaration",
  span: { start: { line: 0, column: 0 }, end: { line: 2, column: 1 } },
  children: [
    { type: "Identifier", value: "hello" },
    { type: "Parameters", children: [...] },
    { type: "BlockStatement", children: [...] }
  ]
}

Synth parser interface:

interface SynthParser {
  parseAsync: (source: string, options?: Record<string, unknown>) => Promise<Tree>
}

interface Tree {
  meta: { language: string; source: string }
  root: NodeId
  nodes: BaseNode[]
}

All Synth parsers use async parsing because they're WASM-based (v0.3.x).

Semantic Boundaries

Semantic boundaries define where code should be split. CodeRAG chunks code at boundaries defined in the language configuration.

Common boundaries by language:

TypeScript/JavaScript:

FunctionDeclaration
ClassDeclaration
InterfaceDeclaration
TypeAliasDeclaration
ExportNamedDeclaration
ExportDefaultDeclaration

Python:

FunctionDef
AsyncFunctionDef
ClassDef
Module

Go:

FuncDecl
MethodDecl
TypeSpec
GenDecl

Rust:

function_item
impl_item
struct_item
enum_item
trait_item

Why semantic boundaries?

Semantic chunking ensures each chunk is a complete, meaningful unit:

// Good: Function-level chunking
// Chunk 1:
export function calculateBM25(tf: number, idf: number, docLen: number, avgDocLen: number): number {
  const k1 = 1.2
  const b = 0.75
  const numerator = tf * (k1 + 1)
  const denominator = tf + k1 * (1 - b + b * docLen / avgDocLen)
  return idf * (numerator / denominator)
}

// Bad: Character-based chunking at 100 chars
// Chunk 1: "export function calculateBM25(tf: number, idf: number, docLen: number, avgDocLen: number): number {"
// Chunk 2: "const k1 = 1.2\n  const b = 0.75\n  const numerator = tf * (k1 + 1)\n  const denominator = tf + k"
// Result: Incomplete, nonsensical chunks

Language-Specific Configurations

Each language has a configuration defining its parser, boundaries, and context types.

Configuration structure:

interface LanguageConfig {
  parser: string                      // NPM package name
  extensions: readonly string[]       // File extensions
  boundaries: readonly string[]       // AST node types for chunking
  contextTypes?: readonly string[]    // Context to preserve (imports, types)
  embedded?: EmbeddedLanguageConfig[] // Embedded languages
  parserOptions?: Record<string, unknown>
}

Example: TypeScript configuration:

typescript: {
  parser: '@sylphx/synth-js',
  extensions: ['.ts', '.mts', '.cts'],
  boundaries: [
    'FunctionDeclaration',
    'ClassDeclaration',
    'InterfaceDeclaration',
    'TypeAliasDeclaration',
    'EnumDeclaration',
    'MethodDefinition',
    'ExportNamedDeclaration',
    'ExportDefaultDeclaration',
  ],
  contextTypes: ['ImportDeclaration', 'TypeAliasDeclaration', 'InterfaceDeclaration'],
  parserOptions: { sourceType: 'module' },
}

Context preservation:

Context types (imports, type definitions) can be prepended to each chunk for better understanding:

// With preserveContext: true

// Chunk 1 (with context):
import { User } from './types'

export function getUser(id: string): User {
  return database.findById(id)
}

// Without context, the chunk would start at "export function..."

Chunk Metadata

Each chunk includes metadata for precise navigation and filtering.

ChunkResult interface:

interface ChunkResult {
  readonly content: string       // Chunk source code
  readonly type: string          // AST node type
  readonly startLine: number     // 1-indexed
  readonly endLine: number       // 1-indexed
  readonly metadata: Record<string, unknown>
}

Example chunk:

{
  content: "export function parseQuery(query: string): string[] {\n  return query.toLowerCase().split(/\\s+/)\n}",
  type: "FunctionDeclaration",
  startLine: 5,
  endLine: 7,
  metadata: {
    name: "parseQuery",
    exported: true
  }
}

Metadata uses:

Search results: Display which function/class matched
Navigation: Jump to exact line in editor
Filtering: Search only specific node types (e.g., only functions)
Ranking: Boost certain types (e.g., exported functions)

Chunking Process

Step-by-step chunking algorithm:

Detect language: Determine language from file extension
Load parser: Get Synth parser for the language
Parse AST: Parse source code into syntax tree
Extract chunks: Traverse tree, extract nodes at semantic boundaries
Merge small chunks: Combine small chunks below minChunkSize
Split large chunks: Recursively split chunks exceeding maxChunkSize
Add metadata: Attach type, line numbers, and other metadata

API usage:

import { chunkCodeByAST } from '@sylphx/coderag'

const chunks = await chunkCodeByAST(
  sourceCode,
  'example.ts',
  {
    maxChunkSize: 1000,      // Max chars per chunk
    minChunkSize: 100,       // Min chars per chunk
    preserveContext: true,   // Include imports/types
    nodeTypes: undefined,    // Chunk all boundaries
    parseEmbedded: true      // Parse code blocks in markdown
  }
)

for (const chunk of chunks) {
  console.log(`${chunk.type} (lines ${chunk.startLine}-${chunk.endLine})`)
  console.log(chunk.content)
}

Fallback behavior:

If AST parsing fails (unknown language, syntax error), CodeRAG falls back to character-based chunking:

// Fallback chunk
{
  content: "... raw text ...",
  type: "text",
  startLine: 0,
  endLine: 0,
  metadata: { fallback: true, reason: "no-semantic-boundaries" }
}

Supported Languages

CodeRAG supports 15+ languages through Synth parsers:

Tier 1 (Full AST support):

JavaScript/TypeScript/JSX/TSX
Python
Go
Java
C/C++
Rust

Tier 2 (Markup/Config):

Markdown
HTML
XML
JSON
YAML
TOML
INI

Tier 3 (Specialized):

Protobuf

See languages.md for full details.

AST-Based Chunking

AST-Based Chunking

How Tree-Sitter Parsers Work

Semantic Boundaries

Language-Specific Configurations

Chunk Metadata

Chunking Process

Supported Languages

Related Documents

基于命题分块以增强RAG

TileMap Chunk Manager

🤖 n8n AI Agent Mastery Course 2025

Document Chunking/Splitting in Langroid