libragen: First-Class AST-Aware Code Chunking Support

Version: 1.0.0 Date: 2024-12-20 Status: In Progress

Summary

Integrate the code-chunk library from supermemoryai to provide AST-aware, semantic code chunking as a first-class feature in libragen. This replaces the current LangChain-based RecursiveCharacterTextSplitter for supported code files with tree-sitter-powered chunking that respects semantic boundaries (functions, classes, methods) and provides rich context (scope chains, imports, siblings, entity signatures) for better embedding quality.

Objectives & Scope

In Scope

Add code-chunk as a dependency to @libragen/core
Create a new CodeChunker class that wraps code-chunk and implements the same interface pattern as the existing Chunker
Extend Chunk and ChunkMetadata types to include semantic context from code-chunk (scope, entities, imports, siblings)
Update Builder to use CodeChunker for supported code files, falling back to the existing Chunker for unsupported files
Store semantic context in the database for retrieval-time enrichment
Store contextualizedText in the database alongside raw content for maximum flexibility
Update CLI and MCP to expose new chunking options (e.g., --no-ast-chunking, --context-mode)
Add build option to enable/disable AST-aware chunking (default: enabled for code files)
Update documentation across all packages

Out of Scope

WASM/Cloudflare Workers support (code-chunk supports this, but libragen is Node.js-focused)
Effect.js integration (use Promise-based API)
Streaming chunking API (batch processing is sufficient for build)

Future Considerations

Custom tree-sitter grammar support beyond what code-chunk provides - this would allow users to add support for additional languages by providing their own tree-sitter grammars

Assumptions & Open Questions

Assumptions

code-chunk is stable enough for production use (v0.1.11)
The contextualizedText field from code-chunk is suitable for embedding (this is its intended use)
Node.js native tree-sitter bindings will work in libragen's target environments (Node 24+)
Users will benefit from richer chunk context even if it increases storage slightly

Resolved Questions

Should AST chunking be opt-in or opt-out?
- Answer: Opt-out (enabled by default for code files)
Should we store both raw text and contextualizedText?
- Answer: Yes, store both. Use contextualizedText for embeddings, store raw content for display. This provides maximum flexibility and highest quality output.
How should we handle files that code-chunk doesn't support?
- Answer: Fall back to existing Chunker
Should chunk context (scope, entities, etc.) be stored in the database?
- Answer: Yes, in the metadata JSON column
What about backward compatibility with old libraries?
- Answer: Old libraries won't have semantic context data, but CLI/MCP should handle both old and new formats gracefully without errors

Requirements

Functional

FR-1: Support AST-aware chunking for TypeScript, JavaScript, Python, Rust, Go, and Java files
FR-2: Fall back to existing text-based chunking for unsupported file types
FR-3: Store semantic context (scope chain, entities, imports, siblings) in chunk metadata
FR-4: Store contextualizedText in the database for embedding and retrieval
FR-5: Use contextualizedText for embedding generation to improve retrieval quality
FR-6: Expose --no-ast-chunking flag in CLI to disable AST-aware chunking
FR-7: Expose --context-mode option (none, minimal, full) for controlling context richness (default: full)
FR-8: Maintain backward compatibility with existing .libragen files
FR-9: CLI and MCP must handle libraries with and without semantic context gracefully

Non-Functional

NFR-1 (Performance): AST chunking should not significantly increase build time (tree-sitter is fast)
NFR-2 (Storage): Chunk metadata storage increase should be reasonable (<30% file size increase due to storing contextualizedText)
NFR-3 (Compatibility): Must work with Node.js 24+ on macOS, Linux, and Windows

Architecture & Design Overview

Data Flow

Source Files
     │
     ▼
┌─────────────────────────────────────────────────────────┐
│ Builder._chunkSource()                                   │
│   ├─ For each file:                                      │
│   │   ├─ Is code file supported by code-chunk?           │
│   │   │   ├─ YES → CodeChunker.chunkText()               │
│   │   │   │         └─ Returns Chunk[] with context      │
│   │   │   │            + embeddingContent                │
│   │   │   └─ NO  → Chunker.chunkText() (existing)        │
│   │   │             └─ Returns Chunk[] (basic metadata)  │
│   │   └─ Collect all chunks                              │
└─────────────────────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────────────────────┐
│ Builder._generateEmbeddings()                            │
│   └─ Use chunk.embeddingContent ?? chunk.content         │
│      (contextualizedText for code, raw for others)       │
└─────────────────────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────────────────────┐
│ VectorStore.addChunks()                                  │
│   └─ Store:                                              │
│      - chunk.content (raw text for display)              │
│      - chunk.embeddingContent (contextualizedText)       │
│      - chunk.metadata (includes codeContext)             │
└─────────────────────────────────────────────────────────┘

Key Interfaces

// Semantic context from code-chunk
interface CodeContext {
  scope: EntityInfo[];        // Scope chain (e.g., class > method)
  entities: ChunkEntityInfo[]; // Entities defined in this chunk
  siblings: SiblingInfo[];     // Nearby entities for context
  imports: ImportInfo[];       // Relevant imports
}

// Extended ChunkMetadata
interface ChunkMetadata {
  sourceFile: string;
  startLine?: number;
  endLine?: number;
  language?: string;
  // NEW: Semantic context from code-chunk
  codeContext?: CodeContext;
}

// Extended Chunk
interface Chunk {
  content: string;              // Raw code text (for display)
  embeddingContent?: string;    // contextualizedText (for embedding)
  metadata: ChunkMetadata;
}

Database Schema Changes

The chunks table already has a metadata JSON column. We will:

Store codeContext in the metadata JSON
Add a new column embedding_content to store contextualizedText separately from content

This allows:

Displaying raw code to users
Using enriched context for embeddings
Retrieving semantic context at search time

Decisions & Trade-offs

Decision: Store both content and embeddingContent (contextualizedText)
- Rationale: Maximum flexibility - raw content for display, contextualized for embeddings and potential re-embedding
- Trade-off: ~20-30% storage increase, but worth it for quality
Decision: Store semantic context in metadata JSON column
- Rationale: Enables rich retrieval-time features without schema migration
- Trade-off: Increases storage, but provides valuable context
Decision: Default to AST chunking with contextMode: 'full'
- Rationale: Best quality out of the box
- Trade-off: Users who want smaller files can opt out
Decision: Graceful fallback for unsupported files and old libraries
- Rationale: Seamless experience regardless of file type or library age
- Trade-off: Some code paths need to handle both cases

Task Grid

Status	ID	Task	Priority	Depends On	Acceptance Criteria
[ ]	T-01	Add `code-chunk` dependency	H	—	Package installed, types available
[ ]	T-02	Extend `Chunk` and `ChunkMetadata` types	H	T-01	Types include semantic context fields
[ ]	T-03	Create `CodeChunker` class	H	T-02	Class wraps code-chunk, matches Chunker pattern
[ ]	T-04	Update `Builder` to use `CodeChunker`	H	T-03	Builder uses AST chunking for supported files
[ ]	T-05	Update `VectorStore` schema	M	T-02	`embedding_content` column added
[ ]	T-06	Add build options for AST chunking	M	T-04	`noAstChunking`, `contextMode` options work
[ ]	T-07	Update CLI with new options	M	T-06	`--no-ast-chunking`, `--context-mode` flags
[ ]	T-08	Update MCP tools with new options	M	T-06	MCP build tool accepts new options
[ ]	T-09	Write unit tests	H	T-03, T-04	≥90% coverage for new code
[ ]	T-10	Update documentation	M	T-07, T-08	READMEs, website docs updated