TreeDex — Comprehensive Documentation

Tree-based, vectorless document RAG framework. Available for Python (3.10+) and Node.js (18+) with identical APIs and cross-compatible index format.

Architecture Overview
Pipeline Deep Dive
Smart Hierarchy Detection
API Reference
LLM Backend System
Token Management Strategy
Benchmarks
Case Studies
Cross-Language Parity
Error Handling
Configuration & Tuning
Limitations & Edge Cases

Architecture Overview

TreeDex has six core modules, each implemented identically in Python and TypeScript:

┌─────────────────────────────────────────────────────┐
│                  Document Loaders                    │
│  PDFLoader · TextLoader · HTMLLoader · DOCXLoader    │
│  auto_loader() — auto-detect format                  │
└──────────────────────┬──────────────────────────────┘
                       │
          ┌────────────▼────────────┐
          │     PDF Parser          │
          │  extract_pages()        │
          │  extract_toc()          │
          │  [H1][H2][H3] markers   │
          │  group_pages()          │
          └────────┬───────┬────────┘
                   │       │
         ┌─────────▼─┐   ┌▼──────────────────────┐
         │ ToC found  │   │  No ToC → LLM path    │
         │ Zero LLM   │   │  Heading-guided        │
         │ calls      │   │  Capped continuation   │
         └─────┬──────┘   └─────────┬──────────────┘
               │                    │
          ┌────▼────────────────────▼─────┐
          │        Tree Builder           │
          │  toc_to_sections()            │
          │  repair_orphans()             │
          │  list_to_tree()               │
          │  assign_page_ranges()         │
          │  assign_node_ids()            │
          │  embed_text_in_tree()         │
          └──────────────┬────────────────┘
                         │
          ┌──────────────▼────────────────┐
          │         TreeDex Core          │
          │  .query()  .save()  .load()   │
          │  .stats()  .show_tree()       │
          └──────────────┬────────────────┘
                         │
          ┌──────────────▼────────────────┐
          │         QueryResult           │
          │  .context  .node_ids          │
          │  .page_ranges  .reasoning     │
          │  .answer (agentic mode)       │
          └───────────────────────────────┘

Module Map

Module	Python	TypeScript	LOC	Purpose
Core	`treedex/core.py`	`src/core.ts`	303 / 398	Main `TreeDex` class, indexing and querying
LLM Backends	`treedex/llm_backends.py`	`src/llm-backends.ts`	700 / 738	18+ provider integrations
Loaders	`treedex/loaders.py`	`src/loaders.ts`	153 / 190	Format-specific document loading
PDF Parser	`treedex/pdf_parser.py`	`src/pdf-parser.ts`	133 / 154	PDF extraction, ToC, heading detection
Prompts	`treedex/prompts.py`	`src/prompts.ts`	91 / 105	LLM prompt templates
Tree Builder	`treedex/tree_builder.py`	`src/tree-builder.ts`	120 / 156	Flat sections → hierarchical tree
Tree Utils	`treedex/tree_utils.py`	`src/tree-utils.ts`	152 / 188	Traversal, serialization, JSON extraction
Types	—	`src/types.ts`	— / 42	TypeScript type definitions

Pipeline Deep Dive

1. Document Loading

TreeDex auto-detects format by extension and delegates to the appropriate loader:

Format	Library (Python)	Library (Node.js)	Images	Headings
PDF	PyMuPDF (fitz)	pdfjs-dist	Base64 extraction	Font-size analysis
TXT/MD	stdlib	Node.js fs	—	—
HTML	HTMLParser (stdlib)	htmlparser2 / regex fallback	Alt text extraction	—
DOCX	python-docx	mammoth	Inline images	—

Each loader returns a list of Page objects:

{
  "page_num": 0,
  "text": "Chapter 1: Introduction...",
  "token_count": 342,
  "images": [{"data": "base64...", "mime_type": "image/png"}]
}

Non-PDF formats split text into synthetic pages by character count (default 3,000 chars/page).

2. PDF ToC Extraction

Before any LLM calls, TreeDex checks for PDF bookmarks:

# Python
toc = extract_toc("document.pdf")
# Returns: [{"level": 1, "title": "Introduction", "physical_index": 0}, ...]
# Or None if < 3 entries

// TypeScript
const toc = await extractToc("document.pdf");

How it works:

Python: Uses fitz.open(path).get_toc() which reads the PDF outline/bookmarks metadata
TypeScript: Uses pdfjs.getDocument().getOutline() and recursively walks the outline tree, resolving page destinations

If a usable ToC is found (3+ entries), the entire LLM extraction step is skipped — the tree is built directly from ToC entries via toc_to_sections().

3. Font-Size Heading Detection

When no ToC is available, TreeDex analyzes font sizes to inject hierarchy markers:

Step 1: Analyze — Sample up to 50 pages, collect all font sizes weighted by character count.

Step 2: Classify — The most-used font size = body text. All sizes > body + 0.5pt = headings. Top 3 heading sizes → H1, H2, H3 (largest first).

Step 3: Annotate — Each line gets a marker based on its font size:

Without markers:          With markers:
─────────────────         ─────────────────
3 System Architecture     [H2] 3 System Architecture
TreeDex consists of...    TreeDex consists of...
3.1 Architecture          [H3] 3.1 Architecture overview
overview                  Figure 1 illustrates...
Figure 1 illustrates...

Overhead: Only +2.7% more tokens (314 extra tokens on a 12k-token document), but dramatically improves hierarchy accuracy.

4. Page Grouping

Documents are split into token-budgeted groups for the LLM:

Pages: [P0, P1, P2, P3, P4, P5, P6, P7, P8, P9]
                                                    max_tokens = 4000
Group 1: [P0, P1, P2, P3, P4, P5, P6]  ← 3,868 tokens
           overlap ↕
Group 2: [P6, P7, P8, P9, P10, P11, P12]  ← 3,926 tokens
           overlap ↕
Group 3: [P12, P13, P14, P15, P16, P17]  ← 3,631 tokens

Each page is wrapped in XML tags: <physical_index_N>text</physical_index_N> so the LLM knows which page each section starts on.

5. Structure Extraction

Group 1 gets the full extraction prompt. The LLM returns:

[
  {"structure": "1", "title": "Introduction", "physical_index": 1},
  {"structure": "1.1", "title": "Background", "physical_index": 1},
  {"structure": "1.2", "title": "Methods", "physical_index": 3}
]

Groups 2+ get a continuation prompt with capped context:

Document Size	Old Context (all sections)	New Context (capped)	Savings
100 pages	9,750 tokens	4,800 tokens	50.8%
300 pages	117,200 tokens	19,200 tokens	83.6%
500 pages	317,200 tokens	31,200 tokens	90.2%

The capped context includes:

All top-level sections (chapters) — to maintain overall structure awareness
Last 30 detailed sections — to continue numbering correctly
Total section count and last structure ID

6. Orphan Repair

After extraction, repair_orphans() fixes broken hierarchy:

Before repair:                After repair:
─────────────                 ─────────────
1    — Introduction           1    — Introduction
1.1  — Background             1.1  — Background
2.3.1 — Deep orphan           2    — Section 2      ← synthetic
                              2.3  — Section 2.3    ← synthetic
                              2.3.1 — Deep orphan

7. Tree Construction

The flat section list becomes a hierarchical tree:

list_to_tree() — Parent lookup by structure prefix ("1.2.3" → parent "1.2")
assign_page_ranges() — Each node gets start/end page indices
assign_node_ids() — DFS assigns "0001", "0002", etc.
embed_text_in_tree() — Each node gets concatenated page text for its range

8. Querying

result = index.query("What methods were used?")

Step 1: Strip text from tree (keep only structure/title/page_ranges/node_ids) to minimize tokens.

Step 2: Send stripped tree + question to LLM. LLM returns {node_ids: [...], reasoning: "..."}.

Step 3: Look up full text for selected nodes via the node map (O(1) per node).

Step 4 (agentic): Optionally generate an answer from the retrieved context.

Smart Hierarchy Detection

Strategy Priority

1. PDF has bookmarks/outline?
   YES → extract_toc() → toc_to_sections() → build tree (0 LLM calls)
   NO  ↓

2. PDF file?
   YES → detect_headings=True → font-size analysis → [H1][H2][H3] markers
   NO  → plain text extraction

3. LLM extraction with heading-guided prompts
   → repair_orphans() → build tree

How Font-Size Detection Works

# Step 1: Collect font sizes from first 50 pages
size_chars = Counter()
for page in doc[:50]:
    for block in page.get_text("dict")["blocks"]:
        for line in block["lines"]:
            for span in line["spans"]:
                size_chars[round(span["size"], 1)] += len(span["text"])

# Step 2: Body = most common size
body_size = size_chars.most_common(1)[0][0]  # e.g., 10.0

# Step 3: Headings = larger sizes, sorted descending
# e.g., 17.2 → H1, 12.0 → H2, 11.0 → H3

# Step 4: Annotate lines
# "[H2] 3 System Architecture"
# "TreeDex consists of five core modules..."

How ToC Numbering Works

PDF Outline:               → Structure Numbering:
─────────────              ─────────────────────
Level 1: Introduction      → "1"
  Level 2: Background      → "1.1"
  Level 2: Motivation      → "1.2"
Level 1: Methods           → "2"
  Level 2: Data            → "2.1"
    Level 3: Surveys       → "2.1.1"

Counters are maintained per level. When a new Level 1 appears, all deeper counters reset.

API Reference

TreeDex

Method	Python	Node.js	Description
Build from file	`TreeDex.from_file(path, llm, **opts)`	`await TreeDex.fromFile(path, llm, opts?)`	Full pipeline: load → detect → extract → build
Build from pages	`TreeDex.from_pages(pages, llm, **opts)`	`await TreeDex.fromPages(pages, llm, opts?)`	From pre-extracted pages
Build from tree	`TreeDex.from_tree(tree, pages, llm)`	`TreeDex.fromTree(tree, pages, llm)`	From existing tree (no LLM)
Query	`index.query(q, llm=, agentic=)`	`await index.query(q, {llm?, agentic?})`	Retrieve relevant sections
Save	`index.save(path)`	`await index.save(path)`	Export to JSON
Load	`TreeDex.load(path, llm)`	`await TreeDex.load(path, llm)`	Import from JSON
Show tree	`index.show_tree()`	`index.showTree()`	Pretty-print
Stats	`index.stats()`	`index.stats()`	Return `{total_pages, total_tokens, ...}`
Find large	`index.find_large_sections(**opts)`	`index.findLargeSections(opts?)`	Nodes exceeding thresholds

`from_file` Options

Parameter	Python	Node.js	Default	Description
path	`str`	`string`	required	Document file path
llm	`BaseLLM`	`BaseLLM`	required	LLM backend instance
loader	`Loader`	`{load()}`	`None`/`undefined`	Custom loader (auto-detect if omitted)
max_tokens	`int`	`number`	20000	Token budget per page group
overlap	`int`	`number`	1	Page overlap between groups
verbose	`bool`	`boolean`	`True`/`true`	Print progress
extract_images	`bool`	`boolean`	`False`/`false`	Extract images from PDFs

QueryResult

Property	Python	Node.js	Type	Description
Context	`.context`	`.context`	`str`	Concatenated text from selected nodes
Node IDs	`.node_ids`	`.nodeIds`	`list[str]`	IDs of selected tree nodes
Page ranges	`.page_ranges`	`.pageRanges`	`list[tuple]`	`[(start, end), ...]`
Pages string	`.pages_str`	`.pagesStr`	`str`	`"pages 5-8, 12-15"`
Reasoning	`.reasoning`	`.reasoning`	`str`	LLM's selection explanation
Answer	`.answer`	`.answer`	`str`	Generated answer (agentic mode only)

Hierarchy Utilities

Function	Python	Node.js	Description
Extract PDF ToC	`extract_toc(path)`	`await extractToc(path)`	Returns `[{level, title, physical_index}]` or `None`/`null`
ToC → sections	`toc_to_sections(toc)`	`tocToSections(toc)`	Convert ToC entries to numbered sections
Repair orphans	`repair_orphans(sections)`	`repairOrphans(sections)`	Insert synthetic parents for orphaned subsections
List to tree	`list_to_tree(sections)`	`listToTree(sections)`	Flat sections → hierarchical tree
Extract JSON	`extract_json(text)`	`extractJson(text)`	Robust JSON extraction from LLM output

LLM Backend System

Architecture

BaseLLM (abstract)
├── SDK-based (lazy import)
│   ├── GeminiLLM ─────── @google/generative-ai
│   ├── OpenAILLM ─────── openai
│   ├── ClaudeLLM ─────── @anthropic-ai/sdk
│   ├── MistralLLM ────── @mistralai/mistralai
│   └── CohereLLM ─────── cohere-ai
│
├── OpenAI-compatible (zero deps, fetch/urllib only)
│   ├── GroqLLM ─────────── api.groq.com
│   ├── TogetherLLM ─────── api.together.xyz
│   ├── FireworksLLM ────── api.fireworks.ai
│   ├── OpenRouterLLM ───── openrouter.ai
│   ├── DeepSeekLLM ─────── api.deepseek.com
│   ├── CerebrasLLM ─────── api.cerebras.ai
│   └── SambanovaLLM ────── api.sambanova.ai
│
├── Fetch-based (zero deps)
│   ├── HuggingFaceLLM ──── huggingface.co
│   └── OllamaLLM ──────── localhost:11434
│
├── Universal
│   ├── OpenAICompatibleLLM ── any URL + key
│   ├── LiteLLM ───────────── litellm (Python only)
│   └── FunctionLLM ────────── wrap any callable
│
└── BaseLLM ── subclass to build your own

Vision Support

Three backends support image description for the extract_images feature:

Backend	Method	Image Format
GeminiLLM	`generate_content()` with inline_data	Base64 PNG/JPEG
OpenAILLM	Chat completion with image_url	Base64 data URI
ClaudeLLM	Messages API with image source	Base64 with media_type

Custom Backend

# Python
class MyLLM(BaseLLM):
    def generate(self, prompt: str) -> str:
        return my_api.call(prompt)

# Or use FunctionLLM for quick wrapping:
llm = FunctionLLM(lambda prompt: my_api.call(prompt))

// TypeScript
const llm = new FunctionLLM((prompt: string) => myApi.call(prompt));

Token Management Strategy

Encoding

TreeDex uses cl100k_base (GPT-3.5/4 standard) for all token counting:

Python: tiktoken.get_encoding("cl100k_base")
Node.js: gpt-tokenizer (encode function)

Where Tokens Matter

Stage	Token Concern	How TreeDex Handles It
Page extraction	Pre-compute per page	`page.token_count` field
Page grouping	Fit within LLM context	`group_pages(maxTokens=20000)`
Continuation context	Don't balloon the prompt	Capped: top-level + last 30 sections
Query (tree structure)	Keep tree JSON small	`strip_text_from_tree()` removes all node text
Query (retrieved context)	Return relevant text	Only selected nodes' text is concatenated

Configuring for Your Model

# Small context model (8k)
index = TreeDex.from_file("doc.pdf", llm, max_tokens=6000)

# Large context model (128k)
index = TreeDex.from_file("doc.pdf", llm, max_tokens=100000)

Lower max_tokens = more groups = more LLM calls but works with smaller context windows. Higher max_tokens = fewer groups = faster but needs larger context.

Benchmarks

All benchmarks on research-paper.pdf (21 pages, 11,710 tokens, 41 ToC entries). Node.js, measured with performance.now().

Core Operations

Operation	Time	Notes
ToC extraction	30.9 ms	Reads PDF outline metadata
Page extraction (no headings)	298.9 ms	Plain text via pdfjs-dist
Page extraction (with headings)	423.5 ms	+41.7% for font analysis
Heading token overhead	+314 tokens	+2.7% (12,024 vs 11,710)

Page Grouping Speed

max_tokens	Groups	Avg tokens/group	Time
4,000	4	3,434	0.6 ms
8,000	2	6,279	0.1 ms
20,000	1	11,958	0.07 ms
128,000	1	11,958	0.05 ms

Tree Building Speed

Sections	Build Time	Nodes	Roots
10	0.6 ms	12	2
50	0.3 ms	50	10
200	0.5 ms	200	40
500	1.3 ms	500	100

Orphan Repair

Orphan Count	Time	Sections Added
5	0.2 ms	10 synthetic parents
20	0.5 ms	40 synthetic parents
50	1.5 ms	100 synthetic parents
100	3.0 ms	200 synthetic parents

tocToSections Speed

ToC Entries	Time/call	1000 calls
10	13 μs	12.8 ms
50	44 μs	43.7 ms
200	156 μs	155.6 ms

Memory Usage (Full Pipeline)

Metric	Value
Heap delta	+1.54 MB
RSS delta	+4.70 MB
External delta	+0.84 MB

Case Studies

Case Study 1: Before vs After Hierarchy Fix

Problem: On a 21-page research paper, the old extraction would produce 41 root-level nodes with max depth 1 — every section treated as a top-level chapter.

After v0.1.5 (ToC extraction):

Metric	Before (flat)	After (hierarchical)	Improvement
Root nodes	41	10	75.6% reduction
Max depth	1	3	3x deeper
Child nodes	0	31	Proper nesting
LLM calls	1+	0	100% saved

Tree output (after):

[0001] 1: Introduction (pages 1-1)
  [0002] 1.1: Background (pages 1-1)
  [0003] 1.2: Limitations of vector-based RAG (pages 1-1)
  [0004] 1.3: Our contribution (pages 2-2)
[0005] 2: Related Work (pages 2-4)
  [0006] 2.1: Retrieval-augmented generation (pages 2-2)
  [0007] 2.2: Document chunking strategies (pages 3-3)
  ...
[0011] 3: System Architecture (pages 5-8)
  [0012] 3.1: Architecture overview (pages 5-5)
  ...7 subsections...

Case Study 2: Heading Detection Impact

Page 2 without headings (what the LLM used to see):

1 Introduction  1.1 Background  Large Language Models (LLMs),
accessible primarily through web APIs...

Page 2 with headings (what the LLM sees now):

[H2] 1 Introduction
[H3] 1.1 Background
Large Language Models (LLMs), accessible primarily through web
APIs...

The [H2] and [H3] markers make the hierarchy unambiguous. The LLM prompt instructs:

[H1] = top-level chapters ("1", "2")
[H2] = sections ("1.1", "1.2")
[H3] = subsections ("1.1.1", "1.1.2")

Cost: Only 314 extra tokens across the entire 21-page document (+2.7%).

Case Study 3: Capped Continuation Context

For a 500-page document generating ~976 sections across ~56 page groups:

Old approach (Group 50 of 56):

Sends all 900+ previously extracted sections as JSON
Context: 317,200 tokens — exceeds most model limits
LLM truncates or hallucinates structure

New approach (Group 50 of 56):

Sends: 15 top-level chapters + 30 most recent sections + metadata
Context: 31,200 tokens — fits comfortably
LLM has structural overview + continuation point
90.2% token savings

Document Size	Old Context	Capped Context	Savings
100 pages	9,750 tok	4,800 tok	50.8%
300 pages	117,200 tok	19,200 tok	83.6%
500 pages	317,200 tok	31,200 tok	90.2%

Case Study 4: Orphan Repair in Practice

Scenario: LLM outputs 2.3.1 without ever producing 2 or 2.3:

Input (broken):              Output (repaired):
──────────────               ──────────────────
1    — Introduction          1    — Introduction
1.1  — Background            1.1  — Background
2.3.1 — Deep section         2    — Section 2      ← synthetic
3.1.2 — Another orphan       2.3  — Section 2.3    ← synthetic
4    — Conclusion            2.3.1 — Deep section
                             3    — Section 3      ← synthetic
                             3.1  — Section 3.1    ← synthetic
                             3.1.2 — Another orphan
                             4    — Conclusion

5 input sections → 9 after repair. The tree now has correct hierarchy instead of 3 orphaned root nodes.

Case Study 5: TreeDex vs Vector DB RAG

Dimension	TreeDex	Vector DB (Chroma/Pinecone)
Indexing	LLM extracts structure (or PDF ToC)	Embedding model generates vectors
Storage	JSON file (human-readable)	Vector database (opaque)
Retrieval	LLM navigates tree	Cosine similarity search
Structure	Preserves chapters/sections/subsections	Flat chunks, no hierarchy
Attribution	Exact page ranges per node	Approximate chunk boundaries
Infrastructure	None (just JSON)	Database server required
Dependencies	1 LLM API	1 LLM API + 1 embedding API + 1 DB
Debugging	Inspect JSON tree directly	Query embedding space (impractical)
Cost per query	1 LLM call (tree nav) + 1 optional (answer)	1 embedding call + 1 LLM call
Scales to 1M+ tokens	Yes (page grouping + capped context)	Yes (vector DB handles scale)
Best for	Structured docs (papers, books, manuals)	Unstructured content (logs, chat, mixed)

Cross-Language Parity

TreeDex maintains identical behavior across Python and Node.js. The JSON index format is fully cross-compatible — build in Python, query from Node.js, or vice versa.

Naming Conventions

Concept	Python	Node.js
Functions	`snake_case`	`camelCase`
Classes	`PascalCase`	`PascalCase`
JSON fields	`snake_case`	`snake_case` (shared format)
File names	`snake_case.py`	`kebab-case.ts`

Implementation Differences

Feature	Python	TypeScript	Notes
PDF library	PyMuPDF (fitz)	pdfjs-dist	PyMuPDF is faster; pdfjs runs in browser
Token counting	tiktoken	gpt-tokenizer	Both use cl100k_base
Async	Synchronous	`async/await`	Python blocks; TS is non-blocking
LiteLLM	Supported	Not available	Python-only universal backend
Image extraction	Full base64	Metadata only	pdfjs doesn't easily export encoded images
Heading detection	`page.get_text("dict")`	`getTextContent()` items	Both analyze font sizes

Index Format (Shared)

{
  "version": "1.0",
  "framework": "TreeDex",
  "tree": [
    {
      "structure": "1",
      "title": "Introduction",
      "physical_index": 0,
      "node_id": "0001",
      "nodes": [
        {
          "structure": "1.1",
          "title": "Background",
          "physical_index": 0,
          "node_id": "0002",
          "nodes": []
        }
      ]
    }
  ],
  "pages": [
    {"page_num": 0, "text": "...", "token_count": 342}
  ]
}

Error Handling

JSON Extraction (Multi-Pass)

LLM output is often imperfect. extract_json() tries 4 strategies:

Direct JSON.parse(text) — clean output
Code block extraction (```json ... ```) — markdown-wrapped
Trailing comma fix (, } → }) — common LLM error
Brace/bracket matching — find outermost {...} or [...]

Image Description (Graceful Fallback)

Has alt text?        → "[Image: alt text]"
Has vision LLM?      → "[Image: LLM description]"
Vision fails?        → "[Image present]"
No vision support?   → "[Image present]"

API Failures

HTTP errors include status code and response body
Timeout: 120s default for fetch-based backends
No automatic retries (user controls retry logic)

Query Without LLM

try:
    result = index.query("question")
except ValueError as e:
    # "No LLM provided. Pass llm= to query() or TreeDex constructor."

Configuration & Tuning

For Small Context Models (4k-8k)

index = TreeDex.from_file("doc.pdf", llm, max_tokens=4000, overlap=2)

More page groups, more LLM calls, but each call fits within the context window. Increase overlap to 2 for better section boundary detection.

For Large Context Models (128k+)

index = TreeDex.from_file("doc.pdf", llm, max_tokens=100000)

Fewer groups, fewer LLM calls. A 300-page document might fit in 3-5 groups instead of 56.

For Image-Heavy Documents

llm = GeminiLLM(api_key)  # Must support vision
index = TreeDex.from_file("slides.pdf", llm, extract_images=True)

Images are described by the vision LLM and appended as [Image: description] to page text.

Forcing Heading Detection

Heading detection is automatic when no ToC is found. To force it even with a custom loader:

from treedex.loaders import PDFLoader
loader = PDFLoader(detect_headings=True)
pages = loader.load("doc.pdf")
index = TreeDex.from_pages(pages, llm)

Limitations & Edge Cases

Current Limitations

Multi-column layouts — Font-size detection works but column order may be wrong
Dense tables — Treated as regular text; no special table parsing
RTL languages — Text extraction works but heading detection may miss markers
Streaming — LLM responses must be complete (no streaming support)
Concurrent queries — Not thread-safe; use separate TreeDex instances
Context overflow — Retrieved context isn't capped; may exceed model limits for very large sections

Edge Cases Handled

Single-page sections — end_index clamped to >= start_index
Empty pages — Token count is 0; grouping skips gracefully
No ToC — Falls back to LLM extraction with heading markers
No headings detected — Falls back to plain text (no font-size variation)
Orphaned sections — Synthetic parents auto-inserted
Malformed LLM JSON — Multi-pass extraction with fallbacks
Missing node IDs in query — Silently skipped (no crash)
PDF without text — Empty pages; images can still be described via vision LLM

Not Handled

Automatic section splitting — Large sections aren't auto-split; use find_large_sections() to identify them
Query result deduplication — Overlapping page ranges aren't merged
Incremental indexing — Re-index entire document on changes
Page-level granularity — Minimum unit is a page; sub-page sections share the full page text

Complexity Analysis

Operation	Time	Space
`extract_pages()`	O(pages)	O(pages × text)
`extract_toc()`	O(toc entries)	O(entries)
`group_pages()`	O(pages)	O(groups × text)
`list_to_tree()`	O(n)	O(n)
`repair_orphans()`	O(n × depth)	O(inserts)
`assign_page_ranges()`	O(n)	O(1) in-place
`assign_node_ids()`	O(n)	O(1) in-place
`embed_text_in_tree()`	O(n × pages_per_node)	O(text)
`create_node_mapping()`	O(n)	O(n)
`query()`	O(1 LLM call + n)	O(context text)
`save()`	O(n + pages)	O(JSON size)
`load()`	O(JSON size)	O(n + pages)

Where n = number of tree nodes, typically 10-500 for most documents.

TreeDex — Comprehensive Documentation

TreeDex — Comprehensive Documentation

Table of Contents

Architecture Overview

Module Map

Pipeline Deep Dive

1. Document Loading

2. PDF ToC Extraction

3. Font-Size Heading Detection

4. Page Grouping

5. Structure Extraction

6. Orphan Repair

7. Tree Construction

8. Querying

Smart Hierarchy Detection

Strategy Priority

How Font-Size Detection Works

How ToC Numbering Works

API Reference

TreeDex

from_file Options

QueryResult

Hierarchy Utilities

LLM Backend System

Architecture

Vision Support

Custom Backend

Token Management Strategy

Encoding

Where Tokens Matter

Configuring for Your Model

Benchmarks

Core Operations

Page Grouping Speed

Tree Building Speed

Orphan Repair

tocToSections Speed

Memory Usage (Full Pipeline)

Case Studies

Case Study 1: Before vs After Hierarchy Fix

Case Study 2: Heading Detection Impact

Case Study 3: Capped Continuation Context

Case Study 4: Orphan Repair in Practice

Case Study 5: TreeDex vs Vector DB RAG

Cross-Language Parity

Naming Conventions

Implementation Differences

Index Format (Shared)

Error Handling

JSON Extraction (Multi-Pass)

Image Description (Graceful Fallback)

API Failures

Query Without LLM

Configuration & Tuning

For Small Context Models (4k-8k)

For Large Context Models (128k+)

For Image-Heavy Documents

Forcing Heading Detection

Limitations & Edge Cases

Current Limitations

Edge Cases Handled

Not Handled

Complexity Analysis

Related Documents

Zig 0.16.0 Context Document for LLMs

所有收集类项目

Attaching virtual persistent memory in a {{site.data.keyword.powerSys_notm}} instance

`from_file` Options