arXiv Knowledge Base — Specification — .md Directory

# arXiv Knowledge Base — Specification **Version:** 0.1.0 **Last updated:** 2026-01-31 **Python:** >=3.10 --- ## Current Status | Metric | Value | |--------|-------| | Total Papers | 9,500 | | Legacy Chunks (v1) | 12,443 | | Enriched Chunks (v2) | 1,035 | | Extraction: completed | 4,612 | | Extraction: pending | 4,886 | | Extraction: failed | 2 | | Embedding: completed | 185 | | Embedding: pending | 9,315 | | V2 Chunks by level | summary: 407, section: 628 | | V2 Contribution types | core: 664, comparison: 134, citation: 237 | | V2 Key results | 97 | | V2 Embedding status | completed: 966, pending: 69 | **Top categories:** cs.LG (2,791), cs.CL (1,655), cs.AI (1,242), cs.CV (1,107), cs.CR (305), stat.ML (224) --- ## Architecture ``` arxiv_data/ ├── config/ │ └── settings.py # Pydantic Settings (env: ARXIV_KB_*) ├── src/ │ ├── cli.py # Typer CLI │ ├── api/main.py # FastAPI REST endpoints │ ├── mcp/server.py # FastMCP server for Claude │ ├── collectors/ │ │ └── arxiv_fetcher.py # arXiv API + PDF downloads │ ├── extractors/ │ │ ├── pdf_extractor.py # Docling PDF→Markdown + chunking │ │ └── entity_extractor.py # Hybrid entity extraction │ ├── embedders/ │ │ └── embedder.py # Sentence-transformers embeddings │ └── storage/ │ ├── database.py # SQLite (metadata + FTS5 + chunks) │ └── vector_store.py # LanceDB (vector search) ├── scripts/ │ └── migrate_to_v2.py # v1→v2 migration tool └── data/ ├── arxiv.db # SQLite (82.8 MB) ├── lancedb/ # Vector database ├── pdfs/{YYYY-MM}/ # Downloaded PDFs └── markdown/{YYYY-MM}/ # Extracted markdown + JSON metadata ``` --- ## Stack | Component | Technology | |-----------|-----------| | Collection | `arxiv` (pip) — official wrapper, rate-limiting | | PDF Extraction | **Docling** (IBM) — GPU-accelerated, Markdown output | | Metadata DB | **SQLite + FTS5** — full-text search on titles/abstracts | | Vector DB | **LanceDB** — serverless, cosine metric, IVF index | | Embeddings | **Qwen3-Embedding-0.6B** (configurable, 1024-dim) | | REST API | **FastAPI** | | MCP Server | **FastMCP** — Claude integration | | CLI | **Typer + Rich** | | Config | **Pydantic Settings** (env prefix: `ARXIV_KB_`) | --- ## Configuration ### Path Settings | Setting | Default | |---------|---------| | `data_dir` | `{base_dir}/data` | | `pdf_dir` | `{data_dir}/pdfs` | | `markdown_dir` | `{data_dir}/markdown` | | `sqlite_path` | `{data_dir}/arxiv.db` | | `lancedb_path` | `{data_dir}/lancedb` | ### arXiv Settings | Setting | Default | |---------|---------| | `arxiv_categories` | `["cs.LG", "cs.CL", "cs.AI", "stat.ML"]` | | `arxiv_delay_seconds` | 3.0 | | `arxiv_page_size` | 500 | | `arxiv_num_retries` | 5 | ### Embedding Settings | Setting | Default | |---------|---------| | `embedding_model` | `Qwen/Qwen3-Embedding-0.6B` | | `embedding_dimension` | 1024 | | `embedding_batch_size` | 32 | ### Chunking Settings (v1 — Legacy) | Setting | Default | |---------|---------| | `chunk_size` | 1024 tokens | | `chunk_overlap` | 100 tokens | | `min_chunk_size` | 100 tokens | ### Hierarchical Chunking Settings (v2) | Setting | Default | |---------|---------| | `enable_hierarchical_chunking` | True | | `summary_chunk_max_tokens` | 512 | | `section_chunk_max_tokens` | 1024 | | `atomic_chunk_max_tokens` | 256 | | `atomic_chunk_min_tokens` | 50 | ### Entity Extraction Settings | Setting | Default | |---------|---------| | `enable_entity_extraction` | True | | `use_llm_for_classification` | False | | `entity_extraction_llm` | `Qwen/Qwen2.5-3B-Instruct` | | `extract_atomic_facts` | True | ### API & Scheduler | Setting | Default | |---------|---------| | `api_host` | `0.0.0.0` | | `api_port` | 8000 | | `sync_hour` | 6 | | `sync_minute` | 0 | --- ## Database Schema ### `papers` | Column | Type | Description | |--------|------|-------------| | id | TEXT PK | arXiv ID (e.g. `2401.12345`) | | title | TEXT NOT NULL | | | abstract | TEXT | | | authors | TEXT | JSON array | | categories | TEXT | JSON array | | primary_category | TEXT | | | published_date | TEXT | ISO format | | updated_date | TEXT | ISO format | | doi | TEXT | | | journal_ref | TEXT | | | pdf_url | TEXT | | | pdf_path | TEXT | Local path | | markdown_path | TEXT | Local path | | extraction_status | TEXT | `pending` / `completed` / `failed` | | embedding_status | TEXT | `pending` / `completed` / `failed` | | version | INTEGER | Paper version number | | created_at | TEXT | | | updated_at | TEXT | | **Indexes:** `published_date`, `primary_category`, `extraction_status`, `embedding_status` **FTS5:** Virtual table on `id`, `title`, `abstract`, `authors` with auto-sync triggers ### `chunks` (v1 — Legacy) | Column | Type | Description | |--------|------|-------------| | id | INTEGER PK AUTO | | | paper_id | TEXT FK | → papers.id | | chunk_index | INTEGER | Position in sequence | | chunk_type | TEXT | `abstract`, `introduction`, `methodology`, `results`, etc. | | content | TEXT NOT NULL | | | token_count | INTEGER | | | embedding_id | TEXT | Vector store reference | ### `chunks_v2` (Enriched — Hierarchical) | Column | Type | Description | |--------|------|-------------| | id | TEXT PK | `{paper_id}_{chunk_level}_{index}` | | paper_id | TEXT FK | → papers.id | | chunk_level | TEXT NOT NULL | `summary` / `section` / `atomic` | | section_type | TEXT | `abstract`, `introduction`, `methodology`, `results`, etc. | | content | TEXT NOT NULL | | | token_count | INTEGER | | | techniques | TEXT | JSON array: `["MoE", "attention"]` | | models_mentioned | TEXT | JSON array: `["BERT", "GPT-4"]` | | benchmarks | TEXT | JSON array: `["GLUE", "MMLU"]` | | metrics | TEXT | JSON object: `{"accuracy": 85.3}` | | contribution_type | TEXT | `core` / `comparison` / `baseline` / `citation` | | is_key_result | BOOLEAN | | | embedding_status | TEXT | `pending` / `completed` / `failed` | | embedding_id | TEXT | | | created_at | TEXT | | | updated_at | TEXT | | **Indexes:** `paper_id`, `chunk_level`, `contribution_type`, `is_key_result`, `embedding_status` ### `sync_state` Single-row table tracking last sync date, last paper ID, papers fetched count, and status (`idle`/`in_progress`/`completed`/`failed`). --- ## Vector Store (LanceDB) ### `paper_chunks` (v1 — Legacy) | Field | Type | |-------|------| | id | str (`{paper_id}_{chunk_index}`) | | paper_id | str | | chunk_index | int | | chunk_type | str | | content | str | | title | str | | authors | str | | primary_category | str | | published_date | str | | vector | Vector(1024) | ### `enriched_paper_chunks` (v2) | Field | Type | |-------|------| | id | str (`{paper_id}_{chunk_level}_{index}`) | | paper_id | str | | chunk_level | str | | section_type | str | | content | str | | title | str | | authors | str | | primary_category | str | | published_date | str | | techniques | str (JSON) | | models_mentioned | str (JSON) | | benchmarks | str (JSON) | | metrics | str (JSON) | | contribution_type | str | | is_key_result | bool | | vector | Vector(1024) | **Index config:** Cosine metric, 256 partitions, 96 sub-vectors (IVF) --- ## Data Pipeline ``` 1. SYNC (arxiv_fetcher.py) arXiv API → papers table + PDFs to data/pdfs/{YYYY-MM}/ Status: extraction_status = "pending" 2. EXTRACT (pdf_extractor.py) PDF → Docling → Markdown → Sections → Chunks Saves: data/markdown/{YYYY-MM}/{id}.md + {id}.json v1: chunks table (flat) v2: chunks_v2 table (hierarchical + entities) Status: extraction_status = "completed" 3. EMBED (embedder.py) Chunks → Qwen3-Embedding-0.6B → LanceDB v1: paper_chunks table v2: enriched_paper_chunks table Status: embedding_status = "completed" 4. SEARCH Query → Embed → Vector similarity + metadata filters → Results ``` --- ## Hierarchical Chunking Strategy (v2) ### Level 1: Summary (1 per paper) - Abstract + first paragraph of introduction - Entities extracted from abstract - Always classified as `core` contribution - Used for overview/discovery queries ### Level 2: Section (variable per paper) - Split by semantic boundaries (tables, figures, subsections, bold headers) - Each section chunk gets independent entity extraction - Contribution type classified per chunk - Results sections split by experiment/benchmark - Methodology sections split by component/step ### Level 3: Atomic (variable per paper) - Fine-grained facts: `"DeltaNet achieves 85.3% on GLUE"` - Only generated from results/methodology sections - Only when entities (models, benchmarks, metrics) are present - Each fact links model + benchmark + metric --- ## Entity Extraction ### Approach: Hybrid (Rules + Patterns + Optional LLM) 1. **Known entities** — curated lists matched via compiled regex: - 100+ benchmarks (GLUE, ImageNet, MMLU, ...) - 150+ models (BERT, GPT-4, LLaMA, ...) - 100+ techniques (attention, MoE, LoRA, ...) 2. **Metrics** — regex patterns for 20+ metric types: - accuracy, F1, BLEU, ROUGE, perplexity, mAP, WER, FLOPs, etc. 3. **Contribution classification** — rule-based (optional LLM): - `core`: "we propose", "our method", "novel", methodology sections - `comparison`: "compared to", "outperforms", "baseline" - `baseline`: baseline references - `citation`: related work, default 4. **Key result detection**: - "state-of-the-art", "SOTA", "new record" - Chunks with metrics + core/comparison type --- ## CLI Commands | Command | Description | |---------|-------------| | `arxiv-kb sync [--days N] [--incremental]` | Fetch papers from arXiv | | `arxiv-kb extract [PAPER_ID] [--all] [--limit N]` | Extract text from PDFs | | `arxiv-kb embed [PAPER_ID] [--all] [--limit N]` | Generate embeddings | | `arxiv-kb search QUERY [--fulltext] [--cat CS.LG]` | Search papers | | `arxiv-kb stats` | Show database statistics | | `arxiv-kb paper PAPER_ID` | Show paper details | | `arxiv-kb serve [--port 8000]` | Start FastAPI server | | `arxiv-kb mcp` | Start MCP server | | `arxiv-kb pipeline [--days 7] [--limit 100]` | Full sync→extract→embed | --- ## REST API Endpoints ### Search | Method | Path | Description | |--------|------|-------------| | POST | `/search` | Semantic search (query, categories, date range, limit) | | POST | `/search/hybrid` | Hybrid semantic + FTS (vector_weight: 0-1) | | GET | `/search/fulltext?query=...` | Full-text search on titles/abstracts | ### Papers | Method | Path | Description | |--------|------|-------------| | GET | `/papers/{id}` | Paper metadata | | GET | `/papers/{id}/chunks` | All text chunks | | GET | `/papers/{id}/markdown` | Extracted markdown text | | GET | `/papers/{id}/pdf` | Download PDF file | | GET | `/papers/{id}/similar?limit=10` | Find similar papers | | POST | `/papers/filter` | Filter by categories, authors, dates | ### Browse | Method | Path | Description | |--------|------|-------------| | GET | `/recent?days=7&categories=cs.LG` | Recent papers | | GET | `/authors/{name}` | Papers by author | ### Metadata | Method | Path | Description | |--------|------|-------------| | GET | `/` | API info | | GET | `/health` | Health check | | GET | `/stats` | Database statistics | --- ## MCP Tools (Claude Integration) ### Legacy Tools (v1) | Tool | Args | Description | |------|------|-------------| | `search_papers` | query, categories?, limit | Semantic search | | `get_paper_details` | paper_id, include_full_text? | Paper metadata + optional text | | `find_similar_papers` | paper_id, limit | Similar papers | | `search_by_author` | author_name, limit | Papers by author | | `get_recent_papers` | days, categories?, limit | Recent papers | | `fulltext_search` | query, limit | Keyword-based search | | `get_paper_chunks` | paper_id | All chunks for a paper | | `get_knowledge_base_stats` | — | DB statistics | ### Enriched Tools (v2) | Tool | Args | Description | |------|------|-------------| | `search_model_results` | model_name, benchmark?, include_baselines | Find benchmark results for a model | | `find_innovations` | topic, techniques?, days, limit | Core contributions on a topic | | `compare_models` | models (comma-sep), benchmark? | Compare model performance | | `search_by_technique` | technique, contribution_type, limit | Papers by technique | | `get_key_results` | query, limit | SOTA / significant results | | `get_paper_entities` | paper_id | Aggregated entities for a paper | | `advanced_search` | query, models?, benchmarks?, techniques?, contribution_type?, chunk_level? | Multi-filter search | --- ## Migration (v1 → v2) ```bash python scripts/migrate_to_v2.py status # Check progress python scripts/migrate_to_v2.py init # Create v2 tables python scripts/migrate_to_v2.py migrate # Migrate all papers python scripts/migrate_to_v2.py migrate --limit 100 python scripts/migrate_to_v2.py migrate --paper-id 2401.12345 python scripts/migrate_to_v2.py migrate --no-skip # Re-migrate existing ``` The migration re-reads existing markdown files (no re-extraction needed), applies hierarchical chunking + entity extraction, generates new embeddings, and stores in both SQLite `chunks_v2` and LanceDB `enriched_paper_chunks`. --- ## Dependencies ```toml [project] dependencies = [ "arxiv>=2.1.0", "httpx>=0.27.0", "aiofiles>=24.1.0", "marker-pdf>=1.0.0", "lancedb>=0.10.0", "sentence-transformers>=3.0.0", "fastapi>=0.115.0", "uvicorn>=0.30.0", "fastmcp>=0.3.0", "apscheduler>=3.10.0", "typer>=0.12.0", "rich>=13.0.0", "pydantic-settings>=2.0.0", "python-dotenv>=1.0.0", ] ``` --- ## File Storage Layout ``` data/ ├── arxiv.db # SQLite (82.8 MB) │ ├── papers (9,500 rows) │ ├── chunks (12,443 rows) │ ├── chunks_v2 (1,035 rows) │ ├── papers_fts (FTS5 index) │ └── sync_state (1 row) ├── lancedb/ │ ├── paper_chunks/ # v1 embeddings (521 vectors) │ └── enriched_paper_chunks/ # v2 embeddings (966 vectors) ├── pdfs/{YYYY-MM}/{paper_id}.pdf # Original PDFs └── markdown/{YYYY-MM}/ ├── {paper_id}.md # Extracted text └── {paper_id}.json # Section metadata ```

arXiv Knowledge Base — Specification

Related Documents

Autonomous SaaS Development Agent

Shadcn UI Rules

commit

AGENTS.md