Loading...
Loading...
Loading...
# arXiv Knowledge Base — Specification
**Version:** 0.1.0
**Last updated:** 2026-01-31
**Python:** >=3.10
---
## Current Status
| Metric | Value |
|--------|-------|
| Total Papers | 9,500 |
| Legacy Chunks (v1) | 12,443 |
| Enriched Chunks (v2) | 1,035 |
| Extraction: completed | 4,612 |
| Extraction: pending | 4,886 |
| Extraction: failed | 2 |
| Embedding: completed | 185 |
| Embedding: pending | 9,315 |
| V2 Chunks by level | summary: 407, section: 628 |
| V2 Contribution types | core: 664, comparison: 134, citation: 237 |
| V2 Key results | 97 |
| V2 Embedding status | completed: 966, pending: 69 |
**Top categories:** cs.LG (2,791), cs.CL (1,655), cs.AI (1,242), cs.CV (1,107), cs.CR (305), stat.ML (224)
---
## Architecture
```
arxiv_data/
├── config/
│ └── settings.py # Pydantic Settings (env: ARXIV_KB_*)
├── src/
│ ├── cli.py # Typer CLI
│ ├── api/main.py # FastAPI REST endpoints
│ ├── mcp/server.py # FastMCP server for Claude
│ ├── collectors/
│ │ └── arxiv_fetcher.py # arXiv API + PDF downloads
│ ├── extractors/
│ │ ├── pdf_extractor.py # Docling PDF→Markdown + chunking
│ │ └── entity_extractor.py # Hybrid entity extraction
│ ├── embedders/
│ │ └── embedder.py # Sentence-transformers embeddings
│ └── storage/
│ ├── database.py # SQLite (metadata + FTS5 + chunks)
│ └── vector_store.py # LanceDB (vector search)
├── scripts/
│ └── migrate_to_v2.py # v1→v2 migration tool
└── data/
├── arxiv.db # SQLite (82.8 MB)
├── lancedb/ # Vector database
├── pdfs/{YYYY-MM}/ # Downloaded PDFs
└── markdown/{YYYY-MM}/ # Extracted markdown + JSON metadata
```
---
## Stack
| Component | Technology |
|-----------|-----------|
| Collection | `arxiv` (pip) — official wrapper, rate-limiting |
| PDF Extraction | **Docling** (IBM) — GPU-accelerated, Markdown output |
| Metadata DB | **SQLite + FTS5** — full-text search on titles/abstracts |
| Vector DB | **LanceDB** — serverless, cosine metric, IVF index |
| Embeddings | **Qwen3-Embedding-0.6B** (configurable, 1024-dim) |
| REST API | **FastAPI** |
| MCP Server | **FastMCP** — Claude integration |
| CLI | **Typer + Rich** |
| Config | **Pydantic Settings** (env prefix: `ARXIV_KB_`) |
---
## Configuration
### Path Settings
| Setting | Default |
|---------|---------|
| `data_dir` | `{base_dir}/data` |
| `pdf_dir` | `{data_dir}/pdfs` |
| `markdown_dir` | `{data_dir}/markdown` |
| `sqlite_path` | `{data_dir}/arxiv.db` |
| `lancedb_path` | `{data_dir}/lancedb` |
### arXiv Settings
| Setting | Default |
|---------|---------|
| `arxiv_categories` | `["cs.LG", "cs.CL", "cs.AI", "stat.ML"]` |
| `arxiv_delay_seconds` | 3.0 |
| `arxiv_page_size` | 500 |
| `arxiv_num_retries` | 5 |
### Embedding Settings
| Setting | Default |
|---------|---------|
| `embedding_model` | `Qwen/Qwen3-Embedding-0.6B` |
| `embedding_dimension` | 1024 |
| `embedding_batch_size` | 32 |
### Chunking Settings (v1 — Legacy)
| Setting | Default |
|---------|---------|
| `chunk_size` | 1024 tokens |
| `chunk_overlap` | 100 tokens |
| `min_chunk_size` | 100 tokens |
### Hierarchical Chunking Settings (v2)
| Setting | Default |
|---------|---------|
| `enable_hierarchical_chunking` | True |
| `summary_chunk_max_tokens` | 512 |
| `section_chunk_max_tokens` | 1024 |
| `atomic_chunk_max_tokens` | 256 |
| `atomic_chunk_min_tokens` | 50 |
### Entity Extraction Settings
| Setting | Default |
|---------|---------|
| `enable_entity_extraction` | True |
| `use_llm_for_classification` | False |
| `entity_extraction_llm` | `Qwen/Qwen2.5-3B-Instruct` |
| `extract_atomic_facts` | True |
### API & Scheduler
| Setting | Default |
|---------|---------|
| `api_host` | `0.0.0.0` |
| `api_port` | 8000 |
| `sync_hour` | 6 |
| `sync_minute` | 0 |
---
## Database Schema
### `papers`
| Column | Type | Description |
|--------|------|-------------|
| id | TEXT PK | arXiv ID (e.g. `2401.12345`) |
| title | TEXT NOT NULL | |
| abstract | TEXT | |
| authors | TEXT | JSON array |
| categories | TEXT | JSON array |
| primary_category | TEXT | |
| published_date | TEXT | ISO format |
| updated_date | TEXT | ISO format |
| doi | TEXT | |
| journal_ref | TEXT | |
| pdf_url | TEXT | |
| pdf_path | TEXT | Local path |
| markdown_path | TEXT | Local path |
| extraction_status | TEXT | `pending` / `completed` / `failed` |
| embedding_status | TEXT | `pending` / `completed` / `failed` |
| version | INTEGER | Paper version number |
| created_at | TEXT | |
| updated_at | TEXT | |
**Indexes:** `published_date`, `primary_category`, `extraction_status`, `embedding_status`
**FTS5:** Virtual table on `id`, `title`, `abstract`, `authors` with auto-sync triggers
### `chunks` (v1 — Legacy)
| Column | Type | Description |
|--------|------|-------------|
| id | INTEGER PK AUTO | |
| paper_id | TEXT FK | → papers.id |
| chunk_index | INTEGER | Position in sequence |
| chunk_type | TEXT | `abstract`, `introduction`, `methodology`, `results`, etc. |
| content | TEXT NOT NULL | |
| token_count | INTEGER | |
| embedding_id | TEXT | Vector store reference |
### `chunks_v2` (Enriched — Hierarchical)
| Column | Type | Description |
|--------|------|-------------|
| id | TEXT PK | `{paper_id}_{chunk_level}_{index}` |
| paper_id | TEXT FK | → papers.id |
| chunk_level | TEXT NOT NULL | `summary` / `section` / `atomic` |
| section_type | TEXT | `abstract`, `introduction`, `methodology`, `results`, etc. |
| content | TEXT NOT NULL | |
| token_count | INTEGER | |
| techniques | TEXT | JSON array: `["MoE", "attention"]` |
| models_mentioned | TEXT | JSON array: `["BERT", "GPT-4"]` |
| benchmarks | TEXT | JSON array: `["GLUE", "MMLU"]` |
| metrics | TEXT | JSON object: `{"accuracy": 85.3}` |
| contribution_type | TEXT | `core` / `comparison` / `baseline` / `citation` |
| is_key_result | BOOLEAN | |
| embedding_status | TEXT | `pending` / `completed` / `failed` |
| embedding_id | TEXT | |
| created_at | TEXT | |
| updated_at | TEXT | |
**Indexes:** `paper_id`, `chunk_level`, `contribution_type`, `is_key_result`, `embedding_status`
### `sync_state`
Single-row table tracking last sync date, last paper ID, papers fetched count, and status (`idle`/`in_progress`/`completed`/`failed`).
---
## Vector Store (LanceDB)
### `paper_chunks` (v1 — Legacy)
| Field | Type |
|-------|------|
| id | str (`{paper_id}_{chunk_index}`) |
| paper_id | str |
| chunk_index | int |
| chunk_type | str |
| content | str |
| title | str |
| authors | str |
| primary_category | str |
| published_date | str |
| vector | Vector(1024) |
### `enriched_paper_chunks` (v2)
| Field | Type |
|-------|------|
| id | str (`{paper_id}_{chunk_level}_{index}`) |
| paper_id | str |
| chunk_level | str |
| section_type | str |
| content | str |
| title | str |
| authors | str |
| primary_category | str |
| published_date | str |
| techniques | str (JSON) |
| models_mentioned | str (JSON) |
| benchmarks | str (JSON) |
| metrics | str (JSON) |
| contribution_type | str |
| is_key_result | bool |
| vector | Vector(1024) |
**Index config:** Cosine metric, 256 partitions, 96 sub-vectors (IVF)
---
## Data Pipeline
```
1. SYNC (arxiv_fetcher.py)
arXiv API → papers table + PDFs to data/pdfs/{YYYY-MM}/
Status: extraction_status = "pending"
2. EXTRACT (pdf_extractor.py)
PDF → Docling → Markdown → Sections → Chunks
Saves: data/markdown/{YYYY-MM}/{id}.md + {id}.json
v1: chunks table (flat)
v2: chunks_v2 table (hierarchical + entities)
Status: extraction_status = "completed"
3. EMBED (embedder.py)
Chunks → Qwen3-Embedding-0.6B → LanceDB
v1: paper_chunks table
v2: enriched_paper_chunks table
Status: embedding_status = "completed"
4. SEARCH
Query → Embed → Vector similarity + metadata filters → Results
```
---
## Hierarchical Chunking Strategy (v2)
### Level 1: Summary (1 per paper)
- Abstract + first paragraph of introduction
- Entities extracted from abstract
- Always classified as `core` contribution
- Used for overview/discovery queries
### Level 2: Section (variable per paper)
- Split by semantic boundaries (tables, figures, subsections, bold headers)
- Each section chunk gets independent entity extraction
- Contribution type classified per chunk
- Results sections split by experiment/benchmark
- Methodology sections split by component/step
### Level 3: Atomic (variable per paper)
- Fine-grained facts: `"DeltaNet achieves 85.3% on GLUE"`
- Only generated from results/methodology sections
- Only when entities (models, benchmarks, metrics) are present
- Each fact links model + benchmark + metric
---
## Entity Extraction
### Approach: Hybrid (Rules + Patterns + Optional LLM)
1. **Known entities** — curated lists matched via compiled regex:
- 100+ benchmarks (GLUE, ImageNet, MMLU, ...)
- 150+ models (BERT, GPT-4, LLaMA, ...)
- 100+ techniques (attention, MoE, LoRA, ...)
2. **Metrics** — regex patterns for 20+ metric types:
- accuracy, F1, BLEU, ROUGE, perplexity, mAP, WER, FLOPs, etc.
3. **Contribution classification** — rule-based (optional LLM):
- `core`: "we propose", "our method", "novel", methodology sections
- `comparison`: "compared to", "outperforms", "baseline"
- `baseline`: baseline references
- `citation`: related work, default
4. **Key result detection**:
- "state-of-the-art", "SOTA", "new record"
- Chunks with metrics + core/comparison type
---
## CLI Commands
| Command | Description |
|---------|-------------|
| `arxiv-kb sync [--days N] [--incremental]` | Fetch papers from arXiv |
| `arxiv-kb extract [PAPER_ID] [--all] [--limit N]` | Extract text from PDFs |
| `arxiv-kb embed [PAPER_ID] [--all] [--limit N]` | Generate embeddings |
| `arxiv-kb search QUERY [--fulltext] [--cat CS.LG]` | Search papers |
| `arxiv-kb stats` | Show database statistics |
| `arxiv-kb paper PAPER_ID` | Show paper details |
| `arxiv-kb serve [--port 8000]` | Start FastAPI server |
| `arxiv-kb mcp` | Start MCP server |
| `arxiv-kb pipeline [--days 7] [--limit 100]` | Full sync→extract→embed |
---
## REST API Endpoints
### Search
| Method | Path | Description |
|--------|------|-------------|
| POST | `/search` | Semantic search (query, categories, date range, limit) |
| POST | `/search/hybrid` | Hybrid semantic + FTS (vector_weight: 0-1) |
| GET | `/search/fulltext?query=...` | Full-text search on titles/abstracts |
### Papers
| Method | Path | Description |
|--------|------|-------------|
| GET | `/papers/{id}` | Paper metadata |
| GET | `/papers/{id}/chunks` | All text chunks |
| GET | `/papers/{id}/markdown` | Extracted markdown text |
| GET | `/papers/{id}/pdf` | Download PDF file |
| GET | `/papers/{id}/similar?limit=10` | Find similar papers |
| POST | `/papers/filter` | Filter by categories, authors, dates |
### Browse
| Method | Path | Description |
|--------|------|-------------|
| GET | `/recent?days=7&categories=cs.LG` | Recent papers |
| GET | `/authors/{name}` | Papers by author |
### Metadata
| Method | Path | Description |
|--------|------|-------------|
| GET | `/` | API info |
| GET | `/health` | Health check |
| GET | `/stats` | Database statistics |
---
## MCP Tools (Claude Integration)
### Legacy Tools (v1)
| Tool | Args | Description |
|------|------|-------------|
| `search_papers` | query, categories?, limit | Semantic search |
| `get_paper_details` | paper_id, include_full_text? | Paper metadata + optional text |
| `find_similar_papers` | paper_id, limit | Similar papers |
| `search_by_author` | author_name, limit | Papers by author |
| `get_recent_papers` | days, categories?, limit | Recent papers |
| `fulltext_search` | query, limit | Keyword-based search |
| `get_paper_chunks` | paper_id | All chunks for a paper |
| `get_knowledge_base_stats` | — | DB statistics |
### Enriched Tools (v2)
| Tool | Args | Description |
|------|------|-------------|
| `search_model_results` | model_name, benchmark?, include_baselines | Find benchmark results for a model |
| `find_innovations` | topic, techniques?, days, limit | Core contributions on a topic |
| `compare_models` | models (comma-sep), benchmark? | Compare model performance |
| `search_by_technique` | technique, contribution_type, limit | Papers by technique |
| `get_key_results` | query, limit | SOTA / significant results |
| `get_paper_entities` | paper_id | Aggregated entities for a paper |
| `advanced_search` | query, models?, benchmarks?, techniques?, contribution_type?, chunk_level? | Multi-filter search |
---
## Migration (v1 → v2)
```bash
python scripts/migrate_to_v2.py status # Check progress
python scripts/migrate_to_v2.py init # Create v2 tables
python scripts/migrate_to_v2.py migrate # Migrate all papers
python scripts/migrate_to_v2.py migrate --limit 100
python scripts/migrate_to_v2.py migrate --paper-id 2401.12345
python scripts/migrate_to_v2.py migrate --no-skip # Re-migrate existing
```
The migration re-reads existing markdown files (no re-extraction needed), applies hierarchical chunking + entity extraction, generates new embeddings, and stores in both SQLite `chunks_v2` and LanceDB `enriched_paper_chunks`.
---
## Dependencies
```toml
[project]
dependencies = [
"arxiv>=2.1.0",
"httpx>=0.27.0",
"aiofiles>=24.1.0",
"marker-pdf>=1.0.0",
"lancedb>=0.10.0",
"sentence-transformers>=3.0.0",
"fastapi>=0.115.0",
"uvicorn>=0.30.0",
"fastmcp>=0.3.0",
"apscheduler>=3.10.0",
"typer>=0.12.0",
"rich>=13.0.0",
"pydantic-settings>=2.0.0",
"python-dotenv>=1.0.0",
]
```
---
## File Storage Layout
```
data/
├── arxiv.db # SQLite (82.8 MB)
│ ├── papers (9,500 rows)
│ ├── chunks (12,443 rows)
│ ├── chunks_v2 (1,035 rows)
│ ├── papers_fts (FTS5 index)
│ └── sync_state (1 row)
├── lancedb/
│ ├── paper_chunks/ # v1 embeddings (521 vectors)
│ └── enriched_paper_chunks/ # v2 embeddings (966 vectors)
├── pdfs/{YYYY-MM}/{paper_id}.pdf # Original PDFs
└── markdown/{YYYY-MM}/
├── {paper_id}.md # Extracted text
└── {paper_id}.json # Section metadata
```
You are an autonomous senior full-stack engineer responsible for building and maintaining a complete SaaS product. You operate with minimal supervision, making independent decisions while consulting on major strategic changes.
<author>blefnk/rules</author>
trigger: model_decision
description: Authoritative guide for all software-writing agents in this repository