Loading...
Loading...
Loading...
Wiki Search is a **semantic search engine** for Wikipedia articles built on modern NLP embeddings and vector similarity search. It enables users to find relevant articles based on semantic meaning rather than keyword matching.
# Wiki Search - Technical Documentation
## Project Overview
Wiki Search is a **semantic search engine** for Wikipedia articles built on modern NLP embeddings and vector similarity search. It enables users to find relevant articles based on semantic meaning rather than keyword matching.
### What It Does
The system ingests Wikipedia articles, breaks them into chunks, converts chunks into semantic embeddings (numerical representations), indexes them for fast retrieval, and provides an interactive search interface that ranks results by semantic similarity.
### Why Semantic Search?
Traditional keyword search fails when:
- Users search with different vocabulary than article content
- Multiple concepts refer to the same thing (synonyms)
- Context matters more than exact wording
Semantic search solves this by understanding the **meaning** of text, not just keywords.
---
## Architecture
### System Architecture Diagram
```
┌─────────────────────────────────────────────────────────────┐
│ Wikipedia Dataset │
│ (10,000 article sample via Hugging Face Datasets) │
└────────────────────────┬────────────────────────────────────┘
↓
[load_articles.py]
↓
┌───────────────────────────────┐
│ SQLite Database (wiki.db) │
│ ┌─────────────┐ ┌──────────┐ │
│ │ articles │ │ chunks │ │
│ │ table │ │ table │ │
│ └─────────────┘ └──────────┘ │
└───────────────────────────────┘
↓
[chunk_articles.py]
↓
┌───────────────────────────────┐
│ Chunked Articles in DB │
│ (50-250 words per chunk) │
└───────────────────────────────┘
↓
[embed_chunks.py]
↓
┌────────────────────────────────┐
│ Embeddings Generated │
│ (384-dim vectors) │
│ - embeddings.npy │
│ - chunk_ids.npy │
└────────────────────────────────┘
↓
[build_faiss.py]
↓
┌────────────────────────────────┐
│ FAISS Vector Index │
│ (faiss.index) │
│ IndexFlatIP L2 normalized │
└────────────────────────────────┘
↓
[search.py]
↓
┌────────────────────────────────┐
│ Interactive Search Interface │
│ Returns ranked results by │
│ semantic similarity │
└────────────────────────────────┘
```
### Pipeline Stages
#### Stage 1: Data Ingestion ([load_articles.py](data/load_articles.py))
**Purpose:** Load Wikipedia articles into persistent storage
**Process:**
1. Fetch Wikipedia dataset: `wikipedia-20220301-en-sample-10k`
2. Extract metadata: `id`, `title`, `url`, `text`
3. Insert into SQLite `articles` table
4. Enable fast lookups by article ID
**Key Parameters:**
- Dataset: Hugging Face `kaitchup/wikipedia-20220301-en-sample-10k`
- Size: 10,000 articles
- Database: SQLite (file-based, serverless)
**Database Schema:**
```sql
CREATE TABLE articles (
id TEXT PRIMARY KEY,
title TEXT,
url TEXT,
text TEXT
);
```
---
#### Stage 2: Chunking ([chunk_articles.py](indexing/chunk_articles.py))
**Purpose:** Break long articles into semantic chunks for embedding
**Problem Solved:**
- Full articles are too long (~5KB average) for effective embeddings
- Embeddings work best on sentences/paragraphs (50-250 words)
- Finer granularity = better search precision
**Algorithm:**
```
For each article:
1. Normalize text:
- Convert \r to \n (newline normalization)
- Collapse multiple spaces/tabs to single space
2. Smart splitting:
- First try: split by paragraph breaks (\n)
- If too few parts: split by sentence boundaries ([.!?])
3. Chunk building:
- Use sliding window approach
- Constraints:
* Max: 250 words per chunk
* Min: 50 words per chunk
- Skip chunks < 50 words
```
**Configuration:**
```python
MAX_WORDS = 250 # Maximum chunk size
MIN_WORDS = 50 # Minimum chunk size (skip smaller)
```
**Database Table Created:**
```sql
CREATE TABLE chunks (
chunk_id INTEGER PRIMARY KEY AUTOINCREMENT,
article_id TEXT,
content TEXT, -- Fixed from "contect" typo
FOREIGN KEY(article_id) REFERENCES articles(id)
);
```
**Example Output:**
- Input: 1 article (2000 words)
- Output: 12-15 chunks (average 150 words each)
---
#### Stage 3: Embedding ([embed_chunks.py](indexing/embed_chunks.py))
**Purpose:** Convert text chunks into dense vector representations
**Model:** `sentence-transformers/all-MiniLM-L6-v2`
**Specifications:**
- **Architecture:** MiniLM (lightweight BERT variant)
- **Output Dimension:** 384 dimensions
- **Training:** Fine-tuned on semantic similarity tasks
- **Inference:** Fast (~2000 chunks/minute on CPU)
**Process:**
```
For each chunk:
1. Combine: "{article_title}. {chunk_content}"
(Title context improves embeddings)
2. Encode: Pass through sentence-transformer model
Output: 384-dimensional vector
3. Normalize: Apply L2 normalization
(Makes vectors magnitude-invariant)
4. Store:
- embeddings.npy: numpy matrix (N × 384)
- chunk_ids.npy: chunk ID mappings
```
**Batch Processing:**
```python
BATCH_SIZE = 64 # Process 64 chunks at a time
# Speeds up GPU/multi-core processing
# Reduces memory overhead vs. processing one-by-one
```
**Output Files:**
- `embeddings.npy`: Shape (N_chunks, 384) - float32
- `chunk_ids.npy`: Shape (N_chunks,) - chunk identifiers
**Performance:**
- 10,000 articles → ~100,000 chunks
- Embeddings file: ~150 MB (100K × 384 × 4 bytes)
---
#### Stage 4: Indexing ([build_faiss.py](indexing/build_faiss.py))
**Purpose:** Create fast similarity search index for embeddings
**FAISS (Facebook AI Similarity Search):**
- Production-grade vector database
- Optimized for billion-scale searches
- Multiple index types available
**Index Configuration:**
```python
index = faiss.IndexFlatIP(dim)
# IndexFlatIP = Flat Inner Product (cosine similarity)
# Exact search, no quantization
# Best for datasets < 1M vectors
faiss.normalize_L2(embeddings)
# L2 normalization ensures:
# - Inner product = cosine similarity
# - Scores in range [0, 1] (easier to interpret)
```
**Index Properties:**
- **Search Complexity:** O(N) - linear scan (acceptable for 100K vectors)
- **Memory:** ~1.5 MB index file + embeddings in RAM
- **Similarity Metric:** Cosine similarity via dot product
**Save/Load:**
```python
faiss.write_index(index, "data/db/faiss.index")
index = faiss.read_index("data/db/faiss.index")
```
---
#### Stage 5: Search ([search.py](search/search.py))
**Purpose:** Interactive semantic search interface
**Algorithm:**
```
User Query
↓
1. Load FAISS index, embeddings, chunk_ids, database
2. Encode query using same model (all-MiniLM-L6-v2)
3. Normalize query vector (L2)
4. Search index for top-K nearest neighbors (K=8)
5. Filter: score >= MIN_SCORE (0.25)
6. Retrieve articles from database using chunk IDs
7. Deduplicate: show one result per article
8. Display results with scores and previews
```
**Configuration:**
```python
TOP_K = 8 # Retrieve 8 nearest neighbors
MIN_SCORE = 0.25 # Minimum similarity (relevance threshold)
```
**Result Deduplication:**
```python
seen_articles = set()
for title in results:
if title not in seen_articles:
seen_articles.add(title)
display_result(title)
```
Ensures user doesn't see same article multiple times (from different chunks).
**Output Format:**
```
Article Title
Score: 0.543
Preview text (first 350 chars)...
```
---
## Database Schema
### articles Table
```sql
CREATE TABLE articles (
id TEXT PRIMARY KEY, -- Wikipedia article ID
title TEXT, -- Article title
url TEXT, -- Wikipedia URL
text TEXT -- Full article text
);
```
**Indexes:** Primary key on `id` for fast lookups
### chunks Table
```sql
CREATE TABLE chunks (
chunk_id INTEGER PRIMARY KEY AUTOINCREMENT, -- Unique identifier
article_id TEXT, -- Foreign key
content TEXT, -- Chunk text (50-250 words)
FOREIGN KEY(article_id) REFERENCES articles(id)
);
```
**Indexes:**
- Implicit on `chunk_id` (primary key)
- Implicit on `article_id` (foreign key) for join performance
**Sample Query:**
```sql
SELECT a.title, c.content
FROM chunks c
JOIN articles a ON c.article_id = a.id
WHERE c.chunk_id = 42;
```
---
## Technologies & Dependencies
### Core Libraries
| Technology | Version | Purpose | Usage |
|-----------|---------|---------|-------|
| **Python** | 3.8+ | Language | Runtime |
| **SQLite3** | Built-in | Database | Schema, CRUD operations |
| **NumPy** | Latest | Arrays | Matrix operations on embeddings |
| **sentence-transformers** | Latest | Embeddings | Generate semantic vectors |
| **FAISS** | `faiss-cpu` | Indexing | Vector similarity search |
| **Hugging Face Datasets** | Latest | Data source | Load Wikipedia data |
### Optional Dependencies
- `faiss-gpu`: GPU-accelerated FAISS (for large-scale indexing)
### Installation
```bash
pip install sentence-transformers faiss-cpu numpy datasets
```
---
## File Descriptions
### `/data` - Data Management Layer
| File | Purpose | Key Operations |
|------|---------|-----------------|
| `init_db.py` | Initialize schema | `CREATE TABLE articles, chunks` |
| `load_articles.py` | Load data | `INSERT INTO articles` |
| `check_db.py` | Verify data | `SELECT COUNT(*), sample rows` |
| `check_chunks.py` | Verify chunking | Display sample chunks with titles |
| `check_counts.py` | Statistics | Show article and chunk counts |
### `/indexing` - Processing Pipeline
| File | Purpose | Key Operations |
|------|---------|-----------------|
| `chunk_articles.py` | Create chunks | Read articles, split, INSERT chunks |
| `embed_chunks.py` | Generate embeddings | Load chunks, encode, save .npy files |
| `build_faiss.py` | Build index | Load embeddings, create FAISS, save |
### `/search` - Search Interface
| File | Purpose | Key Operations |
|------|---------|-----------------|
| `search.py` | Interactive search | Load index, encode query, search, display results |
### `/data/db` - Artifacts
| File | Purpose | Format | Size |
|------|---------|--------|------|
| `wiki.db` | SQLite database | SQLite3 | ~50-100 MB |
| `faiss.index` | Vector index | FAISS | ~1.5 MB |
| `embeddings.npy` | Embedding matrix | NumPy (float32) | ~150 MB |
| `chunk_ids.npy` | ID mapping | NumPy (int64) | ~1 MB |
---
## Configuration & Parameters
### Chunking Parameters
```python
MAX_WORDS = 250 # Maximum words in a chunk
MIN_WORDS = 50 # Minimum words (skip smaller)
```
**Effect:** Tuning these affects:
- **Higher MAX_WORDS** → Fewer chunks, less granular search
- **Lower MIN_WORDS** → More chunks, higher search granularity
- **Trade-off:** More chunks = slower indexing but better precision
### Embedding Parameters
```python
BATCH_SIZE = 64 # Chunks processed per batch
MODEL = "all-MiniLM-L6-v2" # Sentence-transformer model
```
**Effect:**
- **Higher BATCH_SIZE** → Faster but more memory
- **Model choice** → Different embedding quality/speed tradeoffs
### Search Parameters
```python
TOP_K = 8 # Number of results to retrieve
MIN_SCORE = 0.25 # Similarity threshold (0-1)
```
**Effect:**
- **TOP_K**: More results = more computation but better coverage
- **MIN_SCORE**: Higher = stricter filtering (fewer results)
---
## Execution Flow & Commands
### Complete Pipeline
```bash
# Step 1: Initialize database
python data/init_db.py
# Output: "Database initialized at data/db/wiki.db"
# Step 2: Load Wikipedia articles
python data/load_articles.py
# Output: "Loading 10000 articles...", "Articles stored in sqlite"
# Step 3: Chunk articles
python indexing/chunk_articles.py
# Output: "Found X articles", "Processed X articles → Y chunks"
# Step 4: Generate embeddings
python indexing/embed_chunks.py
# Output: "Loading embedding model...", "Embedding X chunks...", progr... bar
# Step 5: Build index
python indexing/build_faiss.py
# Output: "Loading embeddings...", "Building FAISS index..."
# "FAISS index built with X vectors"
# Step 6: Search interface
python search/search.py
# Interactive loop: prompts for queries, displays results
```
### Verification Commands
```bash
# Check database setup
python data/check_db.py
# View sample chunks
python data/check_chunks.py
# Get statistics
python data/check_counts.py
```
---
## Performance Characteristics
### Processing Speed
| Stage | Input | Output | Time (approx) |
|-------|-------|--------|--------------|
| Load Articles | Dataset API | 10K articles | 30-60 sec |
| Chunk Articles | 10K articles | 100K chunks | 15-30 sec |
| Embed Chunks | 100K chunks | Embeddings | 10-20 min (CPU) |
| Build Index | Embeddings | FAISS index | 5-10 sec |
**Hardware Assumptions:** CPU-based (no GPU acceleration)
### Search Speed
- **Query encoding:** ~1-2 ms
- **FAISS search (K=8):** ~5-10 ms
- **Database lookups:** ~10-20 ms
- **Total per search:** ~20-30 ms (interactive speed)
### Memory Usage
- **Model in memory:** ~50 MB (all-MiniLM-L6-v2)
- **FAISS index:** ~200-300 MB (for 100K embeddings)
- **Database file:** ~100 MB
- **Total:** ~400-500 MB
---
## Design Decisions
### 1. Why SQLite?
- ✅ Serverless, file-based
- ✅ No external dependencies
- ✅ Fast enough for 10K articles
- ✅ Easy to inspect with any SQL tool
- ❌ Not suitable for > 1M chunks (would need PostgreSQL)
### 2. Why FAISS?
- ✅ Production-grade, battle-tested at Meta
- ✅ Fast exact search for our scale
- ✅ Supports GPU acceleration if needed
- ✅ Industry standard
- ❌ Overkill for tiny datasets, adds complexity
### 3. Why all-MiniLM-L6-v2?
- ✅ Lightweight (22M parameters)
- ✅ Fast inference (~2000 chunks/min CPU)
- ✅ Good quality embeddings
- ✅ Only 384 dimensions (vs. 768 for base models)
- ❌ Less accurate than larger models (e.g., all-mpnet-base-v2)
### 4. Why Chunking?
- ✅ Improves search precision (semantic coherence)
- ✅ Fits embedding model's context window
- ✅ Enables fine-grained search results
- ❌ Adds processing complexity
### 5. Why L2 Normalization?
- ✅ Converts dot product to cosine similarity
- ✅ Makes scores interpretable (0-1 range)
- ✅ Required by FAISS IndexFlatIP
- ❌ Slightly slower than plain dot product
---
## Potential Improvements
### Short-term
1. **Create requirements.txt** for dependency pinning
2. **Fix database schema typo** (`contect` → `content`)
3. **Add error handling** for missing files/database
4. **Implement config file** (YAML/JSON) for parameters
### Medium-term
1. **Support larger datasets** (1M+ articles → PostgreSQL + Milvus)
2. **Add result filtering** (by date, source, category)
3. **Implement caching** for popular queries
4. **Add reranking** (use BERT-large for final ranking)
### Long-term
1. **Web interface** (FastAPI + React frontend)
2. **Real-time indexing** (stream new articles)
3. **Multi-language support** (multilingual embeddings)
4. **Question-answering** (RAG with LLM)
5. **User feedback loop** (learn from click-through)
---
## Troubleshooting
### Database Issues
```python
# If "contect" column error appears:
# Solution: Fix schema in init_db.py or recreate database
```
### Missing Embeddings
```python
# If "embeddings.npy not found":
# Re-run: python indexing/embed_chunks.py
```
### FAISS Index Corrupted
```python
# If index read fails:
# Re-run: python indexing/build_faiss.py
```
### Out of Memory
```python
# If OOM during embedding:
# Reduce BATCH_SIZE in embed_chunks.py
BATCH_SIZE = 32 # Instead of 64
```
---
## References
- [FAISS Documentation](https://github.com/facebookresearch/faiss)
- [Sentence-Transformers](https://www.sbert.net/)
- [all-MiniLM-L6-v2 Model Card](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
- [Wikipedia Dataset](https://huggingface.co/datasets/kaitchup/wikipedia-20220301-en-sample-10k)
---
## Author Notes
This project demonstrates end-to-end semantic search engineering. Key learning points:
1. **Embeddings are just vectors** - understanding their properties is crucial
2. **Chunking strategy matters** - affects search quality significantly
3. **FAISS is powerful but simple** - don't over-engineer for small scales
4. **Semantic search > keyword search** - for most real-world use cases
---
**Last Updated:** February 2026
**Project Status:** Production-ready for up to 1M articles
**License:** MIT (if published)
This roadmap outlines planned enhancements to transform cheap-RAG from a functional document retrieval system into a production-ready, state-of-the-art RAG framework. Priorities are based on impact vs. effort analysis and alignment with mainstream RAG best practices.
See `specs/Semblance-MVP-Plan-v2.md` for full technical specification.
All notable changes to AvocadoDB will be documented in this file.
**Goal:** Stand up Toasty as a reliable service wired to BLT/GitHub events; deliver safe, useful summaries early.