Loading...
Loading...
Loading...
> **Full-stack RAG (Retrieval-Augmented Generation) Application**
# π Retrieve β Implementation Plan
> **Full-stack RAG (Retrieval-Augmented Generation) Application**
> Node.js + React | OpenAI Embeddings + LLM | Vector Search
---
## Phase 1 β Project Setup
### Step 1: Create Backend Project
```
backend/
βββ src/
β βββ routes/ # API endpoints
β βββ services/ # Business logic
β βββ ingestion/ # Document processing pipeline
β βββ retrieval/ # Search & vector queries
β βββ rerank/ # Result reranking
βββ uploads/ # Uploaded files storage
βββ data/ # Vector store data
βββ package.json
```
**Dependencies:** `openai`, `multer`, `pdf-parse`, `tesseract.js`, `sharp`, `xlsx`, `express`, `cors`
### Step 2: Create React App
**Pages:**
- **Upload** β Upload documents (PDF, images, text, Excel)
- **Search / Chat** β Query documents with text or images
- **Results Viewer** β View text + image results with source links
**Components:** `SearchBox`, `FileUploader`, `ResultCard`, `ImagePreview`
---
## Phase 2 β Ingestion Pipeline βοΈ
> Triggered when documents are uploaded.
### Step 3: Upload Document API
```
POST /api/upload β { file, metadata }
```
Backend stores file and triggers async processing.
### Step 4: Detect File Type
| Input | Processing |
|-------------|-------------------------------|
| PDF | Extract text + page images |
| Image | Caption + OCR + description |
| TXT / DOC | Extract & clean text |
| Excel | Convert tables to text |
### Step 5: Extract Content
- **Text documents** β Clean β Semantic chunking β Add metadata
- **Images** β Generate: β Caption β‘ Detailed description β’ OCR text
- *Example:* `"Machine dashboard showing error spikes and temperature warning"`
- **PDF diagrams** β Extract: page image, figure caption, nearby text
### Step 6: Create Embeddings (OpenAI)
```
model: text-embedding-3-large
```
Generate embeddings for every chunk, caption, and OCR result.
### Step 7: Store in Vector Database
| Field | Description |
|---------------|------------------------------------|
| `id` | Unique identifier |
| `embedding` | Vector from OpenAI |
| `content_text` | Original text content |
| `modality` | `text` / `image` / `table` |
| `source_file` | Original filename |
| `page_number` | Page (if applicable) |
| `image_url` | Path to extracted image |
| `metadata` | Additional info (date, tags, etc.) |
---
## Phase 3 β Retrieval Engine π
> Called when user performs a search.
### Step 8: Query API
```
POST /api/search β { query_text, image?, filters? }
```
### Step 9: Query Embedding
Convert user query β embedding vector.
### Step 10: Vector Search
Retrieve **top K = 20** results by cosine similarity.
### Step 11: Hybrid Search (Recommended)
Combine **vector similarity** + **keyword match** for improved accuracy.
---
## Phase 4 β Reverse HyDE (Advanced, Optional)
For each retrieved result:
1. Get text representation
2. Ask LLM: *"What question does this content answer?"*
3. Compare generated question to user query
4. Re-rank results (or re-query vector DB with generated question)
---
## Phase 5 β Reranking β
Use a stronger model to verify relevance:
- **Input:** user query + retrieved content
- **Output:** relevance score (0β1)
- Re-sort results by score
---
## Phase 6 β Context Assembly
Prepare final context for LLM:
- Text chunks
- Image URLs + captions
- Table data
- Source references
---
## Phase 7 β Answer Generation π€
Send assembled context to LLM with prompt:
> *"Answer based only on retrieved knowledge. Include image references if useful."*
---
## Phase 8 β Response to Frontend
```json
{
"answer": "...",
"sources": [...],
"images": [...],
"confidence": 0.92
}
```
---
## Phase 9 β React UI Flow
- **Search flow:** Query β API β Results β Show text snippet + image preview + source link
- **Chat flow:** Conversation memory stored client-side for multi-turn dialogue
---
## Phase 10 β Image Query Support (Advanced)
User uploads an image to search:
1. Caption the uploaded image
2. Convert caption β embedding
3. Search vector DB β *"Find similar diagrams"*
---
## Phase 11 β Security & Scaling
- [ ] Document permissions
- [ ] Embedding caching
- [ ] Background ingestion queue
- [ ] Chunk overlap tuning
- [ ] Monitoring & logging
This roadmap outlines planned enhancements to transform cheap-RAG from a functional document retrieval system into a production-ready, state-of-the-art RAG framework. Priorities are based on impact vs. effort analysis and alignment with mainstream RAG best practices.
See `specs/Semblance-MVP-Plan-v2.md` for full technical specification.
All notable changes to AvocadoDB will be documented in this file.
**Goal:** Stand up Toasty as a reliable service wired to BLT/GitHub events; deliver safe, useful summaries early.