Loading...
Loading...
Loading...
- Create two top-level folders: `backend/` and `frontend/`
# Socrates Development Plan
## 1. Architecture & Infrastructure Setup
### 1.1 Project Structure
- Create two top-level folders: `backend/` and `frontend/`
- Backend: Python monolith with FastAPI, organized by domain (agents/, rag/, knowledge_graph/, voice/, etc.)
- Frontend: React app with WebRTC integration
- Add Docker Compose at root to orchestrate backend, frontend, and PostgreSQL (or use Supabase for hosted DB)
- Database: Supabase PostgreSQL with pgvector extension enabled
### 1.2 Database Schema Design
**PostgreSQL with pgvector (Supabase)**
- Enable `pgvector` extension in Supabase
- `users` table: user_id, created_at, settings
- `knowledge_graph` table: user_id, topic, mastery_score (0-100), last_reviewed, misconceptions (JSON)
- `session_logs` table: user_id, session_id, timestamp, topic_touched, responses_graded
- `documents` table: document_id, user_id, filename, upload_date, page_count, chunks_indexed
- `document_chunks` table: chunk_id, document_id, page_number, section_title, content (text), embedding (vector with pgvector), bbox (JSON)
- `citations` table: chunk_id, page_number, bbox, content
**Vector Search (pgvector in Supabase)**
- Store embeddings directly in `document_chunks` table as vector column
- Use cosine similarity for semantic search
- Metadata co-located with vectors: document_id, page_number, chunk_index, section_title
### 1.3 API Contract Definition
- Define WebSocket message schema for audio streaming (Speech->Text->JSON)
- Define JSON payload structure for LLM output (text_response, audio_response, visual_data)
- Define REST endpoints: upload document, get knowledge graph, get session history
- Define vision payload: base64 image + comparison context
---
## 2. Phase 1: The "Blind" Tutor (Text-Only, No Voice)
### 2.1 Core LangGraph State Machine
- **Goal**: Build the conversation loop without voice latency concerns
- Design the state graph:
- `assess_state`: Generate quiz question from uploaded text
- `listen_state`: Accept user text input
- `grade_state`: LLM grades response, extracts misconceptions
- `update_memory_state`: Update knowledge_graph table
- `decide_state`: Route to remediation, repetition, or advancement
- **Implementation**: LangGraph Node structure with conditional edges based on mastery_score
### 2.2 Extraction Chain (Background Task)
- After each user response, spawn an async task to:
1. Parse the response for concepts mentioned
2. Update `misconceptions_list` in knowledge_graph if errors detected
3. Recalculate mastery_score using a simple formula (e.g., weighted average of last 5 attempts)
### 2.3 RAG Setup (Strict Mode)
- Ingest uploaded text file → split into semantic chunks (paragraph-level, not fixed token count)
- Use Gemini 2.5 Flash to extract section headers and concept names during chunking
- Generate embeddings for each chunk using Gemini's embedding API (or LangChain's embedding wrapper)
- Store chunks + embeddings in Supabase `document_chunks` table with metadata: `{document_id, page_number, section, bbox_coords}`
- Use pgvector cosine similarity search for retrieval
- Retriever must mark chunks as "from document" vs. "general knowledge" to prevent hallucination
### 2.4 Memory Initialization
- On user login, query knowledge_graph table
- If any topic has mastery_score < 70 and was last reviewed >3 days ago, flag for remediation
- Pass this context to the first `assess_state` to warm-start the conversation
### 2.5 Deliverable (Phase 1)
- Terminal-based CLI app
- User uploads a .txt file (e.g., medical notes)
- Agent quizzes the user in a loop
- Knowledge graph persists across sessions
---
## 3. Phase 2: Voice Integration (LiveKit Agent)
### 3.1 Architecture: LiveKit Voice Pipeline
- Use **LiveKit Agents** framework for seamless voice streaming
- Simplifies STT/TTS pipeline vs. manual WebSocket management
- **Components**:
- **STT (Speech-to-Text)**: Deepgram STT plugin (`nova-2` model for low latency)
- **TTS (Text-to-Speech)**: Cartesia TTS plugin (`sonic-english` for natural voice)
- **VAD (Voice Activity Detection)**: Silero VAD (on-device, no API calls)
- **LLM**: Gemini 2.5 Flash (streaming responses for minimal latency)
### 3.2 Worker Process Architecture
- **`worker.py`**: Runs LiveKit agent processes (separate from REST API)
- Listens for incoming rooms/participants from LiveKit
- Manages audio streaming to/from clients via WebRTC
- No need for custom WebSocket management—LiveKit handles it
- Scales horizontally: multiple worker processes for multiple concurrent sessions
### 3.3 Audio Pipeline Flow
1. **User speaks** → Audio captured by client via WebRTC
2. **STT (Deepgram)** → Transcribes to text in real-time
3. **LLM (Gemini)** → Generates response with streaming tokens
4. **TTS (Cartesia)** → Converts text to speech while LLM still generating
5. **Audio played** → User hears response via WebRTC
### 3.4 Dynamic VAD & Latency
- Silero VAD handles speech detection (no API calls, <50ms overhead)
- Can adjust silence thresholds dynamically based on question type
- Streaming tokens from Gemini reduce "thinking" latency
- Target: <800ms round-trip (audio in → response out)
### 3.5 Integration Points
- **LiveKit server**: Handles WebRTC connection, room management, auth
- **Socrates backend**: REST API still serves document upload, knowledge graph
- **Worker nodes**: Stateless agents that process voice, access DB for context
- **Gemini API**: Used for LLM responses (not embeddings—that's separate)
### 3.6 Deliverable (Phase 2)
- ✅ Voice-to-voice interaction working via LiveKit
- ✅ STT + TTS streaming with <1s latency
- ✅ Dynamic VAD with question-type awareness
- ✅ Stateless worker architecture for scaling
- ✅ Integration with knowledge graph for context
---
## 4. Phase 3: Multimodal Output (Whiteboard & Structured Output)
### 4.1 Structured LLM Output
- Modify LLM prompt to output JSON with fields:
```json
{
"text_response": "...",
"visual_data": {
"type": "mermaid" | "latex" | "matplotlib_python",
"content": "...",
"description": "..."
},
"citation": {
"page": 42,
"bbox": [x, y, w, h],
"quote": "..."
}
}
```
### 4.2 Prompt Engineering for Code Generation
- Test prompts that reliably generate valid Mermaid.js syntax (flowcharts, graphs)
- Test LaTeX generation for mathematical expressions
- Test Python code for Matplotlib graphs (with error handling)
- Include a "Self-Correction" chain:
- If Mermaid render fails, LLM re-generates with error message in context
- Max 2 retry attempts to avoid infinite loops
### 4.3 Frontend Whiteboard Component
- React component that:
- Displays Mermaid diagrams (use mermaid.js library)
- Renders LaTeX with react-latex
- Executes Python/matplotlib on backend and returns SVG
- Syncs animation timing with audio playback (visual highlight appears when audio mentions it)
### 4.4 Document Viewer Integration
- Show the uploaded PDF alongside the whiteboard
- When citation is returned, highlight the referenced bbox with yellow overlay
- Auto-scroll PDF to the cited page
### 4.5 Deliverable (Phase 3)
- Full web UI with React
- Agent speaks, draws diagrams, and cites sources simultaneously
- User sees real-time visual explanations
---
## 5. Phase 4: Vision & Deployment
### 5.1 Vision Integration
- Capture frame from user's webcam when they say "Show me" or manually trigger
- Send frame + base64 to Gemini 2.5 Flash Vision API
- Prompt template: "Analyze this handwritten solution. Compare to the correct answer. Identify the error step."
- Return structured feedback with visual markup (arrows, annotations)
### 5.2 Frontend Camera Component
- React component with webcam capture
- Display captured image, overlay agent feedback with annotations
- Allow user to dismiss or request another take
### 5.3 Deployment & Scaling
- Dockerize backend (FastAPI) and frontend (React + Node)
- Add Docker Compose for local dev, Kubernetes manifests for production
- Environment config: API keys, DB credentials, Pinecone namespace
- Set up monitoring: latency dashboards, error rates, hallucination detection
### 5.4 Testing & KPI Validation
- Latency benchmarks: measure voice-to-voice round-trip
- Retention tests: verify knowledge_graph recalls topics from >3 sessions ago
- Hallucination tests: create test suite of "out-of-document" queries, verify agent says "I don't know"
### 5.5 Deliverable (Phase 4)
- Production-ready deployment
- All 4 core pillars functional
---
## 6. Critical Path & Dependencies
**Must Complete First:**
1. Supabase project setup + pgvector schema design
2. LangGraph state machine (Phase 1)
3. FastAPI WebSocket scaffolding
**Blockers to Watch:**
- Deepgram VAD tuning (Phase 2) - difficult to get right, may need ML fine-tuning
- Gemini 2.5 Flash structured output reliability (Phase 3) - prompt engineering heavy
- Vision API accuracy on handwritten math (Phase 4) - test early with sample images
- pgvector cosine similarity relevance (Phase 1) - test embedding quality and similarity thresholds
**Parallelizable:**
- React UI development can start during Phase 2
- Docker setup can happen anytime
- Frontend and backend can be worked on concurrently once APIs are defined
- Supabase schema can be set up independently while backend infrastructure is being built
---
## 7. Risk Mitigation Checklist
| Risk | Mitigation | Owner | Timeline |
|------|-----------|-------|----------|
| Voice latency >1s | Profile each step; use speculative tokens; test with real Deepgram/Cartesia | Backend | Phase 2 |
| Deepgram VAD errors | Create labeled dataset of thinking vs. done; tuning threshold per question type | Audio | Phase 2 |
| Invalid Mermaid code from LLM | Implement self-correction loop with max 2 retries; test 50+ diagram prompts | Backend | Phase 3 |
| Hallucinations on out-of-doc facts | Enforce citation requirement; add explicit "I don't know" branching logic | RAG | Phase 1 |
| Vision API fails on bad handwriting | Test with low-res images early; set user expectations; fallback to text input | Vision | Phase 4 |
| WebSocket connection drops mid-session | Implement reconnection logic with session resumption; persist state to DB | Frontend | Phase 2 |
| pgvector embedding quality | Test with multiple embedding models; tune similarity threshold; benchmark retrieval accuracy | Backend | Phase 1 |
| Supabase latency on vector queries | Profile query times with large document sets; index optimization if needed | DB | Phase 1 |
---
## 8. Success Metrics (Testing Checklist)
- [ ] Phase 1: Text-based quiz loop works, knowledge_graph updates correctly
- [ ] Phase 2: Voice round-trip latency <1000ms on average; interrupt detection works
- [ ] Phase 3: Diagrams render correctly; citations highlight PDF correctly
- [ ] Phase 4: Vision API identifies errors in handwritten work; deployment passes load test
---
This plan is the blueprint. Ready to start Phase 1 implementation when needed.
This roadmap outlines planned enhancements to transform cheap-RAG from a functional document retrieval system into a production-ready, state-of-the-art RAG framework. Priorities are based on impact vs. effort analysis and alignment with mainstream RAG best practices.
See `specs/Semblance-MVP-Plan-v2.md` for full technical specification.
All notable changes to AvocadoDB will be documented in this file.
**Goal:** Stand up Toasty as a reliable service wired to BLT/GitHub events; deliver safe, useful summaries early.