Socrates Development Plan

# Socrates Development Plan ## 1. Architecture & Infrastructure Setup ### 1.1 Project Structure - Create two top-level folders: `backend/` and `frontend/` - Backend: Python monolith with FastAPI, organized by domain (agents/, rag/, knowledge_graph/, voice/, etc.) - Frontend: React app with WebRTC integration - Add Docker Compose at root to orchestrate backend, frontend, and PostgreSQL (or use Supabase for hosted DB) - Database: Supabase PostgreSQL with pgvector extension enabled ### 1.2 Database Schema Design **PostgreSQL with pgvector (Supabase)** - Enable `pgvector` extension in Supabase - `users` table: user_id, created_at, settings - `knowledge_graph` table: user_id, topic, mastery_score (0-100), last_reviewed, misconceptions (JSON) - `session_logs` table: user_id, session_id, timestamp, topic_touched, responses_graded - `documents` table: document_id, user_id, filename, upload_date, page_count, chunks_indexed - `document_chunks` table: chunk_id, document_id, page_number, section_title, content (text), embedding (vector with pgvector), bbox (JSON) - `citations` table: chunk_id, page_number, bbox, content **Vector Search (pgvector in Supabase)** - Store embeddings directly in `document_chunks` table as vector column - Use cosine similarity for semantic search - Metadata co-located with vectors: document_id, page_number, chunk_index, section_title ### 1.3 API Contract Definition - Define WebSocket message schema for audio streaming (Speech->Text->JSON) - Define JSON payload structure for LLM output (text_response, audio_response, visual_data) - Define REST endpoints: upload document, get knowledge graph, get session history - Define vision payload: base64 image + comparison context --- ## 2. Phase 1: The "Blind" Tutor (Text-Only, No Voice) ### 2.1 Core LangGraph State Machine - **Goal**: Build the conversation loop without voice latency concerns - Design the state graph: - `assess_state`: Generate quiz question from uploaded text - `listen_state`: Accept user text input - `grade_state`: LLM grades response, extracts misconceptions - `update_memory_state`: Update knowledge_graph table - `decide_state`: Route to remediation, repetition, or advancement - **Implementation**: LangGraph Node structure with conditional edges based on mastery_score ### 2.2 Extraction Chain (Background Task) - After each user response, spawn an async task to: 1. Parse the response for concepts mentioned 2. Update `misconceptions_list` in knowledge_graph if errors detected 3. Recalculate mastery_score using a simple formula (e.g., weighted average of last 5 attempts) ### 2.3 RAG Setup (Strict Mode) - Ingest uploaded text file → split into semantic chunks (paragraph-level, not fixed token count) - Use Gemini 2.5 Flash to extract section headers and concept names during chunking - Generate embeddings for each chunk using Gemini's embedding API (or LangChain's embedding wrapper) - Store chunks + embeddings in Supabase `document_chunks` table with metadata: `{document_id, page_number, section, bbox_coords}` - Use pgvector cosine similarity search for retrieval - Retriever must mark chunks as "from document" vs. "general knowledge" to prevent hallucination ### 2.4 Memory Initialization - On user login, query knowledge_graph table - If any topic has mastery_score < 70 and was last reviewed >3 days ago, flag for remediation - Pass this context to the first `assess_state` to warm-start the conversation ### 2.5 Deliverable (Phase 1) - Terminal-based CLI app - User uploads a .txt file (e.g., medical notes) - Agent quizzes the user in a loop - Knowledge graph persists across sessions --- ## 3. Phase 2: Voice Integration (LiveKit Agent) ### 3.1 Architecture: LiveKit Voice Pipeline - Use **LiveKit Agents** framework for seamless voice streaming - Simplifies STT/TTS pipeline vs. manual WebSocket management - **Components**: - **STT (Speech-to-Text)**: Deepgram STT plugin (`nova-2` model for low latency) - **TTS (Text-to-Speech)**: Cartesia TTS plugin (`sonic-english` for natural voice) - **VAD (Voice Activity Detection)**: Silero VAD (on-device, no API calls) - **LLM**: Gemini 2.5 Flash (streaming responses for minimal latency) ### 3.2 Worker Process Architecture - **`worker.py`**: Runs LiveKit agent processes (separate from REST API) - Listens for incoming rooms/participants from LiveKit - Manages audio streaming to/from clients via WebRTC - No need for custom WebSocket management—LiveKit handles it - Scales horizontally: multiple worker processes for multiple concurrent sessions ### 3.3 Audio Pipeline Flow 1. **User speaks** → Audio captured by client via WebRTC 2. **STT (Deepgram)** → Transcribes to text in real-time 3. **LLM (Gemini)** → Generates response with streaming tokens 4. **TTS (Cartesia)** → Converts text to speech while LLM still generating 5. **Audio played** → User hears response via WebRTC ### 3.4 Dynamic VAD & Latency - Silero VAD handles speech detection (no API calls, <50ms overhead) - Can adjust silence thresholds dynamically based on question type - Streaming tokens from Gemini reduce "thinking" latency - Target: <800ms round-trip (audio in → response out) ### 3.5 Integration Points - **LiveKit server**: Handles WebRTC connection, room management, auth - **Socrates backend**: REST API still serves document upload, knowledge graph - **Worker nodes**: Stateless agents that process voice, access DB for context - **Gemini API**: Used for LLM responses (not embeddings—that's separate) ### 3.6 Deliverable (Phase 2) - ✅ Voice-to-voice interaction working via LiveKit - ✅ STT + TTS streaming with <1s latency - ✅ Dynamic VAD with question-type awareness - ✅ Stateless worker architecture for scaling - ✅ Integration with knowledge graph for context --- ## 4. Phase 3: Multimodal Output (Whiteboard & Structured Output) ### 4.1 Structured LLM Output - Modify LLM prompt to output JSON with fields: ```json { "text_response": "...", "visual_data": { "type": "mermaid" | "latex" | "matplotlib_python", "content": "...", "description": "..." }, "citation": { "page": 42, "bbox": [x, y, w, h], "quote": "..." } } ``` ### 4.2 Prompt Engineering for Code Generation - Test prompts that reliably generate valid Mermaid.js syntax (flowcharts, graphs) - Test LaTeX generation for mathematical expressions - Test Python code for Matplotlib graphs (with error handling) - Include a "Self-Correction" chain: - If Mermaid render fails, LLM re-generates with error message in context - Max 2 retry attempts to avoid infinite loops ### 4.3 Frontend Whiteboard Component - React component that: - Displays Mermaid diagrams (use mermaid.js library) - Renders LaTeX with react-latex - Executes Python/matplotlib on backend and returns SVG - Syncs animation timing with audio playback (visual highlight appears when audio mentions it) ### 4.4 Document Viewer Integration - Show the uploaded PDF alongside the whiteboard - When citation is returned, highlight the referenced bbox with yellow overlay - Auto-scroll PDF to the cited page ### 4.5 Deliverable (Phase 3) - Full web UI with React - Agent speaks, draws diagrams, and cites sources simultaneously - User sees real-time visual explanations --- ## 5. Phase 4: Vision & Deployment ### 5.1 Vision Integration - Capture frame from user's webcam when they say "Show me" or manually trigger - Send frame + base64 to Gemini 2.5 Flash Vision API - Prompt template: "Analyze this handwritten solution. Compare to the correct answer. Identify the error step." - Return structured feedback with visual markup (arrows, annotations) ### 5.2 Frontend Camera Component - React component with webcam capture - Display captured image, overlay agent feedback with annotations - Allow user to dismiss or request another take ### 5.3 Deployment & Scaling - Dockerize backend (FastAPI) and frontend (React + Node) - Add Docker Compose for local dev, Kubernetes manifests for production - Environment config: API keys, DB credentials, Pinecone namespace - Set up monitoring: latency dashboards, error rates, hallucination detection ### 5.4 Testing & KPI Validation - Latency benchmarks: measure voice-to-voice round-trip - Retention tests: verify knowledge_graph recalls topics from >3 sessions ago - Hallucination tests: create test suite of "out-of-document" queries, verify agent says "I don't know" ### 5.5 Deliverable (Phase 4) - Production-ready deployment - All 4 core pillars functional --- ## 6. Critical Path & Dependencies **Must Complete First:** 1. Supabase project setup + pgvector schema design 2. LangGraph state machine (Phase 1) 3. FastAPI WebSocket scaffolding **Blockers to Watch:** - Deepgram VAD tuning (Phase 2) - difficult to get right, may need ML fine-tuning - Gemini 2.5 Flash structured output reliability (Phase 3) - prompt engineering heavy - Vision API accuracy on handwritten math (Phase 4) - test early with sample images - pgvector cosine similarity relevance (Phase 1) - test embedding quality and similarity thresholds **Parallelizable:** - React UI development can start during Phase 2 - Docker setup can happen anytime - Frontend and backend can be worked on concurrently once APIs are defined - Supabase schema can be set up independently while backend infrastructure is being built --- ## 7. Risk Mitigation Checklist | Risk | Mitigation | Owner | Timeline | |------|-----------|-------|----------| | Voice latency >1s | Profile each step; use speculative tokens; test with real Deepgram/Cartesia | Backend | Phase 2 | | Deepgram VAD errors | Create labeled dataset of thinking vs. done; tuning threshold per question type | Audio | Phase 2 | | Invalid Mermaid code from LLM | Implement self-correction loop with max 2 retries; test 50+ diagram prompts | Backend | Phase 3 | | Hallucinations on out-of-doc facts | Enforce citation requirement; add explicit "I don't know" branching logic | RAG | Phase 1 | | Vision API fails on bad handwriting | Test with low-res images early; set user expectations; fallback to text input | Vision | Phase 4 | | WebSocket connection drops mid-session | Implement reconnection logic with session resumption; persist state to DB | Frontend | Phase 2 | | pgvector embedding quality | Test with multiple embedding models; tune similarity threshold; benchmark retrieval accuracy | Backend | Phase 1 | | Supabase latency on vector queries | Profile query times with large document sets; index optimization if needed | DB | Phase 1 | --- ## 8. Success Metrics (Testing Checklist) - [ ] Phase 1: Text-based quiz loop works, knowledge_graph updates correctly - [ ] Phase 2: Voice round-trip latency <1000ms on average; interrupt detection works - [ ] Phase 3: Diagrams render correctly; citations highlight PDF correctly - [ ] Phase 4: Vision API identifies errors in handwritten work; deployment passes load test --- This plan is the blueprint. Ready to start Phase 1 implementation when needed.

Related Documents

cheap-RAG Development Roadmap

Semblance AI — Development Roadmap

Changelog

Toasty — AI Triage & Responsible Disclosure Assistant (2026 — 350 hours)