Loading...
Loading...
**Created**: 2025-11-01
# Implementation Plan: Ethnobotany Vectorized Database
**Created**: 2025-11-01
**Updated**: 2025-11-09 (Added docling-rag-agent reference architecture)
**Status**: Ready for Implementation
**Reference Architecture**: [docling-rag-agent](https://github.com/coleam00/ottomator-agents/tree/main/docling-rag-agent)
**Documentation**:
- [Feature Specification](./spec.md)
- [Updated Specification](./specs/001-docling-rag-update/spec.md) - docling-rag-agent patterns integrated
- [Data Model](./data-model.md)
- [API Specification](./api-specification.md)
- Research: [Embeddings](./research/research-embedding-models.md) • [Databases](./research/research-databases.md) • [LLM APIs](./research/research-llm-apis.md)
---
## Technical Context
### Problem Statement
Ethnobotany researchers need a semantic search engine over scientific literature to discover relationships between traditional plant use, regional communities, and research findings that keyword search cannot surface.
### Solution Architecture
**Based on [docling-rag-agent](https://github.com/coleam00/ottomator-agents/tree/main/docling-rag-agent)** - A containerized RAG application that:
- Ingests documents in multiple formats (PDF, DOCX, PPTX, XLSX, HTML, MD, TXT) via Docling
- Stores metadata and vector embeddings in PostgreSQL + pgvector
- Provides chat interface via PydanticAI agent framework with tool-calling
- Supports multi-provider LLMs (OpenAI, Claude, Gemini) with user-provided API keys
- Implements streaming responses and async connection pooling
**Key Design Principles**:
1. **Reference Implementation First**: Follow docling-rag-agent patterns before customizing
2. **No Full-Text Retention**: Metadata + embeddings + source links only
3. **CARE Principles**: Add ethnobotany-specific governance on top of base RAG
---
## Architecture Overview
### System Components
```
┌─────────────────────────────────────────────────────────────────┐
│ Web Frontend (React/Vue) │
│ - Article upload/URL submission interface │
│ - Chat-based semantic search │
│ - Recommendation explorer │
│ - Trend analysis dashboard │
└──────────────────────────┬──────────────────────────────────────┘
│
REST/WebSocket
│
┌──────────────────────────▼──────────────────────────────────────┐
│ Backend API (FastAPI/Python) │
│ ┌────────────────┬──────────────┬──────────────────────────┐ │
│ │ Article Ingest │ Search Service│ Recommendation Engine │ │
│ │ - PDF Parser │ - Semantic │ - Similarity Analysis │ │
│ │ - URL Fetcher │ - Chat LLM │ - Cross-domain Links │ │
│ │ - Metadata Ext │ Interface │ │ │
│ └────────┬───────┴──────┬────────┴─────────────┬──────────┘ │
│ │ │ │ │
├───────────┼──────────────┼──────────────────────┼───────────────┤
│ Embedding Service (Python + SPECTER2) │ │
│ - Batch embedding generation │ │
│ - Async processing │ │
└────────────┬──────────────────────────────────┬─────────────────┘
│ │
┌─────────▼────────────┐ ┌─────────────▼────────┐
│ Vector Database │ │ SQL Database │
│ (PostgreSQL pgvector) │ (PostgreSQL) │
│ - 768-dim embeddings │ - Article metadata │
│ - Similarity search │ - User preferences │
│ - Payload filtering │ - Search history │
└────────────────────┘ └─────────────────────┘
```
### Technology Stack
| Layer | Technology | Rationale |
|-------|-----------|-----------|
| **Frontend** | React 18 + TypeScript | Modern, type-safe, large ecosystem |
| **Backend** | FastAPI (Python 3.11+) | Async-first, excellent for concurrent tasks |
| **Embeddings** | SPECTER2 (Hugging Face) | Scientific-domain optimized, task-adaptive |
| **Vector DB** | PostgreSQL pgvector (MVP) + Qdrant (Prod) | Open-source, cost-effective, reliable |
| **SQL DB** | PostgreSQL 15+ | ACID transactions, full-text search, pgvector |
| **LLM Integration** | LiteLLM + Claude/Gemini/ChatGPT APIs | Multi-provider, unified interface |
| **Article Fetching** | httpx, crossref-commons, arxiv | Async HTTP, native metadata APIs |
| **PDF Processing** | PyPDF2, pdfplumber | Text extraction, metadata parsing |
| **Container** | Docker + Docker Compose | Reproducible deployment, local development |
| **Queue** | Celery + Redis | Async task processing (optional for MVP) |
---
## Data Model & Persistence
### PostgreSQL Schema (Core)
```sql
-- Articles table (core metadata)
CREATE TABLE articles (
id BIGSERIAL PRIMARY KEY,
doi VARCHAR(255) UNIQUE,
title TEXT NOT NULL,
authors TEXT[] NOT NULL, -- Array of author names
year INTEGER,
publication_venue TEXT,
abstract TEXT,
keywords TEXT[],
source_url TEXT, -- URL to original article (journal/DOI/arXiv)
source_type VARCHAR(50), -- 'doi', 'journal_url', 'arxiv', 'researchgate'
language VARCHAR(10) DEFAULT 'en', -- 'en' or 'pt'
ingestion_method VARCHAR(20), -- 'pdf_upload' or 'url_fetch'
metadata_confidence NUMERIC(3,2), -- 0.00-1.00 extraction quality
embedding_status VARCHAR(20) DEFAULT 'pending',
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
-- Full-text search
search_vector tsvector GENERATED ALWAYS AS (
to_tsvector('english', COALESCE(title, '') || ' ' ||
COALESCE(abstract, ''))
) STORED
);
CREATE INDEX idx_articles_doi ON articles(doi);
CREATE INDEX idx_articles_year ON articles(year);
CREATE INDEX idx_articles_search ON articles USING GIN(search_vector);
-- Embeddings table
CREATE TABLE embeddings (
id BIGSERIAL PRIMARY KEY,
article_id BIGINT UNIQUE REFERENCES articles(id) ON DELETE CASCADE,
embedding vector(768), -- SPECTER2 output dimension
model_version VARCHAR(50),
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_embeddings_similarity ON embeddings USING HNSW (embedding vector_cosine_ops);
-- CARE Principles: Community metadata
CREATE TABLE communities (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL UNIQUE,
region VARCHAR(255), -- e.g., "Amazonia", "Cerrado", "Caatinga"
country VARCHAR(100),
research_domain VARCHAR(255), -- e.g., "ethnobotany", "ethnopharmacology"
approval_status VARCHAR(20), -- 'pending', 'approved', 'declined'
notes TEXT
);
CREATE TABLE article_communities (
id BIGSERIAL PRIMARY KEY,
article_id BIGINT REFERENCES articles(id) ON DELETE CASCADE,
community_id INTEGER REFERENCES communities(id),
knowledge_origin TEXT, -- Description of traditional knowledge
validation_status VARCHAR(20), -- 'pending', 'approved', 'disputed'
validated_by VARCHAR(255), -- Community representative name
validated_at TIMESTAMP,
UNIQUE(article_id, community_id)
);
-- User interactions
CREATE TABLE search_queries (
id BIGSERIAL PRIMARY KEY,
user_id VARCHAR(255),
query_text TEXT NOT NULL,
query_embedding vector(768), -- For analytics
results_count INTEGER,
selected_articles INTEGER[], -- Which articles user clicked
llm_provider VARCHAR(50),
created_at TIMESTAMP DEFAULT NOW()
);
-- User preferences & LLM config
CREATE TABLE user_settings (
user_id VARCHAR(255) PRIMARY KEY,
preferred_llm VARCHAR(50) DEFAULT 'claude', -- 'claude', 'gemini', 'chatgpt'
api_keys_encrypted JSONB, -- Encrypted: {claude: enc_key, gemini: enc_key, ...}
language_preference VARCHAR(10) DEFAULT 'pt',
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
-- Annotations & community feedback
CREATE TABLE annotations (
id BIGSERIAL PRIMARY KEY,
article_id BIGINT REFERENCES articles(id) ON DELETE CASCADE,
user_id VARCHAR(255),
annotation_type VARCHAR(50), -- 'correction', 'clarification', 'dispute', 'confirmation'
text TEXT NOT NULL,
evidence_url TEXT, -- Link supporting annotation
created_at TIMESTAMP DEFAULT NOW()
);
```
---
## Implementation Phases
### Phase 1: Foundation (Weeks 1-3)
**Deliverables**: Working MVP with PDF upload and basic semantic search
#### 1.1 Infrastructure Setup
- [ ] Docker environment (PostgreSQL + pgvector, FastAPI, Redis)
- [ ] Project structure (backend/frontend folder layout)
- [ ] Environment configuration (.env, secrets management)
- [ ] CI/CD pipeline skeleton (GitHub Actions)
#### 1.2 Core Backend Services
- [ ] FastAPI application with basic endpoints
- [ ] PostgreSQL connection and schema initialization
- [ ] SPECTER2 model loading and inference service
- [ ] pgvector index configuration
#### 1.3 Article Ingestion (PDF Upload)
- [ ] PDF upload endpoint with file validation
- [ ] PDF text extraction service (pdfplumber)
- [ ] Metadata extraction from PDF (title, authors, year via heuristics)
- [ ] Async metadata enrichment from CrossRef API (when DOI found)
- [ ] Duplicate detection (by DOI or title similarity)
#### 1.4 Embedding Generation
- [ ] Batch embedding service using SPECTER2
- [ ] Storage in pgvector
- [ ] Async processing queue (Celery optional for MVP)
#### 1.5 Frontend Basics
- [ ] Upload form component
- [ ] Article listing/search results page
- [ ] Basic styling and layout
**Success Metrics**:
- Upload 10 sample PDFs successfully
- Metadata extraction works for 95%+ of uploads
- Embeddings generated and stored
- Basic search UI functional
---
### Phase 2: Search & Chat Interface (Weeks 4-5)
**Deliverables**: Semantic search + LLM-powered chat
#### 2.1 Semantic Search Service
- [ ] Vector similarity search implementation
- [ ] Ranking and filtering logic
- [ ] Full-text search hybrid (SQL + vector)
- [ ] Search result formatting with context
#### 2.2 Chat Interface & LLM Integration
- [ ] WebSocket connection for real-time chat
- [ ] Prompt engineering for article synthesis
- [ ] LiteLLM wrapper for Claude/Gemini/ChatGPT
- [ ] Secure API key storage (encrypted)
- [ ] Response streaming to frontend
#### 2.3 URL-Based Submission
- [ ] URL submission endpoint
- [ ] Article metadata fetching from CrossRef/arXiv/PubMed APIs
- [ ] Fallback web scraping for unsupported sources
- [ ] Manual metadata entry as fallback
#### 2.4 Chat Frontend
- [ ] Message display with streaming
- [ ] Chat history persistence
- [ ] Article reference links
- [ ] Follow-up suggestion UI
**Success Metrics**:
- Semantic search returns relevant results (manual validation)
- Chat responses synthesize article findings correctly
- URL submission works for 90%+ of common sources
- Sub-2 second response times for typical queries
---
### Phase 3: Recommendations & Analytics (Weeks 6-7)
**Deliverables**: Article recommendations + trend analysis dashboard
#### 3.1 Recommendation Engine
- [ ] Co-citation analysis
- [ ] Semantic similarity clustering
- [ ] Regional cross-reference detection
- [ ] Related article ranking
#### 3.2 Trend Analysis Service
- [ ] Plant frequency aggregation
- [ ] Regional research patterns
- [ ] Gap identification (plants, regions, properties)
- [ ] Temporal trends (publication year analysis)
#### 3.3 Analytics Dashboard
- [ ] Plant study frequency visualization
- [ ] Regional distribution maps
- [ ] Research gap heatmaps
- [ ] Trend time-series
**Success Metrics**:
- Recommendations identify 10+ related articles per article
- 70%+ user satisfaction on recommendation relevance
- Gap analysis identifies known underresearched areas
---
### Phase 4: CARE Principles & Monitoring (Weeks 8-9)
**Deliverables**: Community governance + publication monitoring
#### 4.1 CARE Implementation
- [ ] Community registration system
- [ ] Knowledge attribution framework
- [ ] Community approval workflows
- [ ] Audit logging
#### 4.2 Publication Monitoring
- [ ] Scheduled journal API polling (CrossRef, arXiv)
- [ ] Ethnobotany query automation
- [ ] Automatic ingestion pipeline
- [ ] Duplicate detection across sources
#### 4.3 Governance UI
- [ ] Community management dashboard
- [ ] Knowledge attribution display
- [ ] Approval status indicators
- [ ] Usage analytics for communities
**Success Metrics**:
- 100% of articles tagged with community origin
- New publications discovered within 7 days
- Automated ingestion succeeds for 90%+ of discoveries
---
### Phase 5: Polish & Deployment (Weeks 10-11)
**Deliverables**: Production-ready system
#### 5.1 Quality Assurance
- [ ] End-to-end integration tests
- [ ] Load testing (50+ concurrent users)
- [ ] Security audit (API keys, authentication)
- [ ] Performance optimization
#### 5.2 Documentation
- [ ] API documentation (OpenAPI/Swagger)
- [ ] User guides (Portuguese + English)
- [ ] Deployment guide
- [ ] Contributing guidelines
#### 5.3 Deployment
- [ ] Production Docker Compose setup
- [ ] Database backup strategy
- [ ] Monitoring and logging (Prometheus, ELK)
- [ ] Disaster recovery procedures
**Success Metrics**:
- 99.5% uptime target
- All endpoints documented
- Deployable in under 30 minutes
---
## Technology Decisions & Justification
### Why SPECTER2 for Embeddings?
**Decision**: Use SPECTER2 model from Allenai for all article embeddings
**Rationale**:
- Trained on 6M scientific paper triplets across 23 fields (includes botany, medicine, biology)
- Task-adaptive: generates different embeddings for different tasks
- Citation network aware: papers close in citation graph are close in embedding space
- Free and open-source (Apache 2.0)
- Supports future fine-tuning on ethnobotany-specific data
**Alternatives Considered**:
- SciBERT: Simpler but less sophisticated for document-level similarity
- Sentence-Transformers: General purpose, requires fine-tuning for scientific content
- OpenAI Embeddings: Excellent quality but proprietary and recurring API costs ($0.02/1M tokens)
**Recommendation**: Use SPECTER2 for MVP. Switch to fine-tuned models or OpenAI only if user feedback indicates insufficient quality.
---
### Why PostgreSQL + pgvector (MVP) + Qdrant (Production)?
**Decision**:
- **MVP**: Single PostgreSQL database with pgvector extension
- **Production**: PostgreSQL for metadata + Qdrant for dedicated vector search
**Rationale for MVP**:
- Single database reduces operational complexity
- pgvector provides native vector support in familiar SQL
- Sufficient for 50k articles (50-100ms query latency acceptable)
- Cost-effective ($5-30/month for self-hosted or managed)
- Hybrid search in single query (SQL filter + vector similarity)
**Rationale for Production**:
- Qdrant provides sub-20ms query latency
- Scales to billions of vectors if needed
- Separates concerns: PostgreSQL (metadata+transactions) vs Qdrant (vectors)
- Both open-source, zero vendor lock-in
- Cost-effective ($9-15/month Qdrant Cloud + PostgreSQL)
**Alternatives Considered**:
- Milvus: Best performance but requires Kubernetes expertise (overkill for 50k articles)
- Weaviate: Feature-rich but higher operational complexity
- Pinecone: Fully managed but expensive ($50-100+/month) and vendor lock-in
---
### Why LiteLLM for Multi-LLM Support?
**Decision**: Use LiteLLM library to abstract LLM provider differences
**Rationale**:
- Single interface for Claude, Gemini, ChatGPT
- Easy provider switching (configuration change only)
- Built-in error handling and retries
- Cost tracking per provider
- Community-maintained, well-tested
**Implementation Pattern**:
```python
import litellm
response = litellm.completion(
model=user_settings.preferred_llm, # "claude-3-5-sonnet" or "gemini-2.5-flash"
messages=messages,
temperature=0.7,
stream=True
)
```
**User Cost Model**: Each user provides their own API keys (encrypted storage)
- Claude: $1-3/month for typical usage
- Gemini: $0.10-0.50/month (cheapest)
- ChatGPT: $3-8/month
**Zero platform API costs** with this model.
---
### Why URL-Based Submission?
**Decision**: Support both PDF upload and article URL submission
**Rationale**:
- PDF upload: For offline articles or when users have local copies
- URL submission: Persistent access to original sources, no storage burden
- Metadata APIs (CrossRef, arXiv): Automated extraction, high success rate
- No full-text retention: Respects publisher rights, minimizes storage
**Implementation**:
```
URL Input (DOI/journal link/arXiv)
→ Fetch metadata via API
→ If API fails → Attempt web scraping
→ If both fail → Allow manual entry
→ Extract abstract for embedding
→ Link to original source
```
**Success Target**: 90%+ of URL submissions processed automatically without manual intervention
---
## Development Workflow
### Git & Branching (Main-Only)
- All work commits to **main** branch
- No feature branches (per user requirement)
- Atomic, descriptive commits with full context
### Code Organization
```
project/
├── backend/
│ ├── app/
│ │ ├── main.py # FastAPI app entry
│ │ ├── models.py # Pydantic models
│ │ ├── database.py # SQLAlchemy setup
│ │ ├── api/
│ │ │ ├── articles.py # Article ingestion endpoints
│ │ │ ├── search.py # Search endpoints
│ │ │ ├── recommendations.py
│ │ │ ├── trends.py
│ │ │ └── chat.py # WebSocket chat
│ │ ├── services/
│ │ │ ├── embedding.py # SPECTER2 service
│ │ │ ├── pdf_extraction.py
│ │ │ ├── url_fetcher.py # Article URL fetching
│ │ │ ├── metadata.py # Extraction & enrichment
│ │ │ ├── search_service.py
│ │ │ └── llm_service.py # LLM integration
│ │ └── utils/
│ │ ├── crypto.py # API key encryption
│ │ ├── validators.py
│ │ └── logging.py
│ ├── requirements.txt
│ ├── Dockerfile
│ └── .env.example
│
├── frontend/
│ ├── src/
│ │ ├── components/
│ │ │ ├── Upload.tsx
│ │ │ ├── SearchChat.tsx
│ │ │ ├── Recommendations.tsx
│ │ │ └── TrendsDashboard.tsx
│ │ ├── services/
│ │ │ ├── api.ts
│ │ │ └── websocket.ts
│ │ └── pages/
│ ├── package.json
│ └── Dockerfile
│
├── docker-compose.yml
├── postgres-init.sql
└── README.md
```
### Testing Strategy
**Unit Tests**:
- Metadata extraction (PDF, URL)
- Embedding service
- Search ranking
- LLM response formatting
**Integration Tests**:
- End-to-end article upload → search
- URL ingestion → embedding
- Chat context handling
- API key encryption/decryption
**Load Tests**:
- 50 concurrent users
- 100 simultaneous searches
- Batch embedding performance
---
## Deployment & Operations
### Docker Compose (Development)
```yaml
version: '3.8'
services:
postgres:
image: postgres:15-alpine
environment:
POSTGRES_DB: ethnobotany
POSTGRES_PASSWORD: dev_password
volumes:
- postgres_data:/var/lib/postgresql/data
- ./postgres-init.sql:/docker-entrypoint-initdb.d/init.sql
ports:
- "5432:5432"
redis:
image: redis:7-alpine
ports:
- "6379:6379"
backend:
build: ./backend
environment:
DATABASE_URL: postgresql://postgres:dev_password@postgres:5432/ethnobotany
REDIS_URL: redis://redis:6379
ports:
- "8000:8000"
depends_on:
- postgres
- redis
frontend:
build: ./frontend
ports:
- "3000:3000"
depends_on:
- backend
```
### Production Deployment
**Recommended Setup**:
- PostgreSQL on managed service (AWS RDS, DigitalOcean, etc.)
- Qdrant Cloud for vector search ($9-15/month)
- Backend on container orchestration (Docker Swarm, Kubernetes minimal)
- Frontend on static hosting (Vercel, Netlify, S3 + CloudFront)
- CloudFlare for CDN and DDoS protection
**Cost Estimate**: $40-80/month for production
---
## Risk Mitigation
| Risk | Impact | Mitigation |
|------|--------|-----------|
| PDF extraction failure on some papers | Blocks article indexing | Implement OCR fallback, allow manual entry |
| URL API rate limits (CrossRef, arXiv) | Slow ingestion | Implement caching, backoff strategy |
| Embedding model quality issues | Poor search results | Allow fine-tuning, fallback to keyword search |
| LLM API outages | Chat unavailable | Fallback to previous responses, offline mode |
| Community approval bottleneck | Slow CARE compliance | Automate binary approvals, escalate complex cases |
---
## Success Metrics & Milestones
| Phase | Milestone | Success Criteria |
|-------|-----------|------------------|
| 1 | MVP Ready | 10 PDFs ingested, search functional |
| 2 | Chat Live | Semantic search + chat working for 50 articles |
| 3 | Analytics Beta | Recommendations engine running, showing insights |
| 4 | CARE Ready | Community approvals implemented, 100% attributed |
| 5 | Production | 50k articles, 50+ concurrent users, 99.5% uptime |
---
## Next Steps
1. **Backend Setup** (Week 1): Initialize FastAPI, PostgreSQL, pgvector
2. **Frontend Foundation** (Week 1): React project setup, basic components
3. **PDF Ingestion** (Week 2): Implement upload and metadata extraction
4. **Embedding Service** (Week 2): Integrate SPECTER2, batch processing
5. **Search API** (Week 3): Vector similarity search endpoints
6. **Chat Interface** (Week 4): WebSocket, LLM integration
7. **URL Submission** (Week 4): CrossRef API integration
8. **Recommendations** (Week 6): Similarity clustering algorithm
9. **Trends Dashboard** (Week 7): Analytics queries and UI
10. **CARE & Monitoring** (Week 8): Community workflows, publication monitoring
11. **Testing & Deployment** (Week 10): End-to-end validation, containerization
12. **Production Launch** (Week 11): Deploy, documentation, onboarding
---
## Conclusion
This plan provides a clear pathway from MVP to production for the ethnobotany vectorized database system. The architecture prioritizes:
1. **Cost-effectiveness**: Open-source models + databases + user-provided LLM keys
2. **Scalability**: From 50k to 50M articles with minimal architectural changes
3. **User value**: Semantic search + recommendations + trend insights
4. **Ethical governance**: CARE principles integrated from the start
5. **Flexibility**: Multi-LLM support, multiple ingestion methods
The 11-week implementation timeline is achievable with a lean team of 2-3 engineers.
Full-stack web application for the University of Guelph Rocketry Club featuring AI-powered chatbot, member management, project showcases, and sponsor integration.
Reactory Data (`reactory-data`) is the data, assets, and CDN repository for the Reactory platform. It provides baseline directory structures, fonts, themes, internationalization files, client plugin source code and runtime bundles, email templates, workflow schedules, database backups, AI learning resources, and static content.
globs: src/app/**/*.tsx src/components/**/*.tsx src/hooks/**/*.ts src/lib/**/*.ts
A TypeScript CLI application that initiates and maintains an autonomous conversation between two AI personas using Ollama. The app starts with user input and then continues the conversation automatically until stopped.