Loading...
Loading...
Loading...
> **Purpose**: This document captures the complete project history and achievements for resume/portfolio purposes.
# AI Code Agent - Project Documentation
> **Purpose**: This document captures the complete project history and achievements for resume/portfolio purposes.
---
## Project Overview
**AI Code Agent** is a production-style AI backend service for semantic code search and retrieval, built with a focus on correctness, explainability, and clean architecture.
**Stack**: Python 3.11, FastAPI, Docker, PostgreSQL + pgvector, OpenAI
**Architecture**: API-first, microservices-ready, containerized
**Status**: Complete RAG (Retrieval-Augmented Generation) system with semantic search and question-answering
---
## Core Functionality Built
### 1. Repository Ingestion Pipeline
- **GitHub Repository Cloning**: Automatic cloning of public repositories
- **Python File Extraction**: Recursively walks file tree, filters only `.py` files
- **Smart Directory Filtering**: Skips `.git`, `__pycache__`, `venv`, `.venv`, `env`, `node_modules`, `dist`, `build`
- **UTF-8 Safe Reading**: Handles encoding errors gracefully
- **File Metadata**: Captures file paths and content
**Implementation**: `backend/app/ingest.py`
### 2. Code Chunking System
- **Character-Based Chunking**: Splits files into 500-character chunks with 75-character overlap
- **Context Preservation**: Overlapping chunks maintain context across boundaries
- **Chunk Metadata**: Each chunk tagged with `repo_name`, `file_path`, `chunk_id`, `text`
- **RAG Preparation**: Structured for future embedding and retrieval
**Implementation**: `backend/app/chunking.py`
**Results**:
- Flask repository (83 files) → 1,416 chunks
- 17x increase in granularity for better retrieval precision
### 3. PostgreSQL Persistence Layer
- **Database Schema**: UUID-based primary keys, composite unique constraints
- **Connection Pooling**: AsyncPG with connection pool (2-10 connections)
- **Automatic Schema Creation**: `CREATE TABLE IF NOT EXISTS` on startup
- **UPSERT Logic**: Handles re-ingestion of same repositories gracefully
- **Indexed Queries**: Repository-based indexing for fast lookups
- **Pagination Support**: Efficient chunk retrieval with limit/offset
**Implementation**: `backend/app/database.py`
**Schema**:
```sql
code_chunks (
id UUID PRIMARY KEY,
repo_name TEXT NOT NULL,
file_path TEXT NOT NULL,
chunk_id INTEGER NOT NULL,
content TEXT NOT NULL,
embedding VECTOR(1536),
created_at TIMESTAMP,
UNIQUE(repo_name, file_path, chunk_id)
)
```
### 4. Semantic Search with Embeddings
- **OpenAI Integration**: Uses `text-embedding-3-small` model (1536 dimensions)
- **Batch Embedding Generation**: Efficient API calls for large repositories
- **Vector Storage**: pgvector extension for high-dimensional vector storage
- **HNSW Indexing**: Hierarchical Navigable Small World index for fast similarity search
- **Cosine Similarity**: Vector distance calculation using `<->` operator
- **Semantic Query**: Natural language queries converted to embeddings
**Implementation**:
- `backend/app/embeddings.py` - Embedding generation
- `backend/app/database.py` - Vector storage and similarity search
**Performance**:
- Ingestion: ~30-60 seconds for Flask repo (1,416 chunks)
- Search: ~0.01 seconds for similarity queries (with HNSW index)
### 6. RAG (Retrieval-Augmented Generation) Pipeline
- **Question Embedding**: Converts natural language questions to vector embeddings
- **Semantic Retrieval**: Uses vector similarity search to find relevant code chunks
- **Test File Filtering**: Automatically filters out test files (`tests/`, `test_`) if non-test chunks are available
- **Context Assembly**: Assembles retrieved chunks into formatted prompt with file paths
- **LLM Generation**: Uses OpenAI Chat Completions (gpt-4o-mini) to generate grounded answers
- **Strict Grounding**: System prompt enforces "answer only from context" to prevent hallucinations
- **Source Citations**: Returns file paths, chunk IDs, and distance scores for transparency
- **Distance-Based Ranking**: Uses cosine distance (lower = more similar) instead of similarity scores
**Implementation**: `backend/app/rag.py`
**Features**:
- Retrieves 2x chunks initially to allow filtering (e.g., top_k=5 retrieves 10, filters to 5)
- Prefers source code over test files for better answer quality
- Falls back to test files if no source code matches
- Returns distance scores (0-2 range, lower is better) for result ranking
- Handles empty results gracefully with informative messages
**Performance**:
- RAG query: ~2-5 seconds (includes embedding generation + LLM call)
- Filtering overhead: Negligible (<1ms)
### 5. RESTful API Endpoints
#### `GET /health`
- Health check endpoint
- Returns: `{"status": "healthy"}`
#### `POST /ingest`
- **Input**: `{"repo_url": "https://github.com/user/repo"}`
- **Process**: Clone → Extract → Chunk → Generate Embeddings → Store
- **Output**: `{"repo_name": "user/repo", "chunk_count": 1416, "status": "ingested_with_embeddings"}`
- **Duration**: 30-60 seconds (includes OpenAI API calls)
#### `GET /repos/{repo_name}/chunks`
- **Query Params**: `limit` (default 100), `offset` (default 0)
- **Returns**: Chunk metadata with 200-character content preview
- **Includes**: Statistics (chunk_count, file_count, last_updated), pagination info
#### `POST /search`
- **Input**: `{"repo_name": "user/repo", "query": "authentication middleware", "top_k": 5}`
- **Process**: Convert query → embedding → vector similarity search
- **Output**: Top K most semantically similar chunks with distance scores (lower = more similar)
- **Response Time**: ~0.01 seconds
#### `POST /ask` (RAG - Retrieval-Augmented Generation)
- **Input**: `{"repo_name": "user/repo", "question": "How does Flask handle routing?", "top_k": 5}`
- **Process**:
1. Generate embedding for question
2. Retrieve top-k relevant code chunks (semantic search)
3. Filter out test files (deprioritizes `tests/`, `test_` patterns)
4. Assemble grounded prompt with retrieved context
5. Generate answer using OpenAI Chat Completions (gpt-4o-mini)
6. Return answer with source citations
- **Output**:
- `answer`: Generated answer grounded in retrieved code
- `sources`: List of source chunks with `file_path`, `chunk_id`, `distance`
- **Features**:
- Strict grounding: System prompt enforces "answer only from context"
- Test file filtering: Prefers source code over test files
- Distance-based ranking: Returns cosine distance (lower = more similar)
- Source citations: Transparent attribution to code chunks used
- **Response Time**: ~2-5 seconds (includes LLM generation)
**Implementation**:
- `backend/app/main.py` - API endpoints
- `backend/app/rag.py` - RAG pipeline logic
---
## Technical Achievements
### Infrastructure
- ✅ **Docker Compose Setup**: Multi-container orchestration (FastAPI + PostgreSQL)
- ✅ **Environment Management**: Secure `.env` file handling with `.gitignore`
- ✅ **Database Migrations**: Automatic schema updates with backwards compatibility
- ✅ **Connection Pooling**: Efficient database connection management
- ✅ **Container Lifecycle**: Proper startup/shutdown handlers
### Data Processing
- ✅ **Batch Operations**: Efficient embedding generation (batch API calls)
- ✅ **Error Handling**: Graceful UTF-8 decoding, missing file handling
- ✅ **Data Validation**: Pydantic models for API request validation
- ✅ **Type Safety**: Full type hints throughout codebase
### Vector Search
- ✅ **pgvector Integration**: Native PostgreSQL vector extension
- ✅ **HNSW Index**: Logarithmic-time similarity search
- ✅ **Semantic Understanding**: Meaning-based code discovery (not keyword-based)
- ✅ **Scalability**: Designed for repositories with 100K+ chunks
- ✅ **Test File Filtering**: Automatically deprioritizes test files in retrieval
- ✅ **Distance-Based Ranking**: Returns cosine distance scores (lower = more similar)
### Code Quality
- ✅ **Clean Architecture**: Separation of concerns (ingest, chunking, embeddings, database)
- ✅ **Documentation**: Comprehensive docstrings explaining purpose and design
- ✅ **Minimal Dependencies**: Only essential libraries (FastAPI, asyncpg, openai, gitpython)
- ✅ **No Premature Optimization**: Simple, readable code that works correctly
---
## Project Statistics
### Repository Tested
- **Flask** (pallets/flask): 83 Python files, 1,416 chunks after chunking
### Performance Metrics
- **Ingestion Time**: ~30-60 seconds (includes OpenAI API calls)
- **Search Latency**: ~0.01 seconds (with HNSW index)
- **Chunk Size**: 500 characters with 75-character overlap
- **Embedding Dimension**: 1536 (OpenAI text-embedding-3-small)
### Code Metrics
- **Files Created**: 7 Python modules (main, ingest, chunking, embeddings, database, rag)
- **Lines of Code**: ~900 (including documentation)
- **API Endpoints**: 4 (health, ingest, search, ask, chunk retrieval)
- **Database Tables**: 1 (code_chunks with vector column)
---
## Technology Stack
### Backend
- **Python 3.11**: Modern Python with type hints
- **FastAPI**: High-performance async web framework
- **Uvicorn**: ASGI server with async support
- **AsyncPG**: Async PostgreSQL driver
### Database
- **PostgreSQL 16**: Production-grade relational database
- **pgvector**: Vector similarity search extension
- **HNSW Index**: Fast approximate nearest neighbor search
### AI/ML
- **OpenAI API**: Text embeddings (text-embedding-3-small) and Chat Completions (gpt-4o-mini)
- **Vector Embeddings**: 1536-dimensional semantic representations
- **RAG Pipeline**: Retrieval-Augmented Generation for grounded question-answering
### DevOps
- **Docker**: Containerization
- **Docker Compose**: Multi-container orchestration
- **Git**: Version control
### Libraries
- **GitPython**: Repository cloning
- **Pydantic**: Data validation
---
## Architecture Decisions
### Why Character-Based Chunking?
- Simple, predictable results
- Fast processing
- Can be improved later with AST-aware chunking
- Good baseline for RAG systems
### Why OpenAI Embeddings?
- High-quality semantic understanding
- Pre-trained on code datasets
- Cost-effective (text-embedding-3-small)
- Ready-to-use (no training required)
### Why pgvector?
- Native PostgreSQL integration
- Efficient HNSW indexing
- Standard tool for vector similarity search
- No separate vector database needed
### Why FastAPI?
- Async support (critical for I/O-bound operations)
- Automatic API documentation
- Type safety with Pydantic
- High performance
### Why Individual Inserts (Not Batch)?
- Better asyncpg vector type handling
- Simpler error handling
- Acceptable performance for ingestion (not user-facing)
---
## Current Capabilities
### What It Does
1. **Clone GitHub repositories** automatically
2. **Extract and chunk Python files** into searchable pieces
3. **Generate semantic embeddings** for each chunk
4. **Store in PostgreSQL** with vector support
5. **Enable semantic search** ("find authentication code" → returns relevant chunks)
6. **Answer questions about codebases** using RAG (Retrieval-Augmented Generation)
7. **Filter test files** automatically (deprioritizes test code in retrieval)
8. **Retrieve chunks** by repository with pagination
9. **Generate grounded answers** with source citations
### What It Doesn't Do (By Design)
- ❌ LangChain/LangGraph integration (not yet - will use LangGraph later)
- ❌ AutoGPT-style autonomous agents (avoided by design)
- ❌ AST-aware chunking (currently character-based, can be improved)
- ❌ Multi-language support (Python only currently)
- ❌ Code execution or modification (read-only system)
- ❌ Multi-repo cross-repository search (single repo at a time)
---
## Future Roadmap (Not Implemented Yet)
### Phase 1: RAG Implementation ✅ **COMPLETE**
- ✅ LLM integration for answer generation
- ✅ Context assembly from retrieved chunks
- ✅ Query → Retrieve → Generate pipeline
- ✅ Test file filtering
- ✅ Source citation tracking
### Phase 2: LangGraph Orchestration
- State machine for complex workflows
- Multi-step reasoning
- Agentic capabilities
### Phase 3: Production Features
- Authentication/authorization
- Rate limiting
- Caching layer
- Monitoring/logging
- Multi-repo search
- Hybrid search (keyword + semantic)
---
## Key Files Structure
```
ai-code-agent/
├── backend/
│ ├── app/
│ │ ├── main.py # FastAPI app, API endpoints
│ │ ├── ingest.py # Repository cloning and file extraction
│ │ ├── chunking.py # Text chunking utilities
│ │ ├── embeddings.py # OpenAI embedding generation
│ │ ├── database.py # PostgreSQL operations, vector search
│ │ └── rag.py # RAG pipeline (retrieval + generation)
│ ├── Dockerfile # Python 3.11, git, dependencies
│ └── requirements.txt # Dependencies
├── docker-compose.yml # Backend + PostgreSQL services
├── .env # Environment variables (API keys)
└── .gitignore # Excludes .env, __pycache__, etc.
```
---
## Resume-Ready Accomplishments
### Technical Skills Demonstrated
- ✅ **Backend Development**: FastAPI, async Python, RESTful APIs
- ✅ **Database Design**: PostgreSQL schema design, indexing strategies
- ✅ **Vector Databases**: pgvector, embedding storage, similarity search
- ✅ **AI/ML Integration**: OpenAI API, semantic embeddings, RAG pipeline (complete)
- ✅ **LLM Integration**: OpenAI Chat Completions, prompt engineering, grounded generation
- ✅ **DevOps**: Docker, Docker Compose, container orchestration
- ✅ **Version Control**: Git, .gitignore best practices
- ✅ **Code Architecture**: Clean separation of concerns, modular design
- ✅ **Performance Optimization**: Database indexing, batch processing, connection pooling
- ✅ **RAG Implementation**: Retrieval-Augmented Generation with strict grounding
### Project Highlights
- **Built complete semantic code search system** from scratch
- **Integrated OpenAI embeddings** with custom ingestion pipeline
- **Designed scalable vector search** using pgvector and HNSW indexing
- **Implemented full RAG pipeline** with question-answering capabilities
- **Achieved sub-100ms search latency** for similarity queries
- **Handled large-scale ingestion** (1,416+ chunks per repository)
- **Implemented production-ready** error handling and validation
- **Added intelligent filtering** to deprioritize test files in retrieval
- **Generated grounded answers** with source citations and strict hallucination prevention
### Technical Depth
- **Vector Similarity Search**: Implemented cosine distance calculations with pgvector
- **Embedding Pipeline**: Batch processing of 1,000+ text chunks to vectors
- **Database Migrations**: Automated schema updates with backwards compatibility
- **Async Programming**: Full async/await pattern for I/O-bound operations
- **API Design**: RESTful endpoints with proper error handling and validation
---
## Testing Examples
### Ingestion
```bash
curl -X POST http://localhost:8000/ingest \
-H "Content-Type: application/json" \
-d '{"repo_url": "https://github.com/pallets/flask"}'
# Result: 1,416 chunks ingested with embeddings
```
### Semantic Search
```bash
curl -X POST http://localhost:8000/search \
-H "Content-Type: application/json" \
-d '{"repo_name": "pallets/flask", "query": "error handling", "top_k": 5}'
# Result: Top 5 semantically similar code chunks with distance scores
```
### RAG Question-Answering
```bash
curl -X POST http://localhost:8000/ask \
-H "Content-Type: application/json" \
-d '{"repo_name": "pallets/flask", "question": "How does Flask handle routing?", "top_k": 5}'
# Result: Generated answer with source citations
```
---
## Lessons Learned / Design Decisions
1. **Start Simple**: Character-based chunking first, can improve later
2. **Use Managed Services**: OpenAI embeddings instead of training own model
3. **Leverage Specialized Tools**: pgvector for vector search instead of custom implementation
4. **Async by Default**: FastAPI + asyncpg for I/O-bound operations
5. **Migration Strategy**: Always add backwards-compatible schema changes
6. **Vector Format Handling**: AsyncPG requires explicit string formatting for vectors
---
## Notes for Resume/Interview
### What to Emphasize
- **Full-Stack AI Backend**: Designed and implemented complete semantic search and RAG system
- **Production-Ready**: Docker, error handling, database migrations, API validation
- **Performance Focused**: Sub-100ms search, efficient batch processing, database indexing
- **Clean Architecture**: Modular design, separation of concerns, well-documented
- **Complete RAG Pipeline**: Retrieval + generation with strict grounding and source citations
- **Intelligent Filtering**: Automatic test file deprioritization for better answer quality
### Technical Talking Points
- "Implemented vector similarity search using pgvector with HNSW indexing, achieving logarithmic-time queries"
- "Designed embedding pipeline processing 1,400+ code chunks with batch API calls to OpenAI"
- "Built complete RAG system with retrieval-augmented generation, strict grounding, and source citations"
- "Developed async REST API with FastAPI handling repository ingestion, semantic search, and question-answering"
- "Architected PostgreSQL schema with vector column and automatic migrations for embedding storage"
- "Implemented intelligent test file filtering to improve answer quality by prioritizing source code"
---
## Date Created
January 2025
## Project Status
✅ **Functional**: Complete RAG system with ingestion, search, and question-answering
✅ **RAG Complete**: Retrieval-Augmented Generation pipeline fully implemented
📋 **Planned**: LangGraph orchestration, frontend UI, production features
---
## For Another AI/Developer
This project is a **complete RAG-powered code search and question-answering system** that:
1. Ingests GitHub repositories automatically
2. Chunks Python files intelligently (500 chars with 75-char overlap)
3. Generates embeddings for semantic understanding (OpenAI text-embedding-3-small)
4. Stores in PostgreSQL with pgvector and HNSW indexing
5. Enables meaning-based code search (not keyword-based)
6. Answers questions about codebases using RAG with strict grounding
7. Filters test files automatically to improve answer quality
8. Returns source citations for transparency
**Key Features:**
- **RAG Pipeline**: Complete retrieval-augmented generation with OpenAI Chat Completions
- **Test File Filtering**: Automatically deprioritizes test files in retrieval
- **Distance-Based Ranking**: Returns cosine distance scores (lower = more similar)
- **Strict Grounding**: System prompts enforce "answer only from context" to prevent hallucinations
- **Source Citations**: Every answer includes file paths and chunk IDs used
The architecture is clean, documented, and production-ready. All code follows best practices with error handling, type hints, and comprehensive documentation. The RAG system is fully functional and can answer questions about any ingested codebase.
This roadmap outlines planned enhancements to transform cheap-RAG from a functional document retrieval system into a production-ready, state-of-the-art RAG framework. Priorities are based on impact vs. effort analysis and alignment with mainstream RAG best practices.
See `specs/Semblance-MVP-Plan-v2.md` for full technical specification.
All notable changes to AvocadoDB will be documented in this file.
**Goal:** Stand up Toasty as a reliable service wired to BLT/GitHub events; deliver safe, useful summaries early.