AI Code Agent - Project Documentation — .md Directory

# AI Code Agent - Project Documentation > **Purpose**: This document captures the complete project history and achievements for resume/portfolio purposes. --- ## Project Overview **AI Code Agent** is a production-style AI backend service for semantic code search and retrieval, built with a focus on correctness, explainability, and clean architecture. **Stack**: Python 3.11, FastAPI, Docker, PostgreSQL + pgvector, OpenAI **Architecture**: API-first, microservices-ready, containerized **Status**: Complete RAG (Retrieval-Augmented Generation) system with semantic search and question-answering --- ## Core Functionality Built ### 1. Repository Ingestion Pipeline - **GitHub Repository Cloning**: Automatic cloning of public repositories - **Python File Extraction**: Recursively walks file tree, filters only `.py` files - **Smart Directory Filtering**: Skips `.git`, `__pycache__`, `venv`, `.venv`, `env`, `node_modules`, `dist`, `build` - **UTF-8 Safe Reading**: Handles encoding errors gracefully - **File Metadata**: Captures file paths and content **Implementation**: `backend/app/ingest.py` ### 2. Code Chunking System - **Character-Based Chunking**: Splits files into 500-character chunks with 75-character overlap - **Context Preservation**: Overlapping chunks maintain context across boundaries - **Chunk Metadata**: Each chunk tagged with `repo_name`, `file_path`, `chunk_id`, `text` - **RAG Preparation**: Structured for future embedding and retrieval **Implementation**: `backend/app/chunking.py` **Results**: - Flask repository (83 files) → 1,416 chunks - 17x increase in granularity for better retrieval precision ### 3. PostgreSQL Persistence Layer - **Database Schema**: UUID-based primary keys, composite unique constraints - **Connection Pooling**: AsyncPG with connection pool (2-10 connections) - **Automatic Schema Creation**: `CREATE TABLE IF NOT EXISTS` on startup - **UPSERT Logic**: Handles re-ingestion of same repositories gracefully - **Indexed Queries**: Repository-based indexing for fast lookups - **Pagination Support**: Efficient chunk retrieval with limit/offset **Implementation**: `backend/app/database.py` **Schema**: ```sql code_chunks ( id UUID PRIMARY KEY, repo_name TEXT NOT NULL, file_path TEXT NOT NULL, chunk_id INTEGER NOT NULL, content TEXT NOT NULL, embedding VECTOR(1536), created_at TIMESTAMP, UNIQUE(repo_name, file_path, chunk_id) ) ``` ### 4. Semantic Search with Embeddings - **OpenAI Integration**: Uses `text-embedding-3-small` model (1536 dimensions) - **Batch Embedding Generation**: Efficient API calls for large repositories - **Vector Storage**: pgvector extension for high-dimensional vector storage - **HNSW Indexing**: Hierarchical Navigable Small World index for fast similarity search - **Cosine Similarity**: Vector distance calculation using `<->` operator - **Semantic Query**: Natural language queries converted to embeddings **Implementation**: - `backend/app/embeddings.py` - Embedding generation - `backend/app/database.py` - Vector storage and similarity search **Performance**: - Ingestion: ~30-60 seconds for Flask repo (1,416 chunks) - Search: ~0.01 seconds for similarity queries (with HNSW index) ### 6. RAG (Retrieval-Augmented Generation) Pipeline - **Question Embedding**: Converts natural language questions to vector embeddings - **Semantic Retrieval**: Uses vector similarity search to find relevant code chunks - **Test File Filtering**: Automatically filters out test files (`tests/`, `test_`) if non-test chunks are available - **Context Assembly**: Assembles retrieved chunks into formatted prompt with file paths - **LLM Generation**: Uses OpenAI Chat Completions (gpt-4o-mini) to generate grounded answers - **Strict Grounding**: System prompt enforces "answer only from context" to prevent hallucinations - **Source Citations**: Returns file paths, chunk IDs, and distance scores for transparency - **Distance-Based Ranking**: Uses cosine distance (lower = more similar) instead of similarity scores **Implementation**: `backend/app/rag.py` **Features**: - Retrieves 2x chunks initially to allow filtering (e.g., top_k=5 retrieves 10, filters to 5) - Prefers source code over test files for better answer quality - Falls back to test files if no source code matches - Returns distance scores (0-2 range, lower is better) for result ranking - Handles empty results gracefully with informative messages **Performance**: - RAG query: ~2-5 seconds (includes embedding generation + LLM call) - Filtering overhead: Negligible (<1ms) ### 5. RESTful API Endpoints #### `GET /health` - Health check endpoint - Returns: `{"status": "healthy"}` #### `POST /ingest` - **Input**: `{"repo_url": "https://github.com/user/repo"}` - **Process**: Clone → Extract → Chunk → Generate Embeddings → Store - **Output**: `{"repo_name": "user/repo", "chunk_count": 1416, "status": "ingested_with_embeddings"}` - **Duration**: 30-60 seconds (includes OpenAI API calls) #### `GET /repos/{repo_name}/chunks` - **Query Params**: `limit` (default 100), `offset` (default 0) - **Returns**: Chunk metadata with 200-character content preview - **Includes**: Statistics (chunk_count, file_count, last_updated), pagination info #### `POST /search` - **Input**: `{"repo_name": "user/repo", "query": "authentication middleware", "top_k": 5}` - **Process**: Convert query → embedding → vector similarity search - **Output**: Top K most semantically similar chunks with distance scores (lower = more similar) - **Response Time**: ~0.01 seconds #### `POST /ask` (RAG - Retrieval-Augmented Generation) - **Input**: `{"repo_name": "user/repo", "question": "How does Flask handle routing?", "top_k": 5}` - **Process**: 1. Generate embedding for question 2. Retrieve top-k relevant code chunks (semantic search) 3. Filter out test files (deprioritizes `tests/`, `test_` patterns) 4. Assemble grounded prompt with retrieved context 5. Generate answer using OpenAI Chat Completions (gpt-4o-mini) 6. Return answer with source citations - **Output**: - `answer`: Generated answer grounded in retrieved code - `sources`: List of source chunks with `file_path`, `chunk_id`, `distance` - **Features**: - Strict grounding: System prompt enforces "answer only from context" - Test file filtering: Prefers source code over test files - Distance-based ranking: Returns cosine distance (lower = more similar) - Source citations: Transparent attribution to code chunks used - **Response Time**: ~2-5 seconds (includes LLM generation) **Implementation**: - `backend/app/main.py` - API endpoints - `backend/app/rag.py` - RAG pipeline logic --- ## Technical Achievements ### Infrastructure - ✅ **Docker Compose Setup**: Multi-container orchestration (FastAPI + PostgreSQL) - ✅ **Environment Management**: Secure `.env` file handling with `.gitignore` - ✅ **Database Migrations**: Automatic schema updates with backwards compatibility - ✅ **Connection Pooling**: Efficient database connection management - ✅ **Container Lifecycle**: Proper startup/shutdown handlers ### Data Processing - ✅ **Batch Operations**: Efficient embedding generation (batch API calls) - ✅ **Error Handling**: Graceful UTF-8 decoding, missing file handling - ✅ **Data Validation**: Pydantic models for API request validation - ✅ **Type Safety**: Full type hints throughout codebase ### Vector Search - ✅ **pgvector Integration**: Native PostgreSQL vector extension - ✅ **HNSW Index**: Logarithmic-time similarity search - ✅ **Semantic Understanding**: Meaning-based code discovery (not keyword-based) - ✅ **Scalability**: Designed for repositories with 100K+ chunks - ✅ **Test File Filtering**: Automatically deprioritizes test files in retrieval - ✅ **Distance-Based Ranking**: Returns cosine distance scores (lower = more similar) ### Code Quality - ✅ **Clean Architecture**: Separation of concerns (ingest, chunking, embeddings, database) - ✅ **Documentation**: Comprehensive docstrings explaining purpose and design - ✅ **Minimal Dependencies**: Only essential libraries (FastAPI, asyncpg, openai, gitpython) - ✅ **No Premature Optimization**: Simple, readable code that works correctly --- ## Project Statistics ### Repository Tested - **Flask** (pallets/flask): 83 Python files, 1,416 chunks after chunking ### Performance Metrics - **Ingestion Time**: ~30-60 seconds (includes OpenAI API calls) - **Search Latency**: ~0.01 seconds (with HNSW index) - **Chunk Size**: 500 characters with 75-character overlap - **Embedding Dimension**: 1536 (OpenAI text-embedding-3-small) ### Code Metrics - **Files Created**: 7 Python modules (main, ingest, chunking, embeddings, database, rag) - **Lines of Code**: ~900 (including documentation) - **API Endpoints**: 4 (health, ingest, search, ask, chunk retrieval) - **Database Tables**: 1 (code_chunks with vector column) --- ## Technology Stack ### Backend - **Python 3.11**: Modern Python with type hints - **FastAPI**: High-performance async web framework - **Uvicorn**: ASGI server with async support - **AsyncPG**: Async PostgreSQL driver ### Database - **PostgreSQL 16**: Production-grade relational database - **pgvector**: Vector similarity search extension - **HNSW Index**: Fast approximate nearest neighbor search ### AI/ML - **OpenAI API**: Text embeddings (text-embedding-3-small) and Chat Completions (gpt-4o-mini) - **Vector Embeddings**: 1536-dimensional semantic representations - **RAG Pipeline**: Retrieval-Augmented Generation for grounded question-answering ### DevOps - **Docker**: Containerization - **Docker Compose**: Multi-container orchestration - **Git**: Version control ### Libraries - **GitPython**: Repository cloning - **Pydantic**: Data validation --- ## Architecture Decisions ### Why Character-Based Chunking? - Simple, predictable results - Fast processing - Can be improved later with AST-aware chunking - Good baseline for RAG systems ### Why OpenAI Embeddings? - High-quality semantic understanding - Pre-trained on code datasets - Cost-effective (text-embedding-3-small) - Ready-to-use (no training required) ### Why pgvector? - Native PostgreSQL integration - Efficient HNSW indexing - Standard tool for vector similarity search - No separate vector database needed ### Why FastAPI? - Async support (critical for I/O-bound operations) - Automatic API documentation - Type safety with Pydantic - High performance ### Why Individual Inserts (Not Batch)? - Better asyncpg vector type handling - Simpler error handling - Acceptable performance for ingestion (not user-facing) --- ## Current Capabilities ### What It Does 1. **Clone GitHub repositories** automatically 2. **Extract and chunk Python files** into searchable pieces 3. **Generate semantic embeddings** for each chunk 4. **Store in PostgreSQL** with vector support 5. **Enable semantic search** ("find authentication code" → returns relevant chunks) 6. **Answer questions about codebases** using RAG (Retrieval-Augmented Generation) 7. **Filter test files** automatically (deprioritizes test code in retrieval) 8. **Retrieve chunks** by repository with pagination 9. **Generate grounded answers** with source citations ### What It Doesn't Do (By Design) - ❌ LangChain/LangGraph integration (not yet - will use LangGraph later) - ❌ AutoGPT-style autonomous agents (avoided by design) - ❌ AST-aware chunking (currently character-based, can be improved) - ❌ Multi-language support (Python only currently) - ❌ Code execution or modification (read-only system) - ❌ Multi-repo cross-repository search (single repo at a time) --- ## Future Roadmap (Not Implemented Yet) ### Phase 1: RAG Implementation ✅ **COMPLETE** - ✅ LLM integration for answer generation - ✅ Context assembly from retrieved chunks - ✅ Query → Retrieve → Generate pipeline - ✅ Test file filtering - ✅ Source citation tracking ### Phase 2: LangGraph Orchestration - State machine for complex workflows - Multi-step reasoning - Agentic capabilities ### Phase 3: Production Features - Authentication/authorization - Rate limiting - Caching layer - Monitoring/logging - Multi-repo search - Hybrid search (keyword + semantic) --- ## Key Files Structure ``` ai-code-agent/ ├── backend/ │ ├── app/ │ │ ├── main.py # FastAPI app, API endpoints │ │ ├── ingest.py # Repository cloning and file extraction │ │ ├── chunking.py # Text chunking utilities │ │ ├── embeddings.py # OpenAI embedding generation │ │ ├── database.py # PostgreSQL operations, vector search │ │ └── rag.py # RAG pipeline (retrieval + generation) │ ├── Dockerfile # Python 3.11, git, dependencies │ └── requirements.txt # Dependencies ├── docker-compose.yml # Backend + PostgreSQL services ├── .env # Environment variables (API keys) └── .gitignore # Excludes .env, __pycache__, etc. ``` --- ## Resume-Ready Accomplishments ### Technical Skills Demonstrated - ✅ **Backend Development**: FastAPI, async Python, RESTful APIs - ✅ **Database Design**: PostgreSQL schema design, indexing strategies - ✅ **Vector Databases**: pgvector, embedding storage, similarity search - ✅ **AI/ML Integration**: OpenAI API, semantic embeddings, RAG pipeline (complete) - ✅ **LLM Integration**: OpenAI Chat Completions, prompt engineering, grounded generation - ✅ **DevOps**: Docker, Docker Compose, container orchestration - ✅ **Version Control**: Git, .gitignore best practices - ✅ **Code Architecture**: Clean separation of concerns, modular design - ✅ **Performance Optimization**: Database indexing, batch processing, connection pooling - ✅ **RAG Implementation**: Retrieval-Augmented Generation with strict grounding ### Project Highlights - **Built complete semantic code search system** from scratch - **Integrated OpenAI embeddings** with custom ingestion pipeline - **Designed scalable vector search** using pgvector and HNSW indexing - **Implemented full RAG pipeline** with question-answering capabilities - **Achieved sub-100ms search latency** for similarity queries - **Handled large-scale ingestion** (1,416+ chunks per repository) - **Implemented production-ready** error handling and validation - **Added intelligent filtering** to deprioritize test files in retrieval - **Generated grounded answers** with source citations and strict hallucination prevention ### Technical Depth - **Vector Similarity Search**: Implemented cosine distance calculations with pgvector - **Embedding Pipeline**: Batch processing of 1,000+ text chunks to vectors - **Database Migrations**: Automated schema updates with backwards compatibility - **Async Programming**: Full async/await pattern for I/O-bound operations - **API Design**: RESTful endpoints with proper error handling and validation --- ## Testing Examples ### Ingestion ```bash curl -X POST http://localhost:8000/ingest \ -H "Content-Type: application/json" \ -d '{"repo_url": "https://github.com/pallets/flask"}' # Result: 1,416 chunks ingested with embeddings ``` ### Semantic Search ```bash curl -X POST http://localhost:8000/search \ -H "Content-Type: application/json" \ -d '{"repo_name": "pallets/flask", "query": "error handling", "top_k": 5}' # Result: Top 5 semantically similar code chunks with distance scores ``` ### RAG Question-Answering ```bash curl -X POST http://localhost:8000/ask \ -H "Content-Type: application/json" \ -d '{"repo_name": "pallets/flask", "question": "How does Flask handle routing?", "top_k": 5}' # Result: Generated answer with source citations ``` --- ## Lessons Learned / Design Decisions 1. **Start Simple**: Character-based chunking first, can improve later 2. **Use Managed Services**: OpenAI embeddings instead of training own model 3. **Leverage Specialized Tools**: pgvector for vector search instead of custom implementation 4. **Async by Default**: FastAPI + asyncpg for I/O-bound operations 5. **Migration Strategy**: Always add backwards-compatible schema changes 6. **Vector Format Handling**: AsyncPG requires explicit string formatting for vectors --- ## Notes for Resume/Interview ### What to Emphasize - **Full-Stack AI Backend**: Designed and implemented complete semantic search and RAG system - **Production-Ready**: Docker, error handling, database migrations, API validation - **Performance Focused**: Sub-100ms search, efficient batch processing, database indexing - **Clean Architecture**: Modular design, separation of concerns, well-documented - **Complete RAG Pipeline**: Retrieval + generation with strict grounding and source citations - **Intelligent Filtering**: Automatic test file deprioritization for better answer quality ### Technical Talking Points - "Implemented vector similarity search using pgvector with HNSW indexing, achieving logarithmic-time queries" - "Designed embedding pipeline processing 1,400+ code chunks with batch API calls to OpenAI" - "Built complete RAG system with retrieval-augmented generation, strict grounding, and source citations" - "Developed async REST API with FastAPI handling repository ingestion, semantic search, and question-answering" - "Architected PostgreSQL schema with vector column and automatic migrations for embedding storage" - "Implemented intelligent test file filtering to improve answer quality by prioritizing source code" --- ## Date Created January 2025 ## Project Status ✅ **Functional**: Complete RAG system with ingestion, search, and question-answering ✅ **RAG Complete**: Retrieval-Augmented Generation pipeline fully implemented 📋 **Planned**: LangGraph orchestration, frontend UI, production features --- ## For Another AI/Developer This project is a **complete RAG-powered code search and question-answering system** that: 1. Ingests GitHub repositories automatically 2. Chunks Python files intelligently (500 chars with 75-char overlap) 3. Generates embeddings for semantic understanding (OpenAI text-embedding-3-small) 4. Stores in PostgreSQL with pgvector and HNSW indexing 5. Enables meaning-based code search (not keyword-based) 6. Answers questions about codebases using RAG with strict grounding 7. Filters test files automatically to improve answer quality 8. Returns source citations for transparency **Key Features:** - **RAG Pipeline**: Complete retrieval-augmented generation with OpenAI Chat Completions - **Test File Filtering**: Automatically deprioritizes test files in retrieval - **Distance-Based Ranking**: Returns cosine distance scores (lower = more similar) - **Strict Grounding**: System prompts enforce "answer only from context" to prevent hallucinations - **Source Citations**: Every answer includes file paths and chunk IDs used The architecture is clean, documented, and production-ready. All code follows best practices with error handling, type hints, and comprehensive documentation. The RAG system is fully functional and can answer questions about any ingested codebase.

AI Code Agent - Project Documentation

Related Documents

cheap-RAG Development Roadmap

Semblance AI — Development Roadmap

Changelog

Toasty — AI Triage & Responsible Disclosure Assistant (2026 — 350 hours)