Loop #1 – RAG Pipeline — .md Directory

# Loop #1 – RAG Pipeline ## Project Overview Building a YouTube video RAG pipeline for an AI coach agent. The pipeline pulls transcripts from a specific YouTube channel, chunks them, embeds them, and stores them in a vector database for agent retrieval. --- ## Tech Stack | Component | Choice | |-------------------|---------------------------------------------| | Language | Python | | Transcript source | Supadata API | | Channel listing | YouTube Data API v3 | | Database | Supabase (PostgreSQL + pgvector) | | Embedding model | OpenAI `text-embedding-3-small` (1536 dims) | | Chunking | Custom character-based sliding window | --- ## Architecture & Pipeline Flow ``` YouTube Data API v3 (publishedAfter filter — last 7 days) → filter already-processed video IDs against Supabase → Supadata /v1/transcript (only for new videos) → character-based chunker with sentence boundary detection → OpenAI text-embedding-3-small → Supabase upsert (youtube_videos + youtube_chunks tables) ``` Trigger: simple polling script run manually or via cron. --- ## Decisions ### Fetching Strategy - Use **YouTube Data API v3** to list videos from the configured channel with a `publishedAfter` filter - Only call **Supadata** for transcripts on videos that pass the date filter and dedup check - Rationale: precise date filtering before any Supadata credits are spent; no wasted API calls - Channel ID specified via environment variable (`YOUTUBE_CHANNEL_ID`) ### Video Filtering - Only process videos published within the last 7 days - Check `youtube_videos` table before fetching transcript — skip if video ID already exists ### Transcript Fetching - Use Supadata `GET /v1/transcript` with timestamp segments - If transcript fetch fails: retry once, then log error and skip the video (do not crash pipeline) - Handle "no transcript available" as a graceful skip, not an error ### Chunking - Strategy: Docling `HybridChunker` (token-aware + semantic boundaries) - Tokenizer: `sentence-transformers/all-MiniLM-L6-v2` - Max tokens per chunk: 512 (configurable via `CHUNK_MAX_TOKENS`) - `merge_peers=True` — merges small adjacent chunks for better coherence - Chunk timestamp: locate each Docling chunk's start position in the concatenated transcript string and binary-search the position map to assign `offset_ms` of the first matching segment - Reference: `PRPs/examples/docling_hybrid_chunking.py` ### Embeddings - Model: `text-embedding-3-small` - Dimensions: 1536 - Provider: OpenAI ### Database Tables **`youtube_videos`** — dedup tracking and video metadata ```sql id TEXT PRIMARY KEY, -- YouTube video ID channel_id TEXT NOT NULL, title TEXT, url TEXT NOT NULL, published_at TIMESTAMPTZ, processed_at TIMESTAMPTZ, status TEXT DEFAULT 'pending', -- pending | processed | failed created_at TIMESTAMPTZ DEFAULT NOW() ``` **`youtube_chunks`** — RAG chunks with embeddings ```sql id BIGSERIAL PRIMARY KEY, video_id TEXT REFERENCES youtube_videos(id), channel_id TEXT NOT NULL, content TEXT NOT NULL, chunk_index INT NOT NULL, start_time_ms BIGINT, -- timestamp of first segment in this chunk embedding vector(1536), created_at TIMESTAMPTZ DEFAULT NOW() ``` ### Extensibility - Pipeline structured so other data sources can be added alongside the YouTube source in future - Channel ID driven by env var to support switching channels without code changes --- ## Environment Variables | Variable | Description | |-----------------------|------------------------------------------| | `YOUTUBE_CHANNEL_ID` | Target YouTube channel ID | | `YOUTUBE_API_KEY` | YouTube Data API v3 key | | `SUPADATA_API_KEY` | Supadata API key | | `OPENAI_API_KEY` | OpenAI API key | | `SUPABASE_URL` | Supabase project URL | | `SUPABASE_KEY` | Supabase service role key | | `CHUNK_MAX_TOKENS` | Max tokens per chunk for HybridChunker (512) | --- ## Known Considerations & Gotchas - Supadata `createdAt` field in `/v1/metadata` had a known bug after Jan 2026 — use YouTube Data API for publish date filtering instead - Getting video title + channel name via Supadata costs 2 credits per video (metadata + transcript) — avoided by using YouTube Data API for metadata - Transcript language: some videos may have no transcript available — handle as graceful skip - Chunk timestamp mapping: chunk boundaries won't align with transcript segment boundaries — assign timestamp of first segment within the chunk --- # Loop #2 – RAG Agent (YouTube AI Coach) ## Project Overview A Pydantic AI agent with a FastAPI streaming API. The agent acts as a YouTube AI coach, searching the ingested transcript knowledge base via RAG tools and answering questions about video content. --- ## Tech Stack | Component | Choice | |-----------------|------------------------------------------------------| | Language | Python | | Agent framework | Pydantic AI | | API framework | FastAPI (streaming SSE) | | LLM | Configurable via env (`LLM_CHOICE`, any OpenAI-compat provider) | | Embeddings | Configurable via env (`EMBEDDING_MODEL_CHOICE`) | | Database | Supabase (existing project `noyddkiaggbhqhsliybl`) | | Auth | Supabase JWT verification | --- ## Architecture ``` Frontend → POST /api/pydantic-agent → FastAPI (main.py) │ verify JWT (Supabase /auth/v1/user) │ fetch conversation history │ agent.iter() — Pydantic AI │ ┌─────────────┼──────────────┐ ▼ ▼ ▼ search_chunks get_full_transcript get_video_url_with_timestamp (vector search) (full video text) (URL + timestamp) └─────────────┼──────────────┘ │ stream tokens to client │ store messages in DB ``` --- ## File Structure ``` src/ ├── agent/ │ ├── __init__.py │ ├── agent.py # Pydantic AI agent, AgentDeps dataclass, get_model() │ ├── tools.py # Tool implementation functions (search, transcript, URL) │ ├── prompt.py # AGENT_SYSTEM_PROMPT │ └── db_utils.py # Conversation/message DB helpers (copied from example) ├── main.py # FastAPI app — streaming endpoint + JWT auth ├── rag_pipeline/ # Unchanged from Loop 1 └── utils/ └── config.py # Extended with agent env vars supabase/migrations/ └── 004_agent_tables.sql ``` --- ## Decisions ### Agent & Model - `get_model(config)` reads `LLM_CHOICE`, `LLM_BASE_URL`, `LLM_API_KEY` from `PipelineConfig` - `api_key` falls back to `config.openai_api_key` if `llm_api_key` is empty - Returns `OpenAIModel` with `OpenAIProvider` — supports OpenAI, OpenRouter, Ollama - Same pattern for embedding client: `EMBEDDING_BASE_URL`, `EMBEDDING_API_KEY` ### AgentDeps ```python @dataclass class AgentDeps: supabase: Client embedding_client: AsyncOpenAI http_client: AsyncClient transcript_max_chars: int ``` - `http_client` included for consistency with example pattern (future web search tool) - No `memories` field — Mem0 deliberately excluded to keep it simple - `transcript_max_chars` passed from config so tool respects env setting ### Three RAG Tools | Tool | When to use | Implementation | |------|-------------|----------------| | `search_youtube_knowledge_base` | Semantic similarity search across all chunks | Embeds query → `match_youtube_chunks` RPC → formatted results | | `get_full_video_transcript` | Need full context of a specific video | Fetches all chunks by `video_id`, ordered by `chunk_index`, joined | | `get_video_url_with_timestamp` | Cite a specific moment in a video | Returns `https://www.youtube.com/watch?v={id}&t={seconds}s` | ### SQL Function: `match_youtube_chunks` - Joins `youtube_chunks` with `youtube_videos` to return `video_title` and `video_url` inline - Filters `WHERE yc.embedding IS NOT NULL` - Accepts `match_count` parameter (default 5) - Returns: `id`, `video_id`, `content`, `chunk_index`, `start_time_ms`, `video_title`, `video_url`, `similarity` ### API Endpoint (`/api/pydantic-agent`) - Mirrors example `agent_api.py` exactly, with deliberate omissions - **Kept**: JWT verification, rate limiting, conversation history, title generation, streaming via `agent.iter()`, error handling - **Removed**: Mem0 (memory), Langfuse (tracing), file attachments (binary upload) - Streaming: `PartStartEvent` / `PartDeltaEvent` → chunked JSON lines - Final chunk includes `session_id`, `conversation_title`, `complete: true` ### Database Migration (`004_agent_tables.sql`) Adds agent tables to existing Supabase project (does NOT touch youtube_videos/youtube_chunks): - `user_profiles` — auto-created on Supabase auth signup via trigger - `requests` — per-request tracking for rate limiting (5 req/min default) - `conversations` — session records with auto-generated titles - `messages` — stores both human and AI turns; `message_data` TEXT stores Pydantic AI serialized message format for history replay - `computed_session_user_id` generated column on messages extracts UUID from `{uuid}~{random}` session_id format - All tables: RLS enabled, no DELETE policy ### Config Extensions (`PipelineConfig`) New optional fields added (all have defaults): - `environment: str = "development"` - `llm_base_url: str = "https://api.openai.com/v1"` - `llm_api_key: str = ""` - `llm_choice: str = "gpt-4o-mini"` - `embedding_base_url: str = "https://api.openai.com/v1"` - `embedding_api_key: str = ""` - `embedding_model_choice: str = "text-embedding-3-small"` - `transcript_max_chars: int = 20000` ### Rate Limiting - Default 5 requests/minute per user - Tracked via `requests` table (checked before agent run) - Returns streaming error response if exceeded --- ## Environment Variables (Loop 2 additions) | Variable | Description | Default | |-------------------------|----------------------------------------------------|---------| | `ENVIRONMENT` | `development` or `production` | `development` | | `LLM_BASE_URL` | LLM provider base URL | `https://api.openai.com/v1` | | `LLM_API_KEY` | LLM API key (falls back to `OPENAI_API_KEY`) | — | | `LLM_CHOICE` | Model name (e.g. `gpt-4o-mini`) | `gpt-4o-mini` | | `EMBEDDING_BASE_URL` | Embedding provider base URL | `https://api.openai.com/v1` | | `EMBEDDING_API_KEY` | Embedding API key (falls back to `OPENAI_API_KEY`) | — | | `EMBEDDING_MODEL_CHOICE`| Embedding model name | `text-embedding-3-small` | | `TRANSCRIPT_MAX_CHARS` | Max chars returned by full transcript tool | `20000` | | `SUPABASE_SERVICE_KEY` | Alias for `SUPABASE_KEY` (already supported) | — |

Loop #1 – RAG Pipeline

Related Documents

cheap-RAG Development Roadmap

Semblance AI — Development Roadmap

Changelog

Toasty — AI Triage & Responsible Disclosure Assistant (2026 — 350 hours)