Loading...
Loading...
Loading...
Building a YouTube video RAG pipeline for an AI coach agent. The pipeline pulls transcripts from a specific YouTube channel, chunks them, embeds them, and stores them in a vector database for agent retrieval.
# Loop #1 – RAG Pipeline
## Project Overview
Building a YouTube video RAG pipeline for an AI coach agent. The pipeline pulls transcripts from a specific YouTube channel, chunks them, embeds them, and stores them in a vector database for agent retrieval.
---
## Tech Stack
| Component | Choice |
|-------------------|---------------------------------------------|
| Language | Python |
| Transcript source | Supadata API |
| Channel listing | YouTube Data API v3 |
| Database | Supabase (PostgreSQL + pgvector) |
| Embedding model | OpenAI `text-embedding-3-small` (1536 dims) |
| Chunking | Custom character-based sliding window |
---
## Architecture & Pipeline Flow
```
YouTube Data API v3 (publishedAfter filter — last 7 days)
→ filter already-processed video IDs against Supabase
→ Supadata /v1/transcript (only for new videos)
→ character-based chunker with sentence boundary detection
→ OpenAI text-embedding-3-small
→ Supabase upsert (youtube_videos + youtube_chunks tables)
```
Trigger: simple polling script run manually or via cron.
---
## Decisions
### Fetching Strategy
- Use **YouTube Data API v3** to list videos from the configured channel with a `publishedAfter` filter
- Only call **Supadata** for transcripts on videos that pass the date filter and dedup check
- Rationale: precise date filtering before any Supadata credits are spent; no wasted API calls
- Channel ID specified via environment variable (`YOUTUBE_CHANNEL_ID`)
### Video Filtering
- Only process videos published within the last 7 days
- Check `youtube_videos` table before fetching transcript — skip if video ID already exists
### Transcript Fetching
- Use Supadata `GET /v1/transcript` with timestamp segments
- If transcript fetch fails: retry once, then log error and skip the video (do not crash pipeline)
- Handle "no transcript available" as a graceful skip, not an error
### Chunking
- Strategy: Docling `HybridChunker` (token-aware + semantic boundaries)
- Tokenizer: `sentence-transformers/all-MiniLM-L6-v2`
- Max tokens per chunk: 512 (configurable via `CHUNK_MAX_TOKENS`)
- `merge_peers=True` — merges small adjacent chunks for better coherence
- Chunk timestamp: locate each Docling chunk's start position in the concatenated transcript string and binary-search the position map to assign `offset_ms` of the first matching segment
- Reference: `PRPs/examples/docling_hybrid_chunking.py`
### Embeddings
- Model: `text-embedding-3-small`
- Dimensions: 1536
- Provider: OpenAI
### Database Tables
**`youtube_videos`** — dedup tracking and video metadata
```sql
id TEXT PRIMARY KEY, -- YouTube video ID
channel_id TEXT NOT NULL,
title TEXT,
url TEXT NOT NULL,
published_at TIMESTAMPTZ,
processed_at TIMESTAMPTZ,
status TEXT DEFAULT 'pending', -- pending | processed | failed
created_at TIMESTAMPTZ DEFAULT NOW()
```
**`youtube_chunks`** — RAG chunks with embeddings
```sql
id BIGSERIAL PRIMARY KEY,
video_id TEXT REFERENCES youtube_videos(id),
channel_id TEXT NOT NULL,
content TEXT NOT NULL,
chunk_index INT NOT NULL,
start_time_ms BIGINT, -- timestamp of first segment in this chunk
embedding vector(1536),
created_at TIMESTAMPTZ DEFAULT NOW()
```
### Extensibility
- Pipeline structured so other data sources can be added alongside the YouTube source in future
- Channel ID driven by env var to support switching channels without code changes
---
## Environment Variables
| Variable | Description |
|-----------------------|------------------------------------------|
| `YOUTUBE_CHANNEL_ID` | Target YouTube channel ID |
| `YOUTUBE_API_KEY` | YouTube Data API v3 key |
| `SUPADATA_API_KEY` | Supadata API key |
| `OPENAI_API_KEY` | OpenAI API key |
| `SUPABASE_URL` | Supabase project URL |
| `SUPABASE_KEY` | Supabase service role key |
| `CHUNK_MAX_TOKENS` | Max tokens per chunk for HybridChunker (512) |
---
## Known Considerations & Gotchas
- Supadata `createdAt` field in `/v1/metadata` had a known bug after Jan 2026 — use YouTube Data API for publish date filtering instead
- Getting video title + channel name via Supadata costs 2 credits per video (metadata + transcript) — avoided by using YouTube Data API for metadata
- Transcript language: some videos may have no transcript available — handle as graceful skip
- Chunk timestamp mapping: chunk boundaries won't align with transcript segment boundaries — assign timestamp of first segment within the chunk
---
# Loop #2 – RAG Agent (YouTube AI Coach)
## Project Overview
A Pydantic AI agent with a FastAPI streaming API. The agent acts as a YouTube AI coach, searching the ingested transcript knowledge base via RAG tools and answering questions about video content.
---
## Tech Stack
| Component | Choice |
|-----------------|------------------------------------------------------|
| Language | Python |
| Agent framework | Pydantic AI |
| API framework | FastAPI (streaming SSE) |
| LLM | Configurable via env (`LLM_CHOICE`, any OpenAI-compat provider) |
| Embeddings | Configurable via env (`EMBEDDING_MODEL_CHOICE`) |
| Database | Supabase (existing project `noyddkiaggbhqhsliybl`) |
| Auth | Supabase JWT verification |
---
## Architecture
```
Frontend → POST /api/pydantic-agent → FastAPI (main.py)
│
verify JWT (Supabase /auth/v1/user)
│
fetch conversation history
│
agent.iter() — Pydantic AI
│
┌─────────────┼──────────────┐
▼ ▼ ▼
search_chunks get_full_transcript get_video_url_with_timestamp
(vector search) (full video text) (URL + timestamp)
└─────────────┼──────────────┘
│
stream tokens to client
│
store messages in DB
```
---
## File Structure
```
src/
├── agent/
│ ├── __init__.py
│ ├── agent.py # Pydantic AI agent, AgentDeps dataclass, get_model()
│ ├── tools.py # Tool implementation functions (search, transcript, URL)
│ ├── prompt.py # AGENT_SYSTEM_PROMPT
│ └── db_utils.py # Conversation/message DB helpers (copied from example)
├── main.py # FastAPI app — streaming endpoint + JWT auth
├── rag_pipeline/ # Unchanged from Loop 1
└── utils/
└── config.py # Extended with agent env vars
supabase/migrations/
└── 004_agent_tables.sql
```
---
## Decisions
### Agent & Model
- `get_model(config)` reads `LLM_CHOICE`, `LLM_BASE_URL`, `LLM_API_KEY` from `PipelineConfig`
- `api_key` falls back to `config.openai_api_key` if `llm_api_key` is empty
- Returns `OpenAIModel` with `OpenAIProvider` — supports OpenAI, OpenRouter, Ollama
- Same pattern for embedding client: `EMBEDDING_BASE_URL`, `EMBEDDING_API_KEY`
### AgentDeps
```python
@dataclass
class AgentDeps:
supabase: Client
embedding_client: AsyncOpenAI
http_client: AsyncClient
transcript_max_chars: int
```
- `http_client` included for consistency with example pattern (future web search tool)
- No `memories` field — Mem0 deliberately excluded to keep it simple
- `transcript_max_chars` passed from config so tool respects env setting
### Three RAG Tools
| Tool | When to use | Implementation |
|------|-------------|----------------|
| `search_youtube_knowledge_base` | Semantic similarity search across all chunks | Embeds query → `match_youtube_chunks` RPC → formatted results |
| `get_full_video_transcript` | Need full context of a specific video | Fetches all chunks by `video_id`, ordered by `chunk_index`, joined |
| `get_video_url_with_timestamp` | Cite a specific moment in a video | Returns `https://www.youtube.com/watch?v={id}&t={seconds}s` |
### SQL Function: `match_youtube_chunks`
- Joins `youtube_chunks` with `youtube_videos` to return `video_title` and `video_url` inline
- Filters `WHERE yc.embedding IS NOT NULL`
- Accepts `match_count` parameter (default 5)
- Returns: `id`, `video_id`, `content`, `chunk_index`, `start_time_ms`, `video_title`, `video_url`, `similarity`
### API Endpoint (`/api/pydantic-agent`)
- Mirrors example `agent_api.py` exactly, with deliberate omissions
- **Kept**: JWT verification, rate limiting, conversation history, title generation, streaming via `agent.iter()`, error handling
- **Removed**: Mem0 (memory), Langfuse (tracing), file attachments (binary upload)
- Streaming: `PartStartEvent` / `PartDeltaEvent` → chunked JSON lines
- Final chunk includes `session_id`, `conversation_title`, `complete: true`
### Database Migration (`004_agent_tables.sql`)
Adds agent tables to existing Supabase project (does NOT touch youtube_videos/youtube_chunks):
- `user_profiles` — auto-created on Supabase auth signup via trigger
- `requests` — per-request tracking for rate limiting (5 req/min default)
- `conversations` — session records with auto-generated titles
- `messages` — stores both human and AI turns; `message_data` TEXT stores Pydantic AI serialized message format for history replay
- `computed_session_user_id` generated column on messages extracts UUID from `{uuid}~{random}` session_id format
- All tables: RLS enabled, no DELETE policy
### Config Extensions (`PipelineConfig`)
New optional fields added (all have defaults):
- `environment: str = "development"`
- `llm_base_url: str = "https://api.openai.com/v1"`
- `llm_api_key: str = ""`
- `llm_choice: str = "gpt-4o-mini"`
- `embedding_base_url: str = "https://api.openai.com/v1"`
- `embedding_api_key: str = ""`
- `embedding_model_choice: str = "text-embedding-3-small"`
- `transcript_max_chars: int = 20000`
### Rate Limiting
- Default 5 requests/minute per user
- Tracked via `requests` table (checked before agent run)
- Returns streaming error response if exceeded
---
## Environment Variables (Loop 2 additions)
| Variable | Description | Default |
|-------------------------|----------------------------------------------------|---------|
| `ENVIRONMENT` | `development` or `production` | `development` |
| `LLM_BASE_URL` | LLM provider base URL | `https://api.openai.com/v1` |
| `LLM_API_KEY` | LLM API key (falls back to `OPENAI_API_KEY`) | — |
| `LLM_CHOICE` | Model name (e.g. `gpt-4o-mini`) | `gpt-4o-mini` |
| `EMBEDDING_BASE_URL` | Embedding provider base URL | `https://api.openai.com/v1` |
| `EMBEDDING_API_KEY` | Embedding API key (falls back to `OPENAI_API_KEY`) | — |
| `EMBEDDING_MODEL_CHOICE`| Embedding model name | `text-embedding-3-small` |
| `TRANSCRIPT_MAX_CHARS` | Max chars returned by full transcript tool | `20000` |
| `SUPABASE_SERVICE_KEY` | Alias for `SUPABASE_KEY` (already supported) | — |
This roadmap outlines planned enhancements to transform cheap-RAG from a functional document retrieval system into a production-ready, state-of-the-art RAG framework. Priorities are based on impact vs. effort analysis and alignment with mainstream RAG best practices.
See `specs/Semblance-MVP-Plan-v2.md` for full technical specification.
All notable changes to AvocadoDB will be documented in this file.
**Goal:** Stand up Toasty as a reliable service wired to BLT/GitHub events; deliver safe, useful summaries early.