rag-vector-chunking

Embedding-Based Retrieval Architecture

The architecture you described is the standard setup for embedding-based retrieval in RAG systems:

Embedding Model: This neural network converts text into dense vector representations. Popular models include:
- Sentence-BERT/Sentence Transformers
- OpenAI's text-embedding models
- Models like E5, BGE, or GTE
Vector Database: Stores these embeddings and provides efficient search capabilities. Examples include:
- Pinecone
- Weaviate
- Milvus
- Chroma
- FAISS (Facebook AI Similarity Search)
Indexing Component: This is critical but often overlooked. It:
- Organizes vectors for efficient retrieval
- Creates data structures that allow for fast approximate nearest neighbor search
- May use techniques like hierarchical navigable small worlds (HNSW) or inverted file index (IVF)

Cosine Similarity Explained

Cosine similarity is the most common way to measure how similar two vectors are in embedding space:

Mathematical Definition: It measures the cosine of the angle between two vectors
- Formula: cos(θ) = (A·B)/(||A||·||B||)
- Returns values between -1 and 1 (though with typical embeddings, values range from 0 to 1)
Intuitive Explanation:
- Think of two vectors as arrows pointing in different directions from the origin
- If they point in the exact same direction (perfectly similar): cosine = 1
- If they point in perpendicular directions (unrelated): cosine = 0
- If they point in opposite directions (opposite meaning): cosine = -1
Why It's Used for Embeddings:
- It focuses on the direction of vectors, not their magnitude
- This means a document's length doesn't overly influence similarity scores
- It works well in high-dimensional spaces where Euclidean distance can be problematic

RAG Evaluation Metrics

The two metrics you mentioned are key for evaluating retrieval quality:

Context Precision: The proportion of retrieved chunks that are actually relevant
- High precision means most retrieved information is useful
- Low precision means the system is retrieving a lot of irrelevant information
Context Recall: The proportion of all relevant chunks that were successfully retrieved
- High recall means the system found most of the important information
- Low recall means the system missed key information

These metrics help balance the tradeoff between retrieving too much information (high recall, low precision) versus missing important information (high precision, low recall).

Chunking Strategies - The Deep Dive

Chunking is perhaps the most underappreciated aspect of RAG systems:

Size Considerations:
- Too large: Wastes context window, may dilute relevant information
- Too small: Loses important context, creates fragmented knowledge
- Optimal size often depends on your specific content and use case
Advanced Chunking Strategies:
- Semantic Chunking: Breaking at natural semantic boundaries rather than arbitrary character counts
- Overlapping Chunks: Having some overlap between chunks to avoid breaking concepts
- Hierarchical Chunking: Creating both small detailed chunks and larger overview chunks
- Recursive Chunking: Using the model itself to identify logical break points
Metadata Augmentation:
- Adding title, source, date, author, category, etc.
- Including parent-child relationships between chunks
- Preserving document structure information (section, chapter, etc.)
- This metadata becomes crucial for contextualizing retrieved information

Retrieval Optimization Techniques

Query Rewriting:
- Using the model to expand or reformulate the original query
- Creating multiple query variations to improve retrieval coverage
- Example: User asks "How tall is the tower?" → System rewrites to "What is the height of the Eiffel Tower in Paris, France?"
Reranking:
- Two-stage retrieval process:
  1. Initial broad retrieval using fast methods
  2. More sophisticated reranking of the top results
- Can use cross-attention models (like cross-encoders) that look at query-document pairs together
- Often dramatically improves precision at a reasonable computational cost

These optimization techniques are becoming standard in advanced RAG systems because they significantly improve retrieval quality without requiring complete architecture changes.

Embedding-Based Retrieval Architecture

Cosine Similarity Explained

RAG Evaluation Metrics

Chunking Strategies - The Deep Dive

Retrieval Optimization Techniques

Related Documents

SUMMARY

Retrieval & Prompts

App Review Support Guide — Switch2Go

RFC-BLite: High-Performance Embedded Document Database for .NET