Loading...
Loading...
Loading...
This document outlines features from the original [FastEmbed Python library](https://github.com/qdrant/fastembed) that are not yet implemented in fastembed-rb.
# FastEmbed-rb Roadmap
This document outlines features from the original [FastEmbed Python library](https://github.com/qdrant/fastembed) that are not yet implemented in fastembed-rb.
## Current Status (v1.0.0)
### Implemented
- Dense text embeddings with 12 models
- Automatic model downloading from HuggingFace
- Lazy evaluation via `Enumerator`
- Query/passage prefixes for retrieval models
- Mean pooling and L2 normalization
- Configurable batch size and threading
- CoreML execution provider support
- CLI tool (`fastembed`)
- **Reranking / Cross-Encoder models** (5 models)
## Feature Gap Analysis
### High Priority
#### 1. Sparse Text Embeddings
The Python library supports sparse embedding models that return indices and values rather than dense vectors. These are useful for hybrid search combining keyword and semantic matching.
**Models to support:**
- `Qdrant/bm25` - Classic BM25 (0.010 GB)
- `Qdrant/bm42-all-minilm-l6-v2-attentions` - Attention-based sparse (0.090 GB)
- `prithivida/Splade_PP_en_v1` - SPLADE++ (0.532 GB)
**API design:**
```ruby
sparse = Fastembed::SparseTextEmbedding.new
result = sparse.embed(["hello world"]).first
# => { indices: [123, 456, 789], values: [0.5, 0.3, 0.2] }
```
**Implementation notes:**
- Need new `SparseTextEmbedding` class
- Different output format (sparse vectors instead of dense)
- May require different tokenization approach for BM25
#### 2. Late Interaction (ColBERT) Models
ColBERT-style models produce token-level embeddings rather than a single vector per document. This enables more fine-grained matching.
**Models to support:**
- `answerdotai/answerai-colbert-small-v1` (96 dim)
- `colbert-ir/colbertv2.0` (128 dim)
- `jinaai/jina-colbert-v2` (128 dim)
**API design:**
```ruby
colbert = Fastembed::LateInteractionTextEmbedding.new
result = colbert.embed(["hello world"]).first
# => Array of token embeddings, shape: [num_tokens, dim]
```
**Implementation notes:**
- Returns 2D array per document (tokens × dimensions)
- Different pooling strategy (no pooling, keep all tokens)
- Scoring requires MaxSim operation between query and document tokens
#### ~~3. Reranking / Cross-Encoder Models~~ ✅ IMPLEMENTED
See `Fastembed::TextCrossEncoder` class.
### Medium Priority
#### ~~4. Image Embeddings~~ ✅ IMPLEMENTED
Vision models for converting images to vectors. Requires `mini_magick` gem.
**Supported models:**
- `Qdrant/resnet50-onnx` (2048 dim)
- `Qdrant/clip-ViT-B-32-vision` (512 dim)
- `jinaai/jina-clip-v1` (768 dim)
**Usage:**
```ruby
# Add to Gemfile: gem "mini_magick"
image_embed = Fastembed::ImageEmbedding.new
vector = image_embed.embed(["path/to/image.jpg"]).first
```
#### ~~5. Custom Model Support~~ ✅ IMPLEMENTED
Implemented via `CustomModelRegistry` module. Users can register custom models:
```ruby
Fastembed.register_model(
model_name: "my-org/my-model",
dim: 768,
sources: { hf: "my-org/my-model" }
)
embed = Fastembed::TextEmbedding.new(model_name: "my-org/my-model")
```
Also supports local model loading via `local_model_dir` parameter.
### Low Priority
#### 6. Multimodal Late Interaction (ColPali)
ColPali models that can embed both images and text for document retrieval.
**Models to support:**
- `vidore/colpali-v1.2`
- `vidore/colqwen2-v1.0`
**Implementation notes:**
- Combines image and text embedding
- Requires vision preprocessing
- Complex architecture, lower priority
#### 7. Quantized Models
Support for INT8/INT4 quantized models for faster inference and lower memory usage.
**Implementation notes:**
- ONNX Runtime supports quantized models natively
- Need to add quantized model variants to registry
- Trade-off between speed and accuracy
## ~~CLI Enhancements~~ ✅ IMPLEMENTED
All planned CLI features have been implemented:
- ✅ `fastembed download <model>` - Pre-download models for offline use
- ✅ `fastembed benchmark` - Run performance benchmarks with configurable iterations
- ✅ `fastembed info <model>` - Show detailed model information including cache status
- ✅ `-i input.txt` - Read texts from file (one per line)
- ✅ `-p` / `--progress` - Show progress bar during embedding
- ✅ `-q` / `--quiet` - Suppress progress output for scripting
## Breaking Changes for v2.0
If we do a major version bump:
1. Consider making `embed()` return an Array instead of Enumerator by default
2. Rename `query_embed`/`passage_embed` to `embed_query`/`embed_passage` for consistency
3. Use keyword arguments consistently throughout
---
## Refactoring Plan
### Completed: Phase 1 - Extract Shared Helpers
- [x] Create `Validators` module for document validation
- [x] Extract `prepare_model_inputs` to BaseModel
- [x] Extract `setup_model_and_tokenizer` to BaseModel
- [x] Update all model classes to use shared helpers
**Result:** Reduced ~60 lines of duplicated code across 4 model classes.
---
### Completed: Phase 2 - Add Missing Features (Medium Risk)
Goal: Achieve API consistency across all model types.
#### 2.1 Add `passage_embed` to TextSparseEmbedding ✅ IMPLEMENTED
Added to TextSparseEmbedding.
```ruby
# lib/fastembed/sparse_embedding.rb
def passage_embed(passages, batch_size: 32)
passages = [passages] if passages.is_a?(String)
embed(passages, batch_size: batch_size)
end
```
#### 2.2 Add async methods to all embedding classes ✅ IMPLEMENTED
Added async methods to all model classes:
- TextSparseEmbedding: embed_async, query_embed_async, passage_embed_async
- LateInteractionTextEmbedding: embed_async, query_embed_async, passage_embed_async
- TextCrossEncoder: rerank_async, rerank_with_scores_async
```ruby
# Add to TextSparseEmbedding
def embed_async(documents, batch_size: 32)
Async::Future.new { embed(documents, batch_size: batch_size).to_a }
end
def query_embed_async(queries, batch_size: 32)
Async::Future.new { query_embed(queries, batch_size: batch_size).to_a }
end
def passage_embed_async(passages, batch_size: 32)
Async::Future.new { passage_embed(passages, batch_size: batch_size).to_a }
end
# Add to TextCrossEncoder
def rerank_async(query:, documents:, batch_size: 64)
Async::Future.new { rerank(query: query, documents: documents, batch_size: batch_size) }
end
```
#### 2.3 Add progress callback support to all embedding classes ✅ IMPLEMENTED
Added progress callback support to TextSparseEmbedding and LateInteractionTextEmbedding.
#### 2.4 Add `show_progress` parameter to TextCrossEncoder ✅ IMPLEMENTED
Made configurable (was hardcoded to true).
---
### Completed: Phase 3 - Unify Initialization (Higher Risk)
Goal: Consistent initialization API across all model types.
#### 3.1 Add quantization support to all models ✅ IMPLEMENTED
Added quantization parameter to all model classes (TextSparseEmbedding, LateInteractionTextEmbedding, TextCrossEncoder).
#### 3.2 Add local_model_dir support to all models ✅ IMPLEMENTED
Added local_model_dir, model_file, and tokenizer_file parameters to all model classes. Shared logic extracted to BaseModel (initialize_from_local, create_local_model_info).
#### 3.3 Document batch size rationale ✅ DOCUMENTED
Default batch sizes vary by model type based on memory requirements:
| Model Type | Default Batch Size | Rationale |
|------------|-------------------|-----------|
| TextEmbedding | 256 | Dense embeddings have fixed output size (e.g., 384 floats). Memory is predictable and efficient. |
| TextSparseEmbedding | 32 | SPLADE models output logits for entire vocabulary (~30k tokens) per sequence position. Much higher memory per document. |
| LateInteractionTextEmbedding | 32 | ColBERT keeps per-token embeddings (not pooled), so output size scales with sequence length × embedding dim. |
| TextCrossEncoder | 64 | Processes query-document pairs together. Each pair requires more memory than single documents, but less than sparse/late interaction. |
Users can override these defaults via the `batch_size` parameter if they have different memory constraints.
---
### Implementation Priority
| Task | Risk | Effort | Value |
|------|------|--------|-------|
| 2.1 Add passage_embed to Sparse | Low | Small | Medium |
| 2.2 Add async to all classes | Low | Medium | High |
| 2.3 Add progress to all classes | Medium | Medium | Medium |
| 2.4 Add show_progress to CrossEncoder | Low | Small | Low |
| 3.1 Add quantization to all | Medium | Medium | Medium |
| 3.2 Add local_model_dir to all | Medium | Large | Medium |
| 3.3 Document batch size rationale | Low | Small | Low |
---
## Contributing
Contributions are welcome! If you'd like to implement any of these features:
1. Open an issue to discuss the approach
2. Follow the existing code style (run `bundle exec rubocop`)
3. Add tests for new functionality
4. Update the README and CHANGELOG
This roadmap outlines planned enhancements to transform cheap-RAG from a functional document retrieval system into a production-ready, state-of-the-art RAG framework. Priorities are based on impact vs. effort analysis and alignment with mainstream RAG best practices.
See `specs/Semblance-MVP-Plan-v2.md` for full technical specification.
All notable changes to AvocadoDB will be documented in this file.
**Goal:** Stand up Toasty as a reliable service wired to BLT/GitHub events; deliver safe, useful summaries early.